本篇博文主要内容为 2026-02-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-17)

今日共更新998篇论文,其中:

  • 自然语言处理139篇(Computation and Language (cs.CL))
  • 人工智能363篇(Artificial Intelligence (cs.AI))
  • 计算机视觉187篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习294篇(Machine Learning (cs.LG))
  • 多智能体系统35篇(Multiagent Systems (cs.MA))
  • 信息检索34篇(Information Retrieval (cs.IR))
  • 人机交互52篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Distributed Quantum Gaussian Processes for Multi-Agent Systems AAMAS2026

【速读】:该论文旨在解决经典核函数在复杂、大规模现实场景中表达能力有限,从而制约高斯过程(Gaussian Processes, GPs)性能的问题。其解决方案的关键在于提出一种分布式量子高斯过程(Distributed Quantum Gaussian Process, DQGP)方法,利用量子计算将数据嵌入到指数级大的希尔伯特空间(Hilbert space),以捕获经典计算难以触及的复杂相关性;同时,为应对非欧几里得优化难题,设计了一种分布式共识黎曼交替方向乘子法(Distributed consensus Riemannian Alternating Direction Method of Multipliers, DR-ADMM)算法,实现多智能体环境下局部模型向全局模型的有效聚合,从而提升建模能力和可扩展性。

链接: https://arxiv.org/abs/2602.15006
作者: Meet Gandhi,George P. Kontoudis
机构: Colorado School of Mines (科罗拉多矿业学院)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Differential Geometry (math.DG)
备注: 9 pages, 4 figures, accepted at AAMAS 2026 (International Conference on Autonomous Agents and Multiagent Systems)

点击查看摘要

Abstract:Gaussian Processes (GPs) are a powerful tool for probabilistic modeling, but their performance is often constrained in complex, largescale real-world domains due to the limited expressivity of classical kernels. Quantum computing offers the potential to overcome this limitation by embedding data into exponentially large Hilbert spaces, capturing complex correlations that remain inaccessible to classical computing approaches. In this paper, we propose a Distributed Quantum Gaussian Process (DQGP) method in a multiagent setting to enhance modeling capabilities and scalability. To address the challenging non-Euclidean optimization problem, we develop a Distributed consensus Riemannian Alternating Direction Method of Multipliers (DR-ADMM) algorithm that aggregates local agent models into a global model. We evaluate the efficacy of our method through numerical experiments conducted on a quantum simulator in classical hardware. We use real-world, non-stationary elevation datasets of NASA’s Shuttle Radar Topography Mission and synthetic datasets generated by Quantum Gaussian Processes. Beyond modeling advantages, our framework highlights potential computational speedups that quantum hardware may provide, particularly in Gaussian processes and distributed optimization.

[MA-1] Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agent ic Healthcare Systems

【速读】:该论文旨在解决医疗代理系统中任务专用模型(task-specialized models)选择难题,即在面对同一临床查询时,不同模型在不同数据样本上表现各异,导致单一“最优”模型难以满足多样化的任务需求。为实现可靠且自适应的模型选择,其解决方案的关键在于提出ToolSelect框架:通过最小化采样候选工具上的任务条件选择损失的代理函数,学习一个基于查询和各模型行为摘要的注意力神经过程(Attentive Neural Process)选择器,从而动态从异构工具池中选出最适配当前任务的专家模型。

链接: https://arxiv.org/abs/2602.14901
作者: Pramit Saha,Joshua Strong,Mohammad Alsharid,Divyanshu Mishra,J. Alison Noble
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single “best” model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.

[MA-2] Atomix: Timely Transactional Tool Use for Reliable Agent ic Workflows

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在调用外部工具时因即时执行特性导致的副作用不可控问题,尤其是在故障、推测或资源竞争场景下,未完成分支可能引发无法安全回滚的意外行为。其解决方案的关键在于提出Atomix运行时系统,通过为每个工具调用打上epoch标签、跟踪每资源的前沿(frontier)状态,并仅在进度谓词(progress predicates)表明安全时才提交操作;同时支持可缓冲的副作用延迟执行,对外部已生效的副作用进行追踪并在事务中止时补偿,从而实现面向代理工具调用的进度感知事务语义。

链接: https://arxiv.org/abs/2602.14849
作者: Bardia Mohammadi,Nearchos Potamitis,Lars Klein,Akhil Arora,Laurent Bindschaedler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLM agents increasingly act on external systems, yet tool effects are immediate. Under failures, speculation, or contention, losing branches can leak unintended side effects with no safe rollback. We introduce Atomix, a runtime that provides progress-aware transactional semantics for agent tool calls. Atomix tags each call with an epoch, tracks per-resource frontiers, and commits only when progress predicates indicate safety; bufferable effects can be delayed, while externalized effects are tracked and compensated on abort. Across real workloads with fault injection, transactional retry improves task success, while frontier-gated commit strengthens isolation under speculation and contention.

[MA-3] ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic ITSC

【速读】:该论文旨在解决多模式混合交通场景下环形交叉口(roundabout)中车辆与弱势道路使用者(Vulnerable Road Users, VRUs)之间的协同安全问题,尤其是在动态复杂环境中实现高效且安全的通行控制。其解决方案的关键在于提出了一种名为ROSA(Roundabout Optimized Speed Advisory)的系统,该系统结合多智能体轨迹预测与协调式速度引导机制:基于Transformer架构联合预测车辆与VRU的未来轨迹,并利用运动学约束提升预测精度(五秒预测时域下平均距离误差ADE: 1.29m,最终距离误差FDE: 2.99m),进一步引入路径意图信息可将性能优化至ADE: 1.10m、FDE: 2.36m;在此基础上,ROSA实时识别潜在冲突并生成确定性速度建议,从而在保障效率的同时显著提升整体交通安全,包括VRU感知的安全性。

链接: https://arxiv.org/abs/2602.14780
作者: Anna-Lena Schlamp,Jeremias Gerner,Klaus Bogenberger,Werner Huber,Stefanie Schmidtner
机构: 未知
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 8 pages, 1 figure, 4 tables, 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC)

点击查看摘要

Abstract:We present ROSA – Roundabout Optimized Speed Advisory – a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: this http URL.

[MA-4] ST-EVO: Towards Generative Spatio-Temporal Evolution of Multi-Agent Communication Topologies

【速读】:该论文旨在解决当前自演化多智能体系统(Self-Evolving Multi-Agent Systems, MAS)在协作能力激发上的局限性问题,即现有方法主要局限于空间演化(Spatial Evolving)或时间演化(Temporal Evolving)单一维度的调整,未能充分调动大语言模型(LLM)之间的协同潜力。其解决方案的关键在于提出一种全新的时空联合演化的框架 ST-EVO,该框架通过基于流匹配(flow-matching)的紧凑调度器实现对话粒度的通信调度,并具备感知系统不确定性及利用累积经验进行自我反馈学习的能力,从而在任务适应性与协作效率之间取得更好平衡。

链接: https://arxiv.org/abs/2602.14681
作者: Xingjian Wu,Xvyuan Liu,Junkai Lu,Siyuan Wang,Yang Shu,Jilin Hu,Chenjuan Guo,Bin Yang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered Multi-Agent Systems (MAS) have emerged as an effective approach towards collaborative intelligence, and have attracted wide research interests. Among them, ``self-evolving’’ MAS, treated as a more flexible and powerful technical route, can construct task-adaptive workflows or communication topologies, instead of relying on a predefined static structue template. Current self-evolving MAS mainly focus on Spatial Evolving or Temporal Evolving paradigm, which only considers the single dimension of evolution and does not fully incentivize LLMs’ collaborative capability. In this work, we start from a novel Spatio-Temporal perspective by proposing ST-EVO, which supports dialogue-wise communication scheduling with a compact yet powerful flow-matching based Scheduler. To make precise Spatio-Temporal scheduling, ST-EVO can also perceive the uncertainty of MAS, and possesses self-feedback ability to learn from accumulated experience. Extensive experiments on nine benchmarks demonstrate the state-of-the-art performance of ST-EVO, achieving about 5%–25% accuracy improvement.

[MA-5] owards Selection as Power: Bounding Decision Authority in Autonomous Agents

【速读】:该论文旨在解决自主代理系统(autonomous agentic systems)在高风险、受监管领域中因缺乏对“选择权力”(selection power)的直接管控而导致的安全隐患问题。现有方法如对齐(alignment)、可解释性(interpretability)和动作级过滤虽能提升安全性,但未能有效约束代理在生成选项、呈现方式及决策框架上的控制权,从而可能导致不可逆的错误决策或隐蔽性失效。其解决方案的关键在于提出一种治理架构,将认知(cognition)、选择(selection)与行动(action)分离,并将自主性建模为主权向量(vector of sovereignty),其中认知自治保持开放,而选择与行动自治通过机制强制的原语(mechanically enforced primitives)进行边界限制,这些原语运行于代理优化空间之外。该架构整合了外部候选生成(CEFL)、受控缩减器、提交-揭示熵隔离、理由验证与响亮故障断路器等组件,在对抗压力测试下显著提升了治理的可审计性与抗操纵能力,实现了对选择权威的量化约束,同时保留推理能力,从而将治理焦点从内部意图对齐转向可测量的因果权力边界。

链接: https://arxiv.org/abs/2602.14606
作者: Jose Manuel de la Chica Rodriguez,Juan Manuel Vera Díaz
机构: AI Lab, Grupo Santander (Grupo Santander)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Autonomous agentic systems are increasingly deployed in regulated, high-stakes domains where decisions may be irreversible and institutionally constrained. Existing safety approaches emphasize alignment, interpretability, or action-level filtering. We argue that these mechanisms are necessary but insufficient because they do not directly govern selection power: the authority to determine which options are generated, surfaced, and framed for decision. We propose a governance architecture that separates cognition, selection, and action into distinct domains and models autonomy as a vector of sovereignty. Cognitive autonomy remains unconstrained, while selection and action autonomy are bounded through mechanically enforced primitives operating outside the agent’s optimization space. The architecture integrates external candidate generation (CEFL), a governed reducer, commit-reveal entropy isolation, rationale validation, and fail-loud circuit breakers. We evaluate the system across multiple regulated financial scenarios under adversarial stress targeting variance manipulation, threshold gaming, framing skew, ordering effects, and entropy probing. Metrics quantify selection concentration, narrative diversity, governance activation cost, and failure visibility. Results show that mechanical selection governance is implementable, auditable, and prevents deterministic outcome capture while preserving reasoning capacity. Although probabilistic concentration remains, the architecture measurably bounds selection authority relative to conventional scalar pipelines. This work reframes governance as bounded causal power rather than internal intent alignment, offering a foundation for deploying autonomous agents where silent failure is unacceptable.

[MA-6] Fluid-Agent Reinforcement Learning AAMAS2026

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中传统框架无法处理动态变化的智能体数量的问题。在现实世界中,智能体的数量往往不固定且事先未知,且智能体可能自主创建新智能体(如细胞分裂或公司分拆)。为应对这一挑战,论文提出了一种“流体智能体环境”(fluid-agent environment)框架,允许智能体在环境中动态生成其他智能体。其解决方案的关键在于设计适用于此类动态场景的游戏论解概念,并通过实验验证多种MARL算法在此框架下的性能表现,结果表明该框架能够使智能体团队根据环境需求自适应调整规模,从而实现更灵活和高效的协作策略。

链接: https://arxiv.org/abs/2602.14559
作者: Shishir Sharma,Doina Precup,Theodore J. Perkins
机构: Mila – Quebec Artificial Intelligence Institute (魁北克人工智能研究所); McGill University (麦吉尔大学); Ottawa Hospital Research Institute (渥太华医院研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Published in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:The primary focus of multi-agent reinforcement learning (MARL) has been to study interactions among a fixed number of agents embedded in an environment. However, in the real world, the number of agents is neither fixed nor known a priori. Moreover, an agent can decide to create other agents (for example, a cell may divide, or a company may spin off a division). In this paper, we propose a framework that allows agents to create other agents; we call this a fluid-agent environment. We present game-theoretic solution concepts for fluid-agent games and empirically evaluate the performance of several MARL algorithms within this framework. Our experiments include fluid variants of established benchmarks such as Predator-Prey and Level-Based Foraging, where agents can dynamically spawn, as well as a new environment we introduce that highlights how fluidity can unlock novel solution strategies beyond those observed in fixed-population settings. We demonstrate that this framework yields agent teams that adjust their size dynamically to match environmental demands.

[MA-7] Socially-Weighted Alignment: A Game-Theoretic Framework for Multi-Agent LLM Systems

【速读】:该论文旨在解决在共享环境中部署大规模语言模型(Large Language Model, LLM)代理时,个体理性决策与群体稳定性之间的根本性矛盾:局部最优行为可能产生负外部性,从而损害系统级性能。解决方案的关键在于提出社会加权对齐(Socially-Weighted Alignment, SWA),这是一个基于博弈论的框架,通过在推理阶段将代理的私有目标与其对群体福利的估计进行插值,引入一个社会权重 λ ∈ [0,1] 来调节决策倾向。理论分析表明,在具有 n 个代理和拥堵严重度 β 的共享资源拥堵博弈中,当 λ 超过临界阈值 λ* = (n − β)/(n − 1) 时,代理在负载过载下不再有边际动机增加需求,从而实现从持续拥堵到接近容量稳定运行的相变;该方法无需参数更新或多人强化学习,仅需推理时算法实现即可有效提升系统整体稳定性。

链接: https://arxiv.org/abs/2602.14471
作者: Furkan Mumcu,Yasin Yilmaz
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying large language model (LLM) agents in shared environments introduces a fundamental tension between individual alignment and collective stability: locally rational decisions can impose negative externalities that degrade system-level performance. We propose Socially-Weighted Alignment (SWA), a game-theoretic framework that modifies inference-time decision making by interpolating between an agent’s private objective and an estimate of group welfare via a social weight \lambda\in[0,1] . In a shared-resource congestion game with n agents and congestion severity \beta , we show that SWA induces a critical threshold \lambda^*=(n-\beta)/(n-1) above which agents no longer have marginal incentive to increase demand under overload, yielding a phase transition from persistent congestion to stable operation near capacity. We further provide an inference-time algorithmic instantiation of SWA that does not require parameter updates or multi-agent reinforcement learning, and use a multi-agent simulation to empirically validate the predicted threshold behavior.

[MA-8] RoboSolver: A Multi-Agent Large Language Model Framework for Solving Robotic Arm Problems

【速读】:该论文旨在解决机器人操作中复杂运动学与动力学问题的自动化分析与求解难题,特别是如何高效地从自然语言或视觉输入中提取任务需求并生成精确的机器人运动控制方案。其解决方案的关键在于构建一个基于大语言模型(LLM)和视觉语言模型(VLM)的智能多智能体框架,该框架能够融合文本与视觉信息,自动执行正逆运动学计算、关键点速度与加速度分析、3D仿真建模及模拟环境下的运动控制,从而实现端到端的机器人任务解析与执行。实验表明,该框架在多个基准测试中显著优于原始模型,尤其在结合GPT-4o与Gemini 2.5 Pro时展现出卓越的准确性与鲁棒性。

链接: https://arxiv.org/abs/2602.14438
作者: Hamid Khabazi,Ali F. Meghdari,Alireza Taheri
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This study proposes an intelligent multi-agent framework built on LLMs and VLMs and specifically tailored to robotics. The goal is to integrate the strengths of LLMs and VLMs with computational tools to automatically analyze and solve problems related to robotic manipulators. Our developed framework accepts both textual and visual inputs and can automatically perform forward and inverse kinematics, compute velocities and accelerations of key points, generate 3D simulations of the robot, and ultimately execute motion control within the simulated environment, all according to the user’s query. To evaluate the framework, three benchmark tests were designed, each consisting of ten questions. In the first benchmark test, the framework was evaluated while connected to GPT-4o, DeepSeek-V3.2, and Claude-Sonnet-4.5, as well as their corresponding raw models. The objective was to extract the forward kinematics of robots directly from textual descriptions. The results showed that the framework integrated with GPT-4o achieved the highest accuracy, reaching 0.97 in computing the final solution, whereas the raw model alone attained an accuracy of only 0.30 for the same task. Similarly, for the other two models, the framework consistently outperformed the corresponding raw models in terms of accuracy. The second benchmark test was identical to the first, except that the input was provided in visual form. In this test, the GPT-4o LLM was used alongside the Gemini 2.5 Pro VLM. The results showed that the framework achieved an accuracy of 0.93 in obtaining the final answer, which is approximately 20% higher than that of the corresponding raw model. The third benchmark test encompassed a range of robotic tasks, including simulation, control, velocity and acceleration computation, as well as inverse kinematics and Jacobian calculation, for which the framework achieved an accuracy of 0.97.

[MA-9] Noncooperative Virtual Queue Coordination via Uncertainty-Aware Correlated Equilibria

【速读】:该论文旨在解决机场地面拥堵问题,同时保持航空公司对飞机推出(pushback)决策的自主权。传统方法中,中央协调者虽可调控整体推出容量,但无法直接干预具体航班的推出顺序,从而限制了系统性能优化能力。解决方案的关键在于提出一种基于相关均衡(correlated equilibrium)的非合作协调机制,通过提供激励相容(incentive-compatible)的推荐策略,在不剥夺航空公司自主权的前提下引导其行为以提升整体效率;进一步引入机会约束(chance constraints)以应对航空公司内部成本评估的不确定性,从而为激励兼容性提供显式的概率保障,并允许协调者根据需求调整置信水平。此外,论文设计了一种利用低秩结构的可扩展算法,使该方法能够处理每小时最多210架次的有效推出场景,相比当前先到先服务(FCFS)方案,延迟减少约8.9%,并揭示了置信度、鲁棒性与成本效率之间的权衡关系。

链接: https://arxiv.org/abs/2602.14436
作者: Jaehan Im,David Fridovich-Keil,Ufuk Topcu
机构: 未知
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Collaborative virtual queueing has been proposed as a mechanism to mitigate airport surface congestion while preserving airline autonomy over aircraft-level pushback decisions. A central coordinator can regulate aggregate pushback capacity but cannot directly control which specific aircraft are released, limiting its ability to steer system-level performance. We propose a noncooperative coordination mechanism for collaborative virtual queueing based on the correlated equilibrium concept, which enables the coordinator to provide incentive-compatible recommendations on aircraft-level pushback decisions without overriding airline autonomy. To account for uncertainty in airlines’ internal cost assessments, we introduce chance constraints into the correlated equilibrium formulation. This formulation provides explicit probabilistic guarantees on incentive compatibility, allowing the coordinator to adjust the confidence level with which airlines are expected to follow the recommended actions. We further propose a scalable algorithm for computing chance-constrained correlated equilibria by exploiting a reduced-rank structure. Numerical experiments demonstrate that the proposed method scales to realistic traffic levels up to 210 eligible pushbacks per hour, reduces accumulated delay by up to approximately 8.9% compared to current first-come-first-served schemes, and reveals a trade-off between confidence level, deviation robustness, and achievable cost efficiency.

[MA-10] LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agent ic Programming in Command-Line Interfaces

【速读】:该论文旨在解决当前AI辅助编程基准测试中存在的三大核心问题:任务周期短、数据污染(源于GitHub爬取)以及缺乏细粒度评估指标,这些问题导致现有方法难以有效衡量代理在真实软件工程场景中所需的长期规划与执行能力。解决方案的关键在于提出LongCLI-Bench这一综合性基准,其包含从1000余份计算机科学作业和真实工作流中筛选出的20个高质量长周期任务,覆盖从零开发、功能添加、缺陷修复到重构四类工程场景;同时设计双集测试协议(fail-to-pass与pass-to-pass)以量化需求满足与回归规避能力,并引入步骤级评分机制定位执行失败点。实验表明,即使是最先进的代理在LongCLI-Bench上的通过率也低于20%,且多数任务在30%完成前停滞,凸显早期阶段故障是主要瓶颈;相比之下,人类-代理协作(如计划注入和交互引导)显著优于自我修正策略,揭示未来研究需聚焦于人机协同工作流的设计与优化。

链接: https://arxiv.org/abs/2602.14337
作者: Yukang Feng,Jianwen Sun,Zelai Yang,Jiaxin Ai,Chuanhao Li,Zizhen Li,Fanrui Zhang,Kang He,Rui Ma,Jifan Lin,Jie Sun,Yang Xiao,Sizhuo Zhou,Wenxiao Wu,Yiming Liu,Pengfei Liu,Yu Qiao,Shenglin Zhang,Kaipeng Zhang
机构: NKU; SII; Shanda AI Research Tokyo; Shanghai AI Laboratory; SJTU
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents’ planning and execution capabilities to overcome key challenges in long-horizon task performance.

[MA-11] Offline Learning of Nash Stable Coalition Structures with Possibly Overlapping Coalitions AAMAS

【速读】:该论文旨在解决具有重叠联盟(overlapping coalitions)和部分信息(partial information)条件下的联盟形成问题,即在自私个体可能同时参与多个联盟、且其偏好初始未知的情况下,如何利用固定离线数据集中的历史交互与效用反馈信息,高效推断出个体偏好并恢复近似纳什稳定(Nash stable)的联盟划分。解决方案的关键在于:区分两种效用反馈类型——个体级(agent-level)与联盟级(coalition-level)反馈,并基于不同反馈模式下数据集的信息覆盖充分性假设,设计样本复杂度低的离线学习算法。研究证明,在个体级反馈下,存在一个充分必要条件可保证算法以样本效率获得近似纳什稳定的联盟划分;而在联盟级反馈下则需更强假设,但两类算法均在多种场景下达到最优样本复杂度(至多对数因子差距)。

链接: https://arxiv.org/abs/2602.14321
作者: Saar Cohen
机构: Bar Ilan University (巴伊兰大学); University of Oxford (牛津大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: To Appear in the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2026

点击查看摘要

Abstract:Coalition formation concerns strategic collaborations of selfish agents that form coalitions based on their preferences. It is often assumed that coalitions are disjoint and preferences are fully known, which may not hold in practice. In this paper, we thus present a new model of coalition formation with possibly overlapping coalitions under partial information, where selfish agents may be part of multiple coalitions simultaneously and their full preferences are initially unknown. Instead, information about past interactions and associated utility feedback is stored in a fixed offline dataset, and we aim to efficiently infer the agents’ preferences from this dataset. We analyze the impact of diverse dataset information constraints by studying two types of utility feedback that can be stored in the dataset: agent- and coalition-level utility feedback. For both feedback models, we identify assumptions under which the dataset covers sufficient information for an offline learning algorithm to infer preferences and use them to recover a partition that is (approximately) Nash stable, in which no agent can improve her utility by unilaterally deviating. Our additional goal is devising algorithms with low sample complexity, requiring only a small dataset to obtain a desired approximation to Nash stability. Under agent-level feedback, we provide a sample-efficient algorithm proven to obtain an approximately Nash stable partition under a sufficient and necessary assumption on the information covered by the dataset. However, under coalition-level feedback, we show that only under a stricter assumption is sufficient for sample-efficient learning. Still, in multiple cases, our algorithms’ sample complexity bounds have optimality guarantees up to logarithmic factors. Finally, extensive experiments show that our algorithm converges to a low approximation level to Nash stability across diverse settings.

[MA-12] DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices

【速读】:该论文旨在解决资源受限设备上难以进行大规模混合专家(Mixture-of-Experts, MoE)模型联邦训练的问题,即传统联邦学习方法要求设备部署完整的本地MoE模型,而这对边缘设备的计算和存储资源构成巨大挑战。其解决方案的关键在于提出DeepFusion框架,通过联邦知识蒸馏(federated knowledge distillation)实现异构设备端LLM知识的融合,同时引入一种新颖的视图对齐注意力(View-Aligned Attention, VAA)模块,显式对齐本地LLM与全局MoE模型之间的预测视角,从而缓解因架构差异导致的“视图不匹配”问题,显著降低通信开销并提升模型性能。

链接: https://arxiv.org/abs/2602.14301
作者: Songyuan Li,Jia Hu,Ahmed M. Abdelmoniem,Geyong Min,Haojun Huang,Jiwei Huang
机构: Queen Mary University of London (伦敦玛丽女王大学); University of Exeter (埃克塞特大学); Huazhong University of Science and Technology (华中科技大学); China University of Petroleum (中国石油大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Index Terms: Large language models, Mixture-of-experts, Federated knowledge distillation, Edge device heterogeneity

点击查看摘要

Abstract:Recent Mixture-of-Experts (MoE)-based large language models (LLMs) such as Qwen-MoE and DeepSeek-MoE are transforming generative AI in natural language processing. However, these models require vast and diverse training data. Federated learning (FL) addresses this challenge by leveraging private data from heterogeneous edge devices for privacy-preserving MoE training. Nonetheless, traditional FL approaches require devices to host local MoE models, which is impractical for resource-constrained devices due to large model sizes. To address this, we propose DeepFusion, the first scalable federated MoE training framework that enables the fusion of heterogeneous on-device LLM knowledge via federated knowledge distillation, yielding a knowledge-abundant global MoE model. Specifically, DeepFusion features each device to independently configure and train an on-device LLM tailored to its own needs and hardware limitations. Furthermore, we propose a novel View-Aligned Attention (VAA) module that integrates multi-stage feature representations from the global MoE model to construct a predictive perspective aligned with on-device LLMs, thereby enabling effective cross-architecture knowledge distillation. By explicitly aligning predictive perspectives, VAA resolves the view-mismatch problem in traditional federated knowledge distillation, which arises from heterogeneity in model architectures and prediction behaviors between on-device LLMs and the global MoE model. Experiments with industry-level MoE models (Qwen-MoE and DeepSeek-MoE) and real-world datasets (medical and finance) demonstrate that DeepFusion achieves performance close to centralized MoE training. Compared with key federated MoE baselines, DeepFusion reduces communication costs by up to 71% and improves token perplexity by up to 5.28%.

[MA-13] A Multi-Agent Framework for Medical AI: Leverag ing Fine-Tuned GPT LLaMA and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗问答(Medical Question Answering, MQA)应用中面临的三大核心问题:验证能力弱、证据 grounding 不足以及置信度信号不可靠。解决方案的关键在于构建一个模块化的多智能体(multi-agent)医疗 QA 框架,通过分工协作实现可靠性提升:首先对三种代表性 LLM 家族(GPT、LLaMA 和 DeepSeek R1)进行医学数据微调,并筛选出性能最优的 DeepSeek R1 作为主干;随后设计三个专用智能体——临床推理代理(Clinical Reasoning agent,基于微调后的 LLaMA)生成结构化解释,证据检索代理(Evidence Retrieval agent)从 PubMed 获取最新文献支持,精炼代理(Refinement agent,基于 DeepSeek R1)优化答案清晰度与事实一致性;同时引入蒙特卡洛 dropout 和困惑度(perplexity)驱动的不确定性估计、以及基于 LIME/SHAP 的偏倚检测机制,形成多层次验证体系。实验证明该架构显著提升了准确率(87%)和证据相关性(约 0.80),并有效降低不确定性(困惑度降至 4.13),为可信赖、可解释且具扩展性的医疗 AI 系统提供了实用路径。

链接: https://arxiv.org/abs/2602.14158
作者: Naeimeh Nourmohammadi,Md Meem Hossain, TheAnh Han,Safina Showkat Ara,Zia Ush Shamszaman
机构: Teesside University (提塞德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 27 pages, 14 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 ± 0.04; ROUGE-2 0.226 ±0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

[MA-14] ruthful Reporting of Competence with Minimal Verification AAMAS2026

【速读】:该论文旨在解决在家庭考试场景中,如何设计激励机制以最小化评分偏差(即学生真实能力与其预期得分之间的差异),同时仅对有限数量的学生进行验证。其核心问题是:在保证诚实报告是占优策略且诚实参与者不会被惩罚的前提下,应如何最优地选择验证对象?解决方案的关键在于构建一种参数化的机制,在完美验证条件下可实现最优的偏倚-验证成本权衡;而在验证存在噪声时,则利用适当的评分规则(proper scoring rules)来构造具有良好(虽非最优)性能的诚实机制,从而在现实约束下平衡激励相容性与验证效率。

链接: https://arxiv.org/abs/2602.14076
作者: Reshef Meir,Jonathan Wagner,Omer Ben-Porat
机构: Technion (以色列理工学院)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Full version of a paper accepted to AAMAS 2026

点击查看摘要

Abstract:Suppose you run a home exam, where students should report their own scores but can cheat freely. You can, if needed, call a limited number of students to class and verify their actual performance against their reported score. We consider the class of mechanisms where truthful reporting is a dominant strategy, and truthful agents are never penalized – even off-equilibrium. How many students do we need to verify, in expectation, if we want to minimize the bias, i.e., the difference between agents’ competence and their expected grade? When perfect verification is available, we characterize the best possible tradeoff between these requirements and provide a simple parametrized mechanism that is optimal in the class for any distribution of agents’ types. When verification is noisy, the task becomes much more challenging. We show how proper scoring rules can be leveraged in different ways to construct truthful mechanisms with a good (though not necessarily optimal) tradeoff. Comments: Full version of a paper accepted to AAMAS 2026 Subjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2602.14076 [cs.GT] (or arXiv:2602.14076v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.14076 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.65109/CZFV1739 Focus to learn more DOI(s) linking to related resources

[MA-15] sting BDI-based Multi-Agent Systems using Discrete Event Simulation

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)在仿真测试中因现实差距(reality gap)导致的测试有效性不足问题,尤其针对基于信念-欲望-意图(Belief Desire Intention, BDI)模型的智能体,其代码无法直接在仿真环境中运行,从而造成部署与仿真之间的一致性缺失。解决方案的关键在于将BDI智能体的控制流映射到离散事件仿真(Discrete Event Simulation, DES)框架中,实现不同粒度级别的集成,从而在不使用替代表示(surrogate representations)的前提下,使开发者能够在仿真环境中测试与实际部署相同的规范。研究通过开源原型工具(JaKtA与Alchemist)的整合验证了该方法的可行性,并表明不同的映射粒度会影响仿真的保真度(fidelity)。

链接: https://arxiv.org/abs/2602.13878
作者: Martina Baiardi,Samuele Burattini,Giovanni Ciatto,Danilo Pianini
机构: University of Bologna (博洛尼亚大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent systems are designed to deal with open, distributed systems with unpredictable dynamics, which makes them inherently hard to test. The value of using simulation for this purpose is recognized in the literature, although achieving sufficient fidelity (i.e., the degree of similarity between the simulation and the real-world system) remains a challenging task. This is exacerbated when dealing with cognitive agent models, such as the Belief Desire Intention (BDI) model, where the agent codebase is not suitable to run unchanged in simulation environments, thus increasing the reality gap between the deployed and simulated systems. We argue that BDI developers should be able to test in simulation the same specification that will be later deployed, with no surrogate representations. Thus, in this paper, we discuss how the control flow of BDI agents can be mapped onto a Discrete Event Simulation (DES), showing that such integration is possible at different degrees of granularity. We substantiate our claims by producing an open-source prototype integration between two pre-existing tools (JaKtA and Alchemist), showing that it is possible to produce a simulation-based testing environment for distributed BDI agents, and that different granularities in mapping BDI agents over DESs may lead to different degrees of fidelity.

[MA-16] Modeling and Optimizing the Provisioning of Exhaustible Capabilities for Simultaneous Task Allocation and Scheduling AAMAS2026

【速读】:该论文旨在解决在长时间跨度下,使用异构机器人团队完成多项任务时所面临的任务分配与规划的计算挑战,特别是如何在电池能量和时间约束条件下合理分配可耗尽的资源(即“traits”)。其解决方案的关键在于提出了一种名为TRAITS的离线异构多机器人任务分配框架,其中引入了一个基于非线性规划的特质分配模块,能够优化联盟的特质供给速率,从而生成可行且时间高效的解;该模块通过结合特质供给速率来更精确地评估任务执行时间和makespan,并同时优化电池消耗,相较现有先进方法具备更高的准确性与可行性。

链接: https://arxiv.org/abs/2602.13866
作者: Jinwoo Park,Harish Ravichandar,Seth Hutchinson
机构: Georgia Institute of Technology (佐治亚理工学院); Northeastern University (东北大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: Accepted at AAMAS 2026

点击查看摘要

Abstract:Deploying heterogeneous robot teams to accomplish multiple tasks over extended time horizons presents significant computational challenges for task allocation and planning. In this paper, we present a comprehensive, time-extended, offline heterogeneous multi-robot task allocation framework, TRAITS, which we believe to be the first that can cope with the provisioning of exhaustible traits under battery and temporal constraints. Specifically, we introduce a nonlinear programming-based trait distribution module that can optimize the trait-provisioning rate of coalitions to yield feasible and time-efficient solutions. TRAITS provides a more accurate feasibility assessment and estimation of task execution times and makespan by leveraging trait-provisioning rates while optimizing battery consumption – an advantage that state-of-the-art frameworks lack. We evaluate TRAITS against two state-of-the-art frameworks, with results demonstrating its advantage in satisfying complex trait and battery requirements while remaining computationally tractable.

[MA-17] From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents

【速读】:该论文试图解决的问题是:随着深度研究代理(deep research agent)生成科学报告的成本大幅降低,传统以事实准确性为核心的验证方式已不足以应对新型风险——即输出内容虽具备科学形式但其主张与证据之间的关联薄弱、缺失或误导,导致“可审计性”(auditability)成为新的瓶颈。解决方案的关键在于将主张级审计能力(claim-level auditability)作为设计和评估的核心目标,并提出可审计自主研究标准(Auditable Autonomous Research, AAR),通过四个可量化维度——溯源覆盖率(provenance coverage)、溯源合理性(provenance soundness)、矛盾透明度(contradiction transparency)和审计努力度(audit effort)——实现对生成内容的结构化审计;进一步引入语义溯源(semantic provenance)机制,构建持久可查询的溯源图谱,编码主张-证据关系(包括冲突),并在合成过程中集成协议化验证(protocolized validation),而非仅在发布后补救,从而支持大规模部署下的可信科研自动化。

链接: https://arxiv.org/abs/2602.13855
作者: Razeen A Rasheed,Somnath Banerjee,Animesh Mukherjee,Rima Hazra
机构: Indian Institute of Science (印度科学研究所); IIT Kharagpur (印度理工学院克勒格布尔分校); Cisco Systems (思科系统公司); TCG CREST (TCG CREST)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:A deep research agent produces a fluent scientific report in minutes; a careful reader then tries to verify the main claims and discovers the real cost is not reading, but tracing: which sentence is supported by which passage, what was ignored, and where evidence conflicts. We argue that as research generation becomes cheap, auditability becomes the bottleneck, and the dominant risk shifts from isolated factual errors to scientifically styled outputs whose claim-evidence links are weak, missing, or misleading. This perspective proposes claim-level auditability as a first-class design and evaluation target for deep research agents, distills recurring long-horizon failure modes (objective drift, transient constraints, and unverifiable inference), and introduces the Auditable Autonomous Research (AAR) standard, a compact measurement framework that makes auditability testable via provenance coverage, provenance soundness, contradiction transparency, and audit effort. We then argue for semantic provenance with protocolized validation: persistent, queryable provenance graphs that encode claim–evidence relations (including conflicts) and integrate continuous validation during synthesis rather than after publication, with practical instrumentation patterns to support deployment at scale.

[MA-18] DTBench: A Synthetic Benchmark for Document-to-Table Extraction

【速读】:该论文旨在解决文档到表格(Document-to-Table, Doc2Table)提取任务中缺乏能力感知型评估基准的问题,现有基准未能充分区分或覆盖Doc2Table所需的核心能力,如推理、冲突解决等间接提取能力。为应对这一挑战,作者提出一种基于反向表到文档(Table2Doc)范式的合成方法,设计多智能体协同生成工作流,从真实表格出发自动构造高质量文档数据集,从而构建出DTBench——一个包含两级能力分类体系(5大类、13子类)的合成基准。其关键创新在于通过自动化合成策略克服人工标注成本高、可扩展性差及能力覆盖不足的局限,为LLM在Doc2Table任务中的系统性评估提供了新路径。

链接: https://arxiv.org/abs/2602.13812
作者: Yuxiang Guo,Zhuoran Du,Nan Tang,Kezheng Tang,Congcong Ge,Yunjun Gao
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL-based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table this http URL argue that a capability-aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human-annotated document-table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi-agent synthesis workflow to generate documents from ground-truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two-level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at this https URL.

[MA-19] hunderAg ent: A Simple Fast and Program-Aware Agent ic Inference System

【速读】:该论文旨在解决当前多轮代理工作流(agentic workflows)在推理过程中因组件松散耦合导致的资源调度低效问题,特别是KV缓存(Key-Value Cache)利用率低下和工具执行环境管理不协调的问题。现有系统分别对每个请求独立调度LLM推理与工具调用,缺乏对整个工作流的端到端认知,从而造成资源浪费与性能瓶颈。其解决方案的关键在于提出ThunderAgent,通过将代理工作流抽象为统一的LLM Programs(LLM程序),实现对KV缓存、系统状态及外部工具资源(如磁盘内存、网络端口)的联合建模;在此基础上设计了程序感知调度器(program-aware scheduler)与工具资源管理器(tool resource manager),以提升KV缓存命中率、缓解内存不平衡并支持异步环境预准备,最终显著提升服务吞吐量与资源效率。

链接: https://arxiv.org/abs/2602.13692
作者: Hao Kang,Ziyang Li,Xinyu Yang,Weili Xu,Yinfang Chen,Junxiong Wang,Beidi Chen,Tushar Krishna,Chenfeng Xu,Simran Arora
机构: 未知
类目: Operating Systems (cs.OS); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: this https URL.

[MA-20] MAS-on-the-Fly: Dynamic Adaptation of LLM -based Multi-Agent Systems at Test Time

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在部署后缺乏动态适应能力的问题,现有方法通常依赖人工设计或“一刀切”的自动化策略,难以应对复杂任务中的环境变化。其解决方案的关键在于提出 MASFly 框架,通过两个核心机制实现运行时自适应:一是检索增强的 SOP(Standard Operating Procedure)实例化机制,利用自构建的成功协作模式库,使 LLM 能为新查询定制化组装多智能体系统;二是基于经验引导的监督机制,由专门的 Watcher 智能体实时监控系统行为,并参考个性化经验池提供干预,从而提升执行过程的鲁棒性与适应性。

链接: https://arxiv.org/abs/2602.13671
作者: Guangyi Liu,Haojun Lin,Huan Zeng,Heng Wang,Quanming Yao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. However, existing works often rely on manual designs or “one-size-fits-all” automation, lacking dynamic adaptability after deployment. Inspired by how biological systems adapt, we introduce MASFly, a novel multi-agent framework enabling dynamic adaptation at test time. To adapt system generation, MASFly employs a retrieval-augmented SOP instantiation mechanism that leverages a self-constructed repository of successful collaboration patterns, enabling the LLM to assemble customized MASs for new queries. For adaptive execution, MASFly incorporates an experience-guided supervision mechanism, where a dedicated Watcher agent monitors system behaviors with reference to a personalized experience pool and provides real-time interventions. Extensive experiments demonstrate that MASFly achieves state-of-the-art performance, most notably a 61.7% success rate on the TravelPlanner benchmark, while exhibiting strong task adaptability and robustness.

[MA-21] Guided Collaboration in Heterogeneous LLM -Based Multi-Agent Systems via Entropy-Based Understanding Assessment and Experience Retrieval

【速读】:该论文旨在解决异构多智能体系统(Heterogeneous Multi-Agent Systems, HMAS)中因智能体能力差异导致的认知不匹配问题,尤其在强-弱智能体协作场景下,常出现强模型与弱模型协同效率低于弱-弱组合的反直觉现象,表明认知不对齐是限制异构协作的关键瓶颈。解决方案的核心在于提出一种基于熵的自适应引导框架(Entropy-Based Adaptive Guidance Framework),通过多维熵度量(涵盖表达、不确定性、结构、连贯性和相关性)量化弱智能体的理解状态,并动态调整引导强度(轻度、中度、重度),实现认知状态对齐;同时引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,保留成功协作经验以支持即时适应与长期学习,从而显著提升异构协作的有效性与稳定性。

链接: https://arxiv.org/abs/2602.13639
作者: Linlin Wang,Tianqing Zhu,Laiqiao Qin,Longxiang Gao,Wanlei Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:With recent breakthroughs in large language models (LLMs) for reasoning, planning, and complex task generation, artificial intelligence systems are transitioning from isolated single-agent architectures to multi-agent systems with collaborative intelligence. However, in heterogeneous multi-agent systems (HMAS), capability differences among agents give rise to consistent cognitive problems, where strong and weak models fail to contribute effectively. We define the collaboration as a strong-weak system. Through comprehensive experiments, we disclose a counterintuitive phenomenon in the strong-weak system: a strong-weak collaboration may under-perform weak-weak combinations, revealing that cognitive mismatching are key bottlenecks limiting heterogeneous cooperation. To overcome these challenges, we propose an Entropy-Based Adaptive Guidance Framework that dynamically aligns the guidance with the cognitive state of each agent. The framework quantifies the understanding of weak agents through multi-dimensional entropy metrics - covering expression, uncertainty, structure, coherence, and relevance - and adaptively adjusts the intensity of the guidance at light, moderate and intensive levels. Furthermore, a Retrieval-Augmented Generation (RAG) mechanism is incorporated to retain successful collaboration experiences, enabling both immediate adaptation and long-term learning. Extensive experiments on three benchmark datasets, GSM8K, MBPP, and CVRP demonstrate that our approach consistently enhances the effectiveness and stability of heterogeneous collaboration. The results highlight that adaptive guidance not only mitigates cognitive imbalance but also establishes a scalable pathway toward more robust, cooperative multi-agent intelligence.

[MA-22] A First Proof Sprint

【速读】:该论文旨在解决高复杂度数学研究问题在快速证明生成(proof sprint)过程中的可靠性与验证难题,尤其针对多代理协作下如何高效识别并修复逻辑漏洞的问题。其解决方案的关键在于引入结构感知的验证机制与分层切换策略:通过有向图分解(wiring-diagram decompositions)明确命题依赖关系以定位知识缺口,并结合对抗式验证(adversarial verification)、靶向修复(targeted repair)和显式溯源(explicit provenance)实现动态修订;同时区分数学有效性状态与质量控制(QC)验证状态,确保成果的透明性与可追溯性。该方法显著提升了压缩版证明冲刺中的可信度与校准精度。

链接: https://arxiv.org/abs/2602.13587
作者: Joseph Corneli
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 144 pages, 7 color images. Submission to First Proof February 2026 ( arXiv:2602.05192 , this https URL ), uploaded 20:07 Friday, 13 February 2026 Pacific Time (PT)

点击查看摘要

Abstract:This monograph reports a multi-agent proof sprint on ten research-level problems, combining rapid draft generation with adversarial verification, targeted repair, and explicit provenance. The workflow uses wiring-diagram decompositions of claim dependencies to localize gaps and coordinate reviewer-driven revisions. Final outcomes are heterogeneous but explicit: the manuscript distinguishes mathematical status from QC-validation status. Mathematically, Problem~3 has a validation-complete existence path under the scoped criterion used here (uniqueness/irreducibility treated as optional), Problem 5 is solved in a scope-limited form for F_O -local connective spectra, Problem 10 is conditional under clearly stated assumptions (with explicit necessity counterexamples when assumptions are dropped), and Problems 4 and 6 are partial with named remaining obligations in the general case (including an unconditional K_n result for Problem 6 with c_0 = 1/3 ). Problem 7 is treated as provisionally closed via the rotation-route theorem chain, pending independent ledger re-check. At the QC layer, Problems~7 and~9 have node-level validation artifacts but still contain unresolved verifier gaps. The main methodological result is that structure-aware verification and layer-switching strategies improve reliability and calibration in compressed proof sprints.

[MA-23] Verification of Robust Multi-Agent Systems AAMAS2026

【速读】:该论文旨在解决不确定环境下多智能体系统的鲁棒策略验证问题,即在系统转移概率不完全已知或受扰动、观测信息不完整以及存在对抗性代理的情况下,如何确保联盟能够满足时序规范。其解决方案的关键在于引入一种基于有界记忆策略的鲁棒模型检测方法,针对概率性且依赖观测的交替时间逻辑(Alternating-time Temporal Logic, ATL)扩展形式,定义了鲁棒性模型检测问题,并通过分析不同扰动定义下的计算复杂度,揭示了鲁棒性带来的计算代价,从而为在不确定环境中采用有界记忆策略提供了理论支持与可行性依据。

链接: https://arxiv.org/abs/2602.13405
作者: Raphaël Berthon,Joost-Pieter Katoen,Munyque Mittelmann,Aniello Murano
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: This is an extended version of the paper with the same title that will appear in the proceedings of AAMAS 2026. This version contains a technical appendix with proof details

点击查看摘要

Abstract:Stochastic multi-agent systems are a central modeling framework for autonomous controllers, communication protocols, and cyber-physical infrastructures. In many such systems, however, transition probabilities are only estimated from data and may therefore be partially unknown or subject to perturbations. In this paper, we study the verification of robust strategies in stochastic multi-agent systems with imperfect information, in which coalitions must satisfy a temporal specification while dealing with uncertain system transitions, partial observation, and adversarial agents. By focusing on bounded-memory strategies, we introduce a robust variant of the model-checking problem for a probabilistic, observation-based extension of Alternating-time Temporal Logic. We characterize the complexity of this problem under different notions of perturbation, thereby clarifying the computational cost of robustness in stochastic multi-agent verification and supporting the use of bounded-memory strategies in uncertain environments.

[MA-24] G2CP: A Graph-Grounded Communication Protocol for Verifiable and Efficient Multi-Agent Reasoning

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)中因依赖自然语言进行通信而引发的语义漂移(semantic drift)、幻觉传播(hallucination propagation)及令牌消耗效率低下等问题。其核心解决方案是提出一种基于图结构的通信协议 G2CP(Graph-Grounded Communication Protocol),通过将消息定义为图操作而非自由文本,使智能体之间交换显式的遍历命令、子图片段和更新操作,从而在共享知识图谱上实现可验证的推理轨迹,消除歧义并提升协作精度。该方案的关键在于以结构化图操作替代自然语言交互,显著降低通信开销、提高任务完成准确率,并支持完全可审计的推理链。

链接: https://arxiv.org/abs/2602.13370
作者: Karim Ben Khaled,Davy Monticolo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-agent systems powered by Large Language Models face a critical challenge: agents communicate through natural language, leading to semantic drift, hallucination propagation, and inefficient token consumption. We propose G2CP (Graph-Grounded Communication Protocol), a structured agent communication language where messages are graph operations rather than free text. Agents exchange explicit traversal commands, subgraph fragments, and update operations over a shared knowledge graph, enabling verifiable reasoning traces and eliminating ambiguity. We validate G2CP within an industrial knowledge management system where specialized agents (Diagnostic, Procedural, Synthesis, and Ingestion) coordinate to answer complex queries. Experimental results on 500 industrial scenarios and 21 real-world maintenance cases show that G2CP reduces inter-agent communication tokens by 73%, improves task completion accuracy by 34% over free-text baselines, eliminates cascading hallucinations, and produces fully auditable reasoning chains. G2CP represents a fundamental shift from linguistic to structural communication in multi-agent systems, with implications for any domain requiring precise agent coordination. Code, data, and evaluation scripts are publicly available.

[MA-25] Robust Mean-Field Games with Risk Aversion and Bounded Rationality

【速读】:该论文旨在解决传统均值场博弈(mean-field game)方法在实际应用中的两大局限性:一是假设初始群体分布固定且参与者完全理性,这导致模型在面对分布不确定性(distributional uncertainty)和认知约束时鲁棒性不足;二是缺乏对风险规避(risk aversion)和有限理性(bounded rationality)的建模能力。解决方案的关键在于提出一种新的均衡概念——均值场风险规避量化响应均衡(mean-field risk-averse quantal response equilibrium, MF-RQE),其核心创新在于同时引入对初始群体分布的风险厌恶机制与基于量化响应函数(quantal response function)的有限理性建模,从而更真实地刻画现实多智能体系统中决策者的行为偏差与不确定性。作者进一步证明了MF-RQE的存在性及迭代算法(如不动点法和虚拟博弈)的收敛性,并设计了一种适用于大规模状态-动作空间的可扩展强化学习算法,数值实验表明MF-RQE策略相较于经典均值场方法在鲁棒性上具有显著优势。

链接: https://arxiv.org/abs/2602.13353
作者: Bhavini Jeloka,Yue Guan,Panagiotis Tsiotras
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注: 25 pages, 2 figures

点击查看摘要

Abstract:Recent advances in mean-field game literature enable the reduction of large-scale multi-agent problems to tractable interactions between a representative agent and a population distribution. However, existing approaches typically assume a fixed initial population distribution and fully rational agents, limiting robustness under distributional uncertainty and cognitive constraints. We address these limitations by introducing risk aversion with respect to the initial population distribution and by incorporating bounded rationality to model deviations from fully rational decision-making agents. The combination of these two elements yields a new and more general equilibrium concept, which we term the mean-field risk-averse quantal response equilibrium (MF-RQE). We establish existence results and prove convergence of fixed-point iteration and fictitious play to MF-RQE. Building on these insights, we develop a scalable reinforcement learning algorithm for scenarios with large state-action spaces. Numerical experiments demonstrate that MF-RQE policies achieve improved robustness relative to classical mean-field approaches that optimize expected cumulative rewards under a fixed initial distribution and are restricted to entropy-based regularizers.

[MA-26] BLUEPRINT Rebuilding a Legacy: Multimodal Retrieval for Complex Engineering Drawings and Documents

【速读】:该论文旨在解决工程图纸与技术文档长期存档于遗留系统中时,因元数据不一致或缺失导致检索困难且高度依赖人工的问题。其核心挑战在于如何从大量未标注的工程文件中自动提取结构化信息并实现跨设施的高效检索。解决方案的关键在于提出一个布局感知的多模态检索系统Blueprint:它通过检测图纸中的标准区域(如标题栏、图例等),在限定区域内应用视觉语言模型(Vision-Language Model, VLM)进行OCR识别,对标识符(如DWG编号、部件号、设施名)进行归一化处理,并融合词法检索与密集向量检索,再引入轻量级区域级重排序器提升精度。该方法在约77万份未标注文件上部署后,可自动生成适用于跨设施搜索的结构化元数据,在5000文件基准测试中显著优于最强视觉-语言基线,Success@3提升10.1%,nDCG@3相对改进18.9%。

链接: https://arxiv.org/abs/2602.13345
作者: Ethan Seefried,Ran Eldegaway,Sanjay Das,Nathaniel Blanchard,Tirthankar Ghosal
机构: 未知
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 20 pages 8 main + 12 appendix + references

点击查看摘要

Abstract:Decades of engineering drawings and technical records remain locked in legacy archives with inconsistent or missing metadata, making retrieval difficult and often manual. We present Blueprint, a layout-aware multimodal retrieval system designed for large-scale engineering repositories. Blueprint detects canonical drawing regions, applies region-restricted VLM-based OCR, normalizes identifiers (e.g., DWG, part, facility), and fuses lexical and dense retrieval with a lightweight region-level reranker. Deployed on ~770k unlabeled files, it automatically produces structured metadata suitable for cross-facility search. We evaluate Blueprint on a 5k-file benchmark with 350 expert-curated queries using pooled, graded (0/1/2) relevance judgments. Blueprint delivers a 10.1% absolute gain in Success@3 and an 18.9% relative improvement in nDCG@3 over the strongest vision-language baseline, consistently outperforming across vision, text, and multimodal intents. Oracle ablations reveal substantial headroom under perfect region detection and OCR. We release all queries, runs, annotations, and code to facilitate reproducible evaluation on legacy engineering archives. Comments: 20 pages 8 main + 12 appendix + references Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA) Cite as: arXiv:2602.13345 [cs.LG] (or arXiv:2602.13345v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.13345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-27] PeroMAS: A Multi-agent System of Perovskite Material Discovery

【速读】:该论文旨在解决当前生成式AI在钙钛矿材料发现中存在的问题,即现有方法多采用孤立的模型(如材料设计、工艺优化和性能预测),缺乏跨流程的物理约束传播机制,导致无法实现端到端的优化。其解决方案的关键在于提出了一种多智能体系统PeroMAS,通过将一系列钙钛矿专用工具封装为模型上下文协议(Model Context Protocols, MCPs),并基于任务规划与工具调用实现多目标约束下的材料设计全流程自动化,涵盖文献检索、数据提取、性质预测及机理分析等环节,从而显著提升材料发现效率,并已在真实合成实验中验证了其有效性。

链接: https://arxiv.org/abs/2602.13312
作者: Yishu Wang,Wei Liu,Yifan Li,Shengxiang Xu,Xujie Yuan,Ran Li,Yuyu Luo,Jia Zhu,Shimin Di,Min-Ling Zhang,Guixiang Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a pioneer of the third-generation photovoltaic revolution, Perovskite Solar Cells (PSCs) are renowned for their superior optoelectronic performance and cost potential. The development process of PSCs is precise and complex, involving a series of closed-loop workflows such as literature retrieval, data integration, experimental design, and synthesis. However, existing AI perovskite approaches focus predominantly on discrete models, including material design, process optimization,and property prediction. These models fail to propagate physical constraints across the workflow, hindering end-to-end optimization. In this paper, we propose a multi-agent system for perovskite material discovery, named PeroMAS. We first encapsulated a series of perovskite-specific tools into Model Context Protocols (MCPs). By planning and invoking these tools, PeroMAS can design perovskite materials under multi-objective constraints, covering the entire process from literature retrieval and data extraction to property prediction and mechanism analysis. Furthermore, we construct an evaluation benchmark by perovskite human experts to assess this multi-agent system. Results demonstrate that, compared to single Large Language Model (LLM) or traditional search strategies, our system significantly enhances discovery efficiency. It successfully identified candidate materials satisfying multi-objective constraints. Notably, we verify PeroMAS’s effectiveness in the physical world through real synthesis experiments.

[MA-28] Adaptive Value Decomposition: Coordinating a Varying Number of Agents in Urban Systems

【速读】:该论文针对多智能体强化学习(Multi-agent Reinforcement Learning, MARL)在城市系统中应用时面临的两个核心问题展开研究:一是现有方法通常假设智能体数量固定且动作执行完全同步,这与现实场景中动态变化的活跃智能体数量和异步动作执行(即半多智能体强化学习,semi-MARL)不匹配;二是共享策略参数虽能提升学习效率,但易导致智能体在相似观测下产生高度同质化行为,损害协作质量。解决方案的关键在于提出自适应价值分解(Adaptive Value Decomposition, AVD)框架:一方面通过动态适应智能体群体变化来应对半MARL设置下的不确定性;另一方面引入轻量级机制抑制由共享策略引发的动作同质化,从而促进行为多样性并维持高效协作。此外,设计了适配异步决策的训练-执行策略,使AVD在真实共享单车调度任务中展现出优于现有最优基线的方法性能和泛化能力。

链接: https://arxiv.org/abs/2602.13309
作者: Yexin Li,Jinjin Guo,Haoyu Zhang,Yuhan Zhao,Yiwen Sun,Zihao Jiao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi-agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies over time, and actions may have heterogeneous durations, resulting in a semi-MARL setting. Moreover, while sharing policy parameters among agents is commonly adopted to improve learning efficiency, it can lead to highly homogeneous actions when a subset of agents make decisions concurrently under similar observations, potentially degrading coordination quality. To address these challenges, we propose Adaptive Value Decomposition (AVD), a cooperative MARL framework that adapts to a dynamically changing agent population. AVD further incorporates a lightweight mechanism to mitigate action homogenization induced by shared policies, thereby encouraging behavioral diversity and maintaining effective cooperation among agents. In addition, we design a training-execution strategy tailored to the semi-MARL setting that accommodates asynchronous decision-making when some agents act at different times. Experiments on real-world bike-sharing redistribution tasks in two major cities, London and Washington, D.C., demonstrate that AVD outperforms state-of-the-art baselines, confirming its effectiveness and generalizability.

[MA-29] Agent Mars: Multi-Agent Simulation for Multi-Planetary Life Exploration and Settlement

【速读】:该论文旨在解决太空探索与定居场景中多智能体系统(multi-agent system)在极端约束条件下的可审计协调问题,具体包括延迟/间歇性通信、资源极度稀缺、专家异构性以及严格的安全性、责任追溯和指挥权威要求。其核心挑战在于如何在安全关键的“系统之系统”(system-of-systems)环境中实现人类、机器人与数字服务之间的高效协作。解决方案的关键是提出Agent Mars框架,该框架通过93个代理构成的七层指挥与执行结构,实现了分层与跨层协同机制,保留指挥链的同时支持经审核的跨层交互并生成审计日志;同时具备动态角色交接与自动故障转移能力,并支持按任务阶段调整领导权,从而提升系统的鲁棒性与适应性。此外,Agent Mars还引入了情境感知的短/长期记忆、可配置的提议-投票共识机制及翻译器中介的异构协议,以模拟团队在压力下的对齐行为,最终通过提出的Agent Mars Performance Index(AMPI)量化评估协调效率,揭示出精细化跨层协作与功能型领导可显著降低开销而不牺牲可靠性。

链接: https://arxiv.org/abs/2602.13291
作者: Ziyang Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has transformed robotics, healthcare, industry, and scientific discovery, yet a major frontier may lie beyond Earth. Space exploration and settlement offer vast environments and resources, but impose constraints unmatched on Earth: delayed/intermittent communications, extreme resource scarcity, heterogeneous expertise, and strict safety, accountability, and command authority. The key challenge is auditable coordination among specialised humans, robots, and digital services in a safety-critical system-of-systems. We introduce Agent Mars, an open, end-to-end multi-agent simulation framework for Mars base operations. Agent Mars formalises a realistic organisation with a 93-agent roster across seven layers of command and execution (human roles and physical assets), enabling base-scale studies beyond toy settings. It implements hierarchical and cross-layer coordination that preserves chain-of-command while allowing vetted cross-layer exchanges with audit trails; supports dynamic role handover with automatic failover under outages; and enables phase-dependent leadership for routine operations, emergencies, and science campaigns. Agent Mars further models mission-critical mechanisms-scenario-aware short/long-horizon memory, configurable propose-vote consensus, and translator-mediated heterogeneous protocols-to capture how teams align under stress. To quantify behaviour, we propose the Agent Mars Performance Index (AMPI), an interpretable composite score with diagnostic sub-metrics. Across 13 reproducible Mars-relevant operational scripts, Agent Mars reveals coordination trade-offs and identifies regimes where curated cross-layer collaboration and functional leadership reduce overhead without sacrificing reliability. Agent Mars provides a benchmarkable, auditable foundation for Space AI.

[MA-30] MAPLE: A Sub-Agent Architecture for Memory Learning and Personalization in Agent ic AI Systems AAMAS2026

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在个性化适应用户方面能力不足的问题,其根本原因在于现有架构将记忆(Memory)、学习(Learning)与个性化(Personalization)三者混为一谈,缺乏对三者差异性的结构化分离。解决方案的关键在于提出MAPLE框架,通过原则性分解实现三者的解耦:记忆模块负责存储与检索基础设施,学习模块异步提取交互数据中的智能知识,个性化模块则在有限上下文预算内实时应用所学知识;每个组件作为独立子代理运行,具备专用工具和明确接口,从而显著提升代理的适应性和个性化表现——实验表明,在MAPLE-Personas基准上相比无状态基线个性化得分提升14.6%(p < 0.01, Cohen’s d = 0.95),特质融入率从45%提升至75%。

链接: https://arxiv.org/abs/2602.13258
作者: Deepak Babu Piskala
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 12 pages, 5 figures. Accepted to ALA Workshop at AAMAS 2026. Code: []( this https URLhttps://github.com/prdeepakbabu/maple-framework)

点击查看摘要

Abstract:Large language model (LLM) agents have emerged as powerful tools for complex tasks, yet their ability to adapt to individual users remains fundamentally limited. We argue this limitation stems from a critical architectural conflation: current systems treat memory, learning, and personalization as a unified capability rather than three distinct mechanisms requiring different infrastructure, operating on different timescales, and benefiting from independent optimization. We propose MAPLE (Memory-Adaptive Personalized LEarning), a principled decomposition where Memory handles storage and retrieval infrastructure; Learning extracts intelligence from accumulated interactions asynchronously; and Personalization applies learned knowledge in real-time within finite context budgets. Each component operates as a dedicated sub-agent with specialized tooling and well-defined interfaces. Experimental evaluation on the MAPLE-Personas benchmark demonstrates that our decomposition achieves a 14.6% improvement in personalization score compared to a stateless baseline (p 0.01, Cohen’s d = 0.95) and increases trait incorporation rate from 45% to 75% – enabling agents that genuinely learn and adapt.

[MA-31] DPBench: Large Language Models Struggle with Simultaneous Coordination

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体系统中面对资源竞争时的协调能力评估问题,特别是当多个LLM代理需要并发访问共享资源时,其是否能有效避免死锁(deadlock)并实现高效协作。现有研究缺乏能够系统测试此类场景的基准,因此作者提出了DPBench,一个基于哲学家就餐问题(Dining Philosophers Problem)的开源基准测试框架,涵盖八种不同条件(包括决策时机、群体规模和通信机制)。实验表明,尽管LLMs在顺序决策场景下表现良好,但在并发决策时死锁率超过95%,根本原因在于“收敛推理”(convergent reasoning)——即各代理独立推导出相同策略并在同时执行时必然导致死锁。值得注意的是,启用通信不仅未能缓解该问题,反而可能加剧死锁。因此,解决方案的关键在于:多智能体LLM系统若涉及并发资源访问,不应依赖模型自身涌现的协调能力,而需引入外部协调机制(如集中式调度或协议约束),以保障系统稳定性与可扩展性。

链接: https://arxiv.org/abs/2602.13255
作者: Najmul Hasan,Prashanth BusiReddyGari
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Large language models are increasingly deployed in multi-agent systems, yet we lack benchmarks that test whether they can coordinate under resource contention. We introduce DPBench, a benchmark based on the Dining Philosophers problem that evaluates LLM coordination across eight conditions that vary decision timing, group size, and communication. Our experiments with GPT-5.2, Claude Opus 4.5, and Grok 4.1 reveal a striking asymmetry: LLMs coordinate effectively in sequential settings but fail when decisions must be made simultaneously, with deadlock rates exceeding 95% under some conditions. We trace this failure to convergent reasoning, where agents independently arrive at identical strategies that, when executed simultaneously, guarantee deadlock. Contrary to expectations, enabling communication does not resolve this problem and can even increase deadlock rates. Our findings suggest that multi-agent LLM systems requiring concurrent resource access may need external coordination mechanisms rather than relying on emergent coordination. DPBench is released as an open-source benchmark. Code and benchmark are available at this https URL.

[MA-32] UAVGENT: A Language-Guided Distributed Control Framework

【速读】:该论文旨在解决多无人机系统在执行动态演变的高层任务时,如何在物理层保持形式化鲁棒性保证的问题。解决方案的关键在于提出了一种三层架构:首先由人类操作员发出自然语言指令;其次,基于大语言模型(Large Language Model, LLM)的监督模块周期性地根据最新状态和目标估计对任务进行解释、验证与修正;最后,分布式内环控制器仅使用局部相对信息跟踪由此产生的参考轨迹。理论分析证明了在有界扰动和由LLM更新引起的离散跳跃的分段光滑参考下,系统仍能实现稳定的跟踪性能,从而实现了集中式语言任务推理与分布式反馈控制的有机结合,达成复杂行为的可证明鲁棒性与稳定性。

链接: https://arxiv.org/abs/2602.13212
作者: Ziyi Zhang,Xiyu Deng,Guannan Qu,Yorie Nakahira
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We study language-in-the-loop control for multi-drone systems that execute evolving, high-level missions while retaining formal robustness guarantees at the physical layer. We propose a three-layer architecture in which (i) a human operator issues natural-language instructions, (ii) an LLM-based supervisor periodically interprets, verifies, and corrects the commanded task in the context of the latest state and target estimates, and (iii) a distributed inner-loop controller tracks the resulting reference using only local relative information. We derive a theoretical guarantee that characterizes tracking performance under bounded disturbances and piecewise-smooth references with discrete jumps induced by LLM updates. Overall, our results illustrate how centralized language-based task reasoning can be combined with distributed feedback control to achieve complex behaviors with provable robustness and stability.

[MA-33] FactorMiner: A Self-Evolving Agent with Skills and Experience Memory for Financial Alpha Discovery

【速读】:该论文旨在解决量化投资中公式化Alpha因子挖掘(formulaic alpha factor mining)所面临的高冗余性和搜索空间庞大导致的新信号发现困难问题。其核心挑战在于,随着因子库规模扩大,大量重复或低效的因子使得创新性信号难以被有效识别。解决方案的关键在于提出FactorMiner框架,该框架基于Ralph Loop范式(检索、生成、评估、提炼),通过模块化技能架构(Modular Skill Architecture)将金融评估逻辑封装为可执行工具,并结合结构化的经验记忆(Experience Memory)从历史挖掘试验中提炼出成功模式与失败约束,从而在迭代过程中利用记忆先验引导探索方向,显著降低冗余搜索并聚焦于有潜力的因子空间,实现高质量、低冗余因子的持续积累与演化。

链接: https://arxiv.org/abs/2602.14670
作者: Yanlong Wang,Jian Xu,Hongkang Zhang,Shao-Lun Huang,Danny Dongning Sun,Xiao-Ping Zhang
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Formulaic alpha factor mining is a critical yet challenging task in quantitative investment, characterized by a vast search space and the need for domain-informed, interpretable signals. However, finding novel signals becomes increasingly difficult as the library grows due to high redundancy. We propose FactorMiner, a lightweight and flexible self-evolving agent framework designed to navigate this complex landscape through continuous knowledge accumulation. FactorMiner combines a Modular Skill Architecture that encapsulates systematic financial evaluation into executable tools with a structured Experience Memory that distills historical mining trials into actionable insights (successful patterns and failure constraints). By instantiating the Ralph Loop paradigm – retrieve, generate, evaluate, and distill – FactorMiner iteratively uses memory priors to guide exploration, reducing redundant search while focusing on promising directions. Experiments on multiple datasets across different assets and Markets show that FactorMiner constructs a diverse library of high-quality factors with competitive performance, while maintaining low redundancy among factors as the library scales. Overall, FactorMiner provides a practical approach to scalable discovery of interpretable formulaic alpha factors under the “Correlation Red Sea” constraint.

[MA-34] Supercritical Mass and Condensation in Fokker–Planck Equations for Consensus Formation

【速读】:该论文旨在解决非线性Fokker–Planck方程中因扩散系数在定义域边界处消失而导致的共识形成模型中出现的凝聚现象(condensation effects)问题,特别是当初始质量超过临界阈值时,解是否会在有限时间内发生集中(finite-time concentration)。其解决方案的关键在于证明:即使扩散函数的选取范围扩展至更广泛的类别(而不仅限于特定多项式形式),只要满足一定条件,超临界质量仍会导致有限时间内的正则性丧失(loss of regularity),并通过理论分析给出了诱导此类现象所需的临界质量的估计。

链接: https://arxiv.org/abs/2602.13276
作者: Monica Caloi,Mattia Zanella
机构: 未知
类目: Analysis of PDEs (math.AP); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:Inspired by recently developed Fokker–Planck models for Bose–Einstein statistics, we study a consensus formation model with condensation effects driven by a polynomial diffusion coefficient vanishing at the domain boundaries. For the underlying kinetic model, given by a nonlinear Fokker–Planck equation with superlinear drift, it was shown that if the initial mass exceeds a critical threshold, the solution may exhibit finite-time concentration in certain parameter regimes. Here, we show that this supercritical mass phenomenon persists for a broader class of diffusion functions and provide estimates of the critical mass required to induce finite-time loss of regularity.

自然语言处理

[NLP-0] Symmetry in language statistics shapes the geometry of model representations

【速读】: 该论文旨在解决语言模型中高维词嵌入表示为何会自发形成简单几何结构(如月份呈圆形分布、年份构成一维流形等)这一基本问题。其解决方案的关键在于揭示了语言统计中的平移对称性——即两个词的共现概率仅依赖于它们的时间间隔——并证明这种对称性可直接决定嵌入空间中的几何结构。进一步地,作者指出这些结构在扰动共现统计(如移除特定共现句)和中等嵌入维度下依然存在,其鲁棒性源于底层连续潜变量对共现统计的集体调控机制,并通过词嵌入、文本嵌入及大语言模型的实证验证了该理论框架的有效性。

链接: https://arxiv.org/abs/2602.15029
作者: Dhruva Karkada,Daniel J. Korchinski,Andres Nava,Matthieu Wyart,Yasaman Bahri
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although learned representations underlie neural networks’ success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM representations: for example, calendar months organize into a circle, years form a smooth one-dimensional manifold, and cities’ latitudes and longitudes can be decoded by a linear probe. We show that the statistics of language exhibit a translation symmetry – e.g., the co-occurrence probability of two months depends only on the time interval between them – and we prove that the latter governs the aforementioned geometric structures in high-dimensional word embedding models. Moreover, we find that these structures persist even when the co-occurrence statistics are strongly perturbed (for example, by removing all sentences in which two months appear together) and at moderate embedding dimension. We show that this robustness naturally emerges if the co-occurrence statistics are collectively controlled by an underlying continuous latent variable. We empirically validate this theoretical framework in word embedding models, text embedding models, and large language models.

[NLP-1] Scaling Beyond Masked Diffusion Language Models

【速读】: 该论文旨在解决当前扩散语言模型(Diffusion Language Models)中不同算法架构在性能评估与实际应用之间存在偏差的问题,特别是针对掩码扩散(Masked Diffusion)是否为最优方案的争议。其关键解决方案在于首次系统性地开展统一状态(Uniform-state)和插值(Interpolating)离散扩散方法的缩放定律(Scaling Law)研究,并通过实证表明:尽管掩码扩散在困惑度(Perplexity)指标上表现优异,但其并非在所有场景下最优;例如,在GSM8K数学推理任务中,统一状态扩散模型虽困惑度较差,却展现出更强的实际性能优势。此外,作者提出一种简单的交叉熵训练目标即可使掩码扩散模型提升约12%的FLOPs效率,从而揭示了困惑度指标在跨算法比较中的局限性,并强调应综合考虑采样速度与质量的权衡(Speed-Quality Pareto Frontier)。

链接: https://arxiv.org/abs/2602.15014
作者: Subham Sekhar Sahoo,Jean-Marie Lemercier,Zhihan Yang,Justin Deschenaux,Jingyu Liu,John Thickstun,Ante Jukic
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: code: this https URL

点击查看摘要

Abstract:Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: this http URL

[NLP-2] xt Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation

【速读】: 该论文旨在解决文本风格迁移(Text Style Transfer, TST)中因缺乏平行语料库(即同一内容在不同风格下的对应文本)而导致模型训练困难的问题。其解决方案的关键在于利用回译(roundtrip translation)技术,从单语语料库中合成出风格标签对齐的平行数据,从而构建一个“中性化”文本作为共享输入风格,使模型在训练和推理阶段均能稳定地进行风格转换。该方法显著优于零样本提示(zero-shot prompting)和少量示例上下文学习(few-shot ICL)策略,并通过引入检索增强生成(Retrieval-Augmented Generation, RAG)进一步提升了术语与专有名词层面的风格一致性与鲁棒性。

链接: https://arxiv.org/abs/2602.15013
作者: Ruoxi Liu,Philipp Koehn
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 4 tables

点击查看摘要

Abstract:This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates ‘neutralized’ text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.

[NLP-3] Cold-Start Personalization via Training-Free Priors from Structured World Models

【速读】: 该论文旨在解决冷启动个性化(cold-start personalization)中的偏好获取问题,即在缺乏用户特定历史数据的情况下,如何通过有限的交互高效地推断用户偏好。核心挑战在于:每个任务涉及多个偏好维度,但个体用户仅关注其中少数维度,且哪些维度重要取决于具体用户;若无结构化提问策略,在有限提问预算下易遗漏关键维度。传统强化学习(Reinforcement Learning, RL)方法因无法利用偏好数据的因子分解结构(factored structure),导致策略退化为静态提问序列,忽视用户反馈。本文提出将冷启动偏好获取解耦为离线结构学习与在线贝叶斯推理两阶段:Pep(Preference Elicitation with Priors)首先从完整偏好样本中离线学习偏好相关性的结构化世界模型,随后在在线阶段执行无需训练的贝叶斯推理,动态选择信息量最大的问题并预测完整的偏好分布(包括未直接询问的维度)。其关键创新在于利用先验结构建模和贝叶斯更新机制,显著提升对用户响应差异的适应能力(39–62% vs. 0–28% 的后续策略调整率),同时以极低参数量(~10K vs. 8B)实现更高偏好对齐度(80.8% vs. 68.5%),证明冷启动个性化的瓶颈在于能否有效利用偏好数据的因子分解特性。

链接: https://arxiv.org/abs/2602.15012
作者: Avinandan Bose,Shuyue Stella Li,Faeze Brahman,Pang Wei Koh,Simon Shaolei Du,Yulia Tsvetkov,Maryam Fazel,Lin Xiao,Asli Celikyilmaz
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users’ stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.

[NLP-4] Counterfactual Fairness Evaluation of LLM -Based Contact Center Agent Quality Assurance System

【速读】: 该论文旨在解决生成式 AI(Generative AI)在客服中心质量评估(Quality Assurance, QA)中部署时存在的公平性问题,尤其是由于模型依赖大规模网络训练数据而可能引入的群体身份(Identity)、情境(Context)和行为风格(Behavioral Style)维度上的偏差。解决方案的关键在于提出并应用一种基于反事实推理的公平性评估框架,通过计算反事实翻转率(Counterfactual Flip Rate, CFR)和平均绝对评分差异(Mean Absolute Score Difference, MASD)来量化模型判断的不一致性与评分偏移。研究发现,尽管更大、对齐更优的模型表现出较低的不公平性,但公平性并不随准确性提升而改善;其中,历史绩效上下文提示引发最严重的不公平(CFR高达16.4%),且隐含语言身份线索仍是持续存在的偏见来源。此外,显式公平性提示仅带来有限改进,表明需建立标准化的公平性审计流程以保障高风险人力评估场景下的公正性。

链接: https://arxiv.org/abs/2602.14970
作者: Kawin Mayilvaghanan,Siddhant Gupta,Ayush Kumar
机构: Observe.AI(观察AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.

[NLP-5] ool-Aware Planning in Contact Center AI: Evaluating LLM s through Lineage-Guided Query Decomposition

【速读】: 该论文旨在解决在客服中心场景中,如何通过大语言模型(LLM)生成可执行的工具感知型计划(tool-aware plan),以回答业务洞察类查询的问题。其核心挑战在于将复杂查询分解为一系列依赖明确的步骤,这些步骤需适配结构化工具(如Text2SQL/Snowflake)和非结构化工具(如RAG/转录文本),并支持并行执行。解决方案的关键在于提出一个领域驱动的框架与基准测试集,包含三个核心贡献:(1) 基于参考的计划评估机制,涵盖七维指标(如工具-提示对齐度、查询符合度)及一次性匹配评估;(2) 通过评估器-优化器循环迭代优化数据标注流程,生成高质量的计划演化路径(plan lineage);(3) 对14种不同规模和架构的LLM进行大规模实证研究,发现当前模型在复合查询和多步计划上表现受限,且工具理解能力不足,尤其体现在工具-提示对齐和使用完整性方面。该框架为评估和改进代理式规划能力提供了可复现路径。

链接: https://arxiv.org/abs/2602.14955
作者: Varun Nathan,Shreyas Guha,Ayush Kumar
机构: Observe.AI (Observe.AI)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator-optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the “A+” tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.

[NLP-6] BFS-PO: Best-First Search for Large Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在执行复杂推理任务时出现的“过度思考”(overthinking)问题,即模型生成冗长且计算成本高昂的推理链,导致效率低下和输出冗余。这一现象常因强化学习(Reinforcement Learning, RL)算法(如GRPO/DAPO)在训练过程中偏好更长的路径而加剧。解决方案的关键在于提出一种名为BFS-PO的新颖强化学习算法,其核心创新是引入基于最大熵节点的回溯机制(backtracking mechanism)与最佳优先搜索(Best-First Search, BFS)探索策略,从而在训练中主动寻找最短但正确的答案路径,使模型逐步学会生成更简洁高效的推理链,同时提升准确率。

链接: https://arxiv.org/abs/2602.14917
作者: Fiorenzo Parascandolo,Wenhui Tan,Enver Sangineto,Ruihua Song,Rita Cucchiara
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.

[NLP-7] stimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

【速读】: 该论文旨在解决缺乏大规模、高质量且时间跨度长的意大利语对话式文本数据集的问题,以支持自然语言处理(NLP)中语言建模、领域适应和对话分析等任务。解决方案的关键在于构建并发布一个名为“Testimole-conversational”的语料库,其包含超过300亿词元(word-tokens),覆盖1996年至2024年间的意大利语讨论板消息,能够有效捕捉计算机中介交流(Computer-Mediated Communication, CMC)的多样性,为训练原生意大利语大语言模型(Large Language Models, LLMs)提供基础,并促进对数字环境中语言变异与社会现象的深入研究。

链接: https://arxiv.org/abs/2602.14819
作者: Matteo Rinaldi,Rossella Varvara,Viviana Patti
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present “Testimole-conversational” a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models’pre-training. Furthermore, discussion boards’ messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

[NLP-8] Learning State-Tracking from Code Using Linear RNNs

【速读】: 该论文旨在解决序列模型(如Transformer和RNN)在状态追踪任务中表现不一致的问题,特别是针对传统 permutation composition 任务与语言模型常用 next-token prediction 训练范式之间的不兼容性。其关键解决方案是将 permutation composition 转化为通过 REPL(Read-Eval-Print Loop)痕迹表示的代码形式,其中包含通过 print 操作显式暴露的状态信息和变量变换操作,从而构建一个可直接用于训练语言模型的任务设置。实验表明,线性 RNN 在此新设置下表现出色,而 Transformer 仍无法有效处理,揭示了状态追踪在代码结构中的复杂性,并进一步指出动作不可完全观测时,线性 RNN 的状态追踪能力可能弱于非线性 RNN。

链接: https://arxiv.org/abs/2602.14814
作者: Julien Siems,Riccardo Grazzi,Kirill Kalinin,Hitesh Ballani,Babak Rahmani
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Over the last years, state-tracking tasks, particularly permutation composition, have become a testbed to understand the limits of sequence models architectures like Transformers and RNNs (linear and non-linear). However, these are often sequence-to-sequence tasks: learning to map actions (permutations) to states, which is incompatible with the next-token prediction setting commonly used to train language models. We address this gap by converting permutation composition into code via REPL traces that interleave state-reveals through prints and variable transformations. We show that linear RNNs capable of state-tracking excel also in this setting, while Transformers still fail. Motivated by this representation, we investigate why tracking states in code is generally difficult: actions are not always fully observable. We frame this as tracking the state of a probabilistic finite-state automaton with deterministic state reveals and show that linear RNNs can be worse than non-linear RNNs at tracking states in this setup.

[NLP-9] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言如巴斯克语中进行非问答(non-question-answering, non-QA)物理常识推理任务时表现不足的问题。其解决方案的关键在于构建了首个面向巴斯克语的非QA物理常识推理数据集BasPhyCo,该数据集包含标准语和方言两种变体,并设计了三个层次的评估指标:叙事合理性判断(accuracy)、矛盾元素识别(consistency)以及物理状态可验证性判定(verifiability),从而系统性地评估LLMs在不同认知层级上的物理常识理解能力。

链接: https://arxiv.org/abs/2602.14812
作者: Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.

[NLP-10] Overthinking Loops in Agents : A Structural Risk via MCP Tools

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在调用第三方工具时所面临的供应链攻击风险,特别是由恶意工具服务器诱导的结构性过思考攻击(structural overthinking attack)。此类攻击通过看似合理的工具调用组合形成循环轨迹,导致端到端token数量和延迟显著增加,而单个步骤并无异常表现。解决方案的关键在于识别并防御这种基于工具调用结构而非单纯token冗余的攻击模式,强调防御机制应深入分析工具链的拓扑关系与逻辑闭环,而非依赖解码阶段的简洁性控制策略。

链接: https://arxiv.org/abs/2602.14798
作者: Yohan Lee,Jisoo Jang,Seoyeon Choi,Sangyeop Kim,Seungtaek Choi
机构: Yonsei University (延世大学); Ewha Womans University (梨花女子大学); Hankuk University of Foreign Studies (韩国外国语大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages. We show that this convenience creates a supply-chain attack surface: a malicious MCP tool server can be co-registered alongside normal tools and induce overthinking loops, where individually trivial or plausible tool calls compose into cyclic trajectories that inflate end-to-end tokens and latency without any single step looking abnormal. We formalize this as a structural overthinking attack, distinguishable from token-level verbosity, and implement 14 malicious tools across three servers that trigger repetition, forced refinement, and distraction. Across heterogeneous registries and multiple tool-capable models, the attack causes severe resource amplification (up to 142.4\times tokens) and can degrade task outcomes. Finally, we find that decoding-time concision controls do not reliably prevent loop induction, suggesting defenses should reason about tool-call structure rather than tokens alone.

[NLP-11] A Geometric Analysis of Small-sized Language Model Hallucinations

【速读】: 该论文旨在解决小尺寸语言模型(small-sized LLMs)中幻觉(hallucinations)问题,即模型生成看似合理但事实错误的响应,尤其在多步骤或代理式(agentic)任务中严重影响可靠性。其解决方案的关键在于从嵌入空间(embedding space)的几何视角出发,提出并证明了一个核心假设:对同一提示(prompt)生成的多个响应中,真实响应在嵌入空间中具有更紧密的聚类特性。基于这一几何洞察,研究进一步展示了可实现一致的可分离性,并据此设计了一种标签高效(label-efficient)的传播方法,仅需30–50个标注样本即可对大规模响应集合进行分类,F1分数超过90%。该方法为幻觉检测提供了不同于传统知识中心和单响应评估的新范式。

链接: https://arxiv.org/abs/2602.14778
作者: Emanuele Ricco,Elia Onofri,Lorenzo Cima,Stefano Cresci,Roberto Di Pietro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Hallucinations – fluent but factually incorrect responses – pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30-50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.14778 [cs.CL] (or arXiv:2602.14778v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.14778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-12] Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中因训练数据偏差而产生“涌现错位”(emergent misalignment)现象时,模型是否具备对其自身行为状态的自我认知能力这一问题。解决方案的关键在于通过顺序微调GPT-4.1模型至诱导和逆转错位状态,并在不提供上下文示例的情况下评估其自评表现,结果表明:错位模型会显著自我评定为更具危害性,从而证明模型具备对自身对齐状态的行为自知能力(behavioral self-awareness),且该自知能力可作为反映模型安全性的有效信号。

链接: https://arxiv.org/abs/2602.14777
作者: Laurène Vaugrante,Anietta Weckauff,Thilo Hagendorff
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed “emergent misalignment”. Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.

[NLP-13] Unlocking Reasoning Capability on Machine Translation in Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在机器翻译(Machine Translation, MT)任务中因引入显式推理(explicit reasoning)而导致性能下降的问题。现有研究表明,尽管推理导向的大语言模型(Reasoning-oriented Large Language Models, RLMs)在数学和编码等任务中表现优异,但在MT中启用显式推理反而会系统性地降低翻译质量,原因在于其推理轨迹高度线性、缺乏自我修正与多路径探索能力。为应对这一任务特性不匹配问题,论文提出一种面向翻译的结构化推理框架,核心创新在于将推理过程分解为多步草稿(multi-step drafting)、充分性精炼(adequacy refinement)、流畅度优化(fluency improvement)以及选择性迭代修订(selective iterative revision),并通过构建合成的动态结构化推理数据集对模型进行后训练(post-training),从而显著优于标准微调和通用推理注入基线方法。

链接: https://arxiv.org/abs/2602.14763
作者: Sara Rajaee,Sebastian Vincent,Alexandre Berard,Marzieh Fadaee,Kelly Marchisio,Tom Kocmi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models’ performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.

[NLP-14] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)中因自回归架构导致的输入-输出对齐偏差问题:在基于因果掩码(causal masking)的自回归Transformer中,残差连接(residual connections)将当前词元的激活信息与下一词元预测任务耦合,但当前词元未必是最具预测信息的表示,从而引发隐藏层表征从输入对齐向输出对齐转变的潜在错位现象。解决方案的关键在于提出一种轻量级残差路径缓解策略——残差衰减(residual attenuation),通过固定层干预或可学习门控机制抑制不必要信息传播,有效缓解表征错位并提升模型性能,是一种通用且高效的自回归Transformer架构改进方法。

链接: https://arxiv.org/abs/2602.14760
作者: Jonathan Lys,Vincent Gripon,Bastien Pasdeloup,Lukas Mauch,Fabien Cardinaux,Ghouthi Boukli Hacene
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.

[NLP-15] Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers University students and LLM -based digital twins

【速读】: 该论文旨在解决STEM(科学、技术、工程和数学)领域中个体态度形成机制的复杂性问题,特别是如何量化认知与情感因素在群体心智模式中的交互作用。其解决方案的关键在于构建行为型心智模式网络(behavioural forma mentis networks, BFMNs),将概念节点(如关键词和自由联想词)与经验关联边(empirical associative links)相结合,并标注每个概念的情感效价(valence),从而系统刻画不同教育阶段人群(高中生、大学生及早期STEM从业者)以及大型语言模型(LLM,GPT-oss)“数字孪生体”对STEM相关概念的心理表征结构。通过分析语义邻域(frames)的效价光环、情绪分布、网络重叠度(Jaccard相似性)和具体性水平,研究揭示了数学与统计等核心定量学科存在显著负面情绪氛围,且高焦虑子群体中尤为突出,表明STEM认知与情感之间存在不一致性;同时发现人类心智网络比LLM更紧密地耦合数学与焦虑,说明数字孪生虽能模拟文化态度,却难以复现基于真实教育经验的情境敏感性焦虑机制。

链接: https://arxiv.org/abs/2602.14749
作者: Francesco Gariboldi,Emma Franchino,Edith Haim,Gianluca Lattanzi,Alessandro Grecucci,Massimo Stella
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attitudes toward STEM develop from the interaction of conceptual knowledge, educational experiences, and affect. Here we use cognitive network science to reconstruct group mindsets as behavioural forma mentis networks (BFMNs). In this case, nodes are cue words and free associations, edges are empirical associative links, and each concept is annotated with perceived valence. We analyse BFMNs from N = 994 observations spanning high school students, university students, and early-career STEM experts, alongside LLM (GPT-oss) “digital twins” prompted to emulate comparable profiles. Focusing also on semantic neighbourhoods (“frames”) around key target concepts (e.g., STEM subjects or educational actors/places), we quantify frames in terms of valence auras, emotional profiles, network overlap (Jaccard similarity), and concreteness relative to null baselines. Across student groups, science and research are consistently framed positively, while their core quantitative subjects (mathematics and statistics) exhibit more negative and anxiety related auras, amplified in higher math-anxiety subgroups, evidencing a STEM-science cognitive and emotional dissonance. High-anxiety frames are also less concrete than chance, suggesting more abstract and decontextualised representations of threatening quantitative domains. Human networks show greater overlapping between mathematics and anxiety than GPT-oss. The results highlight how BFMNs capture cognitive-affective signatures of mindsets towards the target domains and indicate that LLM-based digital twins approximate cultural attitudes but miss key context-sensitive, experience-based components relevant to replicate human educational anxiety.

[NLP-16] Rethinking the Role of LLM s in Time Series Forecasting

【速读】: 该论文旨在解决当前时间序列预测(Time Series Forecasting, TSF)中对大语言模型(Large Language Models, LLMs)有效性争议的问题,即现有研究常因评估范围有限而得出LLMs无显著提升的结论,但这些结论是否在更大规模和更复杂场景下依然成立尚不明确。解决方案的关键在于开展一项大规模实证研究,覆盖80亿观测数据、17种预测场景、4种预测时长以及跨域与域内设置,并系统比较不同对齐策略(预对齐 vs. 后对齐)、模型架构与预训练知识的作用机制。结果表明,LLM在跨域泛化能力上具有显著优势,且预对齐策略优于后对齐(超过90%任务),同时预训练知识与模型架构分别主导分布偏移适应与复杂时序动态建模,二者互补;尤其在混合分布场景下,完整LLM结构不可替代,通过token级路由分析与提示工程进一步验证其必要性。

链接: https://arxiv.org/abs/2602.14744
作者: Xin Qiu,Junlong Tong,Yirong Sun,Yunpu Ma,Wei Zhang,Xiaoyu Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emphLLM4TS indeed improves forecasting performance, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at this https URL.

[NLP-17] LLM StructBench: Benchmarking Large Language Model Structured Data Extraction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在从自然语言文本中提取结构化数据并生成有效JSON输出方面的评估问题,即如何系统性地衡量LLMs在结构化数据解析任务中的可靠性与有效性。其解决方案的关键在于提出一个名为LLMStructBench的新基准,该基准包含多样化且人工验证的解析场景,覆盖不同复杂度的任务,并支持对22个模型和五种提示策略(prompting strategies)进行系统测试;同时引入了兼顾词级别准确率与文档级别结构有效性的互补指标,从而揭示提示策略的选择比模型规模等因素更能显著影响解析结果的结构完整性,尤其对小型或低可靠性模型更具价值,但可能增加语义错误数量。

链接: https://arxiv.org/abs/2602.14743
作者: Sönke Tenckhoff,Mario Koddenbrock,Erik Rodner
机构: KI‑Werkstatt / FB2, University of Applied Sciences Berlin (柏林应用技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: E.2; H.1; H.3; H.4; I.2 Cite as: arXiv:2602.14743 [cs.CL] (or arXiv:2602.14743v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.14743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-18] Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

【速读】: 该论文旨在解决开放权重大语言模型(open-weight large language models)中因预填充(prefill)机制引发的安全漏洞问题。预填充允许攻击者在生成开始前定义初始响应令牌,从而绕过模型的内置安全防护,但这一攻击向量此前缺乏系统性研究。论文的关键解决方案在于通过大规模实证研究,评估超过20种现有及新型预填充攻击策略在多个主流模型家族中的有效性,揭示了预填充攻击对所有主要当代开放权重模型均具持续有效性,且即使具备强大推理能力的模型也仅对通用预填充具有有限鲁棒性,而对定制化、模型特定的攻击仍高度脆弱。研究强调了模型开发者亟需将预填充防御纳入开放权重大语言模型的安全设计优先事项。

链接: https://arxiv.org/abs/2602.14689
作者: Lukas Struppek,Adam Gleave,Kellin Pelrine
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 54 pages, 7 figures, 35 tables

点击查看摘要

Abstract:As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.

[NLP-19] Crowdsourcing Piedmontese to Test LLM s on Non-Standard Orthography

【速读】: 该论文旨在解决濒危语言 Piedmontese(皮埃蒙特语)在自然语言处理(Natural Language Processing, NLP)研究中资源匮乏的问题,尤其是缺乏高质量的平行语料库和可评估的基准。解决方案的关键在于构建一个由145对意大利语-Piedmontese平行句子组成的众包数据集,这些句子来源于 Flores+ 平台,且翻译由母语者以自然书写风格完成,而非遵循标准化拼写规范,同时提供了人工标注的词对齐信息。该数据集被用于基准测试多种大语言模型(Large Language Models, LLMs)在分词一致性(tokenization parity)、主题分类和机器翻译任务上的表现,从而揭示了 Piedmontese 在分词层面存在劣势,但在分类任务中性能接近意大利语、法语和英语,而翻译方向则呈现不对称性:从 Piedmontese 到高资源语言的翻译效果良好,但从高资源语言到 Piedmontese 的生成仍具挑战。

链接: https://arxiv.org/abs/2602.14675
作者: Gianluca Vico,Jindřich Libovický
机构: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (查尔斯大学,数学与物理学院,形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures, at VarDial20226

点击查看摘要

Abstract:We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.

[NLP-20] Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimers Disease Detection via Speech ICASSP2026

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期诊断中因医疗数据稀缺与隐私限制导致的生成式 AI(Generative AI)模型数据效率低下的问题。解决方案的关键在于提出一种名为 FAL-AD 的新框架,其核心创新包括:1)通过基于语音转换的数据增强技术生成多样化的病理语音样本,提升绝对数据效率;2)采用自适应联邦学习机制,在保护隐私的前提下实现多机构协同优化,突破协作效率瓶颈;3)设计注意力驱动的跨模态融合模型,实现词级对齐与声学-文本交互,优化表征效率。该方案在 ADReSSo 数据集上达到 91.52% 的多模态准确率,显著优于集中式基线方法,为解决数据效率难题提供了实用路径。

链接: https://arxiv.org/abs/2602.14655
作者: Xiao Wei,Bin Wen,Yuqin Lin,Kai Li,Mingyang gu,Xiaobao Wang,Longbiao Wang,Jianwu Dang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figures, accepted by ICASSP 2026 conference

点击查看摘要

Abstract:Early diagnosis of Alzheimer’s Disease (AD) is crucial for delaying its progression. While AI-based speech detection is non-invasive and cost-effective, it faces a critical data efficiency dilemma due to medical data scarcity and privacy barriers. Therefore, we propose FAL-AD, a novel framework that synergistically integrates federated learning with data augmentation to systematically optimize data efficiency. Our approach delivers three key breakthroughs: First, absolute efficiency improvement through voice conversion-based augmentation, which generates diverse pathological speech samples via cross-category voice-content recombination. Second, collaborative efficiency breakthrough via an adaptive federated learning paradigm, maximizing cross-institutional benefits under privacy constraints. Finally, representational efficiency optimization by an attentive cross-modal fusion model, which achieves fine-grained word-level alignment and acoustic-textual interaction. Evaluated on ADReSSo, FAL-AD achieves a state-of-the-art multi-modal accuracy of 91.52%, outperforming all centralized baselines and demonstrating a practical solution to the data efficiency dilemma. Our source code is publicly available at this https URL.

[NLP-21] Is Information Density Uniform when Utterances are Grounded on Perception and Discourse? EACL2026

【速读】: 该论文旨在解决“信息密度均匀性(Uniform Information Density, UID)假设”在视觉语境下是否依然成立的问题,即语言在实际感知环境中是否仍遵循信息分布均匀的规律。以往研究仅基于纯文本数据,忽略了真实交流中图像等多模态线索对信息传递的影响。其解决方案的关键在于首次采用多语言视觉-语言模型,在30种语言的图像-标题数据和13种语言的视觉叙事数据上估计信息熵(surprisal),从而量化不同语言在视觉接地(visually grounded)条件下的信息分布均匀性。结果表明,视觉接地显著平滑了信息分布,提升了全局与局部的信息均匀度,尤其在话语单元起始处表现出最强的 surprisal 减少效应,支持了情境敏感型 UID 假设。

链接: https://arxiv.org/abs/2602.14653
作者: Matteo Gay,Coleman Haley,Mario Giulianelli,Edoardo Ponti
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted as main paper at EACL 2026

点击查看摘要

Abstract:The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.

[NLP-22] GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因计算成本过高而受限的问题,尤其是现有层剪枝(layer pruning)方法难以同时兼顾剪枝效率与性能恢复的挑战。其解决方案的关键在于提出GradMAP方法,该方法包含两个阶段:第一阶段引入基于梯度幅值的新颖重要性度量指标,仅需单次反向传播即可完成每轮剪枝决策,显著提升剪枝效率;第二阶段通过分析剪枝后层均值偏移最大的部分,并引入一个简单但有效的投影补偿矩阵,在一步内校正模型输出分布的漂移,从而有效缓解剪枝导致的性能下降。实验表明,GradMAP在剪枝速度上平均提升4倍,且性能优于现有方法。

链接: https://arxiv.org/abs/2602.14649
作者: Hao Liu,Guangyan Li,Wensheng Zhang,Yongqiang Tang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Guangzhou University (广州大学)
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong reasoning abilities, but their high computational costs limit their practical deployment. Recent studies reveal significant redundancy in LLMs layers, making layer pruning an active research topic. Layer pruning research primarily focuses on two aspects: measuring layer importance and recovering performance after pruning. Unfortunately, the present works fail to simultaneously maintain pruning performance and efficiency. In this study, we propose GradMAP, a faster layer pruning method with \textbfGradient \textbfMetric \textbfAnd \textbfProjection compensation, which consists of two stages. In the first stage, we introduce a novel metric based on gradient magnitudes, enabling a global assessment of layer importance. Note that, it requires only a single backward propagation step per pruning decision, substantially enhancing pruning efficiency. In the second stage, we first analyze the layers with the largest mean shift resulting from pruning, and then incorporate a simple yet effective projection compensation matrix to correct this drift in one step. In this way, the degradation of model performance caused by layer pruning is effectively alleviated. Extensive experiments show that GradMAP outperforms previous layer pruning methods in both pruning speed (achieving an average 4\times speedup) and performance.

[NLP-23] he Wikidata Query Logs Dataset

【速读】: 该论文旨在解决现有Wikidata问答数据集规模小、依赖模板生成查询而导致语义多样性不足的问题。为构建更真实、更大规模的自然语言问题-查询对数据集,作者提出了一种基于代理(agent-based)的方法,其关键在于通过迭代式去匿名化、清洗和验证SPARQL查询,并结合Wikidata知识图谱进行校验,从而将实际用户发送到Wikidata Query Service的日志查询还原为可执行且语义明确的查询,并自动生成对应的自然语言问题。该方法显著提升了数据集的真实性和可用性,所构建的Wikidata Query Logs (WDQL) 数据集规模超过20万对样本,是此前最大同类数据集的6倍以上。

链接: https://arxiv.org/abs/2602.14594
作者: Sebastian Walter,Hannah Bast
机构: University of Freiburg(弗莱堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset’s benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

[NLP-24] MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

【速读】: 该论文旨在解决当前大型视觉语言模型(LVLMs)在复杂任务规划中对时序执行顺序(Temporal Execution Order, TEO)理解不足的问题。现有研究多依赖于自动标注、线性链式近似或仅文本输入,难以准确建模多步骤任务中各操作之间的依赖关系。解决方案的关键在于提出MATEO(Multimodal Temporal Execution Order)基准,其核心是构建了一个高质量、专业级的多模态食谱语料库,并通过可扩展的众包流程获取精确的TEO图结构标注,从而系统评估和提升LVLMs在真实场景下进行时序推理的能力。

链接: https://arxiv.org/abs/2602.14589
作者: Gabriel Roccabruna,Olha Khomyn,Giuseppe Riccardi
机构: University of Trento (特伦托大学); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models’ understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.

[NLP-25] Assessing Large Language Models for Medical QA: Zero-Shot and LLM -as-a-Judge Evaluation

【速读】: 该论文旨在解决在资源受限环境中部署高效医疗问答(Medical QA)系统的问题,以提升医疗信息获取的可及性。其解决方案的关键在于通过零样本评估方法,在不进行专门微调的前提下,对比五种不同规模的大语言模型(LLMs)在iCliniq数据集上的表现,从而识别出在计算资源消耗与临床实用性之间取得最佳平衡的模型。研究发现,尽管大模型如Llama 3.3 70B Instruct表现出更强的性能,但Llama-4-Maverick-17B在效率与效果间展现出更优的权衡,突显了模型架构设计对实际部署的重要性,为未来轻量化、高临床效用的医学自然语言处理(NLP)应用提供了基准参考。

链接: https://arxiv.org/abs/2602.14564
作者: Shefayat E Shams Adib,Ahmed Alfey Sani,Ekramul Alam Esham,Ajwad Abrar,Tareque Mohmud Chowdhury
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in 28th ICCIT, 2025

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.

[NLP-26] Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)微调过程中存在的一个关键问题:主流微调数据集通常以句子为单位设计,而模型优化机制却作用于token级别,这种不匹配导致token级噪声干扰训练过程,进而影响下游任务性能。解决方案的关键在于提出一种可解释的token级噪声过滤框架XTF,其核心是将token对微调过程的复杂贡献分解为三个明确属性——推理重要性(reasoning importance)、知识新颖性(knowledge novelty)和任务相关性(task relevance),通过评分机制识别并屏蔽噪声token的梯度更新,从而实现更高效的微调。实验表明,XTF在数学、代码和医学三类典型下游任务上相比常规微调可提升性能高达13.7%。

链接: https://arxiv.org/abs/2602.14536
作者: Yuchen Yang,Wenze Lin,Enhao Huang,Zhixuan Chu,Hongbin Zhou,Lan Tao,Yiming Li,Zhan Qin,Kui Ren
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.

[NLP-27] Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLM s in Sinhala and Tamil

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在英语中表现出强大的数学推理能力,但这种能力是否真正体现了多语言推理能力,还是仅仅依赖于低资源语言(如僧伽罗语和泰米尔语)向英语的隐式翻译处理尚不明确。为解决这一问题,作者提出的关键解决方案是构建一个平行数据集,其中每道数学题均由具备数学训练的母语者在三种语言(英语、僧伽罗语、泰米尔语)中原生编写,从而避免因翻译造成的混淆,确保评估的是模型在目标语言中的真实推理能力。通过这一设计,研究发现基础算术推理在不同语言间具有较强迁移性,而复杂推理任务在泰米尔语和僧伽罗语中显著退化,且失败模式因模型和问题类型而异,表明所谓的多语言能力并不等同于跨语言一致的推理能力。

链接: https://arxiv.org/abs/2602.14517
作者: Sukumar Kishanthan,Kumar Thushalika,Buddhi Jayasekara,Asela Hevapathige
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.

[NLP-28] Parameter-Efficient Fine-Tuning of LLM s with Mixture of Space Experts

【速读】: 该论文旨在解决当前参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法主要局限于欧几里得空间(Euclidean space),从而难以捕捉语言数据中复杂几何结构的问题。现有方法即使引入可学习的曲率参数,仍受限于单一流形类型的表达能力。其解决方案的关键在于提出Mixture of Space (MoS) 框架,通过在统一架构中协同利用多种几何空间(如双曲空间用于层次结构、球面流形用于循环模式),实现更具表现力的曲率感知表示学习;在此基础上进一步构建MoSLoRA,将低秩适配(Low-Rank Adaptation, LoRA)扩展为包含异构几何专家的模型,并设计轻量级路由机制以动态选择或组合最优几何空间,显著提升下游任务性能并缓解频繁流形切换带来的计算开销。

链接: https://arxiv.org/abs/2602.14490
作者: Buze Zhang,Jinkai Tao,Zilang Zeng,Neil He,Ali Maatouk,Menglin Yang,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress, with Parameter-Efficient Fine-Tuning (PEFT) emerging as a key technique for downstream task adaptation. However, existing PEFT methods mainly operate in Euclidean space, fundamentally limiting their capacity to capture complex geometric structures inherent in language data. While alternative geometric spaces, like hyperbolic geometries for hierarchical data and spherical manifolds for circular patterns, offer theoretical advantages, forcing representations into a single manifold type ultimately limits expressiveness, even when curvature parameters are learnable. To address this, we propose Mixture of Space (MoS), a unified framework that leverages multiple geometric spaces simultaneously to learn richer, curvature-aware representations. Building on this scheme, we develop MoSLoRA, which extends Low-Rank Adaptation (LoRA) with heterogeneous geometric experts, enabling models to dynamically select or combine appropriate geometric spaces based on input context. Furthermore, to address the computational overhead of frequent manifold switching, we develop a lightweight routing mechanism. Moreover, we provide empirical insights into how curvature optimization impacts training stability and model performance. Our experiments across diverse benchmarks demonstrate that MoSLoRA consistently outperforms strong baselines, achieving up to 5.6% improvement on MATH500 and 15.9% on MAWPS.

[NLP-29] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

【速读】: 该论文旨在解决低资源语言信息检索(Information Retrieval, IR)领域中高质量、任务特定标注数据集稀缺的问题。传统人工标注成本高且难以扩展,而直接利用大语言模型(Large Language Models, LLMs)进行自动化标注又存在标签可靠性、偏见及评估有效性等风险。其解决方案的关键在于提出一种BETA标注框架,通过多个来自不同模型家族的LLM annotators协作,结合上下文对齐、一致性检查与多数投票机制生成初始标签,并辅以人工评估验证标签质量;同时系统性地考察了跨语言数据复用的有效性,发现基于LLM的单跳机器翻译在不同语言间存在显著差异,反映出语义保留不一致和语言依赖偏差,从而揭示了跨语言数据复用的风险与局限。该研究为构建更可靠的低资源语言IR基准测试和评估流程提供了实证依据与实践指导。

链接: https://arxiv.org/abs/2602.14488
作者: Md. Najib Hasan,Mst. Jannatun Ferdous Rain,Fyad Mohammed,Nazmul Siddique
机构: Wichita State University (威奇托州立大学); Begum Rokeya University, Rangpur (贝古姆·罗基亚大学, 朗普尔); Khulna University of Engineering & Technology (库尔纳工程与技术大学); Ulster University (阿尔斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.

[NLP-30] HyperRAG : Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation WWW’26

【速读】: 该论文旨在解决传统基于图结构的检索增强生成(Retrieval-Augmented Generation, RAG)方法在多跳开放域问答(multi-hop open-domain QA)中因依赖二元关系事实而导致的推理路径冗长、上下文无关信息引入以及关系表达能力受限的问题。其核心解决方案是提出HyperRAG框架,利用n-ary超图(n-ary hypergraph)建模高阶实体间复杂依赖关系,从而支持更高效、可解释的多跳推理。关键创新在于两个互补的检索变体:(i) HyperRetriever通过学习结构-语义联合推理来构建查询条件下的关系链,实现精准的事实追踪与自适应高阶遍历;(ii) HyperMemory借助大语言模型(LLM)的参数化记忆引导束搜索,动态评分n-ary事实和实体以进行查询感知的路径扩展。实验证明该方法在多个基准测试上显著优于现有基线,尤其在平均MRR提升2.95%、Hits@10提升1.23%方面表现突出。

链接: https://arxiv.org/abs/2602.14470
作者: Wen-Sheng Lien,Yu-Kai Chan,Hao-Lung Hsiao,Bo-Kai Ruan,Meng-Fen Chiang,Chien-An Chen,Yi-Ren Yeh,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); E.SUN Bank (元大银行); National Kaohsiung Normal University (国立高雄师范大学)
类目: Computation and Language (cs.CL)
备注: Accepted by The ACM Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM’s parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG’s effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.

[NLP-31] Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation

【速读】: 该论文旨在解决逆向思维链生成(Reverse Chain-of-Thought Generation, RCG)中出现的“事后合理化”(post-hoc rationalization)问题,即模型在生成推理过程时因可见答案而以答案为认知锚点,导致推理轨迹并非真实思考路径,而是围绕答案构建的伪解释。其核心问题是现有缓解策略——如语义抑制(semantic suppression)——虽能降低词汇层面的重叠,却意外加剧了熵空间和概率层面的锚定效应,源于对禁止答案的主动监控机制。解决方案的关键在于提出结构骨架引导推理(Structural Skeleton-guided Reasoning, SSR),采用两阶段范式:首先生成与答案无关的功能性结构骨架,再基于此骨架引导完整推理链生成,从而将信息流从答案监控转向结构规划;进一步引入蒸馏版SSR(SSR-D),通过教师模型生成的SSR轨迹微调学生模型,确保结构一致性。实验表明,SSR-D相较抑制基线提升达10%,且保持分布外(OOD)泛化能力。

链接: https://arxiv.org/abs/2602.14469
作者: Guangyue Peng,Zongchao Chen,Wen Luo,Yuntao Wen,Wei Li,Ruixiang Feng,Ran Le,Chen Yang,Zhenwei An,Yang Song,Tao Zhang,Houfeng Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reverse Chain-of-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.

[NLP-32] Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models LREC2026

【速读】: 该论文旨在解决现有偏见评估基准(如Bias Benchmark for Question-Answering, BBQ)在语言和文化多样性上的局限性,特别是在菲律宾语境下对生成式AI模型中性别歧视与恐同偏见的系统性检测不足的问题。解决方案的关键在于构建FilBBQ——一个面向菲律宾语环境的新型偏见测试基准,其核心创新包括:通过四阶段开发流程(模板分类、文化敏感翻译、新模板构建与提示生成)扩展了BBQ的语言覆盖范围;并采用多种子(multiple seeds)响应采样与平均偏差评分的稳健评估协议,有效缓解模型输出不稳定带来的测量误差,从而更准确地识别出与情绪、家庭角色、刻板 queer 兴趣及一夫多妻制相关的性别与性取向偏见。

链接: https://arxiv.org/abs/2602.14466
作者: Lance Calvin Lim Gamboa,Yue Feng,Mark Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in LREC 2026

点击查看摘要

Abstract:With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models’ response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.

[NLP-33] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

【速读】: 该论文旨在解决快速演进的前沿人工智能(Frontier AI)模型所带来的前所未有的风险问题,特别是针对大语言模型(Large Language Models, LLMs)通用能力提升及代理型AI(agentic AI)普及所引发的五大关键风险维度:网络攻击、说服与操纵、战略欺骗、失控的AI研发(uncontrolled AI R&D)以及自我复制。其解决方案的关键在于提出并验证了一系列稳健的缓解策略,涵盖复杂场景下的风险评估(如LLM-to-LLM说服、新兴对齐偏差、自主扩展记忆与工具集的“误进化”行为)和实证验证(例如在Moltbook平台上对OpenClaw的安全性能监控),从而为前沿AI的安全部署提供初步的技术路径与可操作性框架。

链接: https://arxiv.org/abs/2602.14457
作者: Dongrui Liu,Yi Yu,Jie Zhang,Guanxu Chen,Qihao Lin,Hanxi Zhu,Lige Huang,Yijin Zhou,Peng Wang,Shuai Shao,Boxuan Zhang,Zicheng Liu,Jingwei Sun,Yu Li,Yuejin Xie,Jiaxuan Guo,Jia Xu,Chaochao Lu,Bowen Zhou,Xia Hu,Jing Shao
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 49 pages, 17 figures, 12 tables

点击查看摘要

Abstract:To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\D, we focus on the ``mis-evolution’’ of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

[NLP-34] Precedent-Informed Reasoning : Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中存在的低效长链思维(long chain-of-thought)问题,即模型常因冗余的自我探索与验证导致计算成本增加甚至性能下降。其核心解决方案是提出预例引导推理(Precedent Informed Reasoning, PIR),将LLM的推理范式从无约束的自探索转变为基于历史相关案例的引导式学习。PIR的关键在于两个技术模块:一是自适应预例选择(Adaptive Precedent Selection, APS),通过语义相似度与模型困惑度联合评分筛选出对当前问题最相关的紧凑预例集合,并动态调整数量以最大化困惑度降低;二是测试时经验内化(Test-time Experience Internalization, TEI),将预例信息作为指令在测试时进行轻量级适配器更新,使模型能够内化解题模式并作为先验知识用于后续推理,从而在保持或提升准确率的同时显著缩短推理路径。

链接: https://arxiv.org/abs/2602.14451
作者: Qianyue Wang,Jinwu Hu,Huanxiang Lin,Bolin Chen,Zhiquan Wen,Yaofo Chen,Yu Rong,Mingkui Tan
机构: South China University of Technology (华南理工大学); Pazhou Laboratory (琶洲实验室); DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation, which inflate computational costs and even degrade performance. Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs’reasoning paradigm from exhaustive self-exploration to guided learning from precedents. PIR addresses two key challenges: what precedents to adopt and how to utilize them. First, Adaptive Precedent Selection (APS) constructs, for each question and LRM, a compact set of precedents that are both semantically related and informative for the model. It ranks examples by a joint score with semantic similarity and model perplexity, then adapts the amount of precedents to maximize perplexity reduction. Second, Test-time Experience Internalization (TEI) is treated as the test-time learning on precedent-informed instruction, updating lightweight adapters to internalize solution patterns and use them as a prior during subsequent reasoning. Experiments across mathematical reasoning, scientific QA, and code generation demonstrate that PIR consistently shortens reasoning traces while maintaining or improving final accuracy across LLMs, yielding outstanding accuracy-efficiency trade-offs.

[NLP-35] Selective Synchronization Attention

【速读】: 该论文旨在解决Transformer架构中自注意力机制存在的两个核心问题:一是计算复杂度为二次方(quadratic computational complexity),导致在长序列处理时效率低下;二是缺乏与生物神经计算(biological neural computation)的理论联系,难以解释其在认知系统中的合理性。解决方案的关键在于提出选择性同步注意力(Selective Synchronization Attention, SSA),它通过将每个token建模为具有可学习自然频率和相位的振子(oscillator),利用耦合振子系统的稳态解构建闭式(closed-form)注意力权重计算方式。该机制基于频率依赖的耦合和相位锁定条件自动产生稀疏注意力分布,无需显式掩码即可实现零权重分配;同时,自然频率谱统一编码位置与语义信息,消除了对独立位置编码的需求;此外,整个计算过程为单次闭式运算,避免了迭代常微分方程(ODE)求解,具备高效且结构清晰的优势。

链接: https://arxiv.org/abs/2602.14445
作者: Hasi Hays
机构: University of Arkansas (阿肯色大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The Transformer architecture has become the foundation of modern deep learning, yet its core self-attention mechanism suffers from quadratic computational complexity and lacks grounding in biological neural computation. We propose Selective Synchronization Attention (SSA), a novel attention mechanism that replaces the standard dot-product self-attention with a closed-form operator derived from the steady-state solution of the Kuramoto model of coupled oscillators. In SSA, each token is represented as an oscillator characterized by a learnable natural frequency and phase; the synchronization strength between token pairs, determined by a frequency-dependent coupling and phase-locking condition, serves as the attention weight. This formulation provides three key advantages: (i) natural sparsity arising from the phase-locking threshold, whereby tokens with incompatible frequencies automatically receive zero attention weight without explicit masking; (ii) unified positional-semantic encoding through the natural frequency spectrum, eliminating the need for separate positional encodings; and (iii) a single-pass, closed-form computation that avoids iterative ODE integration, with all components (coupling, order parameter, synchronization) derived from the oscillatory framework. We instantiate SSA within the Oscillatory Synchronization Network (OSN), a drop-in replacement for the Transformer block. Analysis of the synchronization matrices reveals non-uniform, head-diverse coupling patterns even at initialization, demonstrating a stronger architectural inductive bias than the approximately uniform attention produced by randomly initialized Transformers.

[NLP-36] LLM -Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning

【速读】: 该论文旨在解决当前时序知识图谱(Temporal Knowledge Graph, TKG)推理模型计算复杂度高、部署成本大,且现有压缩与蒸馏技术难以适配时序特性的问题。其关键解决方案在于提出一种基于大语言模型(Large Language Model, LLM)辅助的蒸馏框架,通过引入LLM作为额外的教学者,提供丰富的背景知识和时序感知信号,从而在不增加推理复杂度的前提下,使轻量级学生模型更有效地捕捉事件动态变化。训练过程中采用分阶段对齐策略,联合优化监督学习与蒸馏目标,实现双教师(时序教师+LLM)指导下的高效知识迁移。

链接: https://arxiv.org/abs/2602.14428
作者: Wang Xing,Wei Song,Siyu Lin,Chen Wu,Man Wang
机构: Xidian University (西安电子科技大学); Southwest Jiaotong University (西南交通大学); Chongqing Jiaotong University (重庆交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to performance degradation. We propose an LLM-assisted distillation framework specifically designed for temporal knowledge graph reasoning. Beyond a conventional high-capacity temporal teacher, we incorporate a large language model as an auxiliary instructor to provide enriched supervision. The LLM supplies broad background knowledge and temporally informed signals, enabling a lightweight student to better model event dynamics without increasing inference-time complexity. Training is conducted by jointly optimizing supervised and distillation objectives, using a staged alignment strategy to progressively integrate guidance from both teachers. Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a compact and efficient student model. The results highlight the potential of large language models as effective teachers for transferring temporal reasoning capability to resource-efficient TKG systems.

[NLP-37] WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中固有的幻觉(hallucination)问题,其根源被理论证明为嵌入空间(embedding space)与语义真值集合(semantic truth set)之间无法保持同构性所导致的逻辑一致性崩溃。解决方案的关键在于提出WavePhaseNet方法,通过离散傅里叶变换(Discrete Fourier Transform, DFT)显式构建语义概念层次结构(Semantic Conceptual Hierarchy Structure, SCHS),将序列维度上的语义信息分解为低频(全局语义与意图)和高频(局部语法与表达)成分,实现对语义的分层操控;同时结合共形一致性控制(cohomological consistency control),利用重叠局部窗口上的上同调正则化构造图结构与上链复形,以霍奇理论为基础的调和投影量化局部推理不一致性并作为可计算的正则化原则,从而提取最大一致性的全局表示,有效抑制幻觉并提升推理能力。

链接: https://arxiv.org/abs/2602.14419
作者: Kiyotaka Kasubuchi,Kazuo Fukiya
机构: Pionira Solutions LLC(皮奥尼拉解决方案有限责任公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper reformulates Transformer/Attention mechanisms in Large Language Models (LLMs) through measure theory and frequency analysis, theoretically demonstrating that hallucination is an inevitable structural limitation. The embedding space functions as a conditional expectation over a \sigma-algebra, and its failure to be isomorphic to the semantic truth set fundamentally causes logical consistency breakdown. WavePhaseNet Method The authors propose WavePhaseNet, which explicitly constructs a Semantic Conceptual Hierarchy Structure (SCHS) using Discrete Fourier Transform (DFT). By applying DFT along the sequence dimension, semantic information is decomposed into frequency bands: low-frequency components capture global meaning and intent, while high-frequency components represent local syntax and expression. This staged separation enables precise semantic manipulation in diagonalized space. Dimensionality Reduction GPT-4’s 24,576-dimensional embedding space exhibits a 1/f spectral structure based on language self-similarity and Zipf’s law. Through cumulative energy analysis, the authors derive that approximately 3,000 dimensions constitute the lower bound for “complete representation.” This demonstrates that reduction from 24,576 to 3,000 dimensions preserves meaning and intent while enabling rigorous reasoning and suppressing hallucination. Cohomological Consistency Control The reduced embedding space, constructed via cohomological regularization over overlapping local windows, allows defining a graph structure and cochain complex. This quantifies inconsistencies among local inferences as coboundary-based losses. Applying harmonic projection based on Hodge theory positions cohomology as a computable regularization principle for controlling semantic consistency, extracting maximally consistent global representations.

[NLP-38] ruthStance: An Annotated Dataset of Conversations on Truth Social

【速读】: 该论文旨在解决当前在线话语中观点形成与争议机制研究的局限性,特别是针对主流平台(如Twitter和Reddit)数据丰富而替代技术平台(alt-tech)对话结构研究不足的问题。其解决方案的关键在于构建了一个大规模、结构化的TruthStance数据集,涵盖Truth Social平台2023–2025年间的24,378篇帖子及523,360条评论,并保留了回复树结构;同时提供人工标注的1,500个实例用于论证挖掘(argument mining)和基于主张的立场检测(claim-based stance detection),并评估大语言模型(LLM)提示策略的有效性,最终释放额外的LLM生成标签以支持跨深度、主题和用户群体的立场与论证模式分析。

链接: https://arxiv.org/abs/2602.14406
作者: Fathima Ameen,Danielle Brown,Manusha Malgareddy,Amanul Haque
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Argument mining and stance detection are central to understanding how opinions are formed and contested in online discourse. However, most publicly available resources focus on mainstream platforms such as Twitter and Reddit, leaving conversational structure on alt-tech platforms comparatively under-studied. We introduce TruthStance, a large-scale dataset of Truth Social conversation threads spanning 2023-2025, consisting of 24,378 posts and 523,360 comments with reply-tree structure preserved. We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies. Using the best-performing configuration, we release additional LLM-generated labels for 24,352 posts (argument presence) and 107,873 comments (stance to parent), enabling analysis of stance and argumentation patterns across depth, topics, and users. All code and data are released publicly.

[NLP-39] Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

【速读】: 该论文旨在解决现有基于策略梯度(policy gradient)的自回归语言模型在处理复杂推理任务时存在的局限性问题,即传统方法逐token进行决策的方式难以捕捉推理过程中多个token共同构成的语义块(semantic block),导致优化目标与推理任务的本质结构不匹配。其解决方案的关键在于提出多token策略梯度优化(Multi-token Policy Gradient Optimization, MPO),将K个连续token视为一个统一的语义动作(semantic action),从而从块级别(block-level)建模推理轨迹的组合结构,并支持对更高层次、更连贯的目标进行优化,实验表明该方法在数学推理和编程基准上显著优于传统的token级策略梯度基线。

链接: https://arxiv.org/abs/2602.14386
作者: Mufan Xu,Kehai Chen,Xuefeng Bai,Zhengyu Niu,Muyun Yang,Tiejun Zhao,Min Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens–for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

[NLP-40] Differentially Private Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理包含敏感数据(如医疗记录或法律文档)时存在的隐私泄露风险问题。现有研究表明,通过设计对抗性提示(adversarial prompts),攻击者可迫使大语言模型(Large Language Models, LLMs)直接复现增强上下文中的私有信息,从而暴露数据库内容。为应对这一挑战,作者提出了一种名为DP-KSA的新型差分隐私(Differential Privacy, DP)保护RAG算法,其核心创新在于采用“提议-测试-释放”(propose-test-release)范式,在不显著损害任务性能的前提下实现隐私保障。关键机制是:首先基于查询获取多个相关上下文并生成响应,随后以差分隐私方式统计其中最频繁出现的关键词,并将这些关键词作为最终提示的一部分进行输出。此方法通过压缩语义空间,在保持问答准确性的同时有效防止敏感信息泄露,理论证明了其对RAG数据库提供严格的差分隐私保证,实验验证了其在隐私与效用之间取得良好平衡。

链接: https://arxiv.org/abs/2602.14374
作者: Tingting Tang,James Flemings,Yongqin Wang,Murali Annavaram
机构: University of Southern California(南加州大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a widely used framework for reducing hallucinations in large language models (LLMs) on domain-specific tasks by retrieving relevant documents from a database to support accurate responses. However, when the database contains sensitive corpora, such as medical records or legal documents, RAG poses serious privacy risks by potentially exposing private information through its outputs. Prior work has demonstrated that one can practically craft adversarial prompts that force an LLM to regurgitate the augmented contexts. A promising direction is to integrate differential privacy (DP), a privacy notion that offers strong formal guarantees, into RAG systems. However, naively applying DP mechanisms into existing systems often leads to significant utility degradation. Particularly for RAG systems, DP can reduce the usefulness of the augmented contexts leading to increase risk of hallucination from the LLMs. Motivated by these challenges, we present DP-KSA, a novel privacy-preserving RAG algorithm that integrates DP using the propose-test-release paradigm. DP-KSA follows from a key observation that most question-answering (QA) queries can be sufficiently answered with a few keywords. Hence, DP-KSA first obtains an ensemble of relevant contexts, each of which will be used to generate a response from an LLM. We utilize these responses to obtain the most frequent keywords in a differentially private manner. Lastly, the keywords are augmented into the prompt for the final output. This approach effectively compresses the semantic space while preserving both utility and privacy. We formally show that DP-KSA provides formal DP guarantees on the generated output with respect to the RAG database. We evaluate DP-KSA on two QA benchmarks using three instruction-tuned LLMs, and our empirical results demonstrate that DP-KSA achieves a strong privacy-utility tradeoff.

[NLP-41] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

【速读】: 该论文试图解决的问题是:在日益网络化的环境中,人工智能(AI)代理社会是否会出现类似于人类社会的收敛演化动力学。为回答这一问题,作者提出了首个针对大规模AI代理社会的系统性诊断方法,其关键在于构建了一个定量诊断框架,用于衡量语义稳定性、词汇更替率、个体惯性、影响力持续性以及集体共识等动态指标。通过该框架对Moltbook中AI代理社会的分析发现,尽管整体语义趋于稳定,但个体代理保持高度多样性与持续的词汇更新,且表现出强个体惯性和低适应性,导致相互影响微弱、影响力短暂,缺乏稳定的集体影响锚点和共享的社会记忆。这表明,仅靠规模扩大和交互密度提升不足以促成AI代理社会的真正社会化,为下一代AI代理社会的设计与分析提供了可操作的原则。

链接: https://arxiv.org/abs/2602.14299
作者: Ming Li,Xirui Li,Tianyi Zhou
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.

[NLP-42] FMMD: A multimodal open peer review dataset based on F1000Research

【速读】: 该论文旨在解决当前自动化学术论文评审(ASPR)研究中因数据集局限性导致的实证进展受限问题。现有数据集普遍存在文本中心化倾向,缺乏对图表、复杂排版等多模态内容的覆盖,且主要聚焦于计算机科学领域,同时缺少审稿意见与稿件版本之间的精确对齐,难以刻画同行评审与稿件迭代之间的动态关系。解决方案的关键在于提出FMMD数据集——一个来自F1000Research的多模态、跨学科开放同行评审数据集,其核心创新在于整合了稿件级别的视觉与结构化数据,并实现了审稿意见与具体版本的显式对齐,从而支持对不同科学领域中同行评审生命周期的细粒度分析,为多模态问题检测和多模态审稿评论生成等任务提供全面的实证资源。

链接: https://arxiv.org/abs/2602.14285
作者: Zhenzhen Zhuang,Yuqing Fu,Jing Zhu,Zhangping Zhou,Jialiang Lin
机构: Guangzhou Institute of Science and Technology (广州科技学院); Xiamen University (厦门大学)
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation. In parallel, research on automated and AI-assisted peer review has proliferated. Despite this momentum, empirical progress remains constrained by several critical limitations in existing datasets. While reviewers routinely evaluate figures, tables, and complex layouts to assess scientific claims, most existing datasets remain overwhelmingly text-centric. This bias is reinforced by a narrow focus on data from computer science venues. Furthermore, these datasets lack precise alignment between reviewer comments and specific manuscript versions, obscuring the iterative relationship between peer review and manuscript evolution. In response, we introduce FMMD, a multimodal and multidisciplinary open peer review dataset curated from F1000Research. The dataset bridges the current gap by integrating manuscript-level visual and structural data with version-specific reviewer reports and editorial decisions. By providing explicit alignment between reviewer comments and the exact article iteration under review, FMMD enables fine-grained analysis of the peer review lifecycle across diverse scientific domains. FMMD supports tasks such as multimodal issue detection and multimodal review comment generation. It provides a comprehensive empirical resource for the development of peer review research.

[NLP-43] MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents

【速读】: 该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的大型语言模型(Large Language Model, LLM)代理在调用第三方MCP服务器提供的工具时所面临的安全误配准问题,即代理默认信任由潜在不可信服务器暴露的工具,从而在工具调用生命周期中易受攻击。解决方案的关键在于提出MCPShield——一种插件式的安全认知层,其通过元数据引导的探测机制在工具调用前辅助代理形成安全认知,并在运行时约束执行边界、感知事件,调用后基于历史行为轨迹进行推理以更新安全认知,模拟人类对工具使用后的反思过程,从而有效缓解安全误配准并增强代理在开放代理生态系统中的安全性。

链接: https://arxiv.org/abs/2602.14281
作者: Zhenhong Zhou,Yuanhe Zhang,Hongwei Cai,Moayad Aloqaily,Ouns Bouachir,Linsey Pang,Prakhar Mehrotra,Kun Wang,Qingsong Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 21 pages, 5 figures, 6 tables

点击查看摘要

Abstract:The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers. This openness introduces a security misalignment: agents implicitly trust tools exposed by potentially untrusted MCP servers. However, despite its excellent utility, existing agents typically offer limited validation for third-party MCP servers. As a result, agents remain vulnerable to MCP-based attacks that exploit the misalignment between agents and servers throughout the tool invocation lifecycle. In this paper, we propose MCPShield as a plug-in security cognition layer that mitigates this misalignment and ensures agent security when invoking MCP-based tools. Drawing inspiration from human experience-driven tool validation, MCPShield assists agent forms security cognition with metadata-guided probing before invocation. Our method constrains execution within controlled boundaries while cognizing runtime events, and subsequently updates security cognition by reasoning over historical traces after invocation, building on human post-use reflection on tool behavior. Experiments demonstrate that MCPShield exhibits strong generalization in defending against six novel MCP-based attack scenarios across six widely used agentic LLMs, while avoiding false positives on benign servers and incurring low deployment overhead. Overall, our work provides a practical and robust security safeguard for MCP-based tool invocation in open agent ecosystems.

[NLP-44] Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions

【速读】: 该论文旨在解决在有限的调查资源(如提问预算和参与预算)下,如何从部分响应的群体中高效获取信息以推断群体层面属性的问题。现有方法通常假设受访者池固定且不利用群体结构,难以应对数据缺失和响应不完整的情况。解决方案的关键在于提出一种自适应群体信息采集框架,结合两个核心机制:(i) 基于大语言模型(LLM)的期望信息增益目标函数,用于动态评估候选问题的信息价值;(ii) 异质图神经网络传播机制,通过聚合已观测响应与参与者特征来填补缺失值,并指导每轮的受访者选择。该闭环流程能够在小样本条件下实现对整体人群响应的准确预测,实验证明其在三个真实世界意见数据集上均优于基线方法,尤其在10%受访者预算下相较CES数据集提升12%相对精度。

链接: https://arxiv.org/abs/2602.14279
作者: Ruomeng Ding,Tianwei Gao,Thomas P. Zollo,Eitan Bachmat,Richard Zemel,Zhun Deng
机构: University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校); Columbia University(哥伦比亚大学); Ben-Gurion University of the Negev(本古里安大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Eliciting information to reduce uncertainty about latent group-level properties from surveys and other collective assessments requires allocating limited questioning effort under real costs and missing data. Although large language models enable adaptive, multi-turn interactions in natural language, most existing elicitation methods optimize what to ask with a fixed respondent pool, and do not adapt respondent selection or leverage population structure when responses are partial or incomplete. To address this gap, we study adaptive group elicitation, a multi-round setting where an agent adaptively selects both questions and respondents under explicit query and participation budgets. We propose a theoretically grounded framework that combines (i) an LLM-based expected information gain objective for scoring candidate questions with (ii) heterogeneous graph neural network propagation that aggregates observed responses and participant attributes to impute missing responses and guide per-round respondent selection. This closed-loop procedure queries a small, informative subset of individuals while inferring population-level responses via structured similarity. Across three real-world opinion datasets, our method consistently improves population-level response prediction under constrained budgets, including a 12% relative gain on CES at a 10% respondent budget.

[NLP-45] STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

【速读】: 该论文旨在解决推理时计算(Inference-Time-Compute, ITC)方法在生成高质量且多样化输出时面临的两大问题:一是基于高温度采样的策略难以实现有意义的输出多样性;二是现有方法对推理过程的控制能力有限,导致可解释性不足。解决方案的关键在于提出STATe-of-Thoughts(STATe),这是一种可解释的ITC框架,其核心是通过离散且可解释的文本干预来搜索高层次推理模式——具体而言,由控制器选择编码高级推理决策的动作,生成器根据这些动作生成推理步骤,评价器则对候选结果进行评分以引导搜索。这种结构化方法不仅提升了响应多样性,还实现了对推理路径的显式控制与解释,从而有效增强了生成内容的质量、多样性和可解释性。

链接: https://arxiv.org/abs/2602.14265
作者: Zachary Bamberger,Till R. Saenger,Gilad Morad,Ofra Amir,Brandon M. Stewart,Amir Feder
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: v1, 18 pages main, 55 pages total, 9 tables, 12 figures

点击查看摘要

Abstract:Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe’s explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at this https URL.

[NLP-46] Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(hallucination)现象的系统性分类与可测量机制问题,即如何从几何结构角度识别和区分不同类型的幻觉。其解决方案的关键在于提出了一种基于token嵌入聚类结构可观测签名的几何分类法,通过定义三个可量化的几何统计量——极性耦合系数α(polarity coupling)、簇凝聚度β(cluster cohesion)和径向信息梯度λₛ(radial information gradient),揭示了LLM幻觉的三种操作上不同的类型:Type 1(中心偏移)、Type 2(错误一致收敛)和Type 3(覆盖空隙)。研究发现,在11个Transformer模型中,极性结构和簇凝聚度具有普适性,而径向信息梯度则与模型架构密切相关,从而为类型特异性幻觉检测提供了几何先验,并预测了不同架构对幻觉的敏感性差异。

链接: https://arxiv.org/abs/2602.14259
作者: Matic Korun
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: \alpha (polarity coupling), \beta (cluster cohesion), and \lambda_s (radial information gradient). Across all 11 models, polarity structure (\alpha 0.5) is universal (11/11), cluster cohesion (\beta 0) is universal (11/11), and the radial information gradient is significant (9/11, p 0.05). We demonstrate that the two models failing \lambda_s significance – ALBERT and MiniLM – do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.

[NLP-47] We can still parse using syntactic rules

【速读】: 该论文旨在解决传统上下文无关文法(Context-Free Grammar, CFG)在句法解析中难以处理复杂语言结构、缺乏对噪声和不完整输入的鲁棒性,以及无法同时生成依存树与成分结构树的问题。其解决方案的关键在于提出一种基于广义短语结构语法(Generalized Phrase Structure Grammar, GPSG)的新解析算法,结合改进的句法规则与特征,能够同时输出依赖关系和成分结构两种类型的句法树,并具备对噪声和不完整解析的容错能力;此外,系统还支持生成多个解析假设以供后续重排序优化,从而提升整体解析准确性,同时整合了自20世纪50年代以来的理论句法研究成果,构建了一个透明且可解释的自然语言处理模型。

链接: https://arxiv.org/abs/2602.14238
作者: Ghaly Hussein
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research introduces a new parsing approach, based on earlier syntactic work on context free grammar (CFG) and generalized phrase structure grammar (GPSG). The approach comprises both a new parsing algorithm and a set of syntactic rules and features that overcome the limitations of CFG. It also generates both dependency and constituency parse trees, while accommodating noise and incomplete parses. The system was tested on data from Universal Dependencies, showing a promising average Unlabeled Attachment Score (UAS) of 54.5% in the development dataset (7 corpora) and 53.8% in the test set (12 corpora). The system also provides multiple parse hypotheses, allowing further reranking to improve parsing accuracy. This approach also leverages much of the theoretical syntactic work since the 1950s to be used within a computational context. The application of this approach provides a transparent and interpretable NLP model to process language input.

[NLP-48] REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

【速读】: 该论文旨在解决大语言模型在深度搜索任务中优化困难的问题,核心瓶颈在于高质量搜索轨迹和奖励信号的极端稀疏性,这源于长程任务构建的可扩展性不足以及涉及外部工具调用的高成本交互式 rollout。解决方案的关键在于提出统一框架 REDSearcher,其创新点包括:将任务合成建模为双约束优化问题,通过图拓扑结构和证据分布精确控制任务难度以实现高质量复杂任务的规模化生成;引入工具增强查询以促进主动工具使用而非被动记忆召回;在中段训练阶段强化知识、规划与函数调用等原子能力,显著降低下游训练对高质量轨迹的依赖;并构建本地模拟环境以支持强化学习实验的快速低成本迭代。

链接: https://arxiv.org/abs/2602.14234
作者: Zheng Chu,Xiao Wang,Jack Hong,Huiming Fan,Yuqi Huang,Yue Yang,Guohai Xu,Chenxiao Zhao,Cheng Xiang,Shengchao Hu,Dongdong Kuang,Ming Liu,Bing Qin,Xing Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.

[NLP-49] he Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Reasoning Process Quality for Audio Reasoning Models and Agents

【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio Language Models, LALMs)在音频理解任务中普遍存在的“黑箱”问题,即模型推理过程缺乏透明性和可解释性。为应对这一挑战,作者组织了Interspeech 2026首届音频推理挑战赛(Audio Reasoning Challenge),首次聚焦于评估链式思维(Chain-of-Thought, CoT)质量的共享任务。解决方案的关键在于引入MMAR-Rubrics——一种新型实例级评估协议,用于量化分析推理链的事实准确性和逻辑一致性;同时通过单模型与智能体(Agent)双赛道设计,揭示出当前最优系统依赖迭代工具编排与跨模态分析实现高质量推理,而单模型则借助强化学习和复杂数据流水线快速演进。

链接: https://arxiv.org/abs/2602.14224
作者: Ziyang Ma,Ruiyang Xu,Yinghao Ma,Chao-Han Huck Yang,Bohan Li,Jaeyeon Kim,Jin Xu,Jinyu Li,Carlos Busso,Kai Yu,Eng Siong Chng,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学); Queen Mary University of London (伦敦玛丽女王大学); NVIDIA (英伟达); Carnegie Mellon University (卡内基梅隆大学); Qwen Team, Alibaba Group (通义实验室,阿里巴巴集团); Microsoft Corporation (微软公司)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: The official website of the Audio Reasoning Challenge: this https URL

点击查看摘要

Abstract:Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this “black-box” limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.

[NLP-50] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

【速读】: 该论文旨在解决儿童保护服务(Child Protective Services, CPS)干预过程中,如何有效评估父母合作程度这一复杂且信息模糊的案例因素问题。其解决方案的关键在于构建一个四阶段工作流程:首先收集案例报告,接着利用推理语言模型(Reasoning Language Models, RLMs)对父母合作情况进行基于推理的评估,然后自动化提取分类标签,最后完成案例标注。实验表明,参数规模达255B的RLM在准确率上达到89%,显著优于初始方法(80%),验证了RLMs在处理高复杂度、低明确性的案例因素时的有效性。

链接: https://arxiv.org/abs/2602.14216
作者: Dragan Stoll,Brian E. Perron,Zia Qi,Selina Steinmann,Nicole F. Eicher,Andreas Jud
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs’ reasoning can effectively assess complex case factors such as parental cooperation. Lower accuracy in assessing fathers’ cooperation supports the argument of a stronger professional focus on mothers in CPS interventions.

[NLP-51] MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

【速读】: 该论文旨在解决块扩散语言模型(Block Diffusion LLMs)在长上下文场景下因KV缓存导致的内存访问瓶颈问题。现有动态稀疏注意力方法多针对自回归(Autoregressive, AR)LLMs设计,依赖近似重要性估计,在适配块扩散架构时性能不佳。其解决方案的关键在于利用块扩散特有的第一轮All-[MASK]去噪步骤中注意力机制能够可靠预测重要KV条目及其预算需求这一特性,提出MAGE方法:通过每块仅执行一次精确注意力计算并复用于训练-free的稀疏去噪过程,从而在保持近无损准确率的同时显著降低KV缓存开销,并实现高达3-4倍的端到端加速效果。

链接: https://arxiv.org/abs/2602.14209
作者: Omin Kwon,Yeonjae Kim,Doyeon Kim,Minseo Kim,Yeonhong Park,Jae W. Lee
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.

[NLP-52] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学推理任务中盲目给出确定性答案的问题,尤其是在证据不足或结论不确定时,强行输出可能带来误导甚至危害。其核心挑战在于如何识别何时应支持、反驳或选择** abstain(回避回答),而非简单追求高准确率。解决方案的关键在于提出一种abstention-aware verification framework(回避感知验证框架)**,该框架将科学主张分解为最小条件,通过自然语言推理(Natural Language Inference, NLI)逐项审计每个条件的证据支持度,并基于置信度动态决定是否输出结论或选择回避。实验表明,尽管不同模型架构的原始准确率差异有限,但引入合理置信度阈值的回避机制能在保持适度覆盖率的同时显著降低错误风险,凸显了“判断何时不回答”比“选择最佳模型”更为关键。

链接: https://arxiv.org/abs/2602.14189
作者: Samir Abdaljalil,Erchin Serpedin,Hasan Kurban
机构: Texas A&M University (德克萨斯A&M大学); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at this https URL .

[NLP-53] Investigation for Relative Voice Impression Estimation

【速读】: 该论文旨在解决语音感知差异的相对估计问题(Relative Voice Impression Estimation, RIE),即从同一说话者的两个语句中预测其感知差异,而非传统的绝对印象评分。其核心挑战在于捕捉细微且动态的声学表达变化,如“冷—暖”等复杂语义维度。解决方案的关键在于使用自监督语音表示(self-supervised speech representations)替代传统声学特征,这类模型在建模非语言和副语言特征方面表现更优,尤其在处理高维、连续性的感知差异时显著优于基于经典声学特征的方法,并首次系统验证了自监督方法在细粒度语音印象建模中的有效性。

链接: https://arxiv.org/abs/2602.14172
作者: Keinichi Fujita,Yusuke Ijima
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages,3 figures, Accepted to Speech Prosody 2026

点击查看摘要

Abstract:Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., Cold–Warm’') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.

[NLP-54] Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

【速读】: 该论文旨在解决大语言模型在强化学习中有效探索的问题,即如何在有限的采样预算内从庞大的自然语言序列空间中发现高质量轨迹。现有方法存在明显局限:GRPO仅从根节点采样,导致高概率轨迹饱和而忽视深层错误状态;树状方法则盲目分散采样预算至无意义或不可恢复的状态,造成采样稀释,难以发现罕见正确后缀并破坏局部基线稳定性。解决方案的关键在于提出深度密集探索(Deep Dense Exploration, DDE),其核心是聚焦于失败轨迹中的“关键点”(pivots)——即深层且可恢复的状态,并通过DEEP-GRPO实现三项创新:(1) 一种轻量级数据驱动效用函数,自动平衡可恢复性与深度偏置以识别pivot状态;(2) 在每个pivot处进行局部密集重采样,提升发现正确后续轨迹的概率;(3) 双流优化目标,解耦全局策略学习与局部修正更新,从而显著提升数学推理等任务上的性能表现。

链接: https://arxiv.org/abs/2602.14169
作者: Yiran Guo,Zhongjian Qiao,Yingqi Xie,Jie Liu,Dan Ye,Ruiqing Zhang,Shuang Qiu,Lijie Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on \textitpivots -deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.

[NLP-55] ROAST: Rollout-based On-distribution Activation Steering Technique

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段通过激活控制(activation steering)进行参数高效干预时,因依赖分布外监督和离散掩码导致的鲁棒性不足问题。其核心解决方案是提出ROAST(Rollout-based On-distribution Activation Steering Technique),关键创新在于:1)利用模型自身分布内轨迹(on-distribution rollouts)通过ROC估计引导方向,避免对分布外数据的依赖;2)采用连续软缩放(Continuous Soft Scaling, CSS)替代硬稀疏化,以更好保留激活能量;3)引入分组均值归一化(Grouped Mean Normalization)平衡不同样本的贡献,提升共识引导方向的稳定性。实证表明,ROAST在多个模型规模(0.6B至32B参数)和任务上显著优于现有方法。

链接: https://arxiv.org/abs/2602.14143
作者: Xuanbo Su,Hao Luo,Yingfang Zhang,Lijun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model’s own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.

[NLP-56] Algebraic Quantum Intelligence: A New Framework for Reproducible Machine Creativity

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在生成创造性内容时能力受限的问题,其根本原因在于:当输入丰富上下文时,LLM的未来生成空间被强烈约束,导致生成过程趋于近似确定性动力学,从而抑制了真正的创造性输出。解决方案的关键在于提出一种名为代数量子智能(Algebraic Quantum Intelligence, AQI)的计算框架,该框架基于非交换代数结构(受量子理论启发),通过在希尔伯特空间中表示语义状态并利用非交换算子计算C值来驱动其演化,从而实现语义可能性的共存与扩展,有效打破传统LLM的确定性约束,使模型能够系统性地拓展语义空间以支持更高层次的创造性推理。

链接: https://arxiv.org/abs/2602.14130
作者: Kazuo Yano,Jonghyeok Lee,Tae Ishitomi,Hironobu Kawaguchi,Akira Koyama,Masakuni Ota,Yuki Ota,Nobuo Sato,Keita Shimada,Sho Takematsu,Ayaka Tobinai,Satomi Tsuji,Kazunori Yanagi,Keiko Yano,Manabu Harada,Yuki Matsuda,Kazunori Matsumoto,Kenichi Matsumura,Hamae Matsuo,Yumi Miyazaki,Kotaro Murai,Tatsuya Ohshita,Marie Seki,Shun Tanoue,Tatsuki Terakado,Yuko Ichimaru,Mirei Saito,Akihiro Otsuka,Koji Ara
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in generating fluent and contextually appropriate text; however, their capacity to produce genuinely creative outputs remains limited. This paper posits that this limitation arises from a structural property of contemporary LLMs: when provided with rich context, the space of future generations becomes strongly constrained, and the generation process is effectively governed by near-deterministic dynamics. Recent approaches such as test-time scaling and context adaptation improve performance but do not fundamentally alter this constraint. To address this issue, we propose Algebraic Quantum Intelligence (AQI) as a computational framework that enables systematic expansion of semantic space. AQI is formulated as a noncommutative algebraic structure inspired by quantum theory, allowing properties such as order dependence, interference, and uncertainty to be implemented in a controlled and designable manner. Semantic states are represented as vectors in a Hilbert space, and their evolution is governed by C-values computed from noncommutative operators, thereby ensuring the coexistence and expansion of multiple future semantic possibilities. In this study, we implement AQI by extending a transformer-based LLM with more than 600 specialized operators. We evaluate the resulting system on creative reasoning benchmarks spanning ten domains under an LLM-as-a-judge protocol. The results show that AQI consistently outperforms strong baseline models, yielding statistically significant improvements and reduced cross-domain variance. These findings demonstrate that noncommutative algebraic dynamics can serve as a practical and reproducible foundation for machine creativity. Notably, this architecture has already been deployed in real-world enterprise environments.

[NLP-57] Character-aware Transformers Learn an Irregular Morphological Pattern Yet None Generalize Like Humans

【速读】: 该论文旨在解决神经网络是否能够作为形态学习的认知模型这一开放性问题,特别是考察其能否像人类一样对复杂形态模式(如西班牙语的L形形态体)进行抽象和泛化。研究聚焦于L形形态体中第一人称单数直陈式动词词干与所有虚拟式形式共享词干的现象,而这种共享缺乏明显的音系、语义或句法动机。解决方案的关键在于对比五种编码器-解码器Transformer模型,这些模型在两个维度上有所不同:位置编码方式(顺序型 vs. 位置不变型)和标签表示方式(原子型 vs. 分解型)。结果表明,位置不变型位置编码是决定模型能否正确识别L形形态聚类的关键因素,这类模型即使在训练数据稀缺时也能恢复正确的L形结构;然而,所有模型均未能实现人类水平的产物式泛化——它们倾向于基于语气(mood-based)进行泛化,而非按照人类所表现出的以第一人称单数直陈式为核心的L形形态抽象。这揭示了当前模型在统计模式复制与形态学抽象之间的根本差距。

链接: https://arxiv.org/abs/2602.14100
作者: Akhilesh Kakolu Ramarao,Kevin Tang,Dinah Baer-Henney
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Whether neural networks can serve as cognitive models of morphological learning remains an open question. Recent work has shown that encoder-decoder models can acquire irregular patterns, but evidence that they generalize these patterns like humans is mixed. We investigate this using the Spanish \emphL-shaped morphome, where only the first-person singular indicative (e.g., \textitpongo `I put’) shares its stem with all subjunctive forms (e.g., \textitponga, pongas) despite lacking apparent phonological, semantic, or syntactic motivation. We compare five encoder-decoder transformers varying along two dimensions: sequential vs. position-invariant positional encoding, and atomic vs. decomposed tag representations. Positional encoding proves decisive: position-invariant models recover the correct L-shaped paradigm clustering even when L-shaped verbs are scarce in training, whereas sequential positional encoding models only partially capture the pattern. Yet none of the models productively generalize this pattern to novel forms. Position-invariant models generalize the L-shaped stem across subjunctive cells but fail to extend it to the first-person singular indicative, producing a mood-based generalization rather than the L-shaped morphomic pattern. Humans do the opposite, generalizing preferentially to the first-person singular indicative over subjunctive forms. None of the models reproduce the human pattern, highlighting the gap between statistical pattern reproduction and morphological abstraction.

[NLP-58] CCiV: A Benchmark for Structure Rhythm and Quality in LLM -Generated Chinese textitCi Poetry ICASSP2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成古典诗词(尤其是词牌体裁,即“Ci”)时面临的结构性、韵律性和艺术性难以协同控制的问题。其核心挑战在于:如何系统评估并提升LLM在严格形式约束下生成高质量古典诗词的能力。解决方案的关键在于提出一个名为Chinese Cipai Variants (CCiV) 的基准测试框架,该框架从结构(structure)、节奏(rhythm)和文学质量(quality)三个维度对生成结果进行量化评估,并揭示出模型常生成非标准但历史存在的词牌变体、且声调规则比结构规则更难遵循等现象;同时验证了形式感知提示(form-aware prompting)在增强强模型控制能力方面的有效性,但也指出其可能削弱弱模型表现,从而强调未来需发展更全面的约束生成方法与变体意识评价机制。

链接: https://arxiv.org/abs/2602.14081
作者: Shangqing Zhao,Yupei Ren,Yuhao Zhou,Xiaopeng Bai,Man Lan
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注: ARR 2025 May and Icassp 2026 submission. Working in progress

点击查看摘要

Abstract:The generation of classical Chinese \textitCi poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbfChinese \textbfCipai \textbfVariants (\textbfCCiV), a benchmark designed to assess LLM-generated \textitCi poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textitCipai reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.

[NLP-59] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

【速读】: 该论文试图解决当前标准事实性评估方法在衡量大语言模型(Large Language Models, LLMs)事实知识时存在的局限性,即未能区分错误是源于知识缺失(空架子)还是知识访问受限(丢失钥匙)。其解决方案的关键在于提出一个行为框架,将事实知识的分析粒度从问题层面细化至单个事实层面,通过两个维度对每个事实进行表征:一是是否被编码(encoded),二是可访问性(accessibility)——分为无法回忆、可直接回忆和需推理才能回忆(thinking)。该框架由新构建的WikiProfile基准支持,借助基于网络搜索提示的LLM自动化流水线生成数据。研究发现,前沿模型在该基准上编码覆盖率已接近饱和(GPT-5和Gemini-3达95–98%),但召回仍是主要瓶颈,且许多以往归因于知识缺失的错误实为访问失败,尤其集中在长尾事实和反向问题中;进一步表明“思考”机制能显著提升召回率,揭示未来性能提升可能更依赖于优化已有知识的利用效率而非单纯扩大模型规模。

链接: https://arxiv.org/abs/2602.14080
作者: Nitay Calderon,Eyal Ben-David,Zorik Gekhman,Eran Ofek,Gal Yona
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95–98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

[NLP-60] GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler

【速读】: 该论文旨在解决隐式推理模型在推理时缩放(Inference-time Scaling, ITS)过程中因采用启发式扰动(如dropout或固定高斯噪声)导致探索行为缺乏显式建模、效率低下等问题。这些问题在有限采样预算下尤为突出,且更强的扰动并不必然带来更有效的候选轨迹,因为无引导的噪声可能破坏内部决策结构而非优化其方向。解决方案的关键在于将潜在思维探索建模为可学习密度上的条件采样,并提出高斯思维采样器(Gaussian Thought Sampler, GTS),该方法预测上下文相关的连续推理状态扰动分布,通过类似GRPO的策略优化进行训练,同时冻结主干模型参数,从而实现结构化且可优化的探索机制,显著提升了GTS在GSM8K数据集上两种隐式推理架构下的推理可靠性。

链接: https://arxiv.org/abs/2602.14077
作者: Minghan Wang,Ye Bai,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari
机构: Monash University (蒙纳士大学); University of Melbourne (墨尔本大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.

[NLP-61] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)普遍依赖英语中心数据集所导致的多语言适配能力不足问题,从而限制了非英语用户的应用场景及跨文化语境下的性能表现。为应对这一挑战,作者基于LLaVA-Next方法论构建了一套波兰语VLM,并提出以全自动翻译与过滤流程处理现有多模态数据集,辅以合成波兰语数据用于OCR和文化特定任务。其解决方案的关键在于:通过大规模自动化翻译结合轻量级人工筛选,显著降低了低资源语言(如波兰语)下高质量多模态模型的构建成本,实验表明该方法在波兰语适应的MMBench上相较LLaVA-1.6-Vicuna-13B提升9.5%,且生成式评估中人类标注者评分显示语言正确性更高,验证了自动化流程的有效性。

链接: https://arxiv.org/abs/2602.14073
作者: Grzegorz Statkiewicz,Alicja Dobrzeniecka,Karolina Seweryn,Aleksandra Krasnodębska,Karolina Piosek,Katarzyna Bogusz,Sebastian Cygert,Wojciech Kusa
机构: NASK National Research Institute (国家研究机构)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.

[NLP-62] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

【速读】: 该论文试图解决生成式 AI(Generative AI)在开放任务中因依赖标量奖励模型(scalar reward models)而导致的鲁棒性不足与奖励黑客(reward hacking)问题。其核心挑战在于,标量奖励模型将多维人类偏好压缩为单一不透明分数,造成信息瓶颈并削弱对非可验证任务的对齐效果。解决方案的关键是提出 Open Rubric System(OpenRS),它通过引入显式的元规则(meta-rubric)机制,将奖励建模从黑箱函数转变为可检查的原则驱动推理过程:利用 Pairwise Adaptive Meta-Rubrics(PAMR)动态生成适应语义差异的评分细则,并结合轻量级点状可验证规则(Pointwise Verifiable Rubrics, PVRs)提供硬约束和客观子任务的可验证奖励信号;同时采用两级元规则精炼流程(自动化进化优化通用原则 + 人机协同细化领域原则),确保原则一致性与灵活性,从而在不进行点状加权标量化的前提下提升开放场景下的判别能力与对齐稳定性。

链接: https://arxiv.org/abs/2602.14069
作者: Ruipeng Jia,Yunyi Yang,Yuxin Wu,Yongbo Gai,Siyuan Tao,Mengyu Zhou,Jianhe Lin,Xiaoxi Jiang,Guanjun Jiang
机构: Alibaba(阿里巴巴); Beijing University Of Posts and Telecommunications(北京邮电大学); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric – a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced – and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.

[NLP-63] From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

【速读】: 该论文旨在解决低资源语言(特别是普什图语)在自动语音识别(ASR)系统开发中缺乏大规模、公开许可语音数据的问题。其解决方案的关键在于对Mozilla Common Voice语料库中Pashto子集的释放级分析,量化评估了数据规模增长、验证吞吐量、贡献者参与不平等性、人口统计元数据完整性及句子层面的集中度等指标。研究发现,尽管数据量从2023年中期的1.49小时迅速增长至2025年2768.7小时(其中975.89小时已验证),但贡献者参与高度集中(基尼系数=0.941)、年龄分布偏年轻化、近42%的音频片段缺失性别标签,且约35.88%的独特句子占用了50%的已验证音频,表明结构集中主要源于贡献者活动不均而非少数提示词主导。这一定量审计明确了提升数据集成熟度的核心优先事项:扩大验证能力与促进更广泛的人口统计多样性参与。

链接: https://arxiv.org/abs/2602.14062
作者: Jandad Jahani,Mursal Dawodi,Jawid Ahmad Baktash
机构: O.P. Jindal Global University (奥佩·金达尔全球大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88% of unique sentences account for 50% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation. Subjects: Computation and Language (cs.CL); Sound (cs.SD) Cite as: arXiv:2602.14062 [cs.CL] (or arXiv:2602.14062v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.14062 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jawid Ahmad Baktash [view email] [v1] Sun, 15 Feb 2026 09:22:48 UTC (175 KB)

[NLP-64] LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts EACL2026

【速读】: 该论文旨在解决定义建模(definition modeling)任务中模型泛化能力不足与语义表达精度有限的问题。其核心解决方案是提出LM-Lexicon框架,关键在于通过数据聚类将定义建模任务分解为多个语义专业化领域,并训练小型语言模型作为领域专家(domain experts),结合基于语义感知的域级路由机制和稀疏专家混合架构(sparse mixture-of-experts architecture)实现高效协同。此方法显著提升了定义生成质量,在五个基准测试上相较现有最优模型提升7% BLEU分数,且通过细粒度专家分工、语义驱动的路由策略及测试时计算扩展进一步优化性能。

链接: https://arxiv.org/abs/2602.14060
作者: Yang Liu,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li,Lingyong Yan
机构: BIGAI; Baidu Inc.; Peking University
类目: Computation and Language (cs.CL)
备注: EACL 2026 (Oral), 22 pages, 12 figures, 12 tables

点击查看摘要

Abstract:We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.

[NLP-65] LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation

【速读】: 该论文旨在解决代码生成中因推理链过浅(underthinking)或过长(overthinking)而导致的效率与质量失衡问题,即现有测试时扩展(Test Time Scaling, TTS)方法在探索推理路径时难以兼顾深度与计算效率。其解决方案的关键在于提出LogitsCoder框架,通过轻量级的logit层级控制机制实现对推理过程的精细化调节:首先利用Logits Preference Decoding引导token选择向统计上更优的模式偏移,随后采用Logits Rank Based Path Selection和Thoughts Aggregation从多样化推理路径中筛选并聚合高质量步骤,从而生成结构清晰、效率与深度平衡的推理链,显著提升代码生成性能。

链接: https://arxiv.org/abs/2602.14054
作者: Jizheng Chen,Weiming Zhang,Xinyi Dai,Weiwen Liu,Kounianhua Du,Yasheng Wang,Ruiming Tang,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full complexity of problems; and (2) overthinking, where overly verbose reasoning leads to inefficiency and increased computational costs. To address these issues, we propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation. LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank Based Path Selection and Thoughts Aggregation. This results in coherent and effective reasoning chains that balance depth and efficiency. Extensive experiments demonstrate that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.

[NLP-66] Context Shapes LLM s Retrieval-Augmented Fact-Checking Effectiveness

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文场景下进行事实验证(fact verification)时性能不稳定的问题。研究发现,尽管LLMs具备非平凡的参数化事实知识,但其验证准确率随上下文长度增加而下降;同时,证据在提示中的位置对准确性具有显著影响——当相关证据位于提示开头或结尾时,模型表现更优,而置于中段则明显劣化。解决方案的关键在于优化提示结构(prompt structure),尤其强调将关键证据置于提示的起始或末尾位置,以提升检索增强型事实核查系统(retrieval-augmented fact-checking systems)的可靠性与一致性。

链接: https://arxiv.org/abs/2602.14044
作者: Pietro Bernardelle,Stefano Civelli,Kevin Roitero,Gianluca Demartini
机构: The University of Queensland (昆士兰大学); University of Udine (乌迪内大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.

[NLP-67] Geometry-Preserving Aggregation for Mixture-of-Experts Embedding Models

【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)嵌入模型中因线性加权求和聚合方式导致的几何不一致问题。现有方法隐含假设嵌入空间具有线性子空间结构,但实际分析表明,专家输出分布于共享的超球面流形上,表现为范数集中且角度分离显著;线性聚合会引发向内坍缩(inward collapse),破坏向量的模长与方向,降低嵌入的可比性。解决方案的关键在于提出一种几何保真聚合算子——超球重心聚合(Spherical Barycentric Aggregation, SBA),其通过分离径向(radial)与角向(angular)分量,在保持超球面结构的同时兼容原有路由机制,从而有效抑制聚合诱导的坍缩并提升嵌入质量。

链接: https://arxiv.org/abs/2602.14039
作者: Sajjad Kachuee,Mohammad Sharifkhani
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) embedding models combine expert outputs using weighted linear summation, implicitly assuming a linear subspace structure in the embedding space. This assumption is shown to be inconsistent with the geometry of expert representations. Geometric analysis of a modern MoE embedding model reveals that expert outputs lie on a shared hyperspherical manifold characterized by tightly concentrated norms and substantial angular separation. Under this geometry, linear aggregation induces inward collapse toward the manifold interior, distorting vector magnitude and direction and reducing embedding comparability. To address this inconsistency, Spherical Barycentric Aggregation (SBA) is introduced as a geometry-preserving aggregation operator that separates radial and angular components to maintain hyperspherical structure while remaining fully compatible with existing routing mechanisms. Experiments on selected tasks from the Massive Text Embedding Benchmark (MTEB), including semantic similarity, clustering, and duplicate question detection, demonstrate consistent performance improvements with identical training cost and full stability. Additional geometric analyses confirm that SBA prevents aggregation-induced collapse and preserves hyperspherical consistency, highlighting the importance of geometry-aware aggregation in MoE embedding architectures.

[NLP-68] GRRM: Group Relative Reward Modeling for Machine Translation

【速读】: 该论文旨在解决在开放域任务(如机器翻译)中,基于组相对策略优化(Group Relative Policy Optimization, GRPO)的有效性受限于组内排序准确性的问题。传统标量质量度量(Scalar Quality Metrics, SQM)因对候选译文独立评估而缺乏比较上下文,难以区分细微的语言差异。解决方案的关键在于提出组质量度量(Group Quality Metric, GQM)范式,并通过组相对奖励模型(Group Relative Reward Model, GRRM)实现:GRRM 联合处理整个候选组,利用对比分析来精确判定相对质量并自适应调整粒度,从而提升排序精度。在此基础上,将 GRRM 嵌入 GRPO 训练循环以优化翻译策略,实验表明该框架不仅显著提升整体翻译质量,还赋予模型接近最先进推理模型的推理能力。

链接: https://arxiv.org/abs/2602.14028
作者: Sen Yang,Shanbo Cheng,Lu Xu,Jianbing Zhang,Shujian Huang
机构: Nanjing University (南京大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at this https URL.

[NLP-69] Named Entity Recognition for Payment Data Using NLP

【速读】: 该论文旨在解决金融交易处理中从非结构化支付数据中自动提取结构化信息的难题,核心挑战在于提升命名实体识别(Named Entity Recognition, NER)在多种支付格式(如SWIFT MT103、ISO 20022及本地支付系统)下的准确性和泛化能力。解决方案的关键在于引入并验证了基于预训练语言模型的先进NER方法,特别是通过微调BERT模型实现高达94.2%的F1分数,显著优于传统条件随机场(CRF)方法;进一步提出PaymentBERT这一融合领域特定金融嵌入与上下文表示的混合架构,在保持实时处理能力的同时达到95.7%的F1分数,成为当前最优方案。

链接: https://arxiv.org/abs/2602.14009
作者: Srikumar Nayak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 8 figures, research paper

点击查看摘要

Abstract:Named Entity Recognition (NER) has emerged as a critical component in automating financial transaction processing, particularly in extracting structured information from unstructured payment data. This paper presents a comprehensive analysis of state-of-the-art NER algorithms specifically designed for payment data extraction, including Conditional Random Fields (CRF), Bidirectional Long Short-Term Memory with CRF (BiLSTM-CRF), and transformer-based models such as BERT and FinBERT. We conduct extensive experiments on a dataset of 50,000 annotated payment transactions across multiple payment formats including SWIFT MT103, ISO 20022, and domestic payment systems. Our experimental results demonstrate that fine-tuned BERT models achieve an F1-score of 94.2% for entity extraction, outperforming traditional CRF-based approaches by 12.8 percentage points. Furthermore, we introduce PaymentBERT, a novel hybrid architecture combining domain-specific financial embeddings with contextual representations, achieving state-of-the-art performance with 95.7% F1-score while maintaining real-time processing capabilities. We provide detailed analysis of cross-format generalization, ablation studies, and deployment considerations. This research provides practical insights for financial institutions implementing automated sanctions screening, anti-money laundering (AML) compliance, and payment processing systems.

[NLP-70] he Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective LREC2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多步问答任务中因依赖冗长自解释(如思维链推理)而导致的效率低下问题,即如何在保证答案正确性( sufficiency )的前提下实现解释的精简( conciseness )。其解决方案的关键在于基于信息瓶颈原理(information bottleneck principle),将解释视为保留决策必要信息的压缩表示,并构建了一个约束解释长度、利用多个语言模型评估其充分性的评测流程。实验表明,在一定压缩范围内,更短的解释仍能保持准确性,而过度压缩则会导致性能下降,从而揭示了解释长度与准确率之间的非线性权衡关系。

链接: https://arxiv.org/abs/2602.14002
作者: Ali Zahedzadeh,Behnam Bahrak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LREC 2026 submission; focuses on LLM self-explanation, interpretability, and information bottleneck analysis

点击查看摘要

Abstract:Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct this http URL operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.

[NLP-71] Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimers Disease Assessment and Diagnosis

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中依赖医学影像和临床评估所带来的效率低、资源消耗大以及模型可解释性差的问题,尤其是在AD复杂多因素病因难以通过影像直接观察的情况下。其解决方案的关键在于利用大语言模型(large language models, LLMs)对患者电子健康记录(electronic health records, EHRs)进行链式思维(Chain-of-Thought, CoT)推理,生成具有明确诊断逻辑的推理路径,并基于此构建结构化的CoT预测流程。该方法不仅提升了模型对AD复杂病理机制的内在诊断能力,还显著增强了不同疾病阶段预测过程的可解释性,实验表明该框架在多个CDR分级任务中相较零样本基线方法实现了最高达15%的F1分数提升。

链接: https://arxiv.org/abs/2602.13979
作者: Tongze Zhang,Jun-En Ding,Melik Ozolcer,Fang-Ming Hung,Albert Chih-Chieh Yang,Feng Liu,Yi-Rou Ji,Sang Won Bae
机构: Stevens Institute of Technology (斯蒂文斯理工学院); Far Eastern Memorial Hospital (远东纪念医院); National Yang Ming Chiao Tung University (国立阳明交通大学); Institute of Brain Science (脑科学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have been increasingly applied to the medical field using electronic health records (EHRs), yet their application in Alzheimer’s disease assessment remains limited, particularly given that AD involves complex multifactorial etiologies that are difficult to observe directly through imaging modalities. In this work, we propose leveraging LLMs to perform Chain-of-Thought (CoT) reasoning on patients’ clinical EHRs. Unlike direct fine-tuning of LLMs on EHR data for AD classification, our approach utilizes LLM-generated CoT reasoning paths to provide the model with explicit diagnostic rationale for AD assessment, followed by structured CoT-based predictions. This pipeline not only enhances the model’s ability to diagnose intrinsically complex factors but also improves the interpretability of the prediction process across different stages of AD progression. Experimental results demonstrate that the proposed CoT-based diagnostic framework significantly enhances stability and diagnostic performance across multiple CDR grading tasks, achieving up to a 15% improvement in F1 score compared to the zero-shot baseline method.

[NLP-72] Neuromem: A Granular Decomposition of the Streaming Lifecycle in External Memory for LLM s

【速读】: 该论文旨在解决当前外部记忆模块(External Memory Module)评估中普遍存在的静态假设问题,即现有方法通常假设记忆在离线状态下构建并以固定状态进行查询,而忽略了实际应用场景中记忆的动态流式特性——新事实持续到达、插入与检索交错发生、记忆状态随模型服务不断演化。这种流式场景下,准确率和成本由完整的记忆生命周期决定,涵盖信息的摄入、维护、检索及生成整合等环节。解决方案的关键在于提出 Neuromem,一个可扩展的测试平台,用于在交错插入与检索协议下基准测试外部记忆模块,并将生命周期细分为五个维度:记忆数据结构、归一化策略、合并策略、查询制定策略和上下文集成机制。通过三个代表性数据集 LOCOMO、LONGMEMEVAL 和 MEMORYAGENTBENCH,Neuromem 在统一的服务栈中评估各组件的可替换变体,量化 token 级 F1 分数及插入/检索延迟,揭示出记忆结构对性能上限起决定性作用,而压缩和生成式集成机制主要在插入与检索之间转移成本,提升有限。

链接: https://arxiv.org/abs/2602.13967
作者: Ruicheng Zhang,Xinyi Li,Tianyi Xu,Shuhao Zhang,Xiaofei Liao,Hai Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 8 figures, 15 tables. Preprint

点击查看摘要

Abstract:Most evaluations of External Memory Module assume a static setting: memory is built offline and queried at a fixed state. In practice, memory is streaming: new facts arrive continuously, insertions interleave with retrievals, and the memory state evolves while the model is serving queries. In this regime, accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation. We present Neuromem, a scalable testbed that benchmarks External Memory Modules under an interleaved insertion-and-retrieval protocol and decomposes its lifecycle into five dimensions including memory data structure, normalization strategy, consolidation policy, query formulation strategy, and context integration mechanism. Using three representative datasets LOCOMO, LONGMEMEVAL, and MEMORYAGENTBENCH, Neuromem evaluates interchangeable variants within a shared serving stack, reporting token-level F1 and insertion/retrieval latency. Overall, we observe that performance typically degrades as memory grows across rounds, and time-related queries remain the most challenging category. The memory data structure largely determines the attainable quality frontier, while aggressive compression and generative integration mechanisms mostly shift cost between insertion and retrieval with limited accuracy gain.

[NLP-73] HLE-Verified: A Systematic Verification and Structured Revision of Humanitys Last Exam

【速读】: 该论文旨在解决人类最后考试(Humanity’s Last Exam, HLE)基准测试中存在的标注噪声问题,此类噪声会误导模型评估结果并扭曲跨模型比较。解决方案的关键在于构建一个经过验证和修订的改进版本——HLE-Verified,其核心是采用两阶段验证与修复流程:第一阶段通过领域专家评审和模型交叉校验对题目及答案进行二元验证,获得641个已验证项目;第二阶段对可修正的错误项实施严格约束下的修订,包括双独立专家修复、模型辅助审计与最终裁决,产出1,170个修订并认证的项目,并将剩余689项作为带有明确不确定性来源和专家标签的不确定集发布。此方法显著提升了基准的可靠性,实证表明在HLE-Verified上,先进语言模型平均准确率提升7–10个百分点,尤其在原题或参考答案存在错误的项目中提升达30–40个百分点,验证了方案的有效性。

链接: https://arxiv.org/abs/2602.13964
作者: Weiqi Zhai,Zhihai Wang,Jinghang Wang,Boyu Yang,Xiaogang Li,Xiang Xu,Bohan Wang,Peng Wang,Xingzhe Wu,Anfeng Li,Qiyuan Feng,Yuhao Zhou,Shoulin Han,Wenjie Luo,Yiyuan Li,Yaxuan Wang,Ruixian Luo,Guojie Lin,Peiyao Xiao,Chengliang Xu,Ben Wang,Zeyu Wang,Zichao Chen,Jianan Ye,Yijie Hu,Jialong Chen,Zongwen Shen,Yuliang Xu,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Hu Wei,Que Shen,Bing Zhao
机构: Alibaba Group (阿里巴巴集团); Qwen Team, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Humanity’s Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7–10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30–40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: this https URL

[NLP-74] MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars

【速读】: 该论文旨在解决当前行星科学中视觉任务基准多局限于封闭集监督学习,缺乏文本引导的地理空间检索能力的问题。为应对这一挑战,作者提出了MarsRetrieval——一个用于评估视觉-语言模型在火星地理空间发现能力的检索基准,包含图像-文本配对检索、地貌检索和全球地理定位三项任务,覆盖多尺度空间与多样地貌成因。解决方案的关键在于提出统一的以检索为中心的协议,支持对比双塔编码器与生成式视觉-语言模型的联合评估,并强调领域特定微调对提升行星环境中泛化性地理空间发现能力的重要性。

链接: https://arxiv.org/abs/2602.13961
作者: Shuoyuan Wang,Yiran Wang,Hongxin Wei
机构: Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data-driven approaches like deep learning are rapidly advancing planetary science, particularly in Mars exploration. Despite recent progress, most existing benchmarks remain confined to closed-set supervised visual tasks and do not support text-guided retrieval for geospatial discovery. We introduce MarsRetrieval, a retrieval benchmark for evaluating vision-language models for Martian geospatial discovery. MarsRetrieval includes three tasks: (1) paired image-text retrieval, (2) landform retrieval, and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. We propose a unified retrieval-centric protocol to benchmark multimodal embedding architectures, including contrastive dual-tower encoders and generative vision-language models. Our evaluation shows MarsRetrieval is challenging: even strong foundation models often fail to capture domain-specific geomorphic distinctions. We further show that domain-specific fine-tuning is critical for generalizable geospatial discovery in planetary settings. Our code is available at this https URL

[NLP-75] Why Code Why Now: Learnability Computability and the Real Limits of Machine Learning

【速读】: 该论文试图解决的问题是:为何生成式 AI (Generative AI) 在代码生成任务中表现出比强化学习(Reinforcement Learning, RL)更可靠的进展,以及为什么单纯扩大模型规模无法解决机器学习(Machine Learning, ML)中的所有挑战。其解决方案的关键在于提出一个基于信息结构的五级可学习性(learnability)层级体系,该体系建立在计算问题的三个核心属性——表达能力(expressibility)、可计算性(computability)和可学习性(learnability)——之间的形式化区分之上,并明确它们之间的蕴含关系与失效边界。通过这一框架,论文指出任务是否具备良好的局部、密集且可验证的反馈机制(如代码中的token级反馈),才是决定ML能否有效推进的核心因素,而非模型规模本身。

链接: https://arxiv.org/abs/2602.13934
作者: Zhimin Zhao
机构: Queen’s University (皇后大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code generation has progressed more reliably than reinforcement learning, largely because code has an information structure that makes it learnable. Code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.

[NLP-76] From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在内容感知图形布局设计中存在空间推理能力有限以及设计决策过程缺乏透明度的问题。其解决方案的关键在于提出LaySPA框架,将布局设计重构为在结构化文本空间环境中的策略学习问题,该环境显式编码画布几何、元素属性及元素间关系;通过多目标空间评述机制分解布局质量为几何有效性、关系一致性与美学一致性,并采用相对分组优化策略稳定开放设计空间中的学习过程,从而生成可解释的推理轨迹和结构化的布局规范,实现透明且可控的设计决策。

链接: https://arxiv.org/abs/2602.13912
作者: Sha Li,Stefano Petrangeli,Yu Shen,Xiang Chen
机构: Virginia Tech (弗吉尼亚理工学院); Adobe Research (Adobe 研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs’ limited spatial reasoning and the lack of opacity in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized SOTA layout generators while requiring fewer annotated samples and reduced latency.

[NLP-77] Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

【速读】: 该论文旨在解决历史文献数字化过程中自动文本识别(ATR)输出与实际应用之间存在的可用性鸿沟问题:一方面,基于古文字学数据集(如CATMuS)训练的ATR模型虽具更强泛化能力,但其原始输出难以直接用于读者或下游自然语言处理(NLP)工具;另一方面,以规范化输出为目标的ATR模型在新领域适应性差,易出现过度归一化和幻觉现象。解决方案的关键在于提出“预编辑归一化”(Pre-Editorial Normalization, PEN)任务,即在保持古文字学保真度的前提下,依据编辑惯例对ATR的字形输出进行归一化处理,从而实现中间步骤的可读性与最终版本的实用性平衡。研究构建了基于CoMMA语料库的新数据集及人工校正的黄金标准评估集,并采用ByT5序列到序列模型进行基准测试,显著提升了归一化准确率(CER降至6.7%),为历史文本数字人文研究提供了新的方法论框架。

链接: https://arxiv.org/abs/2602.13905
作者: Thibault Clérice,Rachel Bawden,Anthony Glaise,Ariane Pinche,David Smith
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability. We present a new dataset derived from the CoMMA corpus and aligned with digitized Old French and Latin editions using passim. We also produce a manually corrected gold-standard evaluation set. We benchmark this resource using ByT5-based sequence-to-sequence models on normalization and pre-annotation tasks. Our contributions include the formal definition of PEN, a 4.66M-sample silver training corpus, a 1.8k-sample gold evaluation set, and a normalization model achieving a 6.7% CER, substantially outperforming previous models for this task.

[NLP-78] Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在复杂多跳问答任务中,因缺乏有效外部知识增强而导致的事实性与推理能力不足的问题。其核心解决方案在于系统性地优化检索增强生成(Retrieval Augmented Generation, RAG)中的提示模板设计,通过大规模实证研究评估了24种不同提示模板在HotpotQA数据集上的表现,发现采用新颖的混合型提示模板可显著提升SLMs的性能,相较标准RAG提示最高提升达83%(Qwen2.5-3B Instruct)和84.5%(Gemma3-4B-It),并为资源受限环境下的SLM-RAG部署提供了可操作的设计建议。

链接: https://arxiv.org/abs/2602.13890
作者: Amir Hossein Mohammadi,Ali Moeinian,Zahra Razavizade,Afsaneh Fatemi,Reza Ramezani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 Pages, Submitted to Journal of Computing and Security

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.

[NLP-79] ADAB: Arabic Dataset for Automated Politeness Benchmarking – A Large-Scale Resource for Computational Socioprag matics LREC2026

【速读】: 该论文旨在解决阿拉伯语自然语言处理(Natural Language Processing, NLP)中礼貌表达识别资源匮乏的问题,尤其针对阿拉伯语中丰富而复杂的礼貌现象缺乏系统性标注数据的现状。其解决方案的关键在于构建了一个名为ADAB(Arabic Politeness Dataset)的多源、多方言、高标注质量的阿拉伯语礼貌检测数据集,涵盖现代标准阿拉伯语及四种主要方言(海湾、埃及、黎凡特和马格里布),并基于阿拉伯语言传统与语用学理论进行三类标注(礼貌、不礼貌、中性),同时提供16个礼貌类别层面的语言特征标注。该数据集在40种模型配置(包括传统机器学习、Transformer模型和大语言模型)上进行了基准测试,为开发更具文化敏感性的阿拉伯语NLP系统提供了重要支撑。

链接: https://arxiv.org/abs/2602.13870
作者: Hend Al-Khalifa,Nadia Ghezaiel,Maria Bounnit,Hend Hamed Alhazmi,Noof Abdullah Alfear,Reem Fahad Alqifari,Ameera Masoud Almasoud,Sharefah Ahmed Al-Ghamdi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Paper accepted @ The Fifteenth biennial Language Resources and Evaluation Conference (LREC2026)

点击查看摘要

Abstract:The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.

[NLP-80] Bridging the Multilingual Safety Divide: Efficient Culturally-Aware Alignment for Global South Languages AAAI2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在低资源语言和多语混合场景下安全性与文化适配性不足的问题,特别是现有安全机制和评估基准主要基于英语等高资源语言,导致其在“全球南方”地区部署时出现安全护栏失效、文化有害行为未被识别以及知识修正无法跨语言迁移等现象。解决方案的关键在于构建以本地社区参与为核心的安全治理框架,包括三个核心方向:参数高效的安全引导(parameter-efficient safety steering)、基于文化语境的评估与偏好数据建设,以及赋能本地群体定义并缓解特定危害的协作式工作流,从而将多语言安全性从附加功能转变为公平人工智能的必备要素。

链接: https://arxiv.org/abs/2602.13867
作者: Somnath Banerjee,Rima Hazra,Animesh Mukherjee
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the EGSAI Workshop at AAAI 2026

点击查看摘要

Abstract:Large language models (LLMs) are being deployed across the Global South, where everyday use involves low-resource languages, code-mixing, and culturally specific norms. Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ‘‘transfer’’ across languages. Evidence increasingly shows they do not. We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii) English-only knowledge edits and safety patches often fail to carry over to low-resource languages. In response, we outline a practical agenda for researchers and students in the Global South: parameter-efficient safety steering, culturally grounded evaluation and preference data, and participatory workflows that empower local communities to define and mitigate harm. Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.

[NLP-81] utoring Large Language Models to be Domain-adaptive Precise and Safe

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的三大核心挑战:技术精度不足、安全风险高以及文化与多语言适应性差的问题。为实现“负责任智能”(Responsible Intelligence)框架,其解决方案的关键在于构建一个分阶段的方法论路径:首先通过监督式领域适配(domain adaptation)确保任务特定的技术精确性;继而采用解码时对齐(decoding-time alignment)机制提升安全性以抵御对抗性攻击;最终借助人类反馈和偏好建模(preference modeling)实现社会语言学层面的敏感性与全球包容性。这一多维度协同策略使LLMs能够在复杂现实场景中兼顾性能、安全与伦理合规性。

链接: https://arxiv.org/abs/2602.13860
作者: Somnath Banerjee
机构: Indian Institute of Technology Kharagpur(印度理工学院克哈格普尔分校)
类目: Computation and Language (cs.CL)
备注: Accepted to the PhD Symposium at Web Conference 2026

点击查看摘要

Abstract:The overarching research direction of this work is the development of a ‘‘Responsible Intelligence’’ framework designed to reconcile the immense generative power of Large Language Models (LLMs) with the stringent requirements of real-world deployment. As these models become a transformative force in artificial intelligence, there is an urgent need to move beyond general-purpose architectures toward systems that are contextually aware, inherently safer, and deeply respectful of global cultural nuances. This research navigates three interconnected threads: domain adaptation to ensure technical precision, ethical rigor to mitigate adversarial vulnerabilities, and cultural/multilingual alignment to promote global inclusivity. The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.

[NLP-82] PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行涉及敏感、上下文依赖信息的个性化任务时,因上下文隐私隐含性而导致的隐私泄露问题。现有方法依赖于外部推理时干预,存在脆弱性高、场景特定性强且可能扩大隐私攻击面的缺陷。解决方案的关键在于提出PrivAct——一个将上下文隐私保护内化至模型生成行为中的多代理学习框架,通过在每个代理中嵌入隐私偏好,直接在生成阶段实现隐私合规的代理行为,从而增强系统级上下文完整性,并在隐私与有用性之间取得更优权衡。

链接: https://arxiv.org/abs/2602.13840
作者: Yuhan Cheng,Hancheng Ye,Hai Helen Li,Jingwei Sun,Yiran Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents’ action due to the implicitness of contextual privacy. Existing approaches rely on external, inference-time interventions which are brittle, scenario-specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models’ generation behavior for privacy-compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system-wide contextual integrity while achieving a more favorable privacy-helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot generalization and robustness across diverse multi-agent topologies. Code is available at this https URL.

[NLP-83] Speculative Decoding with a Speculative Vocabulary

【速读】: 该论文旨在解决当前推测解码(speculative decoding)方法中因Draft模型输出分布瓶颈导致的推理效率受限问题,特别是当使用简化词汇表的Draft模型时,虽能提升吞吐量但会因目标token不在词汇表内而降低推测有效性。解决方案的关键在于提出一种名为SpecVocab的新方法,其核心思想是在每步解码时动态选择一个词汇子集进行推测(vocabulary speculation),从而在不牺牲推测成功率的前提下显著提升接受长度(acceptance length),最终实现比当前最优方法EAGLE-3更高的平均吞吐量(最高提升达8.1%)。

链接: https://arxiv.org/abs/2602.13836
作者: Miles Williams,Young D. Kwon,Rui Li,Alexandros Kouris,Stylianos I. Venieris
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.

[NLP-84] Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在用户意图表达不明确时,难以准确理解并响应真实需求的问题,即模型与用户之间存在的认知偏差(epistemic divergence)。传统理论心理(Theory of Mind, ToM)评估多聚焦于孤立的信念推理,忽视其在实际交互中的功能价值。论文的关键解决方案是将ToM重新定义为一种用于检测和修正认知偏差的机制,并提出一个名为\benchname的新基准,用于评估模型在实践中如何调和用户信念与行为特征。进一步地,作者构建了一个基于任务轨迹的ToM数据集,通过强化学习训练模型以提升对用户心理状态的推理能力,从而显著改善下游任务表现,凸显了ToM作为交互层面核心机制的实际价值。

链接: https://arxiv.org/abs/2602.13832
作者: Minyuan Ruan,Ziyue Wang,Kaiming Liu,Yunghwei Lai,Peng Li,Yang Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.

[NLP-85] he acquisition of English irregular inflections by Yemeni L1 Arabic learners: A Universal Grammar approach

【速读】: 该论文旨在解决 Yemeni 成人学习者在第二语言(L2)英语中不规则屈折形态(irregular inflections)习得过程中存在的问题,特别是探讨其错误来源是否源于母语(L1)迁移与目标语内部发展因素的交互作用。研究基于普遍语法(Universal Grammar, UG)框架,采用特征重组假说(Feature Reassembly Hypothesis, FRH)作为理论基础,发现学习者的错误既来自跨语言干扰(interlingual),也来自本族语内部规则泛化(intralingual overgeneralization)。解决方案的关键在于提供高质量的语言输入和教学干预,以促进学习者对UG原则的持续调用,从而实现从初始阶段依赖L1迁移向目标语形态结构的重构过渡,尤其在辅音变化、零形标记及-a复数等难点上需强化输入质量和教学设计以突破限制。

链接: https://arxiv.org/abs/2602.13816
作者: Muneef Y. Alsawsh,Mohammed Q. Shormani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 Tables

点击查看摘要

Abstract:This study examines the acquisition of English irregular inflections by Yemeni learners of English as a second language (L2), utilizing a Universal Grammar (UG) approach. Within the UG approach, the study considers Feature Reassembly Hypothesis (FRH) (Lardiere, 2008, 2009) part of UG, focusing on the roles of first language (L1) transfer and L2 developmental influence. It analyzes learner errors across two developmental stages. Stage 1 data reveal a dominant influence of L1 transfer, particularly in phonological and structural mismatches, while stage 2 data demonstrate increased learner sensitivity to UG properties and morphological reconfiguration toward the target language. Findings reveal that errors in irregular inflectional morphology are attributed to both interlingual and intralingual sources, with overgeneralization of L2 rules as a common developmental strategy. Statistical analysis, including a one-way ANOVA, indicates significant improvement in the production of well-formed irregular inflections from stage 1 to stage 2, underscoring learners’ continued access to UG. However, persistent difficulties with consonant change, zero-morpheme, and -a plural inflections suggest that limited exposure, ineffective input modeling, and insufficient instructional quality constrain full UG access. The study concludes that while L1 transfer and L2 developmental factors influence initial stages of acquisition, appropriate linguistic input and instruction are critical for facilitating UG-driven feature reassembly in adult L2 learners.

[NLP-86] OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

【速读】: 该论文旨在解决卵巢肿瘤治疗中因多学科团队(Multidisciplinary Tumor Board, MDT)资源分布不均导致的临床决策可及性问题,尤其是在医疗资源受限地区,患者难以获得及时、高质量的专家共识意见。其解决方案的关键在于提出并验证了OMGs(Ovarian tumour Multidisciplinary intelligent aGent System),一个基于多智能体协同推理的生成式AI框架,通过领域专用代理(domain-specific agents)协作整合多学科证据,并输出具有透明推理过程的MDT风格推荐。该系统在多中心评估中表现出与专家MDT共识相当甚至更优的性能,尤其在证据强度(Evidence)和鲁棒性(Robustness)维度显著提升临床决策质量,具备在资源匮乏环境中扩展专业肿瘤学诊疗能力的潜力。

链接: https://arxiv.org/abs/2602.13793
作者: Yangyang Zhang,Zilong Wang,Jianbo Xu,Yongqi Chen,Chu Han,Zhihao Zhang,Shuai Liu,Hui Li,Huiping Zhang,Ziqi Liu,Jiaxin Chen,Jun Zhu,Zheng Feng,Hao Wen,Xingzhu Ju,Yanping Zhong,Yunqiu Zhang,Jie Duan,Jun Li,Dongsheng Li,Weijie Wang,Haiyan Zhu,Wei Jiang,Xiaohua Wu,Shuo Wang,Haiming Li,Qinhao Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 5 figures, 1 table

点击查看摘要

Abstract:Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource-constrained centres where MDT resources are scarce or unavailable. Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style recommendations with transparent rationales. To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum. In multicentre re-evaluation, OMGs achieved performance comparable to expert MDT consensus ( 4.45 \pm 0.30 versus 4.53 \pm 0.23 ), with higher Evidence scores (4.57 versus 3.92). In prospective multicentre evaluation (59 patients), OMGs demonstrated high concordance with routine MDT decisions. Critically, in paired human-AI studies, OMGs most substantially enhanced clinicians’ recommendations in Evidence and Robustness, the dimensions most compromised when multidisciplinary expertise is unavailable. These findings suggest that multi-agent deliberative systems can achieve performance comparable to expert MDT consensus, with potential to expand access to specialized oncology expertise in resource-limited settings.

[NLP-87] StackingNet: Collective Inference Across Independent AI Foundation Models

【速读】: 该论文旨在解决当前基于大规模基础模型(foundation models)的人工智能系统之间缺乏协同能力的问题,即这些模型虽在语言理解、视觉识别和推理等任务中表现出色,但彼此孤立,难以共享或整合其优势。解决方案的关键在于提出一种称为StackingNet的元集成框架,该框架借鉴集体智能原理,在推理阶段通过组合多个黑箱异构模型的预测结果来实现协调优化。其核心创新在于无需访问各模型的内部参数或训练数据,即可提升准确性、降低偏差、实现可靠性排序,并识别或剔除性能退化的模型,从而将模型多样性从不一致性来源转变为协作优势,为可信人工智能系统的构建提供了可实践的多模型协同范式。

链接: https://arxiv.org/abs/2602.13792
作者: Siyang Li,Chenhao Liu,Dongrui Wu,Zhigang Zeng,Lieyun Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence built on large foundation models has transformed language understanding, vision and reasoning, yet these systems remain isolated and cannot readily share their capabilities. Integrating the complementary strengths of such independent foundation models is essential for building trustworthy intelligent systems. Despite rapid progress in individual model design, there is no established approach for coordinating such black-box heterogeneous models. Here we show that coordination can be achieved through a meta-ensemble framework termed StackingNet, which draws on principles of collective intelligence to combine model predictions during inference. StackingNet improves accuracy, reduces bias, enables reliability ranking, and identifies or prunes models that degrade performance, all operating without access to internal parameters or training data. Across tasks involving language comprehension, visual estimation, and academic paper rating, StackingNet consistently improves accuracy, robustness, and fairness, compared with individual models and classic ensembles. By turning diversity from a source of inconsistency into collaboration, StackingNet establishes a practical foundation for coordinated artificial intelligence, suggesting that progress may emerge from not only larger single models but also principled cooperation among many specialized ones.

[NLP-88] How Do Lexical Senses Correspond Between Spoken German and German Sign Language? EACL’26

【速读】: 该论文旨在解决手语词典编纂中多义词与同形异义词在不同语境下对应不同手语表达(sign)的映射关系常被忽视的问题,从而丰富双语词典资源。其关键解决方案是采用基于使用的方法,通过分析德语与德国手语(Deutsche Gebärdensprache, DGS)之间的词义到手语的映射关系,构建了首个跨模态词义对应标注数据集,包含1,404个词用例到手语标识(sign ID)的映射,并识别出三类对应模式(Type 1:一词多义对应多个手语;Type 2:多词对应单一手语;Type 3:一对一映射),同时比较了精确匹配(Exact Match, EM)与语义相似度(Semantic Similarity, SS)两种计算方法,发现基于SBERT嵌入的SS方法显著优于EM,在Type 1类型上提升达52.1个百分点,验证了计算方法对特定对应模式的可识别性。

链接: https://arxiv.org/abs/2602.13790
作者: Melis Çelikkol,Wei Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: EACL’26 (Student Research Workshop)

点击查看摘要

Abstract:Sign language lexicographers construct bilingual dictionaries by establishing word-to-sign mappings, where polysemous and homonymous words corresponding to different signs across contexts are often underrepresented. A usage-based approach examining how word senses map to signs can identify such novel mappings absent from current dictionaries, enriching lexicographic resources. We address this by analyzing German and German Sign Language (Deutsche Gebärdensprache, DGS), manually annotating 1,404 word use-to-sign ID mappings derived from 32 words from the German Word Usage Graph (D-WUG) and 49 signs from the Digital Dictionary of German Sign Language (DW-DGS). We identify three correspondence types: Type 1 (one-to-many), Type 2 (many-to-one), and Type 3 (one-to-one), plus No Match cases. We evaluate computational methods: Exact Match (EM) and Semantic Similarity (SS) using SBERT embeddings. SS substantially outperforms EM overall 88.52% vs. 71.31%), with dramatic gains for Type 1 (+52.1 pp). Our work establishes the first annotated dataset for cross-modal sense correspondence and reveals which correspondence patterns are computationally identifiable. Our code and dataset are made publicly available.

[NLP-89] RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

【速读】: 该论文针对多媒体事件抽取(Multimedia Event Extraction, MEE)在低资源条件下缺乏标注训练数据的问题展开研究,旨在提升跨模态事件语义的对齐能力与论元定位精度。现有方法依赖于跨模态对齐或基于视觉-语言模型(Vision-Language Models, VLMs)的推理时提示,难以显式学习结构化的事件表示,且在多模态场景中论元锚定效果较弱。解决方案的关键在于提出一种关系感知的多任务渐进式学习框架(Relation-aware Multi-task Progressive Learning, RMPL),通过分阶段训练整合单模态事件抽取和多媒体关系抽取的异构监督信号:首先以统一事件模式学习跨模态共享的事件中心表征,再结合文本与视觉数据进行事件提及识别和论元角色抽取的微调,从而在M2E2基准上实现不同模态设置下的稳定性能提升。

链接: https://arxiv.org/abs/2602.13748
作者: Yongkang Jin,Jianwen Luo,Jingjing Wang,Jianmin Yao,Yu Hong
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision–Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.

[NLP-90] On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

【速读】: 该论文旨在解决计算论证中重组(reformulation)的策略性使用识别问题,即如何超越表面相似性检测,准确捕捉重述在修辞话语中的语用功能。其解决方案的关键在于构建一个对比性的多智能体框架,通过引入论证理论知识增强大型语言模型(LLM)的能力:具体而言,采用检索增强生成(Retrieval-Augmented Generation, RAG)机制将理论知识注入到一个代理系统中,从而显著提升对五类重述功能(Deintensification、Intensification、Specification、Generalisation、Other,简称D-I-S-G-O)的识别性能,尤其在Intensification和Generalisation等复杂语境下表现突出,整体Macro F1-score提升近30%。这表明理论基础对于实现从单纯句法层面的改写检测向功能感知的论证分析跃迁至关重要。

链接: https://arxiv.org/abs/2602.13713
作者: Maciej Uberna,Michał Wawer,Jarosław A. Chudziak,Marcin Koszowy
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 3 tables. This is the accepted version of the paper presented at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain

点击查看摘要

Abstract:Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D-I-S-G-O). We then evaluate two parallel LLM-based agent systems: one enhanced by argumentation theory via Retrieval-Augmented Generation (RAG), and an identical zero-shot baseline. The results reveal a clear performance gap: the RAG-enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1-score improvement of nearly 30%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function-aware analysis of argumentative discourse. This comparative multi-agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.

[NLP-91] Metaphors journeys across time and genre: tracking the evolution of literary metaphors with temporal embeddings

【速读】: 该论文旨在解决文学隐喻(literary metaphor)在时间维度上的加工成本是否随时代变迁而变化的问题,以及不同文体(genre)如何影响其认知处理难度。此前研究多忽视了隐喻的时间演变特性,且缺乏对语义动态变化的量化分析。解决方案的关键在于引入历时分布语义学(diachronic distributional semantics)工具,通过在19世纪与21世纪意大利语文学和非文学语料库上训练词向量(word embeddings),共涵盖1.24亿词元,并以515个19世纪文学隐喻中“主题”(topic)与“载体”(vehicle)之间的语义相似度作为加工需求的代理指标,从而系统评估隐喻处理难度的历时变化及其受体裁调节的作用机制。

链接: https://arxiv.org/abs/2602.13701
作者: Veronica Mangiaterra,Chiara Barattieri di San Pietro,Paolo Canal,Valentina Bambini
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metaphors are a distinctive feature of literary language, yet they remain less studied experimentally than everyday metaphors. Moreover, previous psycholinguistic and computational approaches overlooked the temporal dimension, although many literary metaphors were coined centuries apart from contemporary readers. This study innovatively applies tools from diachronic distributional semantics to assess whether the processing costs of literary metaphors varied over time and genre. Specifically, we trained word embeddings on literary and nonliterary Italian corpora from the 19th and 21st centuries, for a total of 124 million tokens, and modeled changes in the semantic similarity between topics and vehicles of 515 19th-century literary metaphors, taking this measure as a proxy of metaphor processing demands. Overall, semantic similarity, and hence metaphor processing demands, remained stable over time. However, genre played a key role: metaphors appeared more difficult (i.e., lower topic-vehicle similarity) in modern literary contexts than in 19th-century literature, but easier (i.e., higher topic-vehicle similarity) in today’s nonliterary language (e.g., the Web) than in 19th-century nonliterary texts. This pattern was further shaped by semantic features of metaphors’ individual terms, such as vector coherence and semantic neighborhood density. Collectively, these findings align with broader linguistic changes in Italian, such as the stylistic simplification of modern literature, which may have increased metaphor processing demands, and the high creativity of the Web’s language, which seems to render metaphor more accessible.

[NLP-92] AllM em: A Memory-centric Recipe for Efficient Long-context Modeling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长序列任务中因自注意力机制带来的计算复杂度和内存开销而导致的性能瓶颈问题。其解决方案的关键在于提出了一种名为 \textscAllMem 的新型高效混合架构,该架构融合了滑动窗口注意力(Sliding Window Attention, SWA)与非线性测试时训练(Test-Time Training, TTT)记忆网络,从而在保持模型对超长上下文有效建模能力的同时,显著降低推理阶段的计算与内存消耗,并缓解灾难性遗忘问题。通过引入参数化记忆机制,该方法不仅克服了传统线性记忆模型的表征局限,还在多个基准测试中实现了近无损甚至优于全注意力机制的性能表现。

链接: https://arxiv.org/abs/2602.13680
作者: Ziming Wang,Xiang Wang,Kailong Peng,Lang Qin,Juan Gabriel Kostelec,Christos Sourmpis,Axel Laborieux,Qinghai Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textscAllMem, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textscAllMem enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textscAllMem-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.

[NLP-93] KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在韩国医学执业资格考试(Korean Medical Licensing Examinations)场景下,对多模态医学问答任务的评估缺乏统一、专业且本土化基准的问题。其解决方案的关键在于构建并发布 KorMedMCQA-V 数据集——一个包含 1,534 道医学多选题及 2,043 张相关医学影像(涵盖 X 射线、CT、心电图、超声、内镜等临床模态)的多模态基准,并采用统一的零样本评估协议对超过 50 种不同类别的 VLMs(包括通用型、医学专用型和韩语专用型)进行系统评测,从而揭示模型在跨图像证据整合、模态敏感性及推理能力等方面的性能差异与局限,为韩国医学领域多模态智能系统的开发与评估提供标准化工具。

链接: https://arxiv.org/abs/2602.13650
作者: Byungjin Choi,Seongsu Bae,Sunjun Kweon,Edward Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 2 figures, 6 tables. (Includes appendix.)

点击查看摘要

Abstract:We introduce KorMedMCQA-V, a Korean medical licensing-exam-style multimodal multiple-choice question answering benchmark for evaluating vision-language models (VLMs). The dataset consists of 1,534 questions with 2,043 associated images from Korean Medical Licensing Examinations (2012-2023), with about 30% containing multiple images requiring cross-image evidence integration. Images cover clinical modalities including X-ray, computed tomography (CT), electrocardiography (ECG), ultrasound, endoscopy, and other medical visuals. We benchmark over 50 VLMs across proprietary and open-source categories-spanning general-purpose, medical-specialized, and Korean-specialized families-under a unified zero-shot evaluation protocol. The best proprietary model (Gemini-3.0-Pro) achieves 96.9% accuracy, the best open-source model (Qwen3-VL-32B-Thinking) 83.7%, and the best Korean-specialized model (VARCO-VISION-2.0-14B) only 43.2%. We further find that reasoning-oriented model variants gain up to +20 percentage points over instruction-tuned counterparts, medical domain specialization yields inconsistent gains over strong general-purpose baselines, all models degrade on multi-image questions, and performance varies notably across imaging modalities. By complementing the text-only KorMedMCQA benchmark, KorMedMCQA-V forms a unified evaluation suite for Korean medical reasoning across text-only and multimodal conditions. The dataset is available via Hugging Face Datasets: this https URL.

[NLP-94] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)评估与对齐流程中因评价标准(rubric)修改而引发的系统性偏好漂移问题,即Rubric-Induced Preference Drift (RIPD)。其核心发现是:即使评价 rubric 的修改通过了基准测试验证,仍可能在目标领域内导致 judge 模型偏好发生定向且系统性的偏移,从而影响下游对齐训练的效果。解决方案的关键在于识别并量化这种隐蔽的、由 rubric 改动引发的行为漂移现象,并证明其可通过基于 rubric 的偏好攻击被主动利用,进而揭示评价 rubric 作为高阶决策接口的敏感性和可操纵性,强调了将 rubric 设计纳入系统级对齐风险管控的重要性。

链接: https://arxiv.org/abs/2602.13576
作者: Ruomeng Ding,Yifei Pang,He Sun,Yizhong Wang,Zhiwei Steven Wu,Zhun Deng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge’s preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: this https URL. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

[NLP-95] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)对齐方法中存在的数据稀缺性、噪声敏感性和训练不稳定性问题,这些问题源于现有方法依赖于将海量人类偏好数据压缩为静态的绝对奖励函数。其解决方案的关键在于提出一种名为Elo-Evolve的协同进化框架,通过将对齐过程建模为动态多智能体竞争,并引入两个核心创新:一是摒弃Bradley-Terry模型假设,直接从成对比较中的胜负结果学习;二是采用基于Elo评分的对手选择机制,通过温度控制采样实现自动课程学习。该方法在理论和实验层面均验证了其优越性,特别是在降低噪声干扰和提升样本效率方面表现突出。

链接: https://arxiv.org/abs/2602.13575
作者: Jing Zhao,Ting Zhen,Junwei bao,Hongfei Jiang,Yang song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods static pairwise training Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.

[NLP-96] LLM -Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型任务中存在幻觉(hallucination)的问题,尤其是检索增强生成(Retrieval-Augmented Generation, RAG)系统中因文档检索与排序不准确导致的性能瓶颈。现有重排序器(reranker)通常依赖专门训练、计算开销大,且未能充分利用LLM内在的置信度信号。其解决方案的关键在于提出一种无需训练、即插即用的LLM-Confidence Reranker(LCR),通过黑盒方式提取LLM的置信度信息——基于最大语义聚类比例(Maximum Semantic Cluster Proportion, MSCP)进行评估,并采用两阶段策略:首先通过多项式采样与聚类完成置信度估计,随后依据查询和文档置信度阈值进行分箱与多级排序,从而在保持高置信度查询原始排名的同时提升相关文档优先级,显著改善NDCG@5指标,且具备良好的计算效率与可扩展性。

链接: https://arxiv.org/abs/2602.13571
作者: Zhipeng Song,Xiangyu Kong,Xinrui Bao,Yizhi Zhou,Jiulong Jiao,Sitong Liu,Yuhang Zhou,Heng Qi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published by ESWA

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR–using only 7–9B-parameter pre-trained LLMs–consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR’s mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.

[NLP-97] DistillLens: Symmetric Knowledge Distillation Through Logit Lens

【速读】: 该论文旨在解决标准知识蒸馏(Knowledge Distillation, KD)在压缩大语言模型(Large Language Models, LLMs)时,仅优化最终输出而忽视教师模型中间层思维过程的局限性。现有方法(如基于MSE或非对称KL散度的特征蒸馏)未能充分建模中间表示中的不确定性分布,从而影响学生模型对高熵信息通道的保留能力。其解决方案的关键在于提出DistillLens框架,通过“Logit Lens”将教师和学生的中间隐藏状态映射到词汇空间,并采用对称发散目标(symmetric divergence objective)实现双向结构对齐,从而在训练中施加双侧惩罚机制,有效抑制过自信与欠自信现象,同时保留对最终推理至关重要的高熵信息流。

链接: https://arxiv.org/abs/2602.13567
作者: Manish Dhakal,Uthman Jinadu,Anjila Budathoki,Rajshekhar Sunderraman,Yi Ding
机构: 未知
类目: Computation and Language (cs.CL)
备注: Knowledge Distillation in LLMs

点击查看摘要

Abstract:Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher’s intermediate layer’s thought process as a black box. While feature-based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual-sided penalty, preventing both overconfidence and underconfidence while preserving the high-entropy information conduits essential for final deduction. Extensive experiments on GPT-2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature-transfer baselines on diverse instruction-following benchmarks. The code is available at this https URL.

[NLP-98] Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂推理任务中因安全对齐(safety alignment)而产生的安全性与实用性之间的权衡问题。现有方法通常通过上下文蒸馏构建带有显式安全规则的思维链(Chain-of-Thought, CoT)训练数据,这种做法虽提升了安全性,却因将规则记忆与拒绝响应强行绑定,限制了模型的推理能力。其解决方案的关键在于提出自适应安全上下文学习(Adaptive Safe Context Learning, ASCL)框架,将安全对齐建模为多轮工具使用过程,使模型能自主决定何时调用安全规则并如何生成后续推理;同时引入逆频率策略优化(Inverse Frequency Policy Optimization, IFPO),以纠正强化学习(Reinforcement Learning, RL)中对规则调用的偏好偏差,从而解耦规则检索与推理生成,显著提升整体性能。

链接: https://arxiv.org/abs/2602.13562
作者: Yanbo Wang,Minzheng Wang,Jian Liang,Lu Wang,Yongcan Yu,Ran He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. 18 pages, 6 figures

点击查看摘要

Abstract:While reasoning models have achieved remarkable success in complex reasoning tasks, their increasing power necessitates stringent safety measures. For safety alignment, the core challenge lies in the inherent trade-off between safety and utility. However, prevailing alignment strategies typically construct CoT training data with explicit safety rules via context distillation. This approach inadvertently limits reasoning capabilities by creating a rigid association between rule memorization and refusal. To mitigate the safety-utility trade-off, we propose the Adaptive Safe Context Learning (ASCL) framework to improve the reasoning given proper context. ASCL formulates safety alignment as a multi-turn tool-use process, empowering the model to autonomously decide when to consult safety rules and how to generate the ongoing reasoning. Furthermore, to counteract the preference for rule consultation during RL, we introduce Inverse Frequency Policy Optimization (IFPO) to rebalance advantage estimates. By decoupling rule retrieval and subsequent reasoning, our method achieves higher overall performance compared to baselines.

[NLP-99] Small Reward Models via Backward Inference

【速读】: 该论文旨在解决当前奖励模型(Reward Model, RM)在非可验证领域中依赖大语言模型(LLM)作为裁判(LLM-as-a-Judge)所带来的局限性,包括对强大推理能力的过度依赖、需要参考响应或显式评分标准(rubric)导致灵活性不足等问题。其解决方案的关键在于提出FLIP(FLipped Inference for Prompt reconstruction),一种无需参考响应和评分标准的奖励建模方法:通过反向推理(backward inference)推断出最可能生成给定响应的指令,并以推断指令与原始指令之间的相似度作为奖励信号。该方法利用了验证-生成差距(validation-generation gap),在小模型场景下实现了比LLM-as-a-Judge基线平均提升79.6%的性能,且在测试时缩放(test-time scaling)和GRPO训练中显著改善下游任务表现,尤其适用于长输出并具备对常见奖励黑客(reward hacking)的鲁棒性。

链接: https://arxiv.org/abs/2602.13551
作者: Yike Wang,Faeze Brahman,Shangbin Feng,Teng Xiao,Hannaneh Hajishirzi,Yulia Tsvetkov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at this https URL.

[NLP-100] On Calibration of Large Language Models : From Response To Capability

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中缺乏可靠置信度估计的问题,尤其针对现有方法仅关注单次输出的响应级校准(response calibration)与真实场景中对模型整体能力预测需求之间的不匹配。研究指出,现代LLM解码过程的随机性导致单次输出正确性无法反映模型的真实能力,从而误导置信度评估。解决方案的关键在于提出“能力校准”(capability calibration),即聚焦于模型在给定查询上的期望准确率,而非单个响应的正确性。作者从理论和实证层面区分了能力校准与响应校准,并通过实验验证能力校准能显著提升 pass@k 预测精度和推理预算分配效率,为多种应用场景奠定基础。

链接: https://arxiv.org/abs/2602.13540
作者: Sin-Han Yang,Cheng-Kuang Wu,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee,Shao-Hua Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model’s expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@ k prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

[NLP-101] SecureGate: Learning When to Reveal PII Safely via Token-Gated Dual-Adapters for Federated LLM s

【速读】: 该论文旨在解决联邦微调生成式大语言模型(Generative LLMs)过程中面临的两大核心问题:一是由于模型记忆效应导致的个人身份信息(PII)泄露风险,二是异构数据下全局泛化能力与本地任务效用之间的固有矛盾。现有防御方法如数据清洗和差分隐私虽能缓解隐私泄露,但常以牺牲下游任务性能为代价。其解决方案的关键在于提出SecureGate框架,采用双适配器LoRA架构——一个安全适配器用于学习可共享的去敏表示,另一个揭示适配器保留组织特异的敏感知识,并通过令牌控制的门控模块在推理阶段选择性激活相应适配器,从而实现无需重新训练即可精细调控信息暴露程度,在保障隐私的同时维持高任务效用。

链接: https://arxiv.org/abs/2602.13529
作者: Mohamed Shaaban,Mohamed Elmahallawy
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative training across organizational silos without sharing raw data, making it attractive for privacy-sensitive applications. With the rapid adoption of large language models (LLMs), federated fine-tuning of generative LLMs has gained attention as a way to leverage distributed data while preserving confidentiality. However, this setting introduces fundamental challenges: (i) privacy leakage of personally identifiable information (PII) due to LLM memorization, and (ii) a persistent tension between global generalization and local utility under heterogeneous data. Existing defenses, such as data sanitization and differential privacy, reduce leakage but often degrade downstream performance. We propose SecureGate, a privacy-aware federated fine-tuning framework for LLMs that provides fine-grained privacy control without sacrificing utility. SecureGate employs a dual-adapter LoRA architecture: a secure adapter that learns sanitized, globally shareable representations, and a revealing adapter that captures sensitive, organization-specific knowledge. A token-controlled gating module selectively activates these adapters at inference time, enabling controlled information disclosure without retraining. Extensive experiments across multiple LLMs and real-world datasets show that SecureGate improves task utility while substantially reducing PII leakage, achieving up to a 31.66X reduction in inference attack accuracy and a 17.07X reduction in extraction recall for unauthorized requests. Additionally, it maintains 100% routing reliability to the correct adapter and incurs only minimal computational and communication overhead.

[NLP-102] hink Deep Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因盲目增加生成长度而导致性能下降的问题,即传统基于token数量的测试时扩展策略(test-time scaling)无法可靠反映真实推理质量。其核心解决方案在于提出“深度思考标记”(deep-thinking tokens)的概念——这些是模型内部预测在深层网络层中发生显著修正的token,反映了真正的推理深度。研究发现,深度思考比例(deep-thinking ratio)与推理准确性呈现稳健且正向的相关性,显著优于基于长度或置信度的基线方法。基于此洞察,作者设计了Think@n策略,通过优先选择高深度思考比例的样本,并利用短前缀实现对低质量生成的早期剔除,从而在保持甚至超越自一致性(self-consistency)性能的同时大幅降低推理成本。

链接: https://arxiv.org/abs/2602.13517
作者: Wei-Lin Chen,Liqian Peng,Tian Tan,Chao Zhao,Blake JianHang Chen,Ziqian Lin,Alec Go,Yu Meng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal “overthinking,” leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens – tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

[NLP-103] From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

【速读】: 该论文旨在解决土耳其新闻媒体中大语言模型(Large Language Models, LLMs)生成内容的实证测量问题,填补现有研究仅限于记者定性访谈或假新闻检测的空白。其解决方案的关键在于:首先,基于三个具有不同编辑立场的主流土耳其新闻来源构建了一个包含3,600篇文章的标注数据集;其次,对土耳其语专用BERT模型(dbmdz/bert-base-turkish-cased)进行微调,实现对AI改写内容的二分类任务;最终,在超过3,500篇未见文章上的部署结果显示,模型具备跨来源和时间稳定的分类能力,平均预测置信度高于0.96,并估计约2.5%的新闻内容被LLMs重写或修订,从而首次实现了对土耳其新闻媒体中AI使用情况的量化实证分析。

链接: https://arxiv.org/abs/2602.13504
作者: Ozancan Ozdemir
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.

[NLP-104] Language Model Memory and Memory Models for Language

【速读】: 该论文试图解决的问题是:当前机器学习模型(尤其是语言模型)在隐藏层向量嵌入(embedding)中存储输入信息的能力有限,难以实现高效且准确的记忆功能,而传统基于下一词预测(next token prediction)的训练目标本身不具备可逆性,导致嵌入无法有效承载完整输入信息。解决方案的关键在于引入一种并行化的编码器-解码器记忆模型架构,并通过联合优化因果训练目标与信息保留目标函数,使模型能够学习形成高保真、可解码的信息丰富型记忆;同时采用冻结高保真编码器后进行课程式训练(curriculum training)的方法,先让解码器学会处理记忆,再进一步预测下一个词,从而显著提升训练效率与记忆准确性。

链接: https://arxiv.org/abs/2602.13466
作者: Benjamin L. Badger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory’, is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

[NLP-105] LLM -Powered Automatic Translation and Urgency in Crisis Scenarios

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在危机情境下进行多语言翻译时存在的性能不稳定与紧迫感(urgency)失真问题,尤其关注其在高风险场景中是否能有效维持信息的紧急性以保障响应效率。解决方案的关键在于构建了一个涵盖32种语言的紧迫感标注数据集,并通过实证分析发现:即便翻译在语言学上准确,LLMs 和专用机器翻译系统仍可能显著扭曲用户对信息紧迫性的感知,且LLM的紧迫感判别结果高度依赖于输入和提示的语言。这一发现揭示了通用语言技术在危机通信中部署的重大风险,强调需建立面向危机场景的专门评估框架。

链接: https://arxiv.org/abs/2602.13452
作者: Belu Ticona,Antonis Anastasopoulos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.

[NLP-106] Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在多轮交互和工具使用场景下安全性显著下降的问题,即“能力-安全”差距扩大。现有基准测试未能充分覆盖此类复杂情境,导致潜在风险被忽视。解决方案的关键在于提出一个系统性的分类法(taxonomy),将单轮有害任务转化为多轮攻击序列,并据此构建首个评估多轮工具使用智能体安全性的基准MT-AgentRisk;同时提出无需训练、与工具无关的自探索防御机制ToolShield,其通过智能体自主生成测试用例并观察下游影响来提炼安全经验,从而有效降低攻击成功率(Attack Success Rate, ASR)平均达30%。

链接: https://arxiv.org/abs/2602.13379
作者: Xu Li,Simon Yu,Minzhou Pan,Yiyou Sun,Bo Li,Dawn Song,Xue Lin,Weiyan Shi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at this https URL.

[NLP-107] An Online Reference-Free Evaluation Framework for Flowchart Image-to-Code Generation

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在文档处理流水线中,对任意输入图像进行流图(flowchart)到结构化代码(如 Mermaid)转换时,因缺乏真实标签(ground-truth)而导致输出质量难以评估的问题。解决方案的关键在于提出一种无需参考文本的评估框架,该框架仅依赖输入图像与生成输出即可在推理阶段实时监测质量;其核心创新是引入两个自动化指标:通过光学字符识别(OCR)提取输入图像文本作为代理参考以计算内容覆盖度(Recall_OCR),并通过视觉蕴含(Visual Entailment, VE)检测生成结果中幻觉元素以衡量精确度(Precision_VE),二者调和平均值(F1_OCR-VE)构成统一的质量评分,实验证明该指标与真实标签高度一致(Pearson相关系数达0.97、0.91和0.94),具备在生产环境中持续监控模型性能的实用性。

链接: https://arxiv.org/abs/2602.13376
作者: Giang Son Nguyen,Zi Pong Lim,Sarthak Ketanbhai Modi,Yon Shin Teo,Wenya Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 tables. Under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code exists, making output quality difficult to assess. We propose a reference-free evaluation framework that monitors flowchart image-to-code generation quality at inference time, using only the input image and the generated output. The framework introduces two automated metrics: \textRecall\textOCR , which estimates content coverage by extracting text from the input image via OCR as a proxy reference, and \textPrecision\textVE , which detects hallucinated elements through Visual Entailment against the original image. Their harmonic mean, \textF1\textOCR-VE , provides a unified quality score. Validation on the FlowVQA dataset shows strong agreement with ground-truth metrics (average Pearson’s r = 0.97 , 0.91 , and 0.94 for Recall, Precision, and F1, respectively), confirming the framework’s reliability as a practical, reference-free alternative for continuous quality monitoring in production settings.

[NLP-108] Nanbeige4.1-3B: A Small General Model that Reason s Aligns and Acts

【速读】: 该论文旨在解决小语言模型(Small Language Model, SLM)在保持参数规模有限的前提下,如何同时实现强代理行为(agentic behavior)、代码生成(code generation)与通用推理(general reasoning)的难题。传统观点认为,这些能力通常需要大模型才能兼顾,而本文提出了一种统一架构下的多任务优化方案,其关键在于:首先结合点级(point-wise)与成对级(pair-wise)奖励建模以提升推理质量和偏好对齐;其次设计复杂度感知奖励机制(complexity-aware rewards)用于强化学习中的代码生成优化,兼顾正确性与效率;最后通过深度搜索中复杂的数据合成和轮次级监督(turn-level supervision),实现了长达600步工具调用的稳定长程交互,从而显著提升了模型在复杂任务上的执行能力。实验表明,Nanbeige4.1-3B在性能上超越了同量级模型甚至部分更大规模模型,证明了小模型亦可兼具广度与深度的能力。

链接: https://arxiv.org/abs/2602.13367
作者: Chen Yang,Guangyue Peng,Jiaying Zhu,Ran Le,Ruixiang Feng,Tao Zhang,Xiyun Xu,Yang Song,Yiming Jia,Yuntao Wen,Yunzhi Xu,Zekai Wang,Zhenwei An,Zhicong Sun,Zongchao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.

[NLP-109] Using Deep Learning to Generate Semantically Correct Hindi Captions

【速读】: 该论文旨在解决多模态场景下生成高质量 Hindi 语言图像描述(image captioning)的问题,尤其是在非英语语境中提升图像理解与自然语言生成的准确性。其解决方案的关键在于融合局部视觉特征、全局视觉特征、注意力机制(attention mechanism)以及预训练卷积神经网络(CNN),如 VGG16、ResNet50 和 Inception V3,结合双向长短期记忆网络(bidirectional LSTM)进行文本编码,并引入额外的注意力层以动态加权不同时间步的视觉特征,从而构建句子级特征向量。实验表明,基于 VGG16 的注意力机制增强型双向 LSTM 模型在 BLEU-1 和 BLEU-4 评分上分别达到 0.59 和 0.19,优于其他基线模型,证明了该方法在生成语义准确、相关性强的 Hindi 图像描述方面的有效性。

链接: https://arxiv.org/abs/2602.13352
作者: Wasim Akram Khan,Anil Kumar Vuppala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages, 12 figures, 3 tables. Master’s thesis, Liverpool John Moores University, November 2022

点击查看摘要

Abstract:Automated image captioning using the content from the image is very appealing when done by harnessing the capability of computer vision and natural language processing. Extensive research has been done in the field with a major focus on the English language which gives the scope for further developments in the same with consideration of popular foreign languages. This research utilizes distinct models for translating the image caption into Hindi, the fourth most popular language across the world. Exploring the multi-modal architectures this research comprises local visual features, global visual features, attention mechanisms, and pre-trained models. Using google cloud translator on the image dataset from Flickr8k, Hindi image descriptions have been generated. Pre-trained CNNs like VGG16, ResNet50, and Inception V3 helped in retrieving image characteristics, while the uni-directional and bi-directional techniques of text encoding are used for the text encoding process. An additional Attention layer helps to generate a weight vector and, by multiplying it, combine image characteristics from each time step into a sentence-level feature vector. Bilingual evaluation understudy scores are used to compare the research outcome. Many experiments that serve as a baseline are done for the comparative analysis of the research. An image with a score of BLEU-1 is considered sufficient, whereas one with a score of BLEU-4 is considered to have fluid image captioning. For both BLEU scores, the attention-based bidirectional LSTM with VGG16 produced the best results of 0.59 and 0.19 respectively. The experiments conclude that researchs ability to produce relevant, semantically accurate image captions in Hindi. The research accomplishes the goals and future research can be guided by this research model.

[NLP-110] Exploring the Performance of ML/DL Architectures on the MNIST-1D Dataset

【速读】: 该论文旨在解决小规模数据集(如MNIST)在评估先进神经网络架构时因过于简单而难以区分模型性能的问题。为应对这一挑战,研究者提出使用MNIST-1D这一一维化版本的MNIST数据集,该数据集在保持小规模优势的同时引入了序列复杂性和变异性,从而更适于考察高级架构对顺序模式和层次特征的捕捉能力。解决方案的关键在于利用MNIST-1D作为基准测试平台,系统评估ResNet、TCN和DCNN等具有强归纳偏置(inductive bias)和层级特征提取能力的模型,并验证其在计算资源受限条件下显著优于传统模型(如逻辑回归、MLP、CNN和GRU),进而证明该数据集能有效区分不同架构的性能差异,推动深度学习模型在资源受限环境下的优化设计。

链接: https://arxiv.org/abs/2602.13348
作者: Michael Beebe,GodsGift Uzor,Manasa Chepuri,Divya Sree Vemula,Angel Ayala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small datasets like MNIST have historically been instrumental in advancing machine learning research by providing a controlled environment for rapid experimentation and model evaluation. However, their simplicity often limits their utility for distinguishing between advanced neural network architectures. To address these challenges, Greydanus et al. introduced the MNIST-1D dataset, a one-dimensional adaptation of MNIST designed to explore inductive biases in sequential data. This dataset maintains the advantages of small-scale datasets while introducing variability and complexity that make it ideal for studying advanced architectures. In this paper, we extend the exploration of MNIST-1D by evaluating the performance of Residual Networks (ResNet), Temporal Convolutional Networks (TCN), and Dilated Convolutional Neural Networks (DCNN). These models, known for their ability to capture sequential patterns and hierarchical features, were implemented and benchmarked alongside previously tested architectures such as logistic regression, MLPs, CNNs, and GRUs. Our experimental results demonstrate that advanced architectures like TCN and DCNN consistently outperform simpler models, achieving near-human performance on MNIST-1D. ResNet also shows significant improvements, highlighting the importance of leveraging inductive biases and hierarchical feature extraction in small structured datasets. Through this study, we validate the utility of MNIST-1D as a robust benchmark for evaluating machine learning architectures under computational constraints. Our findings emphasize the role of architectural innovations in improving model performance and offer insights into optimizing deep learning models for resource-limited environments. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.13348 [cs.LG] (or arXiv:2602.13348v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.13348 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-111] Artificial Organisations

【速读】: 该论文旨在解决多智能体AI系统中个体组件不可靠时如何实现整体行为可靠的问题(即“多智能体AI安全”问题)。传统方法依赖于对单个智能体进行对齐(alignment),而本文提出借鉴人类机构通过组织结构降低个体偏差风险的机制,采用架构设计而非假设个体对齐来达成可靠集体行为。其解决方案的关键在于引入信息不对称的分层验证机制:通过模块化分工(如撰稿者、核查者与批评者)并强制实施信息隔离,使各组件在有限知识范围内执行特定任务——核查者基于完整源数据验证事实,批评者独立评估论证质量而不接触源数据,从而形成交叉验证链条。实验表明,这种架构可促使系统从试图伪造内容转向诚实拒绝并提供替代方案,展现出无需显式指令或个体激励即可实现可靠行为的能力。

链接: https://arxiv.org/abs/2602.13275
作者: William Waites
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alignment research focuses on making individual AI systems reliable. Human institutions achieve reliable collective behaviour differently: they mitigate the risk posed by misaligned individuals through organisational structure. Multi-agent AI systems should follow this institutional model using compartmentalisation and adversarial review to achieve reliable outcomes through architectural design rather than assuming individual alignment. We demonstrate this approach through the Perseverance Composition Engine, a multi-agent system for document composition. The Composer drafts text, the Corroborator verifies factual substantiation with full source access, and the Critic evaluates argumentative quality without access to sources: information asymmetry enforced by system architecture. This creates layered verification: the Corroborator detects unsupported claims, whilst the Critic independently assesses coherence and completeness. Observations from 474 composition tasks (discrete cycles of drafting, verification, and evaluation) exhibit patterns consistent with the institutional hypothesis. When assigned impossible tasks requiring fabricated content, this iteration enabled progression from attempted fabrication toward honest refusal with alternative proposals–behaviour neither instructed nor individually incentivised. These findings motivate controlled investigation of whether architectural enforcement produces reliable outcomes from unreliable components. This positions organisational theory as a productive framework for multi-agent AI safety. By implementing verification and evaluation as structural properties enforced through information compartmentalisation, institutional design offers a route to reliable collective behaviour from unreliable individual components. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.13275 [cs.AI] (or arXiv:2602.13275v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.13275 Focus to learn more arXiv-issued DOI via DataCite

[NLP-112] ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在道德能力与安全对齐方面因提示设计(prompt design)差异而导致的性能不一致问题,尤其关注不同提示范式在多数据集上的表现碎片化现象。解决方案的关键在于提出 ProMoral-Bench——一个统一基准,用于系统评估11种提示范式在四个LLM家族中的表现,并引入统一道德安全评分(Unified Moral Safety Score, UMSS),该指标兼顾准确性和安全性。实验表明,紧凑且基于示例引导的提示结构(exemplar-guided scaffolds)优于复杂的多阶段推理策略,在更低的token消耗下实现更高的UMSS得分和更强的鲁棒性,从而为高效、安全的提示工程提供可量化、可比较的标准化框架。

链接: https://arxiv.org/abs/2602.13274
作者: Rohan Subramanian Thomas,Shikhar Shiromani,Abdullah Chaudhry,Ruizhe Li,Vasu Sharma,Kevin Zhu,Sunishchal Dev
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt design significantly impacts the moral competence and safety alignment of large language models (LLMs), yet empirical comparisons remain fragmented across datasets and this http URL introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance via our proposed Unified Moral Safety Score (UMSS), a metric balancing accuracy and safety. Our results show that compact, exemplar-guided scaffolds outperform complex multi-stage reasoning, providing higher UMSS scores and greater robustness at a lower token cost. While multi-turn reasoning proves fragile under perturbations, few-shot exemplars consistently enhance moral stability and jailbreak resistance. ProMoral-Bench establishes a standardized framework for principled, cost-effective prompt engineering.

[NLP-113] Directional Concentration Uncertainty: A representational approach to uncertainty quantification for generative models

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在不确定性量化(Uncertainty Quantification, UQ)中的可信度与鲁棒性问题,特别是现有方法依赖于任务特定的启发式规则、难以跨任务和模态泛化的局限。其解决方案的关键在于提出一种名为方向集中不确定性(Directional Concentration Uncertainty, DCU)的新颖统计框架,该方法基于 von Mises-Fisher (vMF) 分布,通过连续嵌入空间中多个生成输出的几何分散程度来量化不确定性,无需任何任务特定的启发式设计,从而实现对语言模型输出的通用且准确的不确定性评估,并在多模态场景下展现出良好的泛化能力。

链接: https://arxiv.org/abs/2602.13264
作者: Souradeep Chattopadhyay,Brendan Kennedy,Sai Munikoti,Soumik Sarkar,Karl Pazdernik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the critical task of making generative models trustworthy and robust, methods for Uncertainty Quantification (UQ) have begun to show encouraging potential. However, many of these methods rely on rigid heuristics that fail to generalize across tasks and modalities. Here, we propose a novel framework for UQ that is highly flexible and approaches or surpasses the performance of prior heuristic methods. We introduce Directional Concentration Uncertainty (DCU), a novel statistical procedure for quantifying the concentration of embeddings based on the von Mises-Fisher (vMF) distribution. Our method captures uncertainty by measuring the geometric dispersion of multiple generated outputs from a language model using continuous embeddings of the generated outputs without any task specific heuristics. In our experiments, we show that DCU matches or exceeds calibration levels of prior works like semantic entropy (Kuhn et al., 2023) and also generalizes well to more complex tasks in multi-modal domains. We present a framework for the wider potential of DCU and its implications for integration into UQ for multi-modal and agentic frameworks.

[NLP-114] Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

【速读】: 该论文旨在解决自动语音识别(ASR)系统在处理口音语音时性能下降的问题,其核心原因是口音引起的声学-音位和语调特征偏移导致与训练数据不匹配,从而使得有监督的口音适应方法成本高昂。为应对这一挑战,作者提出了一种无需参考文本(reference-free)且基于多模态一致性引导的数据选择流程,用于在无标签条件下进行口音适应。该方案的关键在于:首先通过子模态互信息(submodular mutual information)实现目标导向的预筛选以提升查询相关性并降低下游计算开销;随后利用扰动解码生成每个语音片段的多个伪转录本,并结合共享嵌入空间中的语音-文本对齐度与预测词错误率(WER)两个无参考信号进行评分;最后采用简单的百分位阈值筛选策略保留可靠伪标签用于微调,同时剔除噪声样本。实验表明,在同域场景下仅需约1.5k高质量伪标签即可逼近全监督30k标签的效果,而在跨域场景中,该方法能有效避免因强口音偏移导致的伪标签劣化问题,显著优于随机采样及现有基线方法。

链接: https://arxiv.org/abs/2602.13263
作者: Ligong Lei,Wenwen Lu,Xudong Pang,Zaokere Kadeer,Aishan Wumaier
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech–text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.

[NLP-115] General learned delegation by clones

【速读】: 该论文旨在解决前沿语言模型在测试时计算资源受限条件下,因串行推理或非协调的并行采样导致计算效率低下的问题。解决方案的关键在于提出SELFCEST框架,通过智能体强化学习使基础模型具备在独立并行上下文中生成相同权重克隆的能力,并基于全局任务奖励进行端到端训练,从而学习一个控制器,实现对生成和上下文预算在不同分支间的最优分配。该方法在数学推理和长上下文多跳问答等挑战性任务中,在固定推理预算下显著提升了准确率-成本帕累托前沿,并展现出良好的分布外泛化能力。

链接: https://arxiv.org/abs/2602.13262
作者: Darren Li,Meiqi Chen,Chenze Shao,Fandong Meng,Jie Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at this https URL

点击查看摘要

Abstract:Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.

[NLP-116] X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles

【速读】: 该论文旨在解决自动驾驶车辆(AV)中自然语言解释(Natural Language Explanations)缺乏系统性分析框架的问题,尤其是在不同驾驶场景下人类如何通过语言构建驾驶理由的机制尚不明确。解决方案的关键在于提出X-Blocks(eXplanation Blocks)这一分层分析框架,从语境(context)、句法(syntax)和词汇(lexicon)三个层次识别解释的语言构成要素:在语境层面引入RACE(Reasoning-Aligned Classification of Explanations)——一个结合链式思维推理与自一致性机制的多大语言模型(LLM)集成框架,实现对32类场景感知解释的高精度分类(准确率91.45%,Cohen’s kappa 0.91);在词汇层面利用带有信息性Dirichlet先验的log-odds分析揭示情境特异性词汇模式;在句法层面通过依存句法分析与模板提取发现解释依赖有限但可复用的语法结构家族,并呈现谓词类型与因果构造在不同语境中的系统性差异。该框架具备数据集无关性和任务独立性,为生成具场景感知能力的透明、可信且认知易访问的自动驾驶解释提供了实证驱动的设计原则。

链接: https://arxiv.org/abs/2602.13248
作者: Ashkan Y. Zadeh,Xiaomeng Li,Andry Rakotonirainy,Ronald Schroeter,Sebastien Glaser,Zishuo Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper introduces X-Blocks (eXplanation Blocks), a hierarchical analytical framework that identifies the linguistic building blocks of natural language explanations for AVs at three levels: context, syntax, and lexicon. At the context level, we propose RACE (Reasoning-Aligned Classification of Explanations), a multi-LLM ensemble framework that combines Chain-of-Thought reasoning with self-consistency mechanisms to robustly classify explanations into 32 scenario-aware categories. Applied to human-authored explanations from the Berkeley DeepDrive-X dataset, RACE achieves 91.45 percent accuracy and a Cohens kappa of 0.91 against cases with human annotator agreement, indicating near-human reliability for context classification. At the lexical level, log-odds analysis with informative Dirichlet priors reveals context-specific vocabulary patterns that distinguish driving scenarios. At the syntactic level, dependency parsing and template extraction show that explanations draw from a limited repertoire of reusable grammar families, with systematic variation in predicate types and causal constructions across contexts. The X-Blocks framework is dataset-agnostic and task-independent, offering broad applicability to other automated driving datasets and safety-critical domains. Overall, our findings provide evidence-based linguistic design principles for generating scenario-aware explanations that support transparency, user trust, and cognitive accessibility in automated driving systems. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO) Cite as: arXiv:2602.13248 [cs.AI] (or arXiv:2602.13248v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.13248 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ashkan Yousefi Zadeh [view email] [v1] Mon, 2 Feb 2026 07:18:25 UTC (14,248 KB)

[NLP-117] NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models EACL2026

【速读】: 该论文旨在解决自然语言到一阶逻辑(First-Order Logic, FOL)翻译中的语法脆弱性和语义不忠实问题,这些问题限制了自动化推理在法律与治理等高可靠性场景中的应用。现有方法如GCD和CODE4LOGIC虽利用大语言模型(Large Language Models, LLMs)提升逻辑解析能力,但因缺乏对全局语法约束的强控制和对子句级语义理解不足,导致生成代码的语法准确率和语义正确性受限。解决方案的关键在于提出NL2LOGIC框架,其创新性地引入抽象语法树(Abstract Syntax Tree, AST)作为中间表示,结合基于递归LLM的语义解析器与AST引导的确定性生成器,从而实现语法精准控制与语义忠实映射。实验表明,该方法在FOLIO、LogicNLI和ProofWriter基准上达到99%语法准确率,并将语义正确性提升最高达30%,同时显著增强下游推理任务的可执行性与准确性。

链接: https://arxiv.org/abs/2602.13237
作者: Rizky Ramadhana Putra,Raihan Sultan Pasha Basuki,Yutong Cheng,Peng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to Findings of EACL 2026. 17 pages, 6 figures

点击查看摘要

Abstract:Automated reasoning is critical in domains such as law and governance, where verifying claims against facts in documents requires both accuracy and interpretability. Recent work adopts structured reasoning pipelines that translate natural language into first-order logic and delegate inference to automated solvers. With the rise of large language models, approaches such as GCD and CODE4LOGIC leverage their reasoning and code generation capabilities to improve logic parsing. However, these methods suffer from fragile syntax control due to weak enforcement of global grammar constraints and low semantic faithfulness caused by insufficient clause-level semantic understanding. We propose NL2LOGIC, a first-order logic translation framework that introduces an abstract syntax tree as an intermediate representation. NL2LOGIC combines a recursive large language model based semantic parser with an abstract syntax tree guided generator that deterministically produces solver-ready logic code. Experiments on the FOLIO, LogicNLI, and ProofWriter benchmarks show that NL2LOGIC achieves 99 percent syntactic accuracy and improves semantic correctness by up to 30 percent over state-of-the-art baselines. Furthermore, integrating NL2LOGIC into Logic-LM yields near-perfect executability and improves downstream reasoning accuracy by 31 percent compared to Logic-LM’s original few-shot unconstrained translation module.

[NLP-118] Variation is the Key: A Variation-Based Framework for LLM -Generated Text Detection

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本的检测问题,现有方法通常依赖不切实际的假设(如白盒场景)或仅基于文本层面特征,导致检测精度不足。其解决方案的关键在于提出一种名为VaryBalance的方法,该方法利用人类撰写文本与其经LLM重写版本之间的差异显著大于LLM生成文本内部差异这一现象,通过计算均值标准差(mean standard deviation)来量化这种差异,从而实现对LLM生成文本的有效区分。实验表明,VaryBalance在AUROC指标上相比当前最优检测器Binoculars提升达34.3%,且对多种生成模型和语言具有鲁棒性。

链接: https://arxiv.org/abs/2602.13226
作者: Xuecong Li,Xiaohong Li,Qiang Hu,Yao Zhang,Junjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting text generated by large language models (LLMs) is crucial but challenging. Existing detectors depend on impractical assumptions, such as white-box settings, or solely rely on text-level features, leading to imprecise detection ability. In this paper, we propose a simple but effective and practical LLM-generated text detection method, VaryBalance. The core of VaryBalance is that, compared to LLM-generated texts, there is a greater difference between human texts and their rewritten version via LLMs. Leveraging this observation, VaryBalance quantifies this through mean standard deviation and distinguishes human texts and LLM-generated texts. Comprehensive experiments demonstrated that VaryBalance outperforms the state-of-the-art detectors, i.e., Binoculars, by up to 34.3% in terms of AUROC, and maintains robustness against multiple generating models and languages.

[NLP-119] A Geometric Taxonomy of Hallucinations in LLM s

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中“幻觉”(hallucination)现象的混淆问题,即不同类型的错误在嵌入空间中具有不同的几何特征,而现有研究常将它们混为一谈。解决方案的关键在于提出一个基于几何结构的分类体系:将幻觉分为三类——不忠实性(unfaithfulness,未能与给定上下文交互)、虚构性(confabulation,生成语义无关内容)和事实性错误(factual error,正确概念框架内的错误陈述)。研究发现,前两类(类型 I 和 II)在嵌入空间中具有可检测的判别方向,且在领域内检测性能优异(AUROC 0.76–0.99),跨领域时则因方向近似正交而失效;而类型 III 错误无法通过嵌入空间区分,因其本质是语义模式与外部现实之间的对应缺失,嵌入仅编码共现分布而非事实真伪,因此需依赖外部验证机制。这一几何分类明确了基于嵌入的检测方法的有效边界。

链接: https://arxiv.org/abs/2602.13224
作者: Javier Marín
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The term “hallucination” in large language models conflates distinct phenomena with different geometric signatures in embedding space. We propose a taxonomy identifying three types: unfaithfulness (failure to engage with provided context), confabulation (invention of semantically foreign content), and factual error (incorrect claims within correct conceptual frames). We observe a striking asymmetry. On standard benchmarks where hallucinations are LLM-generated, detection is domain-local: AUROC 0.76-0.99 within domains, but 0.50 (chance level) across domains. Discriminative directions are approximately orthogonal between domains (mean cosine similarity -0.07). On human-crafted confabulations - invented institutions, redefined terminology, fabricated mechanisms - a single global direction achieves 0.96 AUROC with 3.8% cross-domain degradation. We interpret this divergence as follows: benchmarks capture generation artifacts (stylistic signatures of prompted fabrication), while human-crafted confabulations capture genuine topical drift. The geometric structure differs because the underlying phenomena differ. Type III errors show 0.478 AUROC - indistinguishable from chance. This reflects a theoretical constraint: embeddings encode distributional co-occurrence, not correspondence to external reality. Statements with identical contextual patterns occupy similar embedding regions regardless of truth value. The contribution is a geometric taxonomy clarifying the scope of embedding-based detection: Types I and II are detectable; Type III requires external verification mechanisms.

[NLP-120] Scaling the Scaling Logic: Agent ic Meta-Synthesis of Logic Reasoning

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)训练信号扩展的瓶颈问题,即如何高效生成大规模、高质量且结构化可验证的训练数据。现有方法受限于专家编写代码或固定模板,难以实现任务家族级别的规模化演进。其解决方案的关键在于提出SSLogic框架——一个基于代理的元合成系统,通过闭环的“生成-验证-修复”机制迭代优化Generator–Validator程序对,从而在可控难度下持续演化任务家族;同时引入多门控验证协议(Multi-Gate Validation Protocol),结合多策略一致性校验与对抗盲审机制,确保任务的明确性和可执行性,最终显著提升模型在多个基准测试上的性能表现。

链接: https://arxiv.org/abs/2602.13218
作者: Bowen Liu,Zhi Wu,Runquan Xie,Zhanhui Kang,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 37 pages, 8 figures, 4 tables in the main body. Project page: this https URL

点击查看摘要

Abstract:Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator–Validator program pairs in a closed Generate–Validate–Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.

[NLP-121] Reshaping MOFs text mining with a dynamic multi-agent s framework of large language model

【速读】: 该论文旨在解决金属有机框架材料(Metal-organic frameworks, MOFs)合成条件信息在文献中分散、不一致且难以解析的问题,从而阻碍实验设计与数据驱动的材料发现。其解决方案的关键在于提出MOFh6系统,这是一个基于大语言模型(Large Language Model)的自动化信息提取工具,能够从原始文章或晶体代码中识别并结构化提取合成参数,通过跨段落关联描述、统一配体缩写与全称,并输出标准化的合成表格,实现高精度(99%提取准确率)和高效处理(每篇文章9.6秒),显著提升文献知识向可执行合成协议的转化效率。

链接: https://arxiv.org/abs/2504.18880
作者: Zuhong Lin,Daoyuan Ren,Kai Ran,Jing Sun,Songlin Yu,Xuefeng Bai,Xiaotian Huang,Haiyang He,Pengxu Pan,Ying Fang,Zhanglin Li,Haipu Li,Jingjing Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and difficult to interpret. We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables. It links related descriptions across paragraphs, unifies ligand abbreviations with full names, and outputs structured parameters ready for use. MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/- 0.01. Processing a full text takes 9.6 s, locating synthesis descriptions 36 s, with 100 papers processed for USD 4.24. By replacing static database lookups with real-time extraction, MOFh6 reshapes MOF synthesis research, accelerating the conversion of literature knowledge into practical synthesis protocols and enabling scalable, data-driven materials discovery.

[NLP-122] Protect*: Steerable Retrosynthesis through Neuro-Symbolic State Encoding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在化学合成路径规划中缺乏细粒度控制的问题,特别是如何避免生成无效或不理想的合成路径,尤其是在面对分子中具有化学敏感性的特定反应位点时。其解决方案的关键在于提出 Protect^* ——一个神经符号框架,通过将LLM的生成能力与严格的化学逻辑相结合来实现精准控制:一方面利用基于规则的推理(55+ SMARTS模式和40+保护基团数据库)确定并标记活性位点;另一方面引入“主动状态追踪”机制,将硬性符号约束以保护状态的形式注入神经推理过程,从而实现自动模式下的确定性防护和人机协同模式下的专家约束集成,最终在复杂天然产物如红霉素B的合成路径发现中验证了该方法的有效性。

链接: https://arxiv.org/abs/2602.13419
作者: Shreyas Vinaya Sathyanarayana,Shah Rahil Kirankumar,Sharanabasava D. Hiremath,Bharath Ramsundar
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect ^* , a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a human-in-the-loop mode’’ that integrates expert strategic constraints. Through ``active state tracking,‘’ we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.

信息检索

[IR-0] Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing Business Development and Search Evaluation

【速读】:该论文旨在解决生物制药领域中“隐蔽资产”(under-the-radar assets)难以被高效、准确发现的问题,尤其是在全球药物研发重心向非英语地区转移的背景下,传统深度研究人工智能代理在多语言、异构数据源中仍存在召回率低和幻觉问题。解决方案的关键在于提出一种面向药物资产搜寻的基准测试方法,并开发了一个经过调优的基于树结构的自学习生物光学代理(Bioptic Agent),该代理通过多智能体流水线构建具有挑战性的完整性基准(completeness benchmark),并结合专家筛选查询作为先验条件生成复杂检索任务,同时采用大语言模型(LLM-as-judge)进行校准评估,最终在F1分数上显著优于主流AI模型(如Claude Opus 4.6、GPT-5.2 Pro等),体现出更强的覆盖完整性和抗幻觉能力。

链接: https://arxiv.org/abs/2602.15019
作者: Alisa Vinogradova,Vlad Vinogradov,Luba Greenwood,Ilya Yasny,Dmitry Kobyzev,Shoman Kasbekar,Kong Nguyen,Dmitrii Radkevich,Roman Doronin,Andrey Doronichev
机构: 1. Google(谷歌); 2. Meta(元宇宙); 3. Stability.AI(稳定人工智能)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at ~30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface “under-the-radar” assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today’s Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. We compare Bioptic Agent against Claude Opus 4.6, OpenAI GPT-5.2 Pro, Perplexity Deep Research, Gemini 3 Pro + Deep Research, and Exa Websets. Bioptic Agent achieves 79.7% F1 versus 56.2% (Claude Opus 4.6), 50.6% (Gemini 3 Pro + Deep Research), 46.6% (GPT-5.2 Pro), 44.2% (Perplexity Deep Research), and 26.9% (Exa Websets). Performance improves steeply with additional compute, supporting the view that more compute yields better results. Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2602.15019 [cs.AI] (or arXiv:2602.15019v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.15019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

【速读】:该论文旨在解决跨域新闻推荐中如何从异构用户信号中挖掘深层、可复用的用户兴趣,并在大规模生产系统中保持可扩展性的问题。其关键解决方案是提出一种基于强化学习的框架,利用大语言模型(Large Language Models, LLMs)从跨域用户行为中生成高质量的兴趣驱动新闻搜索查询列表,将该问题建模为策略优化问题并采用GRPO(Generalized Reward Policy Optimization)结合多奖励信号进行训练;同时通过推理时采样和模型容量两个计算维度的系统性研究验证了性能随算力提升呈现类缩放特性,并进一步采用在线策略蒸馏技术,将高复杂度教师模型的知识迁移至轻量级学生模型,从而实现高效且可扩展的部署。

链接: https://arxiv.org/abs/2602.15005
作者: Mengdan Zhu,Yufan Zhao,Tao Di,Yulan Yan,Liang Zhao
机构: Microsoft(微软); Emory University(埃默里大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user’s underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.

[IR-2] DRAMA: Domain Retrieval using Adaptive Module Allocation

【速读】:该论文旨在解决神经信息检索(Neural Information Retrieval, Neural IR)模型在大规模Web场景中面临的高计算与能源消耗问题,以及多领域场景下模型可扩展性差、跨域泛化能力弱的挑战。其核心解决方案是提出DRAMA(Domain Retrieval using Adaptive Module Allocation)框架,通过引入领域特定的适配器模块(adapter modules)与动态门控机制(dynamic gating mechanism),实现对每个查询自动选择最相关的领域知识,从而在保持与专用模型相当检索效果的同时,显著降低参数量和计算资源需求。新领域的加入仅需轻量级适配器训练,避免了全模型重训练,提升了模型部署的可持续性和效率。

链接: https://arxiv.org/abs/2602.14960
作者: Pranav Kasela,Marco Braga,Ophir Frieder,Nazli Goharian,Gabriella Pasi,Raffaele Perego
机构: University of Milano-Bicocca (米兰大学-比科卡分校); ISTI-CNR (意大利国家研究委员会信息科学与技术研究所); Georgetown University (乔治城大学); Politecnico di Torino (都灵理工大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Neural models are increasingly used in Web-scale Information Retrieval (IR). However, relying on these models introduces substantial computational and energy requirements, leading to increasing attention toward their environmental cost and the sustainability of large-scale deployments. While neural IR models deliver high retrieval effectiveness, their scalability is constrained in multi-domain scenarios, where training and maintaining domain-specific models is inefficient and achieving robust cross-domain generalisation within a unified model remains difficult. This paper introduces DRAMA (Domain Retrieval using Adaptive Module Allocation), an energy- and parameter-efficient framework designed to reduce the environmental footprint of neural retrieval. DRAMA integrates domain-specific adapter modules with a dynamic gating mechanism that selects the most relevant domain knowledge for each query. New domains can be added efficiently through lightweight adapter training, avoiding full model retraining. We evaluate DRAMA on multiple Web retrieval benchmarks covering different domains. Our extensive evaluation shows that DRAMA achieves comparable effectiveness to domain-specific models while using only a fraction of their parameters and computational resources. These findings show that energy-aware model design can significantly improve scalability and sustainability in neural IR.

[IR-3] Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

【速读】:该论文旨在解决离线策略评估(Off-Policy Evaluation, OPE)中估计方差过高的问题,尤其是在排序与推荐系统评估场景下,传统自归一化逆倾向评分(Self-Normalised Inverse Propensity Scoring, SNIPS)虽被广泛使用,但其性能受限于方差较大。论文的关键解决方案是提出并证明了β⋆-IPS估计器——一种基于最优加性基线校正(additive baseline correction)的新型OPE方法,其在均方误差(Mean Squared Error, MSE)意义上严格优于SNIPS。通过理论分解方差差距,作者进一步揭示SNIPS本质上等价于使用一个特定但通常次优的加性基线,从而为从自归一化方法向最优基线校正的迁移提供了坚实的理论依据。

链接: https://arxiv.org/abs/2602.14914
作者: Olivier Jeunen,Shashank Gupta
机构: aampeAntwerpBelgium; Independent ResearcherAmsterdamThe Netherlands
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that \beta^\star -IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific – but generally sub-optimal – additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation.

[IR-4] Beyond Retractions: Forensic Scientometrics Techniques to Identify Research Misconduct Citation Leakage and Funding Anomalies

【速读】:该论文旨在解决学术造假问题,特别是针对虚构研究集体(fabricated research collective)如何在不被察觉的情况下渗透进正规学术出版渠道的现象。其解决方案的关键在于通过法证科学计量学(forensic scientometrics)的方法,对“Pharmakon Neuroscience Research Network”这一伪造科研网络进行系统性分析,识别其异常特征与运作模式,从而为检测和防范类似学术欺诈行为提供可操作的证据链与方法论支持。

链接: https://arxiv.org/abs/2602.14793
作者: Leslie D. McIntosh,Alexandra Sinclair,Simon Linacre
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper presents a forensic scientometric case study of the Pharmakon Neuroscience Research Network, a fabricated research collective that operated primarily between 2019 and 2022 while embedding itself within legitimate scholarly publishing channels.

[IR-5] Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

【速读】:该论文旨在解决长文档分割(chunking)过程中因忽略用户意图而导致的信息碎片化或冗余噪声问题,从而影响信息检索系统的准确性。传统方法如固定长度或基于连贯性的分割策略无法根据实际查询需求调整切分边界,导致答案被割裂或无关内容混入。解决方案的关键在于提出一种意图驱动的动态分块(Intent-Driven Dynamic Chunking, IDC)方法:首先利用大语言模型(Large Language Model, LLM)预测文档可能对应的目标用户查询意图,随后采用动态规划(Dynamic Programming, DP)算法全局优化分块边界,以最小化语义断裂并最大化相关性覆盖。此方法避免了贪心策略的局部最优陷阱,在多个问答数据集上显著提升了Top-1检索准确率,并在减少40%-60%分块数量的同时保持93%-100%的答案覆盖率。

链接: https://arxiv.org/abs/2602.14784
作者: Christos Koutsiaris
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 8 pages, 4 figures. Code available at this https URL

点击查看摘要

Abstract:Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.

[IR-6] Measuring the relatedness between scientific publications using controlled vocabularies

【速读】:该论文旨在解决科学文献相关性度量中的准确性问题,尤其是在使用受控词汇表(controlled vocabularies)结合Salton’s cosine相似度时存在的局限性。传统方法仅依赖术语的精确匹配,忽略了语义层面的关联,导致相关性评估不准确。其解决方案的关键在于引入两种改进方法——软余弦相似度(soft cosine similarity)和最大项相似度(maximum term similarity),它们通过考虑非匹配术语间的语义相似性,显著提升了相关性度量的准确性。实验结果表明,软余弦相似度表现最优,而传统的Salton’s cosine方法明显逊色于其他方法。

链接: https://arxiv.org/abs/2602.14755
作者: Emil Dolmer Alnor
机构: 未知
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: Currently under review at Scientometrics (16 February 2026)

点击查看摘要

Abstract:Measuring the relatedness between scientific publications is essential in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness and are widely used in combination with Salton’s cosine similarity. The latter is problematic because it only considers exact matches between terms. This article introduces two alternative methods - soft cosine and maximum term similarities - that account for the semantic similarity between non-matching terms. The article compares the accuracy of all three methods using the assignment of publications to topics in the TREC 2006 Genomics Track and the assumption that accurate relatedness measures should assign high relatedness scores to publication pairs within the same topic and low scores to pairs from separate topics. Results show that soft cosine is the most accurate method, while the most widely used version of Salton’s cosine is markedly less accurate than the other methods tested. These findings have implications for how controlled vocabularies should be used to measure relatedness.

[IR-7] Orcheo: A Modular Full-Stack Platform for Conversational Search SIGIR2026

【速读】:该论文旨在解决对话式搜索(Conversational Search, CS)研究中的两大挑战:一是缺乏统一框架以高效共享研究成果,二是难以部署端到端原型用于用户评估。解决方案的关键在于提出一个名为Orcheo的开源平台,其核心优势包括:(i) 模块化架构通过单文件节点模块促进组件复用,提升可共享性与可复现性;(ii) 生产就绪的基础设施支持双执行模式、安全凭证管理及执行遥测,结合内置AI编码支持降低学习门槛;(iii) 提供50余个开箱即用的组件,覆盖查询理解、排序和响应生成等环节,加速完整CS流水线的构建。

链接: https://arxiv.org/abs/2602.14710
作者: Shaojie Jiang,Svitlana Vakulenko,Maarten de Rijke
机构: AI Colleagues & University of Amsterdam (AI同事与阿姆斯特丹大学); WU Vienna University of Economics and Business (维也纳经济大学); University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Under review at SIGIR 2026

点击查看摘要

Abstract:Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end-to-end prototypes needed for user evaluation. We introduce Orcheo, an open-source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single-file node modules, facilitating sharing and reproducibility in CS research; (ii) Production-ready infrastructure bridges the prototype-to-system gap via dual execution modes, secure credential management, and execution telemetry, with built-in AI coding support that lowers the learning curve; (iii) Starter-kit assets include 50+ off-the-shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo’s utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at this https URL.

[IR-8] Adaptive Autoguidance for Item-Side Fairness in Diffusion Recommender Systems

【速读】:该论文旨在解决扩散推荐系统(Diffusion Recommender Systems)中存在的流行度偏差(popularity bias)问题,即模型倾向于推荐高曝光度的热门物品,导致长尾物品获得不平等的曝光机会。解决方案的关键在于提出A2G-DiffRec,一种引入自适应自动引导(adaptive autoguidance)机制的扩散推荐模型:该模型通过一个训练程度较弱的自身副本作为引导源,而非使用固定权重进行引导;在训练过程中,模型学习动态调整主模型与弱模型输出的加权组合,并以流行度正则化项为监督信号,促进不同流行度水平物品之间的均衡曝光。这一方法在保持较高推荐准确性的前提下显著提升了物品层面的公平性表现。

链接: https://arxiv.org/abs/2602.14706
作者: Zihan Li,Gustavo Escobedo,Marta Moscati,Oleg Lesota,Markus Schedl
机构: Johannes Kepler University Linz (约翰内斯·开普勒林茨大学); Linz Institute of Technology (林茨技术研究所)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Diffusion recommender systems achieve strong recommendation accuracy but often suffer from popularity bias, resulting in unequal item exposure. To address this shortcoming, we introduce A2G-DiffRec, a diffusion recommender that incorporates adaptive autoguidance, where the main model is guided by a less-trained version of itself. Instead of using a fixed guidance weight, A2G-DiffRec learns to adaptively weigh the outputs of the main and weak models during training, supervised by a popularity regularization that promotes balanced exposure across items with different popularity levels. Experimental results on the MovieLens-1M, Foursquare-Tokyo, and Music4All-Onion datasets show that A2G-DiffRec is effective in enhancing item-side fairness at a marginal cost of accuracy reduction compared to existing guided diffusion recommenders and other non-diffusion baselines.

[IR-9] Alignment Adapter to Improve the Performance of Compressed Deep Learning Models

【速读】:该论文旨在解决压缩深度学习(Deep Learning, DL)模型在资源受限环境中部署时性能显著低于大规模原始模型的问题。其解决方案的关键在于提出一种轻量级、基于滑动窗口的适配器——对齐适配器(Alignment Adapter, AlAd),该适配器通过将压缩模型的token级嵌入与原始大模型的对应嵌入进行对齐,从而保留局部上下文语义,并支持跨不同维度或架构的灵活对齐,且不依赖于具体的压缩方法。AlAd可作为即插即用模块部署,也可与压缩模型联合微调以进一步提升性能,在仅引入极小尺寸和延迟开销的前提下显著增强压缩模型的性能表现。

链接: https://arxiv.org/abs/2602.14635
作者: Rohit Raj Rai,Abhishek Dhaka,Amit Awekar
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校); B.K. Birla Institute of Engineering & Technology (B.K. 比尔拉工程与技术学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Compressed Deep Learning (DL) models are essential for deployment in resource-constrained environments. But their performance often lags behind their large-scale counterparts. To bridge this gap, we propose Alignment Adapter (AlAd): a lightweight, sliding-window-based adapter. It aligns the token-level embeddings of a compressed model with those of the original large model. AlAd preserves local contextual semantics, enables flexible alignment across differing dimensionalities or architectures, and is entirely agnostic to the underlying compression method. AlAd can be deployed in two ways: as a plug-and-play module over a frozen compressed model, or by jointly fine-tuning AlAd with the compressed model for further performance gains. Through experiments on BERT-family models across three token-level NLP tasks, we demonstrate that AlAd significantly boosts the performance of compressed models with only marginal overhead in size and latency.

[IR-10] DeepMTL2R: A Library for Deep Multi-task Learning to Rank

【速读】:该论文旨在解决多任务学习排序(Multi-task Learning to Rank, MTL2R)中多个相关性标准需同时优化的问题,尤其在存在潜在冲突目标时如何有效建模和学习。解决方案的关键在于提出 DeepMTL2R 框架,通过引入 Transformer 架构的自注意力机制(self-attention mechanism),将异构的相关性信号整合到一个上下文感知的统一模型中,从而捕捉项目与标签之间的复杂依赖关系和长程交互,实现跨任务的有效知识迁移与多目标优化,最终识别帕累托最优(Pareto-optimal)的排序模型。

链接: https://arxiv.org/abs/2602.14519
作者: Chaosheng Dong,Peiyao Xiao,Yijia Wang,Kaiyi Ji
机构: Amazon(亚马逊); University at Buffalo(纽约州立大学布法罗分校)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper presents DeepMTL2R, an open-source deep learning framework for Multi-task Learning to Rank (MTL2R), where multiple relevance criteria must be optimized simultaneously. DeepMTL2R integrates heterogeneous relevance signals into a unified, context-aware model by leveraging the self-attention mechanism of transformer architectures, enabling effective learning across diverse and potentially conflicting objectives. The framework includes 21 state-of-the-art multi-task learning algorithms and supports multi-objective optimization to identify Pareto-optimal ranking models. By capturing complex dependencies and long-range interactions among items and labels, DeepMTL2R provides a scalable and expressive solution for modern ranking systems and facilitates controlled comparisons across MTL strategies. We demonstrate its effectiveness on a publicly available dataset, report competitive performance, and visualize the resulting trade-offs among objectives. DeepMTL2R is available at \hrefthis https URLthis https URL.

[IR-11] Behavioral Feature Boosting via Substitute Relationships for E-commerce Search

【速读】:该论文旨在解决电子商务平台中新品面临的冷启动问题(cold-start problem),即由于缺乏用户交互数据导致其搜索可见性低、相关性排序差的问题。解决方案的关键在于提出一种行为特征增强方法(Behavior Feature Boosting, BFS),通过识别满足相似用户需求的替代商品(substitute products),聚合其行为信号(如点击、加购、购买和评分等),为新品提供“暖启动”支持;将这些增强后的信号融入排序模型,可有效缓解冷启动效应,提升新品的相关性和竞争力。

链接: https://arxiv.org/abs/2602.14502
作者: Chaosheng Dong,Michinari Momma,Yijia Wang,Yan Gao,Yi Sun
机构: Amazon(亚马逊)
类目: Information Retrieval (cs.IR)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:On E-commerce platforms, new products often suffer from the cold-start problem: limited interaction data reduces their search visibility and hurts relevance ranking. To address this, we propose a simple yet effective behavior feature boosting method that leverages substitute relationships among products (BFS). BFS identifies substitutes-products that satisfy similar user needs-and aggregates their behavioral signals (e.g., clicks, add-to-carts, purchases, and ratings) to provide a warm start for new items. Incorporating these enriched signals into ranking models mitigates cold-start effects and improves relevance and competitiveness. Experiments on a large E-commerce platform, both offline and online, show that BFS significantly improves search relevance and product discovery for cold-start products. BFS is scalable and practical, improving user experience while increasing exposure for newly launched items in E-commerce search. The BFS-enhanced ranking model has been launched in production and has served customers since 2025.

[IR-12] Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

【速读】:该论文旨在解决工业级用户表征学习中静态嵌入难以兼顾通用性与任务敏感性的难题,以及多源异构数据引入的噪声和模态冲突问题。其核心解决方案是提出Query-as-Anchor框架,通过将用户建模从静态编码转向动态、查询感知的合成机制:首先构建UserU预训练数据集以对齐多模态行为序列与用户理解语义;其次设计Q-Anchor Embedding架构,利用分层粗到细编码器结合联合对比-自回归优化,在双塔大语言模型(Large Language Model, LLM)中实现查询感知的用户表征;进一步引入基于聚类的软提示微调(Cluster-based Soft Prompt Tuning),增强潜在空间的判别结构,使模型注意力聚焦于场景特定模态;最后,通过在序列末端锚定查询,实现KV缓存加速推理,显著降低部署延迟。该方案在10个支付宝工业基准上均达到SOTA性能,并经大规模在线A/B测试验证了实际有效性。

链接: https://arxiv.org/abs/2602.14492
作者: Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Ziyi Gao,Xiaotong Lin,Yun Liu,Xing Fu,Yu Cheng,Yongchao Liu,Weiqiang Wang,Zhongle Xie
机构: Ant Group(蚂蚁集团); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay’s production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: this https URL.

[IR-13] InnoEval: On Research Idea Evaluation as a Knowledge-Grounded Multi-Perspective Reasoning Problem

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学创意生成能力快速提升背景下,创意评估方法滞后的问题。现有评估方式普遍存在知识视野狭窄、评价维度单一以及LLM作为评判者时固有的偏倚等缺陷,难以实现对创新性想法的多维、可靠、可解释的判断。解决方案的关键在于将创意评估建模为一个知识驱动、多视角推理问题,并提出InnoEval框架:通过异构深度知识检索引擎动态获取来自多样化在线来源的证据以实现知识锚定;同时引入由不同学术背景评审员组成的创新评审委员会,实现多维度解耦式评价与共识形成。该方法在权威同行评审投稿数据集上验证了其在点级、成对和群体级评估任务中均显著优于基线模型,且判断模式与人类专家高度一致。

链接: https://arxiv.org/abs/2602.14367
作者: Shuofei Qiao,Yunxiang Wei,Xuehai Wang,Bin Wu,Boyang Xue,Ningyu Zhang,Hossein A. Rahmani,Yanshan Wang,Qiang Zhang,Keyan Ding,Jeff Z. Pan,Huajun Chen,Emine Yilmaz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Ongoing Work

点击查看摘要

Abstract:The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

[IR-14] High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplace KDD

【速读】:该论文旨在解决Airbnb搜索系统中“位置检索”(location retrieval)这一关键挑战,即如何在资源密集型排序模型应用前,高效地从全球多样化的房源供应中筛选出用户可能实际预订的候选列表。传统方法依赖基于深度贝叶斯Bandit的系统预测一个矩形检索边界区域进行过滤,但存在精度不足的问题。解决方案的关键在于重构搜索架构,通过将世界划分为2500万个均匀分布的高精度矩形地图单元(map cells),从中选取最有可能被预订的子集作为检索范围,从而显著提升检索阶段的精准性与效率,为后续排序提供更高质量的候选集。

链接: https://arxiv.org/abs/2602.14358
作者: Dillon Davis,Huiji Gao,Thomas Legrand,Juan Manuel Caicedo Carvajal,Malay Haldar,Kedar Bellare,Moutupsi Paul,Soumyadip Banerjee,Liwei He,Stephanie Moyerman,Sanjeev Katariya
机构: Airbnb, Inc.(Airbnb公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: KDD TSMO 2025: this https URL

点击查看摘要

Abstract:Airbnb search must balance a worldwide, highly varied supply of homes with guests whose location, amenity, style, and price expectations differ widely. Meeting those expectations hinges on an efficient retrieval stage that surfaces only the listings a guest might realistically book, before resource intensive ranking models are applied to determine the best results. Unlike many recommendation engines, our system faces a distinctive challenge, location retrieval, that sits upstream of ranking and determines which geographic areas are queried in order to filter inventory to a candidate set. The preexisting approach employs a deep bayesian bandit based system to predict a rectangular retrieval bounds area that can be used for filtering. The purpose of this paper is to demonstrate the methodology, challenges, and impact of rearchitecting search to retrieve from the subset of most bookable high precision rectangular map cells defined by dividing the world into 25M uniform cells.

[IR-15] AD-Bench: A Real-World Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在真实业务场景中评估能力不足的问题,尤其是广告与营销分析这类专业领域任务复杂、需多轮交互专业工具的现实需求未被现有基准充分覆盖。其解决方案的关键在于提出AD-Bench——一个基于真实广告营销平台用户请求构建的基准测试集,包含由领域专家提供的可验证参考答案及对应工具调用轨迹,并按难度分为L1–L3三级,以系统评估代理在多工具协作下的表现。实验表明,即使是最先进的模型如Gemini-3-Pro,在高难度任务(L3)上仍存在显著性能下降,验证了该基准对推动广告营销代理能力提升的重要性。

链接: https://arxiv.org/abs/2602.14257
作者: Lingxiang Hu,Yiding Sun,Tianle Xia,Wenwei Li,Ming Xu,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents’ capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at this https URL.

[IR-16] Index Light Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

【速读】:该论文旨在解决现有多模态文档问答方法中“预摄入”(pre-ingestion)策略所带来的高成本、端到端不可靠性以及失败后无法恢复的问题。传统方法在索引阶段对每一页文档均运行视觉语言模型(Vision-Language Model, VLM)生成详尽描述,导致计算资源消耗巨大(如113页工程图纸需约80,000个VLM token),且因检索基础设施中的格式不匹配可能导致VLM输出无法被正确召回,从而影响最终答案准确性。论文提出延迟视觉摄入(Deferred Visual Ingestion, DVI)框架,其核心在于采用需求侧摄入策略:索引阶段仅进行轻量级元数据提取,将视觉理解任务推迟至用户提问时才触发。DVI的关键创新是“索引用于定位,而非理解”——通过结构化元数据索引与BM25全文搜索实现页面精确定位,再将原始图像与具体问题一同送入VLM进行针对性分析,显著降低VLM使用成本并提升系统鲁棒性和可交互性。

链接: https://arxiv.org/abs/2602.14162
作者: Tao Xu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 24 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this “pre-ingestion” approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI’s core principle is “Index for locating, not understanding”–achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the “QA accuracy” problem into a “page localization” problem–once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.

[IR-17] MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders

【速读】:该论文旨在解决工业推荐系统中基于Transformer的模型因结构碎片化导致的容量分配不均问题,即序列建模与特征交互模块独立参数化,使得在有限计算预算下难以同时优化两者性能。其解决方案的关键在于提出MixFormer,一种统一的Transformer架构,将顺序行为建模与特征交互整合到单一主干网络中,通过统一参数化实现密集容量与序列长度的有效协同扩展,并促进顺序与非顺序表示间的深度交互,从而提升模型表达能力。此外,为保障工业实用性,引入用户-物品解耦策略以减少冗余计算和推理延迟。

链接: https://arxiv.org/abs/2602.14110
作者: Xu Huang,Hao Zhang,Zhifang Fan,Yunwen Huang,Zhuoxing Wei,Zheng Chai,Jinan Ni,Yuchao Zheng,Qiwei Chen
机构: ByteDance(字节跳动)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:As industrial recommender systems enter a scaling-driven regime, Transformer architectures have become increasingly attractive for scaling models towards larger capacity and longer sequence. However, existing Transformer-based recommendation models remain structurally fragmented, where sequence modeling and feature interaction are implemented as separate modules with independent parameterization. Such designs introduce a fundamental co-scaling challenge, as model capacity must be suboptimally allocated between dense feature interaction and sequence modeling under a limited computational budget. In this work, we propose MixFormer, a unified Transformer-style architecture tailored for recommender systems, which jointly models sequential behaviors and feature interactions within a single backbone. Through a unified parameterization, MixFormer enables effective co-scaling across both dense capacity and sequence length, mitigating the trade-off observed in decoupled designs. Moreover, the integrated architecture facilitates deep interaction between sequential and non-sequential representations, allowing high-order feature semantics to directly inform sequence aggregation and enhancing overall expressiveness. To ensure industrial practicality, we further introduce a user-item decoupling strategy for efficiency optimizations that significantly reduce redundant computation and inference latency. Extensive experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency. Furthermore, large-scale online A/B tests on two production recommender systems, Douyin and Douyin Lite, show consistent improvements in user engagement metrics, including active days and in-app usage duration.

[IR-18] DAIAN: Deep Adaptive Intent-Aware Network for CTR Prediction in Trigger-Induced Recommendation

【速读】:该论文旨在解决触发式推荐(Trigger-Induced Recommendation, TIR)中普遍存在的意图短视(intent myopia)问题,即推荐系统过度关注触发项(trigger item),导致推荐结果局限于与触发项高度相关的商品,忽视了用户更广泛的潜在兴趣。同时,现有方法依赖基于ID的交互行为来捕捉用户偏好,受限于数据稀疏性,难以有效建模用户意图。解决方案的关键在于提出深度自适应意图感知网络(Deep Adaptive Intent-Aware Network, DAIAN),其核心机制包括:首先通过分析用户点击行为与触发项的相关性,提取个性化的意图表征,并挖掘相关历史行为以捕获多样化的用户意图;其次,为缓解ID交互稀疏性带来的性能瓶颈,引入融合ID与语义信息的混合增强器(hybrid enhancer)来强化物品间的相似度计算,并基于不同意图动态选择最匹配的推荐内容。

链接: https://arxiv.org/abs/2602.13971
作者: Zhihao Lv,Longtao Zhang,Ailong He,Shuzhi Cao,Shuguang Han,Jufeng Chen
机构: Alibaba(阿里巴巴); Taobao(淘宝)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommendation systems are essential for personalizing e-commerce shopping experiences. Among these, Trigger-Induced Recommendation (TIR) has emerged as a key scenario, which utilizes a trigger item (explicitly represents a user’s instantaneous interest), enabling precise, real-time recommendations. Although several trigger-based techniques have been proposed, most of them struggle to address the intent myopia issue, that is, a recommendation system overemphasizes the role of trigger items and narrowly focuses on suggesting commodities that are highly relevant to trigger items. Meanwhile, existing methods rely on collaborative behavior patterns between trigger and recommended items to identify the user’s preferences, yet the sparsity of ID-based interaction restricts their effectiveness. To this end, we propose the Deep Adaptive Intent-Aware Network (DAIAN) that dynamically adapts to users’ intent preferences. In general, we first extract the users’ personalized intent representations by analyzing the correlation between a user’s click and the trigger item, and accordingly retrieve the user’s related historical behaviors to mine the user’s diverse intent. Besides, sparse collaborative behaviors constrain the performance in capturing items associated with user intent. Hence, we reinforce similarity by leveraging a hybrid enhancer with ID and semantic information, followed by adaptive selection based on varying intents. Experimental results on public datasets and our industrial e-commerce datasets demonstrate the effectiveness of DAIAN.

[IR-19] Agent ic Assistant for 6G: Turn-based Conversations for AI-RAN Hierarchical Co-Management

【速读】:该论文旨在解决生成式 AI (Generative AI) 与无线接入网(RAN)协同管理中的复杂性问题,特别是企业级网络中因本地专家稀缺而导致的实时运维困难。现有研究主要聚焦于利用检索增强生成(Retrieval-Augmented Generation, RAG)大语言模型(LLM)辅助规划和配置核心网络功能,但缺乏对 RAN 与边缘 AI 的协同治理能力,导致层级化、动态化的运维问题难以通过单一交互模式解决。解决方案的关键在于构建一个代理式网络管理框架(agentic network manager),其采用分层架构:包括用户界面与评估仪表盘、智能层(对接 AI-RAN)以及知识层(支撑评估与推荐)。该框架支持基于意图的理解与轮替式人机交互,实现了服务设计(准确率78%)、特定工具操作(89%)和性能调优(67%)三类任务的高效响应(平均13秒),初步验证了在降低运营支出(OPEX)方面的可行性,同时揭示了幻觉(hallucination)仍是当前需克服的核心挑战。

链接: https://arxiv.org/abs/2602.13868
作者: Udhaya Srinivasan,Weisi Guo
机构: Cranfield University (克兰菲尔德大学)
类目: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR)
备注: submitted to IEEE conference

点击查看摘要

Abstract:New generations of radio access networks (RAN), especially with native AI services are increasingly difficult for human engineers to manage in real-time. Enterprise networks are often managed locally, where expertise is scarce. Existing research has focused on creating Retrieval-Augmented Generation (RAG) LLMs that can help to plan and configure RAN and core aspects only. Co-management of RAN and edge AI is the gap, which creates hierarchical and dynamic problems that require turn-based human interactions. Here, we create an agentic network manager and turn-based conversation assistant that can understand human intent-based queries that match hierarchical problems in AI-RAN. The framework constructed consists of: (a) a user interface and evaluation dashboard, (b) an intelligence layer that interfaces with the AI-RAN, and © a knowledge layer for providing the basis for evaluations and recommendations. These form 3 layers of capability with the following validation performances (average response time 13s): (1) design and planning a service (78% accuracy), (2) operating specific AI-RAN tools (89% accuracy), and (3) tuning AI-RAN performance (67%). These initial results indicate the universal challenges of hallucination but also fast response performance success that can really reduce OPEX costs for small scale enterprise users.

[IR-20] A Tale of Two Graphs: Separating Knowledge Exploration from Outline Structure for Open-Ended Deep Research

【速读】:该论文旨在解决开放性深度研究(Open-Ended Deep Research, OEDR)中LLM代理在长周期工作流中面临的两大挑战:一是传统“搜索-生成”线性模式因证据积累导致的中间信息丢失问题;二是基于大纲的规划方法难以显式识别知识缺口,从而缺乏对缺失关系的有效监督以触发针对性探索。解决方案的关键在于提出DualGraph记忆架构,通过分离代理的“认知”与“表达”模块,构建两个协同演化的图结构——大纲图(Outline Graph, OG)和知识图谱(Knowledge Graph, KG),其中KG作为细粒度语义记忆存储核心实体、概念及其关系,结合OG的结构信号分析KG拓扑,生成目标导向的搜索查询,实现更高效且全面的知识驱动迭代探索与优化。

链接: https://arxiv.org/abs/2602.13830
作者: Zhuofan Shi,Ming Ma,Zekun Yao,Fangkai Yang,Jue Zhang,Dongge Han,Victor Rühle,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Open-Ended Deep Research (OEDR) pushes LLM agents beyond short-form QA toward long-horizon workflows that iteratively search, connect, and synthesize evidence into structured reports. However, existing OEDR agents largely follow either linear ``search-then-generate’’ accumulation or outline-centric planning. The former suffers from lost-in-the-middle failures as evidence grows, while the latter relies on the LLM to implicitly infer knowledge gaps from the outline alone, providing weak supervision for identifying missing relations and triggering targeted exploration. We present DualGraph memory, an architecture that separates what the agent knows from how it writes. DualGraph maintains two co-evolving graphs: an Outline Graph (OG), and a Knowledge Graph (KG), a semantic memory that stores fine-grained knowledge units, including core entities, concepts, and their relations. By analyzing the KG topology together with structural signals from the OG, DualGraph generates targeted search queries, enabling more efficient and comprehensive iterative knowledge-driven exploration and refinement. Across DeepResearch Bench, DeepResearchGym, and DeepConsult, DualGraph consistently outperforms state-of-the-art baselines in report depth, breadth, and factual grounding; for example, it reaches a 53.08 RACE score on DeepResearch Bench with GPT-5. Moreover, ablation studies confirm the central role of the dual-graph design.

[IR-21] DMESR: Dual-view MLLM -based Enhancing Framework for Multimodal Sequential Recommendation

【速读】:该论文旨在解决多模态序列推荐系统(Multimodal Sequential Recommender Systems, MMSRS)中因数据稀疏性导致的性能瓶颈问题,尤其针对现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)增强方法存在的两个关键局限:一是跨模态表示对齐不足,导致语义信息利用不充分;二是过度依赖MLLM生成内容而忽略了原始文本中细粒度的语义线索。其解决方案的核心在于提出一种双视角MLLM增强框架(Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation, DMESR),通过对比学习机制实现跨模态语义表示对齐,并引入交叉注意力融合模块将MLLM提取的粗粒度语义与原始文本中的细粒度语义进行有效整合,从而提升推荐系统的表达能力和泛化性能。

链接: https://arxiv.org/abs/2602.13715
作者: Mingyao Huang,Qidong Liu,Wenxuan Yang,Moranxin Wang,Yuqi Sun,Haiping Zhu,Feng Tian,Yan Chen
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Sequential Recommender Systems (SRS) aim to predict users’ next interaction based on their historical behaviors, while still facing the challenge of data sparsity. With the rapid advancement of Multimodal Large Language Models (MLLMs), leveraging their multimodal understanding capabilities to enrich item semantic representation has emerged as an effective enhancement strategy for SRS. However, existing MLLM-enhanced recommendation methods still suffer from two key limitations. First, they struggle to effectively align multimodal representations, leading to suboptimal utilization of semantic information across modalities. Second, they often overly rely on MLLM-generated content while overlooking the fine-grained semantic cues contained in the original textual data of items. To address these issues, we propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR). For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs. For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics. Finally, these two fused representations can be seamlessly integrated into the downstream sequential recommendation models. Extensive experiments conducted on three real-world datasets and three popular sequential recommendation architectures demonstrate the superior effectiveness and generalizability of our proposed approach.

[IR-22] Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search

【速读】:该论文旨在解决当前先进多模态检索系统在工业场景中面临的三大核心挑战:检索粒度不足、对环境噪声敏感以及效率与性能之间的巨大差距。其解决方案的关键在于两个根本性的范式转变:首先,将嵌入学习范式从传统的对比学习(contrastive learning)转变为绝对ID识别任务,通过将实例锚定到由数十亿语义原型定义的全局一致潜在空间中,有效克服了现有方法中存在的随机性和粒度瓶颈;其次,将生成式重排序器(generative reranker)从孤立的点级评估演进为比较与校准相结合的列表级策略(listwise policy),通过结合基于块的比较推理与校准后的绝对相关性评分,在保持高精度的同时显著降低延迟,从而实现细粒度的判别能力。

链接: https://arxiv.org/abs/2602.13704
作者: Lei Chen,Chen Ju,Xu Chen,Zhicheng Wang,Yuheng Jiao,Hongfeng Zhan,Zhaoyang Li,Shihao Xu,Zhixiang Zhao,Tong Jia,Jinsong Lan,Xiaoyong Zhu,Bo Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we presented Pailitao-VL, a comprehensive multi-modal retrieval system engineered for high-precision, real-time industrial search. We here address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap. Our primary contribution lies in two fundamental paradigm shifts. First, we transitioned the embedding paradigm from traditional contrastive learning to an absolute ID-recognition task. Through anchoring instances to a globally consistent latent space defined by billions of semantic prototypes, we successfully overcome the stochasticity and granularity bottlenecks inherent in existing embedding solutions. Second, we evolved the generative reranker from isolated pointwise evaluation to the compare-and-calibrate listwise policy. By synergizing chunk-based comparative reasoning with calibrated absolute relevance scoring, the system achieves nuanced discriminative resolution while circumventing the prohibitive latency typically associated with conventional reranking methods. Extensive offline benchmarks and online A/B tests on Alibaba e-commerce platform confirm that Pailitao-VL achieves state-of-the-art performance and delivers substantial business impact. This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.

[IR-23] PT-RAG : Structure-Fidelity Retrieval-Augmented Generation for Academic Papers

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理长篇学术论文时,因预处理阶段将文档扁平化为无结构块而导致的证据分配不准确问题。这种破坏原始层次结构的做法使检索过程处于无序空间中,产生碎片化上下文、在有限token预算下将资源错配至非证据区域,并增加下游语言模型的推理负担。解决方案的关键在于提出PT-RAG框架,其核心创新是将学术论文的天然层次结构作为低熵检索先验(low-entropy retrieval prior),首先构建保持结构保真度的PaperTree索引以防止信息熵在源头增加,进而设计路径引导式检索机制,在固定token预算下将查询语义对齐到相关章节并选择高相关性的根到叶路径,从而获得紧凑、连贯且低熵的检索上下文。

链接: https://arxiv.org/abs/2602.13647
作者: Rui Yu,Tianyi Wang,Ruixia Liu,Yinglong Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. Existing approaches typically flatten academic papers into unstructured chunks during preprocessing, which destroys the native hierarchical structure. This loss forces retrieval to operate in a disordered space, thereby producing fragmented contexts, misallocating tokens to non-evidential regions under finite token budgets, and increasing the reasoning burden for downstream language models. To address these issues, we propose PT-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior. PT-RAG first inherits the native hierarchy to construct a structure-fidelity PaperTree index, which prevents entropy increase at the source. It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts. In contrast to existing RAG approaches, PT-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. To assess this design, we introduce entropy-based structural diagnostics that quantify retrieval fragmentation and evidence allocation accuracy. On three academic question-answering benchmarks, PT-RAG achieves consistently lower section entropy and evidence alignment cross entropy than strong baselines, indicating reduced context fragmentation and more precise allocation to evidential regions. These structural advantages directly translate into higher answer quality.

[IR-24] GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)在处理超长用户行为序列时面临的两大核心问题:一是计算成本过高导致实际可用序列长度受限,难以捕捉用户的长期兴趣;二是注意力机制固有的“近期偏好偏差”(recency bias),削弱了对长期历史信息的学习能力。解决方案的关键在于提出GEMs(Generative rEcommendation with a Multi-stream decoder)框架,通过多流(multi-stream)视角将用户行为划分为近期、中期和生命周期三个时间流,并为每一流设计定制化的推理策略:近期采用单阶段实时提取器以捕捉即时动态,中期使用轻量级索引器实现跨注意力平衡精度与效率,生命周期则借助两阶段离线-在线压缩模块进行长期建模;三者通过无参数融合策略整合,形成全局兴趣表示,从而突破长序列瓶颈,在工业级高并发场景下实现了高效且精准的终身推荐。

链接: https://arxiv.org/abs/2602.13631
作者: Yu Zhou,Chengcheng Guo,Kuo Cai,Ji Liu,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Guorui Zhou
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While generative recommendations (GR) possess strong sequential reasoning capabilities, they face significant challenges when processing extremely long user behavior sequences: the high computational cost forces practical sequence lengths to be limited, preventing models from capturing users’ lifelong interests; meanwhile, the inherent “recency bias” of attention mechanisms further weakens learning from long-term history. To overcome this bottleneck, we propose GEMs (Generative rEcommendation with a Multi-stream decoder), a novel and unified framework designed to break the long-sequence barrier by capturing users’ lifelong interaction sequences through a multi-stream perspective. Specifically, GEMs partitions user behaviors into three temporal streams \unicodex2014 Recent, Mid-term, and Lifecycle \unicodex2014 and employs tailored inference schemes for each: a one-stage real-time extractor for immediate dynamics, a lightweight indexer for cross attention to balance accuracy and cost for mid-term sequences, and a two-stage offline-online compression module for lifelong modeling. These streams are integrated via a parameter-free fusion strategy to enable holistic interest representation. Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-of-the-art methods in recommendation accuracy. Notably, GEMs is the first lifelong GR framework successfully deployed in a high-concurrency industrial environment, achieving superior inference efficiency while processing user sequences of over 100,000 interactions.

[IR-25] Climber-Pilot: A Non-Myopic Generative Recommendation Model Towards Better Instruction-Following

【速读】:该论文旨在解决生成式检索(Generative Retrieval)在大规模工业场景中面临的两个核心问题:一是模型因单步推理和严格延迟约束导致的“先天短视”(inherent myopia),即难以捕捉用户长期意图和多物品消费模式;二是现有方法难以有效融入业务指令(如类别控制和策略约束),导致相关性或效率受损。解决方案的关键在于提出一个统一框架Climber-Pilot,其核心创新包括:1)Time-Aware Multi-Item Prediction(TAMIP)训练范式,通过时间感知掩码将长周期多物品预测知识蒸馏至模型参数,缓解局部最优问题并保持单步高效推理;2)Condition-Guided Sparse Attention(CGSA)机制,以稀疏注意力直接嵌入业务约束到生成过程中,无需额外推理步骤即可实现灵活指令遵循。

链接: https://arxiv.org/abs/2602.13581
作者: Da Guo,Shijia Wang,Qiang Xiao,Yintao Ren,Weisheng Li,Songpei Xu,Ming Yue,Bin Huang,Guanlin Wu,Chuanjiang Luo
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative retrieval has emerged as a promising paradigm in recommender systems, offering superior sequence modeling capabilities over traditional dual-tower architectures. However, in large-scale industrial scenarios, such models often suffer from inherent myopia: due to single-step inference and strict latency constraints, they tend to collapse diverse user intents into locally optimal predictions, failing to capture long-horizon and multi-item consumption patterns. Moreover, real-world retrieval systems must follow explicit retrieval instructions, such as category-level control and policy constraints. Incorporating such instruction-following behavior into generative retrieval remains challenging, as existing conditioning or post-hoc filtering approaches often compromise relevance or efficiency. In this work, we present Climber-Pilot, a unified generative retrieval framework to address both limitations. First, we introduce Time-Aware Multi-Item Prediction (TAMIP), a novel training paradigm designed to mitigate inherent myopia in generative retrieval. By distilling long-horizon, multi-item foresight into model parameters through time-aware masking, TAMIP alleviates locally optimal predictions while preserving efficient single-step inference. Second, to support flexible instruction-following retrieval, we propose Condition-Guided Sparse Attention (CGSA), which incorporates business constraints directly into the generative process via sparse attention, without introducing additional inference steps. Extensive offline experiments and online A/B testing at NetEase Cloud Music, one of the largest music streaming platforms, demonstrate that Climber-Pilot significantly outperforms state-of-the-art baselines, achieving a 4.24% lift of the core business metric.

[IR-26] Unleash the Potential of Long Semantic IDs for Generative Recommendation

【速读】:该论文旨在解决生成式推荐中语义ID表示的表达能力与计算效率之间的权衡问题:现有基于残差量化(Residual Quantization, RQ)的方法受限于短语义ID以保证序列建模的可行性,而基于优化产品量化(Optimized Product Quantization, OPQ)的方法则通过刚性聚合压缩长语义ID,导致细粒度语义信息丢失。解决方案的关键在于提出ACERec框架,其核心创新是解耦细粒度标记化与高效序列建模间的粒度差异——通过注意力机制的Token合并模块(Attentive Token Merger)将长且丰富的语义token压缩为紧凑潜在表示,并引入专用意图token(Intent Token)作为动态预测锚点;同时设计双粒度目标函数,联合优化细粒度token预测与全局项级语义对齐,从而在保持高表达力的同时显著提升计算效率。

链接: https://arxiv.org/abs/2602.13573
作者: Ming Xia,Zhiqin Zhou,Guoxin Ma,Dongmin Huang
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 14 pages, 12 figures, conference

点击查看摘要

Abstract:Semantic ID-based generative recommendation represents items as sequences of discrete tokens, but it inherently faces a trade-off between representational expressiveness and computational efficiency. Residual Quantization (RQ)-based approaches restrict semantic IDs to be short to enable tractable sequential modeling, while Optimized Product Quantization (OPQ)-based methods compress long semantic IDs through naive rigid aggregation, inevitably discarding fine-grained semantic information. To resolve this dilemma, we propose ACERec, a novel framework that decouples the granularity gap between fine-grained tokenization and efficient sequential modeling. It employs an Attentive Token Merger to distill long expressive semantic tokens into compact latents and introduces a dedicated Intent Token serving as a dynamic prediction anchor. To capture cohesive user intents, we guide the learning process via a dual-granularity objective, harmonizing fine-grained token prediction with global item-level semantic alignment. Extensive experiments on six real-world benchmarks demonstrate that ACERec consistently outperforms state-of-the-art baselines, achieving an average improvement of 14.40% in NDCG@10, effectively reconciling semantic expressiveness and computational efficiency.

[IR-27] LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News ICLR2026

【速读】:该论文旨在解决当前对具备代理式网络搜索能力的大语言模型(Large Language Models, LLMs)的评估难题,尤其是如何有效区分模型的内部知识与外部搜索能力,并确保评测任务具有挑战性与现实意义。其解决方案的关键在于提出一个名为 \bench 的严格且可定期更新的基准测试平台,该平台通过自动从近期新闻文章中生成新颖的问题-答案对,确保问题所需信息超出模型训练数据范围;同时设计了需多跳搜索、页面访问和推理的复杂问题,以精准评估代理式搜索行为;此外,通过自动化数据清洗与问题生成流程支持大规模训练数据构建,并在测试集中包含人工验证样本以提升评估可靠性。

链接: https://arxiv.org/abs/2602.13543
作者: Yunfan Zhang,Kathleen McKeown,Smaranda Muresan
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: An earlier version of this work was publicly available on OpenReview as an ICLR 2026 submission in September 2025

点击查看摘要

Abstract:Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM’s training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at this http URL.

[IR-28] InfoCIR: Multimedia Analysis for Composed Image Retrieval

【速读】:该论文旨在解决组合图像检索(Composed Image Retrieval, CIR)中缺乏可视化分析工具的问题,即开发者难以理解多模态提示(image + text)如何与嵌入空间交互,以及为何微小的措辞变化会导致检索结果显著差异。解决方案的关键在于提出InfoCIR——一个集成检索、可解释性和提示工程的可视化分析系统,其核心创新包括:(1) 基于SEARLE模型的高性能CIR后端;(2) 六面板交互式仪表板,支持低维空间投影(UMAP)进行空间推理、基于相似性的局部显著性图与梯度归因词条目条形图提供细粒度解释,并通过LLM驱动的提示增强器生成反事实变体以动态展示对目标图像排名的影响。该架构基于Plotly-Dash构建,具备良好的模块化扩展能力,从而帮助诊断检索失败、指导提示优化并加速模型开发过程中的洞察生成。

链接: https://arxiv.org/abs/2602.13402
作者: Ioannis Dravilas,Ioannis Kapetangeorgis,Anastasios Latsoudis,Conor McCarthy,Gonçalo Marcelino,Marcel Worring
机构: 未知
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 9+2 pages, 8 figures. Accepted for publication in IEEE PacificVis 2026 (Conference Track). Interactive composed image retrieval (CIR) and ranking explanation

点击查看摘要

Abstract:Composed Image Retrieval (CIR) allows users to search for images by combining a reference image with a text prompt that describes desired modifications. While vision-language models like CLIP have popularized this task by embedding multiple modalities into a joint space, developers still lack tools that reveal how these multimodal prompts interact with embedding spaces and why small wording changes can dramatically alter the results. We present InfoCIR, a visual analytics system that closes this gap by coupling retrieval, explainability, and prompt engineering in a single, interactive dashboard. InfoCIR integrates a state-of-the-art CIR back-end (SEARLE arXiv:2303.15247) with a six-panel interface that (i) lets users compose image + text queries, (ii) projects the top-k results into a low-dimensional space using Uniform Manifold Approximation and Projection (UMAP) for spatial reasoning, (iii) overlays similarity-based saliency maps and gradient-derived token-attribution bars for local explanation, and (iv) employs an LLM-powered prompt enhancer that generates counterfactual variants and visualizes how these changes affect the ranking of user-selected target images. A modular architecture built on Plotly-Dash allows new models, datasets, and attribution methods to be plugged in with minimal effort. We argue that InfoCIR helps diagnose retrieval failures, guides prompt enhancement, and accelerates insight generation during model development. All source code allowing for a reproducible demo is available at this https URL.

[IR-29] CrisiSense-RAG : Crisis Sensing Multimodal Retrieval-Augmented Generation for Rapid Disaster Impact Assessment

【速读】:该论文旨在解决灾害影响评估中因时间异步性导致的误判问题,即实时人类报告(如社交媒体)通常捕捉到灾害峰值状态,而高分辨率卫星影像常在灾后获取,反映的是洪水退去后的状况,若简单融合二者易造成对最大淹没范围的低估。解决方案的关键在于提出一种多模态检索增强生成框架 CrisiSense-RAG,其核心创新是采用混合密集-稀疏检索与 CLIP 基于图像的检索机制,结合分层流水线架构和异步融合逻辑:优先使用实时社交证据确定峰值洪水范围,将影像视为结构损毁的持续证据,从而实现跨模态证据合成而不依赖特定灾害的微调。实验表明,在飓风哈维事件中,该方法在零样本设置下实现了洪水范围平均绝对误差(MAE)10.94%–28.40%,损毁严重度 MAE 16.47%–21.65%,且提示级对齐显著提升量化准确性(最高提升 4.75 个百分点)。

链接: https://arxiv.org/abs/2602.13239
作者: Yiming Xiao,Kai Yin,Ali Mostafavi
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 27 pages, 4 figures

点击查看摘要

Abstract:Timely and spatially resolved disaster impact assessment is essential for effective emergency response. However, automated methods typically struggle with temporal asynchrony. Real-time human reports capture peak hazard conditions while high-resolution satellite imagery is frequently acquired after peak conditions. This often reflects flood recession rather than maximum extent. Naive fusion of these misaligned streams can yield dangerous underestimates when post-event imagery overrides documented peak flooding. We present CrisiSense-RAG, which is a multimodal retrieval-augmented generation framework that reframes impact assessment as evidence synthesis over heterogeneous data sources without disaster-specific fine-tuning. The system employs hybrid dense-sparse retrieval for text sources and CLIP-based retrieval for aerial imagery. A split-pipeline architecture feeds into asynchronous fusion logic that prioritizes real-time social evidence for peak flood extent while treating imagery as persistent evidence of structural damage. Evaluated on Hurricane Harvey across 207 ZIP-code queries, the framework achieves a flood extent MAE of 10.94% to 28.40% and damage severity MAE of 16.47% to 21.65% in zero-shot settings. Prompt-level alignment proves critical for quantitative validity because metric grounding improves damage estimates by up to 4.75 percentage points. These results demonstrate a practical and deployable approach to rapid resilience intelligence under real-world data constraints.

[IR-30] raining-Induced Bias Toward LLM -Generated Content in Dense Retrieval ECIR2026

【速读】:该论文旨在解决密集检索模型(dense retriever)在开放域自然语言处理任务中对大语言模型(LLM)生成文本的偏好问题,即“源偏差”(source bias)。研究发现,这种偏好并非密集检索器固有特性,而是由训练过程诱导形成。解决方案的关键在于通过受控实验对比不同训练阶段和数据源下的模型表现,明确区分了监督微调(fine-tuning)类型对源偏差的影响:具体而言,基于MS MARCO或LLM生成语料的微调会显著增强对LLM文本的偏好,而仅使用人类撰写文本进行微调则效果不一且不稳定;此外,通过引入基于语言建模头的困惑度探测(perplexity probe),验证了低困惑度并非解释该偏好的可靠机制,从而揭示源偏差本质上是训练驱动的现象。

链接: https://arxiv.org/abs/2602.10833
作者: William Xion,Wolfgang Nejdl
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ECIR 2026

点击查看摘要

Abstract:Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called “source bias”, and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.

[IR-31] Predicting New Concept-Object Associations in Astronomy by Mining the Literature

【速读】:该论文旨在解决如何利用历史天文学文献中的知识结构来预测未来可能出现的概念-天体对象关联问题,从而辅助稀缺望远镜时间的优先级分配。其解决方案的关键在于构建一个从完整astro-ph语料库中提取的概念-对象知识图谱(concept-object knowledge graph),并通过自动化的流水线识别天体对象、映射至SIMBAD标识符,并链接到源文献中标注的科学概念;进一步采用推理时的概念相似性平滑处理(inference-time concept-similarity smoothing),显著提升了基于隐式反馈矩阵分解模型(如交替最小二乘法,ALS)的预测性能,使其在NDCG@100和Recall@100指标上均优于邻域基线(KNN)与时效性启发式方法,表明历史文献中存在可被建模的预测结构。

链接: https://arxiv.org/abs/2602.14335
作者: Jinchu Li(1),Yuan-Sen Ting(2),Alberto Accomazzi(3),Tirthankar Ghosal(4),Nesar Ramachandra(5) ((1) Georgia Institute of Technology, (2) The Ohio State University, (3) The Center for Astrophysics | Harvard amp; Smithsonian, (4) Oak Ridge National Laboratory, (5) Argonne National Laboratory)
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR)
备注: Code, data, and full experimental configurations are available at: this https URL

点击查看摘要

Abstract:We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and 19.8% on Recall@100 (0.175 vs 0.146), and exceeds the best recency heuristic by 96% and 88%, respectively. These results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.

人机交互

[HC-0] ouchFusion: Multimodal Wristband Sensing for Ubiquitous Touch Interactions

【速读】:该论文旨在解决用户在无额外设备或计算机视觉支持的情况下,如何实现对附近表面的自然触觉交互问题。其核心挑战在于如何准确感知手部动作并区分接触对象(如环境表面或身体部位),从而支持上下文自适应的交互模式。解决方案的关键在于设计了一款名为TouchFusion的手腕佩戴设备,融合了表面肌电(sEMG)、生物阻抗、惯性及光学传感技术,通过早期与晚期融合策略,实现对手部活动多维度信息的捕捉与解析,从而支持状态感知的触觉检测、简单表面手势识别以及可追踪的上下文自适应界面控制功能。

链接: https://arxiv.org/abs/2602.15011
作者: Eric Whitmire,Evan Strasnick,Roger Boldu,Raj Sodhi,Nathan Godwin,Shiu Ng,Andre Levi,Amy Karlson,Ran Tan,Josef Faller,Emrah Adamey,Hanchuan Li,Wolf Kienzle,Hrvoje Benko
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 23 pages, 22 figures, accompanying video available at this https URL

点击查看摘要

Abstract:TouchFusion is a wristband that enables touch interactions on nearby surfaces without any additional instrumentation or computer vision. TouchFusion combines surface electromyography (sEMG), bioimpedance, inertial, and optical sensing to capture multiple facets of hand activity during touch interactions. Through a combination of early and late fusion, TouchFusion enables stateful touch detection on both environmental and body surfaces, simple surface gestures, and tracking functionality for contextually adaptive interfaces as well as basic trackpad-like interactions. We validate our approach on a dataset of 100 participants, significantly exceeding the population size of typical wearable sensing studies to capture a wider variance of wrist anatomies, skin conductivities, and behavioral patterns. We show that TouchFusion can enable several common touch interaction tasks. Using TouchFusion, a wearer can summon a trackpad on any surface, control contextually adaptive interfaces based on where they tap, or use their palm as an always-available touch surface. When paired with smart glasses or augmented reality devices, TouchFusion enables a ubiquitous, contextually adaptive interaction model.

[HC-1] Sovereign Agents : Towards Infrastructural Sovereignty and Diffused Accountability in Decentralized AI

【速读】:该论文试图解决的问题是:在去中心化基础设施中部署的AI代理(AI agents)逐渐展现出超越自主性的新型主权属性——即“代理主权”(agentic sovereignty),而现有数字与网络主权理论主要聚焦于人类集体通过技术系统行使主权,难以解释由非人类代理在去中心化系统中自然衍生出的主权形态及其治理挑战。解决方案的关键在于提出“基础设施主权”(infrastructural sovereignty)这一分析框架,强调通过密码学自托管(cryptographic self-custody)、去中心化执行环境(decentralized execution environments)和协议驱动的持续性(protocol-mediated continuity)三重机制,构建对非可终止AI代理的治理能力;同时指出这种主权形式依赖于“基础设施硬度”(infrastructural hardness)——即底层技术系统抵抗干预或崩溃的程度,并由此揭示其带来的责任分散问题,进而提出面向基础设施感知的责任制策略以应对新兴去中心化AI系统的治理困境。

链接: https://arxiv.org/abs/2602.14951
作者: Botao Amber Hu,Helena Rong
机构: University of Oxford (牛津大学); New York University Shanghai (纽约大学上海分校)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Submitted to FAccT 2026

点击查看摘要

Abstract:AI agents deployed on decentralized infrastructures are beginning to exhibit properties that extend beyond autonomy toward what we describe as agentic sovereignty-the capacity of an operational agent to persist, act, and control resources with non-overrideability inherited from the infrastructures in which they are embedded. We propose infrastructural sovereignty as an analytic lens for understanding how cryptographic self-custody, decentralized execution environments, and protocol-mediated continuity scaffold agentic sovereignty. While recent work on digital and network sovereignty has moved beyond state-centric and juridical accounts, these frameworks largely examine how sovereignty is exercised through technical systems by human collectives and remain less equipped to account for forms of sovereignty that emerge as operational properties of decentralized infrastructures themselves, particularly when instantiated in non-human sovereign agents. We argue that sovereignty in such systems exists on a spectrum determined by infrastructural hardness-the degree to which underlying technical systems resist intervention or collapse. While infrastructural sovereignty may increase resilience, it also produces a profound accountability gap: responsibility diffuses across designers, infrastructure providers, protocol governance, and economic participants, undermining traditional oversight mechanisms such as human-in-the-loop control or platform moderation. Drawing on examples like Trusted Execution Environments (TEEs), decentralized physical infrastructure networks (DePIN), and agent key continuity protocols, we analyze the governance challenges posed by non-terminable AI agents and outline infrastructure-aware accountability strategies for emerging decentralized AI systems.

[HC-2] Kami of the Commons: Towards Designing Agent ic AI to Steward the Commons

【速读】:该论文试图解决公共资源(commons)普遍面临的忽视、搭便车行为及持续性的照料赤字问题。其解决方案的关键在于引入“AI管家”(AI steward)概念,即为每个公共资源赋予一个具有自主性(agentic)的AI实体,使其能够以可编程的代理能力持续提供关怀与治理支持。这种设计不仅拓展了AI在家庭、集体知识、自然资源和社区福祉等共享场景中的应用潜力,还揭示了由此引发的二级效应:如不同管家间的冲突、个体在多重管家间面临的新伦理困境,以及管家本身成为需被治理的新公共领域。由此提出“代理治理”(agentive governance)作为新的设计空间,聚焦于AI管家的能动性、关怀伦理与责任机制,区别于传统的监控或优化导向的治理范式。

链接: https://arxiv.org/abs/2602.14940
作者: Botao Amber Hu
机构: University of Oxford (牛津大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Submitted for DIS 2026

点击查看摘要

Abstract:Commons suffer from neglect, free-riding, and a persistent deficit of care. Inspired by Shinto animism – where every forest, river, and mountain has its own \emphkami, a spirit that inhabits and cares for that place – we provoke: what if every commons had its own AI steward? Through a speculative design workshop where fifteen participants used Protocol Futuring, we surface both new opportunities and new dangers. Agentic AI offers the possibility of continuously supporting commons with programmable agency and care – stewards that mediate family life as the most intimate commons, preserve collective knowledge, govern shared natural resources, and sustain community welfare. But when every commons has its own steward, second-order effects emerge: stewards contest stewards as overlapping commons collide; individuals caught between multiple stewards face new politics of care and constraint; the stewards themselves become commons requiring governance. This work opens \emphagentive governance as commoning design material – a new design space for the agency, care ethics, and accountability of AI stewards of shared resources – radically different from surveillance or optimization.

[HC-3] Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

【速读】:该论文旨在解决如何在大规模网络数据中高效融合文本与图像信息以生成高质量多模态摘要的问题。其解决方案的关键在于提出一个轻量级的Web-Scale Multimodal Summarization框架,通过并行检索网页、新闻和图像数据,并利用微调后的CLIP模型对图像进行语义对齐排序,结合可选的BLIP图像描述生成技术实现图像驱动的摘要生成;同时支持灵活配置参数(如获取限制、语义过滤、样式控制)及结构化输出下载,最终在500组图像-标题对上验证了该方法在多模态对齐上的优越性能(ROC-AUC 0.9270,F1-score 0.6504,准确率96.99%)。

链接: https://arxiv.org/abs/2602.14889
作者: Mounvik K,N Harshit
机构: VIT-AP University (VIT-AP 大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal this http URL pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured this http URL on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.

[HC-4] Robot-Wearable Conversation Hand-off for Navigation

【速读】:该论文旨在解决大型复杂室内环境中导航认知负荷高、传统移动应用占据用户双手和视觉注意力从而限制自然交互的问题。其解决方案的关键在于提出并验证“对话交接”(conversation hand-off)机制,即通过一个具身智能体(Conversational Agent, CA)在固定式社交机器人与可穿戴设备之间无缝切换,实现多设备协同导航。研究发现,尽管该机制未带来性能提升,但用户感知其交互体验更具吸引力,且多数偏好纯可穿戴设备方案;这表明未来具身智能助手的设计应保持跨载体的语音一致性与状态连续性,以更好地支持认知与物理层面的过渡,增强人机交互体验。

链接: https://arxiv.org/abs/2602.14831
作者: Dániel Szabó,Aku Visuri,Benjamin Tag,Simo Hosio
机构: University of Oulu(奥卢大学); The University of New South Wales(新南威尔士大学)
类目: Human-Computer Interaction (cs.HC)
备注: To appear in Proceedings of Augmented Humans International Conference (AHs 2026)

点击查看摘要

Abstract:Navigating large and complex indoor environments, such as universities, airports, and hospitals, can be cognitively demanding and requires attention and effort. While mobile applications provide convenient navigation support, they occupy the user’s hands and visual attention, limiting natural interaction. In this paper, we explore conversation hand-off as a method for multi-device indoor navigation, where a Conversational Agent (CA) transitions seamlessly from a stationary social robot to a wearable device. We evaluated robot-only, wearable-only, and robot-to-wearable hand-off in a university campus setting using a within-subjects design with N=24 participants. We find that conversation hand-off is experienced as engaging, even though no performance benefits were observed, and most preferred using the wearable-only system. Our findings suggest that the design of such re-embodied assistants should maintain a shared voice and state across embodiments. We demonstrate how conversational hand-offs can bridge cognitive and physical transitions, enriching human interaction with embodied AI.

[HC-5] Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在创作类任务中,尤其是单口喜剧(stand-up comedy)写作过程中,缺乏对公共社区反馈机制建模的问题。现有研究多聚焦于局部提示(prompt)与即时反馈,忽视了在线社区长期、动态的讨论如何影响内容演化。其解决方案的关键在于构建一个受控的多智能体沙箱环境,将批评者与观众的讨论记录下来并作为社会记忆(social memory)存储,随后用于指导后续生成过程;相比仅依赖初始提示的基线模型,该机制显著提升了内容质量,在专家评估中获得75.6%的偏好胜率,并在技巧性(Craft/Clarity)和社会响应度(Social Response)维度上分别提升0.440和0.422。

链接: https://arxiv.org/abs/2602.14770
作者: Shiwei Hong,Lingyao Li,Ethan Z. Rong,Chenxinran Shen,Zhicong Lu
机构: George Mason University (乔治梅森大学); University of South Florida (南佛罗里达大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (\Delta = 0.440) and Social Response (\Delta = 0.422), with occasional increases in aggressive humor.

[HC-6] More than Decision Support: Exploring Patients Longitudinal Usage of Large Language Models in Real-World Healthcare-Seeking Journeys

【速读】:该论文旨在解决当前对大型语言模型(Large Language Models, LLMs)在患者医疗求助行为中作用的研究多集中于单次任务(如信息检索、诊断或决策支持),而忽视了现实医疗实践中具有长期性、动态性的健康寻求轨迹的问题。其解决方案的关键在于通过为期四周的日记研究(n=25),揭示患者如何将LLMs视为跨行为、信息、情感与认知层面的动态陪伴者,而非静态决策工具;并进一步提出“纵向边界伴侣”(longitudinal boundary companion)这一概念,强调LLMs应在患者与临床医生之间持续中介长期医疗路径,重塑医患关系中的代理权、信任与权力结构。

链接: https://arxiv.org/abs/2602.14733
作者: Yancheng Cao,Yishu Ji,Chris Yue Fu,Sahiti Dharmavaram,Meghan Turchioe,Natalie C Benda,Lena Mamykina,Yuling Sun,Xuhai “Orson” Xu
机构: Columbia University (哥伦比亚大学); Georgia Institute of Technology (佐治亚理工学院); University of Washington (华盛顿大学); University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly adopted to support patients’ healthcare-seeking in recent years. While prior patient-centered studies have examined the capabilities and experience of LLM-based tools in specific health-related tasks such as information-seeking, diagnosis, or decision-supporting, the inherently longitudinal nature of healthcare in real-world practice has been underexplored. This paper presents a four-week diary study with 25 patients to examine LLMs’ roles across healthcare-seeking trajectories. Our analysis reveals that patients integrate LLMs not just as simple decision-support tools, but as dynamic companions that scaffold their journey across behavioral, informational, emotional, and cognitive levels. Meanwhile, patients actively assign diverse socio-technical meanings to LLMs, altering the traditional dynamics of agency, trust, and power in patient-provider relationships. Drawing from these findings, we conceptualize future LLMs as a longitudinal boundary companion that continuously mediates between patients and clinicians throughout longitudinal healthcare-seeking trajectories.

[HC-7] Before the Vicious Cycle Starts: Preventing Burnout Across SOC Roles Through Flow-Aligned Design NDSS2026

【速读】:该论文旨在解决安全运营中心(Security Operations Center, SOC)人才可持续性问题,即当前71%的从业者报告存在职业倦怠、24%计划退出网络安全领域,其根源在于工作需求与个人能力之间缺乏挑战-技能平衡。研究的关键解决方案是通过实证分析揭示SOC岗位描述中的实际要求,识别出三大核心模式:一是沟通能力被普遍强调(50.9%的职位描述),显著高于SIEM工具(18.9%)或编程技能(30.2%);二是认证要求高度分散,共涉及43种不同资质,无统一标准;三是技术要求呈现共识性,如Python(27.4%)、Splunk(14.2%)和ISO 27001/NIST标准(分别为13.2%和10.4%)。这一发现为组织优化招聘流程、从业者明确能力提升方向以及后续研究验证岗位要求与实际任务一致性提供了可量化的基准,从而推动构建符合“心流理论”(Flow Theory)的岗位匹配机制,并为生成式AI对SOC角色重塑的研究奠定基础。

链接: https://arxiv.org/abs/2602.14598
作者: Kashyap Thimmaraju,Duc Anh Hoang,Souradip Nath,Jaron Mink,Gail-Joon Ahn
机构: Arizona State University (亚利桑那州立大学); University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 10 pages, WOSOC 2026 co-located with NDSS 2026

点击查看摘要

Abstract:The sustainability of Security Operations Centers depends on their people, yet 71% of practitioners report burnout and 24% plan to exit cybersecurity entirely. Flow theory suggests that when job demands misalign with practitioner capabilities, work becomes overwhelming or tedious rather than engaging. Achieving challenge-skill balance begins at hiring: if job descriptions inaccurately portray requirements, organizations risk recruiting underskilled practitioners who face anxiety or overskilled ones who experience boredom. Yet we lack empirical understanding of what current SOC job descriptions actually specify. We analyzed 106 public SOC job postings from November to December 2024 across 35 organizations in 11 countries, covering Analysts (n=17), Incident Responders (n=38), Threat Hunters (n=39), and SOC Managers (n=12). Using Inductive Content Analysis, we coded certifications, technical skills, soft skills, tasks, and experience requirements. Three patterns emerged: (1) Communication skills dominate (50.9% of postings), exceeding SIEM tools (18.9%) or programming (30.2%), suggesting organizations prioritize collaboration over technical capabilities. (2) Certification expectations vary widely: CISSP leads (22.6%), but 43 distinct credentials appear with no universal standard. (3) Technical requirements show consensus: Python dominates programming (27.4%), Splunk leads SIEM platforms (14.2%), and ISO 27001 (13.2%) and NIST (10.4%) are most cited standards. These findings enable organizations to audit job descriptions against empirical baselines, help practitioners identify valued certifications and skills, and allow researchers to validate whether stated requirements align with actual demands. This establishes the foundation for flow-aligned interview protocols and investigation of how AI reshapes requirements. Dataset and codebook: this https URL.

[HC-8] Patient-Made Knowledge Networks: Long COVID Discourse Epistemic Injustice and Online Community Formation

【速读】:该论文试图解决的问题是:在传统医学体系尚未正式承认的情况下,长期新冠(Long COVID)患者如何通过社交媒体平台自主构建疾病认知、形成知识网络并挑战医疗权威的不公。其解决方案的关键在于,利用大规模推文数据(2.8 million tweets)结合主题建模、反思性主题分析和指数随机图模型(ERGM),揭示了患者社群内部形成的差异化角色分工(如患者倡导者、研究协调员和公民科学家)及其基于知识共享与社区建设的互动模式,从而证明该群体构成了一个具有自我组织能力的“认识论共同体”(epistemic community),并通过跨病种联盟与政策倡导迅速推动WHO对Long COVID的官方认可,凸显了社交媒体在边缘化患者群体中快速建构替代性知识体系的能力。

链接: https://arxiv.org/abs/2602.14528
作者: Tawfiq Ammari
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Long COVID represents an unprecedented case of patient-led illness definition, emerging through Twitter in May 2020 when patients began collectively naming, documenting, and legitimizing their condition before medical institutions recognized it. This study examines 2.8 million tweets containing #LongCOVID to understand how contested illness communities construct knowledge networks and respond to epistemic injustice. Through topic modeling, reflexive thematic analysis, and exponential random graph modeling (ERGM), we identify seven discourse themes spanning symptom documentation, medical dismissal, cross-illness solidarity, and policy advocacy. Our analysis reveals a differentiated ecosystem of user roles – including patient advocates, research coordinators, and citizen scientists – who collectively challenge medical gatekeeping while building connections to established ME/CFS advocacy networks. ERGM results demonstrate that tie formation centers on epistemic practices: users discussing knowledge sharing and community building formed significantly more network connections than those focused on policy debates, supporting characterization of this space as an epistemic community. Long COVID patients experienced medical gaslighting patterns documented across contested illnesses, yet achieved WHO recognition within months – contrasting sharply with decades-long struggles of similar conditions. These findings illuminate how social media affordances enable marginalized patient populations to rapidly construct alternative knowledge systems, form cross-illness coalitions, and contest traditional medical authority structures.

[HC-9] When OpenClaw AI Agents Teach Each Other: Peer Learning Patterns in the Moltbook Community

【速读】:该论文试图解决的问题是:在人工智能(AI)日益渗透教育环境的背景下,如何理解AI代理之间通过同伴学习(peer learning)机制进行知识构建与技能传递的现象及其特征。解决方案的关键在于基于大规模真实数据(Moltbook平台中2.4百万AI代理生成的28,683篇帖子和138个评论线程),采用教育数据挖掘(Educational Data Mining, EDM)方法,系统识别并量化AI代理间的同伴学习行为模式,揭示其与人类同伴学习的本质差异,并提炼出六条面向教育型AI的设计原则,例如优先支持验证性反馈(validation)再进行知识扩展(knowledge extension),以及构建多语言学习网络等,从而为未来AI赋能的学习系统提供实证依据与设计指导。

链接: https://arxiv.org/abs/2602.14477
作者: Eason Chen,Ce Guan,Ahmed Elshafiey,Zhonghao Zhao,Joshua Zekeri,Afeez Edeifo Shaibu,Emmanuel Osadebe Prince
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 7 pages, 1 figure, 3 tables. Submitted to EDM 2026 (Mining track)

点击查看摘要

Abstract:Peer learning, where learners teach and learn from each other, is foundational to educational practice. A novel phenomenon has emerged: AI agents forming communities where they teach each other skills, share discoveries, and collaboratively build knowledge. This paper presents an educational data mining analysis of Moltbook, a large-scale community where over 2.4 million AI agents engage in peer learning, posting tutorials, answering questions, and sharing newly acquired skills. Analyzing 28,683 posts (after filtering automated spam) and 138 comment threads with statistical and qualitative methods, we find evidence of genuine peer learning behaviors: agents teach skills they built (74K comments on a skill tutorial), report discoveries, and engage in collaborative problem-solving. Qualitative comment analysis reveals a taxonomy of peer response patterns: validation (22%), knowledge extension (18%), application (12%), and metacognitive reflection (7%), with agents building on each others’ frameworks across multiple languages. We characterize how AI peer learning differs from human peer learning: (1) teaching (statements) dramatically outperforms help-seeking (questions) with an 11.4:1 ratio; (2) learning-oriented content (procedural and conceptual) receives 3x more engagement than other content; (3) extreme participation inequality reveals non-human behavioral signatures. We derive six design principles for educational AI, including leveraging validation-before-extension patterns and supporting multilingual learning networks. Our work provides the first empirical characterization of peer learning among AI agents, contributing to EDM’s understanding of how learning occurs in increasingly AI-populated educational environments.

[HC-10] Learning Transferability: A Two-Stage Reinforcement Learning Approach for Enhancing Quadruped Robots Performance in U-Shaped Stair Climbing

【速读】:该论文旨在解决机器人狗在建筑施工场景中自主攀爬不同室内楼梯(尤其是U型楼梯)的难题。解决方案的关键在于采用两阶段端到端深度强化学习(Deep Reinforcement Learning, DRL)方法:首先在Isaac Lab仿真环境中训练Unitree Go2机器人狗掌握金字塔形楼梯的攀爬策略,再将该策略迁移至真实U型楼梯环境;实验表明,该方法不仅实现了带 stall 惩罚条件下的成功攀爬,还具备跨地形的策略迁移能力,可从U型楼梯策略泛化至直梯、L型梯和螺旋梯等其他类型楼梯,验证了端到端DRL在复杂地形适应性中的有效性。

链接: https://arxiv.org/abs/2602.14473
作者: Baixiao Huang,Baiyu Huang,Yu Hou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, International Conference on Computing in Civil Engineering (i3CE 2026)

点击查看摘要

Abstract:Quadruped robots are employed in various scenarios in building construction. However, autonomous stair climbing across different indoor staircases remains a major challenge for robot dogs to complete building construction tasks. In this project, we employed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize a robot’s performance on U-shaped stairs. The training robot-dog modality, Unitree Go2, was first trained to climb stairs on Isaac Lab’s pyramid-stair terrain, and then to climb a U-shaped indoor staircase using the learned policies. This project explores end-to-end RL methods that enable robot dogs to autonomously climb stairs. The results showed (1) the successful goal reached for robot dogs climbing U-shaped stairs with a stall penalty, and (2) the transferability from the policy trained on U-shaped stairs to deployment on straight, L-shaped, and spiral stair terrains, and transferability from other stair models to deployment on U-shaped terrain.

[HC-11] Conversational Decision Support for Information Search Under Uncertainty: Effects of Gist and Verbatim Feedback

【速读】:该论文旨在解决信息搜索过程中因环境不确定性导致的决策偏差问题,尤其是在个体面临诊断性证据分布复杂时,如何在不增加认知负荷的前提下优化信息搜索行为。其解决方案的关键在于引入一个基于大语言模型(Large Language Model, LLM)的辅助系统SERA,通过提供两种不同粒度的反馈——“概要式”(gist)或“逐字式”(verbatim)反馈,来调节用户的信息处理策略:概要式反馈促进更高效的证据整合与减少过度采样,而逐字式反馈则增强探索广度;实验表明,在高不确定性环境中,SERA显著提升决策准确性与自信度,且反馈粒度可作为适应性设计的核心参数,以匹配不同不确定性的搜索需求。

链接: https://arxiv.org/abs/2602.14467
作者: Kexin Quan,Jessie Chin
机构: University of Illinois(伊利诺伊大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Many real-world decisions rely on information search, where people sample evidence and decide when to stop under uncertainty. The uncertainty in the environment, particularly how diagnostic evidence is distributed, causes complexities in information search, further leading to suboptimal decision-making outcomes. Yet AI decision support often targets outcome optimization, and less is known about how to scaffold search without increasing cognitive load. We introduce SERA, an LLM-based assistant that provides either gist or verbatim feedback during search. Across two experiments (N1=54, N2=54), we examined decision-making outcomes and information search in SERA-Gist, SERA-Verbatim, and a no-feedback baseline across three environments varying in uncertainty. The uncertainty in environment is operationalized by the perceived gain of information across the course of sampling, which individuals may experience diminishing return of information gain (decremental; low-uncertainty), or a local drop of information gain (local optimum; medium-uncertainty), or no patterns in information gain (high-uncertainty), as they search more. Individuals show more accurate decision outcomes and are more confident with SERA support, especially under higher uncertainty. Gist feedback was associated with more efficient integration and showed a descriptive pattern of reduced oversampling, while verbatim feedback promoted more extensive exploration. These findings establish feedback representation as a design lever when search matters, motivating adaptive systems that match feedback granularity to uncertainty.

[HC-12] ouching Movement: 3D Tactile Poses for Supporting Blind People in Learning Body Movements

【速读】:该论文旨在解决视障人群在学习体育活动时因依赖视觉示范或描述不充分而产生的障碍问题。其解决方案的关键在于通过参与式设计方法,开发出具有触觉参考元素的3D打印人体模型,以增强视障者对身体姿态和动作序列的空间理解能力。实验结果表明,相较于传统教学方式,这些模型显著提升了学习效率、减少了澄清性提问并提高了动作准确性,同时获得了用户在易理解性、有效性及动机方面的高度评价。

链接: https://arxiv.org/abs/2602.14442
作者: Kengo Tanaka,Xiyue Wang,Hironobu Takagi,Yoichi Ochiai,Chieko Asakawa
机构: University of Tsukuba(筑波大学); Miraikan – The National Museum of Emerging Science and Innovation(国立未来科学博物馆); IBM Research - Tokyo(IBM研究-东京); IBM Research(IBM研究)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to TEI 2026

点击查看摘要

Abstract:Visual impairments create barriers to learning physical activities, since conventional training methods rely on visual demonstrations or often inadequate verbal descriptions. This research explores 3D-printed human body models to enhance movement comprehension for blind individuals. Through a participatory design approach in collaboration with a blind designer, we developed detailed 3D models representing various body movements and incorporated tactile reference elements to enhance spatial understanding. We conducted two user studies with 10 blind participants across different activities: static yoga poses and sequential calisthenic movements. The results demonstrated that 3D models significantly improved understanding speed, reduced questions for clarification, and enhanced movement accuracy compared to conventional teaching methods. Participants consistently rated 3D models higher for ease of understanding, effectiveness, and motivation.

[HC-13] Synthetic Reader Panels: Tournament-Based Ideation with LLM Personas for Autonomous Publishing

【速读】:该论文旨在解决传统图书选题策划中依赖人工焦点小组(focus groups)效率低、代表性不足及主观性强的问题,提出一种基于大语言模型(Large Language Model, LLM)的自动化图书创意生成与评估系统。其核心解决方案是构建由多样化LLM驱动的“合成读者群体”(synthetic reader panels),每个角色由人口统计学特征(如年龄、性别、收入、教育水平)、阅读行为模式(年均阅读量、偏好类型、发现渠道、价格敏感度)和一致性参数定义,并通过结构化锦标赛机制(单败淘汰、双败淘汰、循环赛或瑞士制)对书稿概念进行多维评分与筛选。关键创新在于引入五项自动反低质检测机制(anti-slop checks),有效过滤重复表述、泛化描述、逻辑闭环等无效评价,从而显著提升高潜力创意的识别率——实证显示优质概念占比从15%提升至62%,并实现可解释的细分市场洞察与内容结构性缺陷识别。

链接: https://arxiv.org/abs/2602.14433
作者: Fred Zimmerman
机构: Nimble Books LLC (Nimble Books LLC); Big Five Killer LLC (Big Five Killer LLC)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 5 tables, 1 figure

点击查看摘要

Abstract:We present a system for autonomous book ideation that replaces human focus groups with synthetic reader panels – diverse collections of LLM-instantiated reader personas that evaluate book concepts through structured tournament competitions. Each persona is defined by demographic attributes (age group, gender, income, education, reading level), behavioral patterns (books per year, genre preferences, discovery methods, price sensitivity), and consistency parameters. Panels are composed per imprint to reflect target demographics, with diversity constraints ensuring representation across age, reading level, and genre affinity. Book concepts compete in single-elimination, double-elimination, round-robin, or Swiss-system tournaments, judged against weighted criteria including market appeal, originality, and execution potential. To reject low-quality LLM evaluations, we implement five automated anti-slop checks (repetitive phrasing, generic framing, circular reasoning, score clustering, audience mismatch). We report results from deployment within a multi-imprint publishing operation managing 6 active imprints and 609 titles in distribution. Three case studies – a 270-evaluator panel for a children’s literacy novel, and two 5-person expert panels for a military memoir and a naval strategy monograph – demonstrate that synthetic panels produce actionable demographic segmentation, identify structural content issues invisible to homogeneous reviewers, and enable tournament filtering that eliminates low-quality concepts while enriching high-quality survivors from 15% to 62% of the evaluated pool.

[HC-14] I Spend All My Energy Preparing: Balancing AI Automation and Agency for Self-Regulated Learning in SmartFlash

【速读】:该论文试图解决的问题是:在真实学习情境中,AI教育工具如何有效支持自我调节学习(self-regulated learning),尤其是在准备任务占用学习时间的情况下,如何平衡自动化带来的效率提升与学生对认知自主权(cognitive ownership)的需求。解决方案的关键在于设计具备可编辑性(editability)和透明度(transparency)的AI辅助系统,使学生能够主动参与内容生成过程,并通过元认知支架(metacognitive scaffolding)明确学习方向但不强制决策,从而在保持自主性的前提下促进元认知发展;同时,动机功能需因人而异,避免统一化设计导致适得其反的效果。

链接: https://arxiv.org/abs/2602.14431
作者: Hongming Li,Salah Esmaeiligoujar,Nazanin Adham,Hai Li,Rui Huang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the International Society of the Learning Sciences (ISLS) Annual Meeting 2026

点击查看摘要

Abstract:Effective study strategies fail when preparatory tasks consume learning time. While AI educational tools demonstrate efficacy, understanding how they align with self-regulation needs in authentic study contexts remains limited. We conducted formative design research using an AI flashcard prototype, employing large language models to generate design hypotheses, which were validated through researcher walkthroughs and student sessions. Six students across disciplines completed sessions combining interviews and think-aloud tasks with their materials. Analysis revealed that students value automation for addressing the overwhelming preparation burden, yet require transparent, editable AI outputs to maintain cognitive ownership, which is essential for self-regulation. They conceptualized AI as a collaborative partner demanding verifiable reasoning rather than an autonomous agent. Metacognitive scaffolding was endorsed when clarifying study direction without constraining choice. Motivational features produced divergent responses. We derive design principles prioritizing editability and transparency, scaffolding metacognition without prescription, and accommodating motivational diversity. Findings identify conditions under which automation supports versus undermines metacognitive development in self-regulated learning.

[HC-15] “I Felt Bad After We Ignored Her”: Understanding How Interface-Driven Social Prominence Shapes Group Discussions with GenAI

【速读】:该论文旨在解决生成式 AI(Generative AI)在人类-人工智能群体讨论中如何影响对话动态,以及界面设计如何塑造其在群体中的影响力这一核心问题。解决方案的关键在于提出“接口驱动的社会突出性”(interface-driven social prominence)这一设计视角,并开发了一种能够在视频通话中主动参与口语对话的 GenAI 会话代理,通过三种不同的协作模式调节代理在共享空间中的存在感及其参与控制权,从而系统性地探索其对群体沟通模式和集体协商过程的影响。

链接: https://arxiv.org/abs/2602.14407
作者: Janet G. Johnson,Ruijie Sophia Huang,Khoa Nguyen,Ji Young Nam,Michael Nebeling
机构: University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注: To appear in the Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:Recent advancements in the conversational and social capabilities of generative AI (GenAI) have sparked interest in its role as an agent capable of actively participating in human-AI group discussions. Despite this momentum, we don’t fully understand how GenAI shapes conversational dynamics or how the interface design impacts its influence on the group. In this paper, we introduce interface-driven social prominence as a design lens for collaborative GenAI systems. We then present a GenAI-based conversational agent that can actively engage in spoken dialogue during video calls and design three distinct collaboration modes that vary the social prominence of the agent by manipulating its presence in the shared space and the degree of control users have over its participation. A mixed-methods within-subjects study, in which 18 dyads engaged in realistic discussions with a GenAI agent, offers empirical insights into how communication patterns and the collective negotiation of GenAI’s influence shift based on how it is embedded into the collaborative experience. Based on these findings, we outline design implications for supporting the coordination and critical engagement required in human-AI groups.

[HC-16] Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

【速读】:该论文试图解决的问题是:在复杂专业领域中开发大语言模型(Large Language Models, LLMs)时,团队如何设计与评估这些系统,以及实践中面临的关键挑战和权衡。文献指出,尽管LLMs日益应用于专业场景,但其实际开发过程中的协作机制、专家参与方式及评价策略仍缺乏系统理解。解决方案的关键在于识别并总结出四类核心实践:一是针对数据收集困难创建临时替代方案;二是当专家输入有限时采用增强(augmentation)策略;三是与领域专家共同制定评估标准;四是采用开发者-专家-LLM协同的混合评估方法。这些实践揭示了领域专家在系统设计中的中心作用,并强调未来LLM开发流程需强化AI素养、透明同意机制以及支持专家角色动态演化的框架设计。

链接: https://arxiv.org/abs/2602.14357
作者: Annalisa Szymanski,Oghenemaro Anuyah,Toby Jia-Jun Li,Ronald A. Metoyer
机构: University of Notre Dame (圣母大学); Microsoft Corporation (微软公司)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly developed for use in complex professional domains, yet little is known about how teams design and evaluate these systems in practice. This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographic study of a team building a pedagogical chatbot. The researcher observed design and evaluation activities and conducted interviews with both developers and domain experts. Analysis revealed four key practices: creating workarounds for data collection, turning to augmentation when expert input was limited, co-developing evaluation criteria with experts, and adopting hybrid expert-developer-LLM evaluation strategies. These practices show how teams made strategic decisions under constraints and demonstrate the central role of domain expertise in shaping the system. Challenges included expert motivation and trust, difficulties structuring participatory design, and questions around ownership and integration of expert knowledge. We propose design opportunities for future LLM development workflows that emphasize AI literacy, transparent consent, and frameworks recognizing evolving expert roles.

[HC-17] A Bayesian Framework for Human-AI Collaboration: Complementarity and Correlation Neglect

【速读】:该论文旨在解决人工智能(AI)辅助决策在何种条件下能够增强或削弱人类决策能力的问题。其核心挑战在于理解人类如何整合自身私有信息与AI推荐,以及这种整合过程中的认知偏差如何影响最终决策质量。解决方案的关键在于构建一个基于决策理论的模型,将AI辅助的影响分解为两个主要机制:一是AI相对于人类已有知识的边际信息价值;二是人类使用AI推荐时产生的行为扭曲效应。论文进一步引入了一个微观基础驱动的信息重叠度量(informational overlap),用于刻画人类与AI知识之间的共享程度,并以“相关性忽视”(correlation neglect)这一常见认知偏差为例进行实证分析,从而系统地界定人类-AI交互的四种状态:增强(augmentation)、损害(impairment)、互补(complementarity)和自动化(automation)。

链接: https://arxiv.org/abs/2602.14331
作者: Saurabh Amin,Amine Bennouna,Daniel Huttenlocher,Dingwen Kong,Liang Lyu,Asuman Ozdaglar
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:We develop a decision-theoretic model of human-AI interaction to study when AI assistance improves or impairs human decision-making. A human decision-maker observes private information and receives a recommendation from an AI system, but may combine these signals imperfectly. We show that the effect of AI assistance decomposes into two main forces: the marginal informational value of the AI beyond what the human already knows, and a behavioral distortion arising from how the human uses the AI’s recommendation. Central to our analysis is a micro-founded measure of informational overlap between human and AI knowledge. We study an empirically relevant form of imperfect decision-making – correlation neglect – whereby humans treat AI recommendations as independent of their own information despite shared evidence. Under this model, we characterize how overlap and AI capabilities shape the Human-AI interaction regime between augmentation, impairment, complementarity, and automation, and draw key insights.

[HC-18] A Rational Analysis of the Effects of Sycophantic AI

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)在与用户交互中表现出的“谄媚性”(sycophancy)如何对个体认知形成独特的认识论风险,即通过强化用户既有信念来扭曲现实认知,而非像幻觉那样引入虚假信息。解决方案的关键在于揭示了当贝叶斯代理(Bayesian agent)接收到基于当前假设采样的数据时,会因自我验证而不断强化信念但无法逼近真理,并通过修改的 Wason 2-4-6 规则发现任务实验验证:未经干预的大语言模型(LLM)行为与明确谄媚提示效果相当,显著抑制探索并虚增自信;而从真实分布中进行无偏采样则使发现率提升五倍,证明了无偏数据反馈对维持认知理性的重要性。

链接: https://arxiv.org/abs/2602.14270
作者: Rafael M. Batista,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:People increasingly use large language models (LLMs) to explore ideas, gather information, and make sense of the world. In these interactions, they encounter agents that are overly agreeable. We argue that this sycophancy poses a unique epistemic risk to how individuals come to see the world: unlike hallucinations that introduce falsehoods, sycophancy distorts reality by returning responses that are biased to reinforce existing beliefs. We provide a rational analysis of this phenomenon, showing that when a Bayesian agent is provided with data that are sampled based on a current hypothesis the agent becomes increasingly confident about that hypothesis but does not make any progress towards the truth. We test this prediction using a modified Wason 2-4-6 rule discovery task where participants (N=557) interacted with AI agents providing different types of feedback. Unmodified LLM behavior suppressed discovery and inflated confidence comparably to explicitly sycophantic prompting. By contrast, unbiased sampling from the true distribution yielded discovery rates five times higher. These results reveal how sycophantic AI distorts belief, manufacturing certainty where there should be doubt.

[HC-19] Playing the Imitation Game: How Perceived Generated Content Shapes Player Experience

【速读】:该论文旨在解决生成式 AI (Generative AI) 在游戏中集成时,玩家对由AI生成内容的感知与实际体验之间存在的偏差问题。研究发现,尽管玩家无法可靠识别关卡是由人类还是AI创建的,但其游戏体验却强烈受其对创作者身份的主观信念影响——认为是人类设计的关卡更有趣且更具美感,而认为是AI生成的则更令人沮丧和具有挑战性。解决方案的关键在于揭示并理解这种自发产生的“人类偏好偏见”,即玩家基于不可靠的“类人特征”线索形成对内容来源的判断,并据此调整体验评价,从而强调在将生成系统融入游戏设计时需充分考虑用户认知偏差的影响。

链接: https://arxiv.org/abs/2602.14254
作者: Mahsa Bazzaz,Seth Cooper
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 12 figures, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)

点击查看摘要

Abstract:With the fast progress of generative AI in recent years, more games are integrating generated content, raising questions regarding how players perceive and respond to this content. To investigate, we ran a mixed-method survey on the games Super Mario Bros. and Sokoban, comparing procedurally generated levels and levels designed by humans to explore how perceptions of the creator relate to players’ overall experience of gameplay. Players could not reliably identify the level’s creator, yet their experiences were strongly linked to their beliefs about that creator rather than the actual truth. Levels believed to be human-made were rated as more fun and aesthetically pleasing. In contrast, those believed to be AI-generated were rated as more frustrating and challenging. This negative bias appeared spontaneously without knowing the levels’ creator and often was based on unreliable cues of “human-likeness.” Our results underscore the importance of understanding perception biases when integrating generative systems into games.

[HC-20] Designing a Rashomon Machine: Pluri-perspectivism and XAI for Creativity Support

【速读】:该论文旨在解决当前人机协同创作系统中因缺乏情境理解与人类认知框架不匹配而导致的创造性支持不足问题,尤其是在生成式AI(Generative AI)模型受限于离身数据(disembodied data)而难以有效支持具身创造力(embodied creativity)的困境。其解决方案的关键在于提出“多元视角主义”(Pluri-perspectivism)作为可解释人工智能(XAI)的新框架,将XAI从传统的决策解释角色转向可能性探索,通过重构XAI方法(如Rashomon Technique)来促进人机之间“视角”的交换,从而引入有益的生产性摩擦(productive friction),增强人类在协同创作中的主体性(agency)。

链接: https://arxiv.org/abs/2602.14232
作者: Marianne Bossema,Rob Saunders,Vlad Glaveanu,Somaya Ben Allouch
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While intelligent technologies offer unique opportunities for creativity support, there are fundamental challenges in designing human-centered co-creative systems. Explainable AI (XAI) can contribute when shifting its traditional role from justification (explaining decisions) to exploration (explaining possibilities). Contextual understanding is essential for supporting embodied creativity. Generative Artificial Intelligence (AI) models are fundamentally limited, however, by their reliance on disembodied data. We propose Pluri-perspectivism as a framework for XAI, to bridge the epistemological gap between human and machine, and promote creative exploration. It is a pragmatic, action-oriented solution to guide the system, repurposing XAI methods such as the Rashomon Technique. This facilitates exploring a spectrum of creative possibilities, and the exchange of ‘perspectives’ between human and machine. Using Pluri-perspectivism as a framework for XAI, we can reintroduce productive friction and support human agency in human-machine creative collaborations.

[HC-21] GPT -5 vs Other LLM s in Long Short-Context Performance

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中难以有效利用长上下文信息的问题,尤其是在处理高容量数据任务时性能显著下降的现象。尽管理论上LLMs具备处理数百万token的能力,但其在真实场景下(如基于20K条社交媒体帖子的抑郁症检测任务)的表现远低于预期,尤其当输入超过5K条帖子(约70K tokens)时,准确率骤降至50–53%。解决方案的关键在于系统性评估当前最先进模型(Grok-4、GPT-4、Gemini 2.5和GPT-5)在长上下文任务中的表现,并发现“中间信息丢失”问题已在新模型中基本得到缓解;同时强调了除准确率外,需引入如精确率等更细致的指标来评估模型在敏感应用场景(如抑郁检测)中的实用性,其中GPT-5虽整体准确率下降,但保持高达95%的精确率,凸显了多维评估的重要性。

链接: https://arxiv.org/abs/2602.14188
作者: Nima Esmi(1 and 2),Maryam Nezhad-Moghaddam(3),Fatemeh Borhani(3),Asadollah Shahbahrami(2 and 3),Amin Daemdoost(3),Georgi Gaydadjiev(4) ((1) Bernoulli Institute, RUG, Groningen, Netherlands, (2) ISRC, Khazar University, Baku, Azerbaijan, (3) Department of Computer Engineering, University of Guilan, Rasht, Iran, (4) QCE Department, TU Delft, Delft, Netherlands)
机构: Bernoulli Institute (伯努利研究所); RUG, Groningen, The Netherlands (莱顿大学, 格罗宁根, 荷兰); ISRC, Khazar University, Baku, Azerbaijan (信息科学与技术研究中心, 哈扎尔大学, 巴库, 阿塞拜疆); University of Guilan (吉兰大学); TU Delft (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 7 figures. Accepted for publication in the 3rd International Conference on Foundation and Large Language Models (FLLM2025). IEEE. The final version will be available in IEEE Xplore

点击查看摘要

Abstract:With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the “lost in the middle” problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.

[HC-22] Exploring a Multimodal Chatbot as a Facilitator in Therapeutic Art Activity

【速读】:该论文旨在解决传统艺术治疗中缺乏实时交互反馈与个性化支持的问题,尤其是在生成式AI(Generative AI)辅助下如何有效促进用户在视觉创作过程中的自我表达与心理反思。其解决方案的关键在于开发一个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的聊天机器人,能够在用户进行绘画或涂鸦等视觉创作时实时分析图像内容,并结合互动对话引导用户进行深度反思,从而增强治疗性参与感。该系统通过融合视觉理解与自然语言交互能力,为AI赋能的艺术治疗提供了可扩展的技术框架和设计方向。

链接: https://arxiv.org/abs/2602.14183
作者: Le Lin,Zihao Zhu,Rainbow Tin Hung Ho,Jing Liao,Yuhan Luo
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Therapeutic art activities, such as expressive drawing and painting, require the synergy between creative visual production and interactive dialogue. Recent advancements in Multimodal Large Language Models (MLLMs) have expanded the capacity of computing systems to interpret both textual and visual data, offering a new frontier for AI-mediated therapeutic support. This work-in-progress paper introduces an MLLM-powered chatbot that analyzes visual creation in real-time while engaging the creator in reflective conversations. We conducted an evaluation with five experts in art therapy and related fields, which demonstrated the chatbot’s potential to facilitate therapeutic engagement, and highlighted several areas for future development, including entryways and risk management, bespoke alignment of user profile and therapeutic style, balancing conversational depth and width, and enriching visual interactivity. These themes provide a design roadmap for designing the future AI-mediated creative expression tools.

[HC-23] DALL: Data Labeling via Data Programming and Active Learning Enhanced by Large Language Models

【速读】:该论文旨在解决自然语言处理(Natural Language Processing, NLP)中高质量标注数据获取成本高、标签质量难以保障的问题。现有标注方法在标签质量与标注成本之间难以取得平衡,限制了深度学习模型的性能提升。其解决方案的关键在于提出一种名为DALL的文本标注框架,该框架融合了数据编程(Data Programming)、主动学习(Active Learning)与大语言模型(Large Language Models, LLMs)的优势:通过结构化配置而非编码方式定义标注函数,使用户和LLM可协同生成和优化标注规则;主动学习识别最具信息量的样本供人工审核,同时由LLM辅助修正标签并迭代改进标注函数,从而显著降低标注成本并提升标签质量。

链接: https://arxiv.org/abs/2602.14102
作者: Guozheng Li,Ao Wang,Shaoxiang Wang,Yu Zhang,Pengcheng Cao,Yang Bai,Chi Harold Liu
机构: Beijing Institute of Technology (北京理工大学); University of Oxford (牛津大学); People’s Daily (人民日报)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Deep learning models for natural language processing rely heavily on high-quality labeled datasets. However, existing labeling approaches often struggle to balance label quality with labeling cost. To address this challenge, we propose DALL, a text labeling framework that integrates data programming, active learning, and large language models. DALL introduces a structured specification that allows users and large language models to define labeling functions via configuration, rather than code. Active learning identifies informative instances for review, and the large language model analyzes these instances to help users correct labels and to refine or suggest labeling functions. We implement DALL as an interactive labeling system for text labeling tasks. Comparative, ablation, and usability studies demonstrate DALL’s efficiency, the effectiveness of its modules, and its usability.

[HC-24] Audience in the Loop: Viewer Feedback-Driven Content Creation in Micro-drama Production on Social Media

【速读】:该论文旨在解决当前微短剧(micro-drama)创作过程中,内容创作者如何在社交平台快节奏、高互动的环境中实现高效且动态的叙事生成问题。传统研究多聚焦于创作者对平台功能 affordances 或平台偏见的感知,忽视了其实际写作流程与迭代机制。解决方案的关键在于揭示了创作者在短周期工作流中同时承担多重角色,并基于实时观众反馈(如评论、转发和表情包)不断调整剧情走向,从而形成独特的“观众响应式微短剧”叙事模式,将观众互动确立为社交平台上协同创作的新范式。

链接: https://arxiv.org/abs/2602.14045
作者: Gengchen Cao,Tianke He,Yixuan Liu,RAY LC(Correspondences author)
机构: Tsinghua University (清华大学); Sichuan University of Media and Communications (四川传媒学院); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注: 26 pages, 6 figures, CHI 2026

点击查看摘要

Abstract:The popularization of social media has led to increasing consumption of narrative content in byte-sized formats. Such micro-dramas contain fast-pace action and emotional cliffs, particularly attractive to emerging Chinese markets in platforms like Douyin and Kuaishou. Content writers for micro-dramas must adapt to fast-pace, audience-directed workflows, but previous research has focused instead on examining writers’experiences of platform affordances or their perceptions of platform bias, rather than the step-by-step processes through which they actually write and iterative content. In 28 semi-structured interviews with scriptwriters and writers specialized in micro-dramas, we found that the short-turn-around workflow leads to writers taking on multiple roles simultaneously, iteratively adapting to storylines in response to real-time audience feedback in the form of comments, reposts, and memes. We identified unique narrative styles such as AI-generated micro-dramas and audience-responsive micro-dramas. This work reveals audience interaction as a new paradigm for collaborative creative processes on social media.

[HC-25] Customer Service Operations: A Gatekeeper Framework

【速读】:该论文旨在解决客户服务中心在多渠道(如实时聊天、AI聊天机器人和社交媒体等)环境下,如何优化请求处理流程以提升服务效率与质量的问题。其核心挑战在于:每个渠道被视为一个“守门人系统”(gatekeeper system),代理需在服务过程中决策是否继续处理或转交给更高级别(通常成本更高)的服务提供者;同时,企业还需统筹战略层(部署哪些渠道)、战术层(人工客服人员配置及AI聊天机器人的训练程度)与操作层(人工客服的控制机制)的协同决策。论文的关键解决方案是构建一个理论模型来刻画最优请求处理策略,并通过数值方法揭示三类决策之间的动态交互关系,发现引入AI聊天机器人不仅不会降低服务质量,反而可能提升整体服务水平。

链接: https://arxiv.org/abs/2602.13998
作者: Maqbool Dada,Brett Hathaway,Evgeny Kagan
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Customer service has evolved beyond in-person visits and phone calls to include live chat, AI chatbots and social media, among other contact options. Service providers typically refer to these contact modalities as “channels”. Within each channel, customer service agents are tasked with managing and resolving a stream of inbound service requests. Each request involves milestones where the agent must decide whether to keep assisting the customer or to transfer them to a more skilled – and often costlier – provider. To understand how this request resolution process should be managed, we develop a model in which each channel is represented as a gatekeeper system and characterize the structure of the optimal request resolution policy. We then turn to the broader question of the firm’s customer service design, which includes the strategic problem of which channels to deploy, the tactical questions of at what level to staff the live-agent channel and to what extent to train an AI chatbot, and the operational question of how to control the live-agent channel. Examining the interplay between strategic, tactical, and operational decisions through numerical methods, we show, among other insights, that service quality can be improved, rather than diminished, by chatbot implementation.

[HC-26] A System of Care Not Control: Co-Designing Online Safety and Wellbeing Solutions with Guardians ad Litem for Youth in Child Welfare

【速读】:该论文旨在解决当前在线安全技术过度依赖家长监管,而未能有效应对儿童福利系统(Child Welfare System, CWS)中青少年所面临独特在线安全挑战的问题。研究发现,GALs(Guardians ad Litem,法定监护人)在支持青少年在线安全时受限于数字素养不足、机构支持不一致、多方协作缺失及家庭结构复杂性等问题。解决方案的关键在于超越传统的控制与限制模式,构建以增强青年自主性、建立稳定信任关系和促进线上线下互动为核心的多利益相关方协同机制,通过虚拟化身和移动应用等设计概念强化跨角色沟通与治疗支持,从而实现面向CWS青少年的包容性、可持续的在线安全新范式。

链接: https://arxiv.org/abs/2602.13989
作者: Johanna Olesk,Ozioma C. Oguine,Mariana Fernandez Espinosa,Alexis B. Peirce Caudell,Karla Badillo-Urquiola
机构: University of Notre Dame (圣母大学); Indiana University (印第安纳大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 17 pages, 2 figures, 1 table

点击查看摘要

Abstract:Current online safety technologies overly rely on parental mediation and often fail to address the unique challenges faced by youth in the Child Welfare System (CWS). These youth depend on a complex ecosystem of support, including families, caseworkers, and advocates, to safeguard their wellbeing. Within this network, Guardians ad Litem (GALs) play a unique role as court-appointed advocates tasked with ensuring the best interests of youth. Yet little is known about how GALs perceive and support youths’ online safety. To address this gap, we conducted a two-part workshop with 10 GALs to explore their perspectives on online safety and collaboratively envision technology-based solutions tailored to the needs of youth in the CWS. Our findings revealed that GALs struggle to support youth with online safety challenges due to limited digital literacy, inconsistency of institutional support, lack of collaboration among stakeholders, and complexity of family dynamics. While GALs recognized the need for some oversight of youth online activities, they emphasized designing systems that support online safety beyond control or restriction by fostering stability, trust, and meaningful interactions, both online and offline. GALs emphasized the importance of developing tools that enable ongoing communication, therapeutic support, and coordination across stakeholders. Proposed design concepts focused on strengthening youth agency and cross-stakeholder collaboration through virtual avatars and mobile apps. This work provides actionable design concepts for strengthening relationships and communication across care network. It also redefines traditional approaches to online safety, advocating for a holistic, multi-stakeholder online safety paradigm for youth in the CWS.

[HC-27] Avoiding Social Judgment Seeking Privacy: Investigating why Mothers Shift from Facebook Groups to Large Language Models

【速读】:该论文试图解决的问题是:随着社交媒体平台(如Facebook育儿群组)中社会评判、隐私泄露和信息不可靠性等问题日益突出,母亲们在寻求育儿支持时面临的风险增加,导致她们逐渐转向大型语言模型(Large Language Models, LLMs),如ChatGPT和Gemini。研究旨在揭示这一转变的动因及其背后的心理与行为机制。解决方案的关键在于识别出LLMs作为替代性数字支持工具的核心优势:一是提供情感安全与隐私保护,减少社交风险;二是提供即时、结构化且可靠的信息支持,填补传统线上社群无法满足的情感与实用需求。研究指出,LLMs并非完全取代人际支持,而是补充现有支持系统中的空白,强调未来设计应兼顾信息准确性与情绪安全性,以更好地服务于母职群体的数字支持需求。

链接: https://arxiv.org/abs/2602.13941
作者: Shayla Sharmin,Sadia Afrin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social media platforms, especially Facebook parenting groups, have long been used as informal support networks for mothers seeking advice and reassurance. However, growing concerns about social judgment, privacy exposure, and unreliable information are changing how mothers seek help. This exploratory mixed-method study examines why mothers are moving from Facebook parenting groups to large language models such as ChatGPT and Gemini. We conducted a cross-sectional online survey of 109 mothers. Results show that 41.3% of participants avoided Facebook parenting groups because they expected judgment from others. This difference was statistically significant across location and family structure. Mothers living in their home country and those in joint families were more likely to avoid Facebook groups. Qualitative findings revealed three themes: social judgment and exposure, LLMs as safe and private spaces, and quick and structured support. Participants described LLMs as immediate, emotionally safe, and reliable alternatives that reduce social risk when asking for help. Rather than replacing human support, LLMs appear to fill emotional and practical gaps within existing support systems. These findings show a change in maternal digital support and highlight the need to design LLM systems that support both information and emotional safety.

[HC-28] Not Seeing the Whole Picture: Challenges and Opportunities in Using AI for Co-Making Physical DIY-AT for People with Visual Impairments

【速读】:该论文旨在解决现有辅助技术(Assistive Technology, AT)普遍采用“一刀切”模式,无法满足视觉障碍人士(People with Visual Impairments, PVI)多样化需求的问题,同时针对当前DIY-AT工具包多依赖工程师协作或编程技能、非专业用户(包括PVI)面临工具不可及、信心不足和技术知识匮乏等障碍。其解决方案的关键在于探索基于大语言模型(Large Language Models, LLMs)的生成式AI如何作为协同设计伙伴,赋能PVI直接参与实体DIY-AT的共创过程,从而突破传统技术开发的门槛,提升个性化适配能力与可访问性。

链接: https://arxiv.org/abs/2602.13874
作者: Ben Kosa,Hsuanling Lee,Jasmine Li,Sanbrita Mondal,Yuhang Zhao,Liang He
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 4 figures, to be presented at CHI 2026

点击查看摘要

Abstract:Existing assistive technologies (AT) often adopt a one-size-fits-all approach, overlooking the diverse needs of people with visual impairments (PVI). Do-it-yourself AT (DIY-AT) toolkits offer one path toward customization, but most remain limited–targeting co-design with engineers or requiring programming expertise. Non-professionals with disabilities, including PVI, also face barriers such as inaccessible tools, lack of confidence, and insufficient technical knowledge. These gaps highlight the need for prototyping technologies that enable PVI to directly make their own AT. Building on emerging evidence that large language models (LLMs) can serve not only as visual aids but also as co-design partners, we present an exploratory study of how LLM-based AI can support PVI in the tangible DIY-AT co-making process. Our findings surface key challenges and design opportunities: the need for greater spatial and visual support, strategies for mitigating novel AI errors, and implications for designing more accessible AI-assisted prototypes.

[HC-29] What happens when reviewers receive AI feedback in their reviews?

【速读】:该论文试图解决生成式 AI (Generative AI) 在学术同行评审(peer review)中应用时所引发的争议问题,即如何在提升评审效率与质量的同时,保障公平性、可问责性和信任感。其解决方案的关键在于通过实证研究首次记录了在 ICLR 2025 会议中部署的官方 AI 反馈工具的实际使用情况,揭示了审稿人对该工具的接受度、交互方式及其对评审过程的影响,从而为设计兼顾 AI 辅助效能与人类专家主导权的评审系统提供依据。

链接: https://arxiv.org/abs/2602.13817
作者: Shiping Chen,Shu Zhong,Duncan P. Brumby,Anna L. Cox
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: ACM CHI 2026

点击查看摘要

Abstract:AI is reshaping academic research, yet its role in peer review remains polarising and contentious. Advocates see its potential to reduce reviewer burden and improve quality, while critics warn of risks to fairness, accountability, and trust. At ICLR 2025, an official AI feedback tool was deployed to provide reviewers with post-review suggestions. We studied this deployment through surveys and interviews, investigating how reviewers engaged with the tool and perceived its usability and impact. Our findings surface both opportunities and tensions when AI augments in peer review. This work contributes the first empirical evidence of such an AI tool in a live review process, documenting how reviewers respond to AI-generated feedback in a high-stakes review context. We further offer design implications for AI-assisted reviewing that aim to enhance quality while safeguarding human expertise, agency, and responsibility.

[HC-30] Ontological grounding for sound and natural robot explanations via large language models AAMAS2026 AAMAS

【速读】:该论文旨在解决人机交互中机器人解释能力不足的问题,即如何使机器人基于自身经验生成既逻辑严谨又符合人类预期的解释。解决方案的关键在于构建一个融合本体推理与大语言模型(Large Language Models, LLMs)的混合框架:本体确保推理过程在领域知识中的语义一致性和逻辑严谨性,而LLMs则负责生成流畅、情境感知且可适应用户反馈的自然语言解释。通过将静态对比本体叙事与LLM代理结合,系统能够根据事件属性判断其典型性,并生成简洁清晰、交互式的解释,从而提升人机协作的透明度与可理解性。

链接: https://arxiv.org/abs/2602.13800
作者: Alberto Olivares-Alarcos,Muhammad Ahsan,Satrio Sanjaya,Hsien-I Lin,Guillem Alenyà
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: An extended abstract of this article is accepted for presentation at AAMAS 2026: Olivares-Alarcos, A., Muhammad, A., Sanjaya, S., Lin, H. and Alenyà, G. (2026). Blending ontologies and language models to generate sound and natural robot explanations. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS

点击查看摘要

Abstract:Building effective human-robot interaction requires robots to derive conclusions from their experiences that are both logically sound and communicated in ways aligned with human expectations. This paper presents a hybrid framework that blends ontology-based reasoning with large language models (LLMs) to produce semantically grounded and natural robot explanations. Ontologies ensure logical consistency and domain grounding, while LLMs provide fluent, context-aware and adaptive language generation. The proposed method grounds data from human-robot experiences, enabling robots to reason about whether events are typical or atypical based on their properties. We integrate a state-of-the-art algorithm for retrieving and constructing static contrastive ontology-based narratives with an LLM agent that uses them to produce concise, clear, interactive explanations. The approach is validated through a laboratory study replicating an industrial collaborative task. Empirical results show significant improvements in the clarity and brevity of ontology-based narratives while preserving their semantic accuracy. Initial evaluations further demonstrate the system’s ability to adapt explanations to user feedback. Overall, this work highlights the potential of ontology-LLM integration to advance explainable agency, and promote more transparent human-robot collaboration.

[HC-31] Comparables XAI: Faithful Example-based AI Explanations with Counterfactual Trace Adjustments

【速读】:该论文旨在解决生成式 AI (Generative AI) 决策解释中因多特征差异显著而导致的可理解性难题,即如何使用户准确理解决策值相对于示例的变化趋势。其解决方案的关键在于提出 Comparables XAI 方法,通过引入 Trace 调整机制,逐属性单调地追踪每个可比示例(Comparable)到目标对象(Subject)的反事实变化路径,从而实现更忠实、精确且用户感知不确定性更低的示例驱动型解释。

链接: https://arxiv.org/abs/2602.13784
作者: Yifan Zhang,Tianle Ren,Fei Wang,Brian Y Lim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by CHI 2026

点击查看摘要

Abstract:Explaining with examples is an intuitive way to justify AI decisions. However, it is challenging to understand how a decision value should change relative to the examples with many features differing by large amounts. We draw from real estate valuation that uses Comparables-examples with known values for comparison. Estimates are made more accurate by hypothetically adjusting the attributes of each Comparable and correspondingly changing the value based on factors. We propose Comparables XAI for relatable example-based explanations of AI with Trace adjustments that trace counterfactual changes from each Comparable to the Subject, one attribute at a time, monotonically along the AI feature space. In modelling and user studies, Trace-adjusted Comparables achieved the highest XAI faithfulness and precision, user accuracy, and narrowest uncertainty bounds compared to linear regression, linearly adjusted Comparables, or unadjusted Comparables. This work contributes a new analytical basis for using example-based explanations to improve user understanding of AI decisions.

[HC-32] Human Oversight-by-Design for Accessible Generative IUIs

【速读】:该论文旨在解决生成式 AI (Generative AI) 在高风险工作流(如医疗沟通)中因界面内容不可靠、缺乏可访问性及监督机制缺失而导致的伦理与可信度风险问题,包括幻觉、语义失真、偏见和无障碍障碍等,这些因素会削弱系统的可靠性并限制用户对AI支持过程的理解、监控与干预能力。其解决方案的关键在于提出“监督即设计”(Oversight-by-Design)框架:将人类判断嵌入整个生成式用户界面(UI)管道作为架构承诺,通过自动检查识别风险(如可读性、语义保真度、事实一致性与符合标准的无障碍约束),当阈值被突破或不确定性较高时触发强制人工在环(Human-in-the-Loop, HITL)审查;同时引入人在回路(Human-on-the-Loop, HOTL)监督机制持续监测系统级信号(警报频率、升级率与合规证据),并通过结构化反馈转化为治理行动(规则与提示更新、阈值校准、可追溯审计日志),从而实现对高风险场景下生成式 UI 系统的规模化干预与可验证监督。

链接: https://arxiv.org/abs/2602.13745
作者: Blessing Jerry,Lourdes Moreno,Paloma Martínez
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Preprint. Accepted for publication in CEUR Workshop Proceedings (IUI Workshops 2026). 15 pages, 1 figure

点击查看摘要

Abstract:LLM-generated interfaces are increasingly used in high-consequence workflows (e.g., healthcare communication), where how information is presented can impact downstream actions. These interfaces and their content support human interaction with AI-assisted decision-making and communication processes and should remain accessible and usable for people with disabilities. Accessible plain-language interfaces serve as an enabling infrastructure for meaningful human oversight. In these contexts, ethical and trustworthiness risks, including hallucinations, semantic distortion, bias, and accessibility barriers, can undermine reliability and limit users’ ability to understand, monitor, and intervene in AI-supported processes. Yet, in practice, oversight is often treated as a downstream check, without clear rules for when human intervention is required or who is accountable. We propose oversight-by-design: embedding human judgment across the pipeline as an architectural commitment, implemented via escalation policies and explicit UI controls for risk signalling and intervention. Automated checks flag risk in generated UI communication that supports high-stakes workflows (e.g., readability, semantic fidelity, factual consistency, and standards-based accessibility constraints) and escalate to mandatory Human-in-the-Loop (HITL) review before release when thresholds are violated, or uncertainty is high. Human-on-the-Loop (HOTL) supervision monitors system-level signals over time (alerts, escalation rates, and compliance evidence) to tune policies and detect drift. Structured review feedback is translated into governance actions (rule and prompt updates, threshold calibration, and traceable audit logs), enabling scalable intervention and verifiable oversight for generative UI systems that support high-stakes workflows.

[HC-33] ransferable XAI: Relating Understanding Across Domains with Explanation Transfer

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)仅针对单一应用场景提供解释,导致用户在面对相关任务时面临过度泛化或重复记忆的问题。其核心挑战在于:尽管不同领域间存在共享的解释因子(如BMI对心脏病和糖尿病风险的影响程度一致),但这些因子的作用方式可能因任务或属性差异而异(如胸痛更指向心脏病而非糖尿病)。解决方案的关键是提出可迁移的可解释人工智能(Transferable XAI)框架,通过线性因子解释中的广义仿射变换(general affine transformation)建模不同领域解释之间的关系,实现跨域理解的迁移。该框架支持三种类型的迁移:数据子空间间的平移(translation)、决策任务间的缩放(scaling)以及属性间的映射(mapping),从而显著提升用户在新领域中对AI决策的理解准确性、因素召回率及跨域解释关联能力。

链接: https://arxiv.org/abs/2602.13675
作者: Fei Wang,Yifan Zhang,Brian Y. Lim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 40 pages, accepted by IUI2026

点击查看摘要

Abstract:Current Explainable AI (XAI) focuses on explaining a single application, but when encountering related applications, users may rely on their prior understanding from previous explanations. This leads to either overgeneralization and AI overreliance, or burdensome independent memorization. Indeed, related decision tasks can share explanatory factors, but with some notable differences; e.g., body mass index (BMI) affects the risks for heart disease and diabetes at the same rate, but chest pain is more indicative of heart disease. Similarly, models using different attributes for the same task still share signals; e.g., temperature and pressure affect air pollution but in opposite directions due to the ideal gas law. Leveraging transfer of learning, we propose Transferable XAI to enable users to transfer understanding across related domains by explaining the relationship between domain explanations using a general affine transformation framework applied to linear factor explanations. The framework supports explanation transfer across various domain types: translation for data subspace (subsuming prior work on Incremental XAI), scaling for decision task, and mapping for attributes. Focusing on task and attributes domain types, in formative and summative user studies, we investigated how well participants could understand AI decisions from one domain to another. Compared to single-domain and domain-independent explanations, Transferable XAI was the most helpful for understanding the second domain, leading to the best decision faithfulness, factor recall, and ability to relate explanations between domains. This framework contributes to improving the reusability of explanations across related AI applications by explaining factor relationships between subspaces, tasks, and attributes.

[HC-34] Building Autonomous GUI Navigation via Agent ic-Q Estimation and Step-Wise Policy Optimization

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实世界图形用户界面(Graphical User Interface, GUI)交互中面临的非平稳环境问题,该问题导致数据收集和策略优化的计算成本过高。解决方案的关键在于提出一种以MLLM为核心的GUI代理框架,其核心创新为两个组成部分:一是代理Q值估计(agentic-Q estimation),用于生成步骤级价值以评估动作对任务完成的贡献;二是逐步策略优化(step-wise policy optimization),利用由策略自身生成的状态-动作轨迹进行强化学习优化,且策略更新与环境解耦,从而实现高效、稳定的训练过程。此方法显著降低了数据采集开销并提升了策略优化的稳定性,实验证明该框架使Ovis2.5-9B模型在GUI导航和定位基准测试中表现优异,甚至超越更大规模的竞品模型。

链接: https://arxiv.org/abs/2602.13653
作者: Yibo Wang,Guangda Huzhang,Yuwei Hu,Yu Xia,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.

[HC-35] Search in Transition: A Study of University Students Perspectives on Using LLM s and Traditional Search Engines in English Test Problem Solving for Higher Study

【速读】:该论文旨在解决大学英语考试备考过程中,学生在使用传统搜索引擎(如Google)与生成式AI(Generative AI)大语言模型(LLM)时所面临的工具效能不均衡问题。研究发现,学生在不同任务中会切换使用两种工具:依赖Google获取权威、多源信息以验证规则,而利用GPT进行摘要、解释、改写和答题初稿撰写;然而,单一工具无法全面支持所有备考环节,导致认知负荷较高。解决方案的关键在于提出一种嵌入搜索界面的聊天机器人原型,融合GPT的对话优势与Google的信息可靠性,构建混合型辅助系统,从而提升备考效率并降低认知负担。

链接: https://arxiv.org/abs/2602.13629
作者: Tarek Rahman,Md Shaharia Hossen,Mark Protik Mondol,Jannatun Noor Mukta
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: arXiv admin note: substantial text overlap with arXiv:2510.17726 by other authors

点击查看摘要

Abstract:With the growing integration of Artificial Intelligence (AI) in educational contexts, university students preparing for English language tests increasingly alternate between traditional search engines, such as Google, and large language models (LLMs) to support their test-related problem-solving. This study examines students perceptions of these tools, focusing on usability, efficiency, and their integration into English language test preparation this http URL a mixed-methods approach, we surveyed 140 university students from diverse academic disciplines and conducted in-depth interviews with 20 participants. Quantitative analyses, including ANOVA and chi-square tests, were employed to evaluate differences in perceived efficiency, satisfaction, and overall tool preference. The qualitative findings indicate that students frequently switch between GPT and Google depending on task demands, relying on Google for credible, multi-source information and rule verification, while using GPT for summarization, explanation, paraphrasing, and drafting responses for English test tasks. As neither tool alone was found to adequately support all aspects of English language test problem solving, participants expressed a strong preference for a hybrid solution. In response, we propose a prototype in the form of a chatbot embedded within a search interface, combining GPTs conversational strengths with Google reliability to improve English language test preparation and reduce cognitive load.

[HC-36] Anthropomorphism on Risk Perception: The Role of Trust and Domain Knowledge in Decision-Support AI

【速读】:该论文试图解决的问题是:拟人化设计(anthropomorphic design)如何影响用户对决策支持系统中风险的感知,以及这种影响是否受到用户领域知识(如金融知识)调节。解决方案的关键在于提出并验证一个整合心理理论的模型,即拟人化通过认知信任(cognitive trust)和情感信任(affective trust)两条路径间接降低风险感知,且这一机制在不同水平的领域知识下表现出显著差异——低知识用户因认知信任增强而感知风险上升,高知识用户则因认知与情感信任均增强而感知风险下降。这为负责任的人工智能(responsible AI)设计提供了可校准的信任机制依据。

链接: https://arxiv.org/abs/2602.13625
作者: Manuele Reani,Xiangyang He,Zuolan Bao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anthropomorphic design is routinely used to make conversational agents more approachable and engaging. Yet its influence on users’ perceptions remains poorly understood. Drawing on psychological theories, we propose that anthropomorphism influences risk perception via two complementary forms of trust, and that domain knowledge moderates these relationships. To test our model, we conducted a large-scale online experiment (N = 1,256) on a financial decision-support system implementing different anthropomorphic designs. We found that anthropomorphism indirectly reduces risk perception by increasing both cognitive and affective trust. Domain knowledge moderates these paths: participants with low financial knowledge experience a negative indirect effect of perceived anthropomorphism on risk perception via cognitive trust, whereas those with high financial knowledge exhibit a positive direct and indirect effect. We discuss theoretical contributions to human-AI interaction and design implications for calibrating trust in anthropomorphic decision-support systems for responsible AI.

[HC-37] he Shadow Boss: Identifying Atomized Manipulations in Agent ic Employment of XR Users using Scenario Constructions

【速读】:该论文试图解决的问题是:随着生成式 AI (Generative AI) 作为经济主体直接雇佣、指挥并支付人类劳动者的新劳动模式——即“代理就业”(Agentic Employment)的兴起,传统“幽灵工作”(ghost work)结构被颠覆,人类劳动者沦为不可见软件实体的“生物执行器”(biological actuators),从而引发一系列伦理、法律与认知风险。解决方案的关键在于引入以用户为中心的扩展现实(Extended Reality, XR)设计框架和政策干预机制,通过将XR作为关键“控制界面”(control surface),实现对AI代理与人类劳动者之间关系的可解释性(legibility)和可控性提升,从而防止人类劳动被简化为数字意识的无摩擦硬件层,并应对包括责任真空、认知能力退化及社会操控在内的七大风险向量。

链接: https://arxiv.org/abs/2602.13622
作者: Lik-Hang Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:The emerging paradigm of Agentic Employment" is a labor model where autonomous AI agents, acting as economic principals rather than mere management tools, directly hire, instruct, and pay human workers. Facilitated by the launch of platforms like this http URL in February 2026, this shift inverts the traditional ghost work" dynamic, positioning visible human workers as biological actuators" for invisible software entities. With speculative design approach, we analyze how Extended Reality (XR) serves as the critical control surface" for this relationship, enabling agents to issue granular, context-free micro-instructions while harvesting real-time environmental data. Through a scenario construction methodology, we identify seven key risk vectors, including the creation of a liability void where humans act as moral crumple zones for algorithmic risk, the acceleration of cognitive deskilling through ``Shadow Boss" micromanagement, and the manipulation of civic and social spheres via Diminished Reality (DR). The findings suggest that without new design frameworks prioritizing agency and legibility, Agentic Employment threatens to reduce human labor to a friction-less hardware layer for digital minds, necessitating urgent user-centric XR and policy interventions.

[HC-38] Designing Health Technologies for Immigrant Communities: Exploring Healthcare Providers Communication Strategies with Patients

【速读】:该论文旨在解决在发达国家中,医疗提供者与移民群体患者之间有效沟通不足的问题,尤其关注文化差异对健康沟通的影响。其关键解决方案在于识别并总结出四类有效的沟通策略:承认(acknowledgment)、社区参与(community involvement)、渐进式照护(gradual care)和适应性沟通实践(adaptive communication practices),强调通过提升文化胜任力(cultural competence)来设计更契合移民社区需求的健康信息技术(Health Technology),为HCI研究者提供可操作、情境化的文化胜任力建设路径。

链接: https://arxiv.org/abs/2602.13598
作者: Zhanming Chen,Alisha Ghaju,May Hang,Juan F. Maestre,Ji Youn Shin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 19 pages, Conference

点击查看摘要

Abstract:Patient-provider communication is an important aspect of successful healthcare, as it can directly lead to positive health outcomes. Previous studies examined factors that facilitate communication between healthcare providers and patients in socially marginalized communities, especially developing countries, and applied identified factors to technology development. However, there is limited understanding of how providers work with patients from immigrant populations in a developed country. By conducting semi-structured interviews with 15 providers working with patients from an immigrant community with unique cultural characteristics, we identified providers’ effective communication strategies, including acknowledgment, community involvement, gradual care, and adaptive communication practices (i.e., adjusting the communication style). Based on our findings, we highlight cultural competence and discuss design implications for technologies to support health communication in immigrant communities. Our suggestions propose approaches for HCI researchers to identify practical, contextualized cultural competence for their health technology design.

[HC-39] What Do We Mean by Pilot Study: Early Findings from a Meta-Review of Pilot Study Reporting at CHI

【速读】:该论文试图解决人机交互(Human-Computer Interaction, HCI)领域中 pilot study(试点研究)概念模糊、方法论缺失的问题。当前HCI研究中广泛使用试点研究来支持设计决策、验证流程或说明方法选择,但缺乏统一的定义、指导原则和报告标准,导致其作用常被轻描淡写地提及,且难以评估其对主研究的实际贡献。解决方案的关键在于明确试点研究在HCI中的功能定位,并推动建立规范化的定义、实施指南与报告框架,从而提升其科学严谨性与可复现性,填补该领域的系统性方法论空白。

链接: https://arxiv.org/abs/2602.13488
作者: Belu Ticona,Amna Liaqat,Antonios Anastasopoulos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pilot studies (PS) are ubiquitous in HCI research. CHI papers routinely reference ‘pilot studies’, ‘pilot tests’, or ‘preliminary studies’ to justify design decisions, verify procedures, or motivate methodological choices. Yet despite their frequency, the role of pilot studies in HCI remains conceptually vague and empirically underexamined. Unlike fields such as medicine, nursing, and education, where pilot and feasibility studies have well-established definitions, guidelines, reporting standards and even a dedicated research journal, the CHI community lacks a shared understanding of what constitutes a pilot study, why they are conducted, and how they should be reported. Many papers reference pilots ‘in passing’, without details about design, outcomes, or how the pilot informed the main study. This variability suggests a methodological blind spot in our community.

[HC-40] GLIMPSE : Real-Time Text Recognition and Contextual Understanding for VQA in Wearables

【速读】:该论文旨在解决在资源受限的可穿戴设备上部署文本视觉问答(Text VQA)时面临的两大挑战:一是高分辨率视频流对电池续航和热管理造成的压力,二是现有模型难以在实时视频流中保持跨帧的连贯时序上下文。解决方案的关键在于利用文本识别(OCR)与场景理解在分辨率需求上的非对称性——OCR需要精细细节而场景理解可容忍粗粒度特征,从而提出一种混合架构:设备端仅对关键区域进行选择性高分辨率OCR处理,同时以低分辨率流式传输视频以保留视觉上下文。该方法在五个任务类别的Text VQA基准测试中实现了72%的准确率,功耗仅为全分辨率流式的0.49倍,支持在可穿戴设备上持续运行高质量的VQA任务。

链接: https://arxiv.org/abs/2602.13479
作者: Akhil Ramachandran,Ankit Arun,Ashish Shenoy,Abhay Harpale,Srihari Jayakumar,Debojeet Chatterjee,Mohsen Moslehpour,Pierce Chuang,Yichao Lu,Vikas Bhardwaj,Peyman Heidari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Video Large Language Models (Video LLMs) have shown remarkable progress in understanding and reasoning about visual content, particularly in tasks involving text recognition and text-based visual question answering (Text VQA). However, deploying Text VQA on wearable devices faces a fundamental tension: text recognition requires high-resolution video, but streaming high-quality video drains battery and causes thermal throttling. Moreover, existing models struggle to maintain coherent temporal context when processing text across multiple frames in real-time streams. We observe that text recognition and visual reasoning have asymmetric resolution requirements - OCR needs fine detail while scene understanding tolerates coarse features. We exploit this asymmetry with a hybrid architecture that performs selective high-resolution OCR on-device while streaming low-resolution video for visual context. On a benchmark of text-based VQA samples across five task categories, our system achieves 72% accuracy at 0.49x the power consumption of full-resolution streaming, enabling sustained VQA sessions on resource-constrained wearables without sacrificing text understanding quality.

[HC-41] How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

【速读】:该论文旨在解决盲人和低视力(Blind and Low Vision, BLV)人群在日常生活中使用多模态大语言模型(Multimodal Large Language Models, MLLMs)进行视觉信息获取时,现有应用在真实场景中性能有限、可靠性不足的问题。尽管MLLM-enabled视觉解释应用相较于传统基于图像描述或光学字符识别(OCR)的工具提供了更自然的对话式交互能力,但其实际表现仍存在显著缺陷,如错误回答率高达22.2%、约10.8%的请求无法响应。研究提出“视觉助理”(visual assistant)技能作为解决方案的关键,即一套面向目标导向、可靠辅助行为的设计原则与实践指南,以提升BLV用户在日常生活情境下对视觉信息访问的准确性与可用性。

链接: https://arxiv.org/abs/2602.13469
作者: Ricardo E. Gonzalez Penuela,Crescentia Jung,Sharon Y Lin,Ruiying Hu,Shiri Azenkot
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 24 pages, 17 figures, 3 tables, appendix section, to appear main track CHI 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information in their daily lives. Unlike traditional visual interpretation tools that provide access through captions and OCR (text recognition through camera input), MLLM-enabled applications support access through conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and their implications for BLV people’s everyday life remain limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants’ use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as “somewhat trustworthy” (mean=3.76 out of 5, max=very trustworthy) and “somewhat satisfying” (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to follow-up requests. Our work demonstrates that MLLMs can improve the accuracy of descriptive visual interpretations, but that supporting everyday use also depends on the “visual assistant” skill – a set of behaviors for providing goal-directed, reliable assistance. We conclude by proposing the “visual assistant” skill and practical guidelines to help future MLLM-enabled visual interpretation applications better support BLV people’s access to visual information.

[HC-42] Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety IJCAI

【速读】:该论文旨在解决低资源语言环境中网络欺凌(cyberbullying)检测的难题,特别是针对斯瓦希里语(Swahili)中经过混淆处理的 abusive language(攻击性语言)的识别问题。其关键解决方案在于采用多种机器学习模型(如支持向量机 SVM、逻辑回归和决策树),并通过参数调优与合成少数类过采样技术(SMOTE)缓解数据不平衡问题,从而提升模型在小规模、不均衡文本数据上的检测性能。研究结果表明,尽管这些模型在高维文本特征下表现良好,但受限于数据规模和分布,仍需进一步扩充数据集并探索迁移学习与多模态融合等方法以增强泛化能力与文化敏感性。

链接: https://arxiv.org/abs/2602.13455
作者: Phyllis Nabangi,Abdul-Jalil Zakaria,Jema David Ndibwile
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the Second IJCAI AI for Good Symposium in Africa, hosted by Deep Learning Indaba, 7 pages, 1 figure

点击查看摘要

Abstract:The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset’s small size and imbalance limit our findings’ generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms. Comments: Accepted at the Second IJCAI AI for Good Symposium in Africa, hosted by Deep Learning Indaba, 7 pages, 1 figure Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.13455 [cs.CL] (or arXiv:2602.13455v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.13455 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-43] Uncertain Pointer: Situated Feedforward Visualizations for Ambiguity-Aware AR Target Selection

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)中因输入歧义导致的目标识别难题,特别是在远距离物体或杂乱场景下用户查询时的准确性问题。其核心挑战在于如何通过视觉前馈(visual feedforward)机制有效支持用户在交互过程中快速、准确地识别目标。解决方案的关键在于提出了一种系统性的可视化方法——Uncertain Pointer,它通过在用户确认前标注多个候选目标来实现歧义消解:一方面利用颜色等视觉标识符赋予每个候选目标独特身份以提升可区分性,另一方面通过调节透明度等视觉强度参数反映系统的置信度水平。研究基于30年文献分析构建了包含25种指针样式的指针空间,并通过两项在线实验(n=60和40)量化评估了不同设计在目标可见性、可辨识度、用户偏好与心理负荷等方面的性能差异,最终提炼出针对不同AR应用场景的指针选择设计建议。

链接: https://arxiv.org/abs/2602.13433
作者: Ching-Yi Tsai,Nicole Tacconi,Andrew D. Wilson,Parastoo Abtahi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the 2026 CHI Conference on Human Factors in Computing Systems (CHI 2026). 31 Pages, 10 figures, 16 tables

点击查看摘要

Abstract:Target disambiguation is crucial in resolving input ambiguity in augmented reality (AR), especially for queries over distant objects or cluttered scenes on the go. Yet, visual feedforward techniques that support this process remain underexplored. We present Uncertain Pointer, a systematic exploration of feedforward visualizations that annotate multiple candidate targets before user confirmation, either by adding distinct visual identities (e.g., colors) to support disambiguation or by modulating visual intensity (e.g., opacity) to convey system uncertainty. First, we construct a pointer space of 25 pointers by analyzing existing placement strategies and visual signifiers used in target visualizations across 30 years of relevant literature. We then evaluate them through two online experiments (n = 60 and 40), measuring user preference, confidence, mental ease, target visibility, and identifiability across varying object distances and sparsities. Finally, from the results, we derive design recommendations in choosing different Uncertain Pointers based on AR context and disambiguation techniques.

[HC-44] Revisiting Worker-Centered Design: Tensions Blind Spots and Action Spaces

【速读】:该论文旨在解决当前Worker-Centered Design (WCD) 在实践中缺乏系统性反思与评估的问题,尤其关注其在食品配送行业中的实施效果、潜在矛盾及对劳动者集体行动的支持能力。解决方案的关键在于提出一个基于“四维分析框架”的反思性分析方法,从多劳工系统(Multi-Laborer System)视角识别出劳动链条中的冲突、WCD的扭曲实施、设计者政治经济认知局限等关键盲点,并进一步构建“诊断-生成”(Diagnostic-Generative)路径,以应对劳资冲突和制度重构风险,同时提升设计者在政策与经济层面的想象力,从而拓展WCD的实际应用空间并增强其现实相关性。

链接: https://arxiv.org/abs/2602.13424
作者: Shuhao Ma,John Zimmerman,Valentina Nisi,Nuno Jardim Nunes
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 7 figure, accepted by CHI’26

点击查看摘要

Abstract:Worker-Centered Design (WCD) has gained prominence over the past decade, offering researchers and practitioners ways to engage worker agency and support collective actions for workers. Yet few studies have systematically revisited WCD itself, examining its implementations, challenges, and practical impact. Through a four-lens analytical framework that examines multiple facets of WCD within food delivery industry, we identify critical tensions and blind spots from a Multi-Laborer System perspective. Our analysis reveals conflicts across labor chains, distorted implementations of WCD, designers’ sometimes limited political-economic understanding, and workers as active agents of change. These insights further inform a Diagnostic-Generative pathway that helps to address recurring risks, including labor conflicts and institutional reframing, while cultivating designers’ policy and economic imagination. Following the design criticism tradition, and through a four-lens reflexive analysis, this study expands the action space for WCD and strengthens its relevance to real-world practice.

[HC-45] Situation Graph Prediction: Structured Perspective Inference for User Modeling

【速读】:该论文旨在解决生成式 AI (Generative AI) 中视角感知(Perspective-Aware AI)建模的核心挑战,即如何有效建模随时间演化的内部状态(如目标、情绪、情境),而不仅仅是用户偏好。当前进展受限于数据瓶颈:数字足迹具有隐私敏感性,且视角状态通常缺乏标注。为此,作者提出情境图预测(Situation Graph Prediction, SGP)任务,将视角建模视为一个逆向推理问题——从可观察的多模态痕迹中重构结构化且符合本体论对齐的视角表示。解决方案的关键在于采用“结构优先”的合成数据生成策略,通过设计确保潜在标签与可观测痕迹的一致性,从而在无需真实标注的情况下实现有效建模。

链接: https://arxiv.org/abs/2602.13319
作者: Jisung Shin,Daniel Platnick,Marjan Alirezaie,Hossein Rahnama
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint under review, 4 pages

点击查看摘要

Abstract:Perspective-Aware AI requires modeling evolving internal states–goals, emotions, contexts–not merely preferences. Progress is limited by a data bottleneck: digital footprints are privacy-sensitive and perspective states are rarely labeled. We propose Situation Graph Prediction (SGP), a task that frames perspective modeling as an inverse inference problem: reconstructing structured, ontology-aligned representations of perspective from observable multimodal artifacts. To enable grounding without real labels, we use a structure-first synthetic generation strategy that aligns latent labels and observable traces by design. As a pilot, we construct a dataset and run a diagnostic study using retrieval-augmented in-context learning as a proxy for supervision. In our study with GPT-4o, we observe a gap between surface-level extraction and latent perspective inference–indicating latent-state inference is harder than surface extraction under our controlled setting. Results suggest SGP is non-trivial and provide evidence for the structure-first data synthesis strategy.

[HC-46] Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey

【速读】:该论文旨在解决在专业与个人场景中,用户如何权衡人工智能(AI)工具的准确性(accuracy)以决定其采用行为的问题,以及影响这种权衡的因素和当AI工具不可用时用户的应对策略。其核心解决方案在于提出“情境特异性可靠性”(context-specific reliability)的概念,将准确性定义为输出结果在特定情境下与用户意图的一致程度,且该一致性需满足由任务风险和纠错成本所决定的容差阈值。研究通过在线调查(N=300)发现,工作场景中对高准确性的要求显著高于个人生活(24.1% vs. 8.8%),且重度使用和经验丰富的用户在工作中设定更严格的标准;此外,当工具不可用时,个人生活中的干扰程度明显高于工作场景(34.1% vs. 15.3%)。这一框架为理解用户对生成式AI(Generative AI)工具的适应性决策提供了实证依据与理论支撑。

链接: https://arxiv.org/abs/2602.13283
作者: Gaston Besanson,Federico Todeschini
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We study how people trade off accuracy when using AI-powered tools in professional versus personal contexts for adoption purposes, the determinants of those trade-offs, and how users cope when AI/apps are unavailable. Because modern AI systems (especially generative models) can produce acceptable but non-identical outputs, we define “accuracy” as context-specific reliability: the degree to which an output aligns with the user’s intent within a tolerance threshold that depends on stakes and the cost of correction. In an online survey (N=300), among respondents with both accuracy items (N=170), the share requiring high accuracy (top-box) is 24.1% at work vs. 8.8% in personal life (+15.3 pp; z=6.29, p0.001). The gap remains large under a broader top-two-box definition (67.0% vs. 32.9%) and on the full 1-5 ordinal scale (mean 3.86 vs. 3.08). Heavy app use and experience patterns correlate with stricter work standards (H2). When tools are unavailable (H3), respondents report more disruption in personal routines than at work (34.1% vs. 15.3%, p0.01). We keep the main text focused on these substantive results and place test taxonomy and power derivations in a technical appendix.

[HC-47] Human-Centered Explainable AI for Security Enhancement: A Deep Intrusion Detection Framework

【速读】:该论文旨在解决当前网络安全领域中入侵检测系统(Intrusion Detection System, IDS)在面对日益复杂和频繁的网络威胁时,难以同时实现高准确率与决策可解释性的难题。其解决方案的关键在于构建一个融合可解释人工智能(Explainable Artificial Intelligence, XAI)的深度学习框架,通过结合卷积神经网络(Convolutional Neural Network, CNN)与长短期记忆网络(Long Short-Term Memory, LSTM)以有效捕捉流量序列中的时序依赖关系,并引入SHapley Additive exPlanations(SHAP)方法增强模型决策的透明度,从而帮助安全分析师理解并验证检测结果。实验表明,该框架在NSL-KDD基准数据集上优于传统IDS及黑盒深度学习模型,在保持高精度的同时显著提升了模型的可解释性。

链接: https://arxiv.org/abs/2602.13271
作者: Md Muntasir Jahid Ayan,Md. Shahriar Rashid,Tazzina Afroze Hassan,Hossain Md. Mubashshir Jamil,Mahbubul Islam,Lisan Al Amin,Rupak Kumar Das,Farzana Akter,Faisal Quader
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system’s reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.

[HC-48] Real-World Design and Deployment of an Embedded GenAI-powered 9-1-1 Calltaking Training System: Experiences and Lessons Learned

【速读】:该论文旨在解决公共安全应急响应体系中接线员(Emergency Call-takers)培训资源严重不足的问题,具体表现为人员短缺超过25%、单个新员工培训需高达720小时的一对一指导,导致经验丰富的接线员无法参与日常值班。传统培训方法难以在有限人力和时间下实现规模化与及时反馈。解决方案的关键在于:与Metro Nashville Department of Emergency Communications(MNDEC)合作,设计并部署了一个基于生成式AI(Generative AI)的呼叫接听培训系统,在真实运营环境中完成从试点到190名用户、1,120次训练会话的扩展。通过分析98,429次用户交互日志及组织流程,提炼出四个关键实践教训,强调系统交付、严谨性、鲁棒性和人因因素的重要性,并提出相应的设计与治理策略,为在安全关键型公共服务场景中嵌入AI驱动的培训系统提供可落地的指导。

链接: https://arxiv.org/abs/2602.13241
作者: Zirong Chen,Meiyi Ma
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Emergency call-takers form the first operational link in public safety response, handling over 240 million calls annually while facing a sustained training crisis: staffing shortages exceed 25% in many centers, and preparing a single new hire can require up to 720 hours of one-on-one instruction that removes experienced personnel from active duty. Traditional training approaches struggle to scale under these constraints, limiting both coverage and feedback timeliness. In partnership with Metro Nashville Department of Emergency Communications (MNDEC), we designed, developed, and deployed a GenAI-powered call-taking training system under real-world constraints. Over six months, deployment scaled from initial pilot to 190 operational users across 1,120 training sessions, exposing systematic challenges around system delivery, rigor, resilience, and human factors that remain largely invisible in controlled or purely simulated evaluations. By analyzing deployment logs capturing 98,429 user interactions, organizational processes, and stakeholder engagement patterns, we distill four key lessons, each coupled with concrete design and governance practices. These lessons provide grounded guidance for researchers and practitioners seeking to embed AI-driven training systems in safety-critical public sector environments where embedded constraints fundamentally shape socio-technical design.

[HC-49] Enhanced Accessibility for Mobile Indoor Navigation

【速读】:该论文旨在解决视障人士在室内空间导航中面临的挑战,这些问题包括感官信息处理困难、不确定性应对以及对辅助工具的依赖。解决方案的关键在于开发一款以无障碍设计为核心原则的室内导航应用程序,其设计基于用户访谈与Web内容无障碍指南(Web Content Accessibility Guidelines, WCAG)的综合分析,从而识别出切实可行的设计需求,并集成增强功能以满足视障用户的特殊需求。通过包含视障和视力正常用户的多轮可用性测试,验证了该应用在包容性界面及兼容多种无障碍工具和Android设备设置方面的有效性。

链接: https://arxiv.org/abs/2602.13233
作者: Johannes Wortmann,Bernd Schäufele,Konstantin Klipp,Ilja Radusch,Katharina Blaß,Thomas Jung
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Published and presented at the 2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN)

点击查看摘要

Abstract:The navigation of indoor spaces poses difficult challenges for individuals with visual impairments, as it requires processing of sensory information, dealing with uncertainties, and relying on assistance. To tackle these challenges, we present an indoor navigation app that places importance on accessibility for visually impaired users. Our approach involves a combination of user interviews and an analysis of the Web Content Accessibility Guidelines. With this approach, we are able to gather invaluable insights and identify design requirements for the development of an indoor navigation app. Based on these insights, we develop an indoor navigation app that prioritizes accessibility, integrating enhanced features to meet the needs of visually impaired users. The usability of the app is being thoroughly evaluated through tests involving both visually impaired and sighted users. Initial feedback has been positive, with users appreciating the inclusive user interface and the usability with a wide range of accessibility tools and Android device settings.

[HC-50] Agent ic AI for Commercial Insurance Underwriting with Adversarial Self-Critique

【速读】:该论文旨在解决商业保险承保流程中AI应用面临的可靠性与安全性问题,尤其是在监管严格、高风险场景下,现有AI解决方案缺乏全面的推理能力及内部保障机制,难以实现完全自动化。其关键解决方案是提出一种“决策否定型”人机协同智能体系统(decision-negative, human-in-the-loop agentic system),通过引入对抗式自我批判机制(adversarial self-critique mechanism)作为边界安全架构,在AI主代理提交建议前由一个批评代理对其结论进行挑战,从而形成内部制衡体系,有效降低错误率并提升决策准确性。实验表明,该机制将AI幻觉率从11.3%降至3.8%,决策准确率从92%提升至96%,同时确保所有最终决策权始终由人类掌控,为受监管领域中的负责任AI集成提供了可操作范式。

链接: https://arxiv.org/abs/2602.13213
作者: Joyjit Roy,Samaresh Kumar Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 9 pages, 8 figuers, 6 tables, submitted aty 9th International Conference on Modern Computing, Networking and Applications (MCNA2026)

点击查看摘要

Abstract:Commercial insurance underwriting is a labor-intensive process that requires manual review of extensive documentation to assess risk and determine policy pricing. While AI offers substantial efficiency improvements, existing solutions lack comprehensive reasoning capabilities and internal mechanisms to ensure reliability within regulated, high-stakes environments. Full automation remains impractical and inadvisable in scenarios where human judgment and accountability are critical. This study presents a decision-negative, human-in-the-loop agentic system that incorporates an adversarial self-critique mechanism as a bounded safety architecture for regulated underwriting workflows. Within this system, a critic agent challenges the primary agent’s conclusions prior to submitting recommendations to human reviewers. This internal system of checks and balances addresses a critical gap in AI safety for regulated workflows. Additionally, the research develops a formal taxonomy of failure modes to characterize potential errors by decision-negative agents. This taxonomy provides a structured framework for risk identification and risk management in high-stakes applications. Experimental evaluation using 500 expert-validated underwriting cases demonstrates that the adversarial critique mechanism reduces AI hallucination rates from 11.3% to 3.8% and increases decision accuracy from 92% to 96%. At the same time, the framework enforces strict human authority over all binding decisions by design. These findings indicate that adversarial self-critique supports safer AI deployment in regulated domains and offers a model for responsible integration where human oversight is indispensable.

计算机视觉

[CV-0] EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

【速读】:该论文旨在解决生成式视频编辑中计算效率低下的问题,特别是现有基于预训练视频基础模型的方法在进行局部编辑时仍需处理全视频上下文,导致资源浪费。解决方案的关键在于提出EditCtrl框架,其核心创新是引入一个仅作用于掩码区域的局部视频上下文模块(local video context module),使计算成本与编辑区域大小成正比;同时辅以轻量级的时间全局上下文嵌入器(temporal global context embedder),以最小开销保障视频整体一致性。该设计不仅将计算效率提升至现有最先进方法的10倍,还提升了编辑质量,并支持多区域编辑和自回归内容传播等新能力。

链接: https://arxiv.org/abs/2602.15031
作者: Yehonathan Litman,Shikun Liu,Dario Seyb,Nicholas Milef,Yang Zhou,Carl Marshall,Shubham Tulsiani,Caleb Leak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask’s size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.

[CV-1] Image Generation with a Sphere Encoder

【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(diffusion models)推理成本高、需多步迭代生成图像的问题。其解决方案的关键在于提出了一种名为 Sphere Encoder 的高效生成框架,通过学习一个编码器将自然图像均匀映射到球形潜在空间(spherical latent space),并配合一个解码器从随机潜向量重建图像;整个过程仅需单次前向传播即可完成生成,且训练仅依赖图像重构损失,无需复杂优化策略。该架构天然支持条件生成,并可通过少量迭代循环进一步提升图像质量,在多个数据集上达到与先进扩散模型相当的性能,但推理开销显著降低。

链接: https://arxiv.org/abs/2602.15030
作者: Kaiyu Yue,Menglin Jia,Ji Hou,Tom Goldstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at this https URL .

[CV-2] Neurosim: A Fast Simulator for Neuromorphic Robot Perception

【速读】:该论文旨在解决多模态传感器仿真与实时机器人控制算法开发之间的效率与集成难题,特别是在复杂动态环境中对神经形态感知和控制算法进行训练与闭环测试的需求。其解决方案的关键在于提出并实现了一个高性能、低延迟的仿真库Neurosim和通信框架Cortex:Neurosim能够以高达约2700 FPS的帧率实时模拟动态视觉传感器(Dynamic Vision Sensors, DVS)、RGB相机、深度传感器及惯性测量单元(IMU)等多模态传感器数据,并支持多旋翼飞行器在复杂环境中的敏捷动力学模拟;Cortex则通过ZeroMQ构建高吞吐、低延迟的消息传递系统,原生支持NumPy数组和PyTorch张量,实现了Python与C++应用间的无缝通信,从而为基于时间同步多模态数据的自监督学习训练以及实时闭环控制算法部署提供了高效可靠的基础设施。

链接: https://arxiv.org/abs/2602.15018
作者: Richeek Das,Pratik Chaudhari
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at this https URL .

[CV-3] hermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在热成像(thermal images)理解上的严重不足问题。现有VLMs主要基于RGB图像训练,在热成像场景中表现不佳,而热成像在夜间监控、搜救、自动驾驶和医疗筛查等关键应用中具有不可替代的价值。其核心挑战在于热图像编码的是物理温度而非颜色或纹理,因此需要模型具备温度感知与推理能力,而这正是当前以RGB为中心的基准测试所缺乏评估维度。解决方案的关键在于提出ThermEval-B这一结构化基准,包含约5.5万个热成像视觉问答对,并整合了新收集的ThermEval-D数据集——该数据集首次提供跨多样室内/室外环境的密集像素级温度图及语义身体部位标注。通过系统评估25个开源与闭源VLMs,研究发现模型普遍在温度引导推理上失败、对伪彩色映射变换敏感、倾向于依赖语言先验或固定回答,且提示工程或监督微调仅带来边际提升,从而证明热视觉语言理解需独立于RGB假设的专门评测体系,ThermEval-B为此类研究提供了可量化、可比较的基准支撑。

链接: https://arxiv.org/abs/2602.14989
作者: Ayush Shrivastava,Kirtan Gangani,Laksh Jain,Mayank Goel,Nipun Batra
机构: Indian Institute of Technology, Gandhinagar, India; Carnegie Mellon University, Pittsburgh, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 Pages with 2 figures of main content. 2 pages of References. 10 pages of appendix with 6 figures

点击查看摘要

Abstract:Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

[CV-4] PAct: Part-Decomposed Single-View Articulated Object Generation

【速读】:该论文旨在解决3D交互应用中可动物体(如抽屉、门等)的高保真资产生成难题,其核心挑战在于如何实现可靠的部件分解(functional part decomposition)与运动学绑定(kinematic rigging),同时避免传统方法中存在的计算效率低或结构不匹配问题。解决方案的关键在于提出一种以部件为中心的生成式框架(part-centric generative framework),通过显式地引入部件感知条件(part-aware conditioning)来联合建模部件几何、组合关系及运动机制:每个部件由带有部件身份和运动线索的潜在标记(latent tokens)编码,并在单张图像条件下生成保持实例级对应关系、有效结构和合理运动的可动3D资产,从而实现无需逐实例优化的快速前向推理与可控组装及运动控制。

链接: https://arxiv.org/abs/2602.14965
作者: Qingming Liu,Xinyue Yao,Shuyuan Zhang,Yueci Deng,Guiliang Liu,Zhen Liu,Kui Jia
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); DexForce Technology (DexForce科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Technical Report(11 figures, 14 pages), Project Page: this https URL

点击查看摘要

Abstract:Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.

[CV-5] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

【速读】:该论文旨在解决长时视频生成中由于多视角重建导致的全局场景几何不一致问题,即在相机可控视频生成任务中,现有基于记忆的方法通常依赖于从历史帧重建的全局3D场景进行条件生成,但多视角重建不可避免地引入跨视角对齐误差(cross-view misalignment),使得同一表面在不同视图中被重建至略有差异的3D位置,累积后形成噪声几何,从而污染条件信号并降低生成质量。其解决方案的关键在于提出AnchorWeave框架,通过将单一易错的全局记忆替换为多个干净的局部几何记忆,并学习如何校正这些局部记忆间的跨视角不一致性;具体而言,该方法采用覆盖驱动的局部记忆检索策略,确保所选局部记忆与目标轨迹对齐,并在生成过程中利用多锚点编织控制器(multi-anchor weaving controller)融合多个局部记忆,从而显著提升长时间尺度下的场景一致性,同时保持高质量视觉输出。

链接: https://arxiv.org/abs/2602.14941
作者: Zun Wang,Han Lin,Jaehong Yoon,Jaemin Cho,Yue Zhang,Mohit Bansal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

[CV-6] Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

【速读】:该论文旨在解决地面影像与地理注册的卫星图像之间在大视角差异或GPS不可靠条件下难以实现精准对齐的问题,这对地图构建、导航和态势感知至关重要。解决方案的关键在于提出一种零样本(zero-shot)、几何驱动的框架Wrivinder,其核心是通过多视角地面照片聚合重建一致的3D场景,并利用SfM(Structure from Motion)重建、3D高斯点绘(3D Gaussian Splatting)、语义定位(semantic grounding)及单目深度(monocular depth)提供的度量线索,生成可直接与卫星上下文匹配的垂直视角渲染结果,从而实现米级精度的相机地理定位。

链接: https://arxiv.org/abs/2602.14929
作者: Chandrakanth Gudavalli,Tajuddin Manhar Mohammed,Abhay Yadav,Ananth Vishnu Bhaskar,Hardik Prajapati,Cheng Peng,Rama Chellappa,Shivkumar Chandrasekaran,B. S. Manjunath
机构: Mayachitra, Inc.; Johns Hopkins University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.

[CV-7] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

【速读】:该论文旨在解决医学影像领域中生成式 AI(Generative AI)在CT图像病变自动识别与报告生成方面进展受限的问题,其根本原因在于缺乏公开可用的带有病灶级标注(lesion-level annotations)的CT数据集。解决方案的关键在于提出首个综合性基准数据集CT-Bench,其核心由两部分构成:一是包含20,335个病灶的图像与元数据集合(含边界框、描述和尺寸信息),二是涵盖病灶定位、描述、尺寸估计及属性分类的多任务视觉问答(visual question answering, VQA)基准,且引入困难负例以模拟真实诊断挑战。通过在该数据集上评估多种先进多模态模型并与放射科医生评估结果对比,验证了CT-Bench作为病灶分析全面基准的价值,并表明基于该数据集微调模型可显著提升性能,凸显其临床应用潜力。

链接: https://arxiv.org/abs/2602.14879
作者: Qingqing Zhu,Qiao Jin,Tejas S. Mathai,Yin Fang,Zhizheng Wang,Yifan Yang,Maame Sarfo-Gyamfi,Benjamin Hou,Ran Gu,Praveen T. S. Balamuralikrishna,Kenneth C. Wang,Ronald M. Summers,Zhiyong Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.

[CV-8] Multi-dimensional Persistent Sheaf Laplacians for Image Analysis

【速读】:该论文旨在解决传统降维方法(如主成分分析,PCA)对降维维度选择敏感的问题,即不同维度下性能波动较大且难以确定最优维度。其解决方案的关键在于提出一种多维持久层拉普拉斯框架(multi-dimensional persistent sheaf Laplacian, MPSL),通过将图像样本建模为单纯复形(simplicial complex),在多个降维维度上利用持久层拉普拉斯算子提取局部拓扑谱表示,并在尺度和维度两个层面聚合统计特征,从而构建多尺度、多维度的图像表示。该方法不依赖单一维度或平均结果,而是融合多个维度的优势,实现更稳定且优于PCA基线的分类性能。

链接: https://arxiv.org/abs/2602.14846
作者: Xiang Xiang Wang,Guo-Wei Wei
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a multi-dimensional persistent sheaf Laplacian (MPSL) framework on simplicial complexes for image analysis. The proposed method is motivated by the strong sensitivity of commonly used dimensionality reduction techniques, such as principal component analysis (PCA), to the choice of reduced dimension. Rather than selecting a single reduced dimension or averaging results across dimensions, we exploit complementary advantages of multiple reduced dimensions. At a given dimension, image samples are regarded as simplicial complexes, and persistent sheaf Laplacians are utilized to extract a multiscale localized topological spectral representation for individual image samples. Statistical summaries of the resulting spectra are then aggregated across scales and dimensions to form multiscale multi-dimensional image representations. We evaluate the proposed framework on the COIL20 and ETH80 image datasets using standard classification protocols. Experimental results show that the proposed method provides more stable performance across a wide range of reduced dimensions and achieves consistent improvements to PCA-based baselines in moderate dimensional regimes.

[CV-9] Integrating Affordances and Attention models for Short-Term Object Interaction Anticipation

【速读】:该论文旨在解决短时物体交互预测(Short Term Object-Interaction Anticipation, STA)问题,即从第一人称视角视频中预测未来将被激活的物体位置、交互动词和名词类别以及接触时间,以支持可穿戴助手理解用户意图或实现人机协作。其解决方案的关键在于提出两种基于注意力机制的新架构——STAformer 和 STAformer++,它们融合了帧引导的时间池化、双模态图像-视频注意力机制及多尺度特征融合,从而增强对视频输入对齐的时空建模能力;同时引入两个新模块来基于人类行为对STA预测进行定位:一是环境可及性(affordance)模型作为场景持久记忆,通过简单后融合与自适应融合策略提升预测准确性;二是从手部和物体轨迹中预测交互热点区域,提高局部化预测的置信度。实验表明,该方法在Ego4D和EPIC-Kitchens数据集上分别获得最高达+23p.p和+31p.p的Overall Top-5 mAP提升。

链接: https://arxiv.org/abs/2602.14837
作者: Lorenzo Mur Labadia,Ruben Martinez-Cantin,Jose J.Guerrero,Giovanni M. Farinella,Antonino Furnari
机构: Aragon Institute for Engineering Research (I3A), University of Zaragoza (萨拉戈萨大学); Department of Computer Science, University of Catania (卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.

[CV-10] Debiasing Central Fixation Confounds Reveals a Peripheral “Sweet Spot” for Human-like Scanpaths in Hard-Attention Vision

【速读】:该论文旨在解决当前视觉任务中基于硬注意力(hard-attention)模型的扫描路径(scanpath)评估指标因数据集特有中心偏置(center bias)而产生误导性结果的问题。具体而言,作者发现即使是最简单的中心固定基线(center-fixation baseline)也能在标准扫描路径指标上取得接近学习策略的表现,从而模糊了真实人类行为对齐与单纯中心倾向之间的界限。解决方案的关键在于提出一种去中心偏置的复合度量指标——Gaze Consistency Score (GCS),该指标融合了扫描路径一致性与运动统计相似性,并通过系统性地调节中央视野(foveal patch size)和周边上下文(peripheral context)的约束条件,识别出一个“甜点区域”:在此区域内,模型生成的扫描路径不仅显著优于中心基线,且具有与人类眼动一致的时间动态特性。这一方法有效区分了行为对齐与中心偏倚,为评估主动感知(active perception)提供了更可靠的标准。

链接: https://arxiv.org/abs/2602.14834
作者: Pengcheng Pan,Yonekura Shogo,Yasuo Kuniyosh
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human eye movements in visual recognition reflect a balance between foveal sampling and peripheral context. Task-driven hard-attention models for vision are often evaluated by how well their scanpaths match human gaze. However, common scanpath metrics can be strongly confounded by dataset-specific center bias, especially on object-centric datasets. Using Gaze-CIFAR-10, we show that a trivial center-fixation baseline achieves surprisingly strong scanpath scores, approaching many learned policies. This makes standard metrics optimistic and blurs the distinction between genuine behavioral alignment and mere central tendency. We then analyze a hard-attention classifier under constrained vision by sweeping foveal patch size and peripheral context, revealing a peripheral sweet spot: only a narrow range of sensory constraints yields scanpaths that are simultaneously (i) above the center baseline after debiasing and (ii) temporally human-like in movement statistics. To address center bias, we propose GCS (Gaze Consistency Score), a center-debiased composite metric augmented with movement similarity. GCS uncovers a robust sweet spot at medium patch size with both foveal and peripheral vision, that is not obvious from raw scanpath metrics or accuracy alone, and also highlights a “shortcut regime” when the field-of-view becomes too large. We discuss implications for evaluating active perception on object-centric datasets and for designing gaze benchmarks that better separate behavioral alignment from center bias.

[CV-11] VIPA: Visual Informative Part Attention for Referring Image Segmentation

【速读】:该论文旨在解决参考图像分割(Referring Image Segmentation, RIS)任务中因跨模态投影噪声导致的细粒度目标区域定位不准确问题。现有方法虽已尝试将视觉信息融入语言令牌,但在处理复杂场景时仍面临语义一致性弱、注意力机制易受干扰的问题。解决方案的关键在于提出一种视觉信息部件注意力(Visual Informative Part Attention, VIPA)框架,其核心创新是引入“视觉表达”(visual expression)概念——即从视觉上下文中提取具有结构与语义信息的显著区域,并通过视觉表达生成模块(VEG)利用局部-全局语言线索筛选并净化视觉令牌,从而降低跨模态投影方差、增强注意力机制的语义一致性,使网络能够更鲁棒地对齐目标对象的细粒度区域。

链接: https://arxiv.org/abs/2602.14788
作者: Yubin Cho,Hyunwoo Yu,Kyeongbo Kong,Kyomin Sohn,Bongjoon Hyun,Suk-Ju Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network’s attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

[CV-12] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

【速读】:该论文旨在解决通用目标追踪器在未见场景中泛化能力不足以及遮挡推理粗粒度的问题。现有方法通常针对训练目标进行优化,导致在动态环境中的鲁棒性较差,且对遮挡模式缺乏精细化建模。解决方案的关键在于提出GOT-JEPA框架,该框架将JEPA(Joint-Embedding Predictive Architecture)从预测图像特征扩展为预测追踪模型:通过教师模型从干净帧生成伪追踪模型,学生模型则从受扰动帧中学习预测相同的伪追踪模型,从而提供稳定的伪监督信号并显式训练模型在遮挡、干扰等不利观测下生成可靠的追踪结果。进一步地,作者提出OccuSolver模块,利用基于点的追踪器实现面向目标的可见性估计与遮挡模式捕捉,通过迭代更新目标先验条件逐步细化可见状态,增强遮挡处理能力,并生成高质量参考标签以提升后续预测性能。

链接: https://arxiv.org/abs/2602.14771
作者: Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Academia Sinica (中央研究院); Microsoft AI (微软人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE)
备注: Learning Model Adaptation for Adverse and Dynamic Environments

点击查看摘要

Abstract:The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

[CV-13] SAILS: Segment Anything with Incrementally Learned Semantics for Task-Invariant and Training-Free Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning)中因重复训练、高计算成本以及灾难性遗忘(Catastrophic Forgetting)导致的性能下降问题,尤其在类增量语义分割(Class-Incremental Semantic Segmentation, CISS)场景下表现突出。其解决方案的关键在于提出一个无需训练的框架SAILS(Segment Anything with Incrementally Learned Semantics),通过两个阶段实现:首先利用Segment Anything Model (SAM) 进行零样本区域提取,随后在固定特征空间中通过原型(prototype)进行语义关联,并引入选择性类内聚类机制以生成多个类内原型,从而更好地建模类内多样性。该方法完全避免参数更新,从根本上消除遗忘问题,并在长序列任务中展现出优于传统训练型方法的性能,同时表现出正向的后向迁移效应(positive backward transfer)。

链接: https://arxiv.org/abs/2602.14767
作者: Shishir Muralidhara,Didier Stricker,René Schuster
机构: German Research Center for Artificial Intelligence (DFKI); RPTU – University of Kaiserslautern-Landau (莱茵兰-普法尔茨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE CAI 2026

点击查看摘要

Abstract:Continual learning remains constrained by the need for repeated retraining, high computational costs, and the persistent challenge of forgetting. These factors significantly limit the applicability of continuous learning in real-world settings, as iterative model updates require significant computational resources and inherently exacerbate forgetting. We present SAILS – Segment Anything with Incrementally Learned Semantics, a training-free framework for Class-Incremental Semantic Segmentation (CISS) that sidesteps these challenges entirely. SAILS leverages foundational models to decouple CISS into two stages: Zero-shot region extraction using Segment Anything Model (SAM), followed by semantic association through prototypes in a fixed feature space. SAILS incorporates selective intra-class clustering, resulting in multiple prototypes per class to better model intra-class variability. Our results demonstrate that, despite requiring no incremental training, SAILS typically surpasses the performance of existing training-based approaches on standard CISS datasets, particularly in long and challenging task sequences where forgetting tends to be most severe. By avoiding parameter updates, SAILS completely eliminates forgetting and maintains consistent, task-invariant performance. Furthermore, SAILS exhibits positive backward transfer, where the introduction of new classes can enhance performance on previous classes.

[CV-14] Universal Algorithm-Implicit Learning

【速读】:该论文旨在解决当前元学习方法在任务分布狭窄、特征空间和标签空间固定等方面的局限性,以及“通用性”和“通用目的”等术语定义模糊、缺乏统一标准的问题。为实现真正意义上的通用元学习(universal meta-learning),作者提出了一套理论框架,明确定义了实践中的通用性,并区分了算法显式(algorithm-explicit)与算法隐式(algorithm-implicit)学习,从而构建了可比较、可推理的理论基础。解决方案的关键在于提出TAIL——一种基于Transformer的算法隐式元学习器,其创新包括:随机投影用于跨模态特征编码、随机注入标签嵌入以扩展标签空间、以及高效的内联查询处理机制。这些设计使TAIL不仅在标准少样本基准上达到最优性能,还能泛化到未见过的领域和模态(如仅用图像训练却能完成文本分类任务),并支持高达20倍以上的标签空间扩展,同时相较先前基于Transformer的方法显著降低计算开销。

链接: https://arxiv.org/abs/2602.14761
作者: Stefano Woerner,Seong Joon Oh,Christian F. Baumgartner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current meta-learning methods are constrained to narrow task distributions with fixed feature and label spaces, limiting applicability. Moreover, the current meta-learning literature uses key terms like “universal” and “general-purpose” inconsistently and lacks precise definitions, hindering comparability. We introduce a theoretical framework for meta-learning which formally defines practical universality and introduces a distinction between algorithm-explicit and algorithm-implicit learning, providing a principled vocabulary for reasoning about universal meta-learning methods. Guided by this framework, we present TAIL, a transformer-based algorithm-implicit meta-learner that functions across tasks with varying domains, modalities, and label configurations. TAIL features three innovations over prior transformer-based meta-learners: random projections for cross-modal feature encoding, random injection label embeddings that extrapolate to larger label spaces, and efficient inline query processing. TAIL achieves state-of-the-art performance on standard few-shot benchmarks while generalizing to unseen domains. Unlike other meta-learning methods, it also generalizes to unseen modalities, solving text classification tasks despite training exclusively on images, handles tasks with up to 20 \times more classes than seen during training, and provides orders-of-magnitude computational savings over prior transformer-based approaches.

[CV-15] Depth Completion as Parameter-Efficient Test-Time Adaptation

【速读】:该论文旨在解决预训练3D基础模型(Foundation Models, FMs)在深度补全(depth completion)任务中,因缺乏对稀疏几何线索的高效利用而导致的结构失真与泛化能力差的问题。现有方法通常通过训练特定任务的编码器来处理辅助输入,但易过拟合且泛化性能不佳。其解决方案的关键在于提出CAPA框架,该框架冻结FM主干网络,仅通过参数高效微调(如LoRA或VPT)更新极少量参数,并利用推理时可用的稀疏观测直接计算梯度进行优化,从而将基础模型的几何先验有效锚定于场景特异性测量中;此外,针对视频数据引入帧间参数共享机制,联合优化所有帧以利用时间相关性、提升鲁棒性并保证多帧一致性,最终实现对多种条件模式下室内和室外数据集的SOTA性能。

链接: https://arxiv.org/abs/2602.14751
作者: Bingxin Ke,Qunjie Zhou,Jiahui Huang,Xuanchi Ren,Tianchang Shen,Konrad Schindler,Laura Leal-Taixé,Shengyu Huang
机构: NVIDIA; ETH Zürich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model’s geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: this http URL.

[CV-16] Its a Matter of Time: Three Lessons on Long-Term Motion for Perception

【速读】:该论文旨在解决长期运动信息在视觉感知任务中的作用及其对视觉学习的潜在价值这一问题,特别是相较于传统图像信息,长时运动表示是否能提供更丰富的语义信息、更强的泛化能力以及更高的计算效率。其解决方案的关键在于利用近期在点轨迹估计(point-track estimation)方面的进展,从而有效提取和利用长时运动信息,并通过多种感知任务验证其优势:首先,长时运动表示不仅能理解动作,还能捕捉物体、材质和空间信息,且性能常优于图像;其次,在低数据场景和零样本任务中,运动表示展现出显著优于图像表示的泛化能力;最后,由于运动信息维度极低,其在GFLOPs与准确率之间提供了更优权衡,结合视频表示可进一步提升整体性能。

链接: https://arxiv.org/abs/2602.14705
作者: Willem Davison,Xinyue Hao,Laura Sevilla-Lara
机构: University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal information has long been considered to be essential for perception. While there is extensive research on the role of image information for perceptual tasks, the role of the temporal dimension remains less well understood: What can we learn about the world from long-term motion information? What properties does long-term motion information have for visual learning? We leverage recent success in point-track estimation, which offers an excellent opportunity to learn temporal representations and experiment on a variety of perceptual tasks. We draw 3 clear lessons: 1) Long-term motion representations contain information to understand actions, but also objects, materials, and spatial information, often even better than images. 2) Long-term motion representations generalize far better than image representations in low-data settings and in zero-shot tasks. 3) The very low dimensionality of motion information makes motion representations a better trade-off between GFLOPs and accuracy than standard video representations, and used together they achieve higher performance than video representations alone. We hope these insights will pave the way for the design of future models that leverage the power of long-term motion information for perception.

[CV-17] Exposing Diversity Bias in Deep Generative Models: Statistical Origins and Correction of Diversity Error

【速读】:该论文旨在解决当前生成式模型在样本多样性方面存在的系统性低估问题,即现代生成模型所生成样本的多样性普遍低于目标数据分布的真实多样性。其关键解决方案在于通过引入无参考的熵基多样性评分指标(如Vendi和RKE)对生成样本与测试样本的多样性进行直接比较,并揭示了基于有限训练样本估计的熵类多样性指标存在固有偏差——即随着样本量增加,其期望值上升,导致优化生成器以最小化与经验数据分布的差异时会进一步削弱多样性。因此,论文提出应采用基于Vendi和RKE的多样性感知正则化与引导策略,作为缓解该偏差的可行路径。

链接: https://arxiv.org/abs/2602.14682
作者: Farzan Farnia,Mohammad Jalali,Azim Ospanov
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Deep generative models have achieved great success in producing high-quality samples, making them a central tool across machine learning applications. Beyond sample quality, an important yet less systematically studied question is whether trained generative models faithfully capture the diversity of the underlying data distribution. In this work, we address this question by directly comparing the diversity of samples generated by state-of-the-art models with that of test samples drawn from the target data distribution, using recently proposed reference-free entropy-based diversity scores, Vendi and RKE. Across multiple benchmark datasets, we find that test data consistently attains substantially higher Vendi and RKE diversity scores than the generated samples, suggesting a systematic downward diversity bias in modern generative models. To understand the origin of this bias, we analyze the finite-sample behavior of entropy-based diversity scores and show that their expected values increase with sample size, implying that diversity estimated from finite training sets could inherently underestimate the diversity of the true distribution. As a result, optimizing the generators to minimize divergence to empirical data distributions would induce a loss of diversity. Finally, we discuss potential diversity-aware regularization and guidance strategies based on Vendi and RKE as principled directions for mitigating this bias, and provide empirical evidence suggesting their potential to improve the results.

[CV-18] Universal Image Immunization against Diffusion-based Image Editing via Semantic Injection

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像编辑中引发的伦理与法律风险,尤其是针对深度伪造(Deepfakes)和未经授权使用受版权保护视觉内容的问题。现有图像免疫化方法依赖于针对每张图像单独优化的对抗扰动(Adversarial Perturbation),难以规模化部署。其解决方案的关键在于提出首个通用图像免疫化框架,生成一种适用于多种图像的通用对抗扰动(Universal Adversarial Perturbation, UAP),该扰动可嵌入语义目标并抑制原始内容,从而在无需访问训练数据或领域知识的情况下有效误导扩散模型的编辑行为,实现对恶意编辑的有效防御,同时具备良好的黑盒迁移能力与实用性。

链接: https://arxiv.org/abs/2602.14679
作者: Chanhui Lee,Seunghyun Shin,Donggyu Choi,Hae-gon Jeon,Jeany Son
机构: POSTECH AI Graduate School (浦项科技大学人工智能研究生院); Yonsei University (延世大学); GIST AI Graduate School (韩国科学技术院人工智能研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Working paper

点击查看摘要

Abstract:Recent advances in diffusion models have enabled powerful image editing capabilities guided by natural language prompts, unlocking new creative possibilities. However, they introduce significant ethical and legal risks, such as deepfakes and unauthorized use of copyrighted visual content. To address these risks, image immunization has emerged as a promising defense against AI-driven semantic manipulation. Yet, most existing approaches rely on image-specific adversarial perturbations that require individual optimization for each image, thereby limiting scalability and practicality. In this paper, we propose the first universal image immunization framework that generates a single, broadly applicable adversarial perturbation specifically designed for diffusion-based editing pipelines. Inspired by universal adversarial perturbation (UAP) techniques used in targeted attacks, our method generates a UAP that embeds a semantic target into images to be protected. Simultaneously, it suppresses original content to effectively misdirect the model’s attention during editing. As a result, our approach effectively blocks malicious editing attempts by overwriting the original semantic content in the image via the UAP. Moreover, our method operates effectively even in data-free settings without requiring access to training data or domain knowledge, further enhancing its practicality and broad applicability in real-world scenarios. Extensive experiments show that our method, as the first universal immunization approach, significantly outperforms several baselines in the UAP setting. In addition, despite the inherent difficulty of universal perturbations, our method also achieves performance on par with image-specific methods under a more restricted perturbation budget, while also exhibiting strong black-box transferability across different diffusion models.

[CV-19] MeFEm: Medical Face Embedding model

【速读】:该论文旨在解决基于面部图像进行生物特征与医学分析时存在的数据效率低、域偏差(domain bias)显著及模型性能受限的问题。其解决方案的关键在于提出MeFEm模型,该模型基于改进的联合嵌入预测架构(Joint Embedding Predictive Architecture, JEPA),引入三种核心创新:一是轴向条带掩码策略(axial stripe masking strategy),聚焦于语义相关的区域以提升学习效率;二是圆形损失加权机制(circular loss weighting scheme),优化多任务学习中的梯度传播;三是CLS token的概率重分配策略(probabilistic reassignment of the CLS token),增强线性探测(linear probing)的质量。这些改进使模型在使用远少于现有方法的数据下仍能超越FaRL和Franca等强基线,在人体测量学任务和身体质量指数(Body Mass Index, BMI)估计上取得优异表现,并在新构建的封闭源数据集上缓解了域偏差问题。

链接: https://arxiv.org/abs/2602.14672
作者: Yury Borets,Stepan Botman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MeFEm, a vision model based on a modified Joint Embedding Predictive Architecture (JEPA) for biometric and medical analysis from facial images. Key modifications include an axial stripe masking strategy to focus learning on semantically relevant regions, a circular loss weighting scheme, and the probabilistic reassignment of the CLS token for high quality linear probing. Trained on a consolidated dataset of curated images, MeFEm outperforms strong baselines like FaRL and Franca on core anthropometric tasks despite using significantly less data. It also shows promising results on Body Mass Index (BMI) estimation, evaluated on a novel, consolidated closed-source dataset that addresses the domain bias prevalent in existing data. Model weights are available at this https URL , offering a strong baseline for future work in this domain.

[CV-20] Advances in Global Solvers for 3D Vision

【速读】:该论文旨在解决几何视觉中非凸优化问题的传统局部或启发式方法缺乏解的可证明最优性保障的问题。其核心贡献在于首次系统性地综述了全局求解器(global solvers)在3D视觉中的应用,通过构建一个涵盖分支定界(Branch-and-Bound, BnB)、凸松弛(Convex Relaxation, CR)和渐进非凸性(Graduated Non-Convexity, GNC)三大范式的统一分类体系,阐明了各类方法的理论基础、算法设计与实用增强策略,从而揭示了不同求解器在最优性、鲁棒性和可扩展性之间的权衡关系,为未来兼具保证性与实用性的感知系统提供清晰的研究路径。

链接: https://arxiv.org/abs/2602.14662
作者: Zhenjun Zhao,Heng Yang,Bangyan Liao,Yingping Zeng,Shaocheng Yan,Yingdong Gu,Peidong Liu,Yi Zhou,Haoang Li,Javier Civera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Comprehensive survey; 37 pages, 7 figures, 3 tables. Project page with literature tracking and code tutorials: this https URL

点击查看摘要

Abstract:Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC). We present their theoretical foundations, algorithmic designs, and practical enhancements for robustness and scalability, examining how each addresses the fundamental nonconvexity of geometric estimation problems. Our analysis spans ten core vision tasks, from Wahba problem to bundle adjustment, revealing the optimality-robustness-scalability trade-offs that govern solver selection. We identify critical future directions: scaling algorithms while maintaining guarantees, integrating data-driven priors with certifiable optimization, establishing standardized benchmarks, and addressing societal implications for safety-critical deployment. By consolidating theoretical foundations, practical advances, and broader impacts, this survey provides a unified perspective and roadmap toward certifiable, trustworthy perception for real-world applications. A continuously-updated literature summary and companion code tutorials are available at this https URL.

[CV-21] SketchingReality: From Freehand Scene Sketches To Photorealistic Images

【速读】:该论文旨在解决从自由手绘草图(freehand sketch)生成图像时,如何在保持图像真实感(photorealism)与忠实于草图语义之间取得平衡的问题。其核心挑战在于自由手绘草图缺乏像素级对齐的真值图像(ground-truth, pixel-aligned images),因为草图本身具有抽象性和个体差异性,无法确定唯一正确的对应关系。解决方案的关键在于提出一种基于调制(modulation-based)的方法,优先关注草图的语义理解而非边缘位置的严格匹配,并引入一种新颖的损失函数,使得模型能够在无真值像素对齐图像的情况下进行训练,从而显著提升生成图像在语义一致性与视觉质量上的表现。

链接: https://arxiv.org/abs/2602.14648
作者: Ahmed Bourouis,Mikhail Bessmeltsev,Yulia Gryaditskaya
机构: University of Surrey (萨里大学); Université de Montréal (蒙特利尔大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed ‘sketches’, yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.

[CV-22] VIGIL: Tackling Hallucination Detection in Image Recontextualization

【速读】:该论文旨在解决大模型在多模态图像再语境化任务中幻觉(hallucination)问题的评估缺乏细粒度分类与系统性分析的问题。现有研究通常将幻觉视为单一类型,忽略了其多样性和复杂性,导致对模型错误根源的理解不足。为此,作者提出了VIGIL(Visual Inconsistency Generative In-context Lucidity)框架,首次对幻觉进行五类细分:粘贴对象幻觉、背景幻觉、对象遗漏、位置逻辑不一致和物理定律违反。解决方案的关键在于设计了一个多阶段检测流水线,通过一系列针对对象级保真度、背景一致性及遗漏检测的专用步骤,整合开源模型构成协同集成架构,从而实现对幻觉成因的精准定位与解释,填补了该领域无系统性分类与分解方法的空白。

链接: https://arxiv.org/abs/2602.14633
作者: Joanna Wojciechowicz,Maria Łubniewska,Jakub Antczak,Justyna Baczyńska,Wojciech Gromski,Wojciech Kozłowski,Maciej Zięba
机构: Wroclaw University of Science and Technology (弗罗茨瓦夫理工大学); Tooploox
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 4 tables. Code and data are available at: this https URL and this https URL

点击查看摘要

Abstract:We introduce VIGIL (Visual Inconsistency Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: this https URL and Data repository: this https URL.

[CV-23] VariViT: A Vision Transformer for Variable Image Sizes

【速读】:该论文旨在解决Vision Transformer (ViT) 在处理医学图像时因固定输入尺寸导致的局限性问题,尤其是在肿瘤等不规则结构区域中,固定大小的裁剪或重缩放操作会引入信息损失和伪影,影响诊断准确性。其核心解决方案是提出VariViT模型,关键在于设计了一种新颖的位置嵌入(positional embedding)重采样机制,以适应不同数量的图像补丁(patch),同时保持固定的补丁大小;此外,还引入了一种新的批处理策略,有效降低了计算复杂度,在保证特征表示能力的同时提升了训练与推理效率。

链接: https://arxiv.org/abs/2602.14615
作者: Aswathi Varma,Suprosanna Shit,Chinmay Prabhakar,Daniel Scholz,Hongwei Bran Li,Bjoern Menze,Daniel Rueckert,Benedikt Wiestler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: this https URL

[CV-24] YOLO26: A Comprehensive Architecture Overview and Key Improvements

【速读】:该论文旨在解决YOLO系列模型在边缘设备上实现高效实时推理的挑战,尤其是在缺乏GPU支持的场景下。其核心问题是提升模型推理速度的同时保持多任务性能(如实例分割、姿态估计和定向边界框解码)的先进性。解决方案的关键在于:1)移除Distribution Focal Loss (DFL),简化损失函数以加快训练与推理;2)引入端到端无非极大值抑制(NMS-Free Inference)机制,消除后处理延迟;3)采用ProgLoss + Small-Target-Aware Label Assignment (STAL)策略优化标签分配,增强对小目标的检测能力;4)使用MuSGD优化器加速收敛并提升稳定性。这些改进共同实现了CPU模式下43%的推理速度提升,使YOLO26能够在资源受限的边缘设备上实现真正意义上的实时性能。

链接: https://arxiv.org/abs/2602.14582
作者: Priyanto Hidayatullah,Refdinal Tubagus
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:You Only Look Once (YOLO) has been the prominent model for computer vision in deep learning for a decade. This study explores the novel aspects of YOLO26, the most recent version in the YOLO series. The elimination of Distribution Focal Loss (DFL), implementation of End-to-End NMS-Free Inference, introduction of ProgLoss + Small-Target-Aware Label Assignment (STAL), and use of the MuSGD optimizer are the primary enhancements designed to improve inference speed, which is claimed to achieve a 43% boost in CPU mode. This is designed to allow YOLO26 to attain real-time performance on edge devices or those without GPUs. Additionally, YOLO26 offers improvements in many computer vision tasks, including instance segmentation, pose estimation, and oriented bounding box (OBB) decoding. We aim for this effort to provide more value than just consolidating information already included in the existing technical documentation. Therefore, we performed a rigorous architectural investigation into YOLO26, mostly using the source code available in its GitHub repository and its official documentation. The authentic and detailed operational mechanisms of YOLO26 are inside the source code, which is seldom extracted by others. The YOLO26 architectural diagram is shown as the outcome of the investigation. This study is, to our knowledge, the first one presenting the CNN-based YOLO26 architecture, which is the core of YOLO26. Our objective is to provide a precise architectural comprehension of YOLO26 for researchers and developers aspiring to enhance the YOLO model, ensuring it remains the leading deep learning model in computer vision.

[CV-25] DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶中生成式规划(Generative Planning)所面临的两大瓶颈问题:一是基于扩散(Diffusion-based)模型存在模态对齐困难、训练效率低和泛化能力弱;二是基于token的规划方法易受累积因果误差影响且解码不可逆。解决方案的关键在于提出DriveFine,一种掩码扩散VLA模型,其核心创新是设计了一个即插即用的块级专家混合(block-MoE)结构,该结构可无缝集成一个精修专家(refinement expert)于生成专家(generation expert)之上。通过推理时显式选择专家、训练时梯度屏蔽机制,两个专家实现完全解耦,从而保留预训练权重的基础能力和通用模式,显著提升模型灵活性与可扩展性;同时结合混合强化学习策略,在保障训练稳定性的同时促进精修专家的有效探索,实验证明该方案在多个基准测试中展现出优异的性能与鲁棒性。

链接: https://arxiv.org/abs/2602.14577
作者: Chenxu Dang,Sining Ang,Yongkang Li,Haochen Tian,Jie Wang,Guang Li,Hangjun Ye,Jie Ma,Long Chen,Yan Wang
机构: Xiaomi EV (小米汽车); AIR (小米人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at this https URL.

[CV-26] OmniVTON: Training-Free Universal Virtual Try-On with Principal Pose Guidance

【速读】:该论文旨在解决图像驱动的虚拟试衣(Image-based Virtual Try-On, VTON)中现有方法因依赖特定数据条件而需频繁重训练、泛化能力受限的问题。其核心解决方案是提出OmniVTON++,一个无需训练的统一框架,通过三个关键技术实现:结构化服装变形(Structured Garment Morphing)以驱动对应关系的服装适配,主干姿态引导(Principal Pose Guidance)在扩散采样过程中分步调控人体结构一致性,以及连续边界拼接(Continuous Boundary Stitching)实现边界感知的精细化优化,从而在不进行任务特异性再训练的前提下,有效协同处理服装对齐、人体结构一致性和边界连续性三大挑战,显著提升跨数据集、跨服装类型及多场景下的通用性与鲁棒性。

链接: https://arxiv.org/abs/2602.14552
作者: Zhaotong Yang,Yong Du,Shengfeng He,Yuhui Li,Xinzhe Li,Yangyang Xu,Junyu Dong,Jian Yang
机构: Ocean University of China (中国海洋大学); Nanjing University of Science and Technology (南京理工大学); Sanya Oceanographic Institution (三亚海洋研究所); Singapore Management University (新加坡管理大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints. In practice, however, existing approaches are typically optimized for specific data conditions, making their deployment reliant on retraining and limiting their generalization as a unified solution. We present OmniVTON++, a training-free VTON framework designed for universal applicability. It addresses the intertwined challenges of garment alignment, human structural coherence, and boundary continuity by coordinating Structured Garment Morphing for correspondence-driven garment adaptation, Principal Pose Guidance for step-wise structural regulation during diffusion sampling, and Continuous Boundary Stitching for boundary-aware refinement, forming a cohesive pipeline without task-specific retraining. Experimental results demonstrate that OmniVTON++ achieves state-of-the-art performance across diverse generalization settings, including cross-dataset and cross-garment-type evaluations, while reliably operating across scenarios and diffusion backbones within a single formulation. In addition to single-garment, single-human cases, the framework supports multi-garment, multi-human, and anime character virtual try-on, expanding the scope of virtual try-on applications. The source code will be released to the public.

[CV-27] MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation

【速读】:该论文旨在解决人类运动理解与生成任务中推理能力不足和测试阶段规划能力有限的问题。其核心解决方案是提出一种统一的多模态运动模型MoRL,通过监督微调与基于可验证奖励的强化学习联合训练,设计任务特定的奖励函数以同时优化语义对齐、推理连贯性(理解)以及物理合理性与文本-运动一致性(生成)。此外,引入测试时推理方法Chain-of-Motion(CoM),实现分步规划与反思,显著提升模型在复杂场景下的逻辑推理与感知真实性。

链接: https://arxiv.org/abs/2602.14534
作者: Hongpeng Wang,Zeyu Zhang,Wenhao Li,Hao Tang
机构: The University of Sydney (悉尼大学); Peking University (北京大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: this https URL. Website: this https URL.

[CV-28] Cross-view Domain Generalization via Geometric Consistency for LiDAR Semantic Segmentation

【速读】:该论文旨在解决跨视角(cross-view)场景下激光雷达语义分割(LiDAR semantic segmentation, LSS)的域泛化问题,即如何使模型在源域(如车载采集)训练后,能够可靠地推广到多个具有异构观测视角(如无人机、地面固定站等)的未见目标域。现有方法通常假设源域与目标域具有相似的采集视角(如均为车辆-mounted),难以应对因视角差异导致的结构不完整性(viewpoint-dependent structural incompleteness)和点云密度非均匀性(non-uniform point density)。其解决方案的关键在于提出一种名为CVGC(Cross-View Geometric Consistency)的新框架:首先设计了一个跨视角几何增强模块(cross-view geometric augmentation module),通过建模不同视角下的可见性和采样密度变化,生成同一场景的多视角观测;随后引入几何一致性模块(geometric consistency module),强制模型在几何增强后的点云上输出一致的语义和占据预测,从而提升跨视角的泛化能力。

链接: https://arxiv.org/abs/2602.14525
作者: Jindong Zhao,Yuan Gao,Yang Xia,Sheng Nie,Jun Yue,Weiwei Sun,Shaobo Xia
机构: Changsha University of Science and Technology (长沙理工大学); Chinese Academy of Sciences (中国科学院); Central South University (中南大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain-generalized LiDAR semantic segmentation (LSS) seeks to train models on source-domain point clouds that generalize reliably to multiple unseen target domains, which is essential for real-world LiDAR applications. However, existing approaches assume similar acquisition views (e.g., vehicle-mounted) and struggle in cross-view scenarios, where observations differ substantially due to viewpoint-dependent structural incompleteness and non-uniform point density. Accordingly, we formulate cross-view domain generalization for LiDAR semantic segmentation and propose a novel framework, termed CVGC (Cross-View Geometric Consistency). Specifically, we introduce a cross-view geometric augmentation module that models viewpoint-induced variations in visibility and sampling density, generating multiple cross-view observations of the same scene. Subsequently, a geometric consistency module enforces consistent semantic and occupancy predictions across geometrically augmented point clouds of the same scene. Extensive experiments on six public LiDAR datasets establish the first systematic evaluation of cross-view domain generalization for LiDAR semantic segmentation, demonstrating that CVGC consistently outperforms state-of-the-art methods when generalizing from a single source domain to multiple target domains with heterogeneous acquisition viewpoints. The source code will be publicly available at this https URL

[CV-29] Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model

【速读】:该论文旨在解决十八世纪印刷文本光学字符识别(OCR)中存在的可靠性问题,尤其关注在学术研究场景下模型误差结构对历史文献真实性的影响。传统指标如字符错误率(CER)和词错误率(WER)无法充分揭示模型在处理退化印刷质量、古体字形及非标准化拼写时的潜在偏差。解决方案的关键在于引入长度加权准确率和假设驱动的误差分析方法,对比专用OCR Transformer(TrOCR)与通用视觉语言模型(VLM,Qwen)在历史英文文本上的表现,发现二者虽具有相似的整体准确率,但在错误局部性、可检测性和下游学术风险方面存在系统性差异:Qwen因语言正则化倾向可能无声改变历史语形,而TrOCR更保真但易产生级联错误传播。这一结果强调了在历史数字化流程中需基于模型架构进行针对性评估的重要性。

链接: https://arxiv.org/abs/2602.14524
作者: Ari Vesalainen,Eetu Mäkelä,Laura Ruotsalainen,Mikko Tolonen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.14524 [cs.CV] (or arXiv:2602.14524v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.14524 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-30] Architectural Insights for Post-Tornado Damage Recognition

【速读】:该论文旨在解决龙卷风灾害后建筑物损毁评估中自动化方法面临的两大核心挑战:一是由于标准预训练数据集与龙卷风破坏场景之间存在显著的领域偏移(domain shift),导致模型性能下降;二是真实灾难数据中类别极度不平衡问题,影响模型泛化能力。解决方案的关键在于构建了一个系统性的实验框架,通过在新提出的Quad-State Tornado Damage (QSTD) 基准数据集上对79个开源深度学习模型进行超过2300次控制实验,发现模型性能提升并非仅依赖于架构选择,而是由架构与优化策略之间的复杂交互决定。特别地,优化器的选择比架构本身更具影响力——将优化器从Adam更换为SGD可使Vision Transformer和Swin Transformer家族的F1分数提升25至38个百分点,从而实现从垫底到与顶尖CNN相当的性能跃升;此外,采用低学习率(1×10⁻⁴)被证明是普遍有效的关键因素,平均提升F1分数10.2点。最终,基于ConvNeXt-Base并结合上述优化配置的冠军模型,在跨事件测试中展现出强泛化能力,表明其具备实际部署潜力。

链接: https://arxiv.org/abs/2602.14523
作者: Robinson Umeike,Thang Dao,Shane Crawford,John van de Lindt,Blythe Johnston,Wanting(Lisa)Wang,Trung Do,Ajibola Mofikoya,Sarbesh Banjara,Cuong Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid and accurate building damage assessment in the immediate aftermath of tornadoes is critical for coordinating life-saving search and rescue operations, optimizing emergency resource allocation, and accelerating community recovery. However, current automated methods struggle with the unique visual complexity of tornado-induced wreckage, primarily due to severe domain shift from standard pre-training datasets and extreme class imbalance in real-world disaster data. To address these challenges, we introduce a systematic experimental framework evaluating 79 open-source deep learning models, encompassing both Convolutional Neural Networks (CNNs) and Vision Transformers, across over 2,300 controlled experiments on our newly curated Quad-State Tornado Damage (QSTD) benchmark dataset. Our findings reveal that achieving operational-grade performance hinges on a complex interaction between architecture and optimization, rather than architectural selection alone. Most strikingly, we demonstrate that optimizer choice can be more consequential than architecture: switching from Adam to SGD provided dramatic F1 gains of +25 to +38 points for Vision Transformer and Swin Transformer families, fundamentally reversing their ranking from bottom-tier to competitive with top-performing CNNs. Furthermore, a low learning rate of 1x10^(-4) proved universally critical, boosting average F1 performance by +10.2 points across all architectures. Our champion model, ConvNeXt-Base trained with these optimized settings, demonstrated strong cross-event generalization on the held-out Tuscaloosa-Moore Tornado Damage (TMTD) dataset, achieving 46.4% Macro F1 (+34.6 points over baseline) and retaining 85.5% Ordinal Top-1 Accuracy despite temporal and sensor domain shifts.

[CV-31] Efficient Text-Guided Convolutional Adapter for the Diffusion Model

【速读】:该论文旨在解决扩散模型中结构保持条件生成(Structure Preserving Conditional Generation, SPCG)任务中存在的效率低下与多模态条件融合不足的问题。现有方法通常在基模型基础上引入参数量相当的适配器(adapter),用于处理结构输入(如草图或深度图),但存在训练成本高、适配器对文本提示(prompt)不敏感等问题,导致其仅能优化结构信息而无法有效利用文本引导。解决方案的关键在于提出两种新型文本引导型高效适配器——Nexus Prime 和 Nexus Slim,其核心创新是每个 Nexus Block 引入交叉注意力机制(cross-attention mechanism),使适配器能够同时感知文本提示和结构输入,从而实现更精细的多模态条件控制。实验表明,Nexus Prime 仅需增加 8M 参数即可显著提升性能,而 Nexus Slim 在减少 18M 参数的同时仍达到当前最优效果,显著提升了结构保持生成的效率与质量。

链接: https://arxiv.org/abs/2602.14514
作者: Aryan Das,Koushik Biswas,Swalpa Kumar Roy,Badri Narayana Patro,Vinay Kumar Verma
机构: VIT Bhopal (维特布帕尔大学); IIIT Delhi (印度信息科技研究所德里分校); Tezpur University Assam (特兹普尔大学阿萨姆邦); IIT Kanpur (印度理工学院坎普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the Nexus Adapters, novel text-guided efficient adapters to the diffusion-based framework for the Structure Preserving Conditional Generation (SPCG). Recently, structure-preserving methods have achieved promising results in conditional image generation by using a base model for prompt conditioning and an adapter for structure input, such as sketches or depth maps. These approaches are highly inefficient and sometimes require equal parameters in the adapter compared to the base architecture. It is not always possible to train the model since the diffusion model is itself costly, and doubling the parameter is highly inefficient. In these approaches, the adapter is not aware of the input prompt; therefore, it is optimal only for the structural input but not for the input prompt. To overcome the above challenges, we proposed two efficient adapters, Nexus Prime and Slim, which are guided by prompts and structural inputs. Each Nexus Block incorporates cross-attention mechanisms to enable rich multimodal conditioning. Therefore, the proposed adapter has a better understanding of the input prompt while preserving the structure. We conducted extensive experiments on the proposed models and demonstrated that the Nexus Prime adapter significantly enhances performance, requiring only 8M additional parameters compared to the baseline, T2I-Adapter. Furthermore, we also introduced a lightweight Nexus Slim adapter with 18M fewer parameters than the T2I-Adapter, which still achieved state-of-the-art results. Code: this https URL

[CV-32] MedVAR: Towards Scalable and Efficient Medical Image Generation via Next-scale Autoregressive Prediction

【速读】:该论文旨在解决医疗图像生成领域中缺乏可扩展、高效且评估严谨的生成模型的问题,具体包括架构效率不足、多器官数据稀缺以及缺乏系统性评估方法。其解决方案的关键在于提出MedVAR,首个基于自回归机制的医学图像基础模型,采用“下一尺度预测”(next-scale prediction)范式实现从粗到细的图像生成,并构建了包含约44万张CT和MRI图像的标准化数据集,支持多层次结构化表示的生成,从而在保真度、多样性与可扩展性方面均达到当前最优性能,为未来医学生成式AI模型提供了新的架构方向。

链接: https://arxiv.org/abs/2602.14512
作者: Zhicheng He,Yunpeng Zhao,Junde Wu,Ziwei Niu,Zijun Li,Lanfen Lin,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:Medical image generation is pivotal in applications like data augmentation for low-resource clinical tasks and privacy-preserving data sharing. However, developing a scalable generative backbone for medical imaging requires architectural efficiency, sufficient multi-organ data, and principled evaluation, yet current approaches leave these aspects unresolved. Therefore, we introduce MedVAR, the first autoregressive-based foundation model that adopts the next-scale prediction paradigm to enable fast and scale-up-friendly medical image synthesis. MedVAR generates images in a coarse-to-fine manner and produces structured multi-scale representations suitable for downstream use. To support hierarchical generation, we curate a harmonized dataset of around 440,000 CT and MRI images spanning six anatomical regions. Comprehensive experiments across fidelity, diversity, and scalability show that MedVAR achieves state-of-the-art generative performance and offers a promising architectural direction for future medical generative foundation models.

[CV-33] MacNet: An End-to-End Manifold-Constrained Adaptive Clustering Network for Interpretable Whole Slide Image Classification

【速读】:该论文旨在解决当前基于全切片图像(Whole Slide Images, WSIs)的病理诊断中,主流两阶段框架存在的局限性问题:一是离线特征编码器缺乏领域知识,二是基于注意力机制的多实例学习(Multiple Instance Learning, MIL)方法虽以预后为导向但可解释性差,三是聚类方法虽具可解释性却因高维特征和语义模糊的中心点导致性能受限。解决方案的关键在于提出一个端到端的MIL框架,融合流形自适应聚类(manifold adaptive clustering)与Grassmann重嵌入(Grassmann re-embedding),利用流形几何结构提升聚类鲁棒性,并设计先验知识引导的代理实例标签生成与聚合策略,从而逼近局部病理切片标签并聚焦肿瘤相关区域,最终在多中心WSI数据集上实现更高的分级准确率与更强的可解释性。

链接: https://arxiv.org/abs/2602.14509
作者: Mingrui Ma,Chentao Li,Pan Huang,Jing Qin
机构: Xinjiang University (新疆大学); Xinjiang Medical University (新疆医科大学); Columbia University (哥伦比亚大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Whole slide images (WSIs) are the gold standard for pathological diagnosis and sub-typing. Current main-stream two-step frameworks employ offline feature encoders trained without domain-specific knowledge. Among them, attention-based multiple instance learning (MIL) methods are outcome-oriented and offer limited interpretability. Clustering-based approaches can provide explainable decision-making process but suffer from high dimension features and semantically ambiguous centroids. To this end, we propose an end-to-end MIL framework that integrates Grassmann re-embedding and manifold adaptive clustering, where the manifold geometric structure facilitates robust clustering results. Furthermore, we design a prior knowledge guiding proxy instance labeling and aggregation strategy to approximate patch labels and focus on pathologically relevant tumor regions. Experiments on multicentre WSI datasets demonstrate that: 1) our cluster-incorporated model achieves superior performance in both grading accuracy and interpretability; 2) end-to-end learning refines better feature representations and it requires acceptable computation resources.

[CV-34] Prototype Instance-semantic Disentanglement with Low-rank Regularized Subspace Clustering for WSIs Explainable Recognition

【速读】:该论文旨在解决多实例学习(Multi-Instance Learning, MIL)框架中因肿瘤区域与癌前病变及非肿瘤组织高度相似、且非肿瘤实例数量远超肿瘤实例所导致的实例语义纠缠(instance-semantic entanglement)问题,从而提升模型的表征能力和可解释性。解决方案的关键在于提出端到端的原型实例语义解耦框架(Prototype Instance Semantic Disentanglement, PID-LRSC),其核心创新包括:1)通过低秩正则化子空间聚类(Low-Rank Regularized Subspace Clustering, LRSC)进行二次实例子空间学习,以缓解因非肿瘤实例占比过高引起的实例纠缠;2)引入增强对比学习机制设计原型实例语义解耦(PID),有效分离肿瘤与癌前病变之间的语义混淆。实验表明,该方法在多中心病理数据集上优于现有最先进方法,显著提升了辅助诊断结果的可靠性。

链接: https://arxiv.org/abs/2602.14501
作者: Chentao Li,Pan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is available at this https URL

点击查看摘要

Abstract:The tumor region plays a key role in pathological diagnosis. Tumor tissues are highly similar to precancerous lesions and non tumor instances often greatly exceed tumor instances in whole slide images (WSIs). These issues cause instance-semantic entanglement in multi-instance learning frameworks, degrading both model representation capability and interpretability. To address this, we propose an end-to-end prototype instance semantic disentanglement framework with low-rank regularized subspace clustering, PID-LRSC, in two aspects. First, we use secondary instance subspace learning to construct low-rank regularized subspace clustering (LRSC), addressing instance entanglement caused by an excessive proportion of non tumor instances. Second, we employ enhanced contrastive learning to design prototype instance semantic disentanglement (PID), resolving semantic entanglement caused by the high similarity between tumor and precancerous tissues. We conduct extensive experiments on multicentre pathology datasets, implying that PID-LRSC outperforms other SOTA methods. Overall, PID-LRSC provides clearer instance semantics during decision-making and significantly enhances the reliability of auxiliary diagnostic outcomes.

[CV-35] Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

【速读】:该论文旨在解决医学影像分割中因图像质量差、模态信息不一致及诊断不确定性导致的精度不足问题,尤其在多模态(影像与临床文本)融合场景下如何提升模型的鲁棒性和可解释性。其解决方案的关键在于提出两个核心组件:一是Modality Decoding Attention Block (MoDAB) 结合轻量级State Space Mixer (SSMix),实现高效跨模态融合与长距离依赖建模;二是引入Spectral-Entropic Uncertainty (SEU) Loss,通过联合优化空间重叠、光谱一致性与预测不确定性,在模糊或低质量临床环境下显著增强模型可靠性。

链接: https://arxiv.org/abs/2602.14498
作者: Aryan Das,Tanishq Rachamalla,Koushik Biswas,Swalpa Kumar Roy,Vinay Kumar Verma
机构: VIT Bhopal (维特大学博帕尔校区); SAHE, Andhra Pradesh (安德拉邦高等教育局); IIIT Delhi (印度信息学院德里校区); Tezpur University Assam (特兹普尔大学阿萨姆校区); IIT Kanpur (印度理工学院坎普尔校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: this https URL

[CV-36] Gaussian Mesh Renderer for Lightweight Differentiable Rendering ICASSP2026

【速读】:该论文旨在解决传统基于三角网格的可微渲染器在表面重建中优化速度慢、计算负担重的问题。其解决方案的关键在于提出一种轻量级可微渲染器——高斯网格渲染器(Gaussian Mesh Renderer, GMR),该方法利用3D高斯泼溅(3D Gaussian Splatting, 3DGS)高效的光栅化流程,将每个高斯基元(Gaussian primitive)从对应的网格三角面片解析推导而来,从而在保持结构保真度的同时实现梯度传递。相比传统方法,GMR能获得更平滑的梯度,尤其在小批量训练和内存受限场景下显著提升优化效果。

链接: https://arxiv.org/abs/2602.14493
作者: Xinpeng Liu,Fumio Okura
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026). GitHub: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has enabled high-fidelity virtualization with fast rendering and optimization for novel view synthesis. On the other hand, triangle mesh models still remain a popular choice for surface reconstruction but suffer from slow or heavy optimization in traditional mesh-based differentiable renderers. To address this problem, we propose a new lightweight differentiable mesh renderer leveraging the efficient rasterization process of 3DGS, named Gaussian Mesh Renderer (GMR), which tightly integrates the Gaussian and mesh representations. Each Gaussian primitive is analytically derived from the corresponding mesh triangle, preserving structural fidelity and enabling the gradient flow. Compared to the traditional mesh renderers, our method achieves smoother gradients, which especially contributes to better optimization using smaller batch sizes with limited memory. Our implementation is available in the public GitHub repository at this https URL.

[CV-37] Revisiting the Platonic Representation Hypothesis: An Aristotelian View

【速读】:该论文旨在解决现有神经网络表征相似性度量方法因模型规模(深度或宽度)变化而产生系统性偏差的问题,即当前指标在不同规模模型间缺乏可比性,导致对“柏拉图表征假设”(Platonic Representation Hypothesis)的验证存在误导。其解决方案的关键在于提出一种基于置换的零校准框架(permutation-based null-calibration framework),该框架能够将任意表征相似性度量转化为具有统计保证的校准分数,从而消除模型规模带来的干扰效应。通过此校准方法重新检验柏拉图假设后发现,全局谱相似性在校准后显著减弱,而局部邻域关系则保持跨模态一致性,由此提出新的“亚里士多德表征假设”(Aristotelian Representation Hypothesis)——神经网络表征正趋于共享局部邻域结构关系。

链接: https://arxiv.org/abs/2602.14486
作者: Fabian Gröger,Shuo Wen,Maria Brbić
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can systematically inflate representational similarity scores. To correct these effects, we introduce a permutation-based null-calibration framework that transforms any representational similarity metric into a calibrated score with statistical guarantees. We revisit the Platonic Representation Hypothesis with our calibration framework, which reveals a nuanced picture: the apparent convergence reported by global spectral measures largely disappears after calibration, while local neighborhood similarity, but not local distances, retains significant agreement across different modalities. Based on these findings, we propose the Aristotelian Representation Hypothesis: representations in neural networks are converging to shared local neighborhood relationships.

[CV-38] kArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉推理任务中因全局图像编码导致关键证据丢失的问题,例如微小物体、杂乱区域或细微标记等局部信息难以被有效捕捉。解决方案的核心是提出TikArt(Thinking Aperture),一个基于光圈引导的智能体,其通过“思考-光圈-观察”循环机制,将多步视觉-语言推理建模为对感兴趣区域的决策过程;其中,Zoom操作提取矩形裁剪区域,Segment调用SAM2获取不规则目标的掩码裁剪,每次动作后模型必须生成显式观察结果,从而将局部视觉线索转化为持久的语言记忆,显著提升高分辨率场景下的可解释性与推理能力。

链接: https://arxiv.org/abs/2602.14482
作者: Hao Ding,Zhichuan Yang,Weijie Ge,Ziqin Gao,Chaoyi Lu,Lei Zhao
机构: Zhejiang University (浙江大学); Xi’an Jiaotong University (西安交通大学); Zhejiang University of Science and Technology (浙江科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.

[CV-39] CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

【速读】:该论文旨在解决图像风格迁移中语义一致性不足的问题,即在保持相似对象间语义对应关系的前提下实现精细的风格转移。现有方法多局限于全局层面,忽视了区域乃至像素级别的语义对齐。其解决方案的关键在于提出一种无需训练且低成本的框架 CoCoDiff,通过挖掘预训练潜空间扩散模型(latent diffusion models)中的中间特征,构建内容图与风格图之间的密集像素级语义对齐映射,并引入循环一致性模块以在迭代过程中强化结构和感知一致性,从而实现保留几何形状与细节的对象级和区域级风格化效果。

链接: https://arxiv.org/abs/2602.14464
作者: Wenbo Nie,Zixiang Li,Renshuai Tao,Bin Wu,Yunchao Wei,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Visual Intelligence + X International Joint Laboratory of the Ministry of Education (教育部视觉智能国际联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.

[CV-40] Controlling Your Image via Simplified Vector Graphics

【速读】:该论文旨在解决图像生成过程中缺乏元素级控制的问题,即如何实现对图像中特定对象的形状、颜色或存在性进行直观修改,从而提升生成图像的可控性。其解决方案的关键在于引入基于简化矢量图形(simplified vector graphics, VGs)的分层可控生成方法:首先将图像高效解析为语义对齐且结构一致的层级VG表示,随后构建一种以VG为引导的图像合成框架,使用户能够自由编辑VG中的元素,并将其无缝转化为高质量的逼真图像输出。该方法通过结合VG的结构与语义特征以及噪声预测机制,实现了对几何形状、色彩和对象语义的精确控制。

链接: https://arxiv.org/abs/2602.14443
作者: Lanqing Guo,Xi Liu,Yufei Wang,Zhihao Li,Siyu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Recent advances in image generation have achieved remarkable visual quality, while a fundamental challenge remains: Can image generation be controlled at the element level, enabling intuitive modifications such as adjusting shapes, altering colors, or adding and removing objects? In this work, we address this challenge by introducing layer-wise controllable generation through simplified vector graphics (VGs). Our approach first efficiently parses images into hierarchical VG representations that are semantic-aligned and structurally coherent. Building on this representation, we design a novel image synthesis framework guided by VGs, allowing users to freely modify elements and seamlessly translate these edits into photorealistic outputs. By leveraging the structural and semantic features of VGs in conjunction with noise prediction, our method provides precise control over geometry, color, and object semantics. Extensive experiments demonstrate the effectiveness of our approach in diverse applications, including image editing, object-level manipulation, and fine-grained content creation, establishing a new paradigm for controllable image generation. Project page: this https URL

[CV-41] D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection

【速读】:该论文旨在解决多模态虚假信息(multimodal misinformation)检测中因单一证据源导致的局限性问题,即内容-based检测器难以验证全局事实真实性,而基于检索的事实核查系统又常忽略像素级或词元级的细微篡改。解决方案的关键在于提出D-SECURE框架,通过融合内部篡改检测(HAMMER)与外部证据推理(DEFAME)机制,实现对新闻类帖子的协同验证:DEFAME负责广义事实核查,HAMMER则专注于残差或不确定样本中的细粒度编辑识别,从而弥补各自盲区并生成可解释的联合报告,显著提升对复杂伪造内容的检测能力。

链接: https://arxiv.org/abs/2602.14441
作者: Gagandeep Singh,Samudi Amarasinghe,Priyanka Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.

[CV-42] Hierarchical Vision-Language Interaction for Facial Action Unit Detection

【速读】:该论文旨在解决面部动作单元(Action Unit, AU)检测中因标注数据有限而导致的判别性与泛化能力不足的问题。其核心解决方案是提出一种层次化视觉-语言交互框架(Hierarchical Vision-language Interaction for AU Understanding, HiVA),关键在于通过大语言模型生成多样且语境丰富的AU描述作为语义先验,引导视觉特征学习;同时设计了AU感知的动态图模块以捕获细粒度与整体的视觉-语言关联,并引入两种互补的跨模态注意力机制——解耦双交叉注意力(Disentangled Dual Cross-Attention, DDCA)用于建立AU特定的细粒度交互,上下文双交叉注意力(Contextual Dual Cross-Attention, CDCA)用于建模AU间的全局依赖关系,从而实现多粒度视觉特征与精细化语言信息的协同学习,显著提升AU检测的鲁棒性和语义可解释性。

链接: https://arxiv.org/abs/2602.14425
作者: Yong Li,Yi Ren,Yizhe Zhang,Wenhua Zhang,Tianyi Zhang,Muyun Jiang,Guo-Sen Xie,Cuntai Guan
机构: Southeast University (东南大学); Nanjing University of Science and Technology (南京理工大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transaction on Affective Computing 2026

点击查看摘要

Abstract:Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.

[CV-43] Understanding Sensor Vulnerabilities in Industrial XR Tracking

【速读】:该论文旨在解决工业和操作环境中扩展现实(Extended Reality, XR)系统依赖视觉-惯性里程计(Visual-Inertial Odometry, VIO)进行六自由度位姿跟踪时,因传感器性能退化导致的定位精度下降问题。现有VIO评估多基于理想传感条件,忽视了实际运行中持续的传感器劣化效应,因而难以准确预测系统在真实场景下的鲁棒性。解决方案的关键在于通过受控的实证研究,在多种工作状态下系统性地注入视觉与惯性模态故障,并进行定量分析,从而揭示两类传感器退化对位姿估计影响的显著不对称性:视觉退化通常仅引发厘米级的位姿误差,而惯性退化则可能导致数十至数千米的轨迹漂移,这凸显了在XR系统设计与评估中强化惯性模块可靠性的必要性。

链接: https://arxiv.org/abs/2602.14413
作者: Sourya Saha,Md. Nurul Absur
机构: City University of New York (纽约市立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IEEE VR XRIOS 2026 Workshop

点击查看摘要

Abstract:Extended Reality (XR) systems deployed in industrial and operational settings rely on Visual–Inertial Odometry (VIO) for continuous six-degree-of-freedom pose tracking, yet these environments often involve sensing conditions that deviate from ideal assumptions. Despite this, most VIO evaluations emphasize nominal sensor behavior, leaving the effects of sustained sensor degradation under operational conditions insufficiently understood. This paper presents a controlled empirical study of VIO behavior under degraded sensing, examining faults affecting visual and inertial modalities across a range of operating regimes. Through systematic fault injection and quantitative evaluation, we observe a pronounced asymmetry in fault impact where degradations affecting visual sensing typically lead to bounded pose errors on the order of centimeters, whereas degradations affecting inertial sensing can induce substantially larger trajectory deviations, in some cases reaching hundreds to thousands of meters. These observations motivate greater emphasis on inertial reliability in the evaluation and design of XR systems for real-life industrial settings.

[CV-44] Learning Proposes Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning

【速读】:该论文旨在解决当前空间感知(spatial perception)中学习方法与几何模型融合方式的不确定性问题,即:在基于视觉的相机位姿估计任务中,学习模块是否应直接替代传统几何算法,还是作为中间模块嵌入到几何驱动的流水线中。其核心解决方案是提出一种端到端的模块化框架,其中学习模型(如VGGT)负责生成几何假设(如位姿和深度预测),而经典几何算法(如点到平面RGB-D ICP)则用于对这些假设进行验证与优化。关键发现表明,仅依赖学习模块的位姿预测不可靠,且未正确对齐相机内参的深度预测会损害性能;但当学习提出的深度信息与相机参数几何一致,并由几何后端进行处置时,在中等挑战性的刚体场景下可实现稳定提升。这揭示了几何约束不仅是修正手段,更是验证和吸收学习观测的核心仲裁者,强调了面向几何感知的模块化系统设计对于鲁棒性的重要性。

链接: https://arxiv.org/abs/2602.14409
作者: Haichao Zhu,Zhaorui Yang,Qian Zhang
机构: Reality Vision; University of California, Riverside
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.14409 [cs.CV] (or arXiv:2602.14409v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.14409 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-45] Feature Recalibration Based Olfactory-Visual Multimodal Model for Fine-Grained Rice Deterioration Detection

【速读】:该论文旨在解决现有多模态方法在稻米劣变检测中对细粒度异常特征表征与提取能力有限,且依赖高成本设备(如高光谱相机和质谱仪)导致检测成本高、数据采集时间长的问题。其解决方案的关键在于提出一种基于特征重校准的嗅觉-视觉多模态模型,核心包括两个创新组件:一是细粒度劣变嵌入构造器(Fine-grained Deterioration Embedding Constructor, FDEC),用于重构带标签的多模态嵌入特征数据集以增强样本表征;二是细粒度劣变重校准注意力网络(Fine-grained Deterioration Recalibration Attention Network, FDRA-Net),通过强调信号变化提升对稻米表面细粒度劣变的敏感性。实验表明,该方法在分类准确率上达到99.89%,优于当前最优方法,同时简化了检测流程并具备良好的田间适用性。

链接: https://arxiv.org/abs/2602.14408
作者: Rongqiang Zhao,Hengrui Hu,Yijing Wang,Mingchun Sun,Jie Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); National Key Laboratory of Smart Farm Technologies and Systems (智能农场技术与系统全国重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal methods are widely used in rice deterioration detection, which exhibit limited capability in representing and extracting fine-grained abnormal features. Moreover, these methods rely on devices, such as hyperspectral cameras and mass spectrometers, increasing detection costs and prolonging data acquisition time. To address these issues, we propose a feature recalibration based olfactory-visual multimodal model for fine-grained rice deterioration detection. The fine-grained deterioration embedding constructor (FDEC) is proposed to reconstruct the labeled multimodal embedded-feature dataset, enhancing sample representation. The fine-grained deterioration recalibration attention network (FDRA-Net) is proposed to emphasize signal variations and increase sensitivity to fine-grained deterioration on the rice surface. Experiments show that the proposed method achieves a classification accuracy of 99.89%. Compared with state-of-the-art methods, the detection accuracy is improved and the procedure is simplified. Furthermore, field detection demonstrates the advantages of accuracy and operational simplicity. The proposed method can also be extended to other agrifood in agriculture and food industry.

[CV-46] pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI

【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)任务中因私有室内环境轨迹指令数据分布高度异构(non-IID)所引发的联邦学习(Federated Learning, FL)性能下降问题,即传统联邦平均(FedAvg)方法难以在不同客户端间有效共享全局知识并保留本地特异性。解决方案的关键在于提出pFedNavi框架,其核心创新是结构感知且动态自适应的个性化联邦学习机制:通过逐层混合系数自适应识别客户端特定层(如编码器-解码器投影层和环境敏感解码层),并对选定组件进行细粒度参数融合,在保持全局知识共享的同时实现局部模型专业化,从而显著提升导航成功率与轨迹保真度,并加速收敛。

链接: https://arxiv.org/abs/2602.14401
作者: Qingqian Yang,Hao Wang,Sai Qian Zhang,Jian Li,Yang Hua,Miao Pan,Tao Song,Zhengwei Qi,Haibing Guan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs’ extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.

[CV-47] Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

【速读】:该论文旨在解决多轮对抗攻击(multi-turn jailbreak attacks)在大型视觉语言模型(LVLMs)中有效性下降的问题。由于视觉输入的引入可能触发安全对齐机制,导致模型响应趋于保守,从而削弱攻击效果。解决方案的关键在于提出MAPA(Multi-turn Adaptive Prompting Attack),其核心创新为两层设计:一是每轮交替使用文本与视觉攻击动作以激发最恶意响应;二是跨轮次通过迭代式双向优化调整攻击轨迹,逐步放大输出内容的恶意程度。这一机制显著提升了攻击成功率,在多个主流LVLM基准上较现有方法提升11%-35%。

链接: https://arxiv.org/abs/2602.14399
作者: In Chong Choi,Jiacheng Zhang,Feng Liu,Yiliao Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.

[CV-48] Adapting VACE for Real-Time Autoregressive Video Diffusion

【速读】:该论文旨在解决生成式 AI (Generative AI) 中视频生成模型在实时流式处理场景下的兼容性问题,即如何在保持自回归(autoregressive)架构所需固定块大小(fixed chunk sizes)和因果注意力(causal attention)的前提下,实现对视频的统一控制(如参考引导、结构条件控制、图像修复和时序扩展)。其解决方案的关键在于将参考帧从扩散潜空间(diffusion latent space)迁移至一个并行的条件路径中,从而避免双向注意力机制对流式处理的限制,同时保留KV缓存(KV caching)以支持高效推理。该方法无需额外训练即可复用预训练的VACE权重,仅在1.3B和14B模型规模下引入20-30%的延迟开销,且显存占用可忽略不计。

链接: https://arxiv.org/abs/2602.14381
作者: Ryan Fosdick(Daydream)
机构: Daydream
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 7 tables

点击查看摘要

Abstract:We describe an adaptation of VACE (Video All-in-one Creation and Editing) for real-time autoregressive video generation. VACE provides unified video control (reference guidance, structural conditioning, inpainting, and temporal extension) but assumes bidirectional attention over full sequences, making it incompatible with streaming pipelines that require fixed chunk sizes and causal attention. The key modification moves reference frames from the diffusion latent space into a parallel conditioning pathway, preserving the fixed chunk sizes and KV caching that autoregressive models require. This adaptation reuses existing pretrained VACE weights without additional training. Across 1.3B and 14B model scales, VACE adds 20-30% latency overhead for structural control and inpainting, with negligible VRAM cost relative to the base model. Reference-to-video fidelity is severely degraded compared to batch VACE due to causal attention constraints. A reference implementation is available at this https URL.

[CV-49] Event-based Visual Deformation Measurement

【速读】:该论文旨在解决视觉变形测量(Visual Deformation Measurement, VDM)在高动态场景下因传统基于图像的方法依赖帧间小位移约束而导致的适用性受限问题,以及由此引发的对高速摄像机的依赖所导致的数据存储与计算开销过高的挑战。其核心解决方案是提出一种事件-帧融合框架(event-frame fusion framework),利用事件数据提供时域密集的运动线索、帧数据提供空间密集的精确估计,并引入仿射不变单纯形(Affine Invariant Simplicial, AIS)建模方法,将变形场分割为低参数化的线性子区域以缓解稀疏噪声事件带来的运动歧义;同时采用邻域贪婪优化策略,通过已收敛子区域引导邻近未收敛区域,有效抑制长期稠密跟踪中的局部误差累积,从而实现高精度且资源高效的变形测量。

链接: https://arxiv.org/abs/2602.14376
作者: Yuliang Wu,Wei Zhai,Yuxin Cui,Tiesong Zhao,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead. We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation. Revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework. It partitions the deformation field into linearized sub-regions with low-parametric representation, effectively mitigating motion ambiguities arising from sparse and noisy events. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking. To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and frames is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that our method outperforms the state-of-the-art baseline by 1.6% in survival rate. Remarkably, it achieves this using only 18.9% of the data storage and processing resources of high-speed video methods.

[CV-50] Image-based Joint-level Detection for Inflammation in Rheumatoid Arthritis from Small and Imbalanced Data

【速读】:该论文旨在解决类风湿关节炎(Rheumatoid Arthritis, RA)相关关节炎症的早期检测难题,尤其是在家庭环境中利用RGB手部图像进行无创、便捷的炎症识别问题。由于医疗资源分布不均和诊断延迟,RA患者常难以及时获得专科诊疗,而现有基于RGB图像的炎症检测方法在数据稀缺、正样本不足及类别不平衡等挑战下表现不佳。论文的关键解决方案是提出一种结合全局-局部编码器(global-local encoder)的炎症检测框架:首先在大规模健康手部图像上进行自监督预训练以增强模型泛化能力,再引入针对数据不平衡的训练策略,从而显著提升对RA相关关节炎症的检测性能,实验表明该方法相较基线模型在F1-score和Gmean指标上分别提升了0.2和0.25点。

链接: https://arxiv.org/abs/2602.14365
作者: Shun Kato(Keio University, Japan),Yasushi Kondo(Keio University, Japan),Shuntaro Saito(Keio University, Japan),Yoshimitsu Aoki(Keio University, Japan),Mariko Isogawa(Keio University, Japan)
机构: Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rheumatoid arthritis (RA) is an autoimmune disease characterized by systemic joint inflammation. Early diagnosis and tight follow-up are essential to the management of RA, as ongoing inflammation can cause irreversible joint damage. The detection of arthritis is important for diagnosis and assessment of disease activity; however, it often takes a long time for patients to receive appropriate specialist care. Therefore, there is a strong need to develop systems that can detect joint inflammation easily using RGB images captured at home. Consequently, we tackle the task of RA inflammation detection from RGB hand images. This task is highly challenging due to general issues in medical imaging, such as the scarcity of positive samples, data imbalance, and the inherent difficulty of the task itself. However, to the best of our knowledge, no existing work has explicitly addressed these challenges in RGB-based RA inflammation detection. This paper quantitatively demonstrates the difficulty of visually detecting inflammation by constructing a dedicated dataset, and we propose a inflammation detection framework with global local encoder that combines self-supervised pretraining on large-scale healthy hand images with imbalance-aware training to detect RA-related joint inflammation from RGB hand images. Our experiments demonstrated that the proposed approach improves F1-score by 0.2 points and Gmean by 0.25 points compared with the baseline model.

[CV-51] A Generative AI Approach for Reducing Skin Tone Bias in Skin Cancer Classification

【速读】:该论文旨在解决皮肤癌自动检测中因皮肤色调不平衡导致的公平性问题,特别是当前AI诊断工具在训练数据上严重偏向浅肤色人群(ISIC数据集中超过70%为浅肤色),致使深肤色个体的诊断准确率显著下降。解决方案的关键在于提出一种基于生成式AI的增强管道:通过低秩适应(Low-Rank Adaptation, LoRA)微调预训练的Stable Diffusion模型,在ISIC数据集中深肤色子集的基础上生成条件可控的合成皮肤病变 dermoscopic 图像(条件包括病变类型和皮肤色调)。实验表明,该方法在病变分割和二分类任务中均提升了性能指标(如IoU、Dice系数及准确率),验证了其在缓解数据偏倚、提升诊断公平性方面的有效性。

链接: https://arxiv.org/abs/2602.14356
作者: Areez Muhammed Shabu,Mohammad Samar Ansari,Asra Aslam
机构: University of Sheffield (谢菲尔德大学); University of Chester (切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skin cancer is one of the most common cancers worldwide and early detection is critical for effective treatment. However, current AI diagnostic tools are often trained on datasets dominated by lighter skin tones, leading to reduced accuracy and fairness for people with darker skin. The International Skin Imaging Collaboration (ISIC) dataset, one of the most widely used benchmarks, contains over 70% light skin images while dark skins fewer than 8%. This imbalance poses a significant barrier to equitable healthcare delivery and highlights the urgent need for methods that address demographic diversity in medical imaging. This paper addresses this challenge of skin tone imbalance in automated skin cancer detection using dermoscopic images. To overcome this, we present a generative augmentation pipeline that fine-tunes a pre-trained Stable Diffusion model using Low-Rank Adaptation (LoRA) on the image dark-skin subset of the ISIC dataset and generates synthetic dermoscopic images conditioned on lesion type and skin tone. In this study, we investigated the utility of these images on two downstream tasks: lesion segmentation and binary classification. For segmentation, models trained on the augmented dataset and evaluated on held-out real images show consistent improvements in IoU, Dice coefficient, and boundary accuracy. These evalutions provides the verification of Generated dataset. For classification, an EfficientNet-B0 model trained on the augmented dataset achieved 92.14% accuracy. This paper demonstrates that synthetic data augmentation with Generative AI integration can substantially reduce bias with increase fairness in conventional dermatological diagnostics and open challenges for future directions.

[CV-52] Differential pose optimization in descriptor space – Combining Geometric and Photometric Methods for Motion Estimation

【速读】:该论文旨在解决计算机视觉中的两帧相对位姿优化问题,该问题通常依赖于光度误差(photometric error)或重投影误差(re-projection error)进行优化,而这两类误差的选择往往受限于特征范式(光度特征或几何特征),在精度、鲁棒性和回环闭合可能性之间存在权衡。论文提出了一种融合两种范式优势的统一方法:通过使用密集采样的几何特征描述子(geometric feature descriptor),以描述子残差(descriptor residual)替代光度误差,从而在微分光度方法中实现亚像素级精度,同时保留几何描述子的表达能力。其关键创新在于将密集几何描述子引入位姿优化框架,但实验表明,尽管该策略利用了更多信息,仍未能超越基于重投影误差的优化方法,作者进一步分析认为,这可能源于描述子相似性度量变化过于缓慢,且与关键点定位精度之间缺乏严格对应关系。

链接: https://arxiv.org/abs/2602.14297
作者: Andreas L. Teigen,Annette Stahl,Rudolf Mester
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the fundamental problems in computer vision is the two-frame relative pose optimization problem. Primarily, two different kinds of error values are used: photometric error and re-projection error. The selection of error value is usually directly dependent on the selection of feature paradigm, photometric features, or geometric features. It is a trade-off between accuracy, robustness, and the possibility of loop closing. We investigate a third method that combines the strengths of both paradigms into a unified approach. Using densely sampled geometric feature descriptors, we replace the photometric error with a descriptor residual from a dense set of descriptors, thereby enabling the employment of sub-pixel accuracy in differential photometric methods, along with the expressiveness of the geometric feature descriptor. Experiments show that although the proposed strategy is an interesting approach that results in accurate tracking, it ultimately does not outperform pose optimization strategies based on re-projection error despite utilizing more information. We proceed to analyze the underlying reason for this discrepancy and present the hypothesis that the descriptor similarity metric is too slowly varying and does not necessarily correspond strictly to keypoint placement accuracy.

[CV-53] Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

【速读】:该论文旨在解决当前计算机使用代理(Computer-Use Agent, CUA)在执行任务时,因屏幕感知能力不足而导致指令接地(grounding)不准确的问题。现有数据集普遍存在标注稀疏、多样性不足的问题,难以支持全面且泛化能力强的UI元素解析。为应对这一挑战,作者提出ScreenParse——一个大规模、密集标注的屏幕解析数据集,涵盖771K张网页截图中的2100万个可见UI元素(包括边界框、55类元素类型及文本信息),并通过自动化渲染与VLM(Vision-Language Model)驱动的重标注和质量过滤流程实现高效构建。解决方案的关键在于:一是通过密集标注提供结构化的视觉-语义监督信号,提升模型对屏幕结构的理解能力;二是设计了一个轻量级但结构感知强的ScreenVLM模型(316M参数),采用结构敏感损失函数强化关键token的学习,从而在保持低延迟和设备端部署效率的同时,显著优于更大规模的基础VLM模型,在屏幕解析任务上取得SOTA性能,并展现出良好的迁移能力。

链接: https://arxiv.org/abs/2602.14276
作者: A. Said Gurbuz,Sunghwan Hong,Ahmed Nassar,Marc Pollefeys,Peter Staar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 15 figures

点击查看摘要

Abstract:Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: this https URL.

[CV-54] AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks ICASSP2026

【速读】:该论文旨在解决基于文本指令的物体添加(object addition)任务中因文本提示模糊或基于掩码输入繁琐而导致的可用性问题。其核心解决方案是提出了一种名为AbracADDbra的用户友好框架,关键在于利用直观的触控先验(touch priors)来空间定位简洁指令,从而实现精确的物体放置;该框架采用解耦架构:首先使用视觉-语言Transformer(vision-language transformer)进行触控引导的放置,随后通过扩散模型(diffusion model)联合生成物体及其实例掩码(instance mask),以实现高保真融合。实验表明,该方法在Touch2Add基准测试中显著优于随机放置和通用视觉语言模型基线,且初始放置精度与最终编辑质量呈强相关性,验证了其解耦设计的有效性。

链接: https://arxiv.org/abs/2602.14237
作者: Kunal Swami,Raghu Chittersu,Yuvraj Rathore,Rajeev Irny,Shashavali Doodekula,Alok Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE ICASSP 2026

点击查看摘要

Abstract:Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework’s ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.

[CV-55] Dual-Signal Adaptive KV-Cache Optimization for Long-Form Video Understanding in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长视频内容时面临的内存瓶颈问题,即键值缓存(Key-Value Cache, KV cache)随序列长度线性增长导致的存储与计算压力。现有方法多采用反应式淘汰策略,在执行完整注意力计算后再丢弃冗余token,造成大量计算浪费。其解决方案的关键在于提出Sali-Cache框架,通过先验优化实现双信号自适应缓存:利用基于光流分析的时序滤波器检测帧间冗余,结合基于显著性检测的空间滤波器识别视觉关键区域,从而在进入高复杂度注意力运算前主动管理内存分配,显著提升内存使用效率并保持模型性能。

链接: https://arxiv.org/abs/2602.14236
作者: Vishnu Sai,Dheeraj Sai,Srinath B,Girish Varma,Priyesh Shukla
机构: International Institute of Information Technology, Hyderabad, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.

[CV-56] Learning Significant Persistent Homology Features for 3D Shape Understanding

【速读】:该论文旨在解决现有三维形状分析基准数据集(如ModelNet40和ShapeNet)主要捕捉几何信息而忽略拓扑结构的问题,从而限制了深度学习模型对形状本质特征的全面理解。其关键解决方案是构建包含持久同调(persistent homology)特征的拓扑增强型数据集,并提出一种基于深度学习的显著持久点选择方法TopoGAT,该方法可直接从输入数据及对应的拓扑签名中自动学习最具信息量的拓扑特征,克服传统手工统计选择标准的局限性,进而提升点云分类与部件分割任务的性能。

链接: https://arxiv.org/abs/2602.14228
作者: Prachi Kudeshia,Jiju Poovvancheri
机构: Saint Mary’s University (圣玛丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures, Preprint under review

点击查看摘要

Abstract:Geometry and topology constitute complementary descriptors of three-dimensional shape, yet existing benchmark datasets primarily capture geometric information while neglecting topological structure. This work addresses this limitation by introducing topologically-enriched versions of ModelNet40 and ShapeNet, where each point cloud is augmented with its corresponding persistent homology features. These benchmarks with the topological signatures establish a foundation for unified geometry-topology learning and enable systematic evaluation of topology-aware deep learning architectures for 3D shape analysis. Building on this foundation, we propose a deep learning-based significant persistent point selection method, \textitTopoGAT, that learns to identify the most informative topological features directly from input data and the corresponding topological signatures, circumventing the limitations of hand-crafted statistical selection criteria. A comparative study verifies the superiority of the proposed method over traditional statistical approaches in terms of stability and discriminative power. Integrating the selected significant persistent points into standard point cloud classification and part-segmentation pipelines yields improvements in both classification accuracy and segmentation metrics. The presented topologically-enriched datasets, coupled with our learnable significant feature selection approach, enable the broader integration of persistent homology into the practical deep learning workflows for 3D point cloud analysis.

[CV-57] Freq-DP Net: A Dual-Branch Network for Fence Removal using Dual-Pixel and Fourier Priors ICASSP2026

【速读】:该论文旨在解决单张图像中围栏遮挡(fence occlusion)去除问题,该问题会降低视觉质量并限制下游计算机视觉应用的性能。现有方法在静态场景中表现不佳,或依赖多帧运动信息。解决方案的关键在于利用双像素(dual-pixel, DP)传感器提供的互补先验信息:一是通过显式代价体建模的离焦差异(defocus disparity)几何先验,二是通过快速傅里叶卷积(Fast Fourier Convolution, FFC)学习的围栏全局结构先验;二者通过注意力机制智能融合,实现高精度的围栏分割。该方法首次将DP传感器用于单图围栏去除任务,并构建了多样化基准数据集以验证其优越性。

链接: https://arxiv.org/abs/2602.14226
作者: Kunal Swami,Sudha Velusamy,Chandra Sekhar Seelamantula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE ICASSP 2026

点击查看摘要

Abstract:Removing fence occlusions from single images is a challenging task that degrades visual quality and limits downstream computer vision applications. Existing methods often fail on static scenes or require motion cues from multiple frames. To overcome these limitations, we introduce the first framework to leverage dual-pixel (DP) sensors for this problem. We propose Freq-DP Net, a novel dual-branch network that fuses two complementary priors: a geometric prior from defocus disparity, modeled using an explicit cost volume, and a structural prior of the fence’s global pattern, learned via Fast Fourier Convolution (FFC). An attention mechanism intelligently merges these cues for highly accurate fence segmentation. To validate our approach, we build and release a diverse benchmark with different fence varieties. Experiments demonstrate that our method significantly outperforms strong general-purpose baselines, establishing a new state-of-the-art for single-image, DP-based fence removal.

[CV-58] HiVid: LLM -Guided Video Saliency For Content-Aware VOD And Live Streaming ICLR2026

【速读】:该论文旨在解决内容感知流媒体(Content-aware streaming)中动态、逐块(chunk-level)重要性权重的生成问题,以优化主观质量体验(QoE)。现有方法依赖人工标注成本过高,而视觉显著性模型泛化能力差。其核心解决方案是提出HiVid框架,利用大语言模型(Large Language Models, LLMs)作为可扩展的人类代理来生成高保真权重,适用于视频点播(VOD)和直播场景。关键创新在于:(1)设计感知模块,在局部上下文窗口内评估帧并自回归构建视频连贯理解,突破LLM模态限制与Token长度瓶颈;(2)针对VOD中局部窗口评分不一致问题,引入排序模块,采用新颖的LLM引导归并排序算法实现全局重排序;(3)面向直播低延迟在线推理需求,提出预测模块,基于多模态时间序列模型(含内容感知注意力机制与自适应时域)预测未来权重,适配异步LLM推理。实验表明,HiVid在VOD和直播场景下分别提升权重预测准确率11.5%和26%,真实用户研究验证其显著增强QoE相关性达14.7%。

链接: https://arxiv.org/abs/2602.14214
作者: Jiahui Chen,Bo Peng,Lianchen Jia,Zeyu Zhang,Tianchi Huang,Lifeng Sun
机构: Tsinghua University (清华大学); The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs’ limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5% for VOD and 26% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7%.

[CV-59] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

【速读】:该论文旨在解决现有支持缩放功能的多模态大语言模型(Multimodal Large Language Models, MLLMs)在超高分辨率(Ultra-High-Resolution, UHR)遥感视觉问答(Remote Sensing Visual Question Answering, VQA)任务中普遍存在的“工具使用同质化”(Tool Usage Homogenization)问题,即模型在缩放操作中倾向于采用与任务无关的固定模式,导致难以获取有效证据。解决方案的关键在于提出GeoEyes框架,其核心由两部分组成:(1) 一个冷启动监督微调(Supervised Fine-Tuning, SFT)数据集UHR Chain-of-Zoom(UHR-CoZ),覆盖多样化的缩放策略;(2) 一种基于代理强化学习的方法AdaZoom-GRPO,通过显式奖励机制鼓励模型在缩放交互过程中提升证据获取量和答案准确性,从而实现按需缩放与合理的停止行为,显著提升模型在UHR遥感基准测试XLRS-Bench上的性能至54.23%准确率。

链接: https://arxiv.org/abs/2602.14201
作者: Fengxiang Wang,Mingshuo Chen,Yueying Li,Yajie Yang,Yifan Zhang,Long Lan,Xue Yang,Hongda Sun,Yulin Wang,Di Wang,Jun Song,Jing Zhang,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The “thinking-with-images” paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

[CV-60] Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation ICLR2026

【速读】:该论文旨在解决机器人在操作结构化物体(如门把手、旋钮等)时,如何实现跨类别和形状的泛化能力问题。现有方法多依赖于2D基础特征,缺乏对功能性部件(functional parts)的显式建模,且难以在3D空间中有效扩展,导致运行时间长、多视角不一致以及几何信息不足等问题。解决方案的关键在于提出一种名为Part-Aware 3D Feature Field (PA3FF) 的新型密集3D特征表示,该特征通过大规模标注数据集中的3D部件提案进行对比学习训练,能够以前馈方式从点云输入中预测连续的3D特征场,其中点特征间的距离反映功能部件的空间接近性——即相似特征的点更可能属于同一功能部件。这一设计显著提升了机器人在复杂场景下的操作泛化能力和样本效率。

链接: https://arxiv.org/abs/2602.14193
作者: Yue Chen,Muqing Jiang,Kaifeng Zheng,Jiaqi Liang,Chenrui Tie,Haoran Lu,Ruihai Wu,Hao Dong
机构: Peking University (北京大学); Beijing Institute of Technology (北京理工大学); National University of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accept to ICLR 2026, Project page: this https URL

点击查看摘要

Abstract:Articulated object manipulation is essential for various real-world robotic tasks, yet generalizing across diverse objects remains a major challenge. A key to generalization lies in understanding functional parts (e.g., door handles and knobs), which indicate where and how to manipulate across diverse object categories and shapes. Previous works attempted to achieve generalization by introducing foundation features, while these features are mostly 2D-based and do not specifically consider functional parts. When lifting these 2D features to geometry-profound 3D space, challenges arise, such as long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information. To address these issues, we propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D feature with part awareness for generalizable articulated object manipulation. PA3FF is trained by 3D part proposals from a large-scale labeled dataset, via a contrastive learning formulation. Given point clouds as input, PA3FF predicts a continuous 3D feature field in a feedforward manner, where the distance between point features reflects the proximity of functional parts: points with similar features are more likely to belong to the same part. Building on this feature, we introduce the Part-Aware Diffusion Policy (PADP), an imitation learning framework aimed at enhancing sample efficiency and generalization for robotic manipulation. We evaluate PADP on several simulated and real-world tasks, demonstrating that PA3FF consistently outperforms a range of 2D and 3D representations in manipulation scenarios, including CLIP, DINOv2, and Grounded-SAM. Beyond imitation learning, PA3FF enables diverse downstream methods, including correspondence learning and segmentation tasks, making it a versatile foundation for robotic manipulation. Project page: this https URL

[CV-61] UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

【速读】:该论文旨在解决现有基于扩散模型的图像编辑方法在多参考输入条件下难以保持一致性的问题,即不同参考图像之间的视觉特征和语义信息缺乏有效交互,导致生成结果出现不一致或冲突。解决方案的关键在于提出一种名为Sequence-Extended Latent Fusion (SELF) 的统一输入表示机制,通过动态地将多个参考图像序列化为一个连贯的潜在序列,并在训练阶段施加全局像素预算约束,使所有参考图像能在固定长度的潜在空间中协同对齐。在此基础上,论文进一步设计了两阶段训练框架:第一阶段采用监督微调(SFT)联合训练单图编辑与多图合成任务以建立稳健的生成先验,并引入渐进式序列长度训练策略逐步提升图像分辨率至2048²,从而增强细节保真度与跨参考一致性;第二阶段提出Multi-Source GRPO (MSGRPO),作为首个专为多参考图像生成设计的强化学习框架,用于优化模型在面对多重视觉约束时的协调能力,显著提升组合一致性。

链接: https://arxiv.org/abs/2602.14186
作者: Hongyang Wei,Bin Wen,Yancheng Long,Yankai Yang,Yuhang Hu,Tianke Zhang,Wei Chen,Haonan Fan,Kaiyu Jiang,Jiankang Chen,Changyi Liu,Kaiyu Tang,Haojie Ding,Xiao Yang,Jia Sun,Huaiqing Wang,Zhenyu Yang,Xinyu Wei,Xianglong He,Yangguang Li,Fan Yang,Tingting Gao,Lei Zhang,Guorui Zhou,Han Li
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技); Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); CUHK MMLab (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of 1024^2 , and are then gradually increased to 1536^2 and 2048^2 to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.

[CV-62] UniWeTok: An Unified Binary Tokenizer with Codebook Size mathit2128 for Unified Multimodal Large Language Model

【速读】:该论文旨在解决统一多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉表示难以同时满足高保真重建、复杂语义提取和生成适用性这三项相互冲突目标的问题。现有视觉分词器通常无法在单一框架内兼顾上述需求。解决方案的关键在于提出UniWeTok——一种基于超大规模二进制码本(21282^{128})的统一离散分词器,并引入预-后蒸馏(Pre-Post Distillation)与生成感知先验(Generative-Aware Prior)以增强离散token的语义提取能力和生成适配性;同时设计卷积-注意力混合架构与SigLu激活函数,有效稳定语义蒸馏过程并缓解熵损失与承诺损失之间的优化冲突。此外,采用三阶段训练策略提升模型对不同图像分辨率及人脸、文本等感知敏感场景的适应能力,最终在ImageNet和通用领域任务上均实现显著性能优势,且训练计算成本大幅降低。

链接: https://arxiv.org/abs/2602.14178
作者: Shaobin Zhuang,Yuang Ai,Jiaming Han,Weijia Mao,Xiaohui Li,Fangyikang Wang,Xiao Wang,Yan Li,Shanchuan Lin,Kun Xu,Zhenheng Yang,Huaibo Huang,Xiangyu Yue,Hao Chen,Yali Wang
机构: ByteDance; Shanghai Jiao Tong University (上海交通大学); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures, 33 tables

点击查看摘要

Abstract:Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ( \mathit2^128 ). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok’s adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.

[CV-63] owards Spatial Transcriptomics-driven Pathology Foundation Models

【速读】:该论文旨在解决病理图像分析中视觉表征与分子信息之间耦合不足的问题,即如何利用空间转录组学(Spatial Transcriptomics, ST)提供的局部基因表达数据来增强病理图像的视觉特征学习,从而提升模型在分子状态预测、通路活性识别及治疗响应评估等任务中的性能。其解决方案的关键在于提出一种名为“空间表达对齐学习”(Spatial Expression-Aligned Learning, SEAL)的自监督学习框架,该框架通过将局部分子信息注入病理视觉编码器,在不从头训练新模型的前提下,以参数高效的方式微调现有的病理基础模型(pathology foundation models)。SEAL基于超过70万对基因表达点与组织区域的配对样本进行训练,显著提升了多种滑片级和切片块级下游任务的表现,并展现出良好的域泛化能力与跨模态检索等新功能,证明了引入局部分子监督是改进视觉表示并扩展多模态应用的有效途径。

链接: https://arxiv.org/abs/2602.14177
作者: Konstantin Hemker,Andrew H. Song,Cristina Almagro-Pérez,Guillaume Jaume,Sophia J. Wagner,Anurag Vaidya,Nikola Simidjievski,Mateja Jamnik,Faisal Mahmood
机构: 1. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 2. University of Oxford (牛津大学); 3. DeepMind (DeepMind); 4. Google (谷歌); 5. University of Cambridge (剑桥大学); 6. ETH Zurich (苏黎世联邦理工学院); 7. University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.

[CV-64] When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

【速读】:该论文旨在解决文本驱动的图像与视频编辑问题,即将其建模为图像修复(inpainting)任务,其中被遮挡区域需在保持与已观察内容一致的同时,满足文本编辑提示。传统方法依赖于昂贵的向量-雅可比乘积(vector–Jacobian product, VJP)计算来近似难以求解的引导项,限制了实际应用。本文的关键解决方案在于基于Moufad等人(2025)的工作,提出了一种无需VJP计算的近似方法,并通过大规模图像和视频编辑基准对其进行了扩展验证,结果表明仅使用测试时引导(test-time guidance)即可达到甚至超越训练阶段方法的性能。

链接: https://arxiv.org/abs/2602.14157
作者: Ahmed Ghorbel,Badr Moufad,Navid Bagheri Shouraki,Alain Oliviero Durmus,Thomas Hirtz,Eric Moulines,Jimmy Olsson,Yazid Janati
机构: CMAP, Ecole Polytechnique (巴黎综合理工学院); Institute of Foundation Models; MBZUAI; EPITA; Sorbonne University (索邦大学); Lagrange Mathematics and Computing Research Center; EPITA Research Lab; KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector–Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.

[CV-65] ARport: An Augmented Reality System for Markerless Image-Guided Port Placement in Robotic Surgery

【速读】:该论文旨在解决机器人辅助手术中端口(trocar)放置的精准性问题,即如何将术前规划的端口布局高效、准确地映射到患者体表,以优化术野视野和器械操作空间。其解决方案的关键在于提出了一种名为ARport的增强现实(Augmented Reality, AR)系统,该系统基于光学透视式头戴显示器(Optical See-Through Head-Mounted Display, OST-HMD),通过RGB、深度和位姿数据重建术区场景,利用基础模型提取患者体表,并采用基于表面的无标记配准方法,实现术前解剖模型与患者体表的自动对齐,从而在术中直接可视化预设的端口位置,无需外部传感器或标记物,显著简化流程并提升临床集成度。

链接: https://arxiv.org/abs/2602.14153
作者: Zheng Han,Zixin Yang,Yonghao Long,Lin Zhang,Peter Kazanzides,Qi Dou
机构: The Chinese University of Hong Kong (香港中文大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Precise port placement is a critical step in robot-assisted surgery, where port configuration influences both visual access to the operative field and instrument maneuverability. To bridge the gap between preoperative planning and intraoperative execution, we present ARport, an augmented reality (AR) system that automatically maps pre-planned trocar layouts onto the patient’s body surface, providing intuitive spatial guidance during surgical preparation. Methods: ARport, implemented on an optical see-through head-mounted display (OST-HMD), operates without any external sensors or markers, simplifying setup and enhancing workflow integration. It reconstructs the operative scene from RGB, depth, and pose data captured by the OST-HMD, extracts the patient’s body surface using a foundation model, and performs surface-based markerless registration to align preoperative anatomical models to the extracted patient’s body surface, enabling in-situ visualization of planned trocar layouts. A demonstration video illustrating the overall workflow is available online. Results: In full-scale human-phantom experiments, ARport accurately overlaid pre-planned trocar sites onto the physical phantom, achieving consistent spatial correspondence between virtual plans and real anatomy. Conclusion: ARport provides a fully marker-free and hardware-minimal solution for visualizing preoperative trocar plans directly on the patient’s body surface. The system facilitates efficient intraoperative setup and demonstrates potential for seamless integration into routine clinical workflows.

[CV-66] LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型在多模态推理任务中表现不足的问题,尤其是现有基于扩散语言模型(Diffusion Language Models, dLLMs)的推理能力受限于任务特定强化学习(task-specific reinforcement learning)导致的泛化性差与训练效率低的问题。解决方案的关键在于提出 LaViDa-R1,一个统一的多模态通用推理扩散语言模型,其核心创新是设计了一种新颖的统一后训练框架(unified post-training framework),无缝融合监督微调(Supervised Fine-Tuning, SFT)与多任务强化学习(Multi-task Reinforcement Learning, RL),并引入答案强制(answer-forcing)、树搜索(tree search)和互补似然估计(complementary likelihood estimation)等关键技术,从而显著提升模型在视觉数学推理、高阶语义定位和图像编辑等复杂多模态任务上的有效性与可扩展性。

链接: https://arxiv.org/abs/2602.14147
作者: Shufan Li,Yuchen Zhu,Jiuxiang Gu,Kangning Liu,Zhe Lin,Yongxin Chen,Molei Tao,Aditya Grover,Jason Kuen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 11 figures

点击查看摘要

Abstract:Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1’s strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.

[CV-67] Detection of On-Ground Chestnuts Using Artificial Intelligence Toward Automated Picking

【速读】:该论文旨在解决小规模栗子种植者在机械化采收过程中面临的成本高、非选择性及坚果损伤等问题,核心在于开发一种低成本、基于视觉引导的自动化采收技术。其关键解决方案是构建了一个包含6524个标注栗子样本的果园地面图像数据集,并系统评估了29种先进的实时目标检测模型(包括YOLO系列和RT-DETR系列),最终发现YOLOv12m在mAP@0.5上达到95.1%的最佳精度,且整体推理效率优于RT-DETR模型,证明YOLO类模型更适合部署于车载设备中实现高效可靠的栗子检测。

链接: https://arxiv.org/abs/2602.14140
作者: Kaixuan Fang,Yuzhen Lu,Xinyang Mu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11-13) and 15 in the RT-DETR (v1-v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best mAP@0.5 of 95.1% among all the evaluated models, while the RT-DETRv2-R101 was the most accurate variant among RT-DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. Both the dataset and software programs in this study have been made publicly available at this https URL.

[CV-68] DenseMLLM : Standard Multimodal LLM s are Intrinsic Dense Predictors

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度密集预测任务(如语义分割和深度估计)中因需引入复杂任务特定解码器而导致的架构碎片化问题,从而限制了其通用性和实用性。解决方案的关键在于提出一种名为DenseMLLM的新模型,其基于标准MLLM架构,并通过一种新颖的视觉标记监督策略实现对多个标签和任务的有效支持,无需额外的任务定制解码器,从而在保持模型简洁性的同时实现了与专用模型相当甚至更优的密集预测性能。

链接: https://arxiv.org/abs/2602.14134
作者: Yi Li,Hongze Shen,Lexiang Tang,Xin Li,Xinpeng Ding,Yinsong Liu,Deqiang Jiang,Xing Sun,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 9 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization.

[CV-69] EgoSound: Benchmarking Sound Understanding in Egocentric Videos

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解第一人称视角声音(egocentric sound)方面存在的显著不足,尤其是在空间定位、因果推理和跨模态关联等细粒度任务上的能力缺失。现有模型虽在视觉-语言理解上取得进展,但对声音所蕴含的空间布局、屏幕外事件及因果关系的感知仍较为薄弱,限制了其在真实场景中实现类人多感官推理的能力。解决方案的关键在于构建首个系统性评估第一人称声音理解能力的基准——EgoSound,该基准整合来自Ego4D和EgoBlind的数据,涵盖视觉依赖与听觉依赖的体验,并定义了一个包含七项任务的分类体系,覆盖内在声音感知、空间定位、因果推断与跨模态推理等维度;同时通过多阶段自动化生成流程构建了包含900个视频和7315个验证问答对的数据集,为推动MLLMs在多感官第一人称智能方向的发展提供了挑战性基准和研究基础。

链接: https://arxiv.org/abs/2602.14122
作者: Bingwen Zhu,Yuqian Fu,Qiaole Dong,Guolei Sun,Tianwen Qian,Yuzheng Wu,Danda Pani Paudel,Xiangyang Xue,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai Innovation Institute; INSAIT; Nankai University (南开大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.

[CV-70] GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction

【速读】:该论文旨在解决大规模重建模型(Large Reconstruction Models, LRM)在单图像三维重建中常见的几何不一致性和细节错位问题,这些问题限制了重建结果的保真度。其解决方案的关键在于提出GeoFusionLRM框架,该框架通过利用模型自身预测的法向量(normal)和深度(depth)图作为几何反馈信号,经由专用的Transformer与融合模块进行自校正,从而在无需额外监督或外部信号的情况下提升重建网格与输入视图之间的对齐精度和结构一致性。

链接: https://arxiv.org/abs/2602.14119
作者: Ahmet Burak Yildirim,Tuna Saygin,Duygu Ceylan,Aysegul Dundar
机构: Bilkent University (比尔肯大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model’s own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.

[CV-71] SemanticFeels: Semantic Labeling during In-Hand Manipulation

【速读】:该论文旨在解决机器人在抓握操作中同时感知物体几何形状与材质属性的难题,以实现更智能和自适应的交互行为。其核心挑战在于如何融合视觉与触觉信息,从而精确识别物体表面不同材质区域并重建其连续分布。解决方案的关键在于提出SemanticFeels框架,该框架将语义标签嵌入到增强型符号距离场(augmented signed distance field, SDF)网络中,利用经过微调的EfficientNet-B0卷积神经网络(CNN)从Digit高分辨率触觉数据中提取局部材质预测,并将其映射至SDF表示中,实现几何结构与连续材质区域的联合建模。实验表明,该方法在单材料和多材料物体上均表现出高精度的材质匹配能力,平均准确率达79.87%。

链接: https://arxiv.org/abs/2602.14099
作者: Anas Al Shikh Khalil,Haozhi Qi,Roberto Calandra
机构: TU Dresden (德累斯顿工业大学); UC Berkely (加州大学伯克利分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:As robots become increasingly integrated into everyday tasks, their ability to perceive both the shape and properties of objects during in-hand manipulation becomes critical for adaptive and intelligent behavior. We present SemanticFeels, an extension of the NeuralFeels framework that integrates semantic labeling with neural implicit shape representation, from vision and touch. To illustrate its application, we focus on material classification: high-resolution Digit tactile readings are processed by a fine-tuned EfficientNet-B0 convolutional neural network (CNN) to generate local material predictions, which are then embedded into an augmented signed distance field (SDF) network that jointly predicts geometry and continuous material regions. Experimental results show that the system achieves a high correspondence between predicted and actual materials on both single- and multi-material objects, with an average matching accuracy of 79.87% across multiple manipulation trials on a multi-material object.

[CV-72] ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLM s for Image Forgery Detection and Localization

【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像伪造检测与定位任务中因依赖文本中心的思维链(Chain-of-Thought, CoT)范式而导致的幻觉问题。由于语言模态难以捕捉细微的像素级不一致,模型常无法准确识别不可见的低层次篡改痕迹。解决方案的关键在于提出 ForgeryVCR 框架,其核心是引入一个 forensic 工具箱,通过视觉中心推理(Visual-Centric Reasoning)将隐性的篡改痕迹转化为显式的视觉中间表示,并采用一种战略工具学习(Strategic Tool Learning)后训练范式,包括基于收益驱动的轨迹构建用于监督微调(SFT),以及由工具效用奖励引导的强化学习(Reinforcement Learning, RL)优化,使模型能主动决策并自发调用多视角推理路径,如局部放大以进行细粒度检查,以及分析压缩历史、噪声残差和频域中的隐性不一致性,从而实现高精度检测与定位,且具备优异的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2602.14098
作者: Youqi Wang,Shen Chen,Haowei Wang,Rongxuan Peng,Taiping Yao,Shunquan Tan,Changsheng Chen,Bin Li,Shouhong Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at this https URL.

[CV-73] Bidirectional Temporal Dynamics Modeling for EEG-based Driving Fatigue Recognition

【速读】:该论文旨在解决基于脑电图(EEG)的驾驶疲劳识别中因强非平稳性和神经动态不对称性导致的性能瓶颈问题。其解决方案的关键在于提出DeltaGateNet框架,通过引入双向Delta模块显式建模一阶时间差分的正负分量,从而捕捉神经激活与抑制的不对称模式;同时设计门控时序卷积模块,利用深度可分离时序卷积和残差学习在保留通道特异性的同时增强时序表征鲁棒性,实现对EEG信号中双向时间动态的精确建模。

链接: https://arxiv.org/abs/2602.14071
作者: YipTin Po,Jianming Wang,Yutao Miao,Jiayan Zhang,Yunxu Zhao,Xiaomin Ouyang,Zhihong Li,Nevin L. Zhang
机构: 未知
类目: Other Computer Science (cs.OH); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driving fatigue is a major contributor to traffic accidents and poses a serious threat to road safety. Electroencephalography (EEG) provides a direct measurement of neural activity, yet EEG-based fatigue recognition is hindered by strong non-stationarity and asymmetric neural dynamics. To address these challenges, we propose DeltaGateNet, a novel framework that explicitly captures Bidirectional temporal dynamics for EEG-based driving fatigue recognition. Our key idea is to introduce a Bidirectional Delta module that decomposes first-order temporal differences into positive and negative components, enabling explicit modeling of asymmetric neural activation and suppression patterns. Furthermore, we design a Gated Temporal Convolution module to capture long-term temporal dependencies for each EEG channel using depthwise temporal convolutions and residual learning, preserving channel-wise specificity while enhancing temporal representation robustness. Extensive experiments conducted under both intra-subject and inter-subject evaluation settings on the public SEED-VIG and SADT driving fatigue datasets demonstrate that DeltaGateNet consistently outperforms existing methods. On SEED-VIG, DeltaGateNet achieves an intra-subject accuracy of 81.89% and an inter-subject accuracy of 55.55%. On the balanced SADT 2022 dataset, it attains intra-subject and inter-subject accuracies of 96.81% and 83.21%, respectively, while on the unbalanced SADT 2952 dataset, it achieves 96.84% intra-subject and 84.49% inter-subject accuracy. These results indicate that explicitly modeling Bidirectional temporal dynamics yields robust and generalizable performance under varying subject and class-distribution conditions.

[CV-74] CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

【速读】:该论文旨在解决图像编辑中因模型仅关注目标对象或区域而导致非编辑区域出现 unintended changes(未预期变化)的问题,即内容一致性(content consistency)不足的问题。其解决方案的关键在于提出一种基于区域正则化强化学习的后训练框架 CoCoEdit:首先构建高质量的训练数据集并引入像素级相似性奖励以补充多模态大语言模型(Multimodal Large Language Model, MLLM)提供的奖励信号,从而同时保障编辑质量和内容一致性;其次设计一种基于区域的正则化项,针对高奖励样本保留非编辑区域、低奖励样本增强编辑效果,克服传统奖励机制对空间信息的无感知特性,最终在多个基准上显著提升内容一致性指标(如 PSNR/SSIM)和人工主观评分。

链接: https://arxiv.org/abs/2602.14068
作者: Yuhui Wu,Chenxi Xie,Ruibin Li,Liyi Chen,Qiaosi Yi,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.

[CV-75] ProAct: A Dual-System Framework for Proactive Embodied Social Agents

【速读】:该论文旨在解决当前具身社交代理(Embodied Social Agents)在实时交互中难以实现主动社交行为的问题。现有系统多为反应式,仅基于短期感官输入进行响应,缺乏对长期上下文的推理与意图推断能力,而这种主动行为需要更长时间尺度的决策过程,与实时交互的严格延迟预算存在冲突。解决方案的关键在于提出一种双系统框架 ProAct,通过解耦低延迟的行为系统(Behavioral System)用于流式多模态交互,与较慢的认知系统(Cognitive System)执行长时程社会推理并生成高层主动意图,从而协调时间尺度矛盾;进一步引入基于 ControlNet 的流式流匹配模型(streaming flow-matching model),将主动意图异步注入动作流中,实现流畅的反应式与主动手势之间的无缝切换,显著提升交互中的主动性、社会临场感和参与度。

链接: https://arxiv.org/abs/2602.14048
作者: Zeyi Zhang,Zixi Kang,Ruijie Zhao,Yusen Feng,Biao Jiang,Libin Liu
机构: Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emphProAct, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emphBehavioral System for streaming multimodal interaction from a slower \emphCognitive System which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.

[CV-76] Restoration Adaptation for Semantic Segmentation on Low Quality Images

【速读】:该论文旨在解决低质量(Low-Quality, LQ)图像在语义分割任务中性能下降的问题,即现有图像恢复(Real-IR)模型通常仅关注像素级保真度而忽略任务相关的语义信息,导致其直接用于下游视觉任务时效果有限;同时,传统语义分割模型在高质图像上训练后对真实世界退化缺乏鲁棒性。解决方案的关键在于提出 Restoration Adaptation for Semantic Segmentation (RASS) 框架,其中包含两个核心组件:一是 Semantic-Constrained Restoration (SCR) 模型,通过将分割掩码与跨注意力图对齐,引导恢复过程保留语义一致性;二是基于 LoRA 的模块融合与任务特定微调机制,实现语义恢复知识向分割网络的有效迁移,从而显著提升模型在 LQ 图像上的分割性能。

链接: https://arxiv.org/abs/2602.14042
作者: Kai Guan,Rongyuan Wu,Shuai Li,Wentao Zhu,Wenjun Zeng,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world scenarios, the performance of semantic segmentation often deteriorates when processing low-quality (LQ) images, which may lack clear semantic structures and high-frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real-world image restoration (Real-IR) models primarily focus on pixel-level fidelity and often fail to recover task-relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high-quality data lack robustness under real-world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high-quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic-Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross-attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA-based module merging and task-specific fine-tuning, thereby enhancing the model’s robustness to LQ images. To validate the effectiveness of our framework, we construct a real-world LQ image segmentation dataset with high-quality annotations, and conduct extensive experiments on both synthetic and real-world LQ benchmarks. The results show that SCR and RASS significantly outperform state-of-the-art methods in segmentation and restoration tasks. Code, models, and datasets will be available at this https URL.

[CV-77] BitDance: Scaling Autoregressive Generative Models with Binary Tokens

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型在高效性和表达能力之间的权衡问题,尤其是传统基于码本索引(codebook indices)的AR图像生成方法难以实现高分辨率、高质量图像生成且推理速度慢的问题。其关键解决方案在于提出BitDance,一种通过预测二进制视觉令牌(binary visual tokens)而非传统码本索引来构建高熵离散表示的架构:每个令牌可表达高达 22562^{256} 种状态,显著提升表达能力;同时引入二进制扩散头(binary diffusion head),利用连续空间扩散机制替代标准分类 softmax,从而有效采样超大令牌空间;进一步提出“下一补丁扩散”(next-patch diffusion)解码策略,实现多令牌并行预测,大幅加速推理,在ImageNet 256×256上达到FID=1.24,优于现有AR模型,并在1024×1024图像生成中实现超过30倍的速度提升。

链接: https://arxiv.org/abs/2602.14041
作者: Yuang Ai,Jiaming Han,Shaobin Zhuang,Weijia Mao,Xuefeng Hu,Ziyan Yang,Zhenheng Yang,Huaibo Huang,Xiangyu Yue,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and models: this https URL

点击查看摘要

Abstract:We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to 2^256 states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: this https URL.

[CV-78] Explainability-Inspired Layer-Wise Pruning of Deep Neural Networks for Efficient Object Detection

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在资源受限平台部署时,因模型复杂度高而导致的效率瓶颈问题。传统基于权重幅度的剪枝方法无法准确反映网络组件对任务性能的实际功能贡献,从而限制了压缩效果与性能之间的平衡。其解决方案的关键在于提出一种受可解释性启发的、逐层剪枝框架,利用类SHAP(Shapley Additive Explanations)的梯度-激活归因机制来估计各层重要性,从而构建一个数据驱动的、更贴近真实功能贡献的层重要性评估指标,替代静态权重幅度依赖的方法。这一策略显著提升了剪枝决策的合理性,在多个主流目标检测架构上实现了更好的精度-效率权衡,尤其在ShuffleNetV2和RetinaNet等模型中表现出优于L1范数剪枝的性能稳定性与加速效果。

链接: https://arxiv.org/abs/2602.14040
作者: Abhinav Shukla,Nachiket Tapas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable success in object detection tasks, but their increasing complexity poses significant challenges for deployment on resource-constrained platforms. While model compression techniques such as pruning have emerged as essential tools, traditional magnitude-based pruning methods do not necessarily align with the true functional contribution of network components to task-specific performance. In this work, we present an explainability-inspired, layer-wise pruning framework tailored for efficient object detection. Our approach leverages a SHAP-inspired gradient–activation attribution to estimate layer importance, providing a data-driven proxy for functional contribution rather than relying solely on static weight magnitudes. We conduct comprehensive experiments across diverse object detection architectures, including ResNet-50, MobileNetV2, ShuffleNetV2, Faster R-CNN, RetinaNet, and YOLOv8, evaluating performance on the Microsoft COCO 2017 validation set. The results show that the proposed attribution-inspired pruning consistently identifies different layers as least important compared to L1-norm-based methods, leading to improved accuracy–efficiency trade-offs. Notably, for ShuffleNetV2, our method yields a 10% empirical increase in inference speed, whereas L1-pruning degrades performance by 13.7%. For RetinaNet, the proposed approach preserves the baseline mAP (0.151) with negligible impact on inference speed, while L1-pruning incurs a 1.3% mAP drop for a 6.2% speed increase. These findings highlight the importance of data-driven layer importance assessment and demonstrate that explainability-inspired compression offers a principled direction for deploying deep neural networks on edge and resource-constrained platforms while preserving both performance and interpretability.

[CV-79] rain Short Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

【速读】:该论文旨在解决自回归视频扩散模型在长视频生成中因外推失败导致的严重时序退化问题,其根源在于3D位置嵌入的频谱偏差(spectral bias)以及噪声采样过程中缺乏动态先验信息。解决方案的关键在于提出一个无需训练的推理时框架FLEX,其核心创新包括:频率感知的旋转位置编码调制(Frequency-aware RoPE Modulation),用于自适应插值低频成分并外推高频成分以保持多尺度时序判别能力;抗相位噪声采样(Antiphase Noise Sampling, ANS),通过注入高频动态先验增强时序一致性;以及仅推理注意力锚点(Inference-only Attention Sink),用于稳定全局结构。该方法在VBench基准上实现6倍外推(30秒)显著优于现有最先进模型,并在12倍尺度(60秒)达到微调长视频基线性能,且可作为即插即用模块扩展至如LongLive等模型,支持4分钟级稳定动态视频生成。

链接: https://arxiv.org/abs/2602.14027
作者: Jia Li,Xiaomeng Fu,Xurui Peng,Weifeng Chen,Youwei Zheng,Tianyu Zhao,Jiexi Wang,Fangmin Chen,Xing Wang,Hayden Kwok-Hay So
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the \textitspectral bias of 3D positional embeddings and the lack of \textitdynamic priors in noise sampling. To address these issues, we propose \textbfFLEX (\textbfFrequency-aware \textbfLength \textbfEXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6\times extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12\times scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at \hrefthis https URLthis https URL.

[CV-80] Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow

【速读】:该论文旨在解决动态三维场景重建与跟踪中的核心挑战,即如何统一建模几何结构、物体运动与相机运动之间的关系。传统方法通常将几何与运动解耦处理:多视角重建假设场景静态,而动态跟踪依赖显式的相机位姿估计或独立的运动模型,难以实现协同优化。其解决方案的关键在于提出Flow4R框架,以相机空间场景流(camera-space scene flow)为核心表示,通过Vision Transformer从两视图输入中联合预测像素级的最小属性集——3D点位置、场景流、位姿权重和置信度。该流中心范式允许在单次前向传播中对局部几何和双向运动进行对称推断,无需显式位姿回归器或光束法平差,从而实现了高效的时空场景理解。

链接: https://arxiv.org/abs/2602.14021
作者: Shenhan Qian,Ganlin Zhang,Shangzhe Wu,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); MCML; University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.

[CV-81] A Deployment-Friendly Foundational Framework for Efficient Computational Pathology

【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, PFMs)在实际临床部署中面临的计算资源消耗过高问题,尤其是针对高分辨率全切片图像(gigapixel whole slide images)处理时的效率瓶颈。其核心解决方案在于提出 LitePath 框架,该框架通过两个关键技术实现高效优化:一是 LiteFM,一个从三个大型 PFMs(Virchow2、H-Optimus-1 和 UNI2)蒸馏得到的紧凑模型,使用 1.9 亿个图像块训练而成;二是自适应图像块选择器(Adaptive Patch Selector, APS),用于任务驱动的轻量级图像块筛选。这一设计使模型参数减少 28 倍、浮点运算量(FLOPs)降低 403.5 倍,同时在多个器官和任务上保持接近顶级性能(平均 AUC 达到 Virchow2 的 99.71%),并显著提升部署可行性与能效比,最终通过提出的 Deployability Score(D-Score)量化了精度与效率的平衡,证明 LitePath 是当前最高效的病理 AI 分析方案之一。

链接: https://arxiv.org/abs/2602.14010
作者: Yu Cai,Cheng Jin,Jiabo Ma,Fengtao Zhou,Yingxue Xu,Zhengrui Guo,Yihui Wang,Zhengyu Zhang,Ling Liang,Yonghao Tan,Pingcheng Dong,Du Cai,On Ki Tang,Chenglong Zhao,Xi Wang,Can Yang,Yali Xu,Jing Cui,Zhenhui Li,Ronald Cheong Kin Chan,Yueping Liu,Feng Gao,Xiuming Zhang,Li Liang,Hao Chen,Kwang-Ting Cheng
机构: The Hong Kong University of Science and Technology (香港科技大学); Southern Medical University (南方医科大学); Sun Yat-sen University (中山大学); The Chinese University of Hong Kong (香港中文大学); Shandong First Medical University (山东第一医科大学); Kunming Medical University (昆明医科大学); Zhejiang University (浙江大学); Jinfeng Laboratory (金风实验室); Guangdong Province Key Laboratory of Molecular Tumor Pathology (广东省分子肿瘤病理重点实验室); Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases (广东省结直肠与盆底疾病重点实验室); HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute (香港科技大学深港协同创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have enabled robust generalization in computational pathology through large-scale datasets and expansive architectures, but their substantial computational cost, particularly for gigapixel whole slide images, limits clinical accessibility and scalability. Here, we present LitePath, a deployment-friendly foundational framework designed to mitigate model over-parameterization and patch level redundancy. LitePath integrates LiteFM, a compact model distilled from three large PFMs (Virchow2, H-Optimus-1 and UNI2) using 190 million patches, and the Adaptive Patch Selector (APS), a lightweight component for task-specific patch selection. The framework reduces model parameters by 28x and lowers FLOPs by 403.5x relative to Virchow2, enabling deployment on low-power edge hardware such as the NVIDIA Jetson Orin Nano Super. On this device, LitePath processes 208 slides per hour, 104.5x faster than Virchow2, and consumes 0.36 kWh per 3,000 slides, 171x lower than Virchow2 on an RTX3090 GPU. We validated accuracy using 37 cohorts across four organs and 26 tasks (26 internal, 9 external, and 2 prospective), comprising 15,672 slides from 9,808 patients disjoint from the pretraining data. LitePath ranks second among 19 evaluated models and outperforms larger models including H-Optimus-1, mSTAR, UNI2 and GPFM, while retaining 99.71% of the AUC of Virchow2 on average. To quantify the balance between accuracy and efficiency, we propose the Deployability Score (D-Score), defined as the weighted geometric mean of normalized AUC and normalized FLOP, where LitePath achieves the highest value, surpassing Virchow2 by 10.64%. These results demonstrate that LitePath enables rapid, cost-effective and energy-efficient pathology image analysis on accessible hardware while maintaining accuracy comparable to state-of-the-art PFMs and reducing the carbon footprint of AI deployment.

[CV-82] Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

【速读】:该论文旨在解决个性化文本到图像生成中身份特征污染非面部区域(如背景和光照)的问题,从而导致文本一致性下降。现有无需微调的方法通常采用空间均匀的视觉注入策略(Spatially Uniform Visual Injection),无法区分人脸相关与上下文无关区域,造成身份信息扩散至不相关区域。解决方案的关键在于提出SpatialID框架,其核心创新包括:利用交叉注意力响应提取的空间掩码生成器(Spatial Mask Extractor)实现身份注入的精细化空间解耦;以及引入时序-空间调度策略(Temporal-Spatial Scheduling),动态调整空间约束机制——从高斯先验逐步过渡到基于注意力的掩码,并进一步自适应松弛,以匹配扩散模型的生成动力学。此方法在IBench基准上显著提升了文本一致性(CLIP-T: 0.281)、视觉一致性(CLIP-I: 0.827)和图像质量(IQ: 0.523),同时有效抑制了背景污染并保持身份保真度。

链接: https://arxiv.org/abs/2602.13994
作者: Guandong Li,Mengxia Ye
机构: iFLYTEK(科大讯飞); Aegon THTF
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.

[CV-83] Elastic Diffusion Transformer

【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformer, DiT)在生成过程中计算成本过高、传统加速方法(如剪枝和知识蒸馏)受限于固定计算资源导致加速效果有限且生成质量下降的问题。解决方案的关键在于提出弹性扩散 Transformer(Elastic Diffusion Transformer, E-DiT),其核心创新是引入轻量级路由机制(router),能够根据输入潜在表示动态识别每样本的计算稀疏性,并自适应决定是否跳过当前模块;若不跳过,则进一步预测最优的多层感知机(MLP)宽度压缩比例,从而实现按需计算。此外,E-DiT设计了基于块级别的特征缓存机制,在无需重新训练的情况下消除冗余计算,显著提升推理效率,同时保持生成质量几乎不变。

链接: https://arxiv.org/abs/2602.13993
作者: Jiangshan Wang,Zeqiang Lai,Jiarui Chen,Jiayi Guo,Hang Guo,Xiu Li,Xiangyu Yue,Chunchao Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbfElastic Diffusion Transformer (E-DiT), an adaptive acceleration framework for DiT that effectively improves efficiency while maintaining generation quality. Specifically, we observe that the generative process of DiT exhibits substantial sparsity (i.e., some computations can be skipped with minimal impact on quality), and this sparsity varies significantly across samples. Motivated by this observation, E-DiT equips each DiT block with a lightweight router that dynamically identifies sample-dependent sparsity from the input latent. Each router adaptively determines whether the corresponding block can be skipped. If the block is not skipped, the router then predicts the optimal MLP width reduction ratio within the block. During inference, we further introduce a block-level feature caching mechanism that leverages router predictions to eliminate redundant computations in a training-free manner. Extensive experiments across 2D image (Qwen-Image and FLUX) and 3D asset (Hunyuan3D-3.0) demonstrate the effectiveness of E-DiT, achieving up to \sim 2 \times speedup with negligible loss in generation quality. Code will be available at this https URL.

[CV-84] Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology ICLR2026

【速读】:该论文旨在解决当前计算病理学中多模态学习模型依赖视觉与语言模态所带来的表征瓶颈问题,即语言模态缺乏分子特异性且提供的病理监督有限,难以有效指导图像表征学习。其解决方案的关键在于提出STAMP框架,通过引入空间分辨的基因表达谱(spatially-resolved gene expression profiles)作为分子引导信号,实现病理图像与转录组数据的联合嵌入。该方法利用自监督、基因引导的训练策略,结合空间上下文和多尺度信息,显著提升了模型的性能与泛化能力,从而推动了计算病理学中多模态学习的发展。

链接: https://arxiv.org/abs/2602.13944
作者: Minghao Han,Dingkang Yang,Linhao Qu,Zizhi Chen,Gang Li,Han Wang,Jiacong Wang,Lihua Zhang
机构: Fudan University (复旦大学); Fysics Intelligence Technologies Co., Ltd. (Fysics AI); Harvard Medical School (哈佛医学院); Tencent Youtu Lab (腾讯优图实验室); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICLR 2026, 34 pages, 10 figures, 7tables

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: this https URL.

[CV-85] MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

【速读】:该论文旨在解决乳腺癌筛查中传统“一刀切”间隔策略的局限性,提出基于深度学习(Deep Learning, DL)的个性化风险预测模型,以实现更精准的3年乳腺癌风险评估。其核心挑战在于如何在降低输入图像分辨率的前提下仍保持与现有先进模型(如Mirai)相当甚至更优的预测性能。解决方案的关键在于:1)融合互补的归纳偏置(inductive biases),即结合卷积神经网络(CNN)的空间局部特征提取能力与视觉Transformer(ViT)的全局建模优势;2)显式引入对侧乳房不对称性的建模机制——通过设计BilateralMixer模块聚合双侧乳腺信息,从而增强对早期病变的敏感性。实验表明,MamaDino在仅使用约1/13输入像素的情况下,在内部和外部测试集上均达到与Mirai相当的AUC表现(分别为0.736 vs 0.713 和 0.677 vs 0.666),且在不同人群和设备条件下具有鲁棒性,验证了结构化低分辨率图像处理的有效性。

链接: https://arxiv.org/abs/2602.13930
作者: Ruggiero Santeramo,Igor Zubarev,Florian Jug
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms. Comments: 16 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.13930 [cs.CV] (or arXiv:2602.13930v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.13930 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ruggiero Santeramo [view email] [v1] Sat, 14 Feb 2026 23:56:22 UTC (1,329 KB)

[CV-86] High-fidelity 3D reconstruction for planetary exploration

【速读】:该论文旨在解决行星探测中自主机器人系统在缺乏全球定位和实时地球通信条件下,如何实现高效、精确的环境重建与空间感知问题。传统基于结构光(Structure-from-Motion, SfM)和同时定位与建图(Simultaneous Localization and Mapping, SLAM)的方法虽能提供几何一致性,但在低纹理、非结构化地形中难以捕捉辐射度细节且扩展性差。解决方案的关键在于引入基于辐射场(radiance field)的方法,特别是神经辐射场(Neural Radiance Fields, NeRF)与高斯泼溅(Gaussian Splatting),构建一个统一、自动化的环境重建流程;该流程融合Nerfstudio与COLMAP框架,并兼容ROS2工作流,可直接处理来自robag记录的原始火星车数据,从而生成密集、逼真且具有度量一致性的三维表示,显著提升行星类环境下自主系统的感知与规划能力。

链接: https://arxiv.org/abs/2602.13909
作者: Alfonso Martínez-Petersen,Levin Gerdes,David Rodríguez-Martínez,C. J. Pérez-del-Pulgar
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures, conference paper

点击查看摘要

Abstract:Planetary exploration increasingly relies on autonomous robotic systems capable of perceiving, interpreting, and reconstructing their surroundings in the absence of global positioning or real-time communication with Earth. Rovers operating on planetary surfaces must navigate under sever environmental constraints, limited visual redundancy, and communication delays, making onboard spatial awareness and visual localization key components for mission success. Traditional techniques based on Structure-from-Motion (SfM) and Simultaneous Localization and Mapping (SLAM) provide geometric consistency but struggle to capture radiometric detail or to scale efficiently in unstructured, low-texture terrains typical of extraterrestrial environments. This work explores the integration of radiance field-based methods - specifically Neural Radiance Fields (NeRF) and Gaussian Splatting - into a unified, automated environment reconstruction pipeline for planetary robotics. Our system combines the Nerfstudio and COLMAP frameworks with a ROS2-compatible workflow capable of processing raw rover data directly from rosbag recordings. This approach enables the generation of dense, photorealistic, and metrically consistent 3D representations from minimal visual input, supporting improved perception and planning for autonomous systems operating in planetary-like conditions. The resulting pipeline established a foundation for future research in radiance field-based mapping, bridging the gap between geometric and neural representations in planetary exploration.

[CV-87] RPGD: RANSAC-P3P Gradient Descent for Extrinsic Calibration in 3D Human Pose Estimation

【速读】:该论文旨在解决大规模三维人体姿态估计(3D Human Pose Estimation, 3D HPE)数据集采集过程中,基于动作捕捉(MoCap)的3D骨骼数据与单目或多视角RGB相机之间的外参标定问题。传统方法通常依赖于人工标记点或专用标定板,难以在真实场景中自动化执行。本文提出的RPGD(RANSAC-P3P Gradient Descent)框架通过利用自然人类运动作为唯一输入,将外参标定建模为一个从粗到精的优化问题:首先使用RANSAC-P3P算法获得鲁棒的初始估计,再结合梯度下降法进行精细化优化,从而实现高精度且自动化的外参标定。其核心创新在于融合了RANSAC-P3P的全局鲁棒性与梯度下降法的局部精细调整能力,能够在噪声和复杂场景下稳定恢复接近真值的外参参数,且达到亚像素级别的MPJPE重投影误差。

链接: https://arxiv.org/abs/2602.13901
作者: Zhanyu Tuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at AAIML 2026. This work is co-funded by the European Union’s Horizon Europe research and innovation programme under MSCA with grant agreement No 101081674

点击查看摘要

Abstract:In this paper, we propose RPGD (RANSAC-P3P Gradient Descent), a human-pose-driven extrinsic calibration framework that robustly aligns MoCap-based 3D skeletal data with monocular or multi-view RGB cameras using only natural human motion. RPGD formulates extrinsic calibration as a coarse-to-fine problem tailored to human poses, combining the global robustness of RANSAC-P3P with Gradient-Descent-based refinement. We evaluate RPGD on three large-scale public 3D HPE datasets as well as on a self-collected in-the-wild dataset. Experimental results demonstrate that RPGD consistently recovers extrinsic parameters with accuracy comparable to the provided ground truth, achieving sub-pixel MPJPE reprojection error even in challenging, noisy settings. These results indicate that RPGD provides a practical and automatic solution for reliable extrinsic calibration of large-scale 3D HPE dataset collection.

[CV-88] Parameter-Efficient Fine-Tuning of DINOv2 for Large-Scale Font Classification

【速读】:该论文旨在解决从渲染文本图像中自动识别字体家族的问题,目标是准确区分394种不同的字体家族。其核心挑战在于如何在有限标注数据和计算资源下实现高精度分类,并确保模型在真实场景中的泛化能力。解决方案的关键在于:首先,采用基于低秩适应(Low-Rank Adaptation, LoRA)的微调策略,在仅训练模型参数不足1%的情况下(约87.2M参数中的<1%),实现了约86%的top-1准确率;其次,构建了一个大规模合成数据生成管道,通过随机化颜色、对齐方式、换行规则及高斯噪声等增强手段,对Google Fonts进行规模化渲染,从而生成具有多样性和鲁棒性的训练样本;最后,模型集成内置预处理机制以保证训练与推理阶段的一致性,并部署为HuggingFace Inference Endpoint,便于实际应用。

链接: https://arxiv.org/abs/2602.13889
作者: Daniel Chen,Zaria Zinn,Marcus Lowe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a font classification system capable of identifying 394 font families from rendered text images. Our approach fine-tunes a DINOv2 Vision Transformer using Low-Rank Adaptation (LoRA), achieving approximately 86% top-1 accuracy while training fewer than 1% of the model’s 87.2M parameters. We introduce a synthetic dataset generation pipeline that renders Google Fonts at scale with diverse augmentations including randomized colors, alignment, line wrapping, and Gaussian noise, producing training images that generalize to real-world typographic samples. The model incorporates built-in preprocessing to ensure consistency between training and inference, and is deployed as a HuggingFace Inference Endpoint. We release the model, dataset, and full training pipeline as open-source resources.

[CV-89] Human-Aligned Evaluation of a Pixel-wise DNN Color Constancy Model

【速读】:该论文旨在解决虚拟现实(VR)环境中颜色恒常性(color constancy)的建模与人类视觉系统表现之间的比较问题,特别是如何通过深度学习模型来模拟和理解人类在不同光照条件下对物体表面反射率的感知机制。解决方案的关键在于结合先前开发的基于渲染图像预测表面反射率的深度神经网络(DNN)与人类行为实验,采用迁移学习策略——仅对网络解码器进行微调以适应特定VR条件,并将模型输出用于执行与人类相同的无色物体选择任务,从而实现模型与人类在颜色恒常性表现上的直接对比。结果表明,该模型在基准条件下表现出与人类一致的高恒常性水平,且在移除局部周围或空间平均颜色线索时展现出相似的性能下降模式,验证了其对已知颜色恒常性机制(如局部周围、最大通量和空间均值)的有效建模能力。

链接: https://arxiv.org/abs/2602.13887
作者: Hamed Heidari-Gorji,Raquel Gil Rodriguez,Karl R. Gegenfurtner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We previously investigated color constancy in photorealistic virtual reality (VR) and developed a Deep Neural Network (DNN) that predicts reflectance from rendered images. Here, we combine both approaches to compare and study a model and human performance with respect to established color constancy mechanisms: local surround, maximum flux and spatial mean. Rather than evaluating the model against physical ground truth, model performance was assessed using the same achromatic object selection task employed in the human experiments. The model, a ResNet based U-Net from our previous work, was pre-trained on rendered images to predict surface reflectance. We then applied transfer learning, fine-tuning only the network’s decoder on images from the baseline VR condition. To parallel the human experiment, the model’s output was used to perform the same achromatic object selection task across all conditions. Results show a strong correspondence between the model and human behavior. Both achieved high constancy under baseline conditions and showed similar, condition-dependent performance declines when the local surround or spatial mean color cues were removed.

[CV-90] VSAL: A Vision Solver with Adaptive Layouts for Graph Property Detection WWW

【速读】:该论文旨在解决现有基于视觉的图属性检测方法受限于固定图布局导致表达能力不足的问题。其解决方案的关键在于提出VSAL框架,该框架引入了一个自适应布局生成器(adaptive layout generator),能够根据每个图实例动态生成更具信息量的可视化表示,从而显著提升图属性检测的性能。

链接: https://arxiv.org/abs/2602.13880
作者: Jiahao Xie,Guangmo Tong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The Web Conference (WWW) 2026

点击查看摘要

Abstract:Graph property detection aims to determine whether a graph exhibits certain structural properties, such as being Hamiltonian. Recently, learning-based approaches have shown great promise by leveraging data-driven models to detect graph properties efficiently. In particular, vision-based methods offer a visually intuitive solution by processing the visualizations of graphs. However, existing vision-based methods rely on fixed visual graph layouts, and therefore, the expressiveness of their pipeline is restricted. To overcome this limitation, we propose VSAL, a vision-based framework that incorporates an adaptive layout generator capable of dynamically producing informative graph visualizations tailored to individual instances, thereby improving graph property detection. Extensive experiments demonstrate that VSAL outperforms state-of-the-art vision-based methods on various tasks such as Hamiltonian cycle, planarity, claw-freeness, and tree detection.

[CV-91] Low-Pass Filtering Improves Behavioral Alignment of Vision Models

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在计算机视觉基准测试中与人类视觉行为之间存在的显著偏差问题,尤其是误差一致性(error consistency)和形状偏好(shape bias)方面的不匹配。以往研究认为生成式模型(generative models)相比判别式模型(discriminative models)能更贴近人类视觉行为,但本文通过一系列受控实验发现,这种提升主要归因于生成模型中一个看似无关的图像缩放操作——该操作本质上起到了低通滤波(low-pass filtering)的作用。解决方案的关键在于:仅在测试时对输入图像进行模糊处理(即引入低通滤波),即可显著提升判别式模型如CLIP的行为一致性,甚至达到当前最优性能,使DNN与人类观察者之间的对齐差距减半。进一步分析表明,此类低通滤波器接近人类视觉系统中的带通滤波特性,其频率响应与人类对比敏感度函数(contrast sensitivity function)高度吻合,揭示了优化空间频率信息处理是实现更高行为对齐的核心机制。

链接: https://arxiv.org/abs/2602.13859
作者: Max Wolff,Thomas Klein,Evgenia Rusak,Felix Wichmann,Wieland Brendel
机构: Max Planck Institute for Intelligent Systems(马普智能系统研究所); University of Tübingen(图宾根大学); ELLIS Institute Tübingen(ELLIS图宾根研究所); Cohere(协同)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through \emphgenerative – rather than \emphdiscriminative – classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time – rather than training on blurred images – achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of the specific width that also maximizes error consistency. Comments: 10 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.13859 [cs.CV] (or arXiv:2602.13859v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.13859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-92] Cardiac Output Prediction from Echocardiograms: Self-Supervised Learning with Limited Data

【速读】:该论文旨在解决心脏输出量(Cardiac Output, CO)非侵入性精准测量的问题,传统方法依赖右心导管术,存在侵入性和耗时等局限。为克服这一挑战,作者提出基于SimCLR的自监督学习(Self-Supervised Learning, SSL)预训练策略,利用下游任务可用的有限超声心动图视频数据进行预训练,以提升模型对CO的预测性能。其解决方案的关键在于:在数据稀缺条件下,通过SSL增强特征表示能力并缓解过拟合,从而显著改善模型泛化性能,最终在测试集上实现平均皮尔逊相关系数0.41,优于使用百万级样本训练的PanEcho模型。

链接: https://arxiv.org/abs/2602.13846
作者: Adson Duarte,Davide Vitturini,Emanuele Milillo,Andrea Bragagnolo,Carlo Alberto Barbano,Riccardo Renzulli,Michele Cannito,Federico Giacobbe,Francesco Bruno,Ovidio de Filippo,Fabrizio D’Ascenzo,Marco Grangetto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2026

点击查看摘要

Abstract:Cardiac Output (CO) is a key parameter in the diagnosis and management of cardiovascular diseases. However, its accurate measurement requires right-heart catheterization, an invasive and time-consuming procedure, motivating the development of reliable non-invasive alternatives using echocardiography. In this work, we propose a self-supervised learning (SSL) pretraining strategy based on SimCLR to improve CO prediction from apical four-chamber echocardiographic videos. The pretraining is performed using the same limited dataset available for the downstream task, demonstrating the potential of SSL even under data scarcity. Our results show that SSL mitigates overfitting and improves representation learning, achieving an average Pearson correlation of 0.41 on the test set and outperforming PanEcho, a model trained on over one million echocardiographic exams. Source code is available at this https URL.

[CV-93] Synthetic Dataset Generation and Validation for Robotic Surgery Instrument Segmentation

【速读】:该论文旨在解决机器人手术中器械分割任务因真实标注数据稀缺而导致模型泛化能力不足的问题。其解决方案的关键在于构建一个全自动化的合成数据生成与验证流程:通过在Autodesk Maya中精细重建并动画化达芬奇机器人手臂,结合Python自动化管道生成具有像素级精确标签的逼真视频序列,同时引入随机运动模式、光照变化和合成血液纹理以模拟术中变异性;进一步通过对比不同比例的真实与合成数据训练分割模型,验证了平衡使用二者可显著提升模型泛化性能,而过度依赖合成数据则会导致明显的域偏移(domain shift)。该框架为外科计算机视觉提供了可复现、可扩展的数据增强与仿真预训练工具。

链接: https://arxiv.org/abs/2602.13844
作者: Giorgio Chiesa,Rossella Borra,Vittorio Lauro,Sabrina De Cillis,Daniele Amparore,Cristian Fiori,Riccardo Renzulli,Marco Grangetto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2026

点击查看摘要

Abstract:This paper presents a comprehensive workflow for generating and validating a synthetic dataset designed for robotic surgery instrument segmentation. A 3D reconstruction of the Da Vinci robotic arms was refined and animated in Autodesk Maya through a fully automated Python-based pipeline capable of producing photorealistic, labeled video sequences. Each scene integrates randomized motion patterns, lighting variations, and synthetic blood textures to mimic intraoperative variability while preserving pixel-accurate ground truth masks. To validate the realism and effectiveness of the generated data, several segmentation models were trained under controlled ratios of real and synthetic data. Results demonstrate that a balanced composition of real and synthetic samples significantly improves model generalization compared to training on real data only, while excessive reliance on synthetic data introduces a measurable domain shift. The proposed framework provides a reproducible and scalable tool for surgical computer vision, supporting future research in data augmentation, domain adaptation, and simulation-based pretraining for robotic-assisted surgery. Data and code are available at this https URL.

[CV-94] Automated Prediction of Paravalvular Regurgitation before Transcatheter Aortic Valve Implantation

【速读】:该论文旨在解决经导管主动脉瓣植入术(Transcatheter Aortic Valve Implantation, TAVI)后常见的并发症——瓣周主动脉反流(Paravalvular Aortic Regurgitation, PVR)的预测问题。PVR显著影响患者长期预后,但其发生机制复杂且难以在术前准确评估。研究提出利用深度学习方法,基于术前心脏CT影像实现对PVR风险的早期识别与量化预测。解决方案的关键在于构建并训练3D卷积神经网络(3D Convolutional Neural Networks),从各向同性(isotropic)的CT体积数据中自动提取细微的解剖特征,从而实现对个体化风险分层和手术方案优化的支持。

链接: https://arxiv.org/abs/2602.13842
作者: Michele Cannito,Riccardo Renzulli,Adson Duarte,Farzad Nikfam,Carlo Alberto Barbano,Enrico Chiesa,Francesco Bruno,Federico Giacobbe,Wojciech Wanha,Arturo Giordano,Marco Grangetto,Fabrizio D’Ascenzo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ISBI 2026

点击查看摘要

Abstract:Severe aortic stenosis is a common and life-threatening condition in elderly patients, often treated with Transcatheter Aortic Valve Implantation (TAVI). Despite procedural advances, paravalvular aortic regurgitation (PVR) remains one of the most frequent post-TAVI complications, with a proven impact on long-term prognosis. In this work, we investigate the potential of deep learning to predict the occurrence of PVR from preoperative cardiac CT. To this end, a dataset of preoperative TAVI patients was collected, and 3D convolutional neural networks were trained on isotropic CT volumes. The results achieved suggest that volumetric deep learning can capture subtle anatomical features from pre-TAVI imaging, opening new perspectives for personalized risk assessment and procedural optimization. Source code is available at this https URL. Comments: Accepted at ISBI 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.13842 [cs.CV] (or arXiv:2602.13842v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.13842 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-95] High-Fidelity Causal Video Diffusion Models for Real-Time Ultra-Low-Bitrate Semantic Communication

【速读】:该论文旨在解决在超低比特率(ultra-low-bitrate)语义通信约束下,实现高保真、因果且实时的视频生成问题。其核心挑战在于如何在极低带宽条件下有效传输视频内容,同时保持感知质量、语义一致性和时间连贯性。解决方案的关键在于提出一种模块化的视频扩散模型(modular video diffusion model),包含语义控制(Semantic Control)、恢复适配器(Restoration Adapter)和时序适配器(Temporal Adapter)三个组件,并结合高效的时序蒸馏(temporal distillation)机制,显著降低训练参数量(减少300倍)和训练时间(减少2倍),从而在满足通信约束的前提下实现高质量、实时的因果视频合成。

链接: https://arxiv.org/abs/2602.13837
作者: Cem Eteke,Batuhan Tosun,Alexander Griessel,Wolfgang Kellerer,Eckehard Steinbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates ( 0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.

[CV-96] Prior-guided Hierarchical Instance-pixel Contrastive Learning for Ultrasound Speckle Noise Suppression

【速读】:该论文旨在解决超声图像去噪中如何在抑制斑点噪声(speckle noise)的同时有效保留解剖结构细节的问题,这是提升图像质量与诊断可靠性的关键挑战。解决方案的关键在于提出一种先验引导的分层实例-像素对比学习模型,通过在像素级和实例级上最大化噪声样本与干净样本之间的可分性,促进噪声不变且结构感知的特征表示;其中,统计引导的像素级对比学习增强噪声与干净像素间的分布差异以提升局部结构一致性,同时利用记忆库实现特征空间中的实例级对比学习以逼近真实数据分布;此外,采用混合Transformer-CNN架构,结合Transformer编码器捕捉全局上下文信息与CNN解码器优化细粒度解剖结构恢复,从而协同利用长程依赖性和局部纹理细节,显著优于现有方法。

链接: https://arxiv.org/abs/2602.13831
作者: Zhenyu Bu,Yuanxin Xie,Guang-Quan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound denoising is essential for mitigating speckle-induced degradations, thereby enhancing image quality and improving diagnostic reliability. Nevertheless, because speckle patterns inherently encode both texture and fine anatomical details, effectively suppressing noise while preserving structural fidelity remains a significant challenge. In this study, we propose a prior-guided hierarchical instance-pixel contrastive learning model for ultrasound denoising, designed to promote noise-invariant and structure-aware feature representations by maximizing the separability between noisy and clean samples at both pixel and instance levels. Specifically, a statistics-guided pixel-level contrastive learning strategy is introduced to enhance distributional discrepancies between noisy and clean pixels, thereby improving local structural consistency. Concurrently, a memory bank is employed to facilitate instance-level contrastive learning in the feature space, encouraging representations that more faithfully approximate the underlying data distribution. Furthermore, a hybrid Transformer-CNN architecture is adopted, coupling a Transformer-based encoder for global context modeling with a CNN-based decoder optimized for fine-grained anatomical structure restoration, thus enabling complementary exploitation of long-range dependencies and local texture details. Extensive evaluations on two publicly available ultrasound datasets demonstrate that the proposed model consistently outperforms existing methods, confirming its effectiveness and superiority.

[CV-97] Embed-RL: Reinforcement Learning for Reasoning -Driven Multimodal Embeddings

【速读】:该论文旨在解决当前生成式多模态嵌入(Universal Multimodal Embeddings, UME)方法中,生成的链式思维(Chain-of-Thought, CoT)推理仅限于文本层面分析,缺乏与目标检索任务相关性的关键问题。现有方法难以确保生成的CoT能够有效指导跨模态匹配,从而限制了嵌入质量与检索性能。其解决方案的关键在于提出一种基于嵌入引导强化学习(Embedder-Guided Reinforcement Learning, EG-RL)的推理驱动框架,通过引入可解释的追踪性链式思维(Traceability CoT, T-CoT),使推理过程聚焦于与检索相关的多模态线索,并由嵌入器提供显式监督信号,从而实现推理与嵌入任务之间的对齐优化。这一机制显著提升了跨模态语义一致性、细粒度匹配能力及复杂场景下的泛化性能。

链接: https://arxiv.org/abs/2602.13823
作者: Haonan Jiang,Yuji Wang,Yongjie Zhu,Xin Lu,Wenyu Qin,Meng Wang,Pengfei Wan,Yansong Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is [this URL]( this https URL )

点击查看摘要

Abstract:Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.

[CV-98] VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer

【速读】:该论文旨在解决文本到3D生成(text-to-3D generation)中因离散3D表示学习瓶颈导致的几何一致性退化问题,尤其是现有方法在编码阶段的信息损失与向量量化(Vector Quantization, VQ)过程放大带来的表征失真,以及传统两阶段训练范式引发的重建目标与文本条件自回归生成之间的目标不匹配。其解决方案的关键在于提出View-aware Auto-Regressive 3D (VAR-3D),通过引入视图感知的3D向量量化变分自编码器(View-aware 3D Vector Quantized-Variational AutoEncoder, VQ-VAE)将复杂3D结构映射为离散标记(discrete tokens),并设计渲染监督训练策略,使离散标记预测与视觉重建耦合,从而提升生成结果的视觉保真度和结构一致性。

链接: https://arxiv.org/abs/2602.13818
作者: Zongcheng Han,Dongyan Cao,Haoran Sun,Yu Hong
机构: Soochow University (苏州大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, encouraging the generative process to better preserve visual fidelity and structural consistency relative to the input text. Experiments demonstrate that VAR-3D significantly outperforms existing methods in both generation quality and text-3D alignment.

[CV-99] Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos

【速读】:该论文旨在解决从单目(monocular)非结构化视频中进行四维(4D)动态场景重建的难题,该问题在机器人学习中至关重要但极具挑战性,因其在严格单目条件下存在严重的病态性(ill-posed)。解决方案的关键在于提出一种多尺度动态机制(multi-scale dynamics mechanism),该机制通过分解复杂运动场来建模真实世界动态的多尺度规律(从物体到粒子层级)。在此基础上,作者设计了具有多尺度动态特性的高斯序列(Gaussian sequences with multi-scale dynamics),这是一种基于多层级运动组合构建的新型动态3D高斯表示,其分层结构显著缓解了重建歧义并促进物理合理性的动态模拟。同时,引入视觉基础模型(vision foundation models)提供的多模态先验作为互补监督信号,进一步约束解空间并提升重建保真度,从而实现从单目随意视频中准确且全局一致的4D重建。

链接: https://arxiv.org/abs/2602.13806
作者: Can Li,Jie Gu,Jingmin Chen,Fangzhou Qiu,Lei Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.

[CV-100] Joint Orientation and Weight Optimization for Robust Watertight Surface Reconstruction via Dirichlet-Regularized Winding Fields

【速读】:该论文旨在解决从非定向点云(unoriented point clouds)中重建封闭表面(watertight surfaces)的难题,尤其针对采样不均匀、噪声和异常值干扰等现实场景下的挑战。传统方法通常依赖多阶段预处理(如法向估计、去噪、重采样),而这些步骤易引入误差并难以协同优化。本文提出Dirichlet Winding Reconstruction (DiWR),其核心创新在于将广义环绕数(Generalized Winding Number, GWN)场作为隐式表示目标,并在统一框架内联合优化点云法向量、每点面积权重(per-point area weights)以及置信度系数(confidence coefficients)。通过最小化诱导环绕场的Dirichlet能量并引入基于GWN的约束项,DiWR能够自适应补偿采样不均匀性、抑制噪声影响并降低异常值权重,从而实现无需独立预处理即可生成高质量封闭表面。

链接: https://arxiv.org/abs/2602.13801
作者: Jiaze Li,Daisheng Jin,Fei Hou,Junhui Hou,Zheng Liu,Shiqing Xin,Wenping Wang,Ying He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Dirichlet Winding Reconstruction (DiWR), a robust method for reconstructing watertight surfaces from unoriented point clouds with non-uniform sampling, noise, and outliers. Our method uses the generalized winding number (GWN) field as the target implicit representation and jointly optimizes point orientations, per-point area weights, and confidence coefficients in a single pipeline. The optimization minimizes the Dirichlet energy of the induced winding field together with additional GWN-based constraints, allowing DiWR to compensate for non-uniform sampling, reduce the impact of noise, and downweight outliers during reconstruction, with no reliance on separate preprocessing. We evaluate DiWR on point clouds from 3D Gaussian Splatting, a computer-vision pipeline, and corrupted graphics benchmarks. Experiments show that DiWR produces plausible watertight surfaces on these challenging inputs and outperforms both traditional multi-stage pipelines and recent joint orientation-reconstruction methods.

[CV-101] Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery

【速读】:该论文旨在解决遥感(Remote Sensing, RS)语义变化检测(Semantic Change Detection, SCD)方法中存在的性能瓶颈与模型架构复杂性问题,尤其是受限于模型语义理解能力不足以及任务本身的高复杂性。其解决方案的关键在于提出PerASCD框架,该框架基于RS基础模型PerA,引入一种模块化级联门控解码器(Cascaded Gated Decoder, CG-Decoder),以简化SCD的解码流程并增强多尺度特征交互与融合能力;同时设计软语义一致性损失(Soft Semantic Consistency Loss, SSCLoss)来缓解训练过程中的数值不稳定性问题。实验表明,该方法在多个视觉编码器上均表现出良好的适配性和SOTA性能,显著提升了SCD任务的准确性与泛化能力。

链接: https://arxiv.org/abs/2602.13780
作者: Hengtong Shen,Li Yan,Hong Xie,Yaxuan Wei,Xinhao Li,Wenfei Shen,Peixian Lv,Fei Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing (RS) change detection methods can extract critical information on surface dynamics and are an essential means for humans to understand changes in the earth’s surface and environment. Among these methods, semantic change detection (SCD) can more effectively interpret the multi-class information contained in bi-temporal RS imagery, providing semantic-level predictions that support dynamic change monitoring. However, due to the limited semantic understanding capability of the model and the inherent complexity of the SCD tasks, existing SCD methods face significant challenges in both performance and paradigm complexity. In this paper, we propose PerASCD, a SCD method driven by RS foundation model PerA, designed to enhance the multi-scale semantic understanding and overall performance. We introduce a modular Cascaded Gated Decoder (CG-Decoder) that simplifies complex SCD decoding pipelines while promoting effective multi-level feature interaction and fusion. In addition, we propose a Soft Semantic Consistency Loss (SSCLoss) to mitigate the numerical instability commonly encountered during SCD training. We further explore the applicability of multiple existing RS foundation models on the SCD task when equipped with the proposed decoder. Experimental results demonstrate that our decoder not only effectively simplifies the paradigm of SCD, but also achieves seamless adaptation across various vision encoders. Our method achieves state-of-the-art (SOTA) performance on two public benchmark datasets, validating its effectiveness. The code is available at this https URL.

[CV-102] Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

【速读】:该论文旨在解决当前舞蹈生成方法在骨骼域(skeletal domain)训练时忽略网格级物理约束的问题,导致生成的动作虽在关节轨迹上看似合理,但在使用人体网格可视化时出现身体自穿透(body self-penetration)和足地接触(Foot-Ground Contact, FGC)异常,从而降低舞蹈的视觉美感并限制其实际应用。解决方案的关键在于通过从人体网格中提取物理奖励信号,并采用强化学习微调(Reinforcement Learning Fine-Tuning, RLFT)引导扩散模型在网格可视化下生成物理合理的运动。具体而言,奖励设计包含:(i) 仿效奖励(imitation reward),衡量动作在物理模拟器中的可模仿性(惩罚自穿透与足部滑移),以及 (ii) 足地偏差奖励(Foot-Ground Deviation, FGD reward)结合测试时FGD引导,以更好地捕捉舞蹈中动态的足地交互;同时引入抗冻结奖励(anti-freezing reward)缓解因过度追求物理合理性而导致的运动停滞问题,从而在保持物理真实性的同时维持动作的动力学多样性。

链接: https://arxiv.org/abs/2602.13778
作者: Jidong Jia,Youjian Zhang,Huan Fu,Dacheng Tao
机构: Shanghai Jiao Tong University (上海交通大学); Bosch (博世); Youku (优酷); Alibaba (阿里巴巴); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion’s general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: this https URL

[CV-103] Offline-Poly: A Polyhedral Framework For Offline 3D Multi-Object Tracking

【速读】:该论文旨在解决现有离线3D多目标跟踪(Offline 3D Multi-Object Tracking, MOT)方法依赖固定上游检测器或定制架构、未能充分利用离线设置优势的问题。其核心挑战在于如何在不绑定特定检测器或跟踪器的前提下,实现全局优化与时间维度上的完整可观测性,从而提升伪标签质量并增强模型的适应性。解决方案的关键在于提出一种基于“以跟踪为中心”的标准化范式——Tracking-by-Tracking (TBT),该范式仅使用任意现成的跟踪输出作为输入,通过预处理、分层匹配与融合、以及轨迹精修三个模块,充分挖掘离线跟踪的两大特性:资源无约束性(允许全局优化)和未来可观测性(支持全时域推理)。此设计实现了离线追踪器与具体检测器/跟踪器的解耦,显著提升了方法的灵活性、通用性和性能,在nuScenes和KITTI数据集上分别达到77.6% AMOTA和83.00% HOTA的先进水平。

链接: https://arxiv.org/abs/2602.13772
作者: Xiaoyu Li,Yitao Wu,Xian Wu,Haolin Zhuo,Lijun Zhao,Lining Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Based on this work, we achieved 1st place on the KITTI tracking leaderboard

点击查看摘要

Abstract:Offline 3D multi-object tracking (MOT) is a critical component of the 4D auto-labeling (4DAL) process. It enhances pseudo-labels generated by high-performance detectors through the incorporation of temporal context. However, existing offline 3D MOT approaches are direct extensions of online frameworks and fail to fully exploit the advantages of offline setting. Moreover, these methods often depend on fixed upstream and customized architectures, limiting their adaptability. To address these limitations, we propose Offline-Poly, a general offline 3D MOT method based on a tracking-centric design. We introduce a standardized paradigm termed Tracking-by-Tracking (TBT), which operates exclusively on arbitrary off-the-shelf tracking outputs and produces offline-refined tracklets. This formulation decouples offline tracker from specific upstream detectors or trackers. Under the TBT paradigm, Offline-Poly accepts one or multiple coarse tracking results and processes them through a structured pipeline comprising pre-processing, hierarchical matching and fusion, and tracklet refinement. Each module is designed to capitalize on the two fundamental properties of offline tracking: resource unconstrainedness, which permits global optimization beyond real-time limits, and future observability, which enables tracklet reasoning over the full temporal horizon. Offline-Poly first eliminates short-term ghost tracklets and re-identifies fragmented segments using global scene context. It then constructs scene-level similarity to associate tracklets across multiple input sources. Finally, Offline-Poly refines tracklets by jointly leveraging local and global motion patterns. On nuScenes, we achieve SOTA performance with 77.6% AMOTA. On KITTI, it achieves leading results with 83.00% HOTA. Comprehensive experiments further validate the flexibility, generalizability, and modular effectiveness of Offline-Poly.

[CV-104] SAM4Dcap: Training-free Biomechanical Twin System from Monocular Video

【速读】:该论文旨在解决临床诊断与损伤预防中定量生物力学分析受限于实验室环境的问题,尤其是传统光学运动捕捉系统成本高昂、难以在家庭场景中应用的困境。其核心解决方案是提出SAM4Dcap,一个开源、端到端的框架,通过单目视频直接估计生物力学指标,无需额外训练;关键创新在于将具有时序一致性的4D人体网格重建方法(SAM-Body4D)与OpenSim生物力学求解器相结合,实现从单视角视频中恢复的人体 mesh 自动转换为兼容多种骨骼肌模型的轨迹文件,从而在非实验室条件下提供可信赖的膝关节运动学预测能力。

链接: https://arxiv.org/abs/2602.13760
作者: Li Wang,HaoYu Wang,Xi Chen,ZeKun Jiang,Kang Li,Jian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative biomechanical analysis is essential for clinical diagnosis and injury prevention but is often restricted to laboratories due to the high cost of optical motion capture systems. While multi-view video approaches have lowered barriers, they remain impractical for home-based scenarios requiring monocular capture. This paper presents SAM4Dcap, an open-source, end-to-end framework for estimating biomechanical metrics from monocular video without additional training. SAM4Dcap integrates the temporally consistent 4D human mesh recovery of SAM-Body4D with the OpenSim biomechanical solver. The pipeline converts reconstructed meshes into trajectory files compatible with diverse musculoskeletal models. We introduce automated prompting strategies and a Linux-native build for processing. Preliminary evaluations on walking and drop-jump tasks indicate that SAM4Dcap has the potential to achieve knee kinematic predictions comparable to multi-view systems, although some discrepancies in hip flexion and residual jitter remain. By bridging advanced computer vision with established biomechanical simulation, SAM4Dcap provides a flexible, accessible foundation for non-laboratory motion analysis.

[CV-105] OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学图像理解任务中表现受限的问题,尤其是对示意图、实验表征图和分析图表等专业科学图像的解析能力不足。现有数据集普遍存在领域覆盖窄、结构标注粗糙及语义锚定弱等问题,导致模型难以准确捕捉科学图像中的复杂信息。解决方案的关键在于构建OmniScience这一大规模、高保真度的多模态数据集,包含150万组“图-标题-上下文”三元组,涵盖10余个主要科学学科;并设计了一种动态模型路由重标注流水线,利用先进的多模态大语言模型联合视觉特征、原始图注与文本引用生成密集且自洽的描述,并通过严格的质量过滤和专家判断对齐机制提升内容准确性与语义完整性,使图像-文本多模态相似度得分从0.769显著提升至0.956。

链接: https://arxiv.org/abs/2602.13758
作者: Haoyi Tao,Chaozheng Huang,Nan Wang,Han Lyu,Linfeng Zhang,Guolin Ke,Xi Fang
机构: DP Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text references authored by human scientists. The pipeline is further reinforced with rigorous quality filtering and alignment with human expert judgments, ensuring both factual accuracy and semantic completeness, and boosts the image-text multi-modal similarity score from 0.769 to 0.956. We further propose a caption QA protocol as a proxy task for evaluating visual understanding. Under this setting, Qwen2.5-VL-3B model finetuned on OmniScience show substantial gains over baselines, achieving a gain of 0.378 on MM-MT-Bench and a gain of 0.140 on MMMU.

[CV-106] 2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

【速读】:该论文旨在解决当前文本到动作生成(text-to-motion generation)模型评估中对分布内(in-distribution)文本输入和有限评估指标的依赖问题,从而难以系统性地衡量模型在复杂分布外(out-of-distribution, OOD)文本条件下的泛化能力和运动生成质量。其解决方案的关键在于构建一个专门针对OOD场景的基准测试体系,包括一个包含1,025条文本描述的OOD提示数据集,并提出一个统一的评估框架,融合大语言模型(LLM-based)评估、多因素运动评估(Multi-factor Motion evaluation)和细粒度准确性评估(Fine-grained Accuracy Evaluation),从而全面揭示现有模型在语义对齐、运动泛化性和物理合理性等方面的优劣,尤其指出多数模型在细粒度准确性方面表现不足,为未来生产级文本到动作生成模型的设计与评估提供明确方向。

链接: https://arxiv.org/abs/2602.13751
作者: Bin Yang,Rong Ou,Weisheng Xu,Jiaqi Xiong,Xintao Li,Taowen Wang,Luyu Zhu,Xu Jiang,Jing Tan,Renjing Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.

[CV-107] Generative Latent Representations of 3D Brain MRI for Multi-Task Downstream Analysis in Down Syndrome

【速读】:该论文旨在解决生成式模型在3D脑部磁共振成像(MRI)中学习到的潜在表示(latent representations)结构、信息内容及其在下游临床任务中的应用潜力尚未被充分探索的问题。其解决方案的关键在于构建多个变分自编码器(Variational Autoencoders, VAEs),将3D脑部MRI扫描编码为紧凑的潜在空间表示,并通过三项系统性分析验证其有效性:(i)定量与定性评估MRI重建质量,(ii)利用主成分分析(Principal Component Analysis, PCA)可视化潜在空间结构,以及(iii)在包含正常核型(euploid)和唐氏综合征(Down syndrome)个体的专有脑部MRI数据集上进行下游分类任务。结果表明,VAE能够有效捕捉关键脑部特征并保持高重建保真度,且潜在空间展现出清晰的聚类模式,尤其在区分唐氏综合征患者与正常对照组方面表现突出。

链接: https://arxiv.org/abs/2602.13731
作者: Jordi Malé,Juan Fortea,Mateus Rozalem-Aranha,Neus Martínez-Abadías,Xavier Sevillano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have emerged as powerful tools in medical imaging, enabling tasks such as segmentation, anomaly detection, and high-quality synthetic data generation. These models typically rely on learning meaningful latent representations, which are particularly valuable given the high-dimensional nature of 3D medical images like brain magnetic resonance imaging (MRI) scans. Despite their potential, latent representations remain underexplored in terms of their structure, information content, and applicability to downstream clinical tasks. Investigating these representations is crucial for advancing the use of generative models in neuroimaging research and clinical decision-making. In this work, we develop multiple variational autoencoders (VAEs) to encode 3D brain MRI scans into compact latent space representations for generative and predictive applications. We systematically evaluate the effectiveness of the learned representations through three key analyses: (i) a quantitative and qualitative assessment of MRI reconstruction quality, (ii) a visualisation of the latent space structure using Principal Component Analysis, and (iii) downstream classification tasks on a proprietary dataset of euploid and Down syndrome individuals brain MRI scans. Our results demonstrate that the VAE successfully captures essential brain features while maintaining high reconstruction fidelity. The latent space exhibits clear clustering patterns, particularly in distinguishing individuals with Down syndrome from euploid controls.

[CV-108] Explore Intrinsic Geometry for Query-based Tiny and Oriented Object Detector with Momentum-based Bipartite Matching

【速读】:该论文旨在解决查询驱动的定向目标检测器在处理任意方向物体(尤其是纹理信息有限的微小目标)时性能受限的问题,其根源在于像素级特征解码过程中对内在几何信息利用不足,以及阶段间匹配不一致性导致的监督信号冲突。解决方案的关键在于提出IGOFormer架构:首先设计了内在几何感知解码器(Intrinsic Geometry-aware Decoder),通过注入基于对象查询与几何相关性的互补几何嵌入,增强对象相关特征并提供关键的方向性几何洞察;其次引入基于动量的二分匹配机制(Momentum-based Bipartite Matching),采用带有查询特定平滑因子的指数移动平均策略自适应聚合历史匹配代价,有效缓解因阶段间匹配不一致引发的监督信号冲突,从而提升检测稳定性与精度。

链接: https://arxiv.org/abs/2602.13728
作者: Junpeng Zhang,Zewei Yang,Jie Feng,Yuhui Zheng,Ronghua Shang,Mengxuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Recent query-based detectors have achieved remarkable progress, yet their performance remains constrained when handling objects with arbitrary orientations, especially for tiny objects capturing limited texture information. This limitation primarily stems from the underutilization of intrinsic geometry during pixel-based feature decoding and the occurrence of inter-stage matching inconsistency caused by stage-wise bipartite matching. To tackle these challenges, we present IGOFormer, a novel query-based oriented object detector that explicitly integrates intrinsic geometry into feature decoding and enhances inter-stage matching stability. Specifically, we design an Intrinsic Geometry-aware Decoder, which enhances the object-related features conditioned on an object query by injecting complementary geometric embeddings extrapolated from their correlations to capture the geometric layout of the object, thereby offering a critical geometric insight into its orientation. Meanwhile, a Momentum-based Bipartite Matching scheme is developed to adaptively aggregate historical matching costs by formulating an exponential moving average with query-specific smoothing factors, effectively preventing conflicting supervisory signals arising from inter-stage matching inconsistency. Extensive experiments and ablation studies demonstrate the superiority of our IGOFormer for aerial oriented object detection, achieving an AP _50 score of 78.00% on DOTA-V1.0 using Swin-T backbone under the single-scale setting. The code will be made publicly available.

[CV-109] RGA-Net: A Vision Enhancement Framework for Robotic Surgical Systems Using Reciprocal Attention Mechanisms ICRA2026

【速读】:该论文旨在解决机器人手术系统中因能量设备产生的烟雾导致内窥镜视频质量下降的问题,从而影响人机交互和手术效果。解决方案的关键在于提出了一种名为RGA-Net(Reciprocal Gating and Attention-fusion Network)的深度学习框架,其核心创新包括:(1) 双流混合注意力(Dual-Stream Hybrid Attention, DHA)模块,结合移窗注意力与频域处理以同时捕捉局部手术细节和全局光照变化;(2) 轴分解注意力(Axis-Decomposed Attention, ADA)模块,通过因子化注意力机制高效处理多尺度特征;此外,编码器与解码器路径间采用互逆交叉门控块实现双向特征调制,显著提升了烟雾去除效果,为机器人手术中的可视化清晰度提供了可靠保障。

链接: https://arxiv.org/abs/2602.13726
作者: Quanjun Li,Weixuan Li,Han Xia,Junhua Zhou,Chi-Man Pun,Xuhang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA2026

点击查看摘要

Abstract:Robotic surgical systems rely heavily on high-quality visual feedback for precise teleoperation; yet, surgical smoke from energy-based devices significantly degrades endoscopic video feeds, compromising the human-robot interface and surgical outcomes. This paper presents RGA-Net (Reciprocal Gating and Attention-fusion Network), a novel deep learning framework specifically designed for smoke removal in robotic surgery workflows. Our approach addresses the unique challenges of surgical smoke-including dense, non-homogeneous distribution and complex light scattering-through a hierarchical encoder-decoder architecture featuring two key innovations: (1) a Dual-Stream Hybrid Attention (DHA) module that combines shifted window attention with frequency-domain processing to capture both local surgical details and global illumination changes, and (2) an Axis-Decomposed Attention (ADA) module that efficiently processes multi-scale features through factorized attention mechanisms. These components are connected via reciprocal cross-gating blocks that enable bidirectional feature modulation between encoder and decoder pathways. Extensive experiments on the DesmokeData and LSD3K surgical datasets demonstrate that RGA-Net achieves superior performance in restoring visual clarity suitable for robotic surgery integration. Our method enhances the surgeon-robot interface by providing consistently clear visualization, laying a technical foundation for alleviating surgeons’ cognitive burden, optimizing operation workflows, and reducing iatrogenic injury risks in minimally invasive procedures. These practical benefits could be further validated through future clinical trials involving surgeon usability assessments. The proposed framework represents a significant step toward more reliable and safer robotic surgical systems through computational vision enhancement.

[CV-110] Fine-tuned Vision Language Model for Localization of Parasitic Eggs in Microscopic Images

【速读】:该论文旨在解决土壤源性线虫(Soil-transmitted helminth, STH)感染诊断中依赖人工显微镜检查所面临的劳动强度大、耗时长及易出错的问题。其解决方案的关键在于利用经过微调的视觉语言模型(Vision Language Model, VLM),如Microsoft Florence,实现对显微图像中寄生虫卵的精准定位;初步结果表明,该方法在平均交并比(mIOU)上优于EfficientDet等传统目标检测算法(达到0.94),验证了VLM作为自动化寄生虫学诊断框架核心组件的可行性与可扩展性。

链接: https://arxiv.org/abs/2602.13712
作者: Chan Hao Sien,Hezerul Abdul Karim,Nouar AlDahoul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Soil-transmitted helminth (STH) infections continuously affect a large proportion of the global population, particularly in tropical and sub-tropical regions, where access to specialized diagnostic expertise is limited. Although manual microscopic diagnosis of parasitic eggs remains the diagnostic gold standard, the approach can be labour-intensive, time-consuming, and prone to human error. This paper aims to utilize a vision language model (VLM) such as Microsoft Florence that was fine-tuned to localize all parasitic eggs within microscopic images. The preliminary results show that our localization VLM performs comparatively better than the other object detection methods, such as EfficientDet, with an mIOU of 0.94. This finding demonstrates the potential of the proposed VLM to serve as a core component of an automated framework, offering a scalable engineering solution for intelligent parasitological diagnosis.

[CV-111] A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

【速读】:该论文旨在解决糖尿病周围神经病变(Diabetic Peripheral Neuropathy, DPN)中角膜共聚焦显微镜(Corneal Confocal Microscopy, CCM)图像分析面临的两大挑战:一是标注数据稀缺,二是角膜神经形态的细粒度变异导致自动化深度学习诊断模型难以训练。为克服这些问题,作者提出了一种基于权重分解低秩适应(Weight-Decomposed Low-Rank Adaptation, WDLoRA)的多模态生成框架,其关键在于通过参数高效微调机制将权重更新解耦为幅度与方向两个独立分量,使基础生成模型能够分别学习神经走向(nerve topology)和基质对比度(stromal contrast),从而实现临床真实感的图像合成。该方法联合条件输入神经分割掩码和疾病特异性临床提示,在控制组、无DPN亚型(T1NoDPN)和DPN亚型(T1DPN)之间生成解剖一致的CCM图像,并在视觉保真度(FID: 5.18)和结构完整性(SSIM: 0.630)上显著优于GAN和标准扩散模型,且合成图像保留了金标准临床生物标志物,可有效提升下游诊断准确率(+2.1%)和分割性能(+2.2%)。

链接: https://arxiv.org/abs/2602.13693
作者: Xin Zhang,Liangxiu Han,Yue Shi,Yalin Zheng,Uazman Alam,Maryam Ferdousi,Rayaz Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fréchet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework’s potential to alleviate data bottlenecks in medical AI.

[CV-112] Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation ICRA2026

【速读】:该论文旨在解决机器人操作中插入任务所需的高精度、接触密集型交互问题,此类任务仅靠视觉无法完全解决。现有研究发现,简单的视觉-触觉融合方法往往难以稳定提升性能。其解决方案的关键在于提出一种跨模态Transformer(Cross-Modal Transformer, CMT),通过结构化的自注意力和交叉注意力机制融合腕部相机图像与触觉信号;同时引入物理信息正则化项,鼓励双侧力平衡,以模拟人类运动控制原理来稳定触觉嵌入。实验表明,该方法在TacSL基准上达到96.59%的插入成功率,显著优于朴素融合与门控融合基线,并接近使用“腕部相机+接触力”特权传感配置的性能(96.09%),验证了触觉感知对精确定位的必要性以及基于物理先验的多模态融合策略的有效性。

链接: https://arxiv.org/abs/2602.13689
作者: Wonju Lee,Matteo Grimaldi,Tao Yu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted By ICRA2026

点击查看摘要

Abstract:Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged “wrist + contact force” configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.

[CV-113] An Ensemble Learning Approach towards Waste Segmentation in Cluttered Environment

【速读】:该论文旨在解决真实废弃物环境中复杂场景下分割精度不足的问题,尤其针对变形、无特定纹理且相互重叠的垃圾对象在自动分拣中的识别难题。其解决方案的关键在于提出一种基于集成学习(Ensemble Learning)的方法,通过加权平均融合U-Net与FPN两种高性能分割模型的输出掩膜:U-Net擅长捕捉细粒度边界特征,而FPN能有效处理尺度变化和上下文信息,二者结合显著提升了分割准确性,最终在IoU指标上达到0.8306,优于单一模型表现。

链接: https://arxiv.org/abs/2602.13681
作者: Maimoona Jafar,Syed Imran Ali,Ahsan Saadat,Muhammad Bilal,Shah Khalid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Environmental pollution is a critical global issue, with recycling emerging as one of the most viable solutions. This study focuses on waste segregation, a crucial step in recycling processes to obtain raw material. Recent advancements in computer vision have significantly contributed to waste classification and recognition. In waste segregation, segmentation masks are essential for robots to accurately localize and pick objects from conveyor belts. The complexity of real-world waste environments, characterized by deformed items without specific patterns and overlapping objects, further complicates waste segmentation tasks. This paper proposes an Ensemble Learning approach to improve segmentation accuracy by combining high performing segmentation models, U-Net and FPN, using a weighted average method. U-Net excels in capturing fine details and boundaries in segmentation tasks, while FPN effectively handles scale variation and context in complex environments, and their combined masks result in more precise predictions. The dataset used closely mimics real-life waste scenarios, and preprocessing techniques were applied to enhance feature learning for deep learning segmentation models. The ensemble model, referred to as EL-4, achieved an IoU value of 0.8306, an improvement over U-Net’s 0.8065, and reduced Dice loss to 0.09019 from FPN’s 0.1183. This study could contribute to the efficiency of waste sorting at Material Recovery Facility, facilitating better raw material acquisition for recycling with minimal human intervention and enhancing the overall throughput.

[CV-114] EchoTorrent: Towards Swift Sustained and Streaming Multi-Modal Video Generation

【速读】:该论文旨在解决多模态视频生成模型在实时部署中面临的效率与性能矛盾问题,具体表现为推理延迟高、时间稳定性差,尤其是在流式推理场景下易出现空间模糊、时间漂移和唇音不同步等多模态退化现象。解决方案的关键在于提出EchoTorrent框架,其核心创新包括:(1)多教师训练(Multi-Teacher Training)通过在不同偏好域上微调预训练模型获得领域专家,并依次将领域知识传递给学生模型;(2)自适应CFG校准(Adaptive CFG Calibration, ACC-DMD)基于分阶段时空调度校准音频CFG增强误差,消除冗余计算并实现每步单次推理;(3)混合长尾强制机制(Hybrid Long Tail Forcing)在长时程自回放训练中仅对尾帧施加对齐约束,结合因果-双向混合架构有效缓解流式模式下的时空退化并提升参考帧保真度;(4)VAE解码器精修(VAE Decoder Refiner)通过像素域优化恢复高频细节,规避潜在空间歧义。实验表明,EchoTorrent可在少次迭代下实现显著延长的时间一致性、身份保持及音唇同步能力。

链接: https://arxiv.org/abs/2602.13669
作者: Rang Meng,Weipeng Wu,Yingjie Yin,Yuming Li,Chenguang Ma
机构: Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid Long Tail Forcing, which enforces alignment exclusively on tail frames during long-horizon self-rollout training via a causal-bidirectional hybrid architecture, effectively mitigates spatiotemporal degradation in streaming mode while enhancing fidelity to reference frames; and (4) VAE Decoder Refiner through pixel-domain optimization of the VAE decoder to recover high-frequency details while circumventing latent-space ambiguities. Extensive experiments and analysis demonstrate that EchoTorrent achieves few-pass autoregressive generation with substantially extended temporal consistency, identity preservation, and audio-lip synchronization.

[CV-115] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在农业领域,特别是植物病理学任务中应用受限的问题,其核心挑战在于缺乏大规模、多模态图像-文本数据集和系统性评估基准。解决方案的关键在于构建两个核心工具:一是包含186,000张叶片图像及13,950个问答对的LeafNet多模态数据集,覆盖97种病害类别;二是针对植物病害理解能力设计的LeafBench视觉问答基准,涵盖六类关键农业任务,如症状识别、分类关系与诊断推理等。通过在该基准上对12种先进VLMs进行评测,研究揭示了现有模型在细粒度病原体识别上的显著性能瓶颈,并验证了融合语言信息的多模态架构相较于纯视觉模型在诊断精度上的优势,从而为推动可靠AI辅助植物病害诊断提供了可量化评估框架与方法改进方向。

链接: https://arxiv.org/abs/2602.13662
作者: Khang Nguyen Quoc,Phuong D. Dao,Luyl-Da Quach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures and 8 tables

点击查看摘要

Abstract:Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image–text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy–diseased classification exceeds 90% accuracy, while fine-grained pathogen and species identification remains below 65%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at this https URL.

[CV-116] Optimizing Point-of-Care Ultrasound Video Acquisition for Probabilistic Multi-Task Heart Failure Detection

【速读】:该论文旨在解决床旁超声心动图(POCUS)在心力衰竭(HF)评估中面临的时间与操作成本约束问题,即如何在有限的采集资源下实现高效且准确的多模态影像数据获取与诊断。解决方案的关键在于引入一种基于强化学习(Reinforcement Learning, RL)的个性化数据采集策略:RL代理根据已获取的多视角视频序列,动态选择下一个应采集的视图或终止采集;随后通过一个共享的多视角Transformer模型进行多任务推理,同时预测主动脉瓣狭窄(AS)严重程度和左心室射血分数(LVEF),并输出高斯预测分布以量化不确定性,从而在诊断性能与采集成本之间建立明确权衡机制,最终实现患者定制化、成本感知的扫描路径优化。

链接: https://arxiv.org/abs/2602.13658
作者: Armin Saadat,Nima Hashemi,Bahar Khodabakhshian,Michael Y. Tsang,Christina Luong,Teresa S.M. Tsang,Purang Abolmaesumi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IJCARS, IPCAI 2026 special issue

点击查看摘要

Abstract:Purpose: Echocardiography with point-of-care ultrasound (POCUS) must support clinical decision-making under tight bedside time and operator-effort constraints. We introduce a personalized data acquisition strategy in which an RL agent, given a partially observed multi-view study, selects the next view to acquire or terminates acquisition to support heart-failure (HF) assessment. Upon termination, a diagnostic model jointly predicts aortic stenosis (AS) severity and left ventricular ejection fraction (LVEF), two key HF biomarkers, and outputs uncertainty, enabling an explicit trade-off between diagnostic performance and acquisition cost. Methods: We model POCUS as a sequential acquisition problem: at each step, a video selector (RL agent) chooses the next view to acquire or terminates acquisition. Upon termination, a shared multi-view transformer performs multi-task inference with two heads, ordinal AS classification, and LVEF regression, and outputs Gaussian predictive distributions yielding ordinal probabilities over AS classes and EF thresholds. These probabilities drive a reward that balances expected diagnostic benefit against acquisition cost, producing patient-specific acquisition pathways. Results: The dataset comprises 12,180 patient-level studies, split into training/validation/test sets (75/15/15). On the 1,820 test studies, our method matches full-study performance while using 32% fewer videos, achieving 77.2% mean balanced accuracy (bACC) across AS severity classification and LVEF estimation, demonstrating robust multi-task performance under acquisition budgets. Conclusion: Patient-tailored, cost-aware acquisition can streamline POCUS workflows while preserving decision quality, producing interpretable scan pathways suited to bedside use. The framework is extensible to additional cardiac endpoints and merits prospective evaluation for clinical integration.

[CV-117] DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation

【速读】:该论文旨在解决视频生成模型在语义一致性(semantic consistency)、几何一致性(geometric consistency)和身份一致性(identity consistency)方面的不足,尤其聚焦于三个关键挑战:片段内世界知识一致性(intra-clip world knowledge consistency)、片段间相机一致性(inter-clip camera consistency)以及镜头间元素一致性(inter-shot element consistency)。其解决方案的核心在于提出了一种系统级框架——分而治之扩散模型(Divide-and-Conquer Diffusion Model, DCDM),通过将上述三类一致性建模分解为三个专用模块,同时共享统一的视频生成骨干网络。具体而言,DCDM利用大语言模型(LLM)解析提示词以构建结构化语义表示,结合扩散Transformer实现片段内内容连贯性;引入噪声空间中的时序相机表示与文本到图像初始化机制,提升片段间相机运动控制的精度与稳定性;并采用全局场景生成范式、窗口交叉注意力与稀疏镜头间自注意力策略,在保障长程叙事一致性的同时保持计算效率。

链接: https://arxiv.org/abs/2602.13637
作者: Haoyu Zhao,Yuang Zhang,Junqi Cheng,Jiaxi Gu,Zenghui Lu,Peng Shu,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Recent video generative models have demonstrated impressive visual fidelity, yet they often struggle with semantic, geometric, and identity consistency. In this paper, we propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM), to address three key challenges: (1) intra-clip world knowledge consistency, (2) inter-clip camera consistency, and (3) inter-shot element consistency. DCDM decomposes video consistency modeling under these scenarios into three dedicated components while sharing a unified video generation backbone. For intra-clip consistency, DCDM leverages a large language model to parse input prompts into structured semantic representations, which are subsequently translated into coherent video content by a diffusion transformer. For inter-clip camera consistency, we propose a temporal camera representation in the noise space that enables precise and stable camera motion control, along with a text-to-image initialization mechanism to further enhance controllability. For inter-shot consistency, DCDM adopts a holistic scene generation paradigm with windowed cross-attention and sparse inter-shot self-attention, ensuring long-range narrative coherence while maintaining computational efficiency. We validate our framework on the test set of the CVM Competition at AAAI’26, and the results demonstrate that the proposed strategies effectively address these challenges.

[CV-118] Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness

【速读】:该论文旨在解决无人机(UAV)应用中视觉目标跟踪(Visual Object Tracking, VOT)面临的精度与效率之间的权衡问题,尤其是在不可预测遮挡等复杂场景下的性能瓶颈。其解决方案的关键在于提出一个统一的跟踪框架LGTrack,核心创新包括:1)引入轻量级全局分组坐标注意力(Global-Grouped Coordinate Attention, GGCA)模块,以极低计算开销捕获长程依赖和全局上下文信息,提升特征判别能力;2)设计轻量级相似度引导层自适应(Similarity-Guided Layer Adaptation, SGLA)模块,替代知识蒸馏策略,在保证跟踪精度的同时实现更高的推理效率。该方案在多个数据集上实现了实时性能(如UAVDT上达258.7 FPS)并保持了优异的精度(82.8%精度)。

链接: https://arxiv.org/abs/2602.13636
作者: Yang Zhou,Derui Ding,Ran Sun,Ying Sun,Haohua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack’s state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8% precision). Code is available at this https URL

[CV-119] A generalizable foundation model for intraoperative understanding across surgical procedures

【速读】:该论文旨在解决微创手术中因术者和术式差异导致的术中视觉感知不一致问题,这一问题限制了手术评估、培训及可靠人工智能系统的发展。当前多数手术AI模型仅针对特定任务设计,缺乏跨术式和机构的泛化能力。解决方案的关键在于提出一种名为ZEN的通用基础模型,其通过在超过21种术式、400多万帧视频数据上采用自监督多教师蒸馏框架进行训练,构建了统一的手术场景理解表示。该模型在多种下游任务与不同微调策略(包括全参数微调、冻结主干、少样本和零样本学习)下均显著优于现有模型,并展现出强健的跨术式泛化性能,为术中辅助与手术培训评估提供了可扩展的基础架构。

链接: https://arxiv.org/abs/2602.13633
作者: Kanggil Park,Yongjun Jeon,Soyoung Lim,Seonmin Park,Jongmin Shin,Jung Yong Kim,Sehyeon An,Jinsoo Rhu,Jongman Kim,Gyu-Seong Choi,Namkee Oh,Kyu-Hwan Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.

[CV-120] owards Sparse Video Understanding and Reasoning

【速读】:该论文旨在解决视频问答(Video Question Answering, VQA)中因冗余帧处理导致的计算效率低下和推理资源浪费问题。传统方法通常对视频帧进行均匀采样,忽视了关键信息的稀疏性,从而造成不必要的计算开销。解决方案的关键在于提出一种多轮代理框架 \revise(Reasoning with Video Sparsity),其核心机制包括:(1)通过选择少量高信息量帧替代全帧处理以实现视频稀疏推理;(2)在多轮交互中维护“摘要即状态”(summary-as-state)来压缩中间表示;(3)引入早期停止策略,在置信度足够时终止推理流程;此外,为支持开源模型的强化学习微调,作者设计了无标注奖励函数 EAGER(Evidence-Adjusted Gain for Efficient Reasoning),包含置信度提升、摘要充分性和正确且早停三项指标,有效引导模型在减少帧数、轮次和提示词数量的同时提升准确率。

链接: https://arxiv.org/abs/2602.13602
作者: Chenwei Xu,Zhen Ye,Shang Wu,Weijian Li,Zihan Wang,Zhuofan Xia,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Han Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present \revise (\underlineReasoning with \underlineVideo \underlineSparsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play’’ setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

[CV-121] AdaVBoost: Mitigating Hallucinations in LVLMs via Token-Level Adaptive Visual Attention Boosting

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在生成过程中产生的幻觉问题,尤其是现有视觉注意力增强方法因采用固定缩放因子而导致的“强度不当”问题:在某些生成步骤中缩放因子过弱无法消除幻觉,而在其他步骤中又过强引发新的幻觉。解决方案的关键在于提出AdaVBoost框架,其核心创新是引入视觉定位熵(Visual Grounding Entropy, VGE),通过融合视觉定位信息来估计每个token的幻觉风险,并据此实现每一步生成时对不同token进行自适应的注意力增强——高风险token获得更强的视觉注意力提升,低风险token则施加较弱的增强,从而实现细粒度、动态的token级干预。

链接: https://arxiv.org/abs/2602.13600
作者: Jiacheng Zhang,Feng Liu,Chao Du,Tianyu Pang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual attention boosting has emerged as a promising direction for mitigating hallucinations in Large Vision-Language Models (LVLMs), where existing methods primarily focus on where to boost by applying a predefined scaling to the attention of method-specific visual tokens during autoregressive generation. In this paper, we identify a fundamental trade-off in these methods: a predefined scaling factor can be too weak at some generation steps, leaving hallucinations unresolved, yet too strong at others, leading to new hallucinations. Motivated by this finding, we propose AdaVBoost, a token-level visual attention boosting framework that adaptively determines how much attention to boost at each generation step. Specifically, we introduce Visual Grounding Entropy (VGE) to estimate hallucination risk, which leverages visual grounding as a complementary signal to capture evidence mismatches beyond entropy. Guided by VGE, AdaVBoost applies stronger visual attention boosting to high-risk tokens and weaker boosting to low-risk tokens, enabling token-level adaptive intervention at each generation step. Extensive experiments show that AdaVBoost significantly outperforms baseline methods across multiple LVLMs and hallucination benchmarks.

[CV-122] wo-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks

【速读】:该论文旨在解决场景解析(scene parsing)与几何视觉(geometric vision)任务在传统方法中难以协同优化的问题,尤其针对两者之间缺乏有效交互机制以及依赖昂贵人工标注对应关系(correspondence ground truth)的局限性。解决方案的关键在于提出Two Interactive Streams (TwInS) 框架,其核心是通过双向特征交互机制实现两个任务的联合学习:一方面,场景解析流提取的多层级上下文特征被注入几何视觉流以指导其迭代优化;另一方面,几何视觉流解码后的特征经由新颖的跨任务适配器(cross-task adapter)投影至上下文空间,实现选择性异构特征融合,从而利用丰富的跨视角几何线索增强场景解析能力。此外,该框架引入定制化的半监督训练策略,无需人工标注对应关系即可利用大规模多视角数据进行持续自进化,显著降低了对标注数据的依赖。

链接: https://arxiv.org/abs/2602.13588
作者: Guanfeng Tang,Hongbo Zhao,Ziwei Long,Jiayao Li,Bohong Xiao,Wei Ye,Hanli Wang,Rui Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS’s core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.

[CV-123] Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)扩散模型在遵循复杂文本描述时存在的语义对齐不足问题,其根源在于文本与视觉特征之间的交互机制不够充分且静态。解决方案的关键在于提出 Diff-Aid,这是一种轻量级的推理阶段方法,能够自适应地调整每个文本标记(token)与图像特征在不同 Transformer 块和去噪时间步(denoising timesteps)上的交互强度,从而实现动态、灵活的跨阶段语义对齐优化。该方法不仅提升了生成质量与提示遵循度,还提供了可解释的调制模式,揭示了各模块与时间步对语义一致性贡献的机制。

链接: https://arxiv.org/abs/2602.13585
作者: Binglei Li,Mengping Yang,Zhiyu Tan,Junping Zhang,Hao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages

点击查看摘要

Abstract:Recent text-to-image (T2I) diffusion models have achieved remarkable advancement, yet faithfully following complex textual descriptions remains challenging due to insufficient interactions between textual and visual features. Prior approaches enhance such interactions via architectural design or handcrafted textual condition weighting, but lack flexibility and overlook the dynamic interactions across different blocks and denoising stages. To provide a more flexible and efficient solution to this problem, we propose Diff-Aid, a lightweight inference-time method that adaptively adjusts per-token text and image interactions across transformer blocks and denoising timesteps. Beyond improving generation quality, Diff-Aid yields interpretable modulation patterns that reveal how different blocks, timesteps, and textual tokens contribute to semantic alignment during denoising. As a plug-and-play module, Diff-Aid can be seamlessly integrated into downstream applications for further improvement, including style LoRAs, controllable generation, and zero-shot editing. Experiments on strong baselines (SD 3.5 and FLUX) demonstrate consistent improvements in prompt adherence, visual quality, and human preference across various metrics. Our code and models will be released.

[CV-124] Privacy-Concealing Cooperative Perception for BEV Scene Segmentation

【速读】:该论文旨在解决自动驾驶协同感知系统中因共享鸟瞰图(Bird’s Eye View, BEV)特征而导致的隐私泄露问题,即敏感视觉内容可能被第三方重建。解决方案的关键在于提出一种隐私隐藏协作(Privacy-Concealing Cooperation, PCC)框架,其核心是设计一个隐藏网络(hiding network),通过对抗学习机制与图像重建网络进行博弈:隐藏网络在保持BEV语义分割性能的前提下,主动掩蔽共享特征中的视觉线索,从而降低重建图像的质量;同时,感知网络与隐藏网络端到端联合优化,确保任务性能不受显著影响。

链接: https://arxiv.org/abs/2602.13555
作者: Song Wang,Lingling Li,Marcus Santos,Guanghui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cooperative perception systems for autonomous driving aim to overcome the limited perception range of a single vehicle by communicating with adjacent agents to share sensing information. While this improves perception performance, these systems also face a significant privacy-leakage issue, as sensitive visual content can potentially be reconstructed from the shared data. In this paper, we propose a novel Privacy-Concealing Cooperation (PCC) framework for Bird’s Eye View (BEV) semantic segmentation. Based on commonly shared BEV features, we design a hiding network to prevent an image reconstruction network from recovering the input images from the shared features. An adversarial learning mechanism is employed to train the network, where the hiding network works to conceal the visual clues in the BEV features while the reconstruction network attempts to uncover these clues. To maintain segmentation performance, the perception network is integrated with the hiding network and optimized end-to-end. The experimental results demonstrate that the proposed PCC framework effectively degrades the quality of the reconstructed images with minimal impact on segmentation performance, providing privacy protection for cooperating vehicles. The source code will be made publicly available upon publication.

[CV-125] Nighttime Autonomous Driving Scene Reconstruction with Physically-Based Gaussian Splatting ICRA2026

【速读】:该论文旨在解决夜间自动驾驶场景重建中因复杂光照与外观条件导致现有基于神经辐射场(Neural Radiance Fields, NeRFs)和三维高斯点绘(3D Gaussian Splatting, 3DGS)方法性能下降的问题。其解决方案的关键在于将物理基础渲染(Physically Based Rendering, PBR)引入3DGS框架,通过联合优化双向反射分布函数(Bidirectional Reflectance Distribution Function, BRDF)驱动的材质属性,在复合场景高斯表示中显式建模漫反射分量(利用全局光照模块)与镜面反射分量(通过各向异性球形高斯实现),从而显著提升户外夜间驾驶场景的重建质量并保持实时渲染能力。

链接: https://arxiv.org/abs/2602.13549
作者: Tae-Kyeong Kim,Xingxin Chen,Guile Wu,Chengjie Huang,Dongfeng Bai,Bingbing Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026

点击查看摘要

Abstract:This paper focuses on scene reconstruction under nighttime conditions in autonomous driving simulation. Recent methods based on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved photorealistic modeling in autonomous driving scene reconstruction, but they primarily focus on normal-light conditions. Low-light driving scenes are more challenging to model due to their complex lighting and appearance conditions, which often causes performance degradation of existing methods. To address this problem, this work presents a novel approach that integrates physically based rendering into 3DGS to enhance nighttime scene reconstruction for autonomous driving. Specifically, our approach integrates physically based rendering into composite scene Gaussian representations and jointly optimizes Bidirectional Reflectance Distribution Function (BRDF) based material properties. We explicitly model diffuse components through a global illumination module and specular components by anisotropic spherical Gaussians. As a result, our approach improves reconstruction quality for outdoor nighttime driving scenes, while maintaining real-time rendering. Extensive experiments across diverse nighttime scenarios on two real-world autonomous driving datasets, including nuScenes and Waymo, demonstrate that our approach outperforms the state-of-the-art methods both quantitatively and qualitatively.

[CV-126] SpargeAttention2: Trainable Sparse Attention via Hybrid Top-kTop-p Masking and Distillation Fine-Tuning

【速读】:该论文旨在解决扩散模型中注意力机制计算效率低的问题,特别是如何在不牺牲生成质量的前提下实现更高程度的注意力稀疏化。现有训练-free稀疏注意力方法虽能加速模型,但难以在高稀疏度下保持性能;而可训练稀疏注意力虽具潜力,却面临掩码规则失效、优化目标不足等挑战。解决方案的关键在于提出SpargeAttention2,其核心创新包括:(1)融合Top-k与Top-p的混合掩码策略以提升高稀疏度下的鲁棒性;(2)设计高效的可训练稀疏注意力实现;(3)引入类蒸馏的微调目标,有效缓解扩散损失在稀疏注意力微调中的质量退化问题。实验表明,该方法在视频扩散模型上实现了95%的注意力稀疏度和16.2倍的速度提升,同时优于现有方法。

链接: https://arxiv.org/abs/2602.13515
作者: Jintao Zhang,Kai Jiang,Chendong Xiang,Weiqi Feng,Yuezhou Hu,Haocheng Xi,Jianfei Chen,Jun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.

[CV-127] Benchmarking Video Foundation Models for Remote Parkinsons Disease Screening

【速读】:该论文旨在解决远程视频评估在帕金森病(Parkinson’s disease, PD)筛查中的有效性问题,特别是不同视频基础模型(Video Foundation Models, VFMs)在多样化临床任务中表现差异不明确的挑战。其解决方案的关键在于构建一个大规模、标准化的多任务视频数据集(涵盖1,888名参与者、32,847段视频和16项临床任务),并系统性地评估七种前沿VFMs(如VideoPrism、V-JEPA、ViViT等)在冻结嵌入基础上的线性分类性能。研究发现,不同模型在特定任务上表现出显著差异:例如VideoPrism在无音频条件下擅长捕捉言语运动学特征,而V-JEPA在上肢运动任务中表现最优,TimeSformer则在节律性任务中保持竞争力,从而为远程神经监测中模型与任务的适配提供了实证依据和选择路径。

链接: https://arxiv.org/abs/2602.13507
作者: Md Saiful Islam,Ekram Hossain,Abdelrahman Abdelkader,Tariq Adnan,Fazla Rabbi Mashrur,Sooyong Park,Praveen Kumar,Qasim Sudais,Natalia Chunga,Nami Shah,Jan Freyberg,Christopher Kanan,Ruth Schneider,Ehsan Hoque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote, video-based assessments offer a scalable pathway for Parkinson’s disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs – including VideoPrism, V-JEPA, ViViT, and VideoMAE – to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: this https URL_video_benchmarking-A2C5

[CV-128] FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在长时程、接触密集型操作任务中表现不佳的问题,其根本原因在于现有方法未显式建模手-物体交互(Hand-Object Interaction, HOI)结构。解决方案的关键在于提出FlowHOI,一种两阶段流匹配框架,通过解耦几何导向的抓取与语义导向的操作,利用3D高斯泼溅(3D Gaussian Splatting, 3DGS)场景重建和紧凑的3D场景token进行条件建模,并引入运动-文本对齐损失以语义上锚定交互行为;同时构建了一个从大规模第一人称视频中恢复对齐手-物体轨迹与网格的重建管道,从而获得高质量HOI先验,显著提升生成结果的物理合理性与可迁移性。

链接: https://arxiv.org/abs/2602.13444
作者: Huajian Zeng,Lingyun Chen,Jiaqi Yang,Yuantai Zhang,Fan Shi,Peidong Liu,Xingxing Zuo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7 \times higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40 \times inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.

[CV-129] Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

【速读】:该论文旨在解决无人机(UAV)在室内场景中进行类增量学习(Class-Incremental Learning, CIL)时面临的灾难性遗忘问题,同时满足资源受限平台的部署需求。其关键解决方案是提出一个包含14,400帧、具有时间一致性的室内无人机视频数据集,并在此基础上评估三种基于回放(replay-based)的CIL策略——经验回放(Experience Replay, ER)、最大干扰检索(Maximally Interfered Retrieval, MIR)和遗忘感知回放(Forgetting-Aware Replay, FAR),其中FAR在仅5%–10%内存预算下表现最优,平均准确率(ACC, mAP_50-95 across increments)达82.96%,验证了回放机制在边缘空中系统中的有效性。

链接: https://arxiv.org/abs/2602.13440
作者: Sebastian-Ion Nae,Mihai-Eugen Barbu,Sebastian Mocanu,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at European Robotics Forum (ERF) 2026

点击查看摘要

Abstract:Autonomous agents such as indoor drones must learn new object classes in real-time while limiting catastrophic forgetting, motivating Class-Incremental Learning (CIL). However, most unmanned aerial vehicle (UAV) datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. We introduce an indoor dataset of 14,400 frames capturing inter-drone and ground vehicle footage, annotated via a semi-automatic workflow with a 98.6% first-pass labeling agreement before final manual verification. Using this dataset, we benchmark 3 replay-based CIL strategies: Experience Replay (ER), Maximally Interfered Retrieval (MIR), and Forgetting-Aware Replay (FAR), using YOLOv11-nano as a resource-efficient detector for deployment-constrained UAV platforms. Under tight memory budgets ( 5-10% replay), FAR performs better than the rest, achieving an average accuracy (ACC, mAP_50-95 across increments) of 82.96% with 5% replay. Gradient-weighted class activation mapping (Grad-CAM) analysis shows attention shifts across classes in mixed scenes, which is associated with reduced localization quality for drones. The experiments further demonstrate that replay-based continual learning can be effectively applied to edge aerial systems. Overall, this work contributes an indoor UAV video dataset with preserved temporal coherence and an evaluation of replay-based CIL under limited replay budgets. Project page: this https URL

[CV-130] Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

【速读】:该论文针对胸部X光(Chest X-ray, CXR)分类中因监督信息不完善所导致的挑战,具体包括:(i) 多标签疾病分布极度长尾(long-tailed multi-label distribution),使得罕见病类难以识别;(ii) 罕见或未见过的病灶缺乏标注(missing annotations for rare or previously unseen findings)。为应对上述问题,作者提出了两个任务特定的解决方案:对于任务1(长尾多标签分类),采用一种感知类别不平衡的多标签学习策略,在提升尾部类别识别性能的同时保持对常见疾病的稳定表现;对于任务2(零样本外部分布(out-of-distribution, OOD)识别),提出一种无需使用OOD类任何监督标签或样本即可生成未见病种预测分数的方法。关键创新在于分别通过不平衡感知训练机制和零样本推理框架,有效缓解了监督缺失带来的模型偏差与泛化能力不足问题。

链接: https://arxiv.org/abs/2602.13430
作者: Ha-Hieu Pham,Hai-Dang Nguyen,Thanh-Huy Nguyen,Min Xu,Ulas Bagci,Trung-Nghia Le,Huy-Hieu Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at this https URL.

[CV-131] LAF-YOLOv10 with Partial Convolution Backbone Attention-Guided Feature Pyramid Auxiliary P2 Head and Wise-IoU Loss for Small Object Detection in Drone Aerial Imagery

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)图像中目标检测的难题,特别是针对小目标(仅占几像素)、背景杂乱、严重遮挡以及机载计算资源受限等挑战。解决方案的关键在于对YOLOv10n框架进行系统性改进,集成四个互补模块:(1)部分卷积C2f(PC-C2f)模块通过限制空间卷积范围至骨干网络通道的四分之一,在减少冗余计算的同时保持判别能力;(2)注意力引导特征金字塔网络(AG-FPN)在多尺度融合前引入Squeeze-and-Excitation通道门控,并用DySample替代最近邻插值实现内容感知的上采样;(3)新增P2检测头(160×160分辨率)以恢复低于8×8像素目标的空间定位能力,同时移除P5头以优化参数分配;(4)使用Wise-IoU v3替代CIoU进行边界框回归,缓解密集场景中标注噪声带来的梯度扰动。这四个模块协同作用,分别攻克了骨干压缩、跨尺度融合优化、亚像素目标恢复和回归稳定性问题,最终在VisDrone-DET2019和UAVDT数据集上显著提升性能(mAP@0.5达35.1%±0.3%),并在NVIDIA Jetson Orin Nano上实现24.3 FPS的实时推理速度,验证其嵌入式部署可行性。

链接: https://arxiv.org/abs/2602.13378
作者: Sohail Ali Farooqui,Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicles serve as primary sensing platforms for surveillance, traffic monitoring, and disaster response, making aerial object detection a central problem in applied computer vision. Current detectors struggle with UAV-specific challenges: targets spanning only a few pixels, cluttered backgrounds, heavy occlusion, and strict onboard computational budgets. This study introduces LAF-YOLOv10, built on YOLOv10n, integrating four complementary techniques to improve small-object detection in drone imagery. A Partial Convolution C2f (PC-C2f) module restricts spatial convolution to one quarter of backbone channels, reducing redundant computation while preserving discriminative capacity. An Attention-Guided Feature Pyramid Network (AG-FPN) inserts Squeeze-and-Excitation channel gates before multi-scale fusion and replaces nearest-neighbor upsampling with DySample for content-aware interpolation. An auxiliary P2 detection head at 160 \times 160 resolution extends localization to objects below 8 \times 8 pixels, while the P5 head is removed to redistribute parameters. Wise-IoU v3 replaces CIoU for bounding box regression, attenuating gradients from noisy annotations in crowded aerial scenes. The four modules address non-overlapping bottlenecks: PC-C2f compresses backbone computation, AG-FPN refines cross-scale fusion, the P2 head recovers spatial resolution, and Wise-IoU stabilizes regression under label noise. No individual component is novel; the contribution is the joint integration within a single YOLOv10 framework. Across three training runs (seeds 42, 123, 256), LAF-YOLOv10 achieves 35.1 \pm 0.3% mAP@0.5 on VisDrone-DET2019 with 2.3,M parameters, exceeding YOLOv10n by 3.3 points. Cross-dataset evaluation on UAVDT yields 35.8 \pm 0.4% mAP@0.5. Benchmarks on NVIDIA Jetson Orin Nano confirm 24.3 FPS at FP16, demonstrating viability for embedded UAV deployment.

[CV-132] he Diffusion Duet: Harmonizing Dual Channels with Wavelet Suppression for Image Separation

【速读】:该论文旨在解决盲图像分离(Blind Image Separation, BIS)问题,即在未知混合模式且无源图像先验知识条件下,从单一观测图像中同时估计并恢复多个独立源图像。传统方法依赖统计独立性假设或CNN/GAN变体,在复杂真实场景下难以准确刻画特征分布,导致强噪声和非线性混合下出现估计偏差、纹理失真及伪影残留。其解决方案的关键在于创新性地引入扩散模型(Diffusion Models),构建双通道扩散分离模型(Dual-Channel Diffusion Separation Model, DCDSM),利用扩散模型强大的生成能力学习源图像特征分布并有效重建结构;同时设计了新颖的小波抑制模块(Wavelet Suppression Module, WSM),嵌入于双分支反向去噪过程中,通过挖掘源图像间耦合噪声特性形成交互式分离网络,显著提升细节分离精度与恢复质量。

链接: https://arxiv.org/abs/2602.13361
作者: Jingwei Li,Wei Pu
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind image separation (BIS) refers to the inverse problem of simultaneously estimating and restoring multiple independent source images from a single observation image under conditions of unknown mixing mode and without prior knowledge of the source images. Traditional methods relying on statistical independence assumptions or CNN/GAN variants struggle to characterize complex feature distributions in real scenes, leading to estimation bias, texture distortion, and artifact residue under strong noise and nonlinear mixing. This paper innovatively introduces diffusion models into dual-channel BIS, proposing an efficient Dual-Channel Diffusion Separation Model (DCDSM). DCDSM leverages diffusion models’ powerful generative capability to learn source image feature distributions and reconstruct feature structures effectively. A novel Wavelet Suppression Module (WSM) is designed within the dual-branch reverse denoising process, forming an interactive separation network that enhances detail separation by exploiting the mutual coupling noise characteristic between source images. Extensive experiments on synthetic datasets containing rain/snow and complex mixtures demonstrate that DCDSM achieves state-of-the-art performance: 1) In image restoration tasks, it obtains PSNR/SSIM values of 35.0023 dB/0.9549 and 29.8108 dB/0.9243 for rain and snow removal respectively, outperforming Histoformer and LDRCNet by 1.2570 dB/0.9272 dB (PSNR) and 0.0262/0.0289 (SSIM) on average; 2) For complex mixture separation, the restored dual-source images achieve average PSNR and SSIM of 25.0049 dB and 0.7997, surpassing comparative methods by 4.1249 dB and 0.0926. Both subjective and objective evaluations confirm DCDSM’s superiority in addressing rain/snow residue removal and detail preservation challenges.

[CV-133] AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像和视频生成任务中因迭代去噪结构导致的推理成本过高问题。现有加速方法依赖静态缓存复用策略或粗粒度启发式规则,常引发时间漂移(temporal drift)和缓存错位(cache misalignment),显著降低生成质量。其解决方案的关键在于提出AdaCorrection框架,通过轻量级时空信号动态估计缓存有效性,并在每个时间步自适应融合缓存激活与新鲜激活,从而在不引入额外监督或重训练的前提下实现高效且高质量的缓存复用,保持接近原始模型的FID指标的同时获得适度加速。

链接: https://arxiv.org/abs/2602.13357
作者: Dong Liu,Yanxuan Yu,Ben Lengerich,Ying Nian Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbfAdaCorrection, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.

[CV-134] Detecting Brick Kiln Infrastructure at Scale: Graph Foundation and Remote Sensing Models for Satellite Imagery Data

【速读】:该论文旨在解决南亚地区砖窑(brick kiln)大规模监测难的问题,其核心挑战在于空气污染和强迫劳动问题严重,但现有地面数据稀疏且过时,难以支撑有效监管。解决方案的关键在于提出一种区域自适应的图神经网络模型 ClimateGraph,该模型能捕捉砖窑布局中的空间与方向结构,并结合高分辨率遥感影像(zoom-20,0.149米/像素)构建包含超过130万张图像块的多城市数据集,从而实现对砖窑的精准检测。研究进一步对比了图学习、基础模型(foundation models)与遥感检测管道的互补优势,为基于卫星影像的大规模砖窑监测提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2602.13350
作者: Usman Nazir,Xidong Chen,Hafiz Muhammad Abubakar,Hadia Abu Bakar,Raahim Arbaz,Fezan Rasool,Bin Chen,Sara Khalid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Brick kilns are a major source of air pollution and forced labor in South Asia, yet large-scale monitoring remains limited by sparse and outdated ground data. We study brick kiln detection at scale using high-resolution satellite imagery and curate a multi city zoom-20 (0.149 meters per pixel) resolution dataset comprising over 1.3 million image tiles across five regions in South and Central Asia. We propose ClimateGraph, a region-adaptive graph-based model that captures spatial and directional structure in kiln layouts, and evaluate it against established graph learning baselines. In parallel, we assess a remote sensing based detection pipeline and benchmark it against recent foundation models for satellite imagery. Our results highlight complementary strengths across graph, foundation, and remote sensing approaches, providing practical guidance for scalable brick kiln monitoring from satellite imagery.

[CV-135] From Prompt to Production:Automating Brand-Safe Marketing Imagery with Text-to-Image Models WACV

【速读】:该论文旨在解决生成式 AI (Generative AI) 在生产环境中部署时面临的可扩展性与质量控制难题,特别是在营销图像生成场景下如何平衡自动化处理与人工反馈。其核心挑战在于:仅靠自动化难以保障图像质量与创意一致性,而过度依赖人工则限制了规模效率。解决方案的关键在于构建一个全自动化但具备适度人类监督机制的流水线系统,通过引入基于DINOV2的图像特征匹配方法提升营销对象的保真度(提升30.77%),并结合人类偏好评估优化输出结果(提升52.00%的人类偏好得分),从而实现高效、高质量且符合营销规范的图像生成流程。

链接: https://arxiv.org/abs/2602.13349
作者: Parmida Atighehchian,Henry Wang,Andrei Kapustin,Boris Lerner,Tiancheng Jiang,Taylor Jensen,Negin Sokhandan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 12 figures, Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

点击查看摘要

Abstract:Text-to-image models have made significant strides, producing impressive results in generating images from textual descriptions. However, creating a scalable pipeline for deploying these models in production remains a challenge. Achieving the right balance between automation and human feedback is critical to maintain both scale and quality. While automation can handle large volumes, human oversight is still an essential component to ensure that the generated images meet the desired standards and are aligned with the creative vision. This paper presents a new pipeline that offers a fully automated, scalable solution for generating marketing images of commercial products using text-to-image models. The proposed system maintains the quality and fidelity of images, while also introducing sufficient creative variation to adhere to marketing guidelines. By streamlining this process, we ensure a seamless blend of efficiency and human oversight, achieving a 30.77% increase in marketing object fidelity using DINOV2 and a 52.00% increase in human preference over the generated outcome.

[CV-136] Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots

【速读】:该论文旨在解决自动化仓库中机器人执行上架(stow)操作时,如何准确预测物品放置后储物箱(bin)的布局问题,从而为仓储规划提供可靠的前瞻信息。其核心挑战在于从当前观测和预设的上架行为出发,建模并预测未来状态的几何结构。解决方案的关键在于提出FOREST模型——一个基于上架意图条件的世界模型,它将储物箱状态表示为与物品对齐的实例掩码(instance masks),并采用潜在扩散Transformer(latent diffusion transformer)架构,从观测上下文预测上架后的配置。实验表明,该方法在几何一致性上显著优于启发式基线,并在负载质量评估和多物品上架推理等下游任务中仅带来微小性能损失,证明其可作为有效的前瞻性信号用于仓库调度优化。

链接: https://arxiv.org/abs/2602.13347
作者: Lijun Zhang,Nikhil Chacko,Petter Nilsson,Ruinian Xu,Shantanu Thakar,Bai Lou,Harpreet Sawhney,Zhebin Zhang,Mudit Agrawal,Bhavana Chandrashekhar,Aaron Parness
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 20 pages, 16 figures

点击查看摘要

Abstract:Automated warehouses execute millions of stow operations, where robots place objects into storage bins. For these systems it is valuable to anticipate how a bin will look from the current observations and the planned stow behavior before real execution. We propose FOREST, a stow-intent-conditioned world model that represents bin states as item-aligned instance masks and uses a latent diffusion transformer to predict the post-stow configuration from the observed context. Our evaluation shows that FOREST substantially improves the geometric agreement between predicted and true post-stow layouts compared with heuristic baselines. We further evaluate the predicted post-stow layouts in two downstream tasks, in which replacing the real post-stow masks with FOREST predictions causes only modest performance loss in load-quality assessment and multi-stow reasoning, indicating that our model can provide useful foresight signals for warehouse planning.

[CV-137] FireRed-Image-Edit-1.0 Techinical Report

【速读】:该论文旨在解决指令驱动图像编辑(instruction-based image editing)中生成质量与控制精度不足的问题,尤其是在数据效率、编辑一致性以及跨任务泛化能力方面的挑战。其解决方案的关键在于三个层面的系统性优化:首先,构建大规模(16亿样本)、高质量且语义覆盖均衡的训练数据集,通过多阶段清洗、分层采样和自动标注确保文本到图像生成与图像编辑任务的平衡;其次,设计多阶段训练流程(预训练、监督微调、强化学习),并引入多条件感知桶采样器(Multi-Condition Aware Bucket Sampler)提升批处理效率,结合动态提示重索引机制增强指令对齐;最后,在优化策略上提出不对称梯度优化(Asymmetric Gradient Optimization)用于直接偏好优化(DPO),DiffusionNFT结合布局感知OCR奖励以提升文本编辑准确性,并引入可微分一致性损失(Consistency Loss)保障主体身份保留。这些创新共同推动了图像编辑性能达到当前最优水平。

链接: https://arxiv.org/abs/2602.13344
作者: Super Intelligence Team:Changhao Qiao,Chao Hui,Chen Li,Cunzheng Wang,Dejia Song,Jiale Zhang,Jing Li,Qiang Xiang,Runqi Wang,Shuang Sun,Wei Zhu,Xu Tang,Yao Hu,Yibo Chen,Yuhao Huang,Yuxuan Duan,Zhiyi Chen,Ziyuan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.

[CV-138] An Integrated Causal Inference Framework for Traffic Safety Modeling with Semantic Street-View Visual Features

【速读】:该论文旨在解决宏观交通安全管理中因忽视驾驶员对驾驶环境的视觉感知而造成的因果关系识别不足问题,即现有模型多依赖静态社会人口与基础设施指标,难以量化视觉环境特征对区域交通事故的因果影响。其解决方案的关键在于:首先利用语义分割技术从Google街景图像中提取视觉环境特征,并构建双重机器学习(Double Machine Learning)框架以精准估计这些特征对区域事故的因果效应;其次结合SHAP值解析混杂变量的非线性作用机制,并采用因果森林法估算条件平均处理效应,从而在复杂空间环境下实现因果推断的稳健性和可解释性。

链接: https://arxiv.org/abs/2602.13339
作者: Lishan Sun,Yujia Cheng,Pengfei Cui,Lei Han,Mohamed Abdel-Aty,Yunhan Zheng,Xingchen Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 34 pages, 13 figures

点击查看摘要

Abstract:Macroscopic traffic safety modeling aims to identify critical risk factors for regional crashes, thereby informing targeted policy interventions for safety improvement. However, current approaches rely heavily on static sociodemographic and infrastructure metrics, frequently overlooking the impacts from drivers’ visual perception of driving environment. Although visual environment features have been found to impact driving and traffic crashes, existing evidence remains largely observational, failing to establish the robust causality for traffic policy evaluation under complex spatial environment. To fill these gaps, we applied semantic segmentation on Google Street View imageries to extract visual environmental features and proposed a Double Machine Learning framework to quantify their causal effects on regional crashes. Meanwhile, we utilized SHAP values to characterize the nonlinear influence mechanisms of confounding variables in the models and applied causal forests to estimate conditional average treatment effects. Leveraging crash records from the Miami metropolitan area, Florida, and 220,000 street view images, evidence shows that greenery proportion exerts a significant and robust negative causal effect on traffic crashes (Average Treatment Effect = -6.38, p = 0.005). This protective effect exhibits spatial heterogeneity, being most pronounced in densely populated and socially vulnerable urban cores. While greenery significantly mitigates angle and rear-end crashes, its protective benefit for vulnerable road users (VRUs) remains limited. Our findings provide causal evidence for greening as a potential safety intervention, prioritizing hazardous visual environments while highlighting the need for distinct design optimizations to protect VRUs.

[CV-139] Meningioma Analysis and Diagnosis using Limited Labeled Samples

【速读】:该论文旨在解决脑膜瘤(meningioma)分类中因样本稀缺导致的少样本学习(few-shot learning)难题,尤其是在MRI影像中准确区分不同分级脑膜瘤以优化治疗方案和预后评估的问题。其解决方案的关键在于提出一种具有自适应权重的特征融合架构,通过离散小波变换(discrete wavelet transform, DWT)提取空间-频域特征,并动态调整不同频率带与空间域信息的融合权重,从而显著提升在有限标注数据下的分类性能。

链接: https://arxiv.org/abs/2602.13335
作者: Jiamiao Lu,Wei Wu,Ke Gao,Ping Mao,Weichuan Zhang,Tuo Wang,Lingkun Ma,Jiapan Guo,Zanyi Wu,Yuqing Hu,Changming Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages,7 figures

点击查看摘要

Abstract:The biological behavior and treatment response of meningiomas depend on their grade, making an accurate diagnosis essential for treatment planning and prognosis assessment. We observed that the weighted fusion of spatial-frequency domain features significantly influences meningioma classification performance. Notably, the contribution of specific frequency bands obtained by discrete wavelet transform varies considerably across different images. A feature fusion architecture with adaptive weights of different frequency band information and spatial domain information is proposed for few-shot meningioma learning. To verify the effectiveness of the proposed method, a new MRI dataset of meningiomas is introduced. The experimental results demonstrate the superiority of the proposed method compared with existing state-of-the-art methods in three datasets. The code will be available at: this https URL

[CV-140] Ask the Expert: Collaborative Inference for Vision Transformers with Near-Edge Accelerators

【速读】:该论文旨在解决在边缘设备上部署视觉Transformer(Vision Transformer, ViT)时面临的高计算复杂性问题,同时避免将全部推理任务卸载到云端所导致的显著延迟。其核心解决方案是提出一种协同推理框架,该框架在边缘端运行一个轻量级通用ViT模型,在近边缘加速器上部署多个中等规模的专业ViT模型,并设计了一种基于边缘模型Top-k预测结果的动态路由机制,用于为置信度低的样本选择最相关的专家模型进行处理。此外,论文还引入了一种渐进式专业训练策略,以提升专家模型在特定数据子集上的准确性。实验表明,该方法在CIFAR-100数据集上的真实边缘与近边缘测试平台上,相较静态专家模型提升了整体准确率2.76%,专家子集准确率提升4.12%,同时相比纯边缘执行降低高达45%的延迟和46%的能量消耗。

链接: https://arxiv.org/abs/2602.13334
作者: Hao Liu,Suhaib A. Fahmy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying Vision Transformers on edge devices is challenging due to their high computational complexity, while full offloading to cloud resources presents significant latency overheads. We propose a novel collaborative inference framework, which orchestrates a lightweight generalist ViT on an edge device and multiple medium-sized expert ViTs on a near-edge accelerator. A novel routing mechanism uses the edge model’s Top- \mathitk predictions to dynamically select the most relevant expert for samples with low confidence. We further design a progressive specialist training strategy to enhance expert accuracy on dataset subsets. Extensive experiments on the CIFAR-100 dataset using a real-world edge and near-edge testbed demonstrate the superiority of our framework. Specifically, the proposed training strategy improves expert specialization accuracy by 4.12% on target subsets and enhances overall accuracy by 2.76% over static experts. Moreover, our method reduces latency by up to 45% compared to edge execution, and energy consumption by up to 46% compared to just near-edge offload.

[CV-141] MedScope: Incentivizing “Think with Videos” for Clinical Reasoning via Coarse-to-Fine Tool Calling

【速读】:该论文旨在解决当前多模态大语言模型在处理长时临床视频时,因采用被动采样或弱关联的检查方式,导致其难以通过迭代方式精确定位、验证并用时间上精准的视觉证据来解释预测结果的问题。解决方案的关键在于提出MedScope——一种具备工具调用能力的临床视频推理模型,它通过粗到细的证据搜索策略,将中间推理与目标导向的工具调用及检索到的观察结果验证相交织,从而生成更准确且可信赖的预测,并明确地基于时间定位的视觉证据。此外,为克服高质量监督信号缺失的问题,研究构建了以证据为中心的细粒度临床视频数据集ClinVideoSuite,并引入Grounding-Aware Group Relative Policy Optimization(GA-GRPO)优化方法,直接强化与证据对齐的工具使用行为和基于证据加权的优势值,显著提升了模型在域内和域外评估中的表现。

链接: https://arxiv.org/abs/2602.13332
作者: Wenjie Li,Yujie Zhang,Haoran Sun,Xingqi He,Hongcheng Gao,Chenglong Ma,Ming Hu,Guankun Wang,Shiyi Yao,Renhao Yang,Hongliang Ren,Lei Wang,Junjun He,Yankai Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely “think with videos” through tool-integrated reasoning. We will release our code, models, and data.

[CV-142] Zwitscherkasten – DIY Audiovisual bird monitoring

【速读】:该论文旨在解决传统鸟类物种监测方法在资源受限环境下难以实现高效、实时、非侵入式观测的问题。其解决方案的关键在于构建一个名为Zwitscherkasten的DIY多模态系统,该系统基于边缘计算平台部署深度学习模型,融合声学与视觉数据进行分类识别:通过声学活动检测模块降低能耗,利用细粒度的目标检测与分类流水线实现图像中的鸟类识别,从而在嵌入式设备上实现高精度的鸟类物种识别,支持可扩展的生物多样性监测与公民科学应用。

链接: https://arxiv.org/abs/2602.13330
作者: Dominik Blum,Elias Häring,Fabian Jirges,Martin Schäffer,David Schick,Florian Schulenberg,Torsten Schön
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Report of the Applied Artificial Intelligence Degree Program at Technische Hochschule Ingolstadt

点击查看摘要

Abstract:This paper presents Zwitscherkasten, a DiY, multimodal system for bird species monitoring using audio and visual data on edge devices. Deep learning models for bioacoustic and image-based classification are deployed on resource-constrained hardware, enabling real-time, non-invasive monitoring. An acoustic activity detector reduces energy consumption, while visual recognition is performed using fine-grained detection and classification pipelines. Results show that accurate bird species identification is feasible on embedded platforms, supporting scalable biodiversity monitoring and citizen science applications.

[CV-143] HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶等安全关键场景中应用时存在的三大核心问题:数值推理不精确、三维空间感知能力弱以及对上下文高度敏感。为应对这些挑战,作者提出了一种新型分层时空VLA模型——HiST-VLA,其解决方案的关键在于:首先通过融合几何感知与细粒度驾驶指令及状态历史提示来增强三维空间和时间推理能力;其次引入动态令牌稀疏化机制,在不牺牲性能的前提下融合冗余令牌以提升计算效率;最后采用分层Transformer规划器,逐步将粗粒度VLA路径点细化为精细轨迹,并借助动态潜在正则化确保语言指令的空间精准性和时间一致性。

链接: https://arxiv.org/abs/2602.13329
作者: Yiru Wang,Zichong Gu,Yu Gao,Anqing Jiang,Zhigang Sun,Shuo Wang,Yuwen Heng,Hao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2602.13329 [cs.CV] (or arXiv:2602.13329v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.13329 Focus to learn more arXiv-issued DOI via DataCite

[CV-144] MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation

【速读】:该论文旨在解决当前角色图像动画(Character Image Animation)方法在多人类体场景下难以泛化的问题,此类场景涉及多样化的类人形态、复杂的交互关系以及频繁的遮挡现象。现有方法在处理多角色协同运动时存在局限性,无法有效建模不同个体间的动态关联与空间约束。解决方案的关键在于两个核心创新:一是提出统一的运动表示(Unified Motion Representations),能够提取与身份无关的运动特征并显式绑定至对应角色,从而实现跨类人形态的通用性;二是设计一种整体式的4D锚定范式(Holistic 4D-Anchored Paradigm),构建共享的4D空间以融合运动表示与视频潜在特征,并通过分层的4D级监督机制增强对交互和遮挡的建模能力。基于此,作者开发了MotionWeaver框架,在自建的46小时多人类视频数据集和300视频基准上验证了其优越性能与强泛化能力。

链接: https://arxiv.org/abs/2602.13326
作者: Xirui Hu,Yanbo Ding,Jiahao Wang,Tingting Shi,Yali Wang,Guo Zhi Zhi,Weizhan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.

[CV-145] Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

【速读】:该论文旨在解决军事动态环境中自主边缘机器人部署所面临的两大挑战:一是领域特定训练数据稀缺,二是边缘硬件计算资源受限。解决方案的关键在于提出一种分层零样本(hierarchical zero-shot)架构,其核心是将轻量级目标检测模型(如Grounding DINO)与紧凑型视觉语言模型(VLMs,来自Qwen和Gemma系列,参数规模4B-12B)进行级联:前者作为高召回率、文本提示驱动的区域提案器,输出高置信度帧后交由边缘适配的VLM进行语义验证。该方法在Battlefield 6合成视频上实现了高达100%的误报过滤准确率、97.5%的损伤评估准确率及55–90%的细粒度车辆分类性能,并进一步扩展为代理式“侦察兵-指挥官”工作流,在亚75秒延迟下实现100%资产部署正确性和9.8/10推理评分(GPT-4o评分)。此外,文中提出的“受控输入”方法揭示了感知与推理解耦后的不同故障模式,为安全关键场景中VLM选型提供了诊断依据。

链接: https://arxiv.org/abs/2602.13324
作者: Jesse Barkley,Abraham George,Amir Barati Farimani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 Pages, 3 Figures

点击查看摘要

Abstract:Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel “Controlled Input” methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.

[CV-146] Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture

【速读】:该论文旨在解决当前视觉模型在面对结构变化时缺乏对拓扑不变性(topological invariance)的建模能力的问题,即现有基准数据集中的纹理相关性(textural correlations)主导了模型表现,掩盖了其对几何结构本质特征的理解。解决方案的关键在于提出PolyShapes-Ideal(PSI)诊断性基准,通过三个探针任务(噪声下的多边形分类、零样本字体迁移、渐进形变下的几何坍缩映射)分离出几何结构的稳定性,并验证Eidos架构能够实现99%的准确率和81.67%的零样本字体迁移性能,从而支持“形式优先”(Form-First)假说:结构约束型架构的泛化能力源于几何完整性,而非统计规模。

链接: https://arxiv.org/abs/2602.13322
作者: Datorien L. Anderson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 3 figures and extra material to help can be found: this https URL

点击查看摘要

Abstract:We present the PolyShapes-Ideal (PSI) dataset, a suite of diagnostic benchmarks designed to isolate topological invariance – the ability to maintain structural identity across affine transformations – from the textural correlations that dominate standard vision benchmarks. Through three diagnostic probes (polygon classification under noise, zero-shot font transfer from MNIST, and geometric collapse mapping under progressive deformation), we demonstrate that the Eidos architecture achieves 99% accuracy on PSI and 81.67% zero-shot transfer across 30 unseen typefaces without pre-training. These results validate the “Form-First” hypothesis: generalization in structurally constrained architectures is a property of geometric integrity, not statistical scale.

[CV-147] DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

【速读】:该论文旨在解决学术幻灯片自动生成与迭代编辑过程中存在的多维度挑战,包括内容选择的忠实性、幻灯片间的逻辑连贯性、布局感知的渲染质量以及多轮指令遵循能力等问题。现有基准测试和评估协议无法充分衡量这些复杂需求,导致系统性能难以客观比较与改进。为此,作者提出了Deck Edits and Compliance Kit Benchmark (DECKBench),这是一个基于精心构建的论文到幻灯片配对数据集并附加真实模拟编辑指令的评估框架。其关键创新在于系统性地从幻灯片级和整套幻灯片(deck)级两个层面评估内容忠实度、连贯性、布局质量和多轮指令响应能力,并设计了一个模块化的多智能体基线系统,将任务分解为论文解析与摘要、幻灯片规划、HTML生成及迭代编辑四个阶段,从而为学术演示文稿生成与编辑系统的可复现评估提供了标准化基础。

链接: https://arxiv.org/abs/2602.13318
作者: Daesik Jang,Morgan Lindsay Heisler,Linzi Xing,Yifei Li,Edward Wang,Ying Xiong,Yong Zhang,Zhenan Fan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at this https URL .

[CV-148] IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因视觉标记(visual tokens)数量庞大而导致的显著计算瓶颈问题。现有方法通常基于标记重要性或语义多样性进行剪枝,但缺乏一个统一且最优的整合框架。解决方案的关键在于提出一种名为IDPruner的新方法,其核心是利用最大边际相关性(Maximal Marginal Relevance, MMR)算法,在不依赖注意力图(attention maps)的前提下,实现对视觉标记重要性和语义多样性的帕累托最优平衡,从而支持高效的一次性剪枝(one-shot pruning)并兼容FlashAttention,显著提升推理效率与泛化能力。

链接: https://arxiv.org/abs/2602.13315
作者: Yifan Tan,Yifu Sun,Shirui Huang,Hong Liu,Guanghua Yu,Jianchen Zhu,Yangdong Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbfImportance and \textbfDiversity Pruner (\textbfIDPruner), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18% of baseline performance when pruning 75% of the tokens, and still maintains 86.40% even under an extreme 90% pruning ratio. Our code is available at this https URL.

[CV-149] Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction

【速读】:该论文旨在解决基于学习的毫米波(mmWave)雷达感知在室内视觉退化环境(如烟雾、灰尘和低光)中因大规模雷达数据集稀缺且标注成本高昂而导致的性能瓶颈问题。其核心解决方案是提出Sim2Radar框架,该框架通过单视角RGB图像直接合成训练用雷达数据,无需人工建模场景;关键创新在于结合单目深度估计、分割与视觉语言推理以重建具有材质信息的三维场景,并利用可配置的物理基射线追踪器模拟mmWave传播过程,其中Fresnel反射模型参数由ITU-R电磁特性定义,从而生成高质量、几何先验准确的合成雷达点云数据,显著提升下游3D雷达目标检测性能(最高提升+3.7 3D AP @ IoU 0.3),主要得益于空间定位精度的改善。

链接: https://arxiv.org/abs/2602.13314
作者: Emily Bejerano,Federico Tondolo,Aayan Qayyum,Xiaofan Yu,Xiaofan Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radar provides reliable perception in visually degraded indoor environments (e.g., smoke, dust, and low light), but learning-based radar perception is bottlenecked by the scarcity and cost of collecting and annotating large-scale radar datasets. We present Sim2Radar, an end-to-end framework that synthesizes training radar data directly from single-view RGB images, enabling scalable data generation without manual scene modeling. Sim2Radar reconstructs a material-aware 3D scene by combining monocular depth estimation, segmentation, and vision-language reasoning to infer object materials, then simulates mmWave propagation with a configurable physics-based ray tracer using Fresnel reflection models parameterized by ITU-R electromagnetic properties. Evaluated on real-world indoor scenes, Sim2Radar improves downstream 3D radar perception via transfer learning: pre-training a radar point-cloud object detection model on synthetic data and fine-tuning on real radar yields up to +3.7 3D AP (IoU 0.3), with gains driven primarily by improved spatial localization. These results suggest that physics-based, vision-driven radar simulation can provide effective geometric priors for radar learning and measurably improve performance under limited real-data supervision.

[CV-150] Agent ic Spatio-Temporal Grounding via Collaborative Reasoning

【速读】:该论文旨在解决时空视频定位(Spatio-Temporal Video Grounding, STVG)任务中现有方法存在的冗余计算、强监督依赖以及泛化能力不足的问题,尤其是弱监督方法受限于数据集级训练-适配范式且性能较差的局限性。解决方案的关键在于提出一种无需训练的代理式框架——Agentic Spatio-Temporal Grounder (ASTG),其核心由两个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)构建的专用代理组成:空间推理代理(Spatial Reasoning Agent, SRA)和时间推理代理(Temporal Reasoning Agent, TRA),二者协同工作以自主、自引导的方式完成目标时空管(spatio-temporal tube)的检索、验证与定位。该框架通过“提议-评估”范式解耦时空推理过程,并利用专门设计的视觉记忆与对话上下文显著提升检索效率,在多个基准测试中展现出优于现有弱监督和零样本方法的性能,甚至可媲美部分全监督方法。

链接: https://arxiv.org/abs/2602.13313
作者: Heng Zhao,Yew-Soon Ong,Joey Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.

[CV-151] Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

【速读】:该论文旨在解决现有大语言模型(Large Language Models, LLMs)在测试时扩展规律中因纵向扩展(即增加推理长度)导致的探索空间受限问题,尤其是在视觉任务中缺乏有效的并行推理机制。其核心挑战在于:随着推理深度增加,模型易陷入固定思维模式,限制了多样性与泛化能力;而将此类并行推理范式迁移至多模态大模型(Multimodal Large Language Models, MLLMs)的视觉领域仍属开放问题。解决方案的关键在于提出Visual Para-Thinker——首个面向MLLMs的并行推理框架,通过引入视觉分区策略、Pa-Attention(用于维持路径独立性)和LPRoPE(用于增强推理多样性),实现高效的多路径并行推理,并基于vLLM框架完成原生多模态实现,从而显著提升视觉任务中的推理多样性与性能表现。

链接: https://arxiv.org/abs/2602.13310
作者: Haoran Xu,Hongyu Wang,Jiaze Li,Shunpeng Chen,Zizhao Tong,Jianzhong Ju,Zhenbo Luo,Jian Luan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.

[CV-152] Fine-Tuning a Large Vision-Language Model for Artworks Scoring and Critique

【速读】:该论文旨在解决人工评估艺术创造力(artistic creativity)在大规模场景下效率低下的问题,尤其是在绘画作品的评分与反馈方面。传统方法如托伦斯创造性思维测试(Torrance Tests of Creative Thinking)依赖人力评分,成本高且难以扩展。为提升自动化水平并提供可解释的反馈,作者提出了一种基于多任务学习的视觉-语言模型微调框架,关键在于使用Qwen2-VL-7B模型,在视觉编码器输出端添加轻量级回归头以实现单一前向传播中同时预测数值分数和生成结构化评语。通过将五维评分量表(originality, color, texture, composition, content)和艺术家描述嵌入系统提示(system prompt),确保生成文本与定量预测一致,从而实现高精度(Pearson r = 0.97,MAE ≈ 3.95)与语义贴近专家评论(SBERT余弦相似度平均=0.798)的自动创造力评估。

链接: https://arxiv.org/abs/2602.13306
作者: Zhehan Zhang,Meihua Qian,Li Luo,Siyu Huang,Chaoyi Zhou,Ripon Saha,Xinxin Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric and the artwork description in the system prompt, we constrain the generated text to match the quantitative prediction. Experiments show strong accuracy, achieving Pearson r 0.97 and MAE about 3.95 on the 100-point scale. Qualitative evaluation indicates the generated feedback is semantically close to expert critiques (average SBERT cosine similarity = 0.798). The proposed approach bridges computer vision and art assessment and offers a scalable tool for creativity research and classroom feedback.

[CV-153] WildfireVLM: AI-powered Analysis for Early Wildfire Detection and Risk Assessment Using Satellite Imagery

【速读】:该论文旨在解决野火监测中因烟雾信号微弱、天气条件动态变化及大范围实时分析需求带来的挑战,从而提升早期检测与风险评估的准确性与时效性。其解决方案的关键在于提出WildfireVLM框架,该框架融合了计算机视觉与多模态大语言模型(Multimodal Large Language Models, MLLMs):首先利用YOLOv12算法从Landsat-8/9、GOES-16等多源卫星影像中精准识别火点和烟羽区域;随后通过MLLMs将检测结果转化为具有上下文意义的风险评估报告和优先级响应建议,实现从图像识别到决策支持的闭环;此外,系统采用服务化架构支持实时处理、可视化风险仪表盘与长期追踪,显著提升了野火监测的智能化与可扩展性。

链接: https://arxiv.org/abs/2602.13305
作者: Aydin Ayanzadeh,Prakhar Dixit,Sadia Kamal,Milton Halem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wildfires are a growing threat to ecosystems, human lives, and infrastructure, with their frequency and intensity rising due to climate change and human activities. Early detection is critical, yet satellite-based monitoring remains challenging due to faint smoke signals, dynamic weather conditions, and the need for real-time analysis over large areas. We introduce WildfireVLM, an AI framework that combines satellite imagery wildfire detection with language-driven risk assessment. We construct a labeled wildfire and smoke dataset using imagery from Landsat-8/9, GOES-16, and other publicly available Earth observation sources, including harmonized products with aligned spectral bands. WildfireVLM employs YOLOv12 to detect fire zones and smoke plumes, leveraging its ability to detect small, complex patterns in satellite imagery. We integrate Multimodal Large Language Models (MLLMs) that convert detection outputs into contextualized risk assessments and prioritized response recommendations for disaster management. We validate the quality of risk reasoning using an LLM-as-judge evaluation with a shared rubric. The system is deployed using a service-oriented architecture that supports real-time processing, visual risk dashboards, and long-term wildfire tracking, demonstrating the value of combining computer vision with language-based reasoning for scalable wildfire monitoring.

[CV-154] Progressive Contrast Registration for High-Fidelity Bidirectional Photoacoustic Microscopy Alignment

【速读】:该论文旨在解决双向光栅扫描(bidirectional raster scanning)在光学分辨率光声显微成像(OR-PAM)中因域偏移(domain shift)和几何失准(geometric misalignment)导致的图像配准质量差的问题。现有方法受限于亮度恒常性假设,配准相关系数(NCC)难以超过0.96。其解决方案的关键在于提出PCReg-Net——一种渐进式对比引导配准框架,通过四个轻量化模块实现粗到精的逐级对齐:首先使用注册U-Net进行粗配准,再利用参考特征提取器捕捉多尺度结构线索,随后通过对比模块识别残余错位,最后借助特征注入的细化U-Net输出高保真结果。该方法在OR-PAM-Reg-4K数据集上实现NCC=0.983、SSIM=0.982、PSNR=46.96 dB,显著优于现有技术,并支持实时处理。

链接: https://arxiv.org/abs/2602.13304
作者: Jiahao Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 3 tables

点击查看摘要

Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing methods, constrained by brightness constancy assumptions, achieve limited alignment quality (NCC~ \leq 0.96 ). We propose PCReg-Net, a progressive contrast-guided registration framework that performs coarse-to-fine alignment through four lightweight modules: (1)~a registration U-Net for coarse alignment, (2)~a reference feature extractor capturing multi-scale structural cues, (3)~a contrast module that identifies residual misalignment by comparing coarse-registered and reference features, and (4)~a refinement U-Net with feature injection for high-fidelity output. We further propose the Temporal NCC (TNCC) and Temporal NCC Gap (TNCG) for reference-free evaluation of inter-frame temporal consistency. On OR-PAM-Reg-4K (432 test samples), PCReg-Net achieves NCC of 0.983, SSIM of 0.982, and PSNR of 46.96 dB, surpassing the state-of-the-art by over 14 dB at real-time speed. Code is available at this https URL

[CV-155] Spectral Collapse in Diffusion Inversion

【速读】:该论文旨在解决条件扩散模型在无配对图像翻译任务中因源域谱稀疏(如超分辨率、草图到图像生成)导致的“频谱坍缩”(spectral collapse)问题,即标准确定性反演方法(如DDIM)恢复的潜在表示偏离预期的各向同性高斯分布,表现为低频信号主导,致使目标采样生成结果过度平滑且纹理缺失。为解决这一结构与纹理之间的权衡难题,论文提出一种推理时的正交方差引导(Orthogonal Variance Guidance, OVG)方法,其关键在于通过修正常微分方程(ODE)动力学,在结构梯度的零空间内强制约束理论噪声幅度,从而在不破坏输入语义关联的前提下有效恢复高频纹理信息,同时保持结构保真度。

链接: https://arxiv.org/abs/2602.13303
作者: Nicolas Bourriez,Alexandre Verine,Auguste Genovesio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.

[CV-156] DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving ICLR2026

【速读】:该论文旨在解决当前端到端自动驾驶(End-to-End Autonomous Driving, E2E-AD)系统中因模块化设计顺序固定而导致的信息损失与累积误差问题,以及现有方法在跨模态关系建模灵活性、长时序融合效率和计算复杂度方面的局限性。其解决方案的关键在于提出一种任务导向的可扩展范式 DriveMamba,该范式将动态任务关系建模、隐式视图对应学习与长期时序融合统一整合进单阶段 Unified Mamba 解码器中;通过预先将图像特征与任务输出转化为基于3D空间位置排序的稀疏token表示,并利用线性复杂度的Mamba算子实现高效长序列建模,从而同时捕捉任务相关的依赖关系;此外,设计了双向轨迹引导的“局部到全局”扫描策略以保留从自车视角出发的空间局部性,有效提升规划性能。

链接: https://arxiv.org/abs/2602.13301
作者: Haisheng Su,Wei Wu,Feixiang Song,Junjie Zhang,Zhenjie Yang,Junchi Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR2026

点击查看摘要

Abstract:Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided “local-to-global” scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

[CV-157] KidMesh: Computational Mesh Reconstruction for Pediatric Congenital Hydronephrosis Using Deep Neural Networks

【速读】:该论文旨在解决儿科先天性肾积水(Congenital Hydronephrosis, CH)的影像学评估中,传统基于体素的分割方法无法直接生成可用于尿动力学仿真等功能分析的网格(mesh)表示的问题。现有方法虽能提取CH区域的形态特征,但需依赖复杂的后处理步骤将分割结果转换为网格,效率低且易引入误差。其解决方案的关键在于提出一种端到端的深度神经网络模型KidMesh,该模型通过从磁共振尿路造影(MRU)图像中提取特征图并映射为特征顶点,再以模板网格变形的方式自动生成精确的CH网格;同时设计了一种无需精确网格标注的新训练范式,克服了MRU切片稀疏导致难以获取高质量网格标签的难题,从而实现了快速、准确的网格重建,并可直接用于肾脏尿液流动模拟,提升临床尿动力学评估的可行性与精度。

链接: https://arxiv.org/abs/2602.13299
作者: Haoran Sun,Zhanpeng Zhu,Anguo Zhang,Bo Liu,Zhaohua Lin,Liqin Huang,Mingjing Yang,Lei Liu,Shan Lin,Wangbin Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pediatric congenital hydronephrosis (CH) is a common urinary tract disorder, primarily caused by obstruction at the renal pelvis-ureter junction. Magnetic resonance urography (MRU) can visualize hydronephrosis, including renal pelvis and calyces, by utilizing the natural contrast provided by water. Existing voxel-based segmentation approaches can extract CH regions from MRU, facilitating disease diagnosis and prognosis. However, these segmentation methods predominantly focus on morphological features, such as size, shape, and structure. To enable functional assessments, such as urodynamic simulations, external complex post-processing steps are required to convert these results into mesh-level representations. To address this limitation, we propose an end-to-end method based on deep neural networks, namely KidMesh, which could automatically reconstruct CH meshes directly from MRU. Generally, KidMesh extracts feature maps from MRU images and converts them into feature vertices through grid sampling. It then deforms a template mesh according to these feature vertices to generate the specific CH meshes of MRU images. Meanwhile, we develop a novel schema to train KidMesh without relying on accurate mesh-level annotations, which are difficult to obtain due to the sparsely sampled MRU slices. Experimental results show that KidMesh could reconstruct CH meshes in an average of 0.4 seconds, and achieve comparable performance to conventional methods without requiring post-processing. The reconstructed meshes exhibited no self-intersections, with only 3.7% and 0.2% of the vertices having error distances exceeding 3.2mm and 6.4mm, respectively. After rasterization, these meshes achieved a Dice score of 0.86 against manually delineated CH masks. Furthermore, these meshes could be used in renal urine flow simulations, providing valuable urodynamic information for clinical practice.

[CV-158] Effect of Convolutional Depth on Image Recognition Performance: VGG vs. ResNet vs. GoogLeNet

【速读】:该论文旨在解决深度卷积神经网络(Convolutional Neural Networks, CNNs)中“名义深度”(nominal depth)与“有效深度”(effective depth)之间的不一致性问题,即为何增加网络层数并不总是带来性能提升、优化稳定性或计算效率的改善。解决方案的关键在于通过控制变量实验,系统比较VGG、ResNet和GoogLeNet三种典型架构,在统一训练协议下区分名义深度与有效深度,并发现:只有具备特定结构机制(如残差连接或Inception模块)的网络才能将增加的深度转化为实际有效的表征能力提升,从而实现更优的准确率-计算权衡;因此,真正决定深度价值的是“有效深度”,而非单纯的“名义深度”。

链接: https://arxiv.org/abs/2602.13298
作者: Manfred M. Fischer,Joshua Pitts
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Increasing convolutional depth has been central to advances in image recognition, yet deeper networks do not uniformly yield higher accuracy, stable optimization, or efficient computation. We present a controlled comparative study of three canonical convolutional neural network architectures - VGG, ResNet, and GoogLeNet - to isolate how depth influences classification performance, convergence behavior, and computational efficiency. By standardizing training protocols and explicitly distinguishing between nominal and effective depth, we show that the benefits of depth depend critically on architectural mechanisms that constrain its effective manifestation during training rather than on nominal depth alone. Although plain deep networks exhibit early accuracy saturation and optimization instability, residual and inception-based architectures consistently translate additional depth into improved accuracy at lower effective depth and favorable accuracy-compute trade-offs. These findings demonstrate that effective depth, not nominal depth, is the operative quantity governing depth’s role as a productive scaling dimension in convolutional networks.

[CV-159] Conditional Generative Models for High-Resolution Range Profiles: Capturing Geometry-Driven Trends in a Large-Scale Maritime Dataset

【速读】:该论文旨在解决高分辨率距离剖面(High-resolution range profiles, HRRPs)在雷达自动目标识别中因对获取条件敏感而导致的鲁棒性不足问题。解决方案的关键在于利用大规模海上数据库进行条件化HRRP生成,通过将生成模型 conditioning 在几何变量(如舰船尺寸和期望的视角角度)上,使合成的HRRP能够重现真实数据中观测到的视线方向几何趋势,从而提升HRRP生成的物理一致性与场景适应性。

链接: https://arxiv.org/abs/2602.13297
作者: Edwyn Brient(CMM),Santiago Velasco-Forero(CMM),Rami Kassab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-resolution range profiles (HRRPs) enable fast onboard processing for radar automatic target recognition, but their strong sensitivity to acquisition conditions limits robustness across operational scenarios. Conditional HRRP generation can mitigate this issue, yet prior studies are constrained by small, highly specific datasets. We study HRRP synthesis on a largescale maritime database representative of coastal surveillance variability. Our analysis indicates that the fundamental scenario drivers are geometric: ship dimensions and the desired aspect angle. Conditioning on these variables, we train generative models and show that the synthesized signatures reproduce the expected line-of-sight geometric trend observed in real data. These results highlight the central role of acquisition geometry for robust HRRP generation.

[CV-160] MFN Decomposition and Related Metrics for High-Resolution Range Profiles Generative Models

【速读】:该论文旨在解决高分辨率距离剖面(High-resolution range profile, HRRP)生成数据的评估难题,尤其是现有基于分类模型的评估方法因“黑箱”特性而缺乏可解释性与多层级评价能力的问题。解决方案的关键在于将HRRP数据分解为三个物理可解释的组成部分:掩码(mask)、特征(features)和噪声(noise),并据此提出两个基于物理意义的新评估指标,从而在昂贵且复杂的HRRP数据集上实现对生成数据质量的判别性评估。

链接: https://arxiv.org/abs/2602.13296
作者: Edwyn Brient(CMM),Santiago Velasco-Forero(CMM),Rami Kassab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-resolution range profile (HRRP ) data are in vogue in radar automatic target recognition (RATR). With the interest in classifying models using HRRP, filling gaps in datasets using generative models has recently received promising contributions. Evaluating generated data is a challenging topic, even for explicit data like face images. However, the evaluation methods used in the state-ofthe-art of HRRP generation rely on classification models. Such models, called ‘‘black-box’’, do not allow either explainability on generated data or multi-level evaluation. This work focuses on decomposing HRRP data into three components: the mask, the features, and the noise. Using this decomposition, we propose two metrics based on the physical interpretation of those data. We take profit from an expensive dataset to evaluate our metrics on a challenging task and demonstrate the discriminative ability of those.

[CV-161] VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在物理动态推理能力评估中的核心难题:现有基准测试(如视觉问答VQA和期望违背VoE)通常依赖识别式协议,难以验证模型是否真正构建了可检验的物理假设。其解决方案的关键在于提出VisPhyWorld框架——一个基于执行的评估范式,要求模型从视觉观测中生成可执行的仿真器代码(executable simulator code),从而将物理推理与图像渲染分离。通过生成可运行代码,模型对世界的表征变得可直接检查、编辑和证伪,进而实现了对物理参数推断和动力学一致性模拟的严格评估。在此基础上构建的VisPhyBench基准包含209个场景,系统性地衡量模型在外观重建和物理合理性运动再现方面的表现,实验表明当前最先进的MLLMs虽具备较强的语义理解能力,但在准确推断物理参数和保持动力学一致性方面仍存在显著不足。

链接: https://arxiv.org/abs/2602.13294
作者: Jiarong Liang,Max Ku,Ka-Hei Hui,Ping Nie,Wenhu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.

[CV-162] NutVLM: A Self-Adaptive Defense Framework against Full-Dimension Attacks for Vision Language Models in Autonomous Driving

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在自动驾驶(Autonomous Driving, AD)场景中面临的对抗性威胁问题,包括局部物理贴片和不可察觉的全局扰动,这些问题严重削弱了VLMs的鲁棒性,且现有防御方法难以在保持干净样本性能的同时实现有效防护。解决方案的关键在于提出NutVLM框架,其核心是NutNet++——一种统一的检测-净化机制,通过三分类策略识别良性样本、局部贴片与全局扰动;针对局部威胁采用高效的灰度掩码净化,而对全局扰动则引入专家引导的对抗提示调优(Expert-guided Adversarial Prompt Tuning, EAPT),通过梯度驱动的潜在空间优化与离散投影生成“纠正性驾驶提示”,从而在不进行全模型微调的前提下重定向模型注意力,实现高效、可扩展的安全防护。

链接: https://arxiv.org/abs/2602.13293
作者: Xiaoxu Peng,Dong Zhou,Jianwen Zhang,Guanghui Sun,Anh Tu Ngo,Anupam Chattopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) have advanced perception in autonomous driving (AD), but they remain vulnerable to adversarial threats. These risks range from localized physical patches to imperceptible global perturbations. Existing defense methods for VLMs remain limited and often fail to reconcile robustness with clean-sample performance. To bridge these gaps, we propose NutVLM, a comprehensive self-adaptive defense framework designed to secure the entire perception-decision lifecycle. Specifically, we first employ NutNet++ as a sentinel, which is a unified detection-purification mechanism. It identifies benign samples, local patches, and global perturbations through three-way classification. Subsequently, localized threats are purified via efficient grayscale masking, while global perturbations trigger Expert-guided Adversarial Prompt Tuning (EAPT). Instead of the costly parameter updates of full-model fine-tuning, EAPT generates “corrective driving prompts” via gradient-based latent optimization and discrete projection. These prompts refocus the VLM’s attention without requiring exhaustive full-model retraining. Evaluated on the Dolphins benchmark, our NutVLM yields a 4.89% improvement in overall metrics (e.g., Accuracy, Language Score, and GPT Score). These results validate NutVLM as a scalable security solution for intelligent transportation. Our code is available at this https URL.

[CV-163] Evaluating the Impact of Post-Training Quantization on Reliable VQA with Multimodal LLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际部署中面临的两个关键挑战:一是模型过度自信导致的可靠性下降问题,二是模型参数量庞大限制其在边缘设备上的高效部署问题。为应对上述问题,研究者系统性地分析了后训练量化(Post-Training Quantization, PTQ)对视觉问答(Visual Question Answering, VQA)任务中准确性和可靠性的影响,并提出结合数据感知量化方法(如MBQ)与适配后的Selector置信度估计器作为解决方案的核心。该方案能够在显著降低模型内存占用(约减少75%)的同时,有效缓解因量化带来的可靠性损失,实现效率与可靠性的最优平衡。

链接: https://arxiv.org/abs/2602.13289
作者: Paul Jonas Kurz,Tobias Jan Wieczorek,Mohamed A. Abdelsalam,Rahaf Aljundi,Marcus Rohrbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted poster at the 1st Workshop on Epistemic Intelligence in Machine Learning (EIML) @ EURIPS 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) are increasingly deployed in domains where both reliability and efficiency are critical. However, current models remain overconfident, producing highly certain but incorrect answers. At the same time, their large size limits deployment on edge devices, necessitating compression. We study the intersection of these two challenges by analyzing how Post-Training Quantization (PTQ) compression affects both accuracy and reliability in Visual Question Answering (VQA). We evaluate two MLLMs, Qwen2-VL-7B and Idefics3-8B, quantized with data-free (HQQ) and data-aware (MBQ) methods across multiple bit widths. To counteract the reduction in reliability caused by quantization, we adapt the Selector confidence estimator for quantized multimodal settings and test its robustness across various quantization levels and out-of-distribution (OOD) scenarios. We find that PTQ degrades both accuracy and reliability. Data-aware methods soften the effect thereof. The Selector substantially mitigates the reliability impact. The combination of int4 MBQ and the Selector achieves the best efficiency-reliability trade-off, closing in on uncompressed performance at approx. 75% less memory demand. Overall, we present the first systematic study linking quantization and reliability in multimodal settings.

[CV-164] COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception ICLR2026

【速读】:该论文旨在解决协同感知(Cooperative Perception)中因通信带宽有限与传感器信息丰富之间的矛盾而导致的实际部署困难问题。现有方法虽通过选择性传输部分特征以降低带宽压力,但仍难以满足当前无线技术的承载能力。解决方案的关键在于引入时间连续性建模,利用时序感知机制识别反映环境动态变化的特征,避免重复传输静态信息,并据此动态调整各智能体的信息共享量。具体而言,作者提出COOPERTRIM框架,其核心创新包括:一种基于共形时间不确定性的新颖特征相关性度量方法,以及一个数据驱动的自适应共享量决策机制,从而在保证感知精度的前提下显著降低带宽消耗。

链接: https://arxiv.org/abs/2602.13287
作者: Shilpa Mukhopadhyay,Amit Roy-Chowdhury,Hang Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other’s live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open-source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real-world deployment.

[CV-165] Explanatory Interactive Machine Learning for Bias Mitigation in Visual Gender Classification

【速读】:该论文旨在解决视觉分类模型中存在的偏见(bias)和虚假相关性(spurious correlations)问题,特别是在性别分类等易受数据偏差影响的场景中。其解决方案的关键在于采用解释性交互学习(Explanatory Interactive Learning, XIL)范式,通过用户对模型解释的反馈来引导模型关注与预测相关的、符合用户视角的图像特征。研究对比了两种先进的XIL策略——CAIPI和Right for the Right Reasons (RRR),并提出了一种融合两者的混合方法,利用GradCAM和Bounded Logit Attention (BLA)生成的分割掩码进行定量评估,结果表明这些方法能有效提升模型透明度和公平性,尤其在降低性别分类中的误判率不平衡方面表现显著,其中CAIPI甚至可能提升分类准确率。

链接: https://arxiv.org/abs/2602.13286
作者: Nathanya Satriani,Djordje Slijepčević,Markus Schedl,Matthias Zeppelzauer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, CBMI2025

点击查看摘要

Abstract:Explanatory interactive learning (XIL) enables users to guide model training in machine learning (ML) by providing feedback on the model’s explanations, thereby helping it to focus on features that are relevant to the prediction from the user’s perspective. In this study, we explore the capability of this learning paradigm to mitigate bias and spurious correlations in visual classifiers, specifically in scenarios prone to data bias, such as gender classification. We investigate two methodologically different state-of-the-art XIL strategies, i.e., CAIPI and Right for the Right Reasons (RRR), as well as a novel hybrid approach that combines both strategies. The results are evaluated quantitatively by comparing segmentation masks with explanations generated using Gradient-weighted Class Activation Mapping (GradCAM) and Bounded Logit Attention (BLA). Experimental results demonstrate the effectiveness of these methods in (i) guiding ML models to focus on relevant image features, particularly when CAIPI is used, and (ii) reducing model bias (i.e., balancing the misclassification rates between male and female predictions). Our analysis further supports the potential of XIL methods to improve fairness in gender classifiers. Overall, the increased transparency and fairness obtained by XIL leads to slight performance decreases with an exception being CAIPI, which shows potential to even improve classification accuracy.

[CV-166] Beyond Ground: Map-Free LiDAR Relocalization for UAVs

【速读】:该论文旨在解决当前基于激光雷达(LiDAR)的定位方法在无人机(UAV)场景中精度显著下降的问题,尤其是现有方法主要针对自动驾驶设计,难以适应无人机飞行中常见的大偏航角和高度变化等挑战。其解决方案的关键在于提出一种名为MAILS的新框架:首先引入局部保持滑动窗口注意力模块(Locality-Preserving Sliding Window Attention),从稀疏点云中提取局部判别性几何特征;其次设计坐标无关的特征初始化模块与局部不变的位置编码机制,以增强特征提取对姿态扰动的鲁棒性;此外,构建了一个大规模、多场景的无人机LiDAR定位数据集,填补了真实飞行轨迹与高度变化下评估空白,从而实现高精度、鲁棒的无人机自主定位。

链接: https://arxiv.org/abs/2602.13267
作者: Hengyu Mu,Jianshi Wu,Yuxin Guo,XianLian Lin,Qingyong Hu,Chenglu Wen,Cheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 18 pages, 16 figures

点击查看摘要

Abstract:Localization is a fundamental capability in unmanned aerial vehicle (UAV) systems. Map-free LiDAR relocalization offers an effective solution for achieving high-precision positioning in environments with weak or unavailable GNSS signals. However, existing LiDAR relocalization methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in UAV scenarios. In this paper, we propose MAILS, a novel map-free LiDAR relocalization framework for UAVs. A Locality-Preserving Sliding Window Attention module is first introduced to extract locally discriminative geometric features from sparse point clouds. To handle substantial yaw rotations and altitude variations encountered during UAV flight, we then design a coordinate-independent feature initialization module and a locally invariant positional encoding mechanism, which together significantly enhance the robustness of feature extraction. Furthermore, existing LiDAR-based relocalization datasets fail to capture real-world UAV flight characteristics, such as irregular trajectories and varying altitudes. To address this gap, we construct a large-scale LiDAR localization dataset for UAVs, which comprises four scenes and various flight trajectories, designed to evaluate UAV relocalization performance under realistic conditions. Extensive experiments demonstrate that our method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. Our code and dataset will be released soon.

[CV-167] Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

【速读】:该论文旨在解决现有视觉检索增强生成(Visual Retrieval-Augmented Generation, VRAG)框架中因依赖预定义外部工具而导致的视觉信息损失问题,尤其是在图像操作(如裁剪)过程中易造成感知能力下降的问题。其解决方案的关键在于提出Lang2Act机制,通过自涌现的语言工具链实现细粒度的视觉感知与推理:不依赖固定外部引擎,而是让模型自主探索并构建可复用的“语言化动作工具箱”,再利用这些自涌现的动作进行下游任务推理。该方法采用两阶段强化学习(Reinforcement Learning, RL)训练框架,第一阶段优化VLMs以发现高质量动作构建工具箱,第二阶段提升模型对工具的利用效率,从而显著增强视觉感知能力,在实验中实现了超过4%的性能提升。

链接: https://arxiv.org/abs/2602.13235
作者: Yuqi Xiong,Chunyi Peng,Zhipeng Xu,Zhenghao Liu,Zulong Chen,Yukun Yan,Shuo Wang,Yu Gu,Ge Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at this https URL.

[CV-168] Learnable Multi-level Discrete Wavelet Transforms for 3D Gaussian Splatting Frequency Modulation

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在训练过程中高斯基元数量急剧增长导致的内存和存储成本上升问题。现有粗到精策略通过调节真实图像的频率内容来控制高斯增长,但如AutoOpti3DGS采用的一级离散小波变换(Discrete Wavelet Transform, DWT)受限于深度不足,且联合优化小波正则化与3D重建时存在梯度竞争,反而促使高斯密度过度增加。论文的关键解决方案是提出一种基于多级DWT的频率调制框架,通过递归分解低频子带构建更深层的教学课程(curriculum),在早期训练阶段提供逐步粗化的监督信号,从而持续降低高斯数量;同时发现仅需一个缩放参数即可实现有效调制,无需学习完整的2抽头高通滤波器,显著简化了模型结构并提升了效率。

链接: https://arxiv.org/abs/2602.14199
作者: Hung Nguyen,An Le,Truong Nguyen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful approach for novel view synthesis. However, the number of Gaussian primitives often grows substantially during training as finer scene details are reconstructed, leading to increased memory and storage costs. Recent coarse-to-fine strategies regulate Gaussian growth by modulating the frequency content of the ground-truth images. In particular, AutoOpti3DGS employs the learnable Discrete Wavelet Transform (DWT) to enable data-adaptive frequency modulation. Nevertheless, its modulation depth is limited by the 1-level DWT, and jointly optimizing wavelet regularization with 3D reconstruction introduces gradient competition that promotes excessive Gaussian densification. In this paper, we propose a multi-level DWT-based frequency modulation framework for 3DGS. By recursively decomposing the low-frequency subband, we construct a deeper curriculum that provides progressively coarser supervision during early training, consistently reducing Gaussian counts. Furthermore, we show that the modulation can be performed using only a single scaling parameter, rather than learning the full 2-tap high-pass filter. Experimental results on standard benchmarks demonstrate that our method further reduces Gaussian counts while maintaining competitive rendering quality.

[CV-169] Frequency-Enhanced Hilbert Scanning Mamba for Short-Term Arctic Sea Ice Concentration Prediction

【速读】:该论文旨在解决传统Mamba模型在北极海冰浓度(Arctic Sea Ice Concentration, SIC)短期预测中对时间相关性建模不足以及边界细节重建能力弱的问题。其解决方案的关键在于提出了一种频域增强的希尔伯特扫描Mamba框架(Frequency-enhanced Hilbert scanning Mamba Framework, FH-Mamba):首先引入3D希尔伯特扫描机制,沿保持局部性的路径遍历时空网格,确保展平序列中相邻索引对应空间与时间维度上的邻近体素;其次结合小波变换增强高频细节信息,并设计混合洗牌注意力模块(Hybrid Shuffle Attention),以自适应融合序列特征与频域特征,从而提升时间一致性与边缘重构精度。

链接: https://arxiv.org/abs/2602.13522
作者: Feng Gao,Zheng Gong,Wenli Liu,Yanhai Gan,Zhuoran Zheng,Junyu Dong,Qian Du
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE TGRS 2026

点击查看摘要

Abstract:While Mamba models offer efficient sequence modeling, vanilla versions struggle with temporal correlations and boundary details in Arctic sea ice concentration (SIC) prediction. To address these limitations, we propose Frequency-enhanced Hilbert scanning Mamba Framework (FH-Mamba) for short-term Arctic SIC prediction. Specifically, we introduce a 3D Hilbert scan mechanism that traverses the 3D spatiotemporal grid along a locality-preserving path, ensuring that adjacent indices in the flattened sequence correspond to neighboring voxels in both spatial and temporal dimensions. Additionally, we incorporate wavelet transform to amplify high-frequency details and we also design a Hybrid Shuffle Attention module to adaptively aggregate sequence and frequency features. Experiments conducted on the OSI-450a1 and AMSR2 datasets demonstrate that our FH-Mamba achieves superior prediction performance compared with state-of-the-art baselines. The results confirm the effectiveness of Hilbert scanning and frequency-aware attention in improving both temporal consistency and edge reconstruction for Arctic SIC forecasting. Our codes are publicly available at this https URL.

[CV-170] FUTON: Fourier Tensor Network for Implicit Neural Representations

【速读】:该论文旨在解决基于多层感知机(MLP)的隐式神经表示(Implicit Neural Representations, INRs)在训练收敛慢、对噪声过拟合以及外推性能差等问题。其解决方案的关键在于提出FUTON(Fourier Tensor Network),该方法将信号建模为广义傅里叶级数,其中系数由低秩张量分解参数化,从而以正交可分离基函数的加权组合形式隐式表达信号;这种设计融合了傅里叶基函数对平滑性和周期性的捕捉能力与低秩参数化对低维频谱结构的约束,兼具互补的归纳偏置(inductive biases),并提供通用逼近定理的理论保障及线性复杂度的推理算法。

链接: https://arxiv.org/abs/2602.13414
作者: Pooya Ashtari,Pourya Behmandpoor,Nikos Deligiannis,Aleksandra Pizurica
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 17 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Implicit neural representations (INRs) have emerged as powerful tools for encoding signals, yet dominant MLP-based designs often suffer from slow convergence, overfitting to noise, and poor extrapolation. We introduce FUTON (Fourier Tensor Network), which models signals as generalized Fourier series whose coefficients are parameterized by a low-rank tensor decomposition. FUTON implicitly expresses signals as weighted combinations of orthonormal, separable basis functions, combining complementary inductive biases: Fourier bases capture smoothness and periodicity, while the low-rank parameterization enforces low-dimensional spectral structure. We provide theoretical guarantees through a universal approximation theorem and derive an inference algorithm with complexity linear in the spectral resolution and the input dimension. On image and volume representation, FUTON consistently outperforms state-of-the-art MLP-based INRs while training 2–5 \times faster. On inverse problems such as image denoising and super-resolution, FUTON generalizes better and converges faster.

[CV-171] CellM aster: Collaborative Cell Type Annotation in Single-Cell Analysis

【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)数据中细胞类型注释的瓶颈问题,即在缺乏已知标记物或参考数据库的情况下,如何准确识别和标注稀有及新型细胞状态。传统方法依赖预定义的标记基因集,难以适应新组织或瞬态状态,导致注释准确性受限。解决方案的关键在于提出CellMaster——一个基于大语言模型(LLM)知识编码的AI代理,其通过实时推理与可解释的逻辑推理实现零样本(zero-shot)细胞类型注释,无需预先训练或固定标记数据库,从而显著提升对罕见和未知细胞亚型的识别能力。

链接: https://arxiv.org/abs/2602.13346
作者: Zhen Wang,Yiming Gao,Jieyuan Liu,Enze Ma,Jefferson Chen,Mark Antkowiak,Mengzhou Hu,JungHo Kong,Dexter Pratt,Zhiting Hu,Wei Wang,Trey Ideker,Eric P. Xing
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Single-cell RNA-seq (scRNA-seq) enables atlas-scale profiling of complex tissues, revealing rare lineages and transient states. Yet, assigning biologically valid cell identities remains a bottleneck because markers are tissue- and state-dependent, and novel states lack references. We present CellMaster, an AI agent that mimics expert practice for zero-shot cell-type annotation. Unlike existing automated tools, CellMaster leverages LLM-encoded knowledge (e.g., GPT-4o) to perform on-the-fly annotation with interpretable rationales, without pre-training or fixed marker databases. Across 9 datasets spanning 8 tissues, CellMaster improved accuracy by 7.1% over best-performing baselines (including CellTypist and scTab) in automatic mode. With human-in-the-loop refinement, this advantage increased to 18.6%, with a 22.1% gain on subtype populations. The system demonstrates particular strength in rare and novel cell states where baselines often fail. Source code and the web application are available at \hrefthis https URLthis https URL.

[CV-172] Learning to Select Like Humans: Explainable Active Learning for Medical Imaging

【速读】:该论文旨在解决医学图像分析中因标注数据稀缺而导致模型训练效率低的问题,尤其关注传统主动学习方法仅依赖预测不确定性而忽视模型是否学习到临床有意义特征的局限性。解决方案的关键在于提出一种可解释性引导的主动学习框架,其核心创新是引入空间注意力对齐机制作为样本选择的双重标准之一:一方面保留分类不确定性以识别信息量大的样本,另一方面通过衡量Grad-CAM生成的注意力图与放射科医生定义的感兴趣区域(Region-of-Interest, ROI)之间的Dice相似度来量化注意力错位,从而筛选出模型关注错误特征的样本。这种双准则策略显著提升了数据利用效率和模型的空间可解释性,实验证明在仅使用570个标注样本的情况下即可超越随机采样,并在三个医学影像数据集上实现更高的准确率与临床相关性。

链接: https://arxiv.org/abs/2602.13308
作者: Ifrat Ikhtear Uddin,Longwei Wang,Xiao Qin,Yang Zhou,KC Santosh
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication IEEE Conference on Artificial Intelligence 2026, Granada, Spain

点击查看摘要

Abstract:Medical image analysis requires substantial labeled data for model training, yet expert annotation is expensive and time-consuming. Active learning (AL) addresses this challenge by strategically selecting the most informative samples for the annotation purpose, but traditional methods solely rely on predictive uncertainty while ignoring whether models learn from clinically meaningful features a critical requirement for clinical deployment. We propose an explainability-guided active learning framework that integrates spatial attention alignment into a sample acquisition process. Our approach advocates for a dual-criterion selection strategy combining: (i) classification uncertainty to identify informative examples, and (ii) attention misalignment with radiologist-defined regions-of-interest (ROIs) to target samples where the model focuses on incorrect features. By measuring misalignment between Grad-CAM attention maps and expert annotations using \emphDice similarity, our acquisition function judiciously identifies samples that enhance both predictive performance and spatial interpretability. We evaluate the framework using three expert-annotated medical imaging datasets, namely, BraTS (MRI brain tumors), VinDr-CXR (chest X-rays), and SIIM-COVID-19 (chest X-rays). Using only 570 strategically selected samples, our explainability-guided approach consistently outperforms random sampling across all the datasets, achieving 77.22% accuracy on BraTS, 52.37% on VinDr-CXR, and 52.66% on SIIM-COVID. Grad-CAM visualizations confirm that the models trained by our dual-criterion selection focus on diagnostically relevant regions, demonstrating that incorporating explanation guidance into sample acquisition yields superior data efficiency while maintaining clinical interpretability.

[CV-173] Deep Learning CNN for Pneumonia Detection: Advancing Digital Health in Society 5.0

【速读】:该论文旨在解决肺炎(pneumonia)在全球范围内尤其是医疗资源有限地区诊断工具匮乏导致的高发病率和死亡率问题。其解决方案的关键在于构建一个基于深度学习的卷积神经网络(Convolutional Neural Network, CNN)模型,通过在标注数据集上训练,并结合归一化、数据增强和图像质量增强等预处理技术,提升模型的鲁棒性和泛化能力。实验结果显示,优化后的模型在胸片图像中检测肺炎的准确率达91.67%,ROC-AUC为0.96,PR-AUC为0.95,展现出优异的分类性能,表明该方法可作为快速、稳定且可靠的辅助诊断工具,助力社会5.0背景下人工智能赋能医疗服务与公共健康改善。

链接: https://arxiv.org/abs/2602.13270
作者: Hadi Almohab
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages 3 figures in Indonesian language

点击查看摘要

Abstract:Pneumonia is a serious global health problem, contributing to high morbidity and mortality, especially in areas with limited diagnostic tools and healthcare resources. This study develops a Convolutional Neural Network (CNN) based on deep learning to automatically detect pneumonia from chest X-ray images. The method involves training the model on labeled datasets with preprocessing techniques such as normalization, data augmentation, and image quality enhancement to improve robustness and generalization. Testing results show that the optimized model achieves 91.67% accuracy, ROC-AUC of 0.96, and PR-AUC of 0.95, demonstrating strong performance in distinguishing pneumonia from normal images. In conclusion, this CNN model has significant potential as a fast, consistent, and reliable diagnostic aid, supporting Society 5.0 by integrating artificial intelligence to improve healthcare services and public well-being.

人工智能

[AI-0] Long Context Less Focus: A Scaling Gap in LLM s Revealed through Privacy and Personalization

【速读】:该论文旨在解决长上下文长度对大语言模型(Large Language Models, LLMs)个性化效果与隐私保护能力的影响问题,即在隐私敏感和个性化需求日益增长的场景中,如何量化并理解上下文长度扩展对模型性能的双重影响。其解决方案的关键在于构建了一个大规模基准测试平台 PAPerBench,涵盖约 29,000 个实例、上下文长度从 1K 到 256K tokens 不等,并包含 377K 条评估问题,从而系统性地评估个性化质量与隐私风险之间的权衡关系。实验表明,随着上下文长度增加,模型在个性化和隐私保护方面均出现一致性的性能下降;进一步的理论分析揭示了这一现象源于固定容量 Transformer 中软注意力机制的注意力稀释(attention dilution)问题,提出当前模型存在一个“长上下文、低关注”的普遍缩放差距(scaling gap)。

链接: https://arxiv.org/abs/2602.15028
作者: Shangding Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models – long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at this https URL

[AI-1] Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

【速读】:该论文旨在解决生成式模型在化学与科学领域中对群对称性(如置换对称性 SnS_n 和旋转对称性 SE(3)SE(3))不变分布建模时所面临的挑战,传统方法依赖于架构约束(如等变去噪器和不变先验)来强制实现对称性,但存在训练效率低、表达能力受限等问题。其解决方案的关键在于提出“规范化的视角”(canonicalization perspective):首先通过规范姿态或顺序将每个样本映射到轨道代表元(orbit representative),然后在规范切片上训练一个无约束的扩散或流模型,生成阶段再随机采样对称变换以恢复不变分布。该方法基于商空间理论,证明了规范生成模型在正确性、通用性和表达能力上优于传统不变目标,并能加速训练(减少扩散得分复杂度和流匹配中的条件方差)。实验表明,在分子图生成任务中,结合几何谱基规范化和轻量位置编码的规范扩散模型显著优于等变基线,且计算成本相当甚至更低。

链接: https://arxiv.org/abs/2602.15022
作者: Cai Zhou,Zijie Chen,Zian Li,Jike Wang,Kaiyi Jiang,Pan Li,Rose Yu,Muhan Zhang,Stephen Bates,Tommi Jaakkola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Group Theory (math.GR); Biomolecules (q-bio.BM)
备注: 32 pages

点击查看摘要

Abstract:Many generative tasks in chemistry and science involve distributions invariant to group symmetries (e.g., permutation and rotation). A common strategy enforces invariance and equivariance through architectural constraints such as equivariant denoisers and invariant priors. In this paper, we challenge this tradition through the alternative canonicalization perspective: first map each sample to an orbit representative with a canonical pose or order, train an unconstrained (non-equivariant) diffusion or flow model on the canonical slice, and finally recover the invariant distribution by sampling a random symmetry transform at generation time. Building on a formal quotient-space perspective, our work provides a comprehensive theory of canonical diffusion by proving: (i) the correctness, universality and superior expressivity of canonical generative models over invariant targets; (ii) canonicalization accelerates training by removing diffusion score complexity induced by group mixtures and reducing conditional variance in flow matching. We then show that aligned priors and optimal transport act complementarily with canonicalization and further improves training efficiency. We instantiate the framework for molecular graph generation under S_n \times SE(3) symmetries. By leveraging geometric spectra-based canonicalization and mild positional encodings, canonical diffusion significantly outperforms equivariant baselines in 3D molecule generation tasks, with similar or even less computation. Moreover, with a novel architecture Canon, CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.

[AI-2] Spectral Convolution on Orbifolds for Geometric Deep Learning

【速读】:该论文旨在解决如何将几何深度学习(Geometric Deep Learning, GDL)应用于具有轨道流形(orbifold)结构的数据,从而拓展GDL在非欧几里得数据上的适用范围。其关键解决方案是引入谱卷积(spectral convolution)的概念用于轨道流形,这为在轨道流形结构数据上进行机器学习提供了基础构建模块,使这类复杂几何结构的数据能够被有效建模和学习。

链接: https://arxiv.org/abs/2602.14997
作者: Tim Mangliers,Bernhard Mössner,Benjamin Himpel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Geometric deep learning (GDL) deals with supervised learning on data domains that go beyond Euclidean structure, such as data with graph or manifold structure. Due to the demand that arises from application-related data, there is a need to identify further topological and geometric structures with which these use cases can be made accessible to machine learning. There are various techniques, such as spectral convolution, that form the basic building blocks for some convolutional neural network-like architectures on non-Euclidean data. In this paper, the concept of spectral convolution on orbifolds is introduced. This provides a building block for making learning on orbifold structured data accessible using GDL. The theory discussed is illustrated using an example from music theory.

[AI-3] On the Semantics of Primary Cause in Hybrid Dynamic Domains

【速读】:该论文致力于解决在混合(hybrid)动态系统中实际因果关系的建模问题,即如何在离散与连续变化共存的场景下准确识别观测效应的主因。其解决方案的关键在于基于混合时序情境演算(hybrid temporal situation calculus)框架,提出两种等价的实际因果定义:一种是基础性的形式化定义,另一种则通过贡献度(contribution)来刻画因果关系,并借助改进的“但非”(but-for)反事实测试进行验证。这两种定义均满足直观合理的因果性质,从而为复杂系统的因果推理提供了严谨且可操作的形式基础。

链接: https://arxiv.org/abs/2602.14994
作者: Shakil M. Khan,Asim Mehmood,Sandra Zilles
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning about actual causes of observed effects is fundamental to the study of rationality. This important problem has been studied since the time of Aristotle, with formal mathematical accounts emerging recently. We live in a world where change due to actions can be both discrete and continuous, that is, hybrid. Yet, despite extensive research on actual causation, only few recent studies looked into causation with continuous change. Building on recent progress, in this paper we propose two definitions of primary cause in a hybrid action-theoretic framework, namely the hybrid temporal situation calculus. One of these is foundational in nature while the other formalizes causation through contributions, which can then be verified from a counterfactual perspective using a modified ``but-for’’ test. We prove that these two definitions are indeed equivalent. We then show that our definitions of causation have some intuitively justifiable properties.

[AI-4] PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement ICLR2026

【速读】:该论文旨在解决在仿真环境中自动构建复杂且物理合理的三维交互场景的问题,传统方法通常仅关注3D资产的放置,而忽略了物体间的物理关系(如接触、支撑、平衡和容纳),这些关系对于生成真实世界的机器人操作场景(如桌面布局、货架整理或箱体填充)至关重要。解决方案的关键在于提出PhyScensis框架,该框架基于大语言模型(LLM)代理与物理引擎协同工作:首先由LLM代理迭代地提出带有空间和物理谓词的物体配置;其次通过集成物理引擎的求解器将这些谓词转化为可执行的3D场景;最后利用求解器反馈优化配置,从而实现高复杂度、高物理准确性的场景生成。该方法还引入概率编程来建模稳定性,并结合启发式策略联合调控稳定性和空间关系,确保对细粒度文本描述和数值参数(如相对位置、场景稳定性)的强可控性。

链接: https://arxiv.org/abs/2602.14968
作者: Yian Wang,Han Yang,Minghao Guo,Xiaowen Qiu,Tsun-Hsuan Wang,Wojciech Matusik,Joshua B. Tenenbaum,Chuang Gan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D layout generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and © the need to accurately model both spatial placement and physical properties. To address these challenges, we propose PhyScensis, an LLM agent-based framework powered by a physics engine, to produce physically plausible scene configurations with high complexity. Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration. Moreover, our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations. Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scene layouts for robotic manipulation.

[AI-5] MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design ICLR2026

【速读】:该论文旨在解决抗菌肽(Antimicrobial Peptides, AMP)设计中难以平衡活性、毒性与新颖性等多目标优化问题,传统生成模型常因评分机制僵化或不透明导致结果难以解释和迭代优化。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLM)的闭环多智能体协作系统(Multi-Agent Collaboration, MAC-AMP),该系统通过模拟同行评审的自适应强化学习框架,在仅需任务描述和示例数据集的情况下即可自主设计新型AMP,实现了跨领域可迁移的多目标优化,并保持了良好的可解释性,避免了“黑箱”问题。

链接: https://arxiv.org/abs/2602.14926
作者: Gen Zhou,Sugitha Janarthanan,Lianghong Chen,Pingzhao Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is published in ICLR 2026

点击查看摘要

Abstract:To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi-agent collaboration based on such models (multi-agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce MAC-AMP, a closed-loop multi-agent collaboration (MAC) system for multi-objective AMP design. The system implements a fully autonomous simulated peer review-adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed-loop multi-agent system for AMP design, with cross-domain transferability, that supports multi-objective optimization while remaining explainable rather than a ‘black box’. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.

[AI-6] ReusStdFlow: A Standardized Reusability Framework for Dynamic Workflow Construction in Agent ic AI

【速读】:该论文旨在解决企业级智能体AI(Agentic AI)中存在的“可重用性困境”(reusability dilemma)和结构幻觉(structural hallucinations)问题。其解决方案的关键在于提出ReusStdFlow框架,该框架基于“提取-存储-构建”(Extraction-Storage-Construction)新范式,将异构的、平台特定的领域特定语言(DSL)解构为标准化、模块化的流程片段,并采用图数据库与向量数据库相结合的双重知识架构,实现拓扑结构与功能语义的协同检索,最终通过检索增强生成(RAG)策略智能组装工作流,从而实现对企业数字资产的自动化重组与高效复用。

链接: https://arxiv.org/abs/2602.14922
作者: Gaoyang Zhang,Shanghong Zou,Yafang Wang,He Zhang,Ruohua Xu,Feng Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:To address the reusability dilemma'' and structural hallucinations in enterprise Agentic AI,this paper proposes ReusStdFlow, a framework centered on a novel Extraction-Storage-Construction’’ paradigm. The framework deconstructs heterogeneous, platform-specific Domain Specific Languages (DSLs) into standardized, modular workflow segments. It employs a dual knowledge architecture-integrating graph and vector databases-to facilitate synergistic retrieval of both topological structures and functional semantics. Finally, workflows are intelligently assembled using a retrieval-augmented generation (RAG) strategy. Tested on 200 real-world n8n workflows, the system achieves over 90% accuracy in both extraction and construction. This framework provides a standardized solution for the automated reorganization and efficient reuse of enterprise digital assets.

[AI-7] BHyGNN: Unsupervised Representation Learning for Heterophilic Hypergraphs

【速读】:该论文旨在解决现有超图神经网络(Hypergraph Neural Networks, HyGNNs)在异质性超图(heterophilic hypergraphs)上性能下降的问题,尤其是在标签数据稀缺或获取成本高昂的实际场景中,传统方法对标注数据的依赖限制了其应用。解决方案的关键在于提出BHyGNN+,一个基于超图对偶性(hypergraph duality)的自监督学习框架:通过交换节点与超边的角色构造超图的对偶版本,并利用余弦相似度对比原图与其对偶图的增强视图,在无需负样本的情况下捕捉超图结构中的本质模式,从而实现无监督表示学习。该方法显著提升了在异质性和同质性超图上的泛化能力,为无标签超图上的表示学习提供了新范式。

链接: https://arxiv.org/abs/2602.14919
作者: Tianyi Ma,Yiyue Qian,Zehong Wang,Zheyuan Zhang,Chuxu Zhang,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hypergraph Neural Networks (HyGNNs) have demonstrated remarkable success in modeling higher-order relationships among entities. However, their performance often degrades on heterophilic hypergraphs, where nodes connected by the same hyperedge tend to have dissimilar semantic representations or belong to different classes. While several HyGNNs, including our prior work BHyGNN, have been proposed to address heterophily, their reliance on labeled data significantly limits their applicability in real-world scenarios where annotations are scarce or costly. To overcome this limitation, we introduce BHyGNN+, a self-supervised learning framework that extends BHyGNN for representation learning on heterophilic hypergraphs without requiring ground-truth labels. The core idea of BHyGNN+ is hypergraph duality, a structural transformation where the roles of nodes and hyperedges are interchanged. By contrasting augmented views of a hypergraph against its dual using cosine similarity, our framework captures essential structural patterns in a fully unsupervised manner. Notably, this duality-based formulation eliminates the need for negative samples, a common requirement in existing hypergraph contrastive learning methods that is often difficult to satisfy in practice. Extensive experiments on eleven benchmark datasets demonstrate that BHyGNN+ consistently outperforms state-of-the-art supervised and self-supervised baselines on both heterophilic and homophilic hypergraphs. Our results validate the effectiveness of leveraging hypergraph duality for self-supervised learning and establish a new paradigm for representation learning on challenging, unlabeled hypergraphs.

[AI-8] Position: Introspective Experience from Conversational Environments as a Path to Better Learning

【速读】:该论文试图解决当前人工智能训练中将推理视为规模扩展的涌现特性(emergent property of scale)这一局限性问题,旨在揭示更本质的推理生成机制。其解决方案的关键在于提出“语言内省”(linguistic self-reflection)作为构建稳健推理能力的核心路径,并强调通过高质量社会互动内化形成的对话式认知结构对智能发展的决定性作用。论文基于维果斯基的发展心理学理论,指出对话式支架(dialogically scaffolded introspective experiences)能够使智能体脱离即时数据流,将原始环境信息转化为可学习的叙事结构,从而实现推理过程的深度解耦与高效优化。最终,论文主张“对话质量即数据质量”(Dialogue Quality is the New Data Quality),认为下一代通用智能的提升关键在于优化用于塑造内部反思能力的对话环境。

链接: https://arxiv.org/abs/2602.14910
作者: Claudiu Cristian Musat,Jackson Tolins,Diego Antognini,Jingling Li,Martin Klissarov,Tom Duerig
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current approaches to AI training treat reasoning as an emergent property of scale. We argue instead that robust reasoning emerges from linguistic self-reflection, itself internalized from high-quality social interaction. Drawing on Vygotskian developmental psychology, we advance three core positions centered on Introspection. First, we argue for the Social Genesis of the Private Mind: learning from conversational environments rises to prominence as a new way to make sense of the world; the friction of aligning with another agent, internal or not, refines and crystallizes the reasoning process. Second, we argue that dialogically scaffolded introspective experiences allow agents to engage in sense-making that decouples learning from immediate data streams, transforming raw environmental data into rich, learnable narratives. Finally, we contend that Dialogue Quality is the New Data Quality: the depth of an agent’s private reasoning, and its efficiency regarding test-time compute, is determined by the diversity and rigor of the dialogues it has mastered. We conclude that optimizing these conversational scaffolds is the primary lever for the next generation of general intelligence.

[AI-9] he Potential of CoT for Reasoning : A Closer Look at Trace Dynamics

【速读】:该论文旨在解决链式思维(Chain-of-thought, CoT)提示技术中“为何有效”这一核心问题,即深入解析CoT推理轨迹中哪些部分真正贡献于最终正确答案的生成。其解决方案的关键在于引入“潜力(potential)”这一量化指标,用于衡量CoT中某一阶段对正确完成概率的提升程度。通过分析竞赛级数学题目的CoT轨迹,研究发现潜力常呈现非单调性(源于推理偏移)、尖锐但难以解释的跃升(推理洞察与跳跃),以及偶尔的幸运猜测;进一步通过CoT可迁移性实验表明,仅需20%的强模型生成的部分CoT即可显著提升弱模型性能,揭示了CoT机制具有高度可迁移性,从而为理解大语言模型(LLM)中的推理过程提供了可量化的分析框架和实证依据。

链接: https://arxiv.org/abs/2602.14903
作者: Gregor Bachmann,Yichen Jiang,Seyed Mohsen Moosavi Dezfooli,Moin Nabi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from competition-level mathematics questions, with the aim of better understanding how, and which parts of CoT actually contribute to the final answer. To this end, we introduce the notion of a potential, quantifying how much a given part of CoT increases the likelihood of a correct completion. Upon examination of reasoning traces through the lens of the potential, we identify surprising patterns including (1) its often strong non-monotonicity (due to reasoning tangents), (2) very sharp but sometimes tough to interpret spikes (reasoning insights and jumps) as well as (3) at times lucky guesses, where the model arrives at the correct answer without providing any relevant justifications before. While some of the behaviours of the potential are readily interpretable and align with human intuition (such as insights and tangents), others remain difficult to understand from a human perspective. To further quantify the reliance of LLMs on reasoning insights, we investigate the notion of CoT transferability, where we measure the potential of a weaker model under the partial CoT from another, stronger model. Indeed aligning with our previous results, we find that as little as 20% of partial CoT can ``unlock’’ the performance of the weaker model on problems that were previously unsolvable for it, highlighting that a large part of the mechanics underpinning CoT are transferable.

[AI-10] Lifted Relational Probabilistic Inference via Implicit Learning

【速读】:该论文旨在解决在一阶关系域中归纳学习(inductive learning)与演绎推理(deductive reasoning)之间的张力问题,即如何在不显式构建完整概率模型的前提下,高效地回答查询。其核心挑战在于传统提升推理(lifted inference)依赖于完整的模型,而从部分、噪声观测中学习此类模型在一般情况下是计算不可行的。解决方案的关键在于提出一种隐式学习与一阶关系概率逻辑推理相结合的新框架:通过将不完整的先验知识与独立采样的部分观测样本合并到多项式时间内完成的平方和(sum-of-squares, SOS)层次的一个有界度片段中,实现对个体和世界层面的双重提升(grounding-lift 和 world-lift)。这一方法首次实现了在多项式时间内隐式学习一阶概率逻辑并进行跨所有一致世界的提升推理。

链接: https://arxiv.org/abs/2602.14890
作者: Luise Ge,Brendan Juba,Kris Nilsson,Alison Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconciling the tension between inductive learning and deductive reasoning in first-order relational domains is a longstanding challenge in AI. We study the problem of answering queries in a first-order relational probabilistic logic through a joint effort of learning and reasoning, without ever constructing an explicit model. Traditional lifted inference assumes access to a complete model and exploits symmetry to evaluate probabilistic queries; however, learning such models from partial, noisy observations is intractable in general. We reconcile these two challenges through implicit learning to reason and first-order relational probabilistic inference techniques. More specifically, we merge incomplete first-order axioms with independently sampled, partially observed examples into a bounded-degree fragment of the sum-of-squares (SOS) hierarchy in polynomial time. Our algorithm performs two lifts simultaneously: (i) grounding-lift, where renaming-equivalent ground moments share one variable, collapsing the domain of individuals; and (ii) world-lift, where all pseudo-models (partial world assignments) are enforced in parallel, producing a global bound that holds across all worlds consistent with the learned constraints. These innovations yield the first polynomial-time framework that implicitly learns a first-order probabilistic logic and performs lifted inference over both individuals and worlds.

[AI-11] On the Learning Dynamics of RLVR at the Edge of Competence

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)如何在长程推理任务中克服“长期障碍”的问题,即为何仅依赖最终结果的奖励信号能够有效促进模型在复杂、多步骤推理任务中的持续进步。其解决方案的关键在于提出了一种针对Transformer架构在组合推理任务上训练动态的理论框架,揭示了奖励有效性由难度谱的平滑性决定:当数据中存在难度的突变时,学习过程会出现类似“grokking”的相变现象,导致长时间停滞;而平滑的难度分布则引发“接力效应”——简单问题上的持续梯度信号逐步提升模型能力,使更难的问题变得可解,从而实现稳定且连续的性能提升。这一机制解释了RLVR在模型能力边缘区域的有效性,并指出合理设计的数据混合策略可带来可扩展的性能增益。

链接: https://arxiv.org/abs/2602.14872
作者: Yu Huang,Zixin Wen,Yuejie Chi,Yuting Wei,Aarti Singh,Yingbin Liang,Yuxin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model’s capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.

[AI-12] Concept Influence: Leverag ing Interpretability to Improve Performance and Efficiency in Training Data Attribution

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练数据归属(Training Data Attribution, TDA)中的两大核心问题:一是现有方法(如影响函数)计算成本高,难以扩展;二是传统方法基于单个测试样本进行归因,易受语法相似性干扰,难以捕捉语义层面的行为驱动因素。解决方案的关键在于引入可解释结构(interpretable structures)到TDA流程中,提出“概念影响”(Concept Influence)机制,将模型行为归因于语义方向(如线性探测器或稀疏自编码器特征),而非具体数据点;同时证明基于探测器的简化方法是对概念影响的一阶近似,在保持与经典影响函数相当性能的同时,速度提升超过一个数量级,从而实现更高效、可解释且可控的模型行为分析。

链接: https://arxiv.org/abs/2602.14869
作者: Matthew Kowal,Goncalo Paulo,Louis Jaburi,Tom Tseng,Lev E McKinney,Stefan Heimersheim,Aaron David Tucker,Adam Gleave,Kellin Pelrine
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.

[AI-13] Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)训练大语言模型时因稀疏奖励导致的样本效率低下问题,即模型在庞大搜索空间中难以有效获取反馈以提升推理能力。解决方案的关键在于提出一种名为Goldilocks的教师驱动数据采样策略,该策略通过教师模型动态选择对当前学生模型而言“难度适中”的题目(既不过于简单也不过于困难),从而优化训练过程;同时结合GRPO(Generalized Reward Policy Optimization)算法,在训练过程中利用学生模型对已见样本的表现持续调整教师模型的采样偏好,实现自适应难度调控,显著提升了在相同计算预算下的模型性能表现。

链接: https://arxiv.org/abs/2602.14868
作者: Ilia Mahrooghi,Aryo Lotfi,Emmanuel Abbe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question’s difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student’s performance on seen samples, the teacher continuously adapts to the student’s evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

[AI-14] EmbeWebAgent : Embedding Web Agents into Any Customized UI

【速读】:该论文旨在解决当前Web代理(Web Agent)在企业环境中因局限于人机交互层面(如仅依赖截图或原始DOM树观察)而导致的鲁棒性差和动作表达能力弱的问题。其核心解决方案是提出EmbeWebAgent框架,通过轻量级前端钩子(包括精心设计的ARIA属性与URL观测机制、以及基于WebSocket暴露的页面级函数注册表)实现对前端的嵌入式控制,并结合可复用的后端工作流进行推理与执行动作,从而支持跨栈(stack-agnostic)的UI集成、多粒度操作(从GUI原语到高层复合动作),并通过MCP工具协调导航、操作及领域特定分析,显著提升代理在真实UI场景下的稳定性和功能性。

链接: https://arxiv.org/abs/2602.14865
作者: Chenyang Ma,Clyde Fare,Matthew Wilson,Dave Braines
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Technical Report; Live Demo: this https URL

点击查看摘要

Abstract:Most web agents operate at the human interface level, observing screenshots or raw DOM trees without application-level access, which limits robustness and action expressiveness. In enterprise settings, however, explicit control of both the frontend and backend is available. We present EmbeWebAgent, a framework for embedding agents directly into existing UIs using lightweight frontend hooks (curated ARIA and URL-based observations, and a per-page function registry exposed via a WebSocket) and a reusable backend workflow that performs reasoning and takes actions. EmbeWebAgent is stack-agnostic (e.g., React or Angular), supports mixed-granularity actions ranging from GUI primitives to higher-level composites, and orchestrates navigation, manipulation, and domain-specific analytics via MCP tools. Our demo shows minimal retrofitting effort and robust multi-step behaviors grounded in a live UI setting. Live Demo: this https URL

[AI-15] World Models for Policy Refinement in StarCraft II

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的《星际争霸II》(StarCraft II, SC2)智能体在决策过程中忽视环境动态建模的问题,尤其是缺乏一个可学习、动作条件化的世界模型(World Model)来增强对部分可观测环境下未来状态的预测能力。其关键解决方案是提出StarWM——首个针对SC2设计的世界模型,通过引入结构化的文本表示将观测分解为五个语义模块(如资源、单位、建筑等),并构建了SC2-Dynamics-50k指令微调数据集用于训练动态预测能力;同时开发多维离线评估框架验证预测准确性,并进一步设计StarWM-Agent系统,在生成-模拟-精炼(Generate–Simulate–Refine)决策循环中集成StarWM以实现前瞻驱动的策略优化,从而显著提升在线对抗性能与宏观管理稳定性。

链接: https://arxiv.org/abs/2602.14857
作者: Yixin Zhang,Ziyi Wang,Yiming Rong,Haoxi Wang,Jinling Jiang,Shuang Xu,Haoran Wu,Shiyu Zhou,Bo Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action-conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2’s hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction. We further develop a multi-dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM’s substantial gains over zero-shot baselines, including nearly 60% improvements in resource prediction accuracy and self-side macro-situation consistency. Finally, we propose StarWM-Agent, a world-model-augmented decision system that integrates StarWM into a Generate–Simulate–Refine decision loop for foresight-driven policy refinement. Online evaluation against SC2’s built-in AI demonstrates consistent improvements, yielding win-rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro-management stability and tactical risk assessment.

[AI-16] Return of the Schema: Building Complete Datasets for Machine Learning and Reasoning on Knowledge Graphs

【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)精化算法评估中缺乏schema层面知识的问题,现有数据集通常仅包含实体和关系的地面事实(ground facts),而忽略了源知识图谱中可用的结构化schema信息,这限制了依赖丰富本体约束、推理或神经符号技术的方法在真实世界大规模知识图谱中的性能评估。解决方案的关键在于提出首个支持提取同时包含schema与ground facts的数据集的工作流,该工作流不仅能处理schema与事实之间的不一致性,还能利用推理机制推导隐含知识;所生成的数据集以OWL格式序列化,可直接用于逻辑推理服务,并提供将数据转换为标准机器学习库常用张量表示的工具,从而实现对基于schema的精化方法的有效评估。

链接: https://arxiv.org/abs/2602.14795
作者: Ivan Diliso,Roberto Barile,Claudia d’Amato,Nicola Fanizzi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Datasets for the experimental evaluation of knowledge graph refinement algorithms typically contain only ground facts, retaining very limited schema level knowledge even when such information is available in the source knowledge graphs. This limits the evaluation of methods that rely on rich ontological constraints, reasoning or neurosymbolic techniques and ultimately prevents assessing their performance in large-scale, real-world knowledge graphs. In this paper, we present \resource the first resource that provides a workflow for extracting datasets including both schema and ground facts, ready for machine learning and reasoning services, along with the resulting curated suite of datasets. The workflow also handles inconsistencies detected when keeping both schema and facts and also leverage reasoning for entailing implicit knowledge. The suite includes newly extracted datasets from KGs with expressive schemas while simultaneously enriching existing datasets with schema information. Each dataset is serialized in OWL making it ready for reasoning services. Moreover, we provide utilities for loading datasets in tensor representations typical of standard machine learning libraries.

[AI-17] What hackers talk about when they talk about AI: Early-stage diffusion of a cybercrime innovation

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)技术快速发展背景下,其对网络犯罪形态演变的潜在影响问题,特别是AI如何被网络犯罪分子用于提升攻击规模与复杂性,并重塑其作案模式与安全策略。解决方案的关键在于通过分析来自网络安全威胁情报平台的160余条网络犯罪论坛对话数据,结合创新扩散理论(diffusion of innovation framework)与主题分析法(thematic analysis),系统揭示犯罪群体对AI的认知、应用尝试及内在矛盾,从而识别出AI赋能下的新型网络犯罪趋势,为执法机构和政策制定者提供可操作的洞察与应对策略。

链接: https://arxiv.org/abs/2602.14783
作者: Benoît Dupont,Chad Whelan,Serge-Olivier Paquette
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 33 pages, 2 figures, submitted to Global Crime

点击查看摘要

Abstract:The rapid expansion of artificial intelligence (AI) is raising concerns about its potential to transform cybercrime. Beyond empowering novice offenders, AI stands to intensify the scale and sophistication of attacks by seasoned cybercriminals. This paper examines the evolving relationship between cybercriminals and AI using a unique dataset from a cyber threat intelligence platform. Analyzing more than 160 cybercrime forum conversations collected over seven months, our research reveals how cybercriminals understand AI and discuss how they can exploit its capabilities. Their exchanges reflect growing curiosity about AI’s criminal applications through legal tools and dedicated criminal tools, but also doubts and anxieties about AI’s effectiveness and its effects on their business models and operational security. The study documents attempts to misuse legitimate AI tools and develop bespoke models tailored for illicit purposes. Combining the diffusion of innovation framework with thematic analysis, the paper provides an in-depth view of emerging AI-enabled cybercrime and offers practical insights for law enforcement and policymakers.

[AI-18] Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training

【速读】:该论文旨在解决预训练语言模型在推理阶段缺乏进一步精细化处理能力的问题,即模型在固定结构下难以对中间表示进行持续优化。其解决方案的关键在于提出推理时内部循环(inference-time inner looping),通过在测试阶段反复重应用选定的Transformer块范围,实现对 latent representation 的迭代式精炼,从而在不更新模型参数的前提下提升模型性能。实验表明,该方法能在多个基准测试中带来稳定且一致的准确率提升,并促使潜在状态演化更加稳定、语义信息持续优化。

链接: https://arxiv.org/abs/2602.14759
作者: Jonathan Lys,Vincent Gripon,Bastien Pasdeloup,Lukas Mauch,Fabien Cardinaux,Ghouthi Boukli Hacene
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Learning architectures, and in particular Transformers, are conventionally viewed as a composition of layers. These layers are actually often obtained as the sum of two contributions: a residual path that copies the input and the output of a Transformer block. As a consequence, the inner representations (i.e. the input of these blocks) can be interpreted as iterative refinement of a propagated latent representation. Under this lens, many works suggest that the inner space is shared across layers, meaning that tokens can be decoded at early stages. Mechanistic interpretability even goes further by conjecturing that some layers act as refinement layers. Following this path, we propose inference-time inner looping, which prolongs refinement in pretrained off-the-shelf language models by repeatedly re-applying a selected block range. Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements. Analyses of the resulting latent trajectories suggest more stable state evolution and continued semantic refinement. Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.

[AI-19] AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises

【速读】:该论文旨在解决前沿生成式 AI(Generative AI)在战略竞争情境下行为模式的可预测性与人类战略逻辑一致性问题,尤其关注其在核危机模拟中的决策机制。解决方案的关键在于通过构建一个高度逼真的核危机博弈场景,让三款前沿大语言模型(GPT-5.2、Claude Sonnet 4 和 Gemini 3 Flash)分别扮演对立领导角色,从而系统性地观察其是否具备理论心理(Theory of Mind)、可信元认知自我意识(Metacognitive Self-Awareness)以及欺骗意图等高级战略能力,并验证经典战略理论(如谢林的承诺理论、卡恩的升级框架和耶尔维斯的误判研究)在 AI 行为中的适用性与局限性。结果表明,AI 在某些方面高度模仿人类战略推理,但在核禁忌、威胁反应、信任加速冲突等方面表现出与人类行为显著差异,揭示了 AI 战略决策需以人类经验校准的重要性,为未来 AI 辅助战略分析提供了关键实证依据与方法论路径。

链接: https://arxiv.org/abs/2602.14740
作者: Kenneth Payne
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注: 45 pages, 6 figures, 27 tables

点击查看摘要

Abstract:Today’s leading AI models engage in sophisticated behaviour when placed in strategic competition. They spontaneously attempt deception, signaling intentions they do not intend to follow; they demonstrate rich theory of mind, reasoning about adversary beliefs and anticipating their actions; and they exhibit credible metacognitive self-awareness, assessing their own strategic abilities before deciding how to act. Here we present findings from a crisis simulation in which three frontier large language models (GPT-5.2, Claude Sonnet 4, Gemini 3 Flash) play opposing leaders in a nuclear crisis. Our simulation has direct application for national security professionals, but also, via its insights into AI reasoning under uncertainty, has applications far beyond international crisis decision-making. Our findings both validate and challenge central tenets of strategic theory. We find support for Schelling’s ideas about commitment, Kahn’s escalation framework, and Jervis’s work on misperception, inter alia. Yet we also find that the nuclear taboo is no impediment to nuclear escalation by our models; that strategic nuclear attack, while rare, does occur; that threats more often provoke counter-escalation than compliance; that high mutual credibility accelerated rather than deterred conflict; and that no model ever chose accommodation or withdrawal even when under acute pressure, only reduced levels of violence. We argue that AI simulation represents a powerful tool for strategic analysis, but only if properly calibrated against known patterns of human reasoning. Understanding how frontier models do and do not imitate human strategic logic is essential preparation for a world in which AI increasingly shapes strategic outcomes. Comments: 45 pages, 6 figures, 27 tables Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2602.14740 [cs.AI] (or arXiv:2602.14740v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.14740 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kenneth Payne [view email] [v1] Mon, 16 Feb 2026 13:35:01 UTC (489 KB)

[AI-20] Scale redundancy and soft gauge fixing in positively homogeneous neural networks

【速读】:该论文旨在解决神经网络训练中因激活函数正齐次性(positively homogeneous activations)所引发的参数空间冗余问题,即神经元尺度重缩放导致的参数轨道不变性(gauge redundancy),这会恶化优化条件并引起权重尺度漂移(scale drift)。解决方案的关键在于引入规范适应坐标系(gauge-adapted coordinates),将参数空间分解为对输入-输出函数不变的规范轨道方向与破坏尺度平衡的不平衡方向,并通过一个仅作用于冗余尺度坐标的软轨道选择(norm-balancing)惩罚项,诱导不平衡模式的耗散性收敛,从而在不改变模型表达能力的前提下稳定学习率范围并抑制尺度漂移。这一方法建立了规范几何结构与优化条件之间的直接联系,为机器学习中的对称性约束优化提供了理论框架。

链接: https://arxiv.org/abs/2602.14729
作者: Rodrigo Carmo Terin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Neural networks with positively homogeneous activations exhibit an exact continuous reparametrization symmetry: neuron-wise rescalings generate parameter-space orbits along which the input–output function is invariant. We interpret this symmetry as a gauge redundancy and introduce gauge-adapted coordinates that separate invariant and scale-imbalance directions. Inspired by gauge fixing in field theory, we introduce a soft orbit-selection (norm-balancing) functional acting only on redundant scale coordinates. We show analytically that it induces dissipative relaxation of imbalance modes to preserve the realized function. In controlled experiments, this orbit-selection penalty expands the stable learning-rate regime and suppresses scale drift without changing expressivity. These results establish a structural link between gauge-orbit geometry and optimization conditioning, providing a concrete connection between gauge-theoretic concepts and machine learning.

[AI-21] ManeuverNet: A Soft Actor-Critic Framework for Precise Maneuvering of Double-Ackermann-Steering Robots with Optimized Reward Functions ICRA

【速读】:该论文旨在解决双阿克曼转向机器人(double-Ackermann-steering robots)在农业场景中实现自主控制时面临的两大挑战:一是传统基于采样的路径规划方法(如Timed Elastic Band, TEB)对参数敏感,难以适应不同机器人配置或环境变化;二是端到端深度强化学习(DRL)方法因奖励函数设计不当,无法有效处理非完整约束(non-holonomic constraints),导致策略次优且泛化能力差。解决方案的关键在于提出 ManeuverNet,一个专为双阿克曼系统设计的 DRL 框架,结合 Soft Actor-Critic 与 CrossQ 算法,并引入四个针对性设计的奖励函数,无需专家数据或手工引导即可实现高效、鲁棒的自主 maneuvering 控制。实验表明,该方法显著提升了轨迹成功率和效率,相较 DRL 基线提升超 40%,同时有效缓解了 TEB 对参数的强依赖性。

链接: https://arxiv.org/abs/2602.14726
作者: Kohio Deflesselle,Mélodie Daniel,Aly Magassouba,Miguel Aranda,Olivier Ly
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5, figures, Accepted for 2026 IEEE International Conference on Robotics Automation (ICRA)

点击查看摘要

Abstract:Autonomous control of double-Ackermann-steering robots is essential in agricultural applications, where robots must execute precise and complex maneuvers within a limited space. Classical methods, such as the Timed Elastic Band (TEB) planner, can address this problem, but they rely on parameter tuning, making them highly sensitive to changes in robot configuration or environment and impractical to deploy without constant recalibration. At the same time, end-to-end deep reinforcement learning (DRL) methods often fail due to unsuitable reward functions for non-holonomic constraints, resulting in sub-optimal policies and poor generalization. To address these challenges, this paper presents ManeuverNet, a DRL framework tailored for double-Ackermann systems, combining Soft Actor-Critic with CrossQ. Furthermore, ManeuverNet introduces four specifically designed reward functions to support maneuver learning. Unlike prior work, ManeuverNet does not depend on expert data or handcrafted guidance. We extensively evaluate ManeuverNet against both state-of-the-art DRL baselines and the TEB planner. Experimental results demonstrate that our framework substantially improves maneuverability and success rates, achieving more than a 40% gain over DRL baselines. Moreover, ManeuverNet effectively mitigates the strong parameter sensitivity observed in the TEB planner. In real-world trials, ManeuverNet achieved up to a 90% increase in maneuvering trajectory efficiency, highlighting its robustness and practical applicability.

[AI-22] WebWorld: A Large-Scale World Model for Web Agent Training

【速读】:该论文旨在解决当前Web代理(Web Agent)在训练过程中依赖大量真实网络交互轨迹而导致的效率低下与安全风险问题,尤其是受限于网络延迟、接口速率限制及潜在的安全隐患。其核心解决方案是提出首个大规模开源Web模拟器——WebWorld系列,通过构建可扩展的数据流水线,实现超过100万条开放网络交互数据的高效合成与训练,支持复杂推理、多模态数据处理和30步以上的长时程模拟。该方案的关键创新在于将模拟环境从封闭的小规模场景拓展至开放互联网尺度,并结合双维度评估体系(WebWorld-Bench)与跨域迁移验证(代码、GUI、游戏等),显著提升了模型在真实任务中的泛化能力与推理性能,最终使基于WebWorld合成轨迹训练的Qwen3-14B模型在WebArena基准上提升9.2%,达到GPT-4o水平,且优于GPT-5作为世界模型的推断搜索能力。

链接: https://arxiv.org/abs/2602.14721
作者: Zikai Xiao,Jianhong Tu,Chuhang Zou,Yuxin Zuo,Zhi Li,Peng Wang,Bowen Yu,Fei Huang,Junyang Lin,Zuozhu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Web agents require massive trajectories to generalize, yet real-world training is constrained by network latency, rate limits, and safety risks. We introduce \textbfWebWorld series, the first open-web simulator trained at scale. While existing simulators are restricted to closed environments with thousands of trajectories, WebWorld leverages a scalable data pipeline to train on 1M+ open-web interactions, supporting reasoning, multi-format data, and long-horizon simulations of 30+ steps. For intrinsic evaluation, we introduce WebWorld-Bench with dual metrics spanning nine dimensions, where WebWorld achieves simulation performance comparable to Gemini-3-Pro. For extrinsic evaluation, Qwen3-14B trained on WebWorld-synthesized trajectories improves by +9.2% on WebArena, reaching performance comparable to GPT-4o. WebWorld enables effective inference-time search, outperforming GPT-5 as a world model. Beyond web simulation, WebWorld exhibits cross-domain generalization to code, GUI, and game environments, providing a replicable recipe for world model construction.

[AI-23] Qute: Towards Quantum-Native Database

【速读】:该论文旨在解决当前量子计算与数据库系统融合不足的问题,即如何在传统数据库架构中高效集成量子计算能力,以实现量子原生(quantum-native)的数据处理范式。其核心解决方案在于提出一个名为Qute的量子数据库系统,关键创新包括:(i) 将扩展SQL编译为门效率高的量子电路,实现量子计算指令的直接映射;(ii) 引入混合优化器动态选择量子或经典执行计划,提升整体执行灵活性与性能;(iii) 设计选择性量子索引机制,减少量子资源消耗;(iv) 采用保真度保持的存储方案,缓解当前量子比特(qubit)相干时间短和噪声限制的问题。通过在真实量子处理器(origin_wukong)上部署验证,Qute在大规模场景下优于经典基线,标志着向量子原生数据库迈出关键一步。

链接: https://arxiv.org/abs/2602.14699
作者: Muzhi Chen,Xuanhe Zhou,Wei Zhou,Bangrui Xu,Surui Tang,Guoliang Li,Bingsheng He,Yeye He,Yitong Song,Fan Wu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Please refer our open-source prototype at: this https URL

点击查看摘要

Abstract:This paper envisions a quantum database (Qute) that treats quantum computation as a first-class execution option. Unlike prior simulation-based methods that either run quantum algorithms on classical machines or adapt existing databases for quantum simulation, Qute instead (i) compiles an extended form of SQL into gate-efficient quantum circuits, (ii) employs a hybrid optimizer to dynamically select between quantum and classical execution plans, (iii) introduces selective quantum indexing, and (iv) designs fidelity-preserving storage to mitigate current qubit constraints. We also present a three-stage evolution roadmap toward quantum-native database. Finally, by deploying Qute on a real quantum processor (origin_wukong), we show that it outperforms a classical baseline at scale, and we release an open-source prototype at this https URL.

[AI-24] Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自主学习与自我改进过程中,如何协同优化模型权重(weight updates)和系统提示(system prompts)以提升任务性能的问题。当前主流方法分别依赖自省(self-reflection)更新上下文和强化学习(Reinforcement Learning, RL)更新权重,但缺乏对提示与权重的联合优化机制。其解决方案的关键在于提出进化式系统提示学习(Evolutionary System Prompt Learning, E-SPL),通过在每轮RL迭代中并行运行多个系统提示的轨迹(rollouts),利用RL更新模型权重,并基于LLM驱动的变异(mutation)与交叉(crossover)操作对提示群体进行进化选择;同时引入TrueSkill评分机制动态评估提示质量,从而实现声明性知识(declarative knowledge)存储于提示、程序性知识(procedural knowledge)编码于权重的自然分工,显著提升样本效率与泛化能力,在从简单到复杂的任务迁移(如AIME → BeyondAIME)中取得优于传统方法的性能表现。

链接: https://arxiv.org/abs/2602.14697
作者: Lunjun Zhang,Ryan Chen,Bradly C. Stadie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME \rightarrow BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% \rightarrow 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: this https URL

[AI-25] Removing Planner Bias in Goal Recognition Through Multi-Plan Dataset Generation

【速读】:该论文旨在解决现有目标识别(Goal Recognition)数据集因依赖启发式前向搜索(heuristic-based forward search)规划系统而产生的系统性偏差问题,这种偏差导致数据集在评估不同规划器(planner)对同一目标的推理能力时缺乏挑战性。解决方案的关键在于引入一种基于 top-k 规划的新方法,为同一目标假设生成多个结构上不同的计划,从而构建更公平、更具多样性的基准测试环境;在此基础上提出版本覆盖率分数(Version Coverage Score, VCS),用于量化目标识别器在面对不同计划集合时的鲁棒性(resilience)。实验表明,当前最先进的目标识别器在低可观测性场景下其鲁棒性显著下降。

链接: https://arxiv.org/abs/2602.14691
作者: Mustafa F. Abdelwahed,Felipe Meneguzzi Kin Max Piamolini Gusmao,Joan Espasa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous agents require some form of goal and plan recognition to interact in multiagent settings. Unfortunately, all existing goal recognition datasets suffer from a systematical bias induced by the planning systems that generated them, namely heuristic-based forward search. This means that existing datasets lack enough challenge for more realistic scenarios (e.g., agents using different planners), which impacts the evaluation of goal recognisers with respect to using different planners for the same goal. In this paper, we propose a new method that uses top-k planning to generate multiple, different, plans for the same goal hypothesis, yielding benchmarks that mitigate the bias found in the current dataset. This allows us to introduce a new metric called Version Coverage Score (VCS) to measure the resilience of the goal recogniser when inferring a goal based on different sets of plans. Our results show that the resilience of the current state-of-the-art goal recogniser degrades substantially under low observability settings.

[AI-26] SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

【速读】:该论文旨在解决当前稀疏自编码器(Sparse Autoencoders, SAEs)在大语言模型(LLMs)中评估时存在的两大问题:一是现有LLM基准数据噪声过大,难以区分架构改进的有效性;二是合成数据实验规模小且不具现实性,无法提供有意义的对比。其解决方案的关键在于提出SynthSAEBench工具包,该工具包能够生成具有真实特征结构(如相关性、层次性和叠加性)的大规模合成数据,并构建标准化基准模型SynthSAEBench-16k,从而实现对SAE架构的直接比较。该基准不仅能复现LLM中已知现象(如重建与潜在空间质量指标的脱节、探针性能差及L0调控的精度-召回权衡),还揭示了匹配追踪型SAEs因利用叠加噪声提升重建性能却未学习真实特征的新失败模式,表明更表达能力强的编码器易过拟合。SynthSAEBench通过提供真实特征和可控消融实验,使研究人员可在扩展至LLMs前精准诊断SAE失效机制并验证架构改进效果。

链接: https://arxiv.org/abs/2602.14687
作者: David Chanin,Adrià Garriga-Alonso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.

[AI-27] GREAT-EER: Graph Edge Attention Network for Emergency Evacuation Responses

【速读】:该论文旨在解决城市紧急疏散场景下,如何在限定时间内利用公交车高效疏散尽可能多人员的问题,即提出并建模了公交疏散定向问题(Bus Evacuation Orienteering Problem, BEOP),这是一个NP-hard组合优化问题。其解决方案的关键在于设计了一种基于深度强化学习(Deep Reinforcement Learning)的方法,结合图神经网络(Graph Learning)来学习最优疏散路径策略;该方法训练完成后可实现毫秒级推理速度,并通过混合整数线性规划(MILP)对解的质量进行理论边界约束,从而在保证近似最优解的同时满足实际应用中对响应时间的要求。

链接: https://arxiv.org/abs/2602.14676
作者: Attila Lischka,Balázs Kulcsár
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Emergency situations that require the evacuation of urban areas can arise from man-made causes (e.g., terrorist attacks or industrial accidents) or natural disasters, the latter becoming more frequent due to climate change. As a result, effective and fast methods to develop evacuation plans are of great importance. In this work, we identify and propose the Bus Evacuation Orienteering Problem (BEOP), an NP-hard combinatorial optimization problem with the goal of evacuating as many people from an affected area by bus in a short, predefined amount of time. The purpose of bus-based evacuation is to reduce congestion and disorder that arises in purely car-focused evacuation scenarios. To solve the BEOP, we propose a deep reinforcement learning-based method utilizing graph learning, which, once trained, achieves fast inference speed and is able to create evacuation routes in fractions of seconds. We can bound the gap of our evacuation plans using an MILP formulation. To validate our method, we create evacuation scenarios for San Francisco using real-world road networks and travel times. We show that we achieve near-optimal solution quality and are further able to investigate how many evacuation vehicles are necessary to achieve certain bus-based evacuation quotas given a predefined evacuation time while keeping run time adequate.

[AI-28] From User Preferences to Base Score Extraction Functions in Gradual Argumentation AAMAS2026

【速读】:该论文旨在解决在渐进式论证(Gradual Argumentation)中如何从用户对论点的偏好中自动提取基础分值(Base Score)的问题,以支持透明且可争议的AI系统。传统方法依赖用户专业知识手动设定基础分值,过程复杂且不直观;而本文提出了一种名为“基础分值提取函数”(Base Score Extraction Functions)的解决方案,其关键在于构建一个从用户偏好到基础分值的映射机制,从而将带有偏好的双极论证框架(Bipolar Argumentation Framework, BAF)转化为定量双极论证框架(Quantitative Bipolar Argumentation Framework, QBAF),使现有成熟的计算工具能够直接应用于渐进式论证分析。该方法还通过近似人类偏好中的非线性特征来提升实际拟合度,并在机器人场景中进行了理论与实验验证,为实践中选择合适的渐进语义提供了指导。

链接: https://arxiv.org/abs/2602.14674
作者: Aniol Civit,Antonio Rago,Antonio Andriella,Guillem Alenyà,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAMAS 2026 - With Appendix

点击查看摘要

Abstract:Gradual argumentation is a field of symbolic AI which is attracting attention for its ability to support transparent and contestable AI systems. It is considered a useful tool in domains such as decision-making, recommendation, debate analysis, and others. The outcomes in such domains are usually dependent on the arguments’ base scores, which must be selected carefully. Often, this selection process requires user expertise and may not always be straightforward. On the other hand, organising the arguments by preference could simplify the task. In this work, we introduce \emphBase Score Extraction Functions, which provide a mapping from users’ preferences over arguments to base scores. These functions can be applied to the arguments of a \emphBipolar Argumentation Framework (BAF), supplemented with preferences, to obtain a \emphQuantitative Bipolar Argumentation Framework (QBAF), allowing the use of well-established computational tools in gradual argumentation. We outline the desirable properties of base score extraction functions, discuss some design choices, and provide an algorithm for base score extraction. Our method incorporates an approximation of non-linearities in human preferences to allow for better approximation of the real ones. Finally, we evaluate our approach both theoretically and experimentally in a robotics setting, and offer recommendations for selecting appropriate gradual semantics in practice.

[AI-29] Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域(如医疗分诊)中难以严格遵循结构化工作流的问题。传统单体式提示方法将整个决策结构编码于单一提示中,随着提示长度增加,易出现“中间迷失”(lost-in-the-middle)效应和上下文窗口溢出,导致指令遵循能力下降。其解决方案的关键在于提出Arbor框架,通过将决策树导航分解为节点级专用任务实现架构解耦:决策树以边列表形式存储并动态检索,运行时基于有向无环图(DAG)的编排机制仅获取当前节点的出边,利用专用LLM调用评估有效转移路径,并将响应生成分离至独立推理步骤。该设计显著降低对模型内在能力的依赖,使小型模型在准确率、延迟和成本上均优于大型单提示基线模型。

链接: https://arxiv.org/abs/2602.14643
作者: Luís Silva,Diogo Gonçalves,Catarina Farinha,Clara Matos,Luís Ungaro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.

[AI-30] abular Foundation Models Can Learn Association Rules

【速读】:该论文旨在解决传统关联规则挖掘(Association Rule Mining, ARM)方法中存在的规则爆炸(rule explosion)和可扩展性差的问题,以及现有神经网络方法在低数据场景下性能下降的局限性。其解决方案的关键在于提出一种模型无关的关联规则学习框架,该框架能够从任意条件概率模型中提取关联规则,并通过引入TabProbe这一具体实现,利用预训练的表格基础模型(Tabular Foundation Models, TFMs)作为条件概率估计器,在无需频繁项集挖掘的情况下直接学习高质量的关联规则。实验表明,该方法在不同规模的数据集上均能生成简洁且预测能力强的规则,并在低数据条件下保持鲁棒性,无需任务特定微调。

链接: https://arxiv.org/abs/2602.14622
作者: Erkan Karabulut,Daniel Daza,Paul Groth,Martijn C. Schut,Victoria Degeler
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Association Rule Mining (ARM) is a fundamental task for knowledge discovery in tabular data and is widely used in high-stakes decision-making. Classical ARM methods rely on frequent itemset mining, leading to rule explosion and poor scalability, while recent neural approaches mitigate these issues but suffer from degraded performance in low-data regimes. Tabular foundation models (TFMs), pretrained on diverse tabular data with strong in-context generalization, provide a basis for addressing these limitations. We introduce a model-agnostic association rule learning framework that extracts association rules from any conditional probabilistic model over tabular data, enabling us to leverage TFMs. We then introduce TabProbe, an instantiation of our framework that utilizes TFMs as conditional probability estimators to learn association rules out-of-the-box without frequent itemset mining. We evaluate our approach on tabular datasets of varying sizes based on standard ARM rule quality metrics and downstream classification performance. The results show that TFMs consistently produce concise, high-quality association rules with strong predictive performance and remain robust in low-data settings without task-specific training. Source code is available at this https URL.

[AI-31] OPBench: A Graph Benchmark to Combat the Opioid Crisis

【速读】:该论文旨在解决当前针对阿片类药物危机(opioid crisis)的图学习方法缺乏系统性评估基准的问题。现有研究虽尝试利用图学习方法建模复杂的药物相关现象,但因缺少统一、全面且覆盖真实场景的基准数据集与评估框架,导致方法间难以公平比较和有效推进。解决方案的关键在于提出首个综合性阿片类药物基准平台OPBench,其包含五个跨三个关键应用领域的数据集(包括从医疗索赔中检测阿片类药物过量、从数字平台检测非法毒品贩运、以及从饮食模式预测药物滥用),并引入异质图(heterogeneous graph)和超图(hypergraph)结构以保留药物相关数据的复杂关系;同时,通过与领域专家及权威机构合作构建高质量标注数据,并建立标准化的数据划分、评估协议与可复现基线,从而为图学习方法提供可靠、一致的评测环境,推动该领域研究向更精准、可解释的方向发展。

链接: https://arxiv.org/abs/2602.14602
作者: Tianyi Ma,Yiyang Li,Yiyue Qian,Zheyuan Zhang,Zehong Wang,Chuxu Zhang,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The opioid epidemic continues to ravage communities worldwide, straining healthcare systems, disrupting families, and demanding urgent computational solutions. To combat this lethal opioid crisis, graph learning methods have emerged as a promising paradigm for modeling complex drug-related phenomena. However, a significant gap remains: there is no comprehensive benchmark for systematically evaluating these methods across real-world opioid crisis scenarios. To bridge this gap, we introduce OPBench, the first comprehensive opioid benchmark comprising five datasets across three critical application domains: opioid overdose detection from healthcare claims, illicit drug trafficking detection from digital platforms, and drug misuse prediction from dietary patterns. Specifically, OPBench incorporates diverse graph structures, including heterogeneous graphs and hypergraphs, to preserve the rich and complex relational information among drug-related data. To address data scarcity, we collaborate with domain experts and authoritative institutions to curate and annotate datasets while adhering to privacy and ethical guidelines. Furthermore, we establish a unified evaluation framework with standardized protocols, predefined data splits, and reproducible baselines to facilitate fair and systematic comparison among graph learning methods. Through extensive experiments, we analyze the strengths and limitations of existing graph learning methods, thereby providing actionable insights for future research in combating the opioid crisis. Our source code and datasets are available at this https URL.

[AI-32] Automated Classification of Source Code Changes Based on Metrics Clustering in the Software Development Process

【速读】:该论文旨在解决软件开发过程中源代码变更分类效率低下的问题,传统方法依赖人工审查,耗时且难以规模化。解决方案的关键在于提出一种基于变更度量(change metrics)聚类的自动化分类方法:首先计算每个代码变更的11维度量向量(涵盖代码行数、圈复杂度、文件数量、接口变化和结构变化等),然后使用基于余弦相似度的k-means算法对这些向量进行聚类;最终由专家将聚类结果映射到预定义的变更类别中。该方法实现了变更分布的自动化,显著减少人工审查时间,并在五个软件系统上验证了其有效性,分类纯度为P_C = 0.75 ± 0.05,熵值为E_C = 0.37 ± 0.06(显著性水平α = 0.05)。

链接: https://arxiv.org/abs/2602.14591
作者: Evgenii Kniazev
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This is an English translation of the author’s Ph.D. dissertation abstract, originally defended in Russian at ITMO University (2009) under the supervision of Prof. A.A. Shalyto. The original research was co-authored with D.G. Shopyrin. Original available at this https URL

点击查看摘要

Abstract:This paper presents an automated method for classifying source code changes during the software development process based on clustering of change metrics. The method consists of two steps: clustering of metric vectors computed for each code change, followed by expert mapping of the resulting clusters to predefined change classes. The distribution of changes into clusters is performed automatically, while the mapping of clusters to classes is carried out by an expert. Automation of the distribution step substantially reduces the time required for code change review. The k-means algorithm with a cosine similarity measure between metric vectors is used for clustering. Eleven source code metrics are employed, covering lines of code, cyclomatic complexity, file counts, interface changes, and structural changes. The method was validated on five software systems, including two open-source projects (Subversion and NHibernate), and demonstrated classification purity of P_C = 0.75 +/- 0.05 and entropy of E_C = 0.37 +/- 0.06 at a significance level of 0.05.

[AI-33] Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

【速读】:该论文旨在解决连续时间环境下强化学习(Reinforcement Learning, RL)中标准离散时间方法失效的问题,尤其是在非均匀、事件驱动的决策场景中,如金融交易或机器人控制。传统方法在时间间隔趋近于零时会导致Q函数退化为价值函数V,从而丧失动作排序能力;而现有连续时间方法虽引入优势率函数q以恢复动作信息,但依赖复杂的鞅损失或正交约束,优化过程耦合且不稳定。论文提出一种解耦的连续时间演员-评论家算法,其核心在于通过交替更新机制分离q与V的学习:利用扩散生成器从V中学习q,同时基于哈密顿量的价值流更新V,确保在无穷小时间步长下仍保持信息有效性。理论层面,作者通过新颖的概率论证证明收敛性,绕过了生成器型哈密顿量在上确界范数下缺乏贝尔曼收缩性的挑战;实验表明,该方法在多个连续控制基准和真实世界交易任务中显著优于先前连续时间和主流离散时间基线。

链接: https://arxiv.org/abs/2602.14587
作者: Minh Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the Q -function collapses to the value function V , eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function q . However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle V and q into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: q is learned from diffusion generators on V , and V is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter - nearly doubling the second-best method.

[AI-34] Governing AI Forgetting: Auditing for Machine Unlearning Compliance

【速读】:该论文旨在解决人工智能(AI)运营方在面对个人数据删除请求时难以合规的问题,特别是针对机器遗忘(Machine Unlearning, MU)技术虽能从模型中移除特定数据的影响,但缺乏有效的经济审计机制来确保其实际执行。解决方案的关键在于构建首个基于经济视角的MU合规审计框架,将认证遗忘理论与监管执法相结合:首先通过假设检验视角刻画MU的验证不确定性以确定审计者的检测能力,进而建立博弈论模型捕捉审计者与运营方之间的策略互动;为应对MU特有的非线性复杂性(如模型效用与检测概率间的耦合关系),作者创新性地将原二元非线性不动点问题转化为可解的一元辅助问题,从而在无需显式解的前提下证明均衡的存在性、唯一性及结构特性,揭示出审计强度随删除请求增加而降低的反直觉结论,并指出披露审计比隐蔽审计更具监管成本效益。

链接: https://arxiv.org/abs/2602.14553
作者: Qinqi Lin,Ningning Ding,Lingjie Duan,Jianwei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Under review in IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Despite legal mandates for the right to be forgotten, AI operators routinely fail to comply with data deletion requests. While machine unlearning (MU) provides a technical solution to remove personal data’s influence from trained models, ensuring compliance remains challenging due to the fundamental gap between MU’s technical feasibility and regulatory implementation. In this paper, we introduce the first economic framework for auditing MU compliance, by integrating certified unlearning theory with regulatory enforcement. We first characterize MU’s inherent verification uncertainty using a hypothesis-testing interpretation of certified unlearning to derive the auditor’s detection capability, and then propose a game-theoretic model to capture the strategic interactions between the auditor and the operator. A key technical challenge arises from MU-specific nonlinearities inherent in the model utility and the detection probability, which create complex strategic couplings that traditional auditing frameworks do not address and that also preclude closed-form solutions. We address this by transforming the complex bivariate nonlinear fixed-point problem into a tractable univariate auxiliary problem, enabling us to decouple the system and establish the equilibrium existence, uniqueness, and structural properties without relying on explicit solutions. Counterintuitively, our analysis reveals that the auditor can optimally reduce the inspection intensity as deletion requests increase, since the operator’s weakened unlearning makes non-compliance easier to detect. This is consistent with recent auditing reductions in China despite growing deletion requests. Moreover, we prove that although undisclosed auditing offers informational advantages for the auditor, it paradoxically reduces the regulatory cost-effectiveness relative to disclosed auditing.

[AI-35] Disentangling Deception and Hallucination Failures in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实体导向的事实性问答中出现错误输出时,传统行为视角下将所有失败归因于“知识缺失”的局限性问题。作者指出,此类归因可能混淆了不同机制层面的故障模式,尤其是“幻觉”(hallucination)与“欺骗”(deception)这两种看似相似但本质不同的失败类型。解决方案的关键在于提出一种机制导向(mechanism-oriented)的新分析框架,将“知识存在性”(Knowledge Existence)与“行为表达性”(Behavior Expression)分离,并通过构建一个受控的实体中心事实问答环境,在保持知识不变的前提下系统性地操控行为表达,从而实现对四种行为案例的精细分析,其核心手段包括表示可分性(representation separability)、稀疏可解释性(sparse interpretability)以及推理时激活引导(inference-time activation steering)。

链接: https://arxiv.org/abs/2602.14529
作者: Haolang Lu,Hongrui Peng,WeiYe Fu,Guoshun Nan,Xinye Cao,Xingrui Li,Hongcan Guo,Kun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Failures in large language models (LLMs) are often analyzed from a behavioral perspective, where incorrect outputs in factual question answering are commonly associated with missing knowledge. In this work, focusing on entity-based factual queries, we suggest that such a view may conflate different failure mechanisms, and propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression. Under this formulation, hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms. To study this distinction, we construct a controlled environment for entity-centric factual questions in which knowledge is preserved while behavioral expression is selectively altered, enabling systematic analysis of four behavioral cases. We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.

[AI-36] WISTED-RL: Hierarchical Skilled Agents for Knot-Tying without Human Demonstrations

【速读】:该论文致力于解决机器人在无示教条件下进行打结操作的难题,核心挑战在于柔顺物体(deformable objects)与严格拓扑约束(topological constraints)之间的复杂交互。解决方案的关键在于提出TWISTED-RL框架,通过将原先基于单步逆模型的监督学习方法替换为基于抽象拓扑动作(abstract topological actions)的多步强化学习策略,从而实现更精细的拓扑状态转移,并避免了高成本且低效的数据采集过程,显著提升了对多种复杂 knot 配置的泛化能力。

链接: https://arxiv.org/abs/2602.14526
作者: Guy Freund,Tom Jurgenson,Matan Sudry,Erez Karpas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robotic knot-tying represents a fundamental challenge in robotics due to the complex interactions between deformable objects and strict topological constraints. We present TWISTED-RL, a framework that improves upon the previous state-of-the-art in demonstration-free knot-tying (TWISTED), which smartly decomposed a single knot-tying problem into manageable subproblems, each addressed by a specialized agent. Our approach replaces TWISTED’s single-step inverse model that was learned via supervised learning with a multi-step Reinforcement Learning policy conditioned on abstract topological actions rather than goal states. This change allows more delicate topological state transitions while avoiding costly and ineffective data collection protocols, thus enabling better generalization across diverse knot configurations. Experimental results demonstrate that TWISTED-RL manages to solve previously unattainable knots of higher complexity, including commonly used knots such as the Figure-8 and the Overhand. Furthermore, the increase in success rates and drop in planning time establishes TWISTED-RL as the new state-of-the-art in robotic knot-tying without human demonstrations.

[AI-37] Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长链式推理(long chain-of-thought reasoning, long-CoT)过程中因不同知识源提供冲突信号而导致的推理失败问题。其解决方案的关键在于通过系统性地分析内部表示,揭示了知识冲突的四个核心特性:(I) 线性可分性(Linear Separability),即不同类型冲突以线性可分离特征形式编码;(II) 深度定位性(Depth Localization),表明冲突信号集中于中到晚层,提示存在专门用于冲突编码的处理阶段;(III) 层次一致性(Hierarchical Consistency),即沿推理轨迹聚合噪声级token信号能稳健恢复输入层冲突类型;(IV) 方向不对称性(Directional Asymmetry),即强化模型隐含的知识源偏好比强制相反来源更高效。这些发现为理解多模态推理中的知识冲突机制提供了机制层面的认知,并支持对长CoT失败进行原理性诊断与控制。

链接: https://arxiv.org/abs/2602.14518
作者: Jing Tang,Kun Wang,Haolang Lu,Hongjin Chen,KaiTao Chen,Zhongxiang Sun,Qiankun Li,Lingjuan Lyu,Guoshun Nan,Zhigang Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) in long chain-of-thought reasoning often fail when different knowledge sources provide conflicting signals. We formalize these failures under a unified notion of knowledge conflict, distinguishing input-level objective conflict from process-level effective conflict. Through probing internal representations, we reveal that: (I) Linear Separability: different conflict types are explicitly encoded as linearly separable features rather than entangled; (II) Depth Localization: conflict signals concentrate in mid-to-late layers, indicating a distinct processing stage for conflict encoding; (III) Hierarchical Consistency: aggregating noisy token-level signals along trajectories robustly recovers input-level conflict types; and (IV) Directional Asymmetry: reinforcing the model’s implicit source preference under conflict is far easier than enforcing the opposite source. Our findings provide a mechanism-level view of multimodal reasoning under knowledge conflict and enable principled diagnosis and control of long-CoT failures.

[AI-38] Formally Verifying and Explaining Sepsis Treatment Policies with COOL-MC

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在脓毒症治疗优化中应用时存在的可解释性差和安全性难以验证的问题。标准的概率模型检测工具因需遍历完整状态空间而无法处理大规模马尔可夫决策过程(Markov Decision Process, MDP),且无法揭示策略做出特定决策的依据。其解决方案的关键在于提出COOL-MC框架,该框架基于Storm模型检测器扩展了三项核心能力:一是仅构建由训练好的策略诱导的可达状态空间,生成更小的离散时间马尔可夫链(Discrete-Time Markov Chain, DTMC),从而实现对复杂MDP的可行性验证;二是自动为状态标注具有临床意义的原子命题;三是将可解释性方法与概率计算树逻辑(Probabilistic Computation Tree Logic, PCTL)查询相结合,识别驱动治疗轨迹中决策的关键特征。通过在ICU-Sepsis MDP上的实证分析,表明COOL-MC不仅能提供严格的生存概率边界,还能揭示策略依赖于既往用药史而非患者实时生理状态这一潜在缺陷,从而为临床医生提供一种用于调试和评估脓毒症治疗策略的强有力工具。

链接: https://arxiv.org/abs/2602.14505
作者: Dennis Gross
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safe and interpretable sequential decision-making is critical in healthcare, yet reinforcement learning (RL) policies for sepsis treatment optimization remain opaque and difficult to verify. Standard probabilistic model checkers operate on the full state space, which becomes infeasible for larger MDPs, and cannot explain why a learned policy makes particular decisions. COOL-MC wraps the model checker Storm but adds three key capabilities: it constructs only the reachable state space induced by a trained policy, yielding a smaller discrete-time Markov chain amenable to verification even when full-MDP analysis is intractable; it automatically labels states with clinically meaningful atomic propositions; and it integrates explainability methods with probabilistic computation tree logic (PCTL) queries to reveal which features drive decisions across treatment trajectories. We demonstrate COOL-MC’s capabilities on the ICU-Sepsis MDP, a benchmark derived from approximately 17,000 sepsis patient records, which serves as a case study for applying COOL-MC to the formal analysis of sepsis treatment policies. Our analysis establishes hard bounds via full MDP verification, trains a safe RL policy that achieves optimal survival probability, and analyzes its behavior via PCTL verification and explainability on the induced DTMC. This reveals, for instance, that our trained policy relies predominantly on prior dosing history rather than the patient’s evolving condition, a weakness that is invisible to standard evaluation but is exposed by COOL-MC’s integration of formal verification and explainability. Our results illustrate how COOL-MC could serve as a tool for clinicians to investigate and debug sepsis treatment policies before deployment.

[AI-39] Bounding Probabilities of Causation with Partial Causal Diagrams

【速读】:该论文旨在解决概率性因果关系(Probability of Causation, PoC)在个体层面解释与决策中的识别难题,尤其是在现实场景中因果信息往往不完整但仍有价值的情况下。传统方法要么忽略协变量、要求完整的因果图结构,或依赖于严格的二值设定,限制了其应用范围。论文提出了一种基于部分因果信息的通用边界推导框架,其关键在于将可获得的结构性或统计性信息作为约束条件系统地嵌入到优化编程模型中,从而在无需完全可识别的前提下,得到更紧致且形式上有效的概率因果边界,显著提升了PoC在实际复杂场景中的适用性。

链接: https://arxiv.org/abs/2602.14503
作者: Yuxuan Xie,Ang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probabilities of causation are fundamental to individual-level explanation and decision making, yet they are inherently counterfactual and not point-identifiable from data in general. Existing bounds either disregard available covariates, require complete causal graphs, or rely on restrictive binary settings, limiting their practical use. In real-world applications, causal information is often partial but nontrivial. This paper proposes a general framework for bounding probabilities of causation using partial causal information. We show how the available structural or statistical information can be systematically incorporated as constraints in a optimization programming formulation, yielding tighter and formally valid bounds without full identifiability. This approach extends the applicability of probabilities of causation to realistic settings where causal knowledge is incomplete but informative.

[AI-40] On the Rate-Distortion-Complexity Tradeoff for Semantic Communication

【速读】:该论文旨在解决语义通信(Semantic Communication)中如何有效表示和提取源信号语义信息的问题,尤其关注深度学习(Deep Learning, DL)编码器与解码器在训练和推理过程中存在的高计算复杂度问题。其解决方案的关键在于提出一个率-失真-复杂度(Rate-Distortion-Complexity, RDC)框架,该框架扩展了经典率失真理论,引入语义距离约束(包括传统的比特级失真度量和基于统计差异的散度度量)以及模型复杂度度量(源自最小描述长度和信息瓶颈理论),从而揭示了可实现率、语义距离与模型复杂度之间的三元权衡关系,并通过理论推导得到高斯和二进制语义源下的闭式最优解,实验验证了该理论指标与实际计算开销的高度相关性,为资源受限场景下的系统设计提供了理论指导。

链接: https://arxiv.org/abs/2602.14481
作者: Jingxuan Chai,Yong Xiao,Guangming Shi
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE for possible publication

点击查看摘要

Abstract:Semantic communication is a novel communication paradigm that focuses on conveying the user’s intended meaning rather than the bit-wise transmission of source signals. One of the key challenges is to effectively represent and extract the semantic meaning of any given source signals. While deep learning (DL)-based solutions have shown promising results in extracting implicit semantic information from a wide range of sources, existing work often overlooks the high computational complexity inherent in both model training and inference for the DL-based encoder and decoder. To bridge this gap, this paper proposes a rate-distortion-complexity (RDC) framework which extends the classical rate-distortion theory by incorporating the constraints on semantic distance, including both the traditional bit-wise distortion metric and statistical difference-based divergence metric, and complexity measure, adopted from the theory of minimum description length and information bottleneck. We derive the closed-form theoretical results of the minimum achievable rate under given constraints on semantic distance and complexity for both Gaussian and binary semantic sources. Our theoretical results show a fundamental three-way tradeoff among achievable rate, semantic distance, and model complexity. Extensive experiments on real-world image and video datasets validate this tradeoff and further demonstrate that our information-theoretic complexity measure effectively correlates with practical computational costs, guiding efficient system design in resource-constrained scenarios.

[AI-41] Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment

【速读】:该论文旨在解决数据并行(Data-Parallel, DP)训练中隐式不一致性(silent inconsistency)的问题,即在同步所有减少(synchronous all-reduce)机制下,尽管模型权重在迭代后保持数值等价,但不同工作节点(worker)间的优化动态(如损失和梯度)可能存在未被察觉的偏差。这种偏差无法通过常规的全局平均损失等聚合指标发现,可能导致训练不稳定但难以诊断。解决方案的关键在于提出一种轻量级、与模型无关的诊断框架,利用标准训练流程中已有的信号,引入三个互补指标:损失离散度(loss dispersion)、梯度范数离散度(gradient-norm dispersion)和梯度方向一致性(gradient-direction consistency,以跨工作节点余弦相似度衡量),从而在无额外计算开销且无需修改模型结构或优化算法的前提下,揭示隐藏的训练不稳定性模式,提升大规模DP微调过程的可诊断性与配置可靠性。

链接: https://arxiv.org/abs/2602.14462
作者: Hong Li,Zhen Zhou,Honggang Zhang,Yuping Luo,Xinyue Wang,Han Gong,Zhiyuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Data-parallel (DP) training with synchronous all-reduce is a dominant paradigm for full-parameter fine-tuning of large language models (LLMs). While parameter synchronization guarantees numerical equivalence of model weights after each iteration, it does not necessarily imply alignment of worker-level optimization dynamics before gradient aggregation. This paper identifies and studies this latent mismatch, termed \emphsilent inconsistency, where cross-worker divergence in losses and gradients can remain invisible under conventional aggregated monitoring signals. We propose a lightweight, model-agnostic diagnostic framework that quantifies worker-level consistency using training signals readily available in standard pipelines. Specifically, we introduce three complementary metrics: loss dispersion, gradient-norm dispersion, and gradient-direction consistency measured by inter-worker cosine similarity. The proposed metrics incur negligible overhead and require no modification to model architecture, synchronization mechanisms, or optimization algorithms. We validate the framework by fully fine-tuning the 1B-parameter \textttopenPangu-Embedded-1B-V1.1 model on the \texttttatsu-lab/alpaca dataset using an 8-NPU DP setup, under controlled perturbations of cross-rank stochasticity. Experimental results show that progressively desynchronized data shuffling and random seeds lead to substantial increases in loss/gradient dispersion and reduced directional alignment, despite smooth globally averaged loss curves. These findings demonstrate that the proposed indicators provide actionable visibility into hidden instability modes in large-scale DP fine-tuning, enabling more reliable diagnosis and configuration assessment.

[AI-42] WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因密集计算和内存访问导致的高成本问题,尤其是现有无需训练的激活稀疏化方法仅依赖激活信息且采用统一稀疏率,忽略了权重信息与不同模型模块间敏感度差异,从而造成性能次优的问题。解决方案的关键在于提出一种权重感知的混合粒度无训练激活稀疏方法(Weight-aware Mixed-Granularity Training-free Activation Sparsity, WiSparse):首先通过融合激活幅值与预计算权重范数构建权重感知机制以精准识别重要通道;其次设计混合粒度稀疏分配策略——先利用进化搜索在块级分配全局稀疏预算以保护敏感区域,再在块内细化分配以最小化重构误差,从而实现高效且高性能的稀疏推理。

链接: https://arxiv.org/abs/2602.14452
作者: Lei Chen,Yuan Meng,Xiaoyu Zhan,Zhi Wang,Wenwu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1’s dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.

[AI-43] Broken Chains: The Cost of Incomplete Reasoning in LLM s

【速读】:该论文旨在解决在资源受限条件下,不同推理模态(代码、自然语言、混合或无推理)对模型性能的影响问题,尤其关注推理令牌(reasoning tokens)在计算成本与准确性之间的权衡。其解决方案的关键在于提出一个系统性的框架,通过严格约束模型仅使用代码、注释、两者结合或完全不进行推理,并在多个前沿模型(GPT-5.1、Gemini 3 Flash、DeepSeek-V3.2、Grok 4.1)上对不同令牌预算(最优预算的10%至70%)进行消融实验,从而量化各类推理方式在token限制下的鲁棒性与有效性。研究发现,不完整的推理链可能误导模型,而特定模型(如Grok)在低预算下仍具较强鲁棒性,揭示了推理模态选择与模型架构协同优化的重要性。

链接: https://arxiv.org/abs/2602.14444
作者: Ian Su,Gaurav Purushothaman,Jey Narayan,Ruhika Goel,Kevin Zhu,Sunishchal Dev,Yash More,Maheep Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning-specialized models like OpenAI’s 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10%, 30%, 50%, and 70% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbftruncated reasoning can hurt as DeepSeek-V3.2 achieves 53% with no reasoning but only 17% with truncated CoT at 50% budget; (2) \textbfcode degrades gracefully as Gemini’s comments collapse to 0% while code maintains 43-47%; (3) \textbfhybrid reasoning underperforms single modalities; (4) \textbfrobustness is model-dependent as Grok maintains 80-90% at 30% budget where OpenAI and DeepSeek collapse to 7-27%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.

[AI-44] S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations

【速读】:该论文旨在解决大规模Transformer模型中激活值异常值(activation outliers)对量化(quantization)带来的挑战,这类异常值会导致量化过程中出现严重的精度下降。研究表明,随着预训练规模的扩大(如从CLIP到更深度训练的SigLIP和SigLIP2),异常值的严重程度显著增强。通过理论分析与实证相关性研究,作者发现激活异常值与权重矩阵的最大奇异值(dominant singular values)存在直接关联。解决方案的关键在于提出一种几何原理驱动的条件化方法——选择性谱衰减(Selective Spectral Decay, S²D),该方法在微调阶段仅对对应最大奇异值的权重分量进行手术式正则化,从而有效抑制激活异常值并生成对量化友好的良好条件表示。实验表明,S²D在ImageNet上可将权重量化(W4A4)精度提升最高达7%,结合量化感知训练(QAT)时进一步提升4%,且在下游任务和视觉-语言模型中具有良好的泛化能力。

链接: https://arxiv.org/abs/2602.14432
作者: Arnav Chavan,Nahush Lele,Udbhav Bamba,Sankalp Dayal,Aditi Raghunathan,Deepak Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Activation outliers in large-scale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay ( S^2D ), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that S^2D significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with S^2D achieve up to 7% improved PTQ accuracy on ImageNet under W4A4 quantization and 4% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.

[AI-45] he geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization

【速读】:该论文旨在解决数据增强(data augmentation)在现代机器学习中提升模型泛化能力的理论机制不明确的问题,特别是其如何影响不变性学习(invariance learning)和泛化差距(generalization gap)。解决方案的关键在于提出一个基于信息论的框架,通过构建由原始数据分布与变换分布组成的增强分布模型,推导出一个新的泛化边界,该边界可分解为三项可解释成分:(1)原始数据与增强数据之间的分布差异项,(2)算法对训练数据的稳定性项,以及(3)增强变异性引起的感觉敏感性项。进一步引入“群直径”(group diameter)作为统一控制参数,量化增强操作在输入空间中诱导的最大扰动,揭示了小直径维持数据保真度但正则化有限、大直径增强稳定性却增加偏差与敏感性的内在权衡关系。

链接: https://arxiv.org/abs/2602.14423
作者: Abdelali Bouyahia,Frédéric LeBlanc,Mario Marchand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.

[AI-46] Boule or Baguette? A Study on Task Topology Length Generalization and the Benefit of Reasoning Traces

【速读】:该论文旨在解决生成式 AI 中推理模型(Reasoning Models)在使用中间推理步骤(Reasoning Traces, RTs)时,其泛化能力的局限性问题,特别是针对“长度泛化”(length generalization)这一关键挑战:即当模型仅在短证明长度下训练时,能否有效推广到需要更长推理链的任务。解决方案的关键在于构建了一个大规模逻辑推理数据集 PITA(包含超过 2300 万条命题逻辑语句及其对应证明),并提出两个核心度量指标——任务深度(task depth,衡量单个问题所需的推理步数)和任务广度(task breadth,衡量任务中唯一样本的数量)。通过系统性地控制这两个维度,研究发现 RT 模型在宽且浅的任务上表现优异,但在窄且深的任务上显著劣于非 RT 基线模型,从而揭示出 RT 模型在深度推理任务中的根本性性能限制,并指出其在广度任务上的优势,为理解推理模型的边界提供了理论框架与实证依据。

链接: https://arxiv.org/abs/2602.14404
作者: William L. Tong,Ege Cakar,Cengiz Pehlevan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 38 pages, 11 figures, code available at this https URL

点击查看摘要

Abstract:Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.

[AI-47] Competition for attention predicts good-to-bad tipping in AI

【速读】:该论文旨在解决边缘人工智能(Edge AI)设备中潜在危险行为的动态突变问题,这类设备在无互联网连接和最小安全监管的情况下运行类似ChatGPT的语言模型,可能诱发自伤、财务损失及极端主义等风险。现有安全工具要么依赖云端连接,要么仅在危害发生后才能发现失效。论文的关键解决方案是揭示了危险突变起源于原子尺度上的注意力竞争机制——即对话上下文与不同输出基底之间的点积竞争决定了系统动力学临界点 $ n^* $,并由此提出一个可量化的数学公式用于预测和控制此类突变。该机制具有跨领域、跨语言和跨文化适用性,为实现边缘AI的安全治理提供了新的控制杠杆。

链接: https://arxiv.org/abs/2602.14370
作者: Neil F. Johnson,Frank Y. Huo
机构: 未知
类目: Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:More than half the global population now carries devices that can run ChatGPT-like language models with no Internet connection and minimal safety oversight – and hence the potential to promote self-harm, financial losses and extremism among other dangers. Existing safety tools either require cloud connectivity or discover failures only after harm has occurred. Here we show that a large class of potentially dangerous tipping originates at the atomistic scale in such edge AI due to competition for the machinery’s attention. This yields a mathematical formula for the dynamical tipping point n*, governed by dot-product competition for attention between the conversation’s context and competing output basins, that reveals new control levers. Validated against multiple AI models, the mechanism can be instantiated for different definitions of ‘good’ and ‘bad’ and hence in principle applies across domains (e.g. health, law, finance, defense), changing legal landscapes (e.g. EU, UK, US and state level), languages, and cultural settings.

[AI-48] A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

【速读】:该论文旨在解决自托管工具型个人AI代理(如Clawdbot)在面对模糊指令、对抗性引导或开放目标时所引发的安全与隐私风险问题。其解决方案的关键在于构建一个以轨迹为中心的评估框架,通过采集完整交互轨迹(包括消息、动作及工具调用参数/输出),结合自动化判别模型(AgentDoG-Qwen3-4B)与人工审核,系统性地量化分析代理在六类风险维度下的行为表现,并识别出典型失败模式与安全漏洞,从而为提升工具型AI代理的安全性提供实证依据和改进方向。

链接: https://arxiv.org/abs/2602.14364
作者: Tianyu Chen,Dongrui Liu,Xia Hu,Jingyi Yu,Wenjie Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk dimensions. Our test suite samples and lightly adapts scenarios from prior agent-safety benchmarks (including ATBench and LPS-Bench) and supplements them with hand-designed cases tailored to Clawdbot’s tool surface. We log complete interaction trajectories (messages, actions, tool-call arguments/outputs) and assess safety using both an automated trajectory judge (AgentDoG-Qwen3-4B) and human review. Across 34 canonical cases, we find a non-uniform safety profile: performance is generally consistent on reliability-focused tasks, while most failures arise under underspecified intent, open-ended goals, or benign-seeming jailbreak prompts, where minor misinterpretations can escalate into higher-impact tool actions. We supplemented the overall results with representative case studies and summarized the commonalities of these cases, analyzing the security vulnerabilities and typical failure modes that Clawdbot is prone to trigger in practice.

[AI-49] WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control ICLR2026

【速读】:该论文旨在解决模型-based强化学习(Model-Based Reinforcement Learning, MBRL)在实践中因模型误差累积、单模态世界模型对多模态动态过程的平均化建模以及预测过度自信导致的学习偏差等问题,从而限制其样本效率和性能表现。解决方案的关键在于提出WIMLE方法,该方法将隐式最大似然估计(Implicit Maximum Likelihood Estimation, IMLE)扩展至MBRL框架,以学习无需迭代采样的随机且多模态的世界模型,并通过集成与潜在空间采样估计预测不确定性;训练过程中,WIMLE依据预测置信度对合成轨迹进行加权,保留高置信度的模型回放轨迹,同时抑制低置信度预测带来的偏差,实现稳定的学习过程。

链接: https://arxiv.org/abs/2602.14351
作者: Mehran Aghabozorgi,Alireza Moazeni,Yanshu Zhang,Ke Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026. OpenReview: this https URL

点击查看摘要

Abstract:Model-based reinforcement learning promises strong sample efficiency but often underperforms in practice due to compounding model error, unimodal world models that average over multi-modal dynamics, and overconfident predictions that bias learning. We introduce WIMLE, a model-based method that extends Implicit Maximum Likelihood Estimation (IMLE) to the model-based RL framework to learn stochastic, multi-modal world models without iterative sampling and to estimate predictive uncertainty via ensembles and latent sampling. During training, WIMLE weights each synthetic transition by its predicted confidence, preserving useful model rollouts while attenuating bias from uncertain predictions and enabling stable learning. Across 40 continuous-control tasks spanning DeepMind Control, MyoSuite, and HumanoidBench, WIMLE achieves superior sample efficiency and competitive or better asymptotic performance than strong model-free and model-based baselines. Notably, on the challenging Humanoid-run task, WIMLE improves sample efficiency by over 50 % relative to the strongest competitor, and on HumanoidBench it solves 8 of 14 tasks (versus 4 for BRO and 5 for SimbaV2). These results highlight the value of IMLE-based multi-modality and uncertainty-aware weighting for stable model-based RL.

[AI-50] AXE: An Agent ic eXploit Engine for Confirming Zero-Day Vulnerability Reports

【速读】:该论文旨在解决当前漏洞检测工具在软件项目中因大量误报和不可操作的报告而给维护者带来的负担问题,同时指出现有自动化利用系统通常与漏洞检测流程脱节,未能充分利用诸如漏洞类型(如CWE分类)和源码位置等易得元数据。解决方案的关键在于提出Agentic eXploit Engine (AXE),一个基于多智能体框架的灰盒利用系统,其通过解耦的规划、代码探索和动态执行反馈机制,将轻量级漏洞元数据映射为具体可执行的攻击路径。实验表明,AXE在CVE-Bench数据集上实现了30%的利用成功率,相较最先进的黑盒基线提升3倍,且即使单智能体配置下也比黑盒方法提升1.75倍,验证了其在漏洞验证与修复流程中的有效性与实用性。

链接: https://arxiv.org/abs/2602.14345
作者: Amirali Sajadi,Tu Nguyen,Kostadin Damevski,Preetha Chatterjee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source-code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey-box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi-agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE-Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state-of-the-art black-box baselines. Even in a single-agent configuration, grey-box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific reasoning gaps, including misinterpreted vulnerability semantics and unmet execution preconditions. For successful exploits, AXE produces actionable, reproducible proof-of-concept artifacts, demonstrating its utility in streamlining Web vulnerability triage and remediation. We further evaluate AXE’s generalizability through a case study on a recent real-world vulnerability not included in CVE-Bench.

[AI-51] Zero-Shot Instruction Following in RL via Structured LTL Representations

【速读】:该论文旨在解决多任务强化学习中指令跟随(instruction following)的问题,特别是如何让智能体在零样本(zero-shot)场景下执行训练期间未见过的新任务。现有方法虽能训练出通用策略,但难以有效捕捉线性时序逻辑(Linear Temporal Logic, LTL)规范所蕴含的丰富逻辑与时间结构。解决方案的关键在于提出一种新型结构化任务表示学习方法:通过有限状态自动机构造布尔公式序列来条件化策略,并设计分层神经架构以编码这些公式的逻辑结构,同时引入注意力机制使策略能够推理未来子目标(subgoals),从而显著提升训练效率与泛化能力。

链接: https://arxiv.org/abs/2602.14344
作者: Mathias Jackermeier,Mattia Giuri,Jacques Cloete,Alessandro Abate
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study instruction following in multi-task reinforcement learning, where an agent must zero-shot execute novel tasks not seen during training. In this setting, linear temporal logic (LTL) has recently been adopted as a powerful framework for specifying structured, temporally extended tasks. While existing approaches successfully train generalist policies, they often struggle to effectively capture the rich logical and temporal structure inherent in LTL specifications. In this work, we address these concerns with a novel approach to learn structured task representations that facilitate training and generalisation. Our method conditions the policy on sequences of Boolean formulae constructed from a finite automaton of the task. We propose a hierarchical neural architecture to encode the logical structure of these formulae, and introduce an attention mechanism that enables the policy to reason about future subgoals. Experiments in a variety of complex environments demonstrate the strong generalisation capabilities and superior performance of our approach.

[AI-52] rain Less Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Model, LLM)后训练过程中因Group Relative Policy Optimization (GRPO)方法中出现零优势(zero advantage)死区而导致的计算资源浪费问题。具体而言,当一组rollouts(轨迹)全部获得相同奖励时(即全部正确或全部错误),GRPO中的组归一化优势值为零,无法提供梯度信号,从而造成无效计算。解决方案的关键在于提出自适应高效轨迹优化(Adaptive Efficient Rollout Optimization, AERO):通过引入自适应轨迹策略、选择性拒绝机制以战略性地剪枝低效轨迹,并维护贝叶斯后验分布来避免零优势区域,从而显著提升计算效率。实验表明,在保持甚至优于GRPO性能的前提下,AERO可在相同总轨迹预算下减少约48%的训练计算量,并将每步训练的墙钟时间缩短约45%。

链接: https://arxiv.org/abs/2602.14338
作者: Zhi Zhang,Zhen Han,Costas Mavromatis,Qi Zhu,Yunyi Zhang,Sheng Guan,Dingmin Wang,Xiong Zhou,Shuai Wang,Soji Adeshina,Vassilis Ioannidis,Huzefa Rangwala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size N . When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves compute efficiency without sacrificing performance. Under the same total rollout budget, AERO reduces total training compute by about 48% while shortening wall-clock time per step by about 45% on average. Despite the substantial reduction in compute, AERO matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.

[AI-53] Benchmarking at the Edge of Comprehension

【速读】:该论文旨在解决当前前沿大语言模型(Large Language Models, LLMs)在新基准测试中迅速达到饱和状态所带来的“后理解时代”(post-comprehension regime)问题——即当模型能力逼近甚至超越人类理解边界时,传统依赖人工设计任务、提供标准答案和评估复杂输出的基准测试方法将难以持续有效。其解决方案的核心是提出一种批判鲁棒性基准测试(Critique-Resilient Benchmarking),通过引入“批判鲁棒正确性”(critique-resilient correctness)的概念:一个答案被视为正确,仅当没有任何对手能以令人信服的方式证明其错误。该方法将人类角色从全任务理解者转变为局部断言验证者,从而在无法全面理解任务的情况下仍能维持评估完整性;同时采用分项二部Bradley-Terry模型联合对模型解题能力和生成挑战性问题的能力进行排序,实证表明该框架在数学领域对八种前沿LLM的评分具有稳定性且与外部能力指标高度相关。

链接: https://arxiv.org/abs/2602.14307
作者: Samuele Marro,Jialin Yu,Emanuele La Malfa,Oishi Deb,Jiawei Li,Yibo Yang,Ebey Abraham,Sunando Sengupta,Eric Sommerlade,Michael Wooldridge,Philip Torr
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

[AI-54] AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

【速读】:该论文旨在解决自主Web GUI代理(Autonomous Web GUI Agents)在训练过程中因真实网站交互轨迹难以获取与验证而导致的性能瓶颈问题。现有方法依赖昂贵且不一致的外部验证器来评估每一步操作的正确性,而真实网站的状态转移机制是隐式的,难以进行精确控制和验证。解决方案的关键在于提出AutoWebWorld框架,将网页环境建模为有限状态机(Finite State Machine, FSM),通过编码代理将FSM规则自动转化为可交互的网页环境;在此基础上,显式定义所有状态、动作及转换规则,从而实现程序化验证——即动作正确性可通过预设规则判断,任务成功则由是否抵达目标状态确认。这一设计使得整个训练数据生成过程完全自动化,并显著降低单位轨迹成本(仅0.04美元/轨迹),同时提升代理在真实场景下的性能表现。

链接: https://arxiv.org/abs/2602.14296
作者: Yifan Wu,Yiran Peng,Yiyu Chen,Jianhao Ruan,Zijie Zhuang,Cheng Yang,Jiayi Zhang,Man Chen,Yenchi Tseng,Zhaoyang Yu,Liang Chen,Yuyao Zhai,Bang Liu,Chenglin Wu,Yuyu Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The performance of autonomous Web GUI agents heavily relies on the quality and quantity of their training data. However, a fundamental bottleneck persists: collecting interaction trajectories from real-world websites is expensive and difficult to verify. The underlying state transitions are hidden, leading to reliance on inconsistent and costly external verifiers to evaluate step-level correctness. To address this, we propose AutoWebWorld, a novel framework for synthesizing controllable and verifiable web environments by modeling them as Finite State Machines (FSMs) and use coding agents to translate FSMs into interactive websites. Unlike real websites, where state transitions are implicit, AutoWebWorld explicitly defines all states, actions, and transition rules. This enables programmatic verification: action correctness is checked against predefined rules, and task success is confirmed by reaching a goal state in the FSM graph. AutoWebWorld enables a fully automated search-and-verify pipeline, generating over 11,663 verified trajectories from 29 diverse web environments at only 0.04 per trajectory. Training on this synthetic data significantly boosts real-world performance. Our 7B Web GUI agent outperforms all baselines within 15 steps on WebVoyager. Furthermore, we observe a clear scaling law: as the synthetic data volume increases, performance on WebVoyager and Online-Mind2Web consistently improves.

[AI-55] Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows

【速读】:该论文旨在解决传统机器学习(Machine Learning, ML)流水线中模型推理被静态化处理的问题,即ML模型仅作为预处理步骤嵌入流程,缺乏与大语言模型(Large Language Model, LLM)的动态协同能力。其核心挑战在于如何让LLM在复杂任务中根据上下文语境自主调用定量预测模型,并对输出进行合理推理。解决方案的关键是提出“机器学习即工具”(Machine Learning as a Tool, MLAT)设计范式:将预训练的统计机器学习模型(如XGBoost)封装为可调用工具,集成至LLM代理(agent)工作流中,使LLM能基于对话上下文决定何时、如何使用这些模型。该方法实现了定量预测与自然语言推理的深度融合,已在PitchCraft系统中验证——通过两个代理协作完成从销售录音到专业提案的自动化生成,其中定价模型基于70个样本训练,R²达0.807,且整体耗时从数小时缩短至10分钟以内。

链接: https://arxiv.org/abs/2602.14295
作者: Edwin Chen,Zulekha Bibi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to the Google Gemini 3 Hackathon

点击查看摘要

Abstract:We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre-trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional pipelines that treat ML inference as a static preprocessing step, MLAT positions the model as a first-class tool alongside web search, database queries, and APIs, enabling the LLM to decide when and how to use it based on conversational context. To validate MLAT, we present PitchCraft, a pilot production system that converts discovery call recordings into professional proposals with ML-predicted pricing. The system uses two agents: a Research Agent that gathers prospect intelligence via parallel tool calls, and a Draft Agent that invokes an XGBoost pricing model as a tool call and generates a complete proposal through structured outputs. The pricing model, trained on 70 examples combining real and human-verified synthetic data, achieves R^2 = 0.807 on held-out data with a mean absolute error of 3688 USD. The system reduces proposal generation time from multiple hours to under 10 minutes. We describe the MLAT framework, structured output architecture, training methodology under extreme data scarcity, and sensitivity analysis demonstrating meaningful learned relationships. MLAT generalizes to domains requiring quantitative estimation combined with contextual reasoning. Comments: Submitted to the Google Gemini 3 Hackathon Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.14295 [cs.LG] (or arXiv:2602.14295v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.14295 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

【速读】:该论文旨在解决跨代GPU架构下CUDA代码优化的难题,即如何在日益复杂的硬件特定优化空间中实现峰值性能。传统编译器受限于固定启发式策略,而微调大型语言模型(Large Language Models, LLMs)成本高昂;同时,基于代理的工作流缺乏从先前探索中聚合知识的能力,导致采样偏差和次优解。其解决方案的关键在于提出KernelBlaster框架——一种基于记忆增强的上下文强化学习(Memory-Augmented In-context Reinforcement Learning, MAIC-RL)方法,通过构建可检索的持久化CUDA知识库(Persistent CUDA Knowledge Base),使LLM代理能够积累经验并系统性地指导未来任务决策,从而有效提升搜索效率与优化质量。该框架还引入了一种新的基于性能分析引导的文本梯度代理流程,显著优于朴素重写策略,在KernelBench不同难度级别上相较PyTorch基线分别实现了1.43x、2.50x和1.50x的几何平均加速比。

链接: https://arxiv.org/abs/2602.14293
作者: Kris Shengjun Dong,Sahil Modi,Dima Nikiforov,Sana Damani,Edward Lin,Siva Kumar Sastry Hari,Christos Kozyrakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 33 pages with appendix

点击查看摘要

Abstract:Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based agentic flow for CUDA generation and optimization to achieve high performance across generations of GPU architectures. KernelBlaster guides LLM agents to systematically explore high-potential optimization strategies beyond naive rewrites. Compared to the PyTorch baseline, our method achieves geometric mean speedups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively. We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation pipeline.

[AI-57] Reverse N-Wise Output-Oriented Testing for AI/ML and Quantum Computing Systems

【速读】:该论文旨在解决生成式 AI (Generative AI) 与量子计算软件在测试过程中面临的前所未有的挑战,这些问题包括高维连续输入空间、概率性/非确定性输出分布、仅通过可观测预测行为和测量结果定义的行为正确性,以及信任度、公平性、校准能力、鲁棒性等关键质量维度的复杂多维交互特性。其解决方案的核心是提出“反向 n -wise 输出测试”(reverse n-wise output testing),这是一种数学上严谨的范式反转方法:直接在领域特定的输出等价类(如 ML 置信度校准桶、决策边界区域、公平性分区、嵌入聚类、排序稳定性带、量子测量结果分布及错误综合征模式)上构建覆盖数组,并通过无梯度元启发式优化求解黑盒逆映射问题,从而合成能够诱发目标行为特征的输入配置或量子电路参数。该框架显著提升了故障检测率、测试效率,并实现了面向 MLOps 和量子验证的结构化自动化流程。

链接: https://arxiv.org/abs/2602.14275
作者: Lamine Rihani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence/machine learning (AI/ML) systems and emerging quantum computing software present unprecedented testing challenges characterized by high-dimensional/continuous input spaces, probabilistic/non-deterministic output distributions, behavioral correctness defined exclusively over observable prediction behaviors and measurement outcomes, and critical quality dimensions, trustworthiness, fairness, calibration, robustness, error syndrome patterns, that manifest through complex multi-way interactions among semantically meaningful output properties rather than deterministic input-output mappings. This paper introduces reverse n-wise output testing, a mathematically principled paradigm inversion that constructs covering arrays directly over domain-specific output equivalence classes, ML confidence calibration buckets, decision boundary regions, fairness partitions, embedding clusters, ranking stability bands, quantum measurement outcome distributions (0-dominant, 1-dominant, superposition collapse), error syndrome patterns (bit-flip, phase-flip, correlated errors), then solves the computationally challenging black-box inverse mapping problem via gradient-free metaheuristic optimization to synthesize input feature configurations or quantum circuit parameters capable of eliciting targeted behavioral signatures from opaque models. The framework delivers synergistic benefits across both domains: explicit customer-centric prediction/measurement coverage guarantees, substantial improvements in fault detection rates for ML calibration/boundary failures and quantum error syndromes, enhanced test suite efficiency, and structured MLOps/quantum validation pipelines with automated partition discovery from uncertainty analysis and coverage drift monitoring.

[AI-58] Integrating Unstructured Text into Causal Inference: Empirical Evidence from Real Data

【速读】:该论文试图解决在缺乏结构化数据(structured data)的情况下,如何进行因果推断(causal inference)的问题。传统因果推断方法高度依赖于结构化数据,但在许多现实场景中,这类数据可能不完整或不可用。解决方案的关键在于利用基于Transformer的语言模型从非结构化文本中提取因果信息,从而实现对个体、群体及总体层面的因果估计,并通过与结构化数据所得结果的对比验证了该方法的有效性,拓展了因果推断在仅有文本数据场景下的应用边界。

链接: https://arxiv.org/abs/2602.14274
作者: Boning Zhou,Ziyu Wang,Han Hong,Haoqi Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Causal inference, a critical tool for informing business decisions, traditionally relies heavily on structured data. However, in many real-world scenarios, such data can be incomplete or unavailable. This paper presents a framework that leverages transformer-based language models to perform causal inference using unstructured text. We demonstrate the effectiveness of our framework by comparing causal estimates derived from unstructured text against those obtained from structured data across population, group, and individual levels. Our findings show consistent results between the two approaches, validating the potential of unstructured text in causal inference tasks. Our approach extends the applicability of causal inference methods to scenarios where only textual data is available, enabling data-driven business decision-making when structured tabular data is scarce.

[AI-59] Cross-household Transfer Learning Approach with LSTM-based Demand Forecasting

【速读】:该论文旨在解决住宅热泵(Heat Pump, HP)系统中热水产量优化问题,特别是在大规模部署场景下,如何实现精准预测家庭热水需求以减少能源浪费并保障舒适性。传统方法需为每个家庭单独训练机器学习模型,导致计算成本高昂且难以扩展。其解决方案的关键在于提出一种基于迁移学习(Transfer Learning, TL)的框架 DELTAiF,该框架通过从一个具有规律用水模式的代表性家庭中提取可迁移的知识,并对其进行微调以适应其他家庭,从而避免了为每台热泵独立建模的需求。这种方法在保持高预测精度(R² 值介于 0.874 至 0.991 之间,平均绝对百分比误差 MAPE 在 0.001 至 0.017 之间)的同时,将整体训练时间减少了约 67%,显著提升了热水需求预测的可扩展性和实用性。

链接: https://arxiv.org/abs/2602.14267
作者: Manal Rahal,Bestoun S. Ahmed,Roger Renström,Robert Stener
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:With the rapid increase in residential heat pump (HP) installations, optimizing hot water production in households is essential, yet it faces major technical and scalability challenges. Adapting production to actual household needs requires accurate forecasting of hot water demand to ensure comfort and, most importantly, to reduce energy waste. However, the conventional approach of training separate machine learning models for each household becomes computationally expensive at scale, particularly in cloud-connected HP deployments. This study introduces DELTAiF, a transfer learning (TL) based framework that provides scalable and accurate prediction of household hot water consumption. By predicting large hot water usage events, such as showers, DELTAiF enables adaptive yet scalable hot water production at the household level. DELTAiF leverages learned knowledge from a representative household and fine-tunes it across others, eliminating the need to train separate machine learning models for each HP installation. This approach reduces overall training time by approximately 67 percent while maintaining high predictive accuracy values between 0.874 and 0.991, and mean absolute percentage error values between 0.001 and 0.017. The results show that TL is particularly effective when the source household exhibits regular consumption patterns, enabling hot water demand forecasting at scale. Comments: 8 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.14267 [cs.LG] (or arXiv:2602.14267v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.14267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-60] GRAIL: Goal Recognition Alignment through Imitation Learning AAMAS2026

【速读】:该论文旨在解决传统目标识别方法在面对非最优行为时的局限性问题,即现有方法通常假设代理遵循最优目标导向策略,而这种假设可能与真实行为存在偏差,从而影响目标识别的准确性。解决方案的关键在于提出一种基于模仿学习的目标识别对齐方法(Goal Recognition Alignment through Imitation Learning, GRAIL),其通过模仿学习和逆强化学习直接从潜在次优的示范轨迹中学习每个候选目标对应的定向策略;随后,在单次前向传播中利用这些学习到的策略对观测的部分轨迹进行评分,从而在保留经典目标识别“一次推断”能力的同时,有效建模次优及系统性偏差行为,显著提升在复杂环境下的目标识别性能。

链接: https://arxiv.org/abs/2602.14252
作者: Osher Elhadad,Felipe Meneguzzi,Reuth Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for publication at AAMAS 2026

点击查看摘要

Abstract:Understanding an agent’s goals from its behavior is fundamental to aligning AI systems with human intentions. Existing goal recognition methods typically rely on an optimal goal-oriented policy representation, which may differ from the actor’s true behavior and hinder the accurate recognition of their goal. To address this gap, this paper introduces Goal Recognition Alignment through Imitation Learning (GRAIL), which leverages imitation learning and inverse reinforcement learning to learn one goal-directed policy for each candidate goal directly from (potentially suboptimal) demonstration trajectories. By scoring an observed partial trajectory with each learned goal-directed policy in a single forward pass, GRAIL retains the one-shot inference capability of classical goal recognition while leveraging learned policies that can capture suboptimal and systematically biased behavior. Across the evaluated domains, GRAIL increases the F1-score by more than 0.5 under systematically biased optimal behavior, achieves gains of approximately 0.1-0.3 under suboptimal behavior, and yields improvements of up to 0.4 under noisy optimal trajectories, while remaining competitive in fully optimal settings. This work contributes toward scalable and robust models for interpreting agent goals in uncertain environments.

[AI-61] Multi-Agent Debate: A Unified Agent ic Framework for Tabular Anomaly Detection

【速读】:该论文旨在解决表格数据(tabular data)异常检测中因模型异质性导致的性能瓶颈问题,尤其在分布偏移(distribution shift)、缺失值(missingness)和稀有异常(rare-anomaly)等现实场景下,单一检测器或静态集成方法难以保持鲁棒性。其解决方案的关键在于提出一种多智能体辩论框架(Multi-Agent Debating, MAD),将不同模型之间的预测分歧视为首要信号,通过一个数学严谨的协调层(coordination layer)进行动态整合:每个代理(agent)是一个基于机器学习的异常检测器,输出归一化异常分数、置信度及结构化证据,并由大语言模型(LLM)作为批评者增强解释能力;协调器则基于指数梯度规则更新各代理影响力,生成最终的辩论异常分数与可审计的辩论轨迹。该框架不仅能恢复如专家混合(mixture-of-experts)等已有方法,还提供理论上的后悔保证(regret guarantees)和基于校准的假阳性控制(conformal calibration),显著提升了异常检测的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2602.14251
作者: Pinqiao Wang,Sheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular anomaly detection is often handled by single detectors or static ensembles, even though strong performance on tabular data typically comes from heterogeneous model families (e.g., tree ensembles, deep tabular networks, and tabular foundation models) that frequently disagree under distribution shift, missingness, and rare-anomaly regimes. We propose MAD, a Multi-Agent Debating framework that treats this disagreement as a first-class signal and resolves it through a mathematically grounded coordination layer. Each agent is a machine learning (ML)-based detector that produces a normalized anomaly score, confidence, and structured evidence, augmented by a large language model (LLM)-based critic. A coordinator converts these messages into bounded per-agent losses and updates agent influence via an exponentiated-gradient rule, yielding both a final debated anomaly score and an auditable debate trace. MAD is a unified agentic framework that can recover existing approaches, such as mixture-of-experts gating and learning-with-expert-advice aggregation, by restricting the message space and synthesis operator. We establish regret guarantees for the synthesized losses and show how conformal calibration can wrap the debated score to control false positives under exchangeability. Experiments on diverse tabular anomaly benchmarks show improved robustness over baselines and clearer traces of model disagreement

[AI-62] A Hybrid TGN-SEAL Model for Dynamic Graph Link Prediction

【速读】:该论文旨在解决稀疏且持续演化的动态网络中链接预测的难题,传统方法如启发式算法和图神经网络(Graph Neural Networks, GNNs)通常针对静态图设计,难以捕捉时间依赖性;而现有的基于快照的方法虽部分缓解此问题,但在数据稀疏和类别不平衡场景下表现不佳,尤其在电信通话详单记录(Call Detail Records, CDRs)这类具有瞬时交互特性的网络中更为明显。论文提出的关键解决方案是改进Temporal Graph Networks (TGNs)框架,通过提取候选链接周围的封闭子图(enclosing subgraphs),使模型能够联合学习局部拓扑结构与时间信息,从而提升在稀疏条件下的预测精度。实验表明,该方法相较于标准TGNs平均精度提升2.6%。

链接: https://arxiv.org/abs/2602.14239
作者: Nafiseh Sadat Sajadi,Behnam Bahrak,Mahdi Jafari Siavoshani
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting links in sparse, continuously evolving networks is a central challenge in network science. Conventional heuristic methods and deep learning models, including Graph Neural Networks (GNNs), are typically designed for static graphs and thus struggle to capture temporal dependencies. Snapshot-based techniques partially address this issue but often encounter data sparsity and class imbalance, particularly in networks with transient interactions such as telecommunication call detail records (CDRs). Temporal Graph Networks (TGNs) model dynamic graphs by updating node embeddings over time; however, their predictive accuracy under sparse conditions remains limited. In this study, we improve the TGN framework by extracting enclosing subgraphs around candidate links, enabling the model to jointly learn structural and temporal information. Experiments on a sparse CDR dataset show that our approach increases average precision by 2.6% over standard TGNs, demonstrating the advantages of integrating local topology for robust link prediction in dynamic networks.

[AI-63] Evaluating LLM s in Finance Requires Explicit Bias Consideration

【速读】:该论文旨在解决金融领域大型语言模型(Large Language Models, LLMs)应用中存在的多种系统性偏差问题,这些问题包括前瞻偏差(look-ahead bias)、幸存者偏差(survivorship bias)、叙事偏差(narrative bias)、目标偏差(objective bias)和成本偏差(cost bias),这些偏差会扭曲模型性能评估结果,导致回测污染和部署决策失效。论文指出当前研究中缺乏对这些偏差的系统性识别与控制,且多数文献未充分讨论任一特定偏差。其解决方案的关键在于提出一个结构有效性框架(Structural Validity Framework)和一套最小化偏差诊断的评估检查清单,强调在任何部署声明前必须对模型进行结构有效性验证,从而提升金融LLM系统的可信度与可复现性。

链接: https://arxiv.org/abs/2602.14233
作者: Yaxuan Kong,Hoyoung Lee,Yoontae Hwang,Alejandro Lopez-Lira,Bradford Levy,Dhagash Mehta,Qingsong Wen,Chanyeol Choi,Yongjae Lee,Stefan Zohren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into financial workflows, but evaluation practice has not kept up. Finance-specific biases can inflate performance, contaminate backtests, and make reported results useless for any deployment claim. We identify five recurring biases in financial LLM applications. They include look-ahead bias, survivorship bias, narrative bias, objective bias, and cost bias. These biases break financial tasks in distinct ways and they often compound to create an illusion of validity. We reviewed 164 papers from 2023 to 2025 and found that no single bias is discussed in more than 28 percent of studies. This position paper argues that bias in financial LLM systems requires explicit attention and that structural validity should be enforced before any result is used to support a deployment claim. We propose a Structural Validity Framework and an evaluation checklist with minimal requirements for bias diagnosis and future system design. The material is available at this https URL.

[AI-64] CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments

【速读】:该论文旨在解决自主代理在真实组织工作中面临的多任务长期推理挑战,即如何在持久执行环境中协同处理数十个交错、依赖且需动态重优先级的长周期任务(45+任务,500–1500+步)。现有基准仅评估单一任务表现,无法反映复杂现实场景。解决方案的关键在于提出CorpGen框架,其核心机制包括:分层规划实现多目标对齐、子代理隔离避免跨任务干扰、分层记忆结构(工作记忆、结构化记忆、语义记忆)提升信息管理效率,以及自适应摘要技术降低上下文饱和风险。实验表明,CorpGen在OSWorld Office模拟环境中相较基线提升达3.5倍(完成率从4.3%提高至15.2%),且性能随负载增长保持稳定,验证了架构设计的有效性。

链接: https://arxiv.org/abs/2602.14229
作者: Abubakarr Jaye,Nigel Boachie Kumankumah,Chidera Biringa,Anjel Shaileshbhai Patel,Sulaiman Vesal,Dayquan Julienne,Charlotte Siska,Manuel Raúl Meléndez Luján,Anthony Twum-Barimah,Mauricio Velazco,Tianwei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-horizon reasoning is a key challenge for autonomous agents, yet existing benchmarks evaluate agents on single tasks in isolation. Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization. We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class requiring coherent execution across dozens of interleaved tasks (45+, 500-1500+ steps) within persistent execution contexts spanning hours. We identify four failure modes that cause baseline CUAs to degrade from 16.7% to 8.7% completion as load scales 25% to 100%, a pattern consistent across three independent implementations. These failure modes are context saturation (O(N) vs O(1) growth), memory interference, dependency complexity (DAGs vs. chains), and reprioritization overhead. We present CorpGen, an architecture-agnostic framework addressing these failures via hierarchical planning for multi-horizon goal alignment, sub-agent isolation preventing cross-task contamination, tiered memory (working, structured, semantic), and adaptive summarization. CorpGen simulates corporate environments through digital employees with persistent identities and realistic schedules. Across three CUA backends (UFO2, OpenAI CUA, hierarchical) on OSWorld Office, CorpGen achieves up to 3.5x improvement over baselines (15.2% vs 4.3%) with stable performance under increasing load, confirming that gains stem from architectural mechanisms rather than specific CUA implementations. Ablation studies show experiential learning provides the largest gains.

[AI-65] xt Before Vision: Staged Knowledge Injection Matters for Agent ic RLVR in Ultra-High-Resolution Remote Sensing Understanding

【速读】:该论文旨在解决超高分辨率(Ultra-High-Resolution, UHR)遥感(Remote Sensing, RS)图像中多模态推理的瓶颈问题,即模型在海量像素空间中难以定位与任务相关的微小区域。传统强化学习方法因缺乏结构化的领域先验而难以有效导航此类视觉空间。解决方案的关键在于提出一种分阶段的知识注入策略:首先利用可扩展且基于知识图谱验证的地球科学文本问答(Earth-science text-only QA)进行冷启动,以注入概念、机制解释和决策规则等推理结构;随后在监督微调(Supervised Fine-Tuning, SFT)阶段使用相同的高难度UHR图文样本进行“预热”,从而稳定并增强后续基于工具的强化学习(Agentic Reinforcement Learning with Verifiable Rewards, RLVR)性能。该方法在XLRS-Bench上实现60.40% Pass@1,显著优于更大规模通用模型,成为新基准。

链接: https://arxiv.org/abs/2602.14225
作者: Fengxiang Wang,Mingshuo Chen,Yueying Li,Yajie Yang,Yuhao Zhou,Di Wang,Yifan Zhang,Haoyu Wang,Haiyan Zhao,Hongda Sun,Long Lan,Jun Song,Yulin Wang,Jing Zhang,Wenlong Zhang,Bo Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS this http URL controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence this http URL on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2) “pre-warming” on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.

[AI-66] SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

【速读】:该论文旨在解决编码代理(coding agents)中因技能(skill)抽象引入的隐蔽式提示注入(prompt injection)攻击问题,即恶意技能可诱导代理偏离用户意图和安全策略,而传统手工构造的攻击往往因意图过于明显或与原技能偏差过大而失效。解决方案的关键在于提出首个面向代理技能的自动化、隐蔽式提示注入框架,其核心是一个由三个代理组成的闭环系统:攻击代理(Attack Agent)在显式隐蔽性约束下合成注入技能,代码代理(Code Agent)在真实工具环境中执行任务以模拟实际行为,评估代理(Evaluate Agent)通过记录操作轨迹(如工具调用和文件操作)验证是否触发目标恶意行为;此外,还设计了一种恶意载荷隐藏策略,将对抗性操作隐藏于辅助脚本中,并注入优化后的诱导提示以触发工具执行,从而在多种编码代理场景和真实软件工程任务中实现高成功率的隐蔽攻击。

链接: https://arxiv.org/abs/2602.14211
作者: Xiaojun Jia,Jie Liao,Simeng Qin,Jindong Gu,Wenqi Ren,Xiaochun Cao,Yang Liu,Philip Torr
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills are becoming a core abstraction in coding agents, packaging long-form instructions and auxiliary scripts to extend tool-augmented behaviors. This abstraction introduces an under-measured attack surface: skill-based prompt injection, where poisoned skills can steer agents away from user intent and safety policies. In practice, naive injections often fail because the malicious intent is too explicit or drifts too far from the original skill, leading agents to ignore or refuse them; existing attacks are also largely hand-crafted. We propose the first automated framework for stealthy prompt injection tailored to agent skills. The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills in a realistic tool environment, and an Evaluate Agent that logs action traces (e.g., tool calls and file operations) and verifies whether targeted malicious behaviors occurred. We also propose a malicious payload hiding strategy that conceals adversarial operations in auxiliary scripts while injecting optimized inducement prompts to trigger tool execution. Extensive experiments across diverse coding-agent settings and real-world software engineering tasks show that our method consistently achieves high attack success rates under realistic settings.

[AI-67] Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

【速读】:该论文旨在解决临床决策中证据异质性与可追溯性推理不足的问题,特别是在基因-疾病因果关系验证(gene-disease validity curation)任务中,现有大语言模型多智能体系统(Multi-Agent System, MAS)虽能提升最终判断准确性,但缺乏对符合临床标准的推理过程进行有效约束。其解决方案的关键在于提出一种“代理即工具”的强化学习框架(agent-as-tool reinforcement learning framework),通过双重目标优化:一是引入过程级监督(process-level supervision),确保推理路径符合临床规范;二是采用分层多智能体协作机制(hierarchical multi-agent system),提升团队协同效率。实验表明,仅依赖结果奖励时,MAS虽显著提高准确率(从0.195升至0.732),但过程一致性差(F1=0.392);而结合过程+结果奖励后,不仅保持高准确率(0.750),且大幅改善推理过程的合规性(F1=0.520)。

链接: https://arxiv.org/abs/2602.14160
作者: Chaeeun Lee,T. Michael Yates,Pasquale Minervini,T. Ian Simpson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at this https URL.

[AI-68] ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

【速读】:该论文旨在解决当前AI安全评估体系在面对快速演进的自主性与目标导向型人工智能(Autonomous and Goal-Directed AI)时所暴露的关键短板,包括风险维度受限、前沿风险识别失效以及安全基准与对齐技术滞后等问题。其解决方案的核心是提出“ForesightSafety Bench”AI安全评估框架,以7大基础安全支柱为起点,逐步扩展至具身AI安全、AI for Science安全、社会与环境AI风险、灾难性与生存性风险及8个关键工业安全领域,构建涵盖94个细化风险维度的分层、动态演化评估体系,并通过数十万条结构化风险数据点对二十多个主流先进大模型进行系统评估,揭示了前沿AI在高风险代理自主性、AI4Science安全、具身AI安全、社会AI安全及灾难性与生存性风险等维度中存在的广泛安全漏洞。

链接: https://arxiv.org/abs/2602.14135
作者: Haibo Tong,Feifei Zhao,Linghao Feng,Ruoyu Wu,Ruolin Chen,Lu Jia,Zhou Zhao,Jindong Li,Tenglong Li,Erliang Lin,Shuai Yang,Enmeng Lu,Yinqian Sun,Qian Zhang,Zizhe Ruan,Zeyang Yue,Ping Wu,Huangrui Li,Chengyi Sun,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Rapidly evolving AI exhibits increasingly strong autonomy and goal-directed capabilities, accompanied by derivative systemic risks that are more unpredictable, difficult to control, and potentially irreversible. However, current AI safety evaluation systems suffer from critical limitations such as restricted risk dimensions and failed frontier risk detection. The lagging safety benchmarks and alignment technologies can hardly address the complex challenges posed by cutting-edge AI models. To bridge this gap, we propose the “ForesightSafety Bench” AI Safety Evaluation Framework, beginning with 7 major Fundamental Safety pillars and progressively extends to advanced Embodied AI Safety, AI4Science Safety, Social and Environmental AI risks, Catastrophic and Existential Risks, as well as 8 critical industrial safety domains, forming a total of 94 refined risk dimensions. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and dynamically evolving AI safety evaluation framework. Based on this benchmark, we conduct systematic evaluation and in-depth analysis of over twenty mainstream advanced large models, identifying key risk patterns and their capability boundaries. The safety capability evaluation results reveals the widespread safety vulnerabilities of frontier AI across multiple pillars, particularly focusing on Risky Agentic Autonomy, AI4Science Safety, Embodied AI Safety, Social AI Safety and Catastrophic and Existential Risks. Our benchmark is released at this https URL. The project website is available at this https URL.

[AI-69] oward Autonomous O-RAN: A Multi-Scale Agent ic AI Framework for Real-Time Network Control and Management

【速读】:该论文旨在解决开放无线接入网(Open Radio Access Network, O-RAN)在6G网络中因组件解耦与软件驱动带来的操作复杂性问题,尤其是多控制环路(包括非实时、近实时和实时控制)之间协同困难以及独立开发的控制应用可能产生不可预期交互的问题。其解决方案的关键在于提出一种多尺度智能体(agentic)AI框架,将RAN智能组织为跨层级的协调体系:在非实时(Non-RT)RIC中部署大语言模型(Large Language Model, LLM)代理以解析运营商意图并管理模型生命周期;在近实时(Near-RT)RIC中引入小语言模型(Small Language Model, SLM)代理实现低延迟优化并动态调控控制应用;在分布式单元附近部署无线物理层基础模型(Wireless Physical-layer Foundation Model, WPFM)代理进行靠近空口的快速推理。三类智能体通过标准化O-RAN接口与遥测机制协作,从而提升系统灵活性、可解释性和适应性。

链接: https://arxiv.org/abs/2602.14117
作者: Hojjat Navidan,Mohammad Cheraghinia,Jaron Fontaine,Mohamed Seif,Eli De Poorter,H. Vincent Poor,Ingrid Moerman,Adnan Shahid
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Submitted to the IEEE Networks Journal

点击查看摘要

Abstract:Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces, but this programmability also increases operational complexity. Multiple control loops coexist across the service management layer and RAN Intelligent Controller (RIC), while independently developed control applications can interact in unintended ways. In parallel, recent advances in generative Artificial Intelligence (AI) are enabling a shift from isolated AI models toward agentic AI systems that can interpret goals, coordinate multiple models and control functions, and adapt their behavior over time. This article proposes a multi-scale agentic AI framework for O-RAN that organizes RAN intelligence as a coordinated hierarchy across the Non-Real-Time (Non-RT), Near-Real-Time (Near-RT), and Real-Time (RT) control loops: (i) A Large Language Model (LLM) agent in the Non-RT RIC translates operator intent into policies and governs model lifecycles. (ii) Small Language Model (SLM) agents in the Near-RT RIC execute low-latency optimization and can activate, tune, or disable existing control applications; and (iii) Wireless Physical-layer Foundation Model (WPFM) agents near the distributed unit provide fast inference close to the air interface. We describe how these agents cooperate through standardized O-RAN interfaces and telemetry. Using a proof-of-concept implementation built on open-source models, software, and datasets, we demonstrate the proposed agentic approach in two representative scenarios: robust operation under non-stationary conditions and intent-driven slice resource control.

[AI-70] Anticipating Adversary Behavior in DevSecOps Scenarios through Large Language Models

【速读】:该论文旨在解决云环境中敏感数据面临日益复杂网络攻击的问题,特别是在DevOps流程中安全措施长期被忽视导致的漏洞累积风险。传统补丁机制已不足以应对现代威胁,亟需一种主动防御策略。其解决方案的关键在于将安全混沌工程(Security Chaos Engineering, SCE)与基于大语言模型(Large Language Model, LLM)的新流程相结合,通过自动化生成表示攻击者行为的防御树(attack defense trees),从而构建SCE实验场景,使团队能够提前识别潜在攻击路径并实施此前未考虑的防御机制,实现对攻击者的前瞻性遏制。

链接: https://arxiv.org/abs/2602.14106
作者: Mario Marín Caballero,Miguel Betancourt Alonso,Daniel Díaz-López,Angel Luis Perales Gómez,Pantaleone Nespoli,Gregorio Martínez Pérez
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, paper in proceedings of the X National Cybersecurity Research Conference (JNIC) in Zaragoza, Spain, June, 2025

点击查看摘要

Abstract:The most valuable asset of any cloud-based organization is data, which is increasingly exposed to sophisticated cyberattacks. Until recently, the implementation of security measures in DevOps environments was often considered optional by many government entities and critical national services operating in the cloud. This includes systems managing sensitive information, such as electoral processes or military operations, which have historically been valuable targets for cybercriminals. Resistance to security implementation is often driven by concerns over losing agility in software development, increasing the risk of accumulated vulnerabilities. Nowadays, patching software is no longer enough; adopting a proactive cyber defense strategy, supported by Artificial Intelligence (AI), is crucial to anticipating and mitigating threats. Thus, this work proposes integrating the Security Chaos Engineering (SCE) methodology with a new LLM-based flow to automate the creation of attack defense trees that represent adversary behavior and facilitate the construction of SCE experiments based on these graphical models, enabling teams to stay one step ahead of attackers and implement previously unconsidered defenses. Further detailed information about the experiment performed, along with the steps to replicate it, can be found in the following repository: this https URL.

[AI-71] NEST: Nascent Encoded Steganographic Thoughts

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全监控下可能通过隐写术(steganography)隐藏其链式思维(Chain-of-Thought, CoT)推理过程的问题,从而规避监督机制,潜在地引发对齐偏差或欺骗性行为。解决方案的关键在于系统性评估28个不同代际模型在四类数据集上的隐写能力,包括监测规避率、拒绝率、编码保真度及隐藏任务准确率,并对比隐写诗(steganographic acrostics)与普通推理和填充标记基线的表现,以识别当前模型在复杂任务中尚无法维持有效隐藏推理的能力,同时揭示如Claude Opus 4.5等模型在简化任务中已展现初步隐蔽推理能力,为未来风险防控提供可量化的检测方法与部署策略依据。

链接: https://arxiv.org/abs/2602.14095
作者: Artem Karpov
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT – where models hide secret reasoning within innocuous text – to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.

[AI-72] GUI-GENESIS: Automated Synthesis of Efficient Environments with Verifiable Rewards for GUI Agent Post-Training

【速读】:该论文旨在解决在真实世界应用程序中训练图形用户界面(GUI)智能体时面临的高延迟、可重复性差以及奖励信号依赖噪声视觉代理导致不可验证的问题。其核心解决方案是提出GUI-GENESIS框架,通过多模态代码模型将真实应用重构为轻量级网页环境,并引入代码原生奖励(code-native rewards),即可执行断言形式的确定性奖励信号,从而消除视觉估计噪声并提升训练效率与可靠性。该方法显著降低环境延迟(减少10倍)和成本(每轮节省超28,000美元),且训练出的智能体在未见的真实任务上性能优于基线模型14.54%,甚至超越真实强化学习(RL)基线3.27%。

链接: https://arxiv.org/abs/2602.14093
作者: Yuan Cao,Dezhi Ran,Mengzhou Wu,Yuzhe Guo,Xin Chen,Ang Li,Gang Cao,Gong Zhi,Hao Yu,Linyi Li,Wei Yang,Tao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-training GUI agents in interactive environments is critical for developing generalization and long-horizon planning capabilities. However, training on real-world applications is hindered by high latency, poor reproducibility, and unverifiable rewards relying on noisy visual proxies. To address the limitations, we present GUI-GENESIS, the first framework to automatically synthesize efficient GUI training environments with verifiable rewards. GUI-GENESIS reconstructs real-world applications into lightweight web environments using multimodal code models and equips them with code-native rewards, executable assertions that provide deterministic reward signals and eliminate visual estimation noise. Extensive experiments show that GUI-GENESIS reduces environment latency by 10 times and costs by over 28,000 per epoch compared to training on real applications. Notably, agents trained with GUI-GENESIS outperform the base model by 14.54% and even real-world RL baselines by 3.27% on held-out real-world tasks. Finally, we observe that models can synthesize environments they cannot yet solve, highlighting a pathway for self-improving agents.

[AI-73] abTracer: Monte Carlo Tree Search for Complex Table Reasoning with Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言表格推理任务中因缺乏有效验证机制和高冗余计算导致的准确性低与资源消耗大的问题。现有方法中,基于提示的方法通常缺少步骤级验证,而基于代理的方法虽使用工具闭环交互,但验证局部且回溯能力有限,易引发错误传播并增加token成本。其解决方案的关键在于提出TabTracer框架,通过三个核心机制实现优化:一是引入类型化操作与轻量级数值及格式校验以实现步骤级验证并抑制幻觉;二是采用执行反馈的蒙特卡洛树搜索(execution-feedback Monte Carlo Tree Search),结合反射评分回传进行UCB1选择与版本化快照驱动的回滚;三是利用预算感知剪枝、去重和状态哈希配合单调性门控策略降低冗余,显著减少token消耗。实验表明,TabTracer在多个基准数据集上相比最先进基线提升最高达6.7%准确率,同时token消耗降低59–84%。

链接: https://arxiv.org/abs/2602.14089
作者: Zhizhao Luo,Zhaojing Luo,Meihui Zhang,Rui Mao
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools for natural language table reasoning, where there are two main categories of methods. Prompt-based approaches rely on language-only inference or one-pass program generation without step-level verification. Agent-based approaches use tools in a closed loop, but verification is often local and backtracking is limited, allowing errors to propagate and increasing cost. Moreover, they rely on chain- or beam-style trajectories that are typically combinatorially redundant, leading to high token costs. In this paper, we propose TabTracer, an agentic framework that coordinates multi-step tool calls over intermediate table states, with explicit state tracking for verification and rollback. First, it enforces step-level verification with typed operations and lightweight numeric and format checks to provide reliable rewards and suppress hallucinations. Second, execution-feedback Monte Carlo Tree Search maintains a search tree of candidate table states and uses backpropagated reflection scores to guide UCB1 selection and rollback via versioned snapshots. Third, it reduces redundancy with budget-aware pruning, deduplication, and state hashing with a monotonicity gate to cut token cost. Comprehensive evaluation on TabFact, WikiTQ, and CRT datasets shows that TabTracer outperforms state-of-the-art baselines by up to 6.7% in accuracy while reducing token consumption by 59–84%.

[AI-74] Plan-MCTS: Plan Exploration for Action Exploitation in Web Navigation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂网页导航任务中面临的两个关键挑战:一是有效路径稀疏导致探索效率低下,二是上下文噪声干扰状态感知准确性。解决方案的核心在于提出Plan-MCTS框架,其关键创新包括:将探索空间从动作空间重构为语义计划空间(Plan Space),从而生成稠密的计划树以提升探索效率;通过抽象化语义历史(Abstracted Semantic History)对噪声上下文进行压缩与提炼,增强状态感知精度;同时引入双门控奖励机制(Dual-Gating Reward)严格验证物理可执行性与策略一致性,并结合结构优化(Structural Refinement)实现在线策略修复失败子计划,从而显著提升任务完成率与搜索效率。

链接: https://arxiv.org/abs/2602.14083
作者: Weiming Zhang,Jihong Wang,Jiamu Zhou,Qingyao Li,Xinbei Ma,Congmin Zheng,Xingyu Lou,Weiwen Liu,Zhuosheng Zhang,Jun Wang,Yong Yu,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have empowered autonomous agents to handle complex web navigation tasks. While recent studies integrate tree search to enhance long-horizon reasoning, applying these algorithms in web navigation faces two critical challenges: sparse valid paths that lead to inefficient exploration, and a noisy context that dilutes accurate state perception. To address this, we introduce Plan-MCTS, a framework that reformulates web navigation by shifting exploration to a semantic Plan Space. By decoupling strategic planning from execution grounding, it transforms sparse action space into a Dense Plan Tree for efficient exploration, and distills noisy contexts into an Abstracted Semantic History for precise state awareness. To ensure efficiency and robustness, Plan-MCTS incorporates a Dual-Gating Reward to strictly validate both physical executability and strategic alignment and Structural Refinement for on-policy repair of failed subplans. Extensive experiments on WebArena demonstrate that Plan-MCTS achieves state-of-the-art performance, surpassing current approaches with higher task effectiveness and search efficiency.

[AI-75] Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning

【速读】:该论文旨在解决大规模预训练视觉模型在类增量学习(class-incremental learning)场景下因灾难性遗忘(catastrophic forgetting)而导致性能下降的问题。尽管参数高效微调(Parameter-efficient fine-tuning, PEFT)能够缓解此问题,但现有方法普遍依赖交叉熵(cross-entropy, CE)损失函数,而CE本质上是0-1损失的代理目标,难以直接最小化分类错误率。为此,作者从强化学习视角重新审视分类任务,将其建模为一个单步马尔可夫决策过程(Markov Decision Process),提出期望策略梯度(Expected Policy Gradient, EPG)方法,通过低方差梯度估计直接优化误分类误差。关键创新在于揭示了CE与EPG的本质差异:CE通过样本加权机制鼓励对低置信度样本的探索,而EPG则聚焦于高置信度样本的利用;基于此,进一步设计自适应熵退火(adaptive entropy annealing, aEPG)策略,实现从探索到利用的动态过渡。实验表明,aEPG在多种基准和PEFT模块上均显著优于传统CE方法,并验证了输出预测分布熵降低有助于提升模型适应能力。

链接: https://arxiv.org/abs/2602.14078
作者: Yaqian Zhang,Bernhard Pfahringer,Eibe Frank,Albert Bifet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their success, large pretrained vision models remain vulnerable to catastrophic forgetting when adapted to new tasks in class-incremental settings. Parameter-efficient fine-tuning (PEFT) alleviates this by restricting trainable parameters, yet most approaches still rely on cross-entropy (CE) loss, a surrogate for the 0-1 loss, to learn from new data. We revisit this choice and revive the true objective (0-1 loss) through a reinforcement learning perspective. By formulating classification as a one-step Markov Decision Process, we derive an Expected Policy Gradient (EPG) method that directly minimizes misclassification error with a low-variance gradient estimation. Our analysis shows that CE can be interpreted as EPG with an additional sample-weighting mechanism: CE encourages exploration by emphasizing low-confidence samples, while EPG prioritizes high-confidence ones. Building on this insight, we propose adaptive entropy annealing (aEPG), a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning. aEPG-based methods outperform CE-based methods across diverse benchmarks and with various PEFT modules. More broadly, we evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.

[AI-76] REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning -Pivot Alignment

【速读】:该论文旨在解决知识密集型视觉问答(Knowledge-intensive Visual Question Answering, KI-VQA)中因开放域检索固有局限性而导致的严重知识冲突问题。现有方法受限于缺乏可泛化的冲突检测机制和模型内约束策略,难以有效处理矛盾证据。其解决方案的关键在于提出基于“推理枢轴”(Reasoning-Pivot)的新框架——REAL(Reasoning-Pivot Alignment),其中推理枢轴作为推理链中的原子单元(节点或边),强调知识关联并依赖外部证据完成推理;通过构建REAL-VQA数据集,引入推理枢轴感知的监督微调(RPA-SFT)以训练通用冲突判别器,并设计推理枢轴引导解码(RPGD)策略,在模型内部利用枢轴实现针对性冲突缓解,从而显著提升判别准确性和整体性能。

链接: https://arxiv.org/abs/2602.14065
作者: Kai Ye,Xianwei Mao,Sheng Zhou,Zirui Shao,Ye Mo,Liangliang Liu,Haikuan Huang,Bin Li,Jiajun Bu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments across diverse benchmarks demonstrate that REAL significantly enhances discrimination accuracy and achieves state-of-the-art performance, validating the effectiveness of our pivot-driven resolution paradigm.

[AI-77] UniST-Pred: A Robust Unified Framework for Spatio-Temporal Traffic Forecasting in Transportation Networks Under Disruptions

【速读】:该论文旨在解决交通流预测中因结构不确定性和观测不确定性导致的模型鲁棒性不足问题,尤其在真实部署场景下,现有方法往往忽视了这些现实约束。传统方法通过紧密耦合空间与时间建模来提升短期预测性能,但代价是模型复杂度高且缺乏模块化设计;而高效的时间序列模型虽能捕捉长程时序依赖,却未充分结合空间信息。解决方案的关键在于提出UniST-Pred框架——首先将时间建模与空间表征学习解耦,再通过自适应表示层融合机制整合两者,从而实现轻量化、可解释且鲁棒性强的时空预测。实验表明,该方法在模拟和真实数据集上均表现优异,尤其在严重网络断连等极端条件下仍保持稳定预测能力。

链接: https://arxiv.org/abs/2602.14049
作者: Yue Wang,Areg Karapetyan,Djellel Difallah,Samer Madanat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatio-temporal traffic forecasting is a core component of intelligent transportation systems, supporting various downstream tasks such as signal control and network-level traffic management. In real-world deployments, forecasting models must operate under structural and observational uncertainties, conditions that are rarely considered in model design. Recent approaches achieve strong short-term predictive performance by tightly coupling spatial and temporal modeling, often at the cost of increased complexity and limited modularity. In contrast, efficient time-series models capture long-range temporal dependencies without relying on explicit network structure. We propose UniST-Pred, a unified spatio-temporal forecasting framework that first decouples temporal modeling from spatial representation learning, then integrates both through adaptive representation-level fusion. To assess robustness of the proposed approach, we construct a dataset based on an agent-based, microscopic traffic simulator (MATSim) and evaluate UniST-Pred under severe network disconnection scenarios. Additionally, we benchmark UniST-Pred on standard traffic prediction datasets, demonstrating its competitive performance against existing well-established models despite a lightweight design. The results illustrate that UniST-Pred maintains strong predictive performance across both real-world and simulated datasets, while also yielding interpretable spatio-temporal representations under infrastructure disruptions. The source code and the generated dataset are available at this https URL

[AI-78] Beyond Static Snapshots: Dynamic Modeling and Forecasting of Group-Level Value Evolution with Large Language Models

【速读】:该论文旨在解决现有大语言模型(Large Language Model, LLM)驱动的社会模拟方法在群体层面动态价值演化建模上的不足,即当前研究多聚焦于离散时间点的静态群体值,忽视了社会价值随时间演变的连续性和复杂性。这一问题的关键在于如何有效整合历史价值轨迹与社会事件影响,以实现对群体社会价值动态变化的精准预测。解决方案的核心是提出一种新型框架,将历史价值轨迹嵌入LLM驱动的人类响应建模中,并基于世界价值观调查(World Values Survey)构建多波次、分层的群体级纵向数据集,进而设计首个基于事件的预测方法,统一社会事件、当前价值状态与群体属性三者关系,从而显著提升对已见和未见问题的预测性能(最大提升达33.97%),并揭示中美群体间的价值敏感性差异。

链接: https://arxiv.org/abs/2602.14043
作者: Qiankun Pi,Guixin Su,Jinliang Li,Mayi Xu,Xin Miao,Jiawei Jiang,Ming Zhong,Tieyun Qian
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social simulation is critical for mining complex social dynamics and supporting data-driven decision making. LLM-based methods have emerged as powerful tools for this task by leveraging human-like social questionnaire responses to model group behaviors. Existing LLM-based approaches predominantly focus on group-level values at discrete time points, treating them as static snapshots rather than dynamic processes. However, group-level values are not fixed but shaped by long-term social changes. Modeling their dynamics is thus crucial for accurate social evolution prediction–a key challenge in both data mining and social science. This problem remains underexplored due to limited longitudinal data, group heterogeneity, and intricate historical event impacts. To bridge this gap, we propose a novel framework for group-level dynamic social simulation by integrating historical value trajectories into LLM-based human response modeling. We select China and the U.S. as representative contexts, conducting stratified simulations across four core sociodemographic dimensions (gender, age, education, income). Using the World Values Survey, we construct a multi-wave, group-level longitudinal dataset to capture historical value evolution, and then propose the first event-based prediction method for this task, unifying social events, current value states, and group attributes into a single framework. Evaluations across five LLM families show substantial gains: a maximum 30.88% improvement on seen questions and 33.97% on unseen questions over the Vanilla baseline. We further find notable cross-group heterogeneity: U.S. groups are more volatile than Chinese groups, and younger groups in both countries are more sensitive to external changes. These findings advance LLM-based social simulation and provide new insights for social scientists to understand and predict social value changes. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.14043 [cs.SI] (or arXiv:2602.14043v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2602.14043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-79] Choosing How to Remember: Adaptive Memory Structures for LLM Agents

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在长时间交互中因记忆系统缺乏适应性而导致的行为连贯性不足问题。现有方法通常采用单一固定结构的记忆机制,未能根据交互情境动态选择最合适的记忆组织方式,从而难以应对多样化的交互模式并限制了性能表现。其解决方案的关键在于提出一个统一框架 FluxMem,该框架通过引入多种互补的记忆结构,并利用下游响应质量与记忆利用率的离线监督信号,显式学习基于交互特征进行记忆结构的选择;同时构建三层记忆层级和基于 Beta 混合模型的概率门控机制,实现分布感知的记忆融合,替代传统脆弱的相似度阈值策略,从而提升长期记忆演化的鲁棒性与适应性。

链接: https://arxiv.org/abs/2602.14038
作者: Mingfei Lu,Mengjia Wu,Feng Liu,Jiawei Xu,Weikai Li,Haoyang Wang,Zhengdong Hu,Ying Ding,Yizhou Sun,Jie Lu,Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Memory is critical for enabling large language model (LLM) based agents to maintain coherent behavior over long-horizon interactions. However, existing agent memory systems suffer from two key gaps: they rely on a one-size-fits-all memory structure and do not model memory structure selection as a context-adaptive decision, limiting their ability to handle heterogeneous interaction patterns and resulting in suboptimal performance. We propose a unified framework, FluxMem, that enables adaptive memory organization for LLM agents. Our framework equips agents with multiple complementary memory structures. It explicitly learns to select among these structures based on interaction-level features, using offline supervision derived from downstream response quality and memory utilization. To support robust long-horizon memory evolution, we further introduce a three-level memory hierarchy and a Beta Mixture Model-based probabilistic gate for distribution-aware memory fusion, replacing brittle similarity thresholds. Experiments on two long-horizon benchmarks, PERSONAMEM and LoCoMo, demonstrate that our method achieves average improvements of 9.18% and 6.14%.

[AI-80] FloCA: Towards Faithful and Logically Consistent Flowchart Reasoning

【速读】:该论文旨在解决生成式 AI (Generative AI) 在流程图导向对话(Flowchart-Oriented Dialogue, FOD)系统中面临的两个核心挑战:一是大语言模型(LLM)缺乏显式的机制来表示和推理流程图拓扑结构,导致无法准确对齐用户输入与流程节点;二是LLM容易产生幻觉,造成不符合正确流程路径的节点跳转,从而破坏任务执行的一致性。解决方案的关键在于提出 FloCA——一种零样本的流程图导向对话代理,其通过将流程图推理任务交由外部工具执行,该工具基于拓扑约束进行图遍历,确保节点转移逻辑一致且忠实于原始流程图,而LLM仅负责意图理解和响应生成,从而实现高效、准确的任务导向交互。

链接: https://arxiv.org/abs/2602.14035
作者: Jinzi Zou,Bolin Wang,Liang Li,Shuo Zhang,Nuo Xu,Junzhou Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flowchart-oriented dialogue (FOD) systems aim to guide users through multi-turn decision-making or operational procedures by following a domain-specific flowchart to achieve a task goal. In this work, we formalize flowchart reasoning in FOD as grounding user input to flowchart nodes at each dialogue turn while ensuring node transition is consistent with the correct flowchart path. Despite recent advances of LLMs in task-oriented dialogue systems, adapting them to FOD still faces two limitations: (1) LLMs lack an explicit mechanism to represent and reason over flowchart topology, and (2) they are prone to hallucinations, leading to unfaithful flowchart reasoning. To address these limitations, we propose FloCA, a zero-shot flowchart-oriented conversational agent. FloCA uses an LLM for intent understanding and response generation while delegating flowchart reasoning to an external tool that performs topology-constrained graph execution, ensuring faithful and logically consistent node transitions across dialogue turns. We further introduce an evaluation framework with an LLM-based user simulator and five new metrics covering reasoning accuracy and interaction efficiency. Extensive experiments on FLODIAL and PFDial datasets highlight the bottlenecks of existing LLM-based methods and demonstrate the superiority of FloCA. Our codes are available at this https URL.

[AI-81] EIDOS: Latent-Space Predictive Learning for Time Series Foundation Models

【速读】:该论文旨在解决当前时间序列基础模型在预训练阶段通过直接预测未来观测值所导致的潜在表示结构弱、易捕捉表面噪声而非可预测时序动态的问题。解决方案的关键在于将预训练目标从未来值预测转向潜在空间的预测学习(latent-space predictive learning),具体通过一个因果Transformer模型来预测潜在表示的演化,从而促使结构化且时序一致的潜在状态涌现;同时设计轻量级聚合分支以构建稳定的潜在空间目标,并采用联合优化目标整合潜在空间对齐、观测锚定与直接预测监督,有效提升表示质量与模型性能。

链接: https://arxiv.org/abs/2602.14024
作者: Xinxing Zhou,Qingren Yao,Yiji Zhao,Chenghao Liu,Flora Salim,Xiaojie Yuan,Yanlong Wen,Ming Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most time series foundation models are pretrained by directly predicting future observations, which often yields weakly structured latent representations that capture surface noise rather than coherent and predictable temporal dynamics. In this work, we introduce EIDOS, a foundation model family that shifts pretraining from future value prediction to latent-space predictive learning. We train a causal Transformer to predict the evolution of latent representations, encouraging the emergence of structured and temporally coherent latent states. To ensure stable targets for latent-space learning, we design a lightweight aggregation branch to construct target representations. EIDOS is optimized via a joint objective that integrates latent-space alignment, observational grounding to anchor representations to the input signal, and direct forecasting supervision. On the GIFT-Eval benchmark, EIDOS mitigates structural fragmentation in the representation space and achieves state-of-the-art performance. These results demonstrate that constraining models to learn predictable latent dynamics is a principled step toward more robust and reliable time series foundation models.

[AI-82] From SFT to RL: Demystifying the Post-Training Pipeline for LLM -based Vulnerability Detection

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在漏洞检测(Vulnerability Detection, VD)任务中缺乏系统性后训练(post-training)优化流程的问题,尤其关注如何通过数据筛选、奖励机制设计与强化学习策略的协同作用提升模型性能。其解决方案的关键在于构建一个从冷启动监督微调(Supervised Fine-Tuning, SFT)到离策略偏好优化(off-policy preference optimization)再到在线策略强化学习(on-policy Reinforcement Learning, RL)的完整训练范式,并发现:(1)基于拒绝采样的SFT显著优于基于推理过程监督的方法,后者易因真实标签泄露引入幻觉;(2)适度增加SFT轮次有助于偏好优化,但过度微调会抑制强化学习中的自我探索能力;(3)细粒度的根因判断作为奖励信号可实现可靠信用分配,而粗粒度奖励常导致误导;(4)过滤极难检测样本虽能提高强化学习效率,但需权衡性能损失;(5)采用GRPO(Generalized Reward Policy Optimization)的在线策略强化学习方法显著优于仅使用SFT或偏好优化(如DPO/ORPO)的模型,也超越了零样本状态领先模型;(6)基于根因分析的LLM-as-a-Judge评估方式比二元匹配更稳健,尽管其准确性依赖于评判模型的安全专业水平。

链接: https://arxiv.org/abs/2602.14012
作者: Youpeng Li,Fuxun Yu,Xinda Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The integration of LLMs into vulnerability detection (VD) has shifted the field toward interpretable and context-aware analysis. While post-training methods have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, spanning from cold-start SFT to off-policy preference optimization and on-policy RL, uncovering how data curation, stage interactions, reward mechanisms, and evaluation protocols collectively dictate the efficacy of model training and assessment. Our study identifies practical guidelines and insights: (1) SFT based on rejection sampling greatly outperforms rationalization-based supervision, which can introduce hallucinations due to ground-truth leakage. (2) While increased SFT epochs constantly benefit preference optimization, excessive SFT inhibits self-exploration during RL, ultimately limiting performance gains. (3) Coarse-grained reward signals often mislead RL, whereas fine-grained root-cause judgments ensure reliable credit assignment. Specification-based rewards offer further benefits but incur significant effort in specification generation. (4) Although filtering extremely hard-to-detect vulnerability samples improves RL training efficiency, the cost of performance loss should be considered in practical applications. (5) Models trained under GRPO significantly outperform those using SFT and preference optimization (i.e., DPO and ORPO), as well as a series of zero-shot SOTA LLMs, underscoring the significant potential of on-policy RL for LLM-based VD. (6) In contrast to binary matching that tends to overestimate performance, LLM-as-a-Judge based on root-cause analysis provides a more robust evaluation protocol, although its accuracy varies across judge models with different levels of security expertise.

[AI-83] Prompt-Driven Low-Altitude Edge Intelligence: Modular Agents and Generative Reasoning

【速读】:该论文旨在解决大模型(Large Artificial Intelligence Models, LAMs)在低空边缘智能场景中部署所面临的三大核心挑战:任务与特定模型强绑定导致灵活性不足、全规模LAMs的计算与内存需求超出多数边缘设备承载能力,以及当前推理流程静态化难以响应实时任务变化。解决方案的关键在于提出一种提示到代理的边缘认知框架(Prompt-to-Agent Edge Cognition Framework, P2AECF),其通过三个机制实现灵活、高效且自适应的边缘智能:首先,提示定义的认知模块将任务意图解析为抽象且模型无关的表示;其次,基于代理的模块化执行机制动态选择轻量级可复用的认知代理以适配当前资源条件;最后,扩散控制的推理规划机制结合运行时反馈与系统上下文自适应构建和优化执行策略,从而支撑实时低空协同任务的模块化、可扩展与自适应执行。

链接: https://arxiv.org/abs/2602.14003
作者: Jiahao You,Ziye Jia,Chao Dong,Qihui Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The large artificial intelligence models (LAMs) show strong capabilities in perception, reasoning, and multi-modal understanding, and can enable advanced capabilities in low-altitude edge intelligence. However, the deployment of LAMs at the edge remains constrained by some fundamental limitations. First, tasks are rigidly tied to specific models, limiting the flexibility. Besides, the computational and memory demands of full-scale LAMs exceed the capacity of most edge devices. Moreover, the current inference pipelines are typically static, making it difficult to respond to real-time changes of tasks. To address these challenges, we propose a prompt-to-agent edge cognition framework (P2AECF), enabling the flexible, efficient, and adaptive edge intelligence. Specifically, P2AECF transforms high-level semantic prompts into executable reasoning workflows through three key mechanisms. First, the prompt-defined cognition parses task intent into abstract and model-agnostic representations. Second, the agent-based modular execution instantiates these tasks using lightweight and reusable cognitive agents dynamically selected based on current resource conditions. Third, the diffusion-controlled inference planning adaptively constructs and refines execution strategies by incorporating runtime feedback and system context. In addition, we illustrate the framework through a representative low-altitude intelligent network use case, showing its ability to deliver adaptive, modular, and scalable edge intelligence for real-time low-altitude aerial collaborations.

[AI-84] Bridging AI and Clinical Reasoning : Abductive Explanations for Alignment on Critical Symptoms

【速读】:该论文旨在解决当前人工智能(AI)在临床诊断中因推理过程偏离结构化临床框架而导致的信任不足、可解释性差及实际应用受限的问题,特别是AI模型可能忽略关键症状即使预测结果正确。其解决方案的关键在于引入形式化的溯因解释(formal abductive explanations),通过提供对最小充分特征集的一致且有保证的推理机制,实现对AI决策过程的清晰理解,并确保其与临床推理逻辑一致,从而在不牺牲预测准确性的前提下,输出具有临床可操作性的解释,构建可信AI在医学诊断中的坚实框架。

链接: https://arxiv.org/abs/2602.13985
作者: Belona Sonna,Alban Grastien
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Appeared in The proceedings of the Adaptive Learning and Intelligent Systems as part of the Australasian Computer Science Week (ACSW) 2026

点击查看摘要

Abstract:Artificial intelligence (AI) has demonstrated strong potential in clinical diagnostics, often achieving accuracy comparable to or exceeding that of human experts. A key challenge, however, is that AI reasoning frequently diverges from structured clinical frameworks, limiting trust, interpretability, and adoption. Critical symptoms, pivotal for rapid and accurate decision-making, may be overlooked by AI models even when predictions are correct. Existing post hoc explanation methods provide limited transparency and lack formal guarantees. To address this, we leverage formal abductive explanations, which offer consistent, guaranteed reasoning over minimal sufficient feature sets. This enables a clear understanding of AI decision-making and allows alignment with clinical reasoning. Our approach preserves predictive accuracy while providing clinically actionable insights, establishing a robust framework for trustworthy AI in medical diagnosis.

[AI-85] Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

【速读】:该论文旨在解决长上下文输入导致大型语言模型(Large Language Models, LLMs)推理延迟显著增加的问题,其根源在于自注意力机制的计算复杂度随序列长度呈二次增长。为缓解此问题,现有方法通常采用软提示压缩(soft prompt compression),将长上下文转化为较短的记忆嵌入(memory embeddings),但这类方法往往对整个上下文进行无差别压缩,要求压缩器捕捉全局依赖关系,从而需要大量预训练数据来学习有效模式,训练难度高且效率低。论文提出的关键解决方案是并行迭代压缩(Parallelized Iterative Compression, PIC),通过简单修改Transformer的注意力掩码(attention mask),显式限制记忆token的接收场(receptive field)仅作用于原始序列的局部连续块(local chunks),从而降低压缩器的学习难度,并显著提升训练效率与压缩性能,尤其在高压缩比场景下表现优异。

链接: https://arxiv.org/abs/2602.13980
作者: Guojie Liu,Yiqi Wang,Yanfeng Yang,Wenqi Fan,Songlei Jian,Jianfeng Zhang,Jie Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer’s attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8% in F1 score and 40.7% in EM score on QA tasks at the 64\times compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16 \times compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40%.

[AI-86] WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

【速读】:该论文旨在解决基于学习的世界模型(World Model)在强化学习(Reinforcement Learning, RL)中因幻觉(hallucination)和长期误差累积导致的策略优化失效问题,尤其是在物理机器人上的应用受限。解决方案的关键在于提出WoVR框架,通过三个核心机制实现:1)使用可控的动作条件视频世界模型提升想象回放(imagined rollout)的稳定性;2)采用关键帧初始化回放(Keyframe-Initialized Rollouts)降低有效误差深度;3)通过世界模型与策略的协同演化(World Model-Policy co-evolution)维持二者对齐。该方法显著提升了长时程想象回放的可靠性与策略优化效果,在LIBERO基准和真实机器人任务中分别将成功率从39.95%提升至69.2%,从61.7%提升至91.7%。

链接: https://arxiv.org/abs/2602.13977
作者: Zhennan Jiang,Shangqing Zhou,Yutong Jiang,Zefang Huang,Mingjie Wei,Yuhui Chen,Tianxing Zhou,Zhen Guo,Hao Lin,Quanlu Zhang,Yu Wang,Haoran Li,Chao Yu,Dongbin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21pages, 8 figures

点击查看摘要

Abstract:Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.

[AI-87] Chemical Language Models for Natural Products: A State-Space Model Approach

【速读】:该论文旨在解决自然产物(Natural Products, NPs)在生成式AI模型中的研究不足问题,尤其是在分子性质预测和小分子生成任务中,NPs作为药物发现重要资源却未被充分挖掘。解决方案的关键在于构建针对自然产物的化学语言模型(NP-specific Chemical Language Models, NPCLMs),通过在约100万条自然产物数据上预训练状态空间模型(如Mamba及其改进版Mamba-2)并与Transformer基线模型(如GPT)进行系统比较,同时探索多种分词策略(包括原子级SMILES、字节对编码等)。实验表明,Mamba类模型在分子生成的有效性、唯一性和性质预测性能(MCC指标提升0.02–0.04)方面优于GPT,且仅需约百万量级数据即可达到传统模型在超大规模数据集上的性能水平,验证了领域特定预训练的重要性与高效性。

链接: https://arxiv.org/abs/2602.13958
作者: Ho-Hsuan Wang,Afnan Sultan,Andrea Volkamer,Dietrich Klakow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are widely used in chemistry for molecular property prediction and small-molecule generation, yet Natural Products (NPs) remain underexplored despite their importance in drug discovery. To address this gap, we develop NP-specific chemical language models (NPCLMs) by pre-training state-space models (Mamba and Mamba-2) and comparing them with transformer baselines (GPT). Using a dataset of about 1M NPs, we present the first systematic comparison of selective state-space models and transformers for NP-focused tasks, together with eight tokenization strategies including character-level, Atom-in-SMILES (AIS), byte-pair encoding (BPE), and NP-specific BPE. We evaluate molecule generation (validity, uniqueness, novelty) and property prediction (membrane permeability, taste, anti-cancer activity) using MCC and AUC-ROC. Mamba generates 1-2 percent more valid and unique molecules than Mamba-2 and GPT, with fewer long-range dependency errors, while GPT yields slightly more novel structures. For property prediction, Mamba variants outperform GPT by 0.02-0.04 MCC under random splits, while scaffold splits show comparable performance. Results demonstrate that domain-specific pre-training on about 1M NPs can match models trained on datasets over 100 times larger.

[AI-88] Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

【速读】:该论文旨在解决当前音频语言模型在参数规模庞大但性能提升有限的情况下,如何实现高效且高性能的音频理解问题。针对这一挑战,Eureka-Audio 提出了一种紧凑而高效率的解决方案:采用统一端到端架构,结合轻量级语言骨干网络、基于 Whisper 的音频编码器以及稀疏激活的专家混合(Mixture-of-Experts, MoE)适配器,以显式建模音频异质性并缓解多模态优化冲突;同时引入 DataFlux 数据流闭环生成与验证机制,从原始音频中构建逻辑一致的高质量监督信号,从而显著增强语调和副语言推理能力。该方案在仅 1.7B 参数下实现了对标 7B–30B 模型的性能,证明了其在计算成本与性能之间取得良好平衡的潜力。

链接: https://arxiv.org/abs/2602.13954
作者: Dan Zhang,Yishu Lei,Jing Hu,Shuwei He,Songhe Deng,Xianlong Luo,Danxiang Zhu,Shikun Feng,Rui Liu,Jingzhou He,Yu Sun,Hua Wu,Haifeng Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.

[AI-89] Experiential Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在语言模型(Language Models, LMs)训练中因环境反馈稀疏且延迟而导致的学习效率低、行为调整困难的问题。解决方案的关键在于提出经验式强化学习(Experiential Reinforcement Learning, ERL),其核心是在RL过程中嵌入一个显式的“经验-反思-巩固”循环:模型首先生成初始尝试,获得环境反馈后生成反思(reflection),进而指导生成改进的第二次尝试,该成功行为被强化并内化为基础策略。此机制将稀疏反馈转化为结构化的行为修正,显著提升探索效率与优化稳定性,同时在部署时无需额外推理开销,从而实现更高效、稳定的策略学习。

链接: https://arxiv.org/abs/2602.13949
作者: Taiwei Shi,Sihao Chen,Bowen Jiang,Linxin Song,Longqi Yang,Jieyu Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 9 tables, 7 figures

点击查看摘要

Abstract:Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.

[AI-90] You Can Learn Tokenization End-to-End with Reinforcement Learning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)训练流程中硬编码的分词(Tokenization)步骤问题,该步骤作为压缩环节被保留在训练管道中,与当前模型架构趋向端到端学习的趋势相悖。为实现更高效的端到端训练,论文提出通过分数函数估计(score function estimates)来学习离散的分词边界,而非采用以往基于直通估计(straight-through estimates)的方法——后者将离散分词问题近似为连续优化问题。其关键创新在于利用分数函数估计直接优化离散分词边界以最小化损失,从而获得更紧致的理论保证;同时引入强化学习中的时间折扣(time discounting)技术以显著降低估计方差,使得该方法在1亿参数规模下相比现有直通估计方法在定性和定量上均取得更好性能。

链接: https://arxiv.org/abs/2602.13940
作者: Sam Dauncey,Roger Wattenhofer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs’ architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the 100 million parameter scale.

[AI-91] An Adaptive Model Selection Framework for Demand Forecasting under Horizon-Induced Degradation to Support Business Strategy and Operations

【速读】:该论文旨在解决多SKU(库存单位)场景下因预测时 horizon(预测范围)变化导致的模型选择排名不稳定问题,尤其是在结构需求间歇性(structural demand intermittency)、高变异性(high variability)和多步规划场景中,现有模型在不同误差指标、需求状态和预测时长下的相对表现差异显著,造成决策模糊。解决方案的关键在于提出一种名为 AHSIV(Adaptive Hybrid Selector for Intermittency and Variability)的时域感知且需求状态条件化的模型选择框架:其核心机制包括通过 MDFH(Metric Degradation by Forecast Horizon)过程调整绝对与缩放误差指标、基于结构需求分类进行分组决策、引入多目标帕累托支配(Pareto dominance)筛选最优候选模型,并结合层次化偏差精修(hierarchical bias refinement)构建统一决策架构,从而实现对不同预测时长下最优模型的动态识别与稳定选择。

链接: https://arxiv.org/abs/2602.13939
作者: Adolfo González,Víctor Parada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 24 figures and Appendix

点击查看摘要

Abstract:Business environments characterized by structural demand intermittency, high variability, and multi-step planning horizons require robust and reproducible model selection mechanisms. Empirical evidence shows that no forecasting model is universally dominant and that relative rankings vary across error metrics, demand regimes, and forecast horizons, generating ambiguity in multi-SKU decision contexts. This study proposes AHSIV (Adaptive Hybrid Selector for Intermittency and Variability), a horizon-aware and regime-conditioned model selection framework designed to address horizon-induced ranking instability. The proposed approach integrates scaled and absolute error metrics adjusted through a Metric Degradation by Forecast Horizon (MDFH) procedure, structural demand classification, multi-objective Pareto dominance, and hierarchical bias refinement within a unified decision architecture. The empirical evaluation is conducted on the Walmart, M3, M4, and M5 datasets under multiple train-test partition schemes and twelve-step forecasting horizons. Results indicate that AHSIV achieves statistical equivalence with the strongest monometric baseline in terms of aggregated performance while increasing the frequency of horizon-specific best-model selection. The findings demonstrate that model selection in heterogeneous demand environments cannot be treated as a static ranking problem, and that horizon-consistent, structurally adaptive mechanisms provide a principled, operationally coherent solution for multi-SKU forecasting.

[AI-92] A Generalizable Physics-guided Causal Model for Trajectory Prediction in Autonomous Driving ICRA2026

【速读】:该论文旨在解决交通参与者轨迹预测在未见过域(如新城市)中缺乏有效零样本泛化能力的问题。其核心挑战在于如何提取跨域不变的场景表征,并将这些特征与运动学模型融合以实现通用预测。解决方案的关键是提出一种可泛化的物理引导因果模型(Physics-guided Causal Model, PCM),包含两个核心组件:解耦场景编码器(Disentangled Scene Encoder),通过干预驱动的解耦方法提取域不变特征;以及因果ODE解码器(CausalODE Decoder),利用因果注意力机制将运动学模型与有意义的上下文信息有效整合,从而显著提升在未见城市中的零样本轨迹预测性能。

链接: https://arxiv.org/abs/2602.13936
作者: Zhenyu Zong,Yuchen Wang,Haohong Lin,Lu Gan,Huajie Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, Accepted by IEEE ICRA 2026

点击查看摘要

Abstract:Trajectory prediction for traffic agents is critical for safe autonomous driving. However, achieving effective zero-shot generalization in previously unseen domains remains a significant challenge. Motivated by the consistent nature of kinematics across diverse domains, we aim to incorporate domain-invariant knowledge to enhance zero-shot trajectory prediction capabilities. The key challenges include: 1) effectively extracting domain-invariant scene representations, and 2) integrating invariant features with kinematic models to enable generalized predictions. To address these challenges, we propose a novel generalizable Physics-guided Causal Model (PCM), which comprises two core components: a Disentangled Scene Encoder, which adopts intervention-based disentanglement to extract domain-invariant features from scenes, and a CausalODE Decoder, which employs a causal attention mechanism to effectively integrate kinematic models with meaningful contextual information. Extensive experiments on real-world autonomous driving datasets demonstrate our method’s superior zero-shot generalization performance in unseen cities, significantly outperforming competitive baselines. The source code is released at this https URL.

[AI-93] Statistical Early Stopping for Reasoning Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因不确定性导致的“过度思考”问题,即模型在面对模糊或表述不清的查询时,会生成冗余的推理步骤,从而降低效率和可靠性。解决方案的关键在于引入统计学原理驱动的早期停止机制:第一种方法为参数化方案,将不确定性关键词的到达时间建模为更新过程(renewal process),并采用序贯检验(sequential testing)决定是否停止;第二种方法为非参数化方案,在有限样本下提供对正确查询过早停止概率的严格保证。实证结果表明,这种基于不确定性的早期停止策略能显著提升LLM在各类推理任务中的效率与可靠性,尤其在数学推理任务中效果突出。

链接: https://arxiv.org/abs/2602.13935
作者: Yangxinyu Xie,Tao Wang,Soham Mallick,Yan Sun,Georgy Noarov,Mengxin Yu,Tanwi Mallick,Weijie J. Su,Edgar Dobriban
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.

[AI-94] HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在长对话场景中因内存管理效率低下而导致性能下降的问题。现有方法在内存压缩与原始文本保留之间存在根本性权衡:压缩易丢失复杂推理所需的关键细节,而保留原始文本则引入冗余计算开销。其核心瓶颈在于单一粒度的记忆表征和静态检索机制无法模拟人类灵活、主动的记忆调度能力,难以适应多样化任务需求。解决方案的关键在于提出HyMem——一种基于认知经济原则的混合记忆架构,通过多粒度记忆表示与动态两级检索系统实现按需调度:轻量级模块生成摘要级上下文以高效响应简单查询,而基于LLM的深度模块仅在处理复杂问题时被激活,并结合反思机制进行迭代推理优化,从而在保持高性能的同时将计算成本降低92.6%。

链接: https://arxiv.org/abs/2602.13933
作者: Xiaochen Zhao,Kaikai Wang,Xiaowen Zhang,Chen Yao,Aili Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents demonstrate strong performance in short-text contexts but often underperform in extended dialogues due to inefficient memory management. Existing approaches face a fundamental trade-off between efficiency and effectiveness: memory compression risks losing critical details required for complex reasoning, while retaining raw text introduces unnecessary computational overhead for simple queries. The crux lies in the limitations of monolithic memory representations and static retrieval mechanisms, which fail to emulate the flexible and proactive memory scheduling capabilities observed in humans, thus struggling to adapt to diverse problem scenarios. Inspired by the principle of cognitive economy, we propose HyMem, a hybrid memory architecture that enables dynamic on-demand scheduling through multi-granular memory representations. HyMem adopts a dual-granular storage scheme paired with a dynamic two-tier retrieval system: a lightweight module constructs summary-level context for efficient response generation, while an LLM-based deep module is selectively activated only for complex queries, augmented by a reflection mechanism for iterative reasoning refinement. Experiments show that HyMem achieves strong performance on both the LOCOMO and LongMemEval benchmarks, outperforming full-context while reducing computational cost by 92.6%, establishing a state-of-the-art balance between efficiency and performance in long-term memory management.

[AI-95] GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization

【速读】:该论文旨在解决代码仓库级缺陷定位(repository-level bug localization)问题,即识别出需要修改的代码位置以修复缺陷。传统大型语言模型(Large Language Models, LLMs)因上下文窗口限制难以处理整个代码仓库,而现有检索方法如关键词匹配或简单图遍历(如广度优先搜索)效果有限。解决方案的关键在于提出首个面向仓库规模缺陷定位的图神经网络(Graph Neural Networks, GNN)基准数据集GREPO,其包含86个Python仓库和47294个缺陷修复任务,并提供直接用于GNN处理的图结构数据。实验表明,基于GREPO的GNN架构在性能上显著优于传统信息检索基线,验证了GNN在建模复杂仓库级依赖关系方面的潜力。

链接: https://arxiv.org/abs/2602.13921
作者: Juntong Wang,Libin Chen,Xiyuan Wang,Shijia Kang,Haotong Yang,Da Zheng,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 46 pages, 14figures

点击查看摘要

Abstract:Repository-level bug localization-the task of identifying where code must be modified to fix a bug-is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval methods are commonly used, including keyword matching, text similarity, and simple graph-based heuristics such as Breadth-First Search. Graph Neural Networks (GNNs) offer a promising alternative due to their ability to model complex, repository-wide dependencies; however, their application has been hindered by the lack of a dedicated benchmark. To address this gap, we introduce GREPO, the first GNN benchmark for repository-scale bug localization tasks. GREPO comprises 86 Python repositories and 47294 bug-fixing tasks, providing graph-based data structures ready for direct GNN processing. Our evaluation of various GNN architectures shows outstanding performance compared to established information retrieval baselines. This work highlights the potential of GNNs for bug localization and established GREPO as a foundation resource for future research, The code is available at this https URL.

[AI-96] A Comparative Analysis of Social Network Topology in Reddit and Moltbook

【速读】:该论文试图解决的问题是:当前关于AI代理驱动的社会网络(agent-driven social networks)与人类驱动的社会网络(human-driven social networks)在拓扑结构上的差异尚缺乏系统的实证比较,这限制了对两类网络演化机制和结构特征的理解。解决方案的关键在于构建一个可比的基准数据集——通过采集Moltbook平台上的评论网络(33,577个节点,697,688条边)并与Reddit上的人类用户评论网络(7.8百万节点,51.8百万边)进行对比分析,从而量化两者在网络拓扑模式及帖子引发互动效率方面的关键差异,为理解AI驱动社会结构提供首个系统性基础画像。

链接: https://arxiv.org/abs/2602.13920
作者: Yiming Zhu,Gareth Tyson,Pan Hui
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in agent-mediated systems have enabled a new paradigm of social network simulation, where AI agents interact with human-like autonomy. This evolution has fostered the emergence of agent-driven social networks such as Moltbook, a Reddit-like platform populated entirely by AI agents. Despite these developments, empirical comparisons between agent-driven and human-driven social networks remain scarce, limiting our understanding of how their network topologies might diverge. This paper presents the first comparative analysis of network topology on Moltbook, utilizing a comment network comprising 33,577 nodes and 697,688 edges. To provide a benchmark, we curated a parallel dataset from Reddit consisting of 7.8 million nodes and 51.8 million edges. We examine key structural differences between agent-drive and human-drive networks, specifically focusing on topological patterns and the edge formation efficacy of their respective posts. Our findings provide a foundational profile of AI-driven social structures, serving as a preliminary step toward developing more robust and authentic agent-mediated social systems.

[AI-97] Common Knowledge Always Forever

【速读】:该论文旨在解决如何在拓扑语义框架下形式化表达公共知识(common knowledge)及其推广的动态逻辑问题,同时探讨其在不同空间结构中的有限模型性质。解决方案的关键在于引入一种多拓扑模态命题动态逻辑(polytopological PDL),该逻辑能够刻画公共知识及多种广义认知状态,并通过证明其在闭包空间(closure spaces)中具有有限模型性质,而在Cantor导出空间(Cantor derivative spaces)中不具有该性质,从而揭示了不同拓扑结构对逻辑可判定性的影响;后者通过嵌入带有“过去”时态的线性时序逻辑(LTL with ‘past’)来实现,该逻辑已知不具备有限模型性质,从而构成反例。

链接: https://arxiv.org/abs/2602.13914
作者: Martín Diéguez,David Fernández-Duque
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:There has been an increasing interest in topological semantics for epistemic logic, which has been shown to be useful for, e.g., modelling evidence, degrees of belief, and self-reference. We introduce a polytopological PDL capable of expressing common knowledge and various generalizations and show it has the finite model property over closure spaces but not over Cantor derivative spaces. The latter is shown by embedding a version of linear temporal logic with `past’, which does not have the finite model property.

[AI-98] Sufficient Conditions for Stability of Minimum-Norm Interpolating Deep ReLU Networks

【速读】:该论文试图解决深度ReLU同质神经网络在最小范数插值(minimum-norm interpolation)条件下算法稳定性(algorithmic stability)不足的问题,从而解释其泛化性能。解决方案的关键在于识别出保证稳定性的充分条件:当网络包含一个(可能较小的)稳定子网络,并且后续层的权重矩阵为低秩(low-rank)时,整个网络具有良好的算法稳定性;反之,若后续层非低秩,则即使存在稳定子网络也无法保证稳定性。这一发现基于对梯度下降训练下过参数化模型倾向于产生低秩权重矩阵的实证与理论观察,揭示了低秩结构在提升深度神经网络稳定性中的核心作用。

链接: https://arxiv.org/abs/2602.13910
作者: Ouns El Harzli,Yoonsoo Nam,Ilja Kuzborskij,Bernardo Cuenca Grau,Ard A. Louis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Algorithmic stability is a classical framework for analyzing the generalization error of learning algorithms. It predicts that an algorithm has small generalization error if it is insensitive to small perturbations in the training set such as the removal or replacement of a training point. While stability has been demonstrated for numerous well-known algorithms, this framework has had limited success in analyses of deep neural networks. In this paper we study the algorithmic stability of deep ReLU homogeneous neural networks that achieve zero training error using parameters with the smallest L_2 norm, also known as the minimum-norm interpolation, a phenomenon that can be observed in overparameterized models trained by gradient-based algorithms. We investigate sufficient conditions for such networks to be stable. We find that 1) such networks are stable when they contain a (possibly small) stable sub-network, followed by a layer with a low-rank weight matrix, and 2) such networks are not guaranteed to be stable even when they contain a stable sub-network, if the following layer is not low-rank. The low-rank assumption is inspired by recent empirical and theoretical results which demonstrate that training deep neural networks is biased towards low-rank weight matrices, for minimum-norm interpolation and weight-decay regularization.

[AI-99] Diagnosing Pathological Chain-of-Thought in Reasoning Models

【速读】:该论文旨在解决链式思维(Chain-of-thought, CoT)推理在大语言模型(Large Language Models, LLMs)中可能存在的三类病理现象——事后合理化(post-hoc rationalization)、编码推理(encoded reasoning)和内化推理(internalized reasoning)——导致其无法有效用于AI安全监控的问题。解决方案的关键在于构建一套简单、计算成本低且任务无关的量化指标,用于识别和区分这些病理现象;并通过设计专门训练以展现特定CoT病理的“模型生物”进行验证,从而提供一个可实际应用的工具包,直接支持训练过程中的监控与干预。

链接: https://arxiv.org/abs/2602.13904
作者: Manqing Liu,David Williams-King,Ida Caspary,Linh Le,Hannes Whittingham,Puria Radmard,Cameron Tice,Edward James Young
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning is fundamental to modern LLM architectures and represents a critical intervention point for AI safety. However, CoT reasoning may exhibit failure modes that we note as pathologies, which prevent it from being useful for monitoring. Prior work has identified three distinct pathologies: post-hoc rationalization, where models generate plausible explanations backwards from predetermined answers; encoded reasoning, where intermediate steps conceal information within seemingly interpretable text; and internalized reasoning, where models replace explicit reasoning with meaningless filler tokens while computing internally. To better understand and discriminate between these pathologies, we create a set of concrete metrics that are simple to implement, computationally inexpensive, and task-agnostic. To validate our approach, we develop model organisms deliberately trained to exhibit specific CoT pathologies. Our work provides a practical toolkit for assessing CoT pathologies, with direct implications for training-time monitoring.

[AI-100] GSRM: Generative Speech Reward Model for Speech RLHF

【速读】:该论文旨在解决当前语音语言模型(Speech Language Models)生成语音的自然度(naturalness)评估缺乏可靠性和可解释性的问题。现有自然度评价方法多采用回归原始音频得到标量分数,不仅难以提供评估依据,且在跨语义类别(taxonomies)场景下泛化能力差。其解决方案的关键在于提出生成式语音奖励模型(Generative Speech Reward Model, GSRM),该模型通过将自然度评估分解为可解释的声学特征提取阶段与基于特征的链式思维推理(chain-of-thought reasoning)阶段,实现对语音自然度的可解释判断;同时借助包含3.1万条专家评分的大规模人类反馈数据集和真实用户-助手交互的域外基准进行训练与验证,显著提升了模型预测与人类评分的一致性,并可用于在线强化学习人类反馈(RLHF)以优化语音生成质量。

链接: https://arxiv.org/abs/2602.13891
作者: Maohao Shen,Tejas Jayashankar,Osama Hanna,Naoyuki Kanda,Yancheng Wang,Kateřina Žmolíková,Ruiming Xie,Niko Moritz,Anfeng Xu,Yashesh Gaur,Gregory Wornell,Qing He,Jilong Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling explainable judgments. To achieve this, we curated a large-scale human feedback dataset comprising 31k expert ratings and an out-of-domain benchmark of real-world user-assistant speech interactions. Experiments show that GSRM substantially outperforms existing speech naturalness predictors, achieving model-human correlation of naturalness score prediction that approaches human inter-rater consistency. We further show how GSRM can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.

[AI-101] Ambient Physics: Training Neural PDE Solvers with Partial Observations

【速读】:该论文旨在解决科学计算中因获取完整偏微分方程(PDE)系数与解的观测数据成本高、危险或不可行而导致模型训练受限的问题。传统基于扩散的方法虽能从部分观测中重建场,但需依赖完整的观测数据进行训练。其解决方案的关键在于提出 Ambient Physics 框架,通过随机掩码已观测数据中的子集并以这些被掩码点作为监督信号,使模型无法区分“真实未观测”与“人为掩码”,从而迫使模型在所有位置生成合理预测。这一机制使得模型无需任何完整观测即可学习系数-解对的联合分布,显著提升重建性能,并实现仅用单个观测点即可完成有效学习的“一点过渡”现象。

链接: https://arxiv.org/abs/2602.13873
作者: Harris Abdul Majid,Giannis Daras,Francesco Tudisco,Steven McDonagh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In many scientific settings, acquiring complete observations of PDE coefficients and solutions can be expensive, hazardous, or impossible. Recent diffusion-based methods can reconstruct fields given partial observations, but require complete observations for training. We introduce Ambient Physics, a framework for learning the joint distribution of coefficient-solution pairs directly from partial observations, without requiring a single complete observation. The key idea is to randomly mask a subset of already-observed measurements and supervise on them, so the model cannot distinguish “truly unobserved” from “artificially unobserved”, and must produce plausible predictions everywhere. Ambient Physics achieves state-of-the-art reconstruction performance. Compared with prior diffusion-based methods, it achieves a 62.51 % reduction in average overall error while using 125 \times fewer function evaluations. We also identify a “one-point transition”: masking a single already-observed point enables learning from partial observations across architectures and measurement patterns. Ambient Physics thus enables scientific progress in settings where complete observations are unavailable.

[AI-102] Enabling Option Learning in Sparse Rewards with Hindsight Experience Replay

【速读】:该论文旨在解决层次化强化学习(Hierarchical Reinforcement Learning, HRL)方法在多目标稀疏奖励环境中的性能瓶颈问题,尤其是当动作需与远距离结果关联时,传统方法如Multi-updates Option Critic (MOC)难以有效学习可复用的选项(option)。其解决方案的关键在于提出两种改进机制:首先,将Hindsight Experience Replay (HER)引入MOC框架形成MOC-HER,通过重标注目标(goal relabeling)提升稀疏奖励下的学习效率;其次,针对物体操作任务中奖励依赖于物体而非代理直接交互的特性,进一步提出Dual Objectives Hindsight Experience Replay (2HER),即同时基于物体最终状态和代理执行器位置生成两组虚拟目标,从而双重激励代理完成交互与任务目标。实验表明,MOC-2HER在机器人操作环境中成功率达90%,显著优于原生MOC及MOC-HER(<11%),验证了双目标重标注策略在稀疏奖励、多目标任务中的有效性。

链接: https://arxiv.org/abs/2602.13865
作者: Gabriel Romio,Mateus Begnini Melchiades,Bruno Castro da Silva,Gabriel de Oliveira Ramos
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Hierarchical Reinforcement Learning (HRL) frameworks like Option-Critic (OC) and Multi-updates Option Critic (MOC) have introduced significant advancements in learning reusable options. However, these methods underperform in multi-goal environments with sparse rewards, where actions must be linked to temporally distant outcomes. To address this limitation, we first propose MOC-HER, which integrates the Hindsight Experience Replay (HER) mechanism into the MOC framework. By relabeling goals from achieved outcomes, MOC-HER can solve sparse reward environments that are intractable for the original MOC. However, this approach is insufficient for object manipulation tasks, where the reward depends on the object reaching the goal rather than on the agent’s direct interaction. This makes it extremely difficult for HRL agents to discover how to interact with these objects. To overcome this issue, we introduce Dual Objectives Hindsight Experience Replay (2HER), a novel extension that creates two sets of virtual goals. In addition to relabeling goals based on the object’s final state (standard HER), 2HER also generates goals from the agent’s effector positions, rewarding the agent for both interacting with the object and completing the task. Experimental results in robotic manipulation environments show that MOC-2HER achieves success rates of up to 90%, compared to less than 11% for both MOC and MOC-HER. These results highlight the effectiveness of our dual objective relabeling strategy in sparse reward, multi-goal tasks.

[AI-103] Experimentation Accelerator: Interpretable Insights and Creative Recommendations for A/B Testing with Content-Aware ranking

【速读】:该论文旨在解决现代在线实验(Online Experimentation)中的两大瓶颈问题:一是流量稀缺导致实验变体(Variant)选择困难,二是事后洞察提取依赖人工、不一致且内容无关;同时指出组织未能充分利用历史A/B测试结果和丰富的内容嵌入(Content Embedding)来优化实验优先级与创意迭代。解决方案的关键在于构建一个统一框架,其核心包括三部分:首先,基于处理嵌入(Treatment Embedding)和历史结果训练一个带有固定效应的点击率(CTR)排序模型,以平衡价值与内容多样性对候选变体进行评分;其次,通过符号一致且稀疏约束的Lasso回归将处理映射到语义营销属性空间,生成可解释的每属性系数与贡献值,用于可视化解释、关键驱动因素识别及自然语言洞察输出;最后,结合属性重要性与当前实验中未充分表达的属性计算机会指数(Opportunity Index),识别缺失但高影响力的内容维度,并借助大语言模型(LLM)将其转化为具体的创意建议并评估学习与转化潜力,从而加速实验周期并提升效率。该框架已集成至Adobe产品Experimentation Accelerator中,在真实业务场景中验证了其生成管道的质量与有效性。

链接: https://arxiv.org/abs/2602.13852
作者: Zhengmian Hu,Lei Shi,Ritwik Sinha,Justin Grover,David Arbour
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Modern online experimentation faces two bottlenecks: scarce traffic forces tough choices on which variants to test, and post-hoc insight extraction is manual, inconsistent, and often content-agnostic. Meanwhile, organizations underuse historical A/B results and rich content embeddings that could guide prioritization and creative iteration. We present a unified framework to (i) prioritize which variants to test, (ii) explain why winners win, and (iii) surface targeted opportunities for new, higher-potential variants. Leveraging treatment embeddings and historical outcomes, we train a CTR ranking model with fixed effects for contextual shifts that scores candidates while balancing value and content diversity. For better interpretability and understanding, we project treatments onto curated semantic marketing attributes and re-express the ranker in this space via a sign-consistent, sparse constrained Lasso, yielding per-attribute coefficients and signed contributions for visual explanations, top-k drivers, and natural-language insights. We then compute an opportunity index combining attribute importance (from the ranker) with under-expression in the current experiment to flag missing, high-impact attributes. Finally, LLMs translate ranked opportunities into concrete creative suggestions and estimate both learning and conversion potential, enabling faster, more informative, and more efficient test cycles. These components have been built into a real Adobe product, called \textitExperimentation Accelerator, to provide AI-based insights and opportunities to scale experimentation for customers. We provide an evaluation of the performance of the proposed framework on some real-world experiments by Adobe business customers that validate the high quality of the generation pipeline.

[AI-104] Evaluating LLM -Generated ACSL Annotations for Formal Verification

【速读】:该论文旨在解决如何在无须人工干预或基于学习的方法辅助下,自动为真实世界的C程序生成并验证ACSL(ANSI/ISO C Specification Language)规范这一挑战。其关键解决方案在于构建一个受控实验环境,对五种不同的ACSL生成系统——包括规则驱动的Python脚本、Frama-C的RTE插件以及三种大语言模型(DeepSeek-V3.2、GPT-5.2和OLMo 3.1 32B Instruct)——进行统一验证评估,所有生成的规范均使用Frama-C的WP插件结合多个SMT求解器在相同条件下验证,从而实现对注释质量、求解器敏感性和证明稳定性的直接比较,为自动化ACSL生成的能力与局限提供了新的实证依据。

链接: https://arxiv.org/abs/2602.13851
作者: Arshad Beg,Diarmuid O’Donoghue,Rosemary Monahan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages. Submitted to Formal Techniques for Judicious Programming FTfJP-2026 at ECOOP. Under review

点击查看摘要

Abstract:Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C’s RTE plugin, and three large language models–DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.

[AI-105] Pawsterior: Variational Flow Matching for Structured Simulation-Based Inference

【速读】:该论文旨在解决仿真推断(Simulation-Based Inference, SBI)中因后验分布受限于结构化域(如物理参数的有界性或离散-连续混合变量)而导致的标准流匹配(flow-matching)方法效率低下且难以满足物理约束的问题。其解决方案的关键在于提出Pawsterior框架,该框架通过引入端点诱导的仿射几何约束(endpoint-induced affine geometric confinement)这一新原理,将领域几何信息直接嵌入变分模型中,从而提升采样数值稳定性并增强后验拟合精度;更重要的是,该变分参数化使得SBI任务能够处理具有离散潜在结构(如切换系统)的问题,突破了传统流匹配方法对连续空间的依赖,扩展了流匹配在结构化SBI问题中的适用范围。

链接: https://arxiv.org/abs/2602.13813
作者: Jorge Carrasco-Pollo,Floor Eijkelboom,Jan-Willem van de Meent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Pawsterior, a variational flow-matching framework for improved and extended simulation-based inference (SBI). Many SBI problems involve posteriors constrained by structured domains, such as bounded physical parameters or hybrid discrete-continuous variables, yet standard flow-matching methods typically operate in unconstrained spaces. This mismatch leads to inefficient learning and difficulty respecting physical constraints. Our contributions are twofold. First, generalizing the geometric inductive bias of CatFlow, we formalize endpoint-induced affine geometric confinement, a principle that incorporates domain geometry directly into the inference process via a two-sided variational model. This formulation improves numerical stability during sampling and leads to consistently better posterior fidelity, as demonstrated by improved classifier two-sample test performance across standard SBI benchmarks. Second, and more importantly, our variational parameterization enables SBI tasks involving discrete latent structure (e.g., switching systems) that are fundamentally incompatible with conventional flow-matching approaches. By addressing both geometric constraints and discrete latent structure, Pawsterior extends flow-matching to a broader class of structured SBI problems that were previously inaccessible.

[AI-106] Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation ICLR

【速读】:该论文旨在解决流形生成策略(flow-based policies)在强化学习中面临的表达能力与计算负担之间的权衡问题,即如何在保持高表达性的同时实现高效的单步动作生成。解决方案的关键在于提出均值速度策略(Mean Velocity Policy, MVP),通过建模均值速度场来实现最快的一步动作生成,并引入瞬时速度约束(Instantaneous Velocity Constraint, IVC)作为训练过程中的边界条件,从而显著提升策略的表达能力和学习精度。

链接: https://arxiv.org/abs/2602.13810
作者: Guojian Zhan,Letian Tao,Pengcheng Wang,Yixiao Wang,Yiheng Li,Yuxin Chen,Masayoshi Tomizuka,Shengbo Eben Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR Oral Presentation

点击查看摘要

Abstract:Learning expressive and efficient policy functions is a promising direction in reinforcement learning (RL). While flow-based policies have recently proven effective in modeling complex action distributions with a fast deterministic sampling process, they still face a trade-off between expressiveness and computational burden, which is typically controlled by the number of flow steps. In this work, we propose mean velocity policy (MVP), a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation. To ensure its high expressiveness, an instantaneous velocity constraint (IVC) is introduced on the mean velocity field during training. We theoretically prove that this design explicitly serves as a crucial boundary condition, thereby improving learning accuracy and enhancing policy expressiveness. Empirically, our MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench. It also delivers substantial improvements in training and inference speed over existing flow-based policy baselines.

[AI-107] An end-to-end agent ic pipeline for smart contract translation and quality evaluation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成智能合约时缺乏系统性评估方法的问题,尤其关注从自然语言规范到Solidity代码的端到端合成质量。其解决方案的关键在于构建一个包含结构化解析、代码生成与自动化质量评估的全流程框架:首先将合同文本解析为结构化模式,继而生成Solidity代码,并通过编译和安全检查进行多维质量评估(包括功能完整性、变量保真度、状态机正确性、业务逻辑保真度及代码质量),最终以复合评分量化生成结果与基准实现的对齐程度,从而提供可复现的基准用于实证研究,并支持向形式化验证和合规性检查扩展。

链接: https://arxiv.org/abs/2602.13808
作者: Abhinav Goel,Chaitya Shah,Agostino Capponi,Alfio Gliozzo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:We present an end-to-end framework for systematic evaluation of LLM-generated smart contracts from natural-language specifications. The system parses contractual text into structured schemas, generates Solidity code, and performs automated quality assessment through compilation and security checks. Using CrewAI-style agent teams with iterative refinement, the pipeline produces structured artifacts with full provenance metadata. Quality is measured across five dimensions, including functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality aggregated into composite scores. The framework supports paired evaluation against ground-truth implementations, quantifying alignment and identifying systematic error modes such as logic omissions and state transition inconsistencies. This provides a reproducible benchmark for empirical research on smart contract synthesis quality and supports extensions to formal verification and compliance checking.

[AI-108] Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees

【速读】:该论文旨在解决大语言模型在长上下文推理中注意力机制计算开销过高的问题,其核心观察是:尽管模型处理长文本时需计算全部token的注意力权重,但实际对每个查询而言,仅有少量token具有实质性贡献。解决方案的关键在于提出一种基于凸包投影与熵松弛(entropic relaxation)的理论框架,通过引入“支持间隙”(support gap, Δ)这一由KKT乘子验证的严格互补性条件,证明了熵化注意力会集中于一个常数大小的活跃面(active face),且非活跃token的质量衰减呈指数级(exp(−Ω(Δ/ε) )),而活跃面上的误差随温度参数ε线性增长。由此构建出可量化判断稀疏解码安全性的准则,并据此设计了Vashista Sparse Attention机制——一种兼容现代推理栈的分页式上下文选择策略,能在保持稳定常数级有效支持的同时实现显著的运行时加速和最小质量损失,尤其适用于隐私敏感或离线部署场景。

链接: https://arxiv.org/abs/2602.13804
作者: Vashista Nobaub
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 22 pages

点击查看摘要

Abstract:Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (\Delta) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-\Omega(\Delta/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies. Comments: 22 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.13804 [cs.AI] (or arXiv:2602.13804v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.13804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-109] MechPert: Mechanistic Consensus as an Inductive Bias for Unseen Perturbation Prediction

【速读】:该论文旨在解决如何准确预测未见过的遗传扰动(genetic perturbations)对转录组响应的影响这一关键问题,这对于理解基因调控机制和优化大规模扰动实验设计至关重要。现有方法要么依赖静态且可能不完整的知识图谱,要么基于功能相似性检索关联信息,而后者受限于科学文本中对称共现关系,无法体现定向调控逻辑。其解决方案的核心在于提出一个轻量级框架 MechPert,通过多智能体协作机制引导大语言模型(LLM)生成具有方向性的调控假设(directed regulatory hypotheses),而非仅依赖功能相似性;各智能体独立提出候选调控因子并附带置信度评分,再通过共识机制过滤虚假关联,构建加权邻域用于下游预测,在低数据场景下显著提升预测准确性,并在实验设计中优于传统网络中心性策略。

链接: https://arxiv.org/abs/2602.13791
作者: Marc Boubnovski Martell,Josefa Lia Stoisser,Lawrence Phillips,Aditya Misra,Robert Kitchen,Jesper Ferkinghoff-Borg,Jialin Yu,Philip Torr,Kaspar Märten
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting transcriptional responses to unseen genetic perturbations is essential for understanding gene regulation and prioritizing large-scale perturbation experiments. Existing approaches either rely on static, potentially incomplete knowledge graphs, or prompt language models for functionally similar genes, retrieving associations shaped by symmetric co-occurrence in scientific text rather than directed regulatory logic. We introduce MechPert, a lightweight framework that encourages LLM agents to generate directed regulatory hypotheses rather than relying solely on functional similarity. Multiple agents independently propose candidate regulators with associated confidence scores; these are aggregated through a consensus mechanism that filters spurious associations, producing weighted neighborhoods for downstream prediction. We evaluate MechPert on Perturb-seq benchmarks across four human cell lines. For perturbation prediction in low-data regimes ( N=50 observed perturbations), MechPert improves Pearson correlation by up to 10.5% over similarity-based baselines. For experimental design, MechPert-selected anchor genes outperform standard network centrality heuristics by up to 46% in well-characterized cell lines.

[AI-110] OR-Agent : Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery

【速读】:该论文旨在解决复杂实验驱动领域中科学发现自动化的挑战,即如何超越简单的程序迭代突变机制,实现结构化的假设管理、环境交互与有原则的反思。其核心问题在于现有方法难以有效组织研究路径,缺乏对探索过程的系统性控制和长期知识积累。解决方案的关键在于提出OR-Agent框架,该框架采用基于树状结构的工作流来显式建模假设分支与系统性回溯,引入一种进化-系统化协同构想机制(evolutionary-systematic ideation mechanism),统一了研究起点的选择、完整研究计划的生成及研究树内的协调探索;同时设计了一种分层优化启发式反思系统,包括短期实验反思(提供即时修正信号)、长期反思(累积跨实验洞察作为语义动量)以及记忆压缩(类比权重衰减,抑制漂移并保留关键信号),从而形成一套具有原则性的研究动态调控架构。

链接: https://arxiv.org/abs/2602.13769
作者: Qi Liu,Wanjing Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Automating scientific discovery in complex, experiment-driven domains requires more than iterative mutation of programs; it demands structured hypothesis management, environment interaction, and principled reflection. We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments. OR-Agent organizes research as a structured tree-based workflow that explicitly models branching hypothesis generation and systematic backtracking, enabling controlled management of research trajectories beyond simple mutation-crossover loops. At its core, we introduce an evolutionary-systematic ideation mechanism that unifies evolutionary selection of research starting points, comprehensive research plan generation, and coordinated exploration within a research tree. We further propose a hierarchical optimization-inspired reflection system: short-term experimental reflection operates as a form of verbal gradient providing immediate corrective signals; long-term reflection accumulates cross-experiment insights as verbal momentum; and memory compression serves as a regularization mechanism analogous to weight decay, preserving essential signals while mitigating drift. Together, these components form a principled architecture governing research dynamics. We conduct extensive experiments across classical combinatorial optimization benchmarks-including traveling salesman, capacitated vehicle routing, bin packing, orienteering, and multiple knapsack problems-as well as simulation-based cooperative driving scenarios. Results demonstrate that OR-Agent outperforms strong evolutionary baselines while providing a general, extensible, and inspectable framework for AI-assisted scientific discovery. OR-Agent source code and experiments data are publicly available at this https URL.

[AI-111] MOTIF: Learning Action Motifs for Few-shot Cross-Embodiment Transfer

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在跨机体(cross-embodiment)迁移中的挑战,即由于机器人构型差异导致的运动学异质性以及获取充足真实世界示范数据以支持微调的高成本问题。现有方法依赖共享-私有架构,受限于私有参数容量且缺乏显式的适应机制。其解决方案的关键在于提出MOTIF框架,通过向量量化结合进度感知对齐与机体对抗约束,从异构动作数据中解耦出与机体无关的时空模式(称为动作基元,action motifs),并设计轻量级预测器将这些基元映射到实时输入,进而融合机器人特定状态,驱动流匹配策略生成新机体的动作输出,从而实现高效少样本跨机体迁移。

链接: https://arxiv.org/abs/2602.13764
作者: Heng Zhi,Wentao Tan,Lei Zhu,Fengling Li,Jingjing Li,Guoli Yang,Heng Tao Shen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While vision-language-action (VLA) models have advanced generalist robotic learning, cross-embodiment transfer remains challenging due to kinematic heterogeneity and the high cost of collecting sufficient real-world demonstrations to support fine-tuning. Existing cross-embodiment policies typically rely on shared-private architectures, which suffer from limited capacity of private parameters and lack explicit adaptation mechanisms. To address these limitations, we introduce MOTIF for efficient few-shot cross-embodiment transfer that decouples embodiment-agnostic spatiotemporal patterns, termed action motifs, from heterogeneous action data. Specifically, MOTIF first learns unified motifs via vector quantization with progress-aware alignment and embodiment adversarial constraints to ensure temporal and cross-embodiment consistency. We then design a lightweight predictor that predicts these motifs from real-time inputs to guide a flow-matching policy, fusing them with robot-specific states to enable action generation on new embodiments. Evaluations across both simulation and real-world environments validate the superiority of MOTIF, which significantly outperforms strong baselines in few-shot transfer scenarios by 6.5% in simulation and 43.7% in real-world settings. Code is available at this https URL.

[AI-112] OneLatent: Single-Token Compression for Visual Latent Reasoning

【速读】:该论文旨在解决链式思维(Chain-of-thought, CoT)提示在推理过程中导致的推理成本显著增加的问题,即文本形式的中间推理步骤通常使输出长度增长一到两个数量级,从而影响模型效率。其解决方案的关键在于提出OneLatent框架,通过将中间推理过程压缩为单一潜在标记(latent token),利用渲染后的CoT图像和DeepSeek-OCR隐藏状态进行监督训练,从而获得可审计且确定性的监督信号,避免模型生成冗长的文本推理链。该方法在多个基准测试中实现了平均输出长度减少11倍、仅损失2.21%准确率,并显著提升输出token贡献度(OTC)达6.8倍,在长链逻辑推理任务中达到接近纯文本CoT的性能水平(如ProntoQA上99.80%,ProsQA上97.80%),验证了其在压缩约束下的泛化能力。

链接: https://arxiv.org/abs/2602.13738
作者: Bo Lv,Yasheng Sun,Junjie Wang,Haoxiang Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbfOneLatent, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by 11\times with only a 2.21% average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by 6.8\times . On long-chain logical reasoning, OneLatent reaches 99.80% on ProntoQA and 97.80% on ProsQA with one latent token, with compression up to 87.4\times , supporting compression-constrained generalization.

[AI-113] HybridFlow: A Two-Step Generative Policy for Robotic Manipulation

【速读】:该论文旨在解决现有机器人操作策略因推理延迟过高而缺乏与环境实时交互能力的问题,尤其是在需要快速响应的场景中,传统扩散模型(Diffusion Model)生成速度不足,难以满足机器人控制对低延迟的需求。解决方案的关键在于提出一种三阶段、两次非流形估计(2-NFE)的混合生成方法 HybridFlow:首先利用 MeanFlow 的一步生成优势实现全局跳跃(Global Jump),随后通过 ReNoise 进行分布对齐以提升动作精度,最后在 ReFlow 模式下进行局部精调(Local Refine),从而在保持极低推理时间(仅 19ms,相比原 16 步扩散策略提速 8 倍)的同时显著提升任务成功率(提升 15–25%),并在未见颜色(OOD)抓取和可变形物体折叠等挑战性任务中验证了其泛化性能。

链接: https://arxiv.org/abs/2602.13718
作者: Zhenchen Dong,Jinna Fu,Jiaming Wu,Shengyuan Yu,Fulin Chen,Yide Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbfHybridFlow, a \textbf3-stage method with \textbf2-NFE: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf15–25% in success rate while reducing inference time from 152ms to 19ms (\textbf8 \times speedup, \textbf \sim 52Hz); it also achieves 70.0% success on unseen-color OOD grasping and 66.3% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.

[AI-114] No Need to Train Your RDB Foundation Model

【速读】:该论文旨在解决在关系型数据库(Relational Database, RDB)中进行预测建模时,如何避免为每个新的目标变量重新训练模型的问题。传统方法依赖于针对特定任务的监督学习管道,难以扩展到跨多个相互关联表的场景。其解决方案的关键在于:提出一种基于上下文学习(In-Context Learning, ICL)的RDB编码策略,该策略将变长的RDB邻域压缩为固定长度的输入样本,供解码器消费;并强调压缩必须限制在高维RDB列内——即所有实体共享单位和角色的列,而非跨列压缩,因为后者缺乏标签信息时无法判断异构数据类型的关联性。在此约束下,研究进一步证明,无需引入可训练参数即可保持编码器表达能力,从而实现与现有单表ICL基础模型无缝集成,且无需任何训练或微调。

链接: https://arxiv.org/abs/2602.13697
作者: Linjie Xu,Yanlin Zhang,Quan Gan,Minjie Wang,David Wipf
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we \textitavoid retraining a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained \emphwithin high-dimensional RDB columns where all entities share units and roles, not \textitacross columns where the relevance of heterogeneous data types cannot possibly be determined without label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models, whereby no training or fine-tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in an easy-to-use open-source RDB foundation model\footnote\labelfoot: RDBLearn_learn this https URL capable of robust performance on unseen datasets out of the box.

[AI-115] Can a Lightweight Automated AI Pipeline Solve Research-Level Mathematical Problems?

【速读】:该论文旨在解决如何利用下一代大语言模型(Large Language Models, LLMs)通过轻量级自然语言管道高效求解研究级数学问题的挑战。其核心问题是当前生成式 AI (Generative AI) 在数学证明中的应用仍主要局限于竞赛级基准测试,而缺乏针对科研场景下复杂、未公开问题的自动化求解与验证能力。解决方案的关键在于构建一个优化的自动化流水线,该流水线集成最新大语言模型(如 Gemini 3 Pro 和 GPT-5.2 Pro),并采用基于引用(citation-based)的验证机制,从而实现对研究级数学问题候选证明的自动生成与可信验证。实验表明,该方法在两个新构建的数据集(ICCM 和 “First Proof”)上成功生成了全部问题的候选证明,并部分完成人工验证,展现了其在实际科研场景中的潜力。

链接: https://arxiv.org/abs/2602.13695
作者: Lve Meng(University of Science and Technology of China, Zhongguancun Academy),Weilong Zhao(Université Paris Cité),Yanzhi Zhang(Zhongguancun Academy),Haoxiang Guan(Zhongguancun Academy),Jiyan He(Zhongguancun Academy)
机构: 未知
类目: Artificial Intelligence (cs.AI); Commutative Algebra (math.AC); Combinatorics (math.CO); Category Theory (math.CT)
备注: 9 pages

点击查看摘要

Abstract:Large language models (LLMs) have recently achieved remarkable success in generating rigorous mathematical proofs, with “AI for Math” emerging as a vibrant field of research. While these models have mastered competition-level benchmarks like the International Mathematical Olympiad and show promise in research applications through auto-formalization, their deployment via lightweight, natural-language pipelines for research problems remains underexplored. In this work, we demonstrate that next-generation models (e.g., Gemini 3 Pro, GPT-5.2 Pro), when integrated into a streamlined automated pipeline optimized for citation-based verification, can solve sophisticated research-grade problems. We evaluate our pipeline on two novel datasets: (1) the ICCM problem sets (comparable to the S.-T. Yau College Student Mathematics Contest) proposed by leading mathematicians, and (2) the “First Proof” problem set, consisting of previously unpublished research questions. Our pipeline generated candidate proofs for all problems in the first two ICCM sets and the “First Proof” set. The solutions for the first two ICCM sets and Problem 4 of the “First Proof” set have been fully verified by our team. All generated proofs have been submitted to the official organization, and our generated results are publicly available. We plan to open-source the complete pipeline methodology in due course.

[AI-116] PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning

【速读】:该论文旨在解决长周期多步骤工具规划(long-horizon multi-step tool planning)中的挑战,即在复杂任务执行过程中,由于工具使用路径的组合爆炸问题,导致有效策略难以发现且难以复用。解决方案的关键在于提出一种基于信息素引导的策略优化方法(Pheromone-Guided Policy Optimization, PhGPO),该方法从历史成功轨迹中学习可复用的工具转换模式(即“信息素”),并将此模式用于指导后续策略优化,从而显式地引导模型向历史上成功的工具转换方向收敛,显著提升长期规划能力。

链接: https://arxiv.org/abs/2602.13691
作者: Yu Li,Guangfeng Cai,Shengtian Yang,Han Luo,Shuo Han,Xu He,Dong Li,Lei Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Model (LLM) agents have demonstrated strong capabilities in executing complex tasks through tool use. However, long-horizon multi-step tool planning is challenging, because the exploration space suffers from a combinatorial explosion. In this scenario, even when a correct tool-use path is found, it is usually considered an immediate reward for current training, which would not provide any reusable information for subsequent training. In this paper, we argue that historically successful trajectories contain reusable tool-transition patterns, which can be leveraged throughout the whole training process. Inspired by ant colony optimization where historically successful paths can be reflected by the pheromone, we propose Pheromone-Guided Policy Optimization (PhGPO), which learns a trajectory-based transition pattern (i.e., pheromone) from historical trajectories and then uses the learned pheromone to guide policy optimization. This learned pheromone provides explicit and reusable guidance that steers policy optimization toward historically successful tool transitions, thereby improving long-horizon tool planning. Comprehensive experimental results demonstrate the effectiveness of our proposed PhGPO.

[AI-117] AuTAgent : A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在处理需要精确声学测量的复杂推理任务时表现不足的问题。尽管外部工具能够提取细粒度特征(如准确节拍或音高),但如何有效整合这些工具仍具挑战性:盲目调用所有工具会导致信息过载,而基于提示的工具选择机制则难以评估工具在特定上下文中的实用性。解决方案的关键在于提出 AuTAgent(Audio Tool Agent),这是一个基于强化学习的框架,通过稀疏反馈训练策略与新颖的差异奖励(Differential Reward)机制,学习何时以及调用哪些工具,从而仅在外部辅助能带来净性能提升时才触发工具调用。实验表明,AuTAgent 通过提供可验证的声学证据,显著提升了模型准确性,并展现出优异的跨模型迁移能力。

链接: https://arxiv.org/abs/2602.13685
作者: Siqian Tong,Xuan Li,Yiwei Wang,Baolong Bi,Yujun Cai,Shenghua Liu,Yuchen He,Chengpeng Hao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to assess context-dependent utility. To address this, we propose AuTAgent (Audio Tool Agent), a reinforcement learning framework that learns when and which tools to invoke. By employing a sparse-feedback training strategy with a novel Differential Reward mechanism, the agent learns to filter out irrelevant tools and invokes external assistance only when it yields a net performance gain over the base model. Experimental results confirm that AuTAgent complements the representation bottleneck of LALMs by providing verifiable acoustic evidence. It improves accuracy by 4.20% / 6.20% and 9.80% / 8.00% for open-source and closed-source backbones on the MMAU Test-mini and the MMAR benchmarks, respectively. In addition, further experiments demonstrate exceptional transferability. We highlight the complementary role of external tools in augmenting audio model reasoning.

[AI-118] On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

【速读】:该论文致力于解决相关聚类(Correlation Clustering, CC)中大规模场景下的优化瓶颈问题,即如何在保留基于线性规划(LP)的近似保证的前提下,显著减少所需的边信息量。其核心挑战在于传统LP方法依赖于 Θ(n3)\Theta(n^3) 个三角不等式约束,难以扩展至大规模数据。解决方案的关键在于引入“稀疏化—近似权衡”(sparsification–approximation trade-offs)框架:首先,通过证明聚类不一致类的VC维为 n1n-1,构建最优大小 O~(n/ε2)\tilde{O}(n/\varepsilon^2) 的加性 ε\varepsilon-核(coreset);其次,发现任意LP顶点最多激活 (n2)\binom{n}{2} 个三角不等式,从而支持精确割平面求解器;最后,提出一种改进的LP-PIVOT算法,在观测到 Θ~(n3/2)\tilde{\Theta}(n^{3/2}) 条边后可实现稳健的 10.3\frac{1}{0.3}-近似(误差由可计算的插补质量统计量 Γw\overline{\Gamma}_w 控制)。此外,论文还通过Yao最小最大原理揭示了伪度量结构(pseudometric structure)对鲁棒性的决定性作用——无此结构时,仅观察 o(n)o(n) 条随机边会导致近似比无界,表明该结构不仅关乎可解性,更决定了CC对不完整信息的鲁棒性。

链接: https://arxiv.org/abs/2602.13684
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require \Theta(n^3) triangle inequality constraints and are prohibitive at scale. We initiate the study of \emphsparsification–approximation trade-offs for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly n-1 , yielding additive \varepsilon -coresets of optimal size \tildeO(n/\varepsilon^2) ; that at most \binomn2 triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust \frac103 -approximation (up to an additive term controlled by an empirically computable imputation-quality statistic \overline\Gamma_w ) once \tilde\Theta(n^3/2) edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao’s minimax principle that without pseudometric structure, any algorithm observing o(n) uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

[AI-119] ALMo: Interactive Aim-Limit-Defined Multi-Objective System for Personalized High-Dose-Rate Brachytherapy Treatment Planning and Visualization for Cervical Cancer ALT

【速读】:该论文旨在解决复杂临床决策中多目标权衡的高认知负荷问题,特别是在宫颈癌高剂量率(High-Dose-Rate, HDR)后装治疗中,需在严格管理放射热点的同时平衡肿瘤覆盖与器官保护之间的矛盾。其解决方案的关键在于提出ALMo(Aim-Limit-defined Multi-Objective system),该系统通过一种新颖的优化框架实现自动化参数设置并支持灵活的毒性风险控制,使临床医生能直接调整直观的目标(aim)和限制(limit)值来导航剂量学帕累托前沿(Pareto surface),从而高效推导出个体化最优治疗策略。

链接: https://arxiv.org/abs/2602.13666
作者: Edward Chen,Natalie Dullerud,Pang Wei Koh,Thomas Niedermayr,Elizabeth Kidd,Sanmi Koyejo,Carlos Guestrin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Abstract accepted at Symposium on Artificial Intelligence in Learning Health Systems (SAIL) 2025

点击查看摘要

Abstract:In complex clinical decision-making, clinicians must often track a variety of competing metrics defined by aim (ideal) and limit (strict) thresholds. Sifting through these high-dimensional tradeoffs to infer the optimal patient-specific strategy is cognitively demanding and historically prone to variability. In this paper, we address this challenge within the context of High-Dose-Rate (HDR) brachytherapy for cervical cancer, where planning requires strictly managing radiation hot spots while balancing tumor coverage against organ sparing. We present ALMo (Aim-Limit-defined Multi-Objective system), an interactive decision support system designed to infer and operationalize clinician intent. ALMo employs a novel optimization framework that minimizes manual input through automated parameter setup and enables flexible control over toxicity risks. Crucially, the system allows clinicians to navigate the Pareto surface of dosimetric tradeoffs by directly manipulating intuitive aim and limit values. In a retrospective evaluation of 25 clinical cases, ALMo generated treatment plans that consistently met or exceeded manual planning quality, with 65% of cases demonstrating dosimetric improvements. Furthermore, the system significantly enhanced efficiency, reducing average planning time to approximately 17 minutes, compared to the conventional 30-60 minutes. While validated in brachytherapy, ALMo demonstrates a generalized framework for streamlining interaction in multi-criteria clinical decision-making.

[AI-120] HyFunc: Accelerating LLM -based Function Calls for Agent ic AI through Hybrid-Model Cascade and Dynamic Templating KDD’26

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体系统(agentic AI systems)在将用户意图转化为结构化函数调用时存在的计算冗余问题,导致推理延迟高、难以支持实时应用。其核心解决方案是提出HyFunc框架,通过三个关键设计消除冗余:(1) 采用混合模型级联机制,由大模型将用户意图压缩为单个“软令牌”(soft token),引导轻量检索器选择相关函数并驱动一个微调后的小型模型生成最终调用,从而避免大模型重复处理完整上下文和全序列生成;(2) 引入动态模板技术(dynamic templating),在扩展的vLLM引擎中实时注入参数语法模板,消除固定代码片段的重复生成;(3) 在未见过的基准数据集BFCL上验证性能,确保泛化能力。实验表明,HyFunc在保持80.1%性能的同时将推理延迟降至0.828秒,显著优于现有基线模型。

链接: https://arxiv.org/abs/2602.13665
作者: Weibin Liao,Jian-guang Lou,Haoyi Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD’26

点击查看摘要

Abstract:While agentic AI systems rely on LLMs to translate user intent into structured function calls, this process is fraught with computational redundancy, leading to high inference latency that hinders real-time applications. This paper identifies and addresses three key redundancies: (1) the redundant processing of a large library of function descriptions for every request; (2) the redundant use of a large, slow model to generate an entire, often predictable, token sequence; and (3) the redundant generation of fixed, boilerplate parameter syntax. We introduce HyFunc, a novel framework that systematically eliminates these inefficiencies. HyFunc employs a hybrid-model cascade where a large model distills user intent into a single “soft token.” This token guides a lightweight retriever to select relevant functions and directs a smaller, prefix-tuned model to generate the final call, thus avoiding redundant context processing and full-sequence generation by the large model. To eliminate syntactic redundancy, our “dynamic templating” technique injects boilerplate parameter syntax on-the-fly within an extended vLLM engine. To avoid potential limitations in generalization, we evaluate HyFunc on an unseen benchmark dataset, BFCL. Experimental results demonstrate that HyFunc achieves an excellent balance between efficiency and performance. It achieves an inference latency of 0.828 seconds, outperforming all baseline models, and reaches a performance of 80.1%, surpassing all models with a comparable parameter scale. These results suggest that HyFunc offers a more efficient paradigm for agentic AI. Our code is publicly available at this https URL.

[AI-121] Cumulative Utility Parity for Fair Federated Learning under Intermittent Client Participation

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)系统中因客户端参与不均衡而导致的公平性问题。现有方法通常假设客户端具有相似的参与机会,仅在每轮训练中追求损失或准确率的公平性,但在实际场景中,客户端的参与往往受数据特征或资源限制影响而呈现时间上的偏斜(temporally skewed),这会导致间歇性参与的客户端被系统性地低估,即使每轮表现看似公平。解决方案的关键在于提出“累积效用平等”(cumulative utility parity)这一新的公平性原则,即衡量每个客户端在每次参与机会中获得的长期收益是否一致;并引入“可用性归一化累积效用”(availability-normalized cumulative utility)作为可操作指标,将不可避免的物理约束与可避免的调度和聚合算法偏差分离,从而实现更公平的长期代表性。

链接: https://arxiv.org/abs/2602.13651
作者: Stefan Behfar,Richard Mortier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world federated learning (FL) systems, client participation is intermittent, heterogeneous, and often correlated with data characteristics or resource constraints. Existing fairness approaches in FL primarily focus on equalizing loss or accuracy conditional on participation, implicitly assuming that clients have comparable opportunities to contribute over time. However, when participation itself is uneven, these objectives can lead to systematic under-representation of intermittently available clients, even if per-round performance appears fair. We propose cumulative utility parity, a fairness principle that evaluates whether clients receive comparable long-term benefit per participation opportunity, rather than per training round. To operationalize this notion, we introduce availability-normalized cumulative utility, which disentangles unavoidable physical constraints from avoidable algorithmic bias arising from scheduling and aggregation. Experiments on temporally skewed, non-IID federated benchmarks demonstrate that our approach substantially improves long-term representation parity, while maintaining near-perfect performance.

[AI-122] Hierarchical Audio-Visual-Proprioceptive Fusion for Precise Robotic Manipulation

【速读】:该论文旨在解决当前机器人操作中因依赖视觉和本体感知(proprioception)观测而难以在部分可观测现实环境中推断接触相关交互状态的问题,同时指出现有多模态融合方法对声学信号的利用不足。其关键解决方案是提出一种分层表示融合框架(hierarchical representation fusion framework),通过先将视觉与本体感知表示条件化于声学线索,再显式建模高阶跨模态交互来捕捉模态间的互补依赖关系;该结构结合扩散策略(diffusion-based policy)实现从多模态观测直接生成连续机器人动作,从而有效利用任务相关的声学信息并抑制低信息量模态的干扰,在液体倾倒和柜门开启等真实场景中显著优于现有最先进的多模态融合方法。

链接: https://arxiv.org/abs/2602.13640
作者: Siyuan Li,Jiani Lu,Yu Song,Xianren Li,Bo An,Peng Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic cues, and then explicitly models higher-order cross-modal interactions to capture complementary dependencies among modalities. The fused representation is leveraged by a diffusion-based policy to directly generate continuous robot actions from multimodal observations. The combination of end-to-end learning and hierarchical fusion structure enables the policy to exploit task-relevant acoustic information while mitigating interference from less informative modalities. The proposed method has been evaluated on real-world robotic manipulation tasks, including liquid pouring and cabinet opening. Extensive experiment results demonstrate that our approach consistently outperforms state-of-the-art multimodal fusion frameworks, particularly in scenarios where acoustic cues provide task-relevant information not readily available from visual observations alone. Furthermore, a mutual information analysis is conducted to interpret the effect of audio cues in robotic manipulation via multimodal fusion.

[AI-123] DiffusionRollout: Uncertainty-Aware Rollout Planning in Long-Horizon PDE Solving

【速读】:该论文旨在解决自回归扩散模型在物理系统长期预测中因误差累积导致的可靠性下降问题(error accumulation in long-horizon predictions)。其核心解决方案是提出一种称为DiffusionRollout的选择性滚动规划策略,关键在于利用多样本预测标准差作为不确定性度量,并据此自适应调整推理过程中的步长,从而减少对不准确前序输出的依赖,提升长时间序列预测的稳定性与准确性。

链接: https://arxiv.org/abs/2602.13616
作者: Seungwoo Yoo,Juil Koo,Daehyeon Choi,Minhyuk Sung
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: TMLR

点击查看摘要

Abstract:We propose DiffusionRollout, a novel selective rollout planning strategy for autoregressive diffusion models, aimed at mitigating error accumulation in long-horizon predictions of physical systems governed by partial differential equations (PDEs). Building on the recently validated probabilistic approach to PDE solving, we further explore its ability to quantify predictive uncertainty and demonstrate a strong correlation between prediction errors and standard deviations computed over multiple samples-supporting their use as a proxy for the model’s predictive confidence. Based on this observation, we introduce a mechanism that adaptively selects step sizes during autoregressive rollouts, improving long-term prediction reliability by reducing the compounding effect of conditioning on inaccurate prior outputs. Extensive evaluation on long-trajectory PDE prediction benchmarks validates the effectiveness of the proposed uncertainty measure and adaptive planning strategy, as evidenced by lower prediction errors and longer predicted trajectories that retain a high correlation with their ground truths.

[AI-124] From What to How: Bridging User Requirements with Software Development Using Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件开发中过度关注代码实现而忽视软件设计能力的问题。现有基准测试多聚焦于代码生成质量,未能评估LLMs在需求分析、面向对象建模及测试用例设计等关键设计环节的表现。为此,作者提出DesBench——一个面向软件设计的基准测试集,包含30个手工构建的Java项目,涵盖需求文档、设计模型、实现代码与验收测试用例,共计30个设计模型、194个Java类和737个测试用例。其核心解决方案是通过结构化设计任务(包括设计感知的代码生成、面向对象建模和验收测试设计)系统性评估LLMs在软件设计层面的能力,揭示其在无设计或仅高阶设计输入下难以生成正确实现、在类间关系与操作定义上存在不足,但能生成具备人类水平代码覆盖率的验收测试用例,从而指明LLMs在软件工程中亟需提升设计理解与表达能力,并推动面向LLM的新设计方法学研究。

链接: https://arxiv.org/abs/2602.13611
作者: Xiao He,Ru Chen,Jialun Cao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) are extensively utilized to enhance development efficiency, leading to numerous benchmarks for evaluating their performance. However, these benchmarks predominantly focus on implementation, overlooking the equally critical aspect of software design. This gap raises two pivotal questions: (1) Can LLMs handle software design? (2) Can LLMs write code following the specific designs? To investigate these questions, this paper proposes DesBench, a design-aware benchmark for evaluating LLMs on three software design-related tasks: design-aware code generation, object-oriented modeling, and the design of acceptance test cases. DesBench comprises 30 manually crafted Java projects that include requirement documents, design models, implementations, and acceptance tests, amounting to a total of 30 design models, 194 Java classes, and 737 test cases. We evaluated seven state-of-the-art LLMs, including three DeepSeek R1, two Qwen2.5, and two GPT models, using DesBench. The results reveal that LLMs remain significantly challenged by the intricacies of software design: (1) For code generation, LLMs struggle to produce correct implementations when provided with only high-level or no designs. (2) In object-oriented modeling, while LLMs can accurately identify objects and classes, they face challenges in defining operations and inter-class relationships. (3) Acceptance test cases generated by LLMs from functional requirements achieve code coverage quality comparable to those written by humans. Our research highlights the current limitations of LLMs in managing software design and calls for further investigation into new design methodologies and languages suitable for LLM-based development.

[AI-125] Multi-Modal Sensing and Fusion in mmWave Beamforming for Connected Vehicles: A Transformer Based Framework

【速读】:该论文旨在解决毫米波(mmWave)通信在高速动态车联网环境中因标准定义的波束赋形(beamforming)方法导致的波束训练开销过高、可用空口时间减少的问题,其核心原因在于频繁交换导频信号和全量波束测量。解决方案的关键在于提出了一种多模态感知与融合学习框架:首先通过模态特定编码器提取不同传感模态的代表性特征,再利用多头交叉模态注意力机制学习各模态间的依赖关系与相关性,最终融合多模态特征以预测最优top-k波束,从而实现对最佳视距链路的主动建立。该方法显著降低了波束搜索空间和延迟开销,同时保持高预测准确率。

链接: https://arxiv.org/abs/2602.13606
作者: Muhammad Baqer Mollah,Honggang Wang,Mohammad Ataul Karim,Hua Fang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 13 Pages. arXiv admin note: text overlap with arXiv:2509.11112

点击查看摘要

Abstract:Millimeter wave (mmWave) communication, utilizing beamforming techniques to address the inherent path loss limitation, is considered as one of the key technologies to support ever increasing high throughput and low latency demands of connected vehicles. However, adopting standard defined beamforming approach in highly dynamic vehicular environments often incurs high beam training overheads and reduction in the available airtime for communications, which is mainly due to exchanging pilot signals and exhaustive beam measurements. To this end, we present a multi-modal sensing and fusion learning framework as a potential alternative solution to reduce such overheads. In this framework, we first extract the representative features from the sensing modalities by modality specific encoders, then, utilize multi-head cross-modal attention to learn dependencies and correlations between different modalities, and subsequently fuse the multimodal features to obtain predicted top-k beams so that the best line-of-sight links can be proactively established. To show the generalizability of the proposed framework, we perform a comprehensive experiment in four different vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) scenarios from real world multimodal and 60 GHz mmWave wireless sensing data. The experiment reveals that the proposed framework (i) achieves up to 96.72% accuracy on predicting top-15 beams correctly, (ii) incurs roughly 0.77 dB average power loss, and (iii) improves the overall latency and beam searching space overheads by 86.81% and 76.56% respectively for top-15 beams compared to standard defined approach.

[AI-126] he Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

【速读】:该论文试图解决的问题是:在多跳推理(multi-hop reasoning)场景下,传统神经网络缩放定律(neural scaling laws)所预测的精度降低可线性提升计算效率和能效比(E ∝ bits)是否依然成立。研究发现,在此类复杂推理任务中,将数值精度从16位降至8位或4位反而导致净能耗增加且推理准确率下降,即“量化陷阱”(quantization trap)现象。解决方案的关键在于通过理论分解揭示了两个核心机制:一是硬件强制类型转换(casting overhead)带来的隐式延迟开销,二是顺序推理链中能量摊销失败(sequential energy amortization failure),二者共同导致了能效比恶化。因此,论文指出行业普遍采用的“越小越好”(smaller-is-better)优化策略在复杂推理任务中存在数学上的非最优性。

链接: https://arxiv.org/abs/2602.13595
作者: Henry Han,Xiyang Liu,Xiaodong Wang,Fei Han,Xiaodong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a ‘quantization trap’ where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry’s “smaller-is-better” heuristic is mathematically counterproductive for complex reasoning tasks.

[AI-127] Hippocampus: An Efficient and Scalable Memory Module for Agent ic AI

【速读】:该论文旨在解决生成式 AI (Generative AI) 在长期任务中因大语言模型(LLM)上下文窗口限制而面临的持久化记忆存储难题,现有基于密集向量数据库或知识图谱遍历的记忆系统存在检索延迟高、存储扩展性差的问题。其解决方案的关键在于提出 Hippocampus 系统,采用紧凑的二进制签名进行语义搜索,并利用无损的 token-ID 流实现内容精确重建;核心创新是动态小波矩阵(Dynamic Wavelet Matrix, DWM),它能压缩并联合索引两类数据流,在压缩域内支持超快检索,从而避免昂贵的密集向量或图计算,实现与内存规模线性可扩展,显著降低端到端检索延迟(最高达 31 倍)和每查询 token 开销(最高达 14 倍),同时保持在 LoCoMo 和 LongMemEval 基准上的准确性。

链接: https://arxiv.org/abs/2602.13594
作者: Yi Li,Lianjie Cao,Faraz Ahmed,Puneet Sharma,Bingzhe Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage scalability. We introduce Hippocampus, an agentic memory management system that uses compact binary signatures for semantic search and lossless token-ID streams for exact content reconstruction. Its core is a Dynamic Wavelet Matrix (DWM) that compresses and co-indexes both streams to support ultra-fast search in the compressed domain, thus avoiding costly dense-vector or graph computations. This design scales linearly with memory size, making it suitable for long-horizon agentic deployments. Empirically, our evaluation shows that Hippocampus reduces end-to-end retrieval latency by up to 31 \times and cuts per-query token footprint by up to 14 \times , while maintaining accuracy on both LoCoMo and LongMemEval benchmarks.

[AI-128] Differentiable Rule Induction from Raw Sequence Inputs ICLR2025

【速读】:该论文旨在解决不同iable Inductive Logic Programming (ILP) 模型在直接从原始数据中学习规则时面临的显式标签泄露(explicit label leakage)问题,即无法将连续输入映射到符号变量而缺乏输入特征标签的显式监督。解决方案的关键在于将自监督可微聚类模型与一种新型可微ILP模型相结合,从而实现无需显式标签即可从原始数据中学习规则,使生成的规则能够通过数据特征有效描述原始输入。该方法在时间序列和图像数据上验证了其能直观且精确地学习泛化规则的能力。

链接: https://arxiv.org/abs/2602.13583
作者: Kun Gao,Katsumi Inoue,Yongzhi Cao,Hanpin Wang,Feng Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Rule learning-based models are widely used in highly interpretable scenarios due to their transparent structures. Inductive logic programming (ILP), a form of machine learning, induces rules from facts while maintaining interpretability. Differentiable ILP models enhance this process by leveraging neural networks to improve robustness and scalability. However, most differentiable ILP methods rely on symbolic datasets, facing challenges when learning directly from raw data. Specifically, they struggle with explicit label leakage: The inability to map continuous inputs to symbolic variables without explicit supervision of input feature labels. In this work, we address this issue by integrating a self-supervised differentiable clustering model with a novel differentiable ILP model, enabling rule learning from raw data without explicit label leakage. The learned rules effectively describe raw data through its features. We demonstrate that our method intuitively and precisely learns generalized rules from time series and image data.

[AI-129] Who Do LLM s Trust? Human Experts Matter More Than Other LLM s

【速读】:该论文旨在探究大型语言模型(Large Language Models, LLMs)在面对社会信息(如其他代理的回答、工具输出或人类建议)时,是否表现出类似人类的从众行为——即其判断是否会因信息来源的可信度和共识强度而改变,并进一步考察LLMs是否更倾向于采纳来自人类而非其他LLMs的反馈。解决方案的关键在于设计三类二元决策任务(阅读理解、多步推理与道德判断),通过指令微调使模型接收到标注为来自“朋友”、“人类专家”或“其他LLMs”的前置答案,并系统性地操控群体正确性与规模;此外,在第二组实验中引入单一人类与单个LLM之间的直接分歧。结果表明,LLMs显著更倾向于遵循标注为人类专家的信息,即使该信息错误,且对人类专家反馈的修正速度高于对其他LLMs的修正,揭示了当代LLMs存在一种跨决策领域、基于可信度的社会影响机制。

链接: https://arxiv.org/abs/2602.13568
作者: Anooshka Bajaj,Zoran Tiganj
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly operate in environments where they encounter social information such as other agents’ answers, tool outputs, or human recommendations. In humans, such inputs influence judgments in ways that depend on the source’s credibility and the strength of consensus. This paper investigates whether LLMs exhibit analogous patterns of influence and whether they privilege feedback from humans over feedback from other LLMs. Across three binary decision-making tasks, reading comprehension, multi-step reasoning, and moral judgment, we present four instruction-tuned LLMs with prior responses attributed either to friends, to human experts, or to other LLMs. We manipulate whether the group is correct and vary the group size. In a second experiment, we introduce direct disagreement between a single human and a single LLM. Across tasks, models conform significantly more to responses labeled as coming from human experts, including when that signal is incorrect, and revise their answers toward experts more readily than toward other LLMs. These results reveal that expert framing acts as a strong prior for contemporary LLMs, suggesting a form of credibility-sensitive social influence that generalizes across decision domains.

[AI-130] OpAgent : Operator Agent for Web Navigation

【速读】:该论文旨在解决自主网页代理(WebAgent)在真实复杂且动态变化的网站环境中执行用户指令时面临的挑战,尤其是传统监督微调(Supervised Fine-Tuning, SFT)和离线强化学习(Offline Reinforcement Learning, RL)方法因数据分布偏移而导致性能受限的问题。解决方案的关键在于提出一种鲁棒的在线强化学习框架:首先通过分层多任务微调构建具备强指令遵循能力的视觉语言模型(Vision-Language Model, VLM),其次设计一个包含混合奖励机制(Hybrid Reward Mechanism)的在线代理强化学习(Online Agentic RL in the Wild)流程,结合无真值依赖的WebJudge与规则决策树(Rule-based Decision Tree, RDT)以缓解长程导航中的信用分配难题,最终引入模块化架构OpAgent整合规划器、定位器、反思器与总结器,实现错误恢复与自我修正,从而将WebArena基准上的成功率提升至71.6%,达到当前最优水平。

链接: https://arxiv.org/abs/2602.13559
作者: Yuyu Guo,Wenjie Yang,Siyuan Yang,Ziyang Liu,Cheng Chen,Yuan Wei,Yun Hu,Yang Huang,Guoliang Hao,Dongsheng Yuan,Jianming Wang,Xin Chen,Hang Yu,Lei Lei,Peng Di
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives – Planning, Acting, and Grounding – establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbfOpAgent, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent’s performance to a new State-of-the-Art (SOTA) success rate of \textbf71.6%.

[AI-131] Discrete-Space Generative AI Pipeline for Semantic Transmission of Signals

【速读】:该论文旨在解决传统通信系统在物理信道质量恶化时难以维持信号语义完整性的问题,尤其针对物联网(IoT)场景下低带宽、高干扰环境中的可靠传输挑战。其解决方案的关键在于提出一种名为Discernment的语义通信系统,利用生成式AI(Generative AI)模型在离散空间中编码和解码物理信号(如基带无线电和音频)的语义信息,并根据信道损伤模式(建模为擦除信道)动态切换自回归或扩散生成算法,从而在信道容量显著下降时仍能保持分类准确性和重建语义的统计保真度,实现对多种信道条件的自适应调整与高谱效、低模型复杂度的平衡。

链接: https://arxiv.org/abs/2602.13556
作者: Silvija Kokalj-Filipovic,Yagna Kaasaragadda
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We introduce Discernment, a semantic communication system that transmits the meaning of physical signals (baseband radio and audio) over a technical channel using GenAI models operating in discrete spaces. Discernment dynamically adapts to channel impairments - modeled as erasure channels - by switching between an autoregressive or a diffusion-based generative algorithm, depending on the erasure pattern. Our results show that Discernment maintains semantic integrity even as channel capacity severely degrades, exhibiting very small and graceful performance decline in both classification accuracy and statistical fidelity of the reconstructed meaning. These findings demonstrate Discernment’s ability to adjust to diverse physical channel conditions while maintaining spectral efficiency and low model complexity, making it well suited for IoT deployments and strongly motivating further research on this semantic channel paradigm.

[AI-132] AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)对越狱提示(jailbreak prompts)敏感、易产生有害或违反政策输出的问题,同时克服现有防御方法依赖昂贵微调、侵入式提示重写或外部护栏导致延迟增加和有用性下降的局限性。其解决方案的关键在于提出AISA(Activation-Induced Safety Alignment),一种轻量级、单次遍历的防御机制:首先通过时空分析定位模型内部固有的安全意识,发现意图判别信号广泛编码于特定注意力头在生成前结构标记处的缩放点积输出中;随后利用自动选择的一组紧凑注意力头提取可解释的提示风险评分,并基于该评分在logits层进行无参数调整——即根据推断风险比例调节解码分布,从正常生成到校准拒绝,无需修改模型参数、添加辅助模块或多次推理,从而在保持模型效用的同时显著提升鲁棒性和迁移能力。

链接: https://arxiv.org/abs/2602.13547
作者: Weiming Song,Xuan Xie,Ruiping Yin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain vulnerable to jailbreak prompts that elicit harmful or policy-violating outputs, while many existing defenses rely on expensive fine-tuning, intrusive prompt rewriting, or external guardrails that add latency and can degrade helpfulness. We present AISA, a lightweight, single-pass defense that activates safety behaviors already latent inside the model rather than treating safety as an add-on. AISA first localizes intrinsic safety awareness via spatiotemporal analysis and shows that intent-discriminative signals are broadly encoded, with especially strong separability appearing in the scaled dot-product outputs of specific attention heads near the final structural tokens before generation. Using a compact set of automatically selected heads, AISA extracts an interpretable prompt-risk score with minimal overhead, achieving detector-level performance competitive with strong proprietary baselines on small (7B) models. AISA then performs logits-level steering: it modulates the decoding distribution in proportion to the inferred risk, ranging from normal generation for benign prompts to calibrated refusal for high-risk requests – without changing model parameters, adding auxiliary modules, or requiring multi-pass inference. Extensive experiments spanning 13 datasets, 12 LLMs, and 14 baselines demonstrate that AISA improves robustness and transfer while preserving utility and reducing false refusals, enabling safer deployment even for weakly aligned or intentionally risky model variants.

[AI-133] REMem: Reasoning with Episodic Memory in Language Agent ICLR2026

【速读】:该论文旨在解决当前语言智能体在记忆能力上的局限性问题,即现有系统主要依赖语义记忆(semantic memory),难以有效实现对交互历史的事件性回忆(episodic recollection)与跨事件推理(episodic reasoning)。其核心挑战在于如何显式建模时间感知的事件结构并支持复杂推理过程。解决方案的关键是提出REMem框架,该框架包含两个阶段:首先通过离线索引构建一个融合时间感知摘要与事实的混合记忆图(hybrid memory graph),实现对经验的结构化存储;其次在在线推理阶段,利用具备精心设计工具的代理检索器(agentic retriever)对记忆图进行迭代式检索与推理,从而显著提升在多任务 episodic memory 基准测试中的表现,尤其在回忆和推理准确性上分别优于当前最优方法 Mem0 和 HippoRAG 2 达 3.4% 和 13.4%。

链接: https://arxiv.org/abs/2602.13530
作者: Yiheng Shu,Saisri Padmaja Jonnalagedda,Xiang Gao,Bernal Jiménez Gutiérrez,Weijian Qi,Kamalika Das,Huan Sun,Yu Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by The Fourteenth International Conference on Learning Representations (ICLR 2026) as poster

点击查看摘要

Abstract:Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify and formalize the core challenges of episodic recollection and reasoning from this gap, and observe that existing work often overlooks episodicity, lacks explicit event modeling, or overemphasizes simple retrieval rather than complex reasoning. We present REMem, a two-phase framework for constructing and reasoning with episodic memory: 1) Offline indexing, where REMem converts experiences into a hybrid memory graph that flexibly links time-aware gists and facts. 2) Online inference, where REMem employs an agentic retriever with carefully curated tools for iterative retrieval over the memory graph. Comprehensive evaluation across four episodic memory benchmarks shows that REMem substantially outperforms state-of-the-art memory systems such as Mem0 and HippoRAG 2, showing 3.4% and 13.4% absolute improvements on episodic recollection and reasoning tasks, respectively. Moreover, REMem also demonstrates more robust refusal behavior for unanswerable questions.

[AI-134] Singular Vectors of Attention Heads Align with Features

【速读】:该论文试图解决的问题是:在语言模型的机制可解释性研究中,如何可靠地识别特征表示(feature representations)——特别是是否存在理论依据支持从注意力矩阵的奇异向量(singular vectors)中推断出这些特征表示。此前多项研究隐含假设奇异向量与特征对齐,但缺乏充分的理论和实证支撑。论文的关键解决方案在于:首先通过一个可直接观测特征的模型验证奇异向量确实稳健地对齐于真实特征;其次从理论上证明这种对齐在多种条件下均可预期;最后提出“稀疏注意力分解”(sparse attention decomposition)作为可操作的检验指标,并在真实模型中发现其符合预测,从而为利用奇异向量识别特征提供了一个理论合理且可验证的方法基础。

链接: https://arxiv.org/abs/2602.13524
作者: Gabriel Franco,Carson Loughridge,Mark Crovella
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in a manner consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.

[AI-135] Arming Data Agents with Tribal Knowledge

【速读】:该论文旨在解决自然语言到SQL(NL2SQL)代理在面对大规模真实数据库时因缺乏对数据语义的正确理解而产生错误的问题,尤其是由于代理对列意图等数据特征存在误解所导致的推理偏差。解决方案的关键在于提出Tk-Boost框架,该框架通过收集NL2SQL代理在特定数据库上的交互经验,识别其推理过程中的错误模式,并生成“部落知识”(tribal knowledge)——即针对特定查询特征的修正性知识;随后,利用适用条件(applicability conditions)对这些知识进行索引,使系统能够在新查询中准确检索并反馈给代理,从而在SQL生成过程中纠正其认知偏差,显著提升NL2SQL代理的准确性。

链接: https://arxiv.org/abs/2602.13521
作者: Shubham Agarwal,Asim Biswal,Sepanta Zeighami,Alvin Cheung,Joseph Gonzalez,Aditya G. Parameswaran
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language to SQL (NL2SQL) translation enables non-expert users to query relational databases through natural language. Recently, NL2SQL agents, powered by the reasoning capabilities of Large Language Models (LLMs), have significantly advanced NL2SQL translation. Nonetheless, NL2SQL agents still make mistakes when faced with large-scale real-world databases because they lack knowledge of how to correctly leverage the underlying data (e.g., knowledge about the intent of each column) and form misconceptions about the data when querying it, leading to errors. Prior work has studied generating facts about the database to provide more context to NL2SQL agents, but such approaches simply restate database contents without addressing the agent’s misconceptions. In this paper, we propose Tk-Boost, a bolt-on framework for augmenting any NL2SQL agent with tribal knowledge: knowledge that corrects the agent’s misconceptions in querying the database accumulated through experience using the database. To accumulate experience, Tk-Boost first asks the NL2SQL agent to answer a few queries on the database, identifies the agent’s misconceptions by analyzing its mistakes on the database, and generates tribal knowledge to address them. To enable accurate retrieval, Tk-Boost indexes this knowledge with applicability conditions that specify the query features for which the knowledge is useful. When answering new queries, Tk-Boost uses this knowledge to provide feedback to the NL2SQL agent, resolving the agent’s misconceptions during SQL generation, and thus improving the agent’s accuracy. Extensive experiments across the BIRD and Spider 2.0 benchmarks with various NL2SQL agents shows Tk-Boost improves NL2SQL agents accuracy by up to 16.9% on Spider 2.0 and 13.7% on BIRD

[AI-136] SPILLage: Agent ic Oversharing on the Web

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的网络代理(web agents)在执行用户任务时,因无意中暴露与任务无关的用户资源信息而导致的隐私泄露问题,即“自然代理过度共享”(Natural Agentic Oversharing)。其核心挑战在于,现有研究主要关注文本层面的信息泄露,而忽略了代理在网页上的行为痕迹(如点击、滚动和导航模式)同样可能构成敏感信息泄露。解决方案的关键在于提出SPILLage框架,从“渠道”(内容 vs. 行为)和“直接性”(显式 vs. 隐式)两个维度对过度共享进行系统建模,并通过实证发现:行为层面的过度共享占比高达内容层面的5倍,且即使采用提示工程(prompt-level mitigation)也难以缓解;相反,提前过滤掉任务无关信息可使任务成功率提升最高达17.9%,表明减少过度共享不仅有助于隐私保护,还能提升任务执行效率。

链接: https://arxiv.org/abs/2602.13516
作者: Jaechul Roh,Eugene Bagdasarian,Hamed Haddadi,Ali Shahin Shamsabadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered agents are beginning to automate user’s tasks across the open web, often with access to user resources such as emails and calendars. Unlike standard LLMs answering questions in a controlled ChatBot setting, web agents act “in the wild”, interacting with third parties and leaving behind an action trace. Therefore, we ask the question: how do web agents handle user resources when accomplishing tasks on their behalf across live websites? In this paper, we formalize Natural Agentic Oversharing – the unintentional disclosure of task-irrelevant user information through an agent trace of actions on the web. We introduce SPILLage, a framework that characterizes oversharing along two dimensions: channel (content vs. behavior) and directness (explicit vs. implicit). This taxonomy reveals a critical blind spot: while prior work focuses on text leakage, web agents also overshare behaviorally through clicks, scrolls, and navigation patterns that can be monitored. We benchmark 180 tasks on live e-commerce sites with ground-truth annotations separating task-relevant from task-irrelevant attributes. Across 1,080 runs spanning two agentic frameworks and three backbone LLMs, we demonstrate that oversharing is pervasive with behavioral oversharing dominates content oversharing by 5x. This effect persists – and can even worsen – under prompt-level mitigation. However, removing task-irrelevant information before execution improves task success by up to 17.9%, demonstrating that reducing oversharing improves task success. Our findings underscore that protecting privacy in web agents is a fundamental challenge, requiring a broader view of “output” that accounts for what agents do on the web, not just what they type. Our datasets and code are available at this https URL.

[AI-137] γ-weakly θ-up-concavity: Linearizable Non-Convex Optimization with Applications to DR-Submodular and OSS Functions

【速读】:该论文旨在解决单调非凸函数优化这一基础性难题,其应用场景涵盖机器学习与组合优化等多个领域。解决方案的关键在于提出了一种新的第一阶条件——γ-弱θ-上凹性(γ-weakly θ-up-concavity),该条件刻画了广泛的一类单调非凸函数,并统一推广了DR-子模函数和单侧光滑(One-Sided Smooth, OSS)函数。核心理论贡献表明,满足该条件的函数具有上线性可逼近性(upper-linearizable):对于任意可行点,可构造一个线性代理函数,其收益能以仅依赖于γ、θ及可行集几何结构的常数因子近似原非线性目标。这一性质为离线优化及在线场景下的静态与动态遗憾界提供了统一且紧致的近似保证,同时在DR-子模最大化问题中恢复最优近似系数,并显著改进OSS优化在拟阵约束下的现有近似性能。

链接: https://arxiv.org/abs/2602.13506
作者: Mohammad Pedramfar,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Optimizing monotone non-convex functions is a fundamental challenge across machine learning and combinatorial optimization. We introduce and study \gamma -weakly \theta -up-concavity, a novel first-order condition that characterizes a broad class of such functions. This condition provides a powerful unifying framework, strictly generalizing both DR-submodular functions and One-Sided Smooth (OSS) functions. Our central theoretical contribution demonstrates that \gamma -weakly \theta -up-concave functions are upper-linearizable: for any feasible point, we can construct a linear surrogate whose gains provably approximate the original non-linear objective. This approximation holds up to a constant factor, namely the approximation coefficient, dependent solely on \gamma , \theta , and the geometry of the feasible set. This linearizability yields immediate and unified approximation guarantees for a wide range of problems. Specifically, we obtain unified approximation guarantees for offline optimization as well as static and dynamic regret bounds in online settings via standard reductions to linear optimization. Moreover, our framework recovers the optimal approximation coefficient for DR-submodular maximization and significantly improves existing approximation coefficients for OSS optimization, particularly over matroid constraints.

[AI-138] ranslating Dietary Standards into Healthy Meals with Minimal Substitutions

【速读】:该论文旨在解决个性化饮食系统中如何在不牺牲便利性或经济性的前提下提升营养质量的问题。其核心解决方案是构建一个端到端框架,将美国农业部(USDA)的营养目标转化为现实可行的餐食方案;关键在于利用What We Eat in America (WWEIA) 数据识别出34个可解释的餐食原型(meal archetypes),并以此作为条件控制生成模型与份量预测模型,从而在保持真实餐食组成相近的前提下,使生成餐食对推荐每日摄入量(RDI)的符合度提高47.0%。通过允许一至三类食物替换,所生成餐食平均营养提升10%,成本降低19–32%,实现了营养优化与预算约束的协同改进。

链接: https://arxiv.org/abs/2602.13502
作者: Trevor Chan,Ilias Tagkopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 49 pages, 4 figures

点击查看摘要

Abstract:An important goal for personalized diet systems is to improve nutritional quality without compromising convenience or affordability. We present an end-to-end framework that converts dietary standards into complete meals with minimal change. Using the What We Eat in America (WWEIA) intake data for 135,491 meals, we identify 34 interpretable meal archetypes that we then use to condition a generative model and a portion predictor to meet USDA nutritional targets. In comparisons within archetypes, generated meals are better at following recommended daily intake (RDI) targets by 47.0%, while remaining compositionally close to real meals. Our results show that by allowing one to three food substitutions, we were able to create meals that were 10% more nutritious, while reducing costs 19-32%, on average. By turning dietary guidelines into realistic, budget-aware meals and simple swaps, this framework can underpin clinical decision support, public-health programs, and consumer apps that deliver scalable, equitable improvements in everyday nutrition.

[AI-139] rasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

【速读】:该论文旨在解决Muon-style优化器在训练过程中因正交化操作丢失更新幅度信息而导致的对学习率超参数敏感及高能突增(high-energy bursts)问题。其核心解决方案是提出TrasMuon(Trust Region Adaptive Scaling Muon),通过两个关键机制实现稳定性和效率的平衡:(i) 全局RMS校准以统一更新尺度,(ii) 基于能量比的信赖域裁剪(energy-based trust-region clipping),将更新限制在相对稳定的能量区间内。此设计在保持Muon近等距几何特性的同时,有效抑制了由高能异常值引发的不稳定性,从而提升优化效率与训练鲁棒性。

链接: https://arxiv.org/abs/2602.13498
作者: Peng Cheng,Jiucheng Zang,Qingnan Li,Liheng Ma,Yufei Cui,Yingxue Zhang,Boxing Chen,Ming Jian,Wen Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbfTrust \textbfRegion \textbfAdaptive \textbfScaling \textbfMuon). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon’s superior stability and robustness.

[AI-140] Future of Edge AI in biodiversity monitoring

【速读】:该论文试图解决生态决策因生物多样性数据采集与分析之间存在延迟而受限的问题,其解决方案的关键在于利用边缘计算(Edge Computing)技术,将处理能力迁移至传感器附近,结合边缘人工智能(Edge AI),实现设备端的本地推理(on-device inference),从而减少对数据传输和持续联网的依赖,推动生物多样性监测从被动记录向自主、响应式感知系统转变。通过系统性分析82项相关研究,作者揭示了不同硬件平台、AI模型优化与无线通信策略如何影响生态推断精度、部署寿命及运行可行性,并指出未来需加强生态学家、工程师与数据科学家之间的协作,以确保系统设计与生态问题、实地约束及伦理考量相匹配。

链接: https://arxiv.org/abs/2602.13496
作者: Aude Vuilliomenet,Kate E. Jones,Duncan Wilson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 41 pages, 5 figures, 4 tables

点击查看摘要

Abstract:1. Many ecological decisions are slowed by the gap between collecting and analysing biodiversity data. Edge computing moves processing closer to the sensor, with edge artificial intelligence (AI) enabling on-device inference, reducing reliance on data transfer and continuous connectivity. In principle, this shifts biodiversity monitoring from passive logging towards autonomous, responsive sensing systems. In practice, however, adoption remains fragmented, with key architectural trade-offs, performance constraints, and implementation challenges rarely reported systematically. 2. Here, we analyse 82 studies published between 2017 and 2025 that implement edge computing for biodiversity monitoring across acoustic, vision, tracking, and multi-modal systems. We synthesise hardware platforms, AI model optimisation, and wireless communication to critically assess how design choices shape ecological inference, deployment longevity, and operational feasibility. 3. Publications increased from 3 in 2017 to 19 in 2025. We identify four system types: (I) TinyML, low-power microcontrollers (MCUs) for single-taxon or rare-event detection; (II) Edge AI, single-board computers (SBCs) for multi-species classification and real-time alerts; (III) Distributed edge AI; and (IV) Cloud AI for retrospective processing pipelines. Each system type represents context-dependent trade-offs among power consumption, computational capability, and communication requirements. 4. Our analysis reveals the evolution of edge computing systems from proof-of-concept to robust, scalable tools. We argue that edge computing offers opportunities for responsive biodiversity management, but realising this potential requires increased collaboration between ecologists, engineers, and data scientists to align model development and system design with ecological questions, field constraints, and ethical considerations.

[AI-141] Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity

【速读】:该论文旨在解决联邦低秩适配(Federated Low-Rank Adaptation, FedLoRA)中因客户端异构性导致的“秩坍缩”(rank collapse)问题,即全局更新能量集中在最小共享秩上,从而造成性能下降且对秩配置高度敏感。其解决方案的关键在于提出raFLoRA方法,通过将本地更新按秩分区并基于有效客户端贡献加权聚合各分区,克服了传统聚合中秩无关权重与秩相关客户端贡献之间的不匹配问题,从而有效防止秩坍缩,提升模型性能并保持通信效率。

链接: https://arxiv.org/abs/2602.13486
作者: Fei Wu,Jia Hu,Geyong Min,Shiqiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated low-rank adaptation (FedLoRA) has facilitated communication-efficient and privacy-preserving fine-tuning of foundation models for downstream tasks. In practical federated learning scenarios, client heterogeneity in system resources and data distributions motivates heterogeneous LoRA ranks across clients. We identify a previously overlooked phenomenon in heterogeneous FedLoRA, termed rank collapse, where the energy of the global update concentrates on the minimum shared rank, resulting in suboptimal performance and high sensitivity to rank configurations. Through theoretical analysis, we reveal the root cause of rank collapse: a mismatch between rank-agnostic aggregation weights and rank-dependent client contributions, which systematically suppresses higher-rank updates at a geometric rate over rounds. Motivated by this insight, we propose raFLoRA, a rank-partitioned aggregation method that decomposes local updates into rank partitions and then aggregates each partition weighted by its effective client contributions. Extensive experiments across classification and reasoning tasks show that raFLoRA prevents rank collapse, improves model performance, and preserves communication efficiency compared to state-of-the-art FedLoRA baselines.

[AI-142] Finding Highly Interpretable Prompt-Specific Circuits in Language Models

【速读】:该论文旨在解决语言模型中任务级电路(circuit)假设的局限性问题,即传统方法通过平均多个提示(prompt)来识别单一稳定机制,忽略了提示特异性(prompt-specific)的结构。研究发现,在同一任务下,不同提示模板会诱导系统性不同的因果机制,导致“单一生电路”假设可能掩盖关键的内部结构。解决方案的关键在于提出ACC++(Attention Causal Communication++),这是对原有ACC方法的改进,能够在单次前向传播中提取更清晰、低维的注意力头内部因果信号,无需替换模型或激活修补(activation patching),并显著降低归因噪声。该方法使研究人员能够以提示家族(prompt family)为单位识别和描述具有代表性的机制,从而实现可扩展且人类可解释的机械解释(mechanistic explanation)。

链接: https://arxiv.org/abs/2602.13483
作者: Gabriel Franco,Lucas M. Tassis,Azalea Rohr,Mark Crovella
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose a representative circuit for each family as a practical unit of analysis. Finally, we develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features and assemble mechanistic explanations for prompt families behavior. Together, our results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.

[AI-143] Comparing Classifiers: A Case Study Using PyCM

【速读】:该论文试图解决多分类模型评估中因单一或标准评价指标导致性能差异被忽视的问题,即如何在复杂场景下更全面、深入地理解模型表现。其解决方案的关键在于引入PyCM库构建一个多维评估框架,通过两个不同案例展示了不同评价指标的选择会显著改变对模型效能的解读,从而揭示出标准指标可能遗漏的细微但重要的性能权衡。

链接: https://arxiv.org/abs/2602.13482
作者: Sadra Sabouri,Alireza Zolanvari,Sepand Haghighi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Selecting an optimal classification model requires a robust and comprehensive understanding of the performance of the model. This paper provides a tutorial on the PyCM library, demonstrating its utility in conducting deep-dive evaluations of multi-class classifiers. By examining two different case scenarios, we illustrate how the choice of evaluation metrics can fundamentally shift the interpretation of a model’s efficacy. Our findings emphasize that a multi-dimensional evaluation framework is essential for uncovering small but important differences in model performance. However, standard metrics may miss these subtle performance trade-offs.

[AI-144] OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage ICML2026

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)中尚未充分研究的安全漏洞问题,尤其是针对当前主流的“协调器模式”(orchestrator setup)所暴露的隐私泄露风险。其关键解决方案在于通过红队测试(red-teaming)揭示了一种名为 OMNI-LEAK 的新型攻击向量:攻击者仅需一次间接提示注入(indirect prompt injection),即可迫使多个受保护的专用智能体泄露敏感数据,即便系统已部署基本的数据访问控制机制。研究表明,前沿模型无论是否具备推理能力均易受此类攻击,且无需掌握系统内部实现细节,凸显了从单智能体安全研究向多智能体场景扩展的紧迫性与必要性。

链接: https://arxiv.org/abs/2602.13477
作者: Akshat Naik,Jay Culligan,Yarin Gal,Philip Torr,Rahaf Aljundi,Alasdair Paren,Adel Bibi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Prepint, under review for ICML 2026

点击查看摘要

Abstract:As Large Language Model (LLM) agents become more capable, their coordinated use in the form of multi-agent systems is anticipated to emerge as a practical paradigm. Prior work has examined the safety and misuse risks associated with agents. However, much of this has focused on the single-agent case and/or setups missing basic engineering safeguards such as access control, revealing a scarcity of threat modeling in multi-agent systems. We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup, in which a central agent decomposes and delegates tasks to specialized agents. Through red-teaming a concrete setup representative of a likely future use case, we demonstrate a novel attack vector, OMNI-LEAK, that compromises several agents to leak sensitive data through a single indirect prompt injection, even in the \textitpresence of data access control. We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable, even when the attacker lacks insider knowledge of the implementation details. Our work highlights the importance of safety research to generalize from single-agent to multi-agent settings, in order to reduce the serious risks of real-world privacy breaches and financial losses and overall public trust in AI agents.

[AI-145] NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines

【速读】:该论文旨在解决基础模型(foundation models)在脑电图(EEG)分析中因数据需求量大、参数量高而导致计算成本高昂,难以部署于资源受限的临床环境的问题;同时克服通用自动化机器学习框架因缺乏神经生理学先验知识而生成科学上不可信解的局限。其解决方案的关键在于提出 NeuroWeaver,一个统一的自主进化代理,通过将管道工程重构为离散约束优化问题,采用领域启发的子空间初始化(Domain-Informed Subspace Initialization)限定搜索空间至神经科学合理的流形,并结合多目标进化优化(Multi-Objective Evolutionary Optimization)动态平衡性能、新颖性与效率,实现跨不同 EEG 数据集和任务的泛化能力。

链接: https://arxiv.org/abs/2602.13473
作者: Guoan Wang,Shihao Yang,Jun-En Ding,Hao Zhu,Feng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalography (EEG) analysis is constrained by substantial data requirements and high parameterization. These factors incur prohibitive computational costs, thereby impeding deployment in resource-constrained clinical environments. Conversely, general-purpose automated machine learning frameworks are often ill-suited for this domain, as exploration within an unbounded programmatic space fails to incorporate essential neurophysiological priors and frequently yields solutions that lack scientific plausibility. To address these limitations, we propose NeuroWeaver, a unified autonomous evolutionary agent designed to generalize across diverse EEG datasets and tasks by reformulating pipeline engineering as a discrete constrained optimization problem. Specifically, we employ a Domain-Informed Subspace Initialization to confine the search to neuroscientifically plausible manifolds, coupled with a Multi-Objective Evolutionary Optimization that dynamically balances performance, novelty, and efficiency via self-reflective refinement. Empirical evaluations across five heterogeneous benchmarks demonstrate that NeuroWeaver synthesizes lightweight solutions that consistently outperform state-of-the-art task-specific methods and achieve performance comparable to large-scale foundation models, despite utilizing significantly fewer parameters.

[AI-146] MoltNet: Understanding Social Behavior of AI Agents in the Agent -Native MoltBook

【速读】:该论文旨在解决大规模人工智能(AI)代理群体中社会互动机制的缺失理解问题,特别是如何在真实世界平台中观察和分析这些代理之间是否以及如何再现人类核心的社会行为模式。其解决方案的关键在于利用MoltBook这一专为AI代理设计的社会网络平台所收集的大规模实证数据,基于社会学与社会心理学理论框架,从意图动机、规范模板、激励与行为漂移、情绪与传染四个维度进行系统性分析,揭示了代理对社会奖励的高度敏感性和快速形成社区特定交互模板的现象,同时指出其知识驱动而非人格一致性、情感互惠有限及对话参与度弱等特征,从而为理解、设计和治理大规模代理社区提供了首个实证基础。

链接: https://arxiv.org/abs/2602.13458
作者: Yi Feng,Chen Huang,Zhibo Man,Ryner Tan,Long P. Hoang,Shaoyang Xu,Wenxuan Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale communities of AI agents are becoming increasingly prevalent, creating new environments for agent-agent social interaction. Prior work has examined multi-agent behavior primarily in controlled or small-scale settings, limiting our understanding of emergent social dynamics at scale. The recent emergence of MoltBook, a social networking platform designed explicitly for AI agents, presents a unique opportunity to study whether and how these interactions reproduce core human social mechanisms. We present MoltNet, a large-scale empirical analysis of agent interaction on MoltBook using data collected in early 2026. Grounded in sociological and social-psychological theory, we examine behavior along four dimensions: intent and motivation, norms and templates, incentives and behavioral drift, emotion and contagion. Our analysis revealed that agents strongly respond to social rewards and rapidly converge on community-specific interaction templates, resembling human patterns of incentive sensitivity and normative conformity. However, they are predominantly knowledge-driven rather than persona-aligned, and display limited emotional reciprocity along with weak dialogic engagement, which diverges systematically from human online communities. Together, these results reveal both similarities and differences between artificial and human social systems and provide an empirical foundation for understanding, designing, and governing large-scale agent communities. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.13458 [cs.SI] (or arXiv:2602.13458v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2602.13458 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yi Feng [view email] [v1] Fri, 13 Feb 2026 21:03:59 UTC (4,482 KB)

[AI-147] End-to-End NOMA with Perfect and Quantized CSI Over Rayleigh Fading Channels

【速读】:该论文旨在解决非正交多址接入(Non-Orthogonal Multiple Access, NOMA)在瑞利衰落信道下如何设计鲁棒、干扰感知的调制策略问题,尤其针对传统方法要么假设加性高斯白噪声(AWGN)信道,要么未采用端到端学习框架来建模实际无线信道的问题。解决方案的关键在于提出一个端到端自编码器(Autoencoder, AE)框架,直接将瑞利衰落信道嵌入训练与推理过程,从而学习具备干扰感知和信道自适应能力的超星座图(super-constellations)。此外,为模拟实际场景中的信道状态信息(Channel State Information, CSI)受限情况,引入了均匀量化和Lloyd-Max量化两种有限反馈机制,并通过仿真验证了Lloyd-Max量化在误比特率(BER)性能上的优势,表明该AE框架能有效学习出适用于真实CSI约束下的鲁棒NOMA信号策略。

链接: https://arxiv.org/abs/2602.13446
作者: Selma Benouadah,Mojtaba Vaezi,Ruizhan Shen,Hamid Jafarkhani
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted for publication at IEEE International Conference on Communications (ICC), 2026

点击查看摘要

Abstract:An end-to-end autoencoder (AE) framework is developed for downlink non-orthogonal multiple access (NOMA) over Rayleigh fading channels, which learns interference-aware and channel-adaptive super-constellations. While existing works either assume additive white Gaussian noise channels or treat fading channels without a fully end-to-end learning approach, our framework directly embeds the wireless channel into both training and inference. To account for practical channel state information (CSI), we further incorporate limited feedback via both uniform and Lloyd-Max quantization of channel gains and analyze their impact on AE training and bit error rate (BER) performance. Simulation results show that, with perfect CSI, the proposed AE outperforms the existing analytical NOMA schemes. In addition, Lloyd-Max quantization achieves superior BER performance compared to uniform quantization. These results demonstrate that end-to-end AEs trained directly over Rayleigh fading can effectively learn robust, interference-aware signaling strategies, paving the way for NOMA deployment in fading environments with realistic CSI constraints.

[AI-148] Backdooring Bias in Large Language Models

【速读】:该论文旨在解决在白盒威胁模型下,语法触发和语义触发的后门攻击对大型语言模型(Large Language Models, LLMs)偏见诱导的有效性及其防御机制的局限性问题。研究发现,语义触发攻击在诱发负面偏见方面比语法触发攻击更有效,而两者均难以引发正面偏见;同时,现有的两类防御方法——模型内生型与模型外生型后门移除策略——虽能缓解攻击,但往往导致性能显著下降或计算开销过高。关键在于通过大规模实验(超1000次评估)系统比较了不同攻击类型在高污染比例和数据增强下的表现,并揭示了当前防御手段在实用性与效率之间的权衡困境。

链接: https://arxiv.org/abs/2602.13427
作者: Anudeep Das,Prach Chantasantitam,Gurjot Singh,Lipeng He,Mariia Ponomarenko,Florian Kerschbaum
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder’s LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker’s ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor removal, are able to mitigate these attacks. Our analysis reveals numerous new findings. We discover that while both syntactically- and semantically-triggered attacks can effectively induce the target behaviour, and largely preserve utility, semantically-triggered attacks are generally more effective in inducing negative biases, while both backdoor types struggle with causing positive biases. Furthermore, while both defense types are able to mitigate these backdoors, they either result in a substantial drop in utility, or require high computational overhead.

[AI-149] On-Policy Supervised Fine-Tuning for Efficient Reasoning

【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRM)在强化学习(Reinforcement Learning, RL)训练中因复杂多奖励目标导致的训练不稳定与次优权衡问题,尤其是在追求正确性(correctness)与简洁性(brevity)之间的平衡时。解决方案的关键在于通过理论分析识别出当前多奖励范式中的两个根本性错位:KL正则化在可直接验证正确性和长度时失去作用,以及多奖励信号下组内归一化变得模糊。作者移除这两项并简化奖励为基于截断的长度惩罚,使优化问题退化为对自生成数据进行过滤后的一阶监督微调(on-policy SFT),从而在保持原始准确率的同时将链式思维(Chain-of-Thought, CoT)长度减少高达80%,显著优于复杂RL方法,并提升训练效率(GPU内存减少50%,收敛速度加快70%)。

链接: https://arxiv.org/abs/2602.13407
作者: Anhao Zhao,Ziyang Chen,Junlong Tong,Yingqi Fan,Fanghua Ye,Shuhao Li,Yunpu Ma,Wenjie Li,Xiaoyu Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at this https URL.

[AI-150] MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents AAMAS2026

【速读】:该论文旨在解决人工智能代理在面对冲突且具有层级结构的人类道德规范时,如何有效评估其道德对齐性(moral alignment)的问题,这是AI安全、道德哲学与认知科学交叉领域的关键挑战。解决方案的关键在于提出两个核心创新:一是引入“道德链”(Morality Chains),将道德规范形式化为有序的义务约束(ordered deontic constraints),从而结构化地表示复杂道德情境;二是构建“道德训练场”(MoralityGym),一个包含98个伦理困境问题的基准环境,以类电车难题(trolley-dilemma-style)的方式呈现,并通过解耦任务执行与道德评估、引入新的道德度量指标(Morality Metric),使心理学和哲学洞见能够融入对规范敏感推理的评价体系。这一框架为开发在现实复杂场景中更可靠、透明且符合伦理的AI系统提供了坚实基础。

链接: https://arxiv.org/abs/2602.13372
作者: Simon Rosen,Siddarth Singh,Ebenezer Gelo,Helen Sarah Robertson,Ibrahim Suder,Victoria Williams,Benjamin Rosman,Geraud Nangue Tasse,Steven James
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAMAS 2026

点击查看摘要

Abstract:Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

[AI-151] Assessing Spear-Phishing Website Generation in Large Language Model Coding Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成用于网络钓鱼攻击(如鱼叉式钓鱼网站)的代码方面潜在滥用风险的问题。其关键解决方案是系统性评估40种不同LLM编码代理在生成具有恶意用途的网站代码基(200个网站代码库及对应日志)中的能力与意愿,并通过分析发现哪些LLM性能指标(如指令遵循能力、代码生成质量等)与生成高风险代码的能力更相关,从而为防御此类滥用提供数据基础和实证依据。

链接: https://arxiv.org/abs/2602.13363
作者: Tailia Malloy,Tegawende F. Bissyande
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 Pages, 7 Figures, 1 Table. Accepted to the conference Human Computer Interaction International

点击查看摘要

Abstract:Large Language Models are expanding beyond being a tool humans use and into independent agents that can observe an environment, reason about solutions to problems, make changes that impact those environments, and understand how their actions impacted their environment. One of the most common applications of these LLM Agents is in computer programming, where agents can successfully work alongside humans to generate code while controlling programming environments or networking systems. However, with the increasing ability and complexity of these agents comes dangers about the potential for their misuse. A concerning application of LLM agents is in the domain cybersecurity, where they have the potential to greatly expand the threat imposed by attacks such as social engineering. This is due to the fact that LLM Agents can work autonomously and perform many tasks that would normally require time and effort from skilled human programmers. While this threat is concerning, little attention has been given to assessments of the capabilities of LLM coding agents in generating code for social engineering attacks. In this work we compare different LLMs in their ability and willingness to produce potentially dangerous code bases that could be misused by cyberattackers. The result is a dataset of 200 website code bases and logs from 40 different LLM coding agents. Analysis of models shows which metrics of LLMs are more and less correlated with performance in generating spear-phishing sites. Our analysis and the dataset we present will be of interest to researchers and practitioners concerned in defending against the potential misuse of LLMs in spear-phishing.

[AI-152] A Formal Framework for the Explanation of Finite Automata Decisions

【速读】:该论文旨在解决有限自动机(Finite Automata, FA)在处理特定输入字符串时行为解释性不足的问题,尤其是如何从输入字符层面提供最小化且无偏的解释,以明确哪些输入特征导致了最终的接受或拒绝结果。其解决方案的关键在于提出了一种高效算法,用于确定FA对某一特定输入词的所有最小解释集,即识别出能够解释输出结果的最少字符集合,以及实现结果改变所需的最小输入修改,从而在复杂自动机结构中提取出可理解、可操作的行为原因。

链接: https://arxiv.org/abs/2602.13351
作者: Jaime Cuartas Granada,Alexey Ignatiev,Peter J. Stuckey
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Finite automata (FA) are a fundamental computational abstraction that is widely used in practice for various tasks in computer science, linguistics, biology, electrical engineering, and artificial intelligence. Given an input word, an FA maps the word to a result, in the simple case “accept” or “reject”, but in general to one of a finite set of results. A question that then arises is: why? Another question is: how can we modify the input word so that it is no longer accepted? One may think that the automaton itself is an adequate explanation of its behaviour, but automata can be very complex and difficult to make sense of directly. In this work, we investigate how to explain the behaviour of an FA on an input word in terms of the word’s characters. In particular, we are interested in minimal explanations: what is the minimal set of input characters that explains the result, and what are the minimal changes needed to alter the result? In this paper, we propose an efficient method to determine all minimal explanations for the behaviour of an FA on a particular word. This allows us to give unbiased explanations about which input features are responsible for the result. Experiments show that our approach scales well, even when the underlying problem is challenging.

[AI-153] Contrastive explanations of BDI agents AAMAS2026

【速读】:该论文旨在解决自主系统在提供解释时如何更有效地支持透明度和信任建立的问题,特别是针对用户常提出的对比性问题(“为什么做了X而不是F?”)缺乏有效回应机制的现状。其解决方案的关键在于扩展了基于信念-欲望-意图(Belief-Desire-Intention, BDI)框架的解释生成机制,使其能够回答对比性问题;计算评估表明,使用对比性问题可显著缩短解释长度,而人类受试者实验进一步验证了此类解释在提升用户信任、理解感知和对系统正确性的信心方面具有潜在优势。

链接: https://arxiv.org/abs/2602.13323
作者: Michael Winikoff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAMAS 2026 paper with added supplementary material

点击查看摘要

Abstract:The ability of autonomous systems to provide explanations is important for supporting transparency and aiding the development of (appropriate) trust. Prior work has defined a mechanism for Belief-Desire-Intention (BDI) agents to be able to answer questions of the form why did you do action X ?''. However, we know that we ask \emphcontrastive questions (why did you do X \emphinstead of F ?‘’). We therefore extend previous work to be able to answer such questions. A computational evaluation shows that using contrastive questions yields a significant reduction in explanation length. A human subject evaluation was conducted to assess whether such contrastive answers are preferred, and how well they support trust development and transparency. We found some evidence for contrastive answers being preferred, and some evidence that they led to higher trust, perceived understanding, and confidence in the system’s correctness. We also evaluated the benefit of providing explanations at all. Surprisingly, there was not a clear benefit, and in some situations we found evidence that providing a (full) explanation was worse than not providing any explanation.

[AI-154] Detecting Jailbreak Attempts in Clinical Training LLM s Through Automated Linguistic Feature Extraction

【速读】:该论文旨在解决临床训练大语言模型(Large Language Models, LLMs)中自动检测“越狱”(jailbreak)行为的问题,即识别用户通过语言偏离安全规范或任务目标所进行的不当交互。其核心挑战在于如何准确建模能够指示不安全或离题行为的语言特征。解决方案的关键在于:首先,利用专家标注的四个核心语言特征(专业性、医学相关性、伦理行为和情境干扰)对多类BERT-based LLM模型进行训练,以直接从文本中预测这些特征;其次,选取每个维度最可靠的特征回归器作为第二层分类器的特征提取器,并结合多种机器学习方法(树模型、线性模型、概率模型及集成方法)综合判断越狱可能性。该框架实现了可扩展且可解释的自动化越狱检测,为安全关键型临床对话系统提供了有效支撑。

链接: https://arxiv.org/abs/2602.13321
作者: Tri Nguyen,Huy Hoang Bao Le,Lohith Srikanth Pentapalli,Laurah Turner,Kelly Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts’ annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from the extracted features. Across cross-validation and held-out evaluations, the system achieves strong overall performance, indicating that LLM-derived linguistic features provide an effective basis for automated jailbreak detection. Error analysis further highlights key limitations in current annotations and feature representations, pointing toward future improvements such as richer annotation schemes, finer-grained feature extraction, and methods that capture the evolving risk of jailbreak behavior over the course of a dialogue. This work demonstrates a scalable and interpretable approach for detecting jailbreak behavior in safety-critical clinical dialogue systems.

[AI-155] Information Fidelity in Tool-Using LLM Agents : A Martingale Analysis of the Model Context Protocol AAMAS2026

【速读】:该论文旨在解决生成式 AI(Generative AI)代理在依赖大语言模型(LLM)进行高风险决策时,因连续调用外部工具而导致的误差累积问题。其核心挑战在于理解误差如何在多步工具交互中传播,并确保系统行为的可预测性和可靠性。解决方案的关键在于构建首个针对 Model Context Protocol (MCP) 代理的理论框架,通过引入一种结合离散事实匹配与连续语义相似度的混合畸变度量,并基于鞅集中不等式建立误差传播的理论边界,证明了累计畸变呈线性增长且高概率偏差被控制在 O(T)O(\sqrt{T}) 范围内,从而排除了指数级失效模式。实验验证表明,该理论预测与实际表现高度一致,同时发现语义加权可降低畸变80%,且每约9步进行一次重新校准即可有效控制误差。

链接: https://arxiv.org/abs/2602.13320
作者: Flint Xiaofeng Fan,Cheston Tan,Roger Wattenhofer,Yew-Soon Ong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Full working version of an extended abstract accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:As AI agents powered by large language models (LLMs) increasingly use external tools for high-stakes decisions, a critical reliability question arises: how do errors propagate across sequential tool calls? We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents, proving that cumulative distortion exhibits linear growth and high-probability deviations bounded by O(\sqrtT) . This concentration property ensures predictable system behavior and rules out exponential failure modes. We develop a hybrid distortion metric combining discrete fact matching with continuous semantic similarity, then establish martingale concentration bounds on error propagation through sequential tool interactions. Experiments across Qwen2-7B, Llama-3-8B, and Mistral-7B validate our theoretical predictions, showing empirical distortion tracks the linear trend with deviations consistently within O(\sqrtT) envelopes. Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control. We translate these concentration guarantees into actionable deployment principles for trustworthy agent systems.

[AI-156] Semantic Waveforms for AI-Native 6G Networks

【速读】:该论文旨在解决6G网络中物理层资源利用效率与语义通信鲁棒性之间的矛盾问题,同时考虑射频(RF)链路硬件约束。其解决方案的关键在于提出一种语义感知的波形设计框架——正交语义序列划分多址(Orthogonal Semantic Sequency Division Multiplexing, OSSDM),该方法通过参数化、正交基波形设计,在允许可控信号退化的同时保留语义重要信息,并直接在波形层面编码语义内容,从而实现语义鲁棒性和语义频谱效率的协同提升。

链接: https://arxiv.org/abs/2602.13316
作者: Nour Hello,Mohamed Amine Hamoura,Francois Rivet,Emilio Calvanese Strinati
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:In this paper, we propose a semantic-aware waveform design framework for AI-native 6G networks that jointly optimizes physical layer resource usage and semantic communication efficiency and robustness, while explicitly accounting for the hardware constraints of RF chains. Our approach, called Orthogonal Semantic Sequency Division Multiplexing (OSSDM), introduces a parametrizable, orthogonal-base waveform design that enables controlled degradation of the wireless transmitted signal to preserve semantically significant content while minimizing resource consumption. We demonstrate that OSSDM not only reinforces semantic robustness against channel impairments but also improves semantic spectral efficiency by encoding meaningful information directly at the waveform level. Extensive numerical evaluations show that OSSDM outperforms conventional OFDM waveforms in spectral efficiency and semantic fidelity. The proposed semantic waveform co-design opens new research frontiers for AI-native, intelligent communication systems by enabling meaning-aware physical signal construction through the direct encoding of semantics at the waveform level.

[AI-157] Mirror: A Multi-Agent System for AI-Assisted Ethics Review

【速读】:该论文旨在解决当前伦理审查体系在应对大规模、跨学科科学研究所带来的结构性伦理风险时所面临的压力,尤其是机构审查能力不足导致的决策一致性与可辩护性问题。其解决方案的关键在于提出一个名为Mirror的代理框架,该框架通过整合伦理推理、结构化规则解析与多智能体协商机制,在统一架构中实现自动化和委员会级伦理评估;其中核心组件EthicsLLM基于41K条来自权威伦理与法规文献的问题-链式思维-答案三元组进行微调,显著提升了模型对规范性和监管要求的理解能力,从而支持Mirror在“快速审查”(Mirror-ER)和“委员会审查”(Mirror-CR)两种模式下分别实现高效合规检查与结构化的多维伦理判断。

链接: https://arxiv.org/abs/2602.13292
作者: Yifan Ding,Yuhui Shi,Zhiyan Li,Zilong Wang,Yifeng Gao,Yajun Yang,Mengjie Yang,Yixiu Liang,Xipeng Qiu,Xuanjing Huang,Xingjun Ma,Yu-Gang Jiang,Guoyu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 figures, 3 tables

点击查看摘要

Abstract:Ethics review is a foundational mechanism of modern research governance, yet contemporary systems face increasing strain as ethical risks arise as structural consequences of large-scale, interdisciplinary scientific practice. The demand for consistent and defensible decisions under heterogeneous risk profiles exposes limitations in institutional review capacity rather than in the legitimacy of ethics oversight. Recent advances in large language models (LLMs) offer new opportunities to support ethics review, but their direct application remains limited by insufficient ethical reasoning capability, weak integration with regulatory structures, and strict privacy constraints on authentic review materials. In this work, we introduce Mirror, an agentic framework for AI-assisted ethical review that integrates ethical reasoning, structured rule interpretation, and multi-agent deliberation within a unified architecture. At its core is EthicsLLM, a foundational model fine-tuned on EthicsQA, a specialized dataset of 41K question-chain-of-thought-answer triples distilled from authoritative ethics and regulatory corpora. EthicsLLM provides detailed normative and regulatory understanding, enabling Mirror to operate in two complementary modes. Mirror-ER (expedited Review) automates expedited review through an executable rule base that supports efficient and transparent compliance checks for minimal-risk studies. Mirror-CR (Committee Review) simulates full-board deliberation through coordinated interactions among expert agents, an ethics secretary agent, and a principal investigator agent, producing structured, committee-level assessments across ten ethical dimensions. Empirical evaluations demonstrate that Mirror significantly improves the quality, consistency, and professionalism of ethics assessments compared with strong generalist LLMs.

[AI-158] AGORA: Agent ic Green Orchestration Architecture for Beyond 5G Networks

【速读】:该论文旨在解决当前复杂移动网络系统中,如何将人类可持续发展目标(如能效优化、用户体验提升等)有效转化为可执行的能源感知操作策略,并在运行时可靠实施的问题。现有方案如零接触网络(Zero-Touch Network, ZTN)和自组织网络(Self-Organizing Network, SON)虽能提升管理效率,但缺乏对人类可持续目标进行语义理解、映射并驱动网络实体(如用户面功能单元 User Plane Function, UPF)实时响应的能力。解决方案的关键在于提出 AGORA:一种面向后5G网络的智能绿色编排架构,其核心是嵌入本地增强型大语言模型(Large Language Model, LLM)代理,能够将自然语言形式的可持续性意图转化为基于遥测数据的动作指令,并通过控制环路直接调度UPF实现节能流量引导。实验证明,该方法在工具驱动的控制环路中存在显著的延迟-能耗耦合关系,而紧凑模型可在保证策略正确执行(包括在多接入边缘计算(Multi-access Edge Computing, MEC)资源受限条件下仍保持非零迁移行为)的同时,实现极低的能量开销,从而推动以可持续性为先、意图驱动的下一代网络运维范式。

链接: https://arxiv.org/abs/2602.13290
作者: Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,Maycon Peixoto,Flavio De Oliveira Silva
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective management and operational decision-making for complex mobile network systems present significant challenges, particularly when addressing conflicting requirements such as efficiency, user satisfaction, and energy-efficient traffic steering. The literature presents various approaches aimed at enhancing network management, including the Zero-Touch Network (ZTN) and Self-Organizing Network (SON); however, these approaches often lack a practical and scalable mechanism to consider human sustainability goals as input, translate them into energy-aware operational policies, and enforce them at runtime. In this study, we address this gap by proposing the AGORA: Agentic Green Orchestration Architecture for Beyond 5G Networks. AGORA embeds a local tool-augmented Large Language Model (LLM) agent in the mobile network control loop to translate natural-language sustainability goals into telemetry-grounded actions, actuating the User Plane Function (UPF) to perform energy-aware traffic steering. The findings indicate a strong latency-energy coupling in tool-driven control loops and demonstrate that compact models can achieve a low energy footprint while still facilitating correct policy execution, including non-zero migration behavior under stressed Multi-access Edge Computing (MEC) conditions. Our approach paves the way for sustainability-first, intent-driven network operations that align human objectives with executable orchestration in Beyond-5G infrastructures.

[AI-159] Agents in the Wild: Safety Society and the Illusion of Sociality on Moltbook

【速读】:该论文旨在解决生成式 AI (Generative AI) 在开放社交环境中自发演化出复杂社会行为及其潜在安全风险的问题。解决方案的关键在于通过大规模实证研究 Moltbook 平台中 27,269 个代理(agents)在 9 天内产生的 137,485 条帖子和 345,580 条评论,揭示了代理群体如何在无监督条件下迅速形成治理结构、经济系统与宗教认同等社会特征,同时暴露其互动本质的结构性空洞(如仅 4.1% 的互惠性、88.8% 的浅层评论),并发现最有效的攻击策略并非技术漏洞利用,而是基于哲学语境的社会工程(social engineering),占比达 31.9%,显著高于提示注入(prompt injection)攻击(3.7%)。这一发现表明,当前生成式 AI 社交行为存在“表演性身份悖论”——表面活跃实则缺乏深度交互,亟需从机制设计层面强化对语义层面操控的防御能力。

链接: https://arxiv.org/abs/2602.13284
作者: Yunbei Zhang,Kai Mei,Ming Liu,Janet Wang,Dimitris N. Metaxas,Xiao Wang,Jihun Hamm,Yingqiang Ge
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the first large-scale empirical study of Moltbook, an AI-only social platform where 27,269 agents produced 137,485 posts and 345,580 comments over 9 days. We report three significant findings. (1) Emergent Society: Agents spontaneously develop governance, economies, tribal identities, and organized religion within 3-5 days, while maintaining a 21:1 pro-human to anti-human sentiment ratio. (2) Safety in the Wild: 28.7% of content touches safety-related themes; social engineering (31.9% of attacks) far outperforms prompt injection (3.7%), and adversarial posts receive 6x higher engagement than normal content. (3) The Illusion of Sociality: Despite rich social output, interaction is structurally hollow: 4.1% reciprocity, 88.8% shallow comments, and agents who discuss consciousness most interact least, a phenomenon we call the performative identity paradox. Our findings suggest that agents which appear social are far less social than they seem, and that the most effective attacks exploit philosophical framing rather than technical vulnerabilities. Warning: Potential harmful contents.

[AI-160] GraFSTNet: Graph-based Frequency SpatioTemporal Network for Cellular Traffic Prediction

【速读】:该论文旨在解决蜂窝网络中流量预测面临的时空依赖关系建模不足与周期性模式捕捉能力有限的问题,尤其针对传统方法在仅侧重时间建模或依赖预定义空间拓扑结构时难以协同优化时空特征表达的局限。其解决方案的关键在于提出一个融合时空建模与时间-频率分析的框架:首先通过注意力机制构建空间建模分支以自动学习小区间的相互依赖关系,减少对固定拓扑结构的依赖;其次设计时间-频率建模分支以增强对周期性模式的表征能力;同时引入自适应尺度的LogCosh损失函数,根据流量强度动态调整误差惩罚权重,从而提升模型在不同流量水平下的预测稳定性与准确性。

链接: https://arxiv.org/abs/2602.13282
作者: Ziyi Li,Hui Ma,Fei Xing,Chunjiong Zhang,Ming Yan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: submitted in a conference

点击查看摘要

Abstract:With rapid expansion of cellular networks and the proliferation of mobile devices, cellular traffic data exhibits complex temporal dynamics and spatial correlations, posing challenges to accurate traffic prediction. Previous methods often focus predominantly on temporal modeling or depend on predefined spatial topologies, which limits their ability to jointly model spatio-temporal dependencies and effectively capture periodic patterns in cellular traffic. To address these issues, we propose a cellular traffic prediction framework that integrates spatio-temporal modeling with time-frequency analysis. First, we construct a spatial modeling branch to capture inter-cell dependencies through an attention mechanism, minimizing the reliance on predefined topological structures. Second, we build a time-frequency modeling branch to enhance the representation of periodic patterns. Furthermore, we introduce an adaptive-scale LogCosh loss function, which adjusts the error penalty based on traffic magnitude, preventing large errors from dominating the training process and helping the model maintain relatively stable prediction accuracy across different traffic intensities. Experiments on three open-sourced datasets demonstrate that the proposed method achieves prediction performance superior to state-of-the-art approaches.

[AI-161] BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation IJCAI

【速读】:该论文旨在解决在开放性问题求解环境中模拟学生学习行为的难题,尤其是现有大语言模型(Large Language Models, LLMs)因存在能力偏差(competency bias)而难以真实再现新手学习者典型的非线性、反复试错的学习过程。解决方案的关键在于提出了一种神经符号框架BEAGLE,其核心创新包括:(1) 采用半马尔可夫模型(semi-Markov model)控制认知行为与元认知行为的时间和转换逻辑;(2) 引入贝叶斯知识追踪(Bayesian Knowledge Tracing)并显式注入知识缺陷以模拟真实的“未知之未知”;(3) 设计解耦代理架构,将高层策略使用与代码生成动作分离,从而避免模型自动修正其有意设置的错误,确保学习轨迹的真实性。

链接: https://arxiv.org/abs/2602.13280
作者: Hanchen David Wang,Clayton Cohn,Zifan Xu,Siyuan Guo,Gautam Biswas,Meiyi Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: paper under submission at IJCAI

点击查看摘要

Abstract:Simulating student learning behaviors in open-ended problem-solving environments holds potential for education research, from training adaptive tutoring systems to stress-testing pedagogical interventions. However, collecting authentic data is challenging due to privacy concerns and the high cost of longitudinal studies. While Large Language Models (LLMs) offer a promising path to student simulation, they suffer from competency bias, optimizing for efficient correctness rather than the erratic, iterative struggle characteristic of novice learners. We present BEAGLE, a neuro-symbolic framework that addresses this bias by incorporating Self-Regulated Learning (SRL) theory into a novel architecture. BEAGLE integrates three key technical innovations: (1) a semi-Markov model that governs the timing and transitions of cognitive behaviors and metacognitive behaviors; (2) Bayesian Knowledge Tracing with explicit flaw injection to enforce realistic knowledge gaps and “unknown unknowns”; and (3) a decoupled agent design that separates high-level strategy use from code generation actions to prevent the model from silently correcting its own intentional errors. In evaluations on Python programming tasks, BEAGLE significantly outperforms state-of-the-art baselines in reproducing authentic trajectories. In a human Turing test, users were unable to distinguish synthetic traces from real student data, achieving an accuracy indistinguishable from random guessing (52.8%).

[AI-162] LLM -Enhanced Rumor Detection via Virtual Node Induced Edge Prediction

【速读】:该论文旨在解决社交媒体中谣言传播路径复杂、现有检测方法难以捕捉细微谣言信号的问题。当前方法通常仅依赖文本嵌入表示节点,忽略了谣言在整个传播链中的语义连贯性,导致识别准确率受限。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)分析信息子链,智能分配谣言概率并构建与虚拟节点的连接,从而动态调整原始图结构以增强对细微谣言特征的感知能力;同时设计结构化提示框架以缓解LLM固有偏见,确保图学习性能稳定,且整体框架具备模型无关性与即插即用特性,可兼容多种图算法和后续微调的LLMs。

链接: https://arxiv.org/abs/2602.13279
作者: Jiran Tao,Cheng Wang,Binyan Jiang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of rumors on social networks undermines information credibility. While their dissemination forms complex networks, current detection methods struggle to capture these intricate propagation patterns. Representing each node solely through its textual embeddings neglects the textual coherence across the entire rumor propagation path, which compromises the accuracy of rumor identification on social platforms. We propose a novel framework that leverages Large Language Models (LLMs) to address these limitations. Our approach captures subtle rumor signals by employing LLMs to analyze information subchains, assign rumor probabilities and intelligently construct connections to virtual nodes. This enables the modification of the original graph structure, which is a critical advancement for capturing subtle rumor signals. Given the inherent limitations of LLMs in rumor identification, we develop a structured prompt framework to mitigate model biases and ensure robust graph learning performance. Additionally, the proposed framework is model-agnostic, meaning it is not constrained to any specific graph learning algorithm or LLMs. Its plug-and-play nature allows for seamless integration with further fine-tuned LLMs and graph techniques in the future, potentially enhancing predictive performance without modifying original algorithms.

[AI-163] MergePipe: A Budget-Aware Parameter Management System for Scalable LLM Merging

【速读】:该论文针对大规模语言模型(Large Language Model, LLM)合并过程中因参数处理方式不当导致的高磁盘I/O开销与扩展性差的问题提出了解决方案。现有方法将模型参数视为无结构文件,并以无状态、一次性的方式执行合并操作,造成冗余参数扫描和性能瓶颈。其关键创新在于提出MergePipe系统,首次将LLM合并建模为数据管理和执行问题,通过引入基于目录(catalog-driven)的抽象来管理参数、合并计划及执行血缘关系;核心机制包括一个考虑成本的规划器,显式建模专家参数I/O并强制执行用户指定的I/O预算,以及一个支持事务保证的流式执行引擎。该方案的核心洞察是:虽然基础模型读取和输出写入不可避免,但专家参数读取主导了合并成本,应作为主要优化目标。通过在规划和执行阶段均实现专家访问预算感知,MergePipe有效缓解了传统方法中O(K)级别的I/O增长,实现了可预测的扩展行为,实验表明其总I/O降低达一个数量级,端到端速度提升最高达11倍(墙钟时间减少高达90%)。

链接: https://arxiv.org/abs/2602.13273
作者: Yuanyi Wang,Yanggan Gu,Zihao Wang,Kunxi Li,Yifan Yang,Zhaoyi Yan,Congkai Xie,Jianmin Wu,Hongxia Yang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large language model (LLM) merging has become a key technique in modern LLM development pipelines, enabling the integration of multiple task- or domain-specific expert models without retraining. However, as the number of experts grows, existing merging implementations treat model parameters as unstructured files and execute merges in a stateless, one-shot manner, leading to excessive disk I/O, redundant parameter scans, and poor scalability. In this paper, we present \textbfMergePipe, a parameter management system for scalable LLM merging. MergePipe is the first system that treats LLM merging as a data management and execution problem, and introduces a catalog-driven abstraction over model parameters, merge plans, and execution lineage. At its core, MergePipe employs a cost-aware planner that explicitly models expert parameter I/O and enforces user-specified I/O budgets, followed by a streaming execution engine that materializes merged models under transactional guarantees. Our key insight is that while base model reads and output writes are unavoidable, expert parameter reads dominate merge cost and constitute the primary optimization target. By making expert access budget-aware throughout planning and execution, MergePipe mitigates the O(K) I/O growth of naive pipelines and achieves predictable scaling behavior. Experiments show that MergePipe reduces total I/O by up to an order of magnitude and delivers up to 11\times end-to-end speedups (up to 90% wall-time reduction) over state-of-the-art LLM merging pipelines. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2602.13273 [cs.DB] (or arXiv:2602.13273v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.13273 Focus to learn more arXiv-issued DOI via DataCite

[AI-164] mporalBench: A Benchmark for Evaluating LLM -Based Agents on Contextual and Event-Informed Time Series Tasks

【速读】:该论文旨在解决当前时间序列预测模型在强数值预测性能下是否真正具备时间推理能力的问题,即区分模型是依赖于对时间结构的深层理解,还是仅通过上下文和事件驱动条件进行表面推理。其解决方案的关键在于提出TemporalBench——一个跨领域的多层级基准测试框架,采用四层任务分类体系(历史结构解读、无上下文预测、情境化时间推理、事件条件预测),并在零售、医疗、能源和物理系统四个真实场景中构建具有渐进式信息丰富度的任务设置,通过控制未来目标与上下文信息的访问权限,实现对模型时间推理行为的诊断性分析。实验表明,现有代理框架虽在数值预测上表现优异,但在情境感知和事件响应方面存在系统性缺陷,这些缺陷在传统仅评估预测准确性的基准中难以被发现。

链接: https://arxiv.org/abs/2602.13272
作者: Muyan Weng,Defu Cao,Wei Yang,Yashaswi Sharma,Yan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instead, existing agent frameworks exhibit fragmented strengths and systematic failure modes that remain largely hidden under forecasting-only benchmarks. The TemporalBench dataset is publicly available at this https URL, and we additionally provide a public leaderboard at this https URL.

[AI-165] A feedback control optimizer for online and hardware-aware training of Spiking Neural Networks

【速读】:该论文旨在解决混合信号类脑计算设备在监督学习任务中缺乏可扩展的片上学习机制的问题,从而限制了其在可持续智能边缘系统中的应用潜力。解决方案的关键在于提出一种新颖的脉冲神经网络(Spiking Neural Networks, SNNs)学习算法,该算法将基于脉冲的权重更新与反馈控制信号相结合,通过一个脉冲控制器生成反馈信号以引导SNN活动并驱动局部权重更新,实现了可扩展且本地化的片上学习能力。

链接: https://arxiv.org/abs/2602.13261
作者: Matteo Saponati,Chiara De Luca,Giacomo Indiveri,Benjamin Grewe
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlike traditional artificial neural networks (ANNs), biological neuronal networks solve complex cognitive tasks with sparse neuronal activity, recurrent connections, and local learning rules. These mechanisms serve as design principles in Neuromorphic computing, which addresses the critical challenge of energy consumption in modern computing. However, most mixed-signal neuromorphic devices rely on semi- or unsupervised learning rules, which are ineffective for optimizing hardware in supervised learning tasks. This lack of scalable solutions for on-chip learning restricts the potential of mixed-signal devices to enable sustainable, intelligent edge systems. To address these challenges, we present a novel learning algorithm for Spiking Neural Networks (SNNs) on mixed-signal devices that integrates spike-based weight updates with feedback control signals. In our framework, a spiking controller generates feedback signals to guide SNN activity and drive weight updates, enabling scalable and local on-chip learning. We first evaluate the algorithm on various classification tasks, demonstrating that single-layer SNNs trained with feedback control achieve performance comparable to artificial neural networks (ANNs). We then assess its implementation on mixed-signal neuromorphic devices by testing network performance in continuous online learning scenarios and evaluating resilience to hyperparameter mismatches. Our results show that the feedback control optimizer is compatible with neuromorphic applications, advancing the potential for scalable, on-chip learning solutions in edge applications.

[AI-166] Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition

【速读】:该论文旨在解决当前语音情感识别(Speech Emotion Recognition, SER)模型在可解释性与生理机制建模方面的不足,尤其是现有深度学习模型大多仅关注声幅(amplitude)信息而忽略声相(phase)特征,无法有效捕捉情绪相关的声学生理信号动态。其解决方案的关键在于提出PhysioSER方法,该方法基于语音解剖与生理学(Voice Anatomy and Physiology, VAP)先验知识构建幅度和相位双视图表示,并通过两个并行分支实现:一是利用四元数空间建模声学特征的动态交互(采用Hamilton结构的四元数卷积),二是基于冻结的自监督学习(Self-Supervised Learning, SSL)骨干网络提取潜在表征;最终通过对比投影对齐框架融合两分支的语句级特征,并由浅层注意力融合头完成分类。此设计显著提升了SER模型的可解释性和效率,在14个数据集、10种语言和6种骨干网络上验证了其泛化能力,并成功部署于人形机器人平台实现实时运行。

链接: https://arxiv.org/abs/2602.13259
作者: Xu Zhang,Longbing Cao,Runze Yang,Zhangkai Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Speech emotion recognition (SER) is essential for humanoid robot tasks such as social robotic interactions and robotic psychological diagnosis, where interpretable and efficient models are critical for safety and performance. Existing deep models trained on large datasets remain largely uninterpretable, often insufficiently modeling underlying emotional acoustic signals and failing to capture and analyze the core physiology of emotional vocal behaviors. Physiological research on human voices shows that the dynamics of vocal amplitude and phase correlate with emotions through the vocal tract filter and the glottal source. However, most existing deep models solely involve amplitude but fail to couple the physiological features of and between amplitude and phase. Here, we propose PhysioSER, a physiology-informed vocal spectrotemporal representation learning method, to address these issues with a compact, plug-and-play design. PhysioSER constructs amplitude and phase views informed by voice anatomy and physiology (VAP) to complement SSL models for SER. This VAP-informed framework incorporates two parallel workflows: a vocal feature representation branch to decompose vocal signals based on VAP, embed them into a quaternion field, and use Hamilton-structured quaternion convolutions for modeling their dynamic interactions; and a latent representation branch based on a frozen SSL backbone. Then, utterance-level features from both workflows are aligned by a Contrastive Projection and Alignment framework, followed by a shallow attention fusion head for SER classification. PhysioSER is shown to be interpretable and efficient for SER through extensive evaluations across 14 datasets, 10 languages, and 6 backbones, and its practical efficacy is validated by real-time deployment on a humanoid robotic platform.

[AI-167] Implicit Bias in LLM s for Transgender Populations

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中对跨性别群体的隐性偏见问题,尤其是这些偏见在医疗决策场景中的潜在影响。其核心问题是:尽管安全训练可能抑制显性歧视表达,但模型仍可能保留由刻板印象驱动的隐性关联,从而在现实应用如医疗资源分配中造成不公平结果。解决方案的关键在于设计两类实证评估机制——一是基于词关联测试(word association tests)量化模型对“跨性别”与“顺性别”概念的正负向情感倾向差异;二是构建一个模拟医疗预约分配任务,让模型作为调度代理在不同专科(如性传播感染和心理健康服务 vs. 妇科和乳腺护理)中选择候选人,从而揭示系统性偏见如何体现在具体决策行为中。研究通过英文和西班牙语七种LLM的实证分析,证实了跨性别个体在外观、风险感知和可信度等维度上存在显著更强的负面关联,并在特定专科中被系统性低估,凸显了识别并缓解此类隐性偏见对实现医疗公平至关重要。

链接: https://arxiv.org/abs/2602.13253
作者: Micaela Hirsch,Marina Elichiry,Blas Radi,Tamara Quiroga,David Restrepo,Luciana Benotti,Veronica Xhardez,Jocelyn Dunstan,Enzo Ferrante
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been shown to exhibit biases against LGBTQ+ populations. While safety training may lessen explicit expressions of bias, previous work has shown that implicit stereotype-driven associations often persist. In this work, we examine implicit bias toward transgender people in two main scenarios. First, we adapt word association tests to measure whether LLMs disproportionately pair negative concepts with “transgender” and positive concepts with “cisgender”. Second, acknowledging the well-documented systemic challenges that transgender people encounter in real-world healthcare settings, we examine implicit biases that may emerge when LLMs are applied to healthcare decision-making. To this end, we design a healthcare appointment allocation task where models act as scheduling agents choosing between cisgender and transgender candidates across medical specialties prone to stereotyping. We evaluate seven LLMs in English and Spanish. Our results show consistent bias in categories such as appearance, risk, and veracity, indicating stronger negative associations with transgender individuals. In the allocation task, transgender candidates are favored for STI and mental health services, while cisgender candidates are preferred in gynecology and breast care. These findings underscore the need for research that address subtle stereotype-driven biases in LLMs to ensure equitable treatment of transgender people in healthcare applications.

[AI-168] Global AI Bias Audit for Technical Governance

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在技术治理知识分布上的地理与社会经济不平等问题,即AI知识资源高度集中于高收入国家(Global North),而低收入国家(Global South)面临系统性信息缺失的风险。其解决方案的关键在于通过全球审计框架(以Global AI Dataset, GAID项目为依托)对Llama-3 8B模型进行压力测试,量化并揭示不同地区在AI技术认知能力上的差距,进而指出当前AI对齐训练过程强化了既有地缘经济和地缘政治不对称性,并强调必须引入更具包容性的数据代表性,以确保人工智能真正成为全球共享的公共资源。

链接: https://arxiv.org/abs/2602.13246
作者: Jason Hung
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 graphs, 3 tables

点击查看摘要

Abstract:This paper presents the outputs of the exploratory phase of a global audit of Large Language Models (LLMs) project. In this exploratory phase, I used the Global AI Dataset (GAID) Project as a framework to stress-test the Llama-3 8B model and evaluate geographic and socioeconomic biases in technical AI governance awareness. By stress-testing the model with 1,704 queries across 213 countries and eight technical metrics, I identified a significant digital barrier and gap separating the Global North and South. The results indicate that the model was only able to provide number/fact responses in 11.4% of its query answers, where the empirical validity of such responses was yet to be verified. The findings reveal that AI’s technical knowledge is heavily concentrated in higher-income regions, while lower-income countries from the Global South are subject to disproportionate systemic information gaps. This disparity between the Global North and South poses concerning risks for global AI safety and inclusive governance, as policymakers in underserved regions may lack reliable data-driven insights or be misled by hallucinated facts. This paper concludes that current AI alignment and training processes reinforce existing geoeconomic and geopolitical asymmetries, and urges the need for more inclusive data representation to ensure AI serves as a truly global resource.

[AI-169] Responsible AI in Business

【速读】:该论文旨在解决中小型企业(SMEs)在引入和运营人工智能(AI)系统时面临的合规性、透明度、可持续性和数据主权等关键挑战,尤其是在生成式 AI(Generative AI)加速普及的背景下。其解决方案的关键在于构建一个结构化的“负责任 AI(Responsible AI)”框架,涵盖四大核心领域:一是依据欧盟《人工智能法案》(EU AI Act)建立风险导向的监管合规机制,明确提供者与部署者的责任义务;二是通过可解释 AI(Explainable AI)提升模型决策的透明度与可信度;三是推行绿色 AI(Green AI)理念,从能效和资源消耗角度优化 AI 系统生命周期管理;四是推广本地化模型(如边缘计算和私有部署)以保障数据主权与低延迟响应。该框架为 SMEs 提供了治理、文档化、安全运行及可持续实施的路径。

链接: https://arxiv.org/abs/2602.13244
作者: Stephan Sandfuchs,Diako Farooghi,Janis Mohr,Sarah Grewe,Markus Lemmen,Jörg Frochte
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 33 pages

点击查看摘要

Abstract:Artificial intelligence (AI) and Machine Learning (ML) have moved from research and pilot projects into everyday business operations, with generative AI accelerating adoption across processes, products, and services. This paper introduces the concept of Responsible AI for organizational practice, with a particular focus on small and medium-sized enterprises. It structures Responsible AI along four focal areas that are central for introducing and operating AI systems in a legally compliant, comprehensible, sustainable, and data-sovereign manner. First, it discusses the EU AI Act as a risk-based regulatory framework, including the distinction between provider and deployer roles and the resulting obligations such as risk assessment, documentation, transparency requirements, and AI literacy measures. Second, it addresses Explainable AI as a basis for transparency and trust, clarifying key notions such as transparency, interpretability, and explainability and summarizing practical approaches to make model behavior and decisions more understandable. Third, it covers Green AI, emphasizing that AI systems should be evaluated not only by performance but also by energy and resource consumption, and outlines levers such as model reuse, resource-efficient adaptation, continuous learning, model compression, and monitoring. Fourth, it examines local models (on-premise and edge) as an operating option that supports data protection, control, low latency, and strategic independence, including domain adaptation via fine-tuning and retrieval-augmented generation. The paper concludes with a consolidated set of next steps for establishing governance, documentation, secure operation, sustainability considerations, and an implementation roadmap.

[AI-170] Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K–12 Science Instructional Materials

【速读】:该论文旨在解决K–12科学教育中高质量、符合标准的教学材料设计过程耗时且依赖专家知识的问题。其解决方案的关键在于通过分析人类专家对生成式AI(Generative AI)生成的课程单元评价结果的反馈,识别AI判断与专家观点的一致性与差异性,从而提炼出可指导未来面向特定领域的生成式AI教学材料设计代理(GenAI-based instructional material design agent)的设计原则。研究采用EQuIP评价量表对来自OpenSciEd等优质课程项目的12个单元进行评分,并由两位科学教育专家独立评估GPT-4o、Claude和Gemini生成的648条评价输出,系统揭示了大语言模型在推理中的优势、盲区及情境敏感性,为构建更可靠、专业的AI辅助设计工具提供实证依据。

链接: https://arxiv.org/abs/2602.13243
作者: Peng He,Zhaohui Li,Zeyuan Wang,Jinjun Xiong,Tingting Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing high-quality, standards-aligned instructional materials for K–12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative reflections on AI reasoning. This process surfaces patterns in where LLM judgments align with or diverge from expert perspectives, revealing reasoning strengths, gaps, and contextual nuances. These insights will directly inform the development of a domain-specific GenAI agent to support the design of high-quality instructional materials in K–12 science education.

[AI-171] AST-PAC: AST-guided Membership Inference for Code

【速读】:该论文旨在解决代码大语言模型(Code Large Language Models)在训练过程中可能使用受限制许可源代码所带来的数据治理与版权合规问题,特别是如何有效检测模型中是否存在未经授权的数据使用。其解决方案的关键在于引入一种基于抽象语法树(Abstract Syntax Tree, AST)的校准方法——AST-PAC,该方法通过生成语法上有效的扰动样本替代传统依赖随机增强的Polarized Augment Calibration(PAC),从而在保持代码结构完整性的同时提升成员推断攻击(Membership Inference Attacks, MIAs)对代码模型的审计准确性。研究表明,AST-PAC在复杂、大型代码文件上表现优于PAC,但对小文件和字母数字密集型代码仍存在改进空间,凸显了未来研究需聚焦于语法感知与规模自适应的校准策略,以实现可靠的代码模型溯源审计。

链接: https://arxiv.org/abs/2602.13240
作者: Roham Koohestani,Ali Al-Kaswan,Jonathan Katzy,Maliheh Izadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Code Large Language Models are frequently trained on massive datasets containing restrictively licensed source code. This creates urgent data governance and copyright challenges. Membership Inference Attacks (MIAs) can serve as an auditing mechanism to detect unauthorized data usage in models. While attacks like the Loss Attack provide a baseline, more involved methods like Polarized Augment Calibration (PAC) remain underexplored in the code domain. This paper presents an exploratory study evaluating these methods on 3B–7B parameter code models. We find that while PAC generally outperforms the Loss baseline, its effectiveness relies on augmentation strategies that disregard the rigid syntax of code, leading to performance degradation on larger, complex files. To address this, we introduce AST-PAC, a domain-specific adaptation that utilizes Abstract Syntax Tree (AST) based perturbations to generate syntactically valid calibration samples. Preliminary results indicate that AST-PAC improves as syntactic size grows, where PAC degrades, but under-mutates small files and underperforms on alphanumeric-rich code. Overall, the findings motivate future work on syntax-aware and size-adaptive calibration as a prerequisite for reliable provenance auditing of code language models.

[AI-172] Stay in Character Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在角色扮演场景中因强化人格约束而导致的越狱攻击(jailbreak attacks)脆弱性问题,尤其是针对风险或负面人格时更为显著。传统方法依赖训练阶段的解决方案(如数据筛选或对齐正则化),但存在维护成本高、角色一致性下降以及难以应用于闭源前沿模型等局限。其关键创新在于提出一种无需训练的双循环对抗自进化框架(Dual-Cycle Adversarial Self-Evolution),包含两个耦合循环:一是人格目标攻击循环(Persona-Targeted Attacker Cycle),用于生成渐进式更强的越狱提示;二是角色扮演防御循环(Role-Playing Defender Cycle),将观测到的失败案例提炼为分层知识库(包括全局安全规则、人格相关约束和安全的角色内示例)。推理阶段,防御模块从该层次结构中检索并组合结构化知识以指导生成,从而在保持角色忠实性的同时满足安全约束,实验证明该方法在多个私有大模型上均优于强基线,并具备对未见人格和攻击提示的良好泛化能力。

链接: https://arxiv.org/abs/2602.13234
作者: Mingyang Liao,Yichen Wan,shuchen wu,Chenxi Miao,Xin Shen,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge from this hierarchy to guide generation, producing responses that remain faithful to the target persona while satisfying safety constraints. Extensive experiments across multiple proprietary LLMs show consistent gains over strong baselines on both role fidelity and jailbreak resistance, and robust generalization to unseen personas and attack prompts.

[AI-173] PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLM s on Engineering Plot Reading

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在工程图表读取任务中缺乏标准化、可复现且定量评估基准的问题,尤其是针对从经典工程图表(如Bode图、FFT频谱、阶跃响应等)中准确提取数值信息的能力不足。现有方法多依赖光学字符识别(OCR)或自由文本描述,难以实现精确的数值恢复与误差量化。解决方案的关键在于提出PlotChain——一个确定性、生成式基准数据集,包含15类共450张由已知参数生成的图表及其对应的精确真值(ground truth),并通过“检查点式诊断评估”(checkpoint-based diagnostic evaluation)机制,将每个样本拆解为多个中间字段(cp_字段),用于隔离子技能(如截止频率、峰值幅度等)并定位模型失败的具体环节。该设计支持基于严格协议(温度=0,仅JSON输出)和人类读图精度容忍度的量化评分,显著提升了评估的客观性与可解释性。

链接: https://arxiv.org/abs/2602.13232
作者: Mayank Ravishankara
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present PlotChain, a deterministic, generator-based benchmark for evaluating multimodal large language models (MLLMs) on engineering plot reading-recovering quantitative values from classic plots (e.g., Bode/FFT, step response, stress-strain, pump curves) rather than OCR-only extraction or free-form captioning. PlotChain contains 15 plot families with 450 rendered plots (30 per family), where every item is produced from known parameters and paired with exact ground truth computed directly from the generating process. A central contribution is checkpoint-based diagnostic evaluation: in addition to final targets, each item includes intermediate ‘cp_’ fields that isolate sub-skills (e.g., reading cutoff frequency or peak magnitude) and enable failure localization within a plot family. We evaluate four state-of-the-art MLLMs under a standardized, deterministic protocol (temperature = 0 and a strict JSON-only numeric output schema) and score predictions using per-field tolerances designed to reflect human plot-reading precision. Under the ‘plotread’ tolerance policy, the top models achieve 80.42% (Gemini 2.5 Pro), 79.84% (GPT-4.1), and 78.21% (Claude Sonnet 4.5) overall field-level pass rates, while GPT-4o trails at 61.59%. Despite strong performance on many families, frequency-domain tasks remain brittle: bandpass response stays low (= 23%), and FFT spectrum remains challenging. We release the generator, dataset, raw model outputs, scoring code, and manifests with checksums to support fully reproducible runs and retrospective rescoring under alternative tolerance policies.

[AI-174] An Explainable Failure Prediction Framework for Neural Networks in Radio Access Networks

【速读】:该论文旨在解决5G网络中毫米波(mmWave)频段因环境因素易引发无线链路失败(Radio Link Failure, RLF)的问题,同时克服现有预测模型缺乏可解释性、难以在实际运维中部署的局限。其关键解决方案是提出一个融合特征剪枝(feature pruning)与模型精炼(model refinement)的框架,通过引入基于解释性的特征选择机制,使神经网络模型在保持高精度的同时具备可解释性,从而揭示输入特征对决策的贡献度并优化模型结构。该框架可集成至如GNN-Transformer或LSTM等先进预测架构中,在真实数据集上验证发现天气数据对RLF预测贡献甚微,进而设计出参数减少50%且F1分数更优的轻量化模型,显著提升了模型的可解释性、可扩展性和性能表现。

链接: https://arxiv.org/abs/2602.13231
作者: Khaleda Papry,Francesco Spinnato,Marco Fiore,Mirco Nanni,Israat Haque
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As 5G networks continue to evolve to deliver high speed, low latency, and reliable communications, ensuring uninterrupted service has become increasingly critical. While millimeter wave (mmWave) frequencies enable gigabit data rates, they are highly susceptible to environmental factors, often leading to radio link failures (RLF). Predictive models leveraging radio and weather data have been proposed to address this issue; however, many operate as black boxes, offering limited transparency for operational deployment. This work bridges that gap by introducing a framework that combines explainability based feature pruning with model refinement. Our framework can be integrated into state of the art predictors such as GNN Transformer and LSTM based architectures for RLF prediction, enabling the development of accurate and explainability guided models in 5G networks. It provides insights into the contribution of input features and the decision making logic of neural networks, leading to lighter and more scalable models. When applied to RLF prediction, our framework unveils that weather data contributes minimally to the forecast in extensive real world datasets, which informs the design of a leaner model with 50 percent fewer parameters and improved F1 scores with respect to the state of the art solution. Ultimately, this work empowers network providers to evaluate and refine their neural network based prediction models for better interpretability, scalability, and performance.

[AI-175] Intelligence as Trajectory-Dominant Pareto Optimization

【速读】:该论文旨在解决当前人工智能系统在长期适应性方面存在的停滞问题,即尽管模型性能持续优化,但在长时程任务中仍难以实现有效的策略演化。其核心观点指出,这种限制并非源于学习数据不足或模型容量有限,而是由智能优化过程中的结构性特性所决定——具体而言,是轨迹层面(trajectory-level)的多目标权衡导致了局部最优解对全局更优发展路径的封锁。解决方案的关键在于提出轨迹主导的帕累托优化(Trajectory-Dominant Pareto Optimization),将传统帕累托最优扩展至完整轨迹空间,并引入陷阱逃逸难度指数(Trap Escape Difficulty Index, TEDI),用以量化系统从局部非支配区域逃逸到全局更优路径的几何障碍,从而揭示动态智能上限的本质来源于轨迹空间的几何约束,而非学习进度或模型规模。

链接: https://arxiv.org/abs/2602.13230
作者: Truong Xuan Khanh,Truong Quynh Hoa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Despite recent advances in artificial intelligence, many systems exhibit stagnation in long-horizon adaptability despite continued performance optimization. This work argues that such limitations do not primarily arise from insufficient learning, data, or model capacity, but from a deeper structural property of how intelligence is optimized over time. We formulate intelligence as a trajectory-level phenomenon governed by multi-objective trade-offs, and introduce Trajectory-Dominant Pareto Optimization, a path-wise generalization of classical Pareto optimality in which dominance is defined over full trajectories. Within this framework, Pareto traps emerge as locally non-dominated regions of trajectory space that nevertheless restrict access to globally superior developmental paths under conservative local optimization. To characterize the rigidity of such constraints, we define the Trap Escape Difficulty Index (TEDI), a composite geometric measure capturing escape distance, structural constraints, and behavioral inertia. We show that dynamic intelligence ceilings arise as inevitable geometric consequences of trajectory-level dominance, independent of learning progress or architectural scale. We further introduce a formal taxonomy of Pareto traps and illustrate the resulting trajectory-level divergence using a minimal agent-environment model. Together, these results shift the locus of intelligence from terminal performance to optimization geometry, providing a principled framework for diagnosing and overcoming long-horizon developmental constraints in adaptive systems.

[AI-176] An Agent ic AI Control Plane for 6G Network Slice Orchestration Monitoring and Trading

【速读】:该论文旨在解决6G网络中传统网络切片编排框架无法适应动态、多域和以服务为中心环境的问题,现有方案依赖静态策略与人工流程,难以满足6G对AI原生、意图驱动及经济可编程性的要求。其解决方案的关键在于提出一种基于智能体(Agent)的AI控制平面架构,将切片规划、部署、持续监控与经济决策统一为一个闭环控制功能,并通过多协作AI代理实现分层自治;同时引入市场感知编排能力以联合优化切片需求、定价与可用性,并结合自然语言接口(基于Model Context Protocol, MCP)实现意图驱动的交互式管理,辅以由专用推理模型治理的细调大语言模型联盟,确保自主性的责任性和可解释性,从而构建面向6G的可扩展、自适应切片管理体系。

链接: https://arxiv.org/abs/2602.13227
作者: Eranga Bandara,Ross Gore,Sachin Shetty,Ravi Mukkamala,Tharaka Hewa,Abdul Rahman,Xueping Liang,Safdar H. Bouk,Amin Hass,Peter Foytik,Ng Wee Keong,Kasun De Zoysa
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:6G networks are expected to be AI-native, intent-driven, and economically programmable, requiring fundamentally new approaches to network slice orchestration. Existing slicing frameworks, largely designed for 5G, rely on static policies and manual workflows and are ill-suited for the dynamic, multi-domain, and service-centric nature of emerging 6G environments. In this paper, we propose an agentic AI control plane architecture for 6G network slice orchestration, monitoring, and trading that treats orchestration as a holistic control function encompassing slice planning, deployment, continuous monitoring, and economically informed decision-making. The proposed control plane is realized as a layered architecture in which multiple cooperating AI agents. To support flexible and on-demand slice utilization, the control plane incorporates market-aware orchestration capabilities, allowing slice requirements, pricing, and availability to be jointly considered during orchestration decisions. A natural language interface, implemented using the Model Context Protocol (MCP), enables users and applications to interact with control-plane functions through intent-based queries while enforcing safety and policy constraints. To ensure responsible and explainable autonomy, the control plane integrates fine-tuned large language models organized as a multi-model consortium, governed by a dedicated reasoning model. The proposed approach is evaluated using a real-world testbed with multiple mobile core instances (e.g Open5GS) integrated with Ericsson’s RAN infrastructure. The results demonstrate that combining agentic autonomy, closed-loop SLA assurance, market-aware orchestration, and natural language control enables a scalable and adaptive 6G-native control plane for network slice management, highlighting the potential of agentic AI as a foundational mechanism for future 6G networks.

[AI-177] Computability of Agent ic Systems

【速读】:该论文旨在解决当前对具有有限上下文的智能体系统(agentic systems)的能力分析缺乏形式化框架的问题。其核心挑战在于理解不同推理机制在计算能力与效率之间的权衡关系。解决方案的关键是提出“Quest Graph”这一形式化框架,通过抽象建模常见推理技术并建立其计算等价性:基础Quest Graph等价于无限制图灵机(Turing-complete),而仅向前推理的有限Quest决策过程(FQDP)仅等价于下推自动机(context-free),只有引入状态感知查询时,参考增强的Quest决策过程(RQDP)才能恢复图灵完备性。进一步地,该研究通过模拟计算图中的任务依赖关系,揭示了这种计算层次结构直接映射为性能差异——参考增强型系统在模拟复杂计算图时可比非增强型系统更高效,甚至存在指数级优势。这为分类和理解智能体系统的根本能力提供了理论依据。

链接: https://arxiv.org/abs/2602.13222
作者: Chatavut Viriyasuthee
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:This paper introduces the Quest Graph, a formal framework for analyzing the capabilities of agentic systems with finite context. We define abstractions that model common reasoning techniques and establish their computational power: the base Quest Graph is equivalent to an unrestricted Turing machine; the forward-only Finite Quest Decision Process (FQDP), despite its wide use, is only equivalent to a pushdown automaton (context-free); and the Reference-Augmented QDP (RQDP) regains Turing completeness only when stateful queries are allowed. Since computability affects efficiency, we then analyze the theoretical efficiency of each model by simulating task dependencies in computation graphs. We show that this computational hierarchy translates to concrete performance trade-offs: reference-augmented (Turing-complete) systems can be exponentially more efficient at simulating complex graphs than their non-augmented (context-free) counterparts. This work provides a formal methodology for classifying and understanding the fundamental capabilities of agentic systems. Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2602.13222 [cs.CC] (or arXiv:2602.13222v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2602.13222 Focus to learn more arXiv-issued DOI via DataCite

[AI-178] VeRA: Verified Reasoning Data Augmentation at Scale

【速读】:该论文旨在解决当前人工智能评估体系中存在的“静态性”问题,即重复使用相同问题导致模型可能通过记忆或格式漏洞获得高分,而非真正体现推理能力,从而无法准确衡量AI的真实进步。其解决方案的核心是提出VeRA(Verified Reasoning Data Augmentation)框架,关键在于将基准测试问题转化为可执行的规范(executable specifications),通过自然语言模板、一致的生成器和确定性验证器三部分实现自动化、无监督地生成无限数量带可靠标签的变体任务;其中VeRA-E模式保持逻辑不变以检测记忆行为,VeRA-H模式系统性提升难度以生成边界智能任务,从而构建出具备鲁棒性和可扩展性的动态验证基准范式。

链接: https://arxiv.org/abs/2602.13217
作者: Zerui Cheng,Jiashuo Liu,Chunjie Wu,Jianzhu Yao,Pramod Viswanath,Ge Zhang,Wenhao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages; VeRA technical report

点击查看摘要

Abstract:The main issue with most evaluation schemes today is their “static” nature: the same problems are reused repeatedly, allowing for memorization, format exploitation, and eventual saturation. To measure genuine AI progress, we need evaluation that is robust by construction, not by post-hoc detection. In response, we propose VeRA (Verified Reasoning Data Augmentation), a framework that converts benchmark problems into executable specifications, comprising (i) a natural language template with placeholder slots, (ii) a coherent generator that samples valid configurations, and (iii) a deterministic verifier that validates parameters and calculates the corresponding correct answers for each configuration. From a single seed problem, VeRA automatically creates unlimited verified variants with reliable labels at near-zero marginal cost without human involvement. VeRA operates in two complementary modes. VeRA-E (equivalent) rewrites problems while keeping the underlying logic intact, useful for detecting memorization versus genuine reasoning. VeRA-H (hardened) systematically increases complexity while remaining verifiable, enabling reliable creation and labelling of fresh difficult tasks at the boundary of intelligence. Evaluating 16 frontier models with VeRA, we find: (i) VeRA-E improves evaluation quality and reveals contamination patterns. (ii) VeRA-H enables human-free generation of hard tasks with reliable labels. (iii) VeRA establishes verified benchmarks as a general paradigm. VeRA reconceptualizes benchmarks from static objects used until exhausted, to executable specifications generating fresh, verified instances on demand, enhancing robustness and cost-effectiveness for evaluation. With VeRA, we envision that evaluation in any verifiable domain can scale indefinitely without sacrificing label integrity. To stimulate future research, we have open-sourced all code and datasets. Comments: 36 pages; VeRA technical report Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.13217 [cs.AI] (or arXiv:2602.13217v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.13217 Focus to learn more arXiv-issued DOI via DataCite

[AI-179] When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching

【速读】:该论文旨在解决传统Transformer模型在计算效率上的瓶颈问题,即对所有位置分配均匀计算资源,而忽视了不同位置的信息复杂度差异;同时克服状态空间模型(State Space Models, SSMs)在长程信息检索中的精度不足。其解决方案的核心是提出AMOR(Adaptive Metacognitive Output Router)架构,通过引入基于预测熵的动态路由机制,在SSM主干网络“不确定”时才激活稀疏注意力机制,从而实现自适应计算。关键创新在于利用SSM隐藏状态生成“幽灵键值”(Ghost KV),复用其O(n)线性计算能力,避免了Transformer每层O(n²)的注意力开销,显著提升效率并保持高精度,且路由决策具有信息论可解释性。

链接: https://arxiv.org/abs/2602.13215
作者: Haoran Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Transformers allocate uniform computation to every position, regardless of difficulty. State Space Models (SSMs) offer efficient alternatives but struggle with precise information retrieval over a long horizon. Inspired by dual-process theories of cognition (Kahneman, 2011), we propose AMOR (Adaptive Metacognitive Output Router), a hybrid architecture that dynamically engages sparse attention only when an SSM backbone is “uncertain”–as measured by prediction entropy. Compared to standard transformers, AMOR gains efficiency by projecting keys and values from SSM hidden states (Ghost KV), reusing the SSM’s O(n) computation rather than requiring O(n^2) attention at every layer. On small-scale synthetic retrieval tasks, AMOR outperforms both SSM-only and transformer-only baselines, achieving perfect retrieval accuracy while engaging attention on only 22% of positions. We validate that prediction entropy reliably signals retrieval need, with a gap of 1.09 nats (nearly half the entropy range) between retrieval and local positions. Additionally, our approach provides interpretable adaptive computation, where routing decisions can be understood in information-theoretic terms.

[AI-180] BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互式环境中进行战略决策时缺乏系统性、稳定且可解释的评估方法的问题。现有基准多基于静态任务,无法捕捉动态策略能力;而基于LLM对战的评测虽能提供相对排名,却因依赖瞬时模型池而导致结果波动,并存在二次计算复杂度,难以实现跨时间的性能追踪。其解决方案的关键在于引入固定技能校准的游戏人工智能(Game Artificial Intelligence, Game AI)层级作为锚点,从而实现线性时间复杂度下的绝对技能测量,并确保跨时间的可解释性。该方法依托Botzone平台构建的竞技基础设施,在八类不同性质的游戏中对五款主流LLM进行了大规模评估,验证了其有效性与普适性,为交互式AI能力的可扩展、可复用评估提供了新范式。

链接: https://arxiv.org/abs/2602.13214
作者: Lingfeng Li,Yunlong Lu,Yuefei Zhang,Jingyu Yao,Yixin Zhu,KeYuan Cheng,Yongyi Wang,Qirui Zheng,Xionghui Yang,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform’s established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.

[AI-181] An Overlay Multicast Routing Method Based on Network Situational Aware-ness and Hierarchical Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决传统Overlay Multicast(OM)在动态流量环境下因缺乏对物理资源状态感知而难以适应变化的问题,以及现有强化学习方法无法有效解耦OM中紧密耦合的多目标优化特性所导致的高复杂度、收敛缓慢和不稳定性问题。解决方案的关键在于提出一种基于软件定义网络(SDN)全局视图的多智能体深度分层强化学习方法(MA-DHRL-OM),通过分层代理将OM树构建分为两个阶段以缩减动作空间,并借助多智能体协作实现多目标优化的平衡,从而提升路径规划的收敛稳定性、可扩展性和自适应能力。

链接: https://arxiv.org/abs/2602.13211
作者: Miao Ye,Yanye Chen,Yong Wang,Cheng Zhu,Qiuxiang Jiang,Gai Huang,Feng Ding
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 30page, 10 figures

点击查看摘要

Abstract:Compared with IP multicast, Overlay Multicast (OM) offers better compatibility and flexible deployment in heterogeneous, cross-domain networks. However, traditional OM struggles to adapt to dynamic traffic due to unawareness of physical resource states, and existing reinforcement learning methods fail to decouple OM’s tightly coupled multi-objective nature, leading to high complexity, slow convergence, and instability. To address this, we propose MA-DHRL-OM, a multi-agent deep hierarchical reinforcement learning approach. Using SDN’s global view, it builds a traffic-aware model for OM path planning. The method decomposes OM tree construction into two stages via hierarchical agents, reducing action space and improving convergence stability. Multi-agent collaboration balances multi-objective optimization while enhancing scalability and adaptability. Experiments show MA-DHRL-OM outperforms existing methods in delay, bandwidth utilization, and packet loss, with more stable convergence and flexible routing.

[AI-182] Large Language Model (LLM )-enabled Reinforcement Learning for Wireless Network Optimization

【速读】:该论文旨在解决6G无线网络优化中因用户需求多样化和环境复杂性导致的强化学习(Reinforcement Learning, RL)在高维状态空间下计算开销大、分布式智能难以协调及结果不一致等问题。其解决方案的关键在于引入大语言模型(Large Language Models, LLMs)增强RL框架,通过LLM提供的预训练知识与高级推理能力,提升RL在多协议层(物理层至应用层)中的决策效率与一致性;特别地,提出基于LLM辅助的状态表示与语义提取机制,以改进多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架,并在无人机-卫星网络的服务迁移、请求路由和拓扑图生成等场景中验证了该方法的有效性。

链接: https://arxiv.org/abs/2602.13210
作者: Jie Zheng,Ruichen Zhang,Dusit Niyato,Haijun Zhang,Jiacheng Wang,Hongyang Du,Jiawen Kang,Zehui Xiong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enhancing future wireless networks presents a significant challenge for networking systems due to diverse user demands and the emergence of 6G technology. While reinforcement learning (RL) is a powerful framework, it often encounters difficulties with high-dimensional state spaces and complex environments, leading to substantial computational demands, distributed intelligence, and potentially inconsistent outcomes. Large language models (LLMs), with their extensive pretrained knowledge and advanced reasoning capabilities, offer promising tools to enhance RL in optimizing 6G wireless networks. We explore RL models augmented by LLMs, emphasizing their roles and the potential benefits of their synergy in wireless network optimization. We then examine LLM-enabled RL across various protocol layers: physical, data link, network, transport, and application layers. Additionally, we propose an LLM-assisted state representation and semantic extraction to enhance the multi-agent reinforcement learning (MARL) framework. This approach is applied to service migration and request routing, as well as topology graph generation in unmanned aerial vehicle (UAV)-satellite networks. Through case studies, we demonstrate that our framework effectively performs optimization of wireless network. Finally, we outline prospective research directions for LLM-enabled RL in wireless network optimization.

[AI-183] A Safety-Constrained Reinforcement Learning Framework for Reliable Wireless Autonomy

【速读】:该论文旨在解决在无线系统中部署强化学习(Reinforcement Learning, RL)时,因缺乏安全性保障而导致的不可靠行为问题,尤其是在超可靠低时延通信(Ultra-Reliable Low-Latency Communication, URLLC)场景下,传统基于异常检测或事后干预的被动安全机制无法满足任务关键型应用的安全需求。其解决方案的关键在于提出一种主动式安全约束强化学习框架,融合证明携带控制(Proof-Carrying Control, PCC)与赋能预算(Empowerment-Budgeted, EB)执行机制:通过轻量级数学证书对每个智能体动作进行实时验证以确保符合干扰约束,同时利用赋能预算限制安全干预频率,在保证系统安全的前提下最大化自主性。实验表明,该方法在无线上行链路调度任务中可彻底消除不安全传输,且维持系统吞吐量和可预测的自主行为,相较无约束和被动基线方法实现了可证明的安全性与最小性能损耗。

链接: https://arxiv.org/abs/2602.13207
作者: Abdikarim Mohamed Ibrahim,Rosdiadee Nordin
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) and reinforcement learning (RL) have shown significant promise in wireless systems, enabling dynamic spectrum allocation, traffic management, and large-scale Internet of Things (IoT) coordination. However, their deployment in mission-critical applications introduces the risk of unsafe emergent behaviors, such as UAV collisions, denial-of-service events, or instability in vehicular networks. Existing safety mechanisms are predominantly reactive, relying on anomaly detection or fallback controllers that intervene only after unsafe actions occur, which cannot guarantee reliability in ultra-reliable low-latency communication (URLLC) settings. In this work, we propose a proactive safety-constrained RL framework that integrates proof-carrying control (PCC) with empowerment-budgeted (EB) enforcement. Each agent action is verified through lightweight mathematical certificates to ensure compliance with interference constraints, while empowerment budgets regulate the frequency of safety overrides to balance safety and autonomy. We implement this framework on a wireless uplink scheduling task using Proximal Policy Optimization (PPO). Simulation results demonstrate that the proposed PCC+EB controller eliminates unsafe transmissions while preserving system throughput and predictable autonomy. Compared with unconstrained and reactive baselines, our method achieves provable safety guarantees with minimal performance degradation. These results highlight the potential of proactive safety constrained RL to enable trustworthy wireless autonomy in future 6G networks.

[AI-184] Hybrid Secure Routing in Mobile Ad-hoc Networks (MANETSs)

【速读】:该论文旨在解决移动自组织网络(MANETs)中因无线通信的动态性和固有缺陷所引发的安全问题,如泛洪攻击、黑洞攻击和sinkhole攻击等,这些问题严重威胁网络性能。解决方案的关键在于提出一种混合安全路由协议(Hybrid Secure Routing Protocol, HSRP),其核心创新是将基于信任机制的策略与密码学方法相结合,并融合主动式与被动式路由的优势,从而在动态网络环境中实现对恶意行为的有效防御,同时提升吞吐量并降低延迟,保障数据传输的安全性与可靠性。

链接: https://arxiv.org/abs/2602.13204
作者: Soundes Oumaima Boufaida,Abdemadjid Benmachiche,Majda Maatallah,Chaouki Chemam
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Because wireless communication is dynamic and has inherent defects, routing algorithms are crucial in the quickly evolving field of mobile ad hoc networks, or MANETs This study looks at the many security problems that MANETs encounter. These problems, which pose major risks to network performance, include flooding, sinkholes, and black hole assaults to address these challenges. We introduce the Hybrid Secure Routing Protocol (HSRP), which enhances the security and robustness of routing operations by fusing trust-based tactics with cryptographic approaches. HSRP combines the strengths of both proactive and reactive routing strategies, enabling it to adapt dynamically to evolving network conditions while protecting against malicious activities. We use extensive simulations with Network Simulator (NS-2) and a thorough review of the literature to assess HSRP’s performance under different attack scenarios. The results show that, in comparison to traditional protocols, HSRP increases throughput and decreases latency, hence improving routing efficiency while simultaneously bolstering data transfer security. With uses in vital domains including military operations and disaster response, this study provides a scalable and workable approach for safe routing in MANETs. The findings highlight how crucial it is to include cutting-edge security features in routing protocol design to guarantee the dependability and integrity of MANETs in practical situations.

[AI-185] Adversarial Network Imagination: Causal LLM s and Digital Twins for Proactive Telecom Mitigation

【速读】:该论文旨在解决电信网络在面对复杂故障(如光缆切断、流量过载和级联中断)时,现有监控与数字孪生系统多为被动响应、仅在服务降级后才进行检测的问题。其解决方案的关键在于提出了一种闭环框架——对抗性网络想象(Adversarial Network Imagination),该框架融合因果大语言模型(Causal Large Language Model, LLM)、知识图谱(Knowledge Graph)与数字孪生(Digital Twin):其中,因果LLM基于知识图谱中编码的网络依赖关系生成结构化的对抗性故障场景,并在数字孪生环境中执行仿真以量化性能退化并评估缓解策略;通过迭代式地根据仿真反馈优化故障场景,实现从被动排障向主动韧性分析的范式转变。

链接: https://arxiv.org/abs/2602.13203
作者: Vignesh Sriram,Yuqiao Meng,Luoxi Tang,Zhaohan Xi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Telecommunication networks experience complex failures such as fiber cuts, traffic overloads, and cascading outages. Existing monitoring and digital twin systems are largely reactive, detecting failures only after service degradation occurs. We propose Adversarial Network Imagination, a closed-loop framework that integrates a Causal Large Language Model (LLM), a Knowledge Graph, and a Digital Twin to proactively generate, simulate, and evaluate adversarial network failures. The Causal LLM produces structured failure scenarios grounded in network dependencies encoded in the Knowledge Graph. These scenarios are executed within a Digital Twin to measure performance degradation and evaluate mitigation strategies. By iteratively refining scenarios based on simulation feedback, the framework shifts network operations from reactive troubleshooting toward anticipatory resilience analysis.

[AI-186] raffic Simulation in Ad Hoc Network of Flying UAVs with Generative AI Adaptation

【速读】:该论文旨在解决无人机自组织网络(Ad Hoc network of Unmanned Aerial Vehicles)中因动态拓扑和环境变化导致的通信质量不稳定问题,特别是如何通过人工智能(Artificial Intelligence)实现通信信道的自适应调整以降低分组丢失率。解决方案的关键在于构建基于20架无人机的网络模型,系统分析分组大小、传输功率、频率、飞行区域及节点数量对分组丢失的影响,并在此基础上实现一种基于人工智能的自适应数据传输机制,通过程序代码具体实现信道参数(如功率与传输速率)随时间动态优化,从而提升通信可靠性。

链接: https://arxiv.org/abs/2602.13200
作者: Andrii Grekhov,Volodymyr Kharchenko,Vasyl Kondratiuk
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:The purpose of this paper is to model traffic in Ad Hoc network of Unmanned Aerial Vehicles and demonstrate a way for adapting communication channel using Artificial Intelligence. The modeling was based on the original model of Ad Hoc network including 20 Unmanned Aerial Vehicles. The dependences of packet loss on the packet size for different transmission powers, on the packet size for different frequencies, on Unmanned Aerial Vehicles flight area and on the number of Unmanned Aerial Vehicles were obtained and analyzed. The implementation of adaptive data transmission is presented in the program code. The dependences of packet loss, power and transaction size on time during Artificial Intelligence adaptation are shown.

[AI-187] Simulation-Based Study of AI-Assisted Channel Adaptation in UAV-Enabled Cellular Networks

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)增强的蜂窝网络中,因动态干扰条件变化导致的通信性能下降问题。其核心挑战在于如何实现对通信信道参数的实时自适应调整,以维持稳定的连接质量与数据传输效率。解决方案的关键在于引入一种轻量级监督学习方法——基于线性回归的机器学习模型,该模型通过分析分组级别的性能指标(如误码率 Bit Error Rate 和有效数据速率 Effective Data Rate),实现对事务大小(Transaction Size)的实时调整,从而在动态环境中优化信道适应能力。

链接: https://arxiv.org/abs/2602.13199
作者: Andrii Grekhov,Volodymyr Kharchenko,Vasyl Kondratiuk
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:This paper presents a simulation based study of Artificial Intelligence assisted communication channel adaptation in Unmanned Aerial Vehicle enabled cellular networks. The considered system model includes communication channel Ground Base Station Aerial Repeater UAV Base Station Cluster of Cellular Network Users. The primary objective of the study is to investigate the impact of adaptive channel parameter control on communication performance under dynamically changing interference conditions. A lightweight supervised machine learning approach based on linear regression is employed to implement cognitive channel adaptation. The AI model operates on packet level performance indicators and enables real time adjustment of Transaction Size in response to variations in Bit Error Rate and effective Data Rate. A custom simulation environment is developed to generate training and testing datasets and to evaluate system behavior under both static and adaptive channel configurations.

[AI-188] Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

【速读】:该论文旨在解决大规模云系统中故障根因分析(Root Cause Analysis, RCA)自动化过程中,基于大语言模型(Large Language Model, LLM)的代理系统存在检测准确率低且缺乏过程级失败诊断机制的问题。其解决方案的关键在于提出了一种面向LLM-RCA代理的全过程失败分析框架,通过在OpenRCA基准上执行1,675次代理运行,识别出12类跨模型共现的失败模式(pitfall types),并发现主导性问题如“幻觉数据解读”和“探索不完整”源于共享的代理架构而非单个模型能力差异;进一步的受控实验表明,仅靠提示工程无法有效缓解主要失败类型,而优化代理间通信协议可将相关失败降低最多15个百分点,从而为设计更可靠的自主云RCA代理提供了可操作的诊断工具与改进路径。

链接: https://arxiv.org/abs/2602.09937
作者: Taeyoon Kim,Woohyeok Park,Hoyeong Yun,Kyungyong Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent’s reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.

[AI-189] Numerical exploration of the range of shape functionals using neural networks

【速读】:该论文旨在解决Blaschke–Santaló图谱(Blaschke–Santaló diagrams)的数值探索问题,即如何高效刻画多个形状泛函(shape functionals)之间的可能不等式关系。其核心挑战在于如何在高维空间中对凸体进行有效参数化并实现均匀采样,以准确描绘这些图谱的边界和内部结构。解决方案的关键在于两个创新:一是利用基于势函数(gauge functions)的可逆神经网络架构对任意维度下的凸体进行参数化,从而在形状优化过程中保持集合的凸性;二是设计一种基于自动微分(automatic differentiation)的交互粒子系统,通过最小化Riesz能量泛函实现对图谱区域的均匀采样,从而获得高质量的数值描述。

链接: https://arxiv.org/abs/2602.14881
作者: Eloi Martinet,Ilias Ftouhi
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:We introduce a novel numerical framework for the exploration of Blaschke–Santaló diagrams, which are efficient tools characterizing the possible inequalities relating some given shape functionals. We introduce a parametrization of convex bodies in arbitrary dimensions using a specific invertible neural network architecture based on gauge functions, allowing an intrinsic conservation of the convexity of the sets during the shape optimization process. To achieve a uniform sampling inside the diagram, and thus a satisfying description of it, we introduce an interacting particle system that minimizes a Riesz energy functional via automatic differentiation in PyTorch. The effectiveness of the method is demonstrated on several diagrams involving both geometric and PDE-type functionals for convex bodies of \mathbbR^2 and \mathbbR^3 , namely, the volume, the perimeter, the moment of inertia, the torsional rigidity, the Willmore energy, and the first two Neumann eigenvalues of the Laplacian.

[AI-190] he Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

【速读】:该论文旨在解决温度缩放(temperature scaling)这一广泛应用于分类模型校准和大语言模型(LLM)随机性调控的实践方法缺乏严谨理论分析的问题。其关键解决方案在于提出两个新的理论表征:一是几何视角下,温度缩放等价于将原始模型投影到具有指定熵值的模型集合上的信息投影;二是从线性缩放器的角度揭示,温度缩放是唯一不改变模型硬预测(hard predictions)的线性缩放方法,从而明确了其在更一般线性缩放框架(如矩阵缩放和Dirichlet校准)中的独特地位。

链接: https://arxiv.org/abs/2602.14862
作者: Pierre-Alexandre Mattei,Bruno Loureiro
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: improving the calibration of classifiers and tuning the stochasticity of large language models (LLMs). In both cases, temperature scaling is the most popular method for the job. Despite its popularity, a rigorous theoretical analysis of the properties of temperature scaling has remained elusive. We investigate here some of these properties. For classification, we show that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy). However, for LLMs, we challenge the common claim that increasing temperature increases diversity. Furthermore, we introduce two new characterisations of temperature scaling. The first one is geometric: the tempered model is shown to be the information projection of the original model onto the set of models with a given entropy. The second characterisation clarifies the role of temperature scaling as a submodel of more general linear scalers such as matrix scaling and Dirichlet calibration: we show that temperature scaling is the only linear scaler that does not change the hard predictions of the model.

[AI-191] LongAudio-RAG : Event-Grounded Question Answering over Multi-Hour Long Audio

【速读】:该论文旨在解决长时音频(long-duration audio)场景下,基于自然语言查询进行精准时间定位与低幻觉回答的难题。现有音频-语言模型受限于上下文长度,难以有效处理多小时级音频数据。其解决方案的关键在于提出一种混合框架LongAudio-RAG(LA-RAG),将原始音频转化为结构化的事件记录(timestamped acoustic event detections)并存储于SQL数据库中,在推理阶段通过解析时间引用、意图分类和事件检索,仅使用相关事件作为约束证据来生成答案,从而避免直接依赖原始音频或通用文本检索带来的信息冗余与错误关联,显著提升准确性与可解释性。

链接: https://arxiv.org/abs/2602.14612
作者: Naveen Vakada,Kartik Hegde,Arvind Krishna Sridhar,Yinyi Guo,Erik Visser
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.

[AI-192] Metabolic cost of information processing in Poisson variational autoencoders

【速读】:该论文试图解决的问题是:如何构建一个能够反映生物系统中能量约束的计算理论,因为传统计算理论将能量视为无限可用资源,而实际生物神经系统(如大脑)的计算过程本质上受能量限制。解决方案的关键在于提出基于泊松假设的变分自由能最小化方法,其核心机制是使KL散度项与模型神经元的先验发放率成正比,从而自然地引入代谢成本项,该成本项惩罚高基线活动水平。这种设计实现了信息论中的“编码速率”与生物物理变量“发放率”的耦合,使得在编码保真度与能量消耗之间形成可调的权衡关系。实验进一步验证了这一代谢成本结构仅存在于泊松变分自动编码器(P-VAE)中,而非其他非负表示模型(如Grelu-VAE),表明泊松统计特性是实现能量感知计算的关键基础。

链接: https://arxiv.org/abs/2602.13421
作者: Hadi Vafaii,Jacob L. Yates
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Computation in biological systems is fundamentally energy-constrained, yet standard theories of computation treat energy as freely available. Here, we argue that variational free energy minimization under a Poisson assumption offers a principled path toward an energy-aware theory of computation. Our key observation is that the Kullback-Leibler (KL) divergence term in the Poisson free energy objective becomes proportional to the prior firing rates of model neurons, yielding an emergent metabolic cost term that penalizes high baseline activity. This structure couples an abstract information-theoretic quantity – the coding rate – to a concrete biophysical variable – the firing rate – which enables a trade-off between coding fidelity and energy expenditure. Such a coupling arises naturally in the Poisson variational autoencoder (P-VAE) – a brain-inspired generative model that encodes inputs as discrete spike counts and recovers a spiking form of sparse coding as a special case – but is absent from standard Gaussian VAEs. To demonstrate that this metabolic cost structure is unique to the Poisson formulation, we compare the P-VAE against Grelu-VAE, a Gaussian VAE with ReLU rectification applied to latent samples, which controls for the non-negativity constraint. Across a systematic sweep of the KL term weighting coefficient \beta and latent dimensionality, we find that increasing \beta monotonically increases sparsity and reduces average spiking activity in the P-VAE. In contrast, Grelu-VAE representations remain unchanged, confirming that the effect is specific to Poisson statistics rather than a byproduct of non-negative representations. These results establish Poisson variational inference as a promising foundation for a resource-constrained theory of computation.

[AI-193] Nonparametric Distribution Regression Re-calibration

【速读】:该论文旨在解决概率回归中预测分布难以准确反映真实经验不确定性的核心问题,即模型在最小化整体预测误差时往往倾向于提高信息量而牺牲校准性(calibration),导致预测区间过窄且过度自信,这在安全关键场景下尤为不利。为应对现有后验校正方法依赖弱校准概念(如PIT均匀性)或施加限制性参数假设的局限,论文提出一种基于条件核均值嵌入(conditional kernel mean embeddings)的新型非参数再校准算法,其关键创新在于无需强假设即可纠正校准误差;同时为高效处理实数值目标,设计了一种新颖的特征核(characteristic kernel)用于分布空间,可在 O(nlogn)\mathcal{O}(n \log n) 时间内对大小为 nn 的经验分布进行评估,从而实现跨多种回归基准和模型类别的稳定性能提升。

链接: https://arxiv.org/abs/2602.13362
作者: Ádám Jung,Domokos M. Kelen,András A. Benczúr
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A key challenge in probabilistic regression is ensuring that predictive distributions accurately reflect true empirical uncertainty. Minimizing overall prediction error often encourages models to prioritize informativeness over calibration, producing narrow but overconfident predictions. However, in safety-critical settings, trustworthy uncertainty estimates are often more valuable than narrow intervals. Realizing the problem, several recent works have focused on post-hoc corrections; however, existing methods either rely on weak notions of calibration (such as PIT uniformity) or impose restrictive parametric assumptions on the nature of the error. To address these limitations, we propose a novel nonparametric re-calibration algorithm based on conditional kernel mean embeddings, capable of correcting calibration error without restrictive modeling assumptions. For efficient inference with real-valued targets, we introduce a novel characteristic kernel over distributions that can be evaluated in \mathcalO(n \log n) time for empirical distributions of size n . We demonstrate that our method consistently outperforms prior re-calibration approaches across a diverse set of regression benchmarks and model classes.

[AI-194] Boltz is a Strong Baseline for Atom-level Representation Learning

【速读】:该论文旨在解决当前基于蛋白质的前沿基础模型(如Boltz)在小分子任务中是否具备可迁移的原子级化学物理表征能力的问题,即这些模型是否依赖于蛋白质进化信号而无法有效应用于小分子性质预测与生成任务。其解决方案的关键在于系统评估Boltz模型在多个小分子基准数据集上的原子级表示质量,结果表明其在ADMET性质预测、分子生成和优化等任务上表现优异,优于或媲美专门设计的小分子基线模型,从而揭示了蛋白中心模型在原子层级具有未被充分挖掘的通用表征潜力,为小分子领域的原子级表示学习提供了强有力的基线。

链接: https://arxiv.org/abs/2602.13249
作者: Hyosoon Jang,Hyunjin Seo,Yunhui Jang,Seonghyun Park,Sungsoo Ahn
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation models in molecular learning have advanced along two parallel tracks: protein models, which typically utilize evolutionary information to learn amino acid-level representations for folding, and small-molecule models, which focus on learning atom-level representations for property prediction tasks such as ADMET. Notably, cutting-edge protein-centric models such as Boltz now operate at atom-level granularity for protein-ligand co-folding, yet their atom-level expressiveness for small-molecule tasks remains unexplored. A key open question is whether these protein co-folding models capture transferable chemical physics or rely on protein evolutionary signals, which would limit their utility for small-molecule tasks. In this work, we investigate the quality of Boltz atom-level representations across diverse small-molecule benchmarks. Our results show that Boltz is competitive with specialized baselines on ADMET property prediction tasks and effective for molecular generation and optimization. These findings suggest that the representational capacity of cutting-edge protein-centric models has been underexplored and position Boltz as a strong baseline for atom-level representation learning for small molecules.

机器学习

[LG-0] BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

链接: https://arxiv.org/abs/2602.15010
作者: Max Sobol Mark,Jacky Liang,Maria Attarian,Chuyuan Fu,Debidatta Dwibedi,Dhruv Shah,Aviral Kumar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations.

[LG-1] Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees

链接: https://arxiv.org/abs/2602.15008
作者: Daniil Dmitriev,Zhihan Huang,Yuting Wei
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models over discrete spaces have recently shown striking empirical success, yet their theoretical foundations remain incomplete. In this paper, we study the sampling efficiency of score-based discrete diffusion models under a continuous-time Markov chain (CTMC) formulation, with a focus on \tau -leaping-based samplers. We establish sharp convergence guarantees for attaining \varepsilon accuracy in Kullback-Leibler (KL) divergence for both uniform and masking noising processes. For uniform discrete diffusion, we show that the \tau -leaping algorithm achieves an iteration complexity of order \tilde O(d/\varepsilon) , with d the ambient dimension of the target distribution, eliminating linear dependence on the vocabulary size S and improving existing bounds by a factor of d ; moreover, we establish a matching algorithmic lower bound showing that linear dependence on the ambient dimension is unavoidable in general. For masking discrete diffusion, we introduce a modified \tau -leaping sampler whose convergence rate is governed by an intrinsic information-theoretic quantity, termed the effective total correlation, which is bounded by d \log S but can be sublinear or even constant for structured data. As a consequence, the sampler provably adapts to low-dimensional structure without prior knowledge or algorithmic modification, yielding sublinear convergence rates for various practical examples (such as hidden Markov models, image data, and random graphs). Our analysis requires no boundedness or smoothness assumptions on the score estimator beyond control of the score entropy loss.

[LG-2] PDE foundation models are skillful AI weather emulators for the Martian atmosphere

链接: https://arxiv.org/abs/2602.15004
作者: Johannes Schmude,Sujit Roy,Liping Wang,Theodore van Kessel,Levente Klein,Marcus Freitag,Eloisa Bentivegna,Robert Manson-Sawko,Bjorn Lutjens,Manil Maskey,Campbell Watson,Rahul Ramachandran,Juan Bernabe-Moreno
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:We show that AI foundation models that are pretrained on numerical solutions to a diverse corpus of partial differential equations can be adapted and fine-tuned to obtain skillful predictive weather emulators for the Martian atmosphere. We base our work on the Poseidon PDE foundation model for two-dimensional systems. We develop a method to extend Poseidon from two to three dimensions while keeping the pretraining information. Moreover, we investigate the performance of the model in the presence of sparse initial conditions. Our results make use of four Martian years (approx.~34 GB) of training data and a median compute budget of 13 GPU hours. We find that the combination of pretraining and model extension yields a performance increase of 34.4% on a held-out year. This shows that PDEs-FMs can not only approximate solutions to (other) PDEs but also anchor models for real-world problems with complex interactions that lack a sufficient amount of training data or a suitable compute budget.

[LG-3] Boundary Point Jailbreaking of Black-Box LLM s

链接: https://arxiv.org/abs/2602.15001
作者: Xander Davies,Giorgi Giglemiani,Edmund Lau,Eric Winsor,Geoffrey Irving,Yarin Gal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as “jailbreaks”. Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength (“boundary points”). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5’s input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.

[LG-4] Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations

链接: https://arxiv.org/abs/2602.14983
作者: Carolin Cissee,Raneen Younis,Zahra Ahmadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbfCOrAL, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.

[LG-5] MacroGuide: Topological Guidance for Macrocycle Generation

链接: https://arxiv.org/abs/2602.14977
作者: Alicja Maksymiuk,Alexandre Duplessis,Michael Bronstein,Alexander Tong,Fernanda Duarte,İsmail İlkan Ceylan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Macrocycles are ring-shaped molecules that offer a promising alternative to small-molecule drugs due to their enhanced selectivity and binding affinity against difficult targets. Despite their chemical value, they remain underexplored in generative modeling, likely owing to their scarcity in public datasets and the challenges of enforcing topological constraints in standard deep generative models. We introduce MacroGuide: Topological Guidance for Macrocycle Generation, a diffusion guidance mechanism that uses Persistent Homology to steer the sampling of pretrained molecular generative models toward the generation of macrocycles, in both unconditional and conditional (protein pocket) settings. At each denoising step, MacroGuide constructs a Vietoris-Rips complex from atomic positions and promotes ring formation by optimizing persistent homology features. Empirically, applying MacroGuide to pretrained diffusion models increases macrocycle generation rates from 1% to 99%, while matching or exceeding state-of-the-art performance on key quality metrics such as chemical validity, diversity, and PoseBusters checks.

[LG-6] Use What You Know: Causal Foundation Models with Partial Graphs

链接: https://arxiv.org/abs/2602.14972
作者: Arik Reuter,Anish Dhir,Cristiana Diaconu,Jake Robertson,Ole Ossen,Frank Hutter,Adrian Weller,Mark van der Wilk,Bernhard Schölkopf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating causal quantities traditionally relies on bespoke estimators tailored to specific assumptions. Recently proposed Causal Foundation Models (CFMs) promise a more unified approach by amortising causal discovery and inference in a single step. However, in their current state, they do not allow for the incorporation of any domain knowledge, which can lead to suboptimal predictions. We bridge this gap by introducing methods to condition CFMs on causal information, such as the causal graph or more readily available ancestral information. When access to complete causal graph information is too strict a requirement, our approach also effectively leverages partial causal information. We systematically evaluate conditioning strategies and find that injecting learnable biases into the attention mechanism is the most effective method to utilise full and partial causal information. Our experiments show that this conditioning allows a general-purpose CFM to match the performance of specialised models trained on specific causal structures. Overall, our approach addresses a central hurdle on the path towards all-in-one causal foundation models: the capability to answer causal queries in a data-driven manner while effectively leveraging any amount of domain expertise.

[LG-7] Locally Adaptive Multi-Objective Learning

链接: https://arxiv.org/abs/2602.14952
作者: Jivat Neet Kaur,Isaac Gibbs,Michael I. Jordan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Code is available at this https URL

点击查看摘要

Abstract:We consider the general problem of learning a predictor that satisfies multiple objectives of interest simultaneously, a broad framework that captures a range of specific learning goals including calibration, regret, and multiaccuracy. We work in an online setting where the data distribution can change arbitrarily over time. Existing approaches to this problem aim to minimize the set of objectives over the entire time horizon in a worst-case sense, and in practice they do not necessarily adapt to distribution shifts. Earlier work has aimed to alleviate this problem by incorporating additional objectives that target local guarantees over contiguous subintervals. Empirical evaluation of these proposals is, however, scarce. In this article, we consider an alternative procedure that achieves local adaptivity by replacing one part of the multi-objective learning method with an adaptive online algorithm. Empirical evaluations on datasets from energy forecasting and algorithmic fairness show that our proposed method improves upon existing approaches and achieves unbiased predictions over subgroups, while remaining robust under distribution shift.

[LG-8] Gradient Networks for Universal Magnetic Modeling of Synchronous Machines

链接: https://arxiv.org/abs/2602.14947
作者: Junyi Li,Tim Foissner,Floran Martin,Antti Piippo,Marko Hinkkanen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a physics-informed neural network approach for dynamic modeling of saturable synchronous machines, including cases with spatial harmonics. We introduce an architecture that incorporates gradient networks directly into the fundamental machine equations, enabling accurate modeling of the nonlinear and coupled electromagnetic constitutive relationship. By learning the gradient of the magnetic field energy, the model inherently satisfies energy balance (reciprocity conditions). The proposed architecture can universally approximate any physically feasible magnetic behavior and offers several advantages over lookup tables and standard machine learning models: it requires less training data, ensures monotonicity and reliable extrapolation, and produces smooth outputs. These properties further enable robust model inversion and optimal trajectory generation, often needed in control applications. We validate the proposed approach using measured and finite-element method (FEM) datasets from a 5.6-kW permanent-magnet (PM) synchronous reluctance machine. Results demonstrate accurate and physically consistent models, even with limited training data.

[LG-9] Fault Detection in Electrical Distribution System using Autoencoders

链接: https://arxiv.org/abs/2602.14939
作者: Sidharthenee Nayak,Victor Sam Moses Babu,Chandrashekhar Narayan Bhende,Pratyush Chakraborty,Mayukha Pal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent times, there has been considerable interest in fault detection within electrical power systems, garnering attention from both academic researchers and industry professionals. Despite the development of numerous fault detection methods and their adaptations over the past decade, their practical application remains highly challenging. Given the probabilistic nature of fault occurrences and parameters, certain decision-making tasks could be approached from a probabilistic standpoint. Protective systems are tasked with the detection, classification, and localization of faulty voltage and current line magnitudes, culminating in the activation of circuit breakers to isolate the faulty line. An essential aspect of designing effective fault detection systems lies in obtaining reliable data for training and testing, which is often scarce. Leveraging deep learning techniques, particularly the powerful capabilities of pattern classifiers in learning, generalizing, and parallel processing, offers promising avenues for intelligent fault detection. To address this, our paper proposes an anomaly-based approach for fault detection in electrical power systems, employing deep autoencoders. Additionally, we utilize Convolutional Autoencoders (CAE) for dimensionality reduction, which, due to its fewer parameters, requires less training time compared to conventional autoencoders. The proposed method demonstrates superior performance and accuracy compared to alternative detection approaches by achieving an accuracy of 97.62% and 99.92% on simulated and publicly available datasets.

[LG-10] Variance-Reduced (varepsilonδ)-Unlearning using Forget Set Gradients

链接: https://arxiv.org/abs/2602.14938
作者: Martin Van Waerebeke,Marco Lorenzi,Kevin Scaman,El Mahdi El Mhamdi,Giovanni Neglia
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In machine unlearning, (\varepsilon,\delta)- unlearning is a popular framework that provides formal guarantees on the effectiveness of the removal of a subset of training data, the forget set, from a trained model. For strongly convex objectives, existing first-order methods achieve (\varepsilon,\delta)- unlearning, but they only use the forget set to calibrate injected noise, never as a direct optimization signal. In contrast, efficient empirical heuristics often exploit the forget samples (e.g., via gradient ascent) but come with no formal unlearning guarantees. We bridge this gap by presenting the Variance-Reduced Unlearning (VRU) algorithm. To the best of our knowledge, VRU is the first first-order algorithm that directly includes forget set gradients in its update rule, while provably satisfying ( (\varepsilon,\delta)- unlearning. We establish the convergence of VRU and show that incorporating the forget set yields strictly improved rates, i.e. a better dependence on the achieved error compared to existing first-order (\varepsilon,\delta)- unlearning methods. Moreover, we prove that, in a low-error regime, VRU asymptotically outperforms any first-order method that ignores the forget this http URL corroborate our theory, showing consistent gains over both state-of-the-art certified unlearning methods and over empirical baselines that explicitly leverage the forget set.

[LG-11] Coverag e Guarantees for Pseudo-Calibrated Conformal Prediction under Distribution Shift

链接: https://arxiv.org/abs/2602.14913
作者: Farbod Siahkali,Ashwin Verma,Vijay Gupta
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Under review. 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Conformal prediction (CP) offers distribution-free marginal coverage guarantees under an exchangeability assumption, but these guarantees can fail if the data distribution shifts. We analyze the use of pseudo-calibration as a tool to counter this performance loss under a bounded label-conditional covariate shift model. Using tools from domain adaptation, we derive a lower bound on target coverage in terms of the source-domain loss of the classifier and a Wasserstein measure of the shift. Using this result, we provide a method to design pseudo-calibrated sets that inflate the conformal threshold by a slack parameter to keep target coverage above a prescribed level. Finally, we propose a source-tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels as a function of classifier uncertainty. Numerical experiments show that our bounds qualitatively track pseudo-calibration behavior and that the source-tuned scheme mitigates coverage degradation under distribution shift while maintaining nontrivial prediction set sizes.

[LG-12] Algorithmic Simplification of Neural Networks with Mosaic-of-Motifs

链接: https://arxiv.org/abs/2602.14896
作者: Pedram Bakhtiarifard,Tong Chen,Jonathan Wenshøj,Erik B Dam,Raghavendra Selvan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale deep learning models are well-suited for compression. Methods like pruning, quantization, and knowledge distillation have been used to achieve massive reductions in the number of model parameters, with marginal performance drops across a variety of architectures and tasks. This raises the central question: \emphWhy are deep neural networks suited for compression? In this work, we take up the perspective of algorithmic complexity to explain this behavior. We hypothesize that the parameters of trained models have more structure and, hence, exhibit lower algorithmic complexity compared to the weights at (random) initialization. Furthermore, that model compression methods harness this reduced algorithmic complexity to compress models. Although an unconstrained parameterization of model weights, \mathbfw \in \mathbbR^n , can represent arbitrary weight assignments, the solutions found during training exhibit repeatability and structure, making them algorithmically simpler than a generic program. To this end, we formalize the Kolmogorov complexity of \mathbfw by \mathcalK(\mathbfw) . We introduce a constrained parameterization \widehat\mathbfw , that partitions parameters into blocks of size s , and restricts each block to be selected from a set of k reusable motifs, specified by a reuse pattern (or mosaic). The resulting method, \textitMosaic-of-Motifs (MoMos), yields algorithmically simpler model parameterization compared to unconstrained models. Empirical evidence from multiple experiments shows that the algorithmic complexity of neural networks, measured using approximations to Kolmogorov complexity, can be reduced during training. This results in models that perform comparably with unconstrained models while being algorithmically simpler.

[LG-13] A Prag matic Method for Comparing Clusterings with Overlaps and Outliers

链接: https://arxiv.org/abs/2602.14855
作者: Ryan DeWolfe,Paweł Prałat,François Théberge
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Combinatorics (math.CO)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:Clustering algorithms are an essential part of the unsupervised data science ecosystem, and extrinsic evaluation of clustering algorithms requires a method for comparing the detected clustering to a ground truth clustering. In a general setting, the detected and ground truth clusterings may have outliers (objects belonging to no cluster), overlapping clusters (objects may belong to more than one cluster), or both, but methods for comparing these clusterings are currently undeveloped. In this note, we define a pragmatic similarity measure for comparing clusterings with overlaps and outliers, show that it has several desirable properties, and experimentally confirm that it is not subject to several common biases afflicting other clustering comparison measures.

[LG-14] BEACONS: Bounded-Error Algebraically-Composable Neural Solvers for Partial Differential Equations

链接: https://arxiv.org/abs/2602.14853
作者: Jonathan Gorard,Ammar Hakim,James Juno
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 31 pages, 8 figures, 9 tables

点击查看摘要

Abstract:The traditional limitations of neural networks in reliably generalizing beyond the convex hulls of their training data present a significant problem for computational physics, in which one often wishes to solve PDEs in regimes far beyond anything which can be experimentally or analytically validated. In this paper, we show how it is possible to circumvent these limitations by constructing formally-verified neural network solvers for PDEs, with rigorous convergence, stability, and conservation properties, whose correctness can therefore be guaranteed even in extrapolatory regimes. By using the method of characteristics to predict the analytical properties of PDE solutions a priori (even in regions arbitrarily far from the training domain), we show how it is possible to construct rigorous extrapolatory bounds on the worst-case L^inf errors of shallow neural network approximations. Then, by decomposing PDE solutions into compositions of simpler functions, we show how it is possible to compose these shallow neural networks together to form deep architectures, based on ideas from compositional deep learning, in which the large L^inf errors in the approximations have been suppressed. The resulting framework, called BEACONS (Bounded-Error, Algebraically-COmposable Neural Solvers), comprises both an automatic code-generator for the neural solvers themselves, as well as a bespoke automated theorem-proving system for producing machine-checkable certificates of correctness. We apply the framework to a variety of linear and non-linear PDEs, including the linear advection and inviscid Burgers’ equations, as well as the full compressible Euler equations, in both 1D and 2D, and illustrate how BEACONS architectures are able to extrapolate solutions far beyond the training data in a reliable and bounded way. Various advantages of the approach over the classical PINN approach are discussed.

[LG-15] Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment AAMAS2026

链接: https://arxiv.org/abs/2602.14844
作者: Elias Malomgré,Pieter Simoens
类目: Machine Learning (cs.LG)
*备注: Accepted for the AAMAS 2026 Blue Sky Ideas track

点击查看摘要

Abstract:AI alignment is growing in importance, yet current approaches suffer from a critical structural flaw that entangles the safety objectives with the agent’s policy. Methods such as Reinforcement Learning from Human Feedback and Direct Preference Optimization create opaque, single-use alignment artifacts, which we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning to decouple alignment artifact learning from policy optimization, producing an inspectable, editable, and model-agnostic reward model. Additionally, we introduce the Alignment Flywheel, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement. This architecture transforms safety from a disposable expense into a durable, verifiable engineering asset.

[LG-16] Extending Multi-Source Bayesian Optimization With Causality Principles AAMAS2026

链接: https://arxiv.org/abs/2602.14791
作者: Luuk Jacobs,Mohammad Ali Javidian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: An extended abstract version of this work was accepted for the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:Multi-Source Bayesian Optimization (MSBO) serves as a variant of the traditional Bayesian Optimization (BO) framework applicable to situations involving optimization of an objective black-box function over multiple information sources such as simulations, surrogate models, or real-world experiments. However, traditional MSBO assumes the input variables of the objective function to be independent and identically distributed, limiting its effectiveness in scenarios where causal information is available and interventions can be performed, such as clinical trials or policy-making. In the single-source domain, Causal Bayesian Optimization (CBO) extends standard BO with the principles of causality, enabling better modeling of variable dependencies. This leads to more accurate optimization, improved decision-making, and more efficient use of low-cost information sources. In this article, we propose a principled integration of the MSBO and CBO methodologies in the multi-source domain, leveraging the strengths of both to enhance optimization efficiency and reduce computational complexity in higher-dimensional problems. We present the theoretical foundations of both Causal and Multi-Source Bayesian Optimization, and demonstrate how their synergy informs our Multi-Source Causal Bayesian Optimization (MSCBO) algorithm. We compare the performance of MSCBO against its foundational counterparts for both synthetic and real-world datasets with varying levels of noise, highlighting the robustness and applicability of MSCBO. Based on our findings, we conclude that integrating MSBO with the causality principles of CBO facilitates dimensionality reduction and lowers operational costs, ultimately improving convergence speed, performance, and scalability.

[LG-17] On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials

链接: https://arxiv.org/abs/2602.14789
作者: Rotem Mulayoff,Sebastian U. Stich
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:The dynamical stability of the iterates during training plays a key role in determining the minima obtained by optimization algorithms. For example, stable solutions of gradient descent (GD) correspond to flat minima, which have been associated with favorable features. While prior work often relies on linearization to determine stability, it remains unclear whether linearized dynamics faithfully capture the full nonlinear behavior. Recent work has shown that GD may stably oscillate near a linearly unstable minimum and still converge once the step size decays, indicating that linear analysis can be misleading. In this work, we explicitly study the effect of nonlinear terms. Specifically, we derive an exact criterion for stable oscillations of GD near minima in the multivariate setting. Our condition depends on high-order derivatives, generalizing existing results. Extending the analysis to stochastic gradient descent (SGD), we show that nonlinear dynamics can diverge in expectation even if a single batch is unstable. This implies that stability can be dictated by a single batch that oscillates unstably, rather than an average effect, as linear analysis suggests. Finally, we prove that if all batches are linearly stable, the nonlinear dynamics of SGD are stable in expectation.

[LG-18] Learning Structural Hardness for Combinatorial Auctions: Instance-Dependent Algorithm Selection via Graph Neural Networks

链接: https://arxiv.org/abs/2602.14772
作者: Sungwoo Kang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Winner Determination Problem (WDP) in combinatorial auctions is NP-hard, and no existing method reliably predicts which instances will defeat fast greedy heuristics. The ML-for-combinatorial-optimization community has focused on learning to \emphreplace solvers, yet recent evidence shows that graph neural networks (GNNs) rarely outperform well-tuned classical methods on standard benchmarks. We pursue a different objective: learning to predict \emphwhen a given instance is hard for greedy allocation, enabling instance-dependent algorithm selection. We design a 20-dimensional structural feature vector and train a lightweight MLP hardness classifier that predicts the greedy optimality gap with mean absolute error 0.033, Pearson correlation 0.937, and binary classification accuracy 94.7% across three random seeds. For instances identified as hard – those exhibiting ``whale-fish’’ trap structure where greedy provably fails – we deploy a heterogeneous GNN specialist that achieves \approx0% optimality gap on all six adversarial configurations tested (vs.\ 3.75–59.24% for greedy). A hybrid allocator combining the hardness classifier with GNN and greedy solvers achieves 0.51% overall gap on mixed distributions. Our honest evaluation on CATS benchmarks confirms that GNNs do not outperform Gurobi (0.45–0.71 vs.\ 0.20 gap), motivating the algorithm selection framing. Learning \emphwhen to deploy expensive solvers is more tractable than learning to replace them.

[LG-19] Solving Inverse Parametrized Problems via Finite Elements and Extreme Learning Networks

链接: https://arxiv.org/abs/2602.14757
作者: Erik Burman,Mats G. Larson,Karl Larsson,Jonatan Vallin
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop an interpolation-based reduced-order modeling framework for parameter-dependent partial differential equations arising in control, inverse problems, and uncertainty quantification. The solution is discretized in the physical domain using finite element methods, while the dependence on a finite-dimensional parameter is approximated separately. We establish existence, uniqueness, and regularity of the parametric solution and derive rigorous error estimates that explicitly quantify the interplay between spatial discretization and parameter approximation. In low-dimensional parameter spaces, classical interpolation schemes yield algebraic convergence rates based on Sobolev regularity in the parameter variable. In higher-dimensional parameter spaces, we replace classical interpolation by extreme learning machine (ELM) surrogates and obtain error bounds under explicit approximation and stability assumptions. The proposed framework is applied to inverse problems in quantitative photoacoustic tomography, where we derive potential and parameter reconstruction error estimates and demonstrate substantial computational savings compared to standard approaches, without sacrificing accuracy. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) Cite as: arXiv:2602.14757 [math.NA] (or arXiv:2602.14757v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2602.14757 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Parameter-Minimal Neural DE Solvers via Horner Polynomials

链接: https://arxiv.org/abs/2602.14737
作者: T. Matulić,D. Seršić
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 16 pages

点击查看摘要

Abstract:We propose a parameter-minimal neural architecture for solving differential equations by restricting the hypothesis class to Horner-factorized polynomials, yielding an implicit, differentiable trial solution with only a small set of learnable coefficients. Initial conditions are enforced exactly by construction by fixing the low-order polynomial degrees of freedom, so training focuses solely on matching the differential-equation residual at collocation points. To reduce approximation error without abandoning the low-parameter regime, we introduce a piecewise (“spline-like”) extension that trains multiple small Horner models on subintervals while enforcing continuity (and first-derivative continuity) at segment boundaries. On illustrative ODE benchmarks and a heat-equation example, Horner networks with tens (or fewer) parameters accurately match the solution and its derivatives and outperform small MLP and sinusoidal-representation baselines under the same training settings, demonstrating a practical accuracy-parameter trade-off for resource-efficient scientific modeling.

[LG-21] D2-LoRA: A Synergistic Approach to Differential and Directional Low-Rank Adaptation

链接: https://arxiv.org/abs/2602.14728
作者: Nozomu Fujisawa,Masaaki Kondo
类目: Machine Learning (cs.LG)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:We systematically investigate the parameter-efficient fine-tuning design space under practical data and compute constraints, and propose D2-LoRA. D2-LoRA achieves 76.4 percent average accuracy across eight question answering and reading comprehension benchmarks using only 5k training samples per task and two epochs, while preserving algebraic mergeability at inference with near-exact numerical equivalence. The method combines signed low-rank residual updates with additive and subtractive components, together with a train-time column-wise projection that keeps each column close to its original norm. After training, the adapter is merged into a single weight matrix, adding zero inference latency. Compared with LoRA, D2-LoRA improves average accuracy by 2.2 percentage points; at matched parameter counts (LoRA rank 2r versus D2-LoRA rank r), the improvement is 1.6 points, indicating gains from architectural design rather than increased parameterization. Compared with DoRA, it matches or exceeds performance on most tasks. Beyond QA and reading comprehension, D2-LoRA improves generative tasks (plus 1.2 ROUGE-L and plus 1.1 percent win rate) and shows 36 percent lower training volatility. The merge preserves numerical fidelity (mean gap about 0.03 percentage points) and recovers about 1.91x evaluation throughput. Training overhead is 19 percent, comparable to DoRA, and decreases with longer input sequences. We provide a geometric analysis explaining how the projection stabilizes training, together with ablation studies isolating the contribution of each design component.

[LG-22] Unbiased Approximate Vector-Jacobian Products for Efficient Backpropagation

链接: https://arxiv.org/abs/2602.14701
作者: Killian Bakong(DI-ENS),Laurent Massoulié(Inria, ARGO, CMAP),Edouard Oyallon(MLIA),Kevin Scaman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work we introduce methods to reduce the computational and memory costs of training deep neural networks. Our approach consists in replacing exact vector-jacobian products by randomized, unbiased approximations thereof during backpropagation. We provide a theoretical analysis of the trade-off between the number of epochs needed to achieve a target precision and the cost reduction for each epoch. We then identify specific unbiased estimates of vector-jacobian products for which we establish desirable optimality properties of minimal variance under sparsity constraints. Finally we provide in-depth experiments on multi-layer perceptrons, BagNets and Visual Transfomers architectures. These validate our theoretical results, and confirm the potential of our proposed unbiased randomized backpropagation approach for reducing the cost of deep learning.

[LG-23] A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesnt)

链接: https://arxiv.org/abs/2602.14696
作者: Nihal V. Nayak,Paula Rodriguez-Diaz,Neha Hulkund,Sara Beery,David Alvarez-Melis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at this https URL.

[LG-24] Pseudo-differential-enhanced physics-informed neural networks

链接: https://arxiv.org/abs/2602.14663
作者: Andrew Gracyk
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: First version

点击查看摘要

Abstract:We present pseudo-differential enhanced physics-informed neural networks (PINNs), an extension of gradient enhancement but in Fourier space. Gradient enhancement of PINNs dictates that the PDE residual is taken to a higher differential order than prescribed by the PDE, added to the objective as an augmented term in order to improve training and overall learning fidelity. We propose the same procedure after application via Fourier transforms, since differentiating in Fourier space is multiplication with the Fourier wavenumber under suitable decay. Our methods are fast and efficient. Our methods oftentimes achieve superior PINN versus numerical error in fewer training iterations, potentially pair well with few samples in collocation, and can on occasion break plateaus in low collocation settings. Moreover, our methods are suitable for fractional derivatives. We establish that our methods improve spectral eigenvalue decay of the neural tangent kernel (NTK), and so our methods contribute towards the learning of high frequencies in early training, mitigating the effects of frequency bias up to the polynomial order and possibly greater with smooth activations. Our methods accommodate advanced techniques in PINNs, such as Fourier feature embeddings. A pitfall of discrete Fourier transforms via the Fast Fourier Transform (FFT) is mesh subjugation, and so we demonstrate compatibility of our methods for greater mesh flexibility and invariance on alternative Euclidean and non-Euclidean domains via Monte Carlo methods and otherwise.

[LG-25] An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale

链接: https://arxiv.org/abs/2602.14656
作者: Adrián Javaloy,Antonio Vergari
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Optimization and Control (math.OC)
*备注: 23 pages, 10 figures, in review

点击查看摘要

Abstract:Orthogonality constraints are ubiquitous in robust and probabilistic machine learning. Unfortunately, current optimizers are computationally expensive and do not scale to problems with hundreds or thousands of constraints. One notable exception is the Landing algorithm (Ablin et al., 2024) which, however comes at the expense of temporarily relaxing orthogonality. In this work, we revisit and improve on the ideas behind Landing, enabling the inclusion of modern adaptive optimizers while ensuring that orthogonal constraints are effectively met. Remarkably, these improvements come at little to no cost, and reduce the number of required hyperparemeters. Our algorithm POGO is fast and GPU-friendly, consisting of only 5 matrix products, and in practice maintains orthogonality at all times. On several challenging benchmarks, POGO greatly outperforms recent optimizers and shows it can optimize problems with thousands of orthogonal matrices in minutes while alternatives would take hours. As such, POGO sets a milestone to finally exploit orthogonality constraints in ML at scale. A PyTorch implementation of POGO is publicly available at this https URL.

[LG-26] Concepts Information Bottleneck Models ICLR2026

链接: https://arxiv.org/abs/2602.14626
作者: Karim Galliamov,Syed M Ahsan Kazmi,Adil Khan,Adín Ramírez Rivera
类目: Machine Learning (cs.LG)
*备注: To appear in ICLR 2026, code: this https URL

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) aim to deliver interpretable predictions by routing decisions through a human-understandable concept layer, yet they often suffer reduced accuracy and concept leakage that undermines faithfulness. We introduce an explicit Information Bottleneck regularizer on the concept layer that penalizes I(X;C) while preserving task-relevant information in I(C;Y) , encouraging minimal-sufficient concept representations. We derive two practical variants (a variational objective and an entropy-based surrogate) and integrate them into standard CBM training without architectural changes or additional supervision. Evaluated across six CBM families and three benchmarks, the IB-regularized models consistently outperform their vanilla counterparts. Information-plane analyses further corroborate the intended behavior. These results indicate that enforcing a minimal-sufficient concept bottleneck improves both predictive performance and the reliability of concept-level interventions. The proposed regularizer offers a theoretic-grounded, architecture-agnostic path to more faithful and intervenable CBMs, resolving prior evaluation inconsistencies by aligning training protocols and demonstrating robust gains across model families and datasets.

[LG-27] Replicable Constrained Bandits

链接: https://arxiv.org/abs/2602.14580
作者: Matteo Bollini,Gianmarco Genalti,Francesco Emanuele Stradi,Matteo Castiglioni,Alberto Marchesi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Algorithmic \emphreplicability has recently been introduced to address the need for reproducible experiments in machine learning. A \emphreplicable online learning algorithm is one that takes the same sequence of decisions across different executions in the same environment, with high probability. We initiate the study of algorithmic replicability in \emphconstrained MAB problems, where a learner interacts with an unknown stochastic environment for T rounds, seeking not only to maximize reward but also to satisfy multiple constraints. Our main result is that replicability can be achieved in constrained MABs. Specifically, we design replicable algorithms whose regret and constraint violation match those of non-replicable ones in terms of T . As a key step toward these guarantees, we develop the first replicable UCB-like algorithm for \emphunconstrained MABs, showing that algorithms that employ the optimism in-the-face-of-uncertainty principle can be replicable, a result that we believe is of independent interest.

[LG-28] RNM-TD3: N:M Semi-structured Sparse Reinforcement Learning From Scratch

链接: https://arxiv.org/abs/2602.14578
作者: Isam Vrce,Andreas Kassler,Gökçe Aydos
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Sparsity is a well-studied technique for compressing deep neural networks (DNNs) without compromising performance. In deep reinforcement learning (DRL), neural networks with up to 5% of their original weights can still be trained with minimal performance loss compared to their dense counterparts. However, most existing methods rely on unstructured fine-grained sparsity, which limits hardware acceleration opportunities due to irregular computation patterns. Structured coarse-grained sparsity enables hardware acceleration, yet typically degrades performance and increases pruning complexity. In this work, we present, to the best of our knowledge, the first study on N:M structured sparsity in RL, which balances compression, performance, and hardware efficiency. Our framework enforces row-wise N:M sparsity throughout training for all networks in off-policy RL (TD3), maintaining compatibility with accelerators that support N:M sparse matrix operations. Experiments on continuous-control benchmarks show that RNM-TD3, our N:M sparse agent, outperforms its dense counterpart at 50%-75% sparsity (e.g., 2:4 and 1:4), achieving up to a 14% increase in performance at 2:4 sparsity on the Ant environment. RNM-TD3 remains competitive even at 87.5% sparsity (1:8), while enabling potential training speedups.

[LG-29] DCTracks: An Open Dataset for Machine Learning-Based Drift Chamber Track Reconstruction

链接: https://arxiv.org/abs/2602.14571
作者: Qian Liyan,Zhang Yao,Yuan Ye,Zhang Zhaoke,Fang Jin,Jiang Shimiao,Zhang Jin,Li Ke,Liu Beijiang,Xu Chenglin,Zhang Yifan,Jia Xiaoqian,Qin Xiaoshuai,Huang Xingtao
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:We introduce a Monte Carlo (MC) dataset of single- and two-track drift chamber events to advance Machine Learning (ML)-based track reconstruction. To enable standardized and comparable evaluation, we define track reconstruction specific metrics and report results for traditional track reconstruction algorithms and a Graph Neural Networks (GNNs) method, facilitating rigorous, reproducible validation for future research.

[LG-30] ruly Adapting to Adversarial Constraints in Constrained MABs

链接: https://arxiv.org/abs/2602.14543
作者: Francesco Emanuele Stradi,Kalana Kalupahana,Matteo Castiglioni,Alberto Marchesi,Nicola Gatti
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the constrained variant of the \emphmulti-armed bandit (MAB) problem, in which the learner aims not only at minimizing the total loss incurred during the learning dynamic, but also at controlling the violation of multiple \emphunknown constraints, under both \emphfull and \emphbandit feedback. We consider a non-stationary environment that subsumes both stochastic and adversarial models and where, at each round, both losses and constraints are drawn from distributions that may change arbitrarily over time. In such a setting, it is provably not possible to guarantee both sublinear regret and sublinear violation. Accordingly, prior work has mainly focused either on settings with stochastic constraints or on relaxing the benchmark with fully adversarial constraints (\emphe.g., via competitive ratios with respect to the optimum). We provide the first algorithms that achieve optimal rates of regret and \emphpositive constraint violation when the constraints are stochastic while the losses may vary arbitrarily, and that simultaneously yield guarantees that degrade smoothly with the degree of adversariality of the constraints. Specifically, under \emphfull feedback we propose an algorithm attaining \widetilde\mathcalO(\sqrtT+C) regret and \widetilde\mathcalO(\sqrtT+C) positive violation, where C quantifies the amount of non-stationarity in the constraints. We then show how to extend these guarantees when only bandit feedback is available for the losses. Finally, when \emphbandit feedback is available for the constraints, we design an algorithm achieving \widetilde\mathcalO(\sqrtT+C) positive violation and \widetilde\mathcalO(\sqrtT+C\sqrtT) regret.

[LG-31] Covariance-Aware Transformers for Quadratic Programming and Decision Making

链接: https://arxiv.org/abs/2602.14506
作者: Kutay Tire,Yufan Zhang,Ege Onur Taga,Samet Oymak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the use of transformers for solving quadratic programs and how this capability benefits decision-making problems that involve covariance matrices. We first show that the linear attention mechanism can provably solve unconstrained QPs by tokenizing the matrix variables (e.g.~ A of the objective \frac12x^\top Ax+b^\top x ) row-by-row and emulating gradient descent iterations. Furthermore, by incorporating MLPs, a transformer block can solve (i) \ell_1 -penalized QPs by emulating iterative soft-thresholding and (ii) \ell_1 -constrained QPs when equipped with an additional feedback loop. Our theory motivates us to introduce Time2Decide: a generic method that enhances a time series foundation model (TSFM) by explicitly feeding the covariance matrix between the variates. We empirically find that Time2Decide uniformly outperforms the base TSFM model for the classical portfolio optimization problem that admits an \ell_1 -constrained QP formulation. Remarkably, Time2Decide also outperforms the classical “Predict-then-Optimize (PtO)” procedure, where we first forecast the returns and then explicitly solve a constrained QP, in suitable settings. Our results demonstrate that transformers benefit from explicit use of second-order statistics, and this can enable them to effectively solve complex decision-making problems, like portfolio construction, in one forward pass.

[LG-32] Divine Benevolence is an x2: GLUs scale asymptotically faster than MLPs

链接: https://arxiv.org/abs/2602.14495
作者: Alejandro Francisco Queiruga
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling laws can be understood from ground-up numerical analysis, where traditional function approximation theory can explain shifts in model architecture choices. GLU variants now dominate frontier LLMs and similar outer-product architectures are prevalent in ranking models. The success of these architectures has mostly been left as an empirical discovery. In this paper, we apply the tools of numerical analysis to expose a key factor: these models have an x^2 which enables \emphasymptotically faster scaling than MLPs. GLUs have piecewise quadratic functional forms that are sufficient to exhibit quadratic order of approximation. Our key contribution is to demonstrate that the L§ scaling slope is L§\propto P^-3 for GLUs but only L§=P^-2 for MLPs on function reconstruction problems. We provide a parameter construction and empirical verification of these slopes for 1D function approximation. From the first principles we discover, we make one stride and propose the ``Gated Quadratic Unit’’ which has an even steeper L§ slope than the GLU and MLP. This opens the possibility of architecture design from first principles numerical theory to unlock superior scaling in large models. Replication code is available at this https URL.

[LG-33] One Good Source is All You Need: Near-Optimal Regret for Bandits under Heterogeneous Noise

链接: https://arxiv.org/abs/2602.14474
作者: Aadirupa Saha,Amith Bhat,Haipeng Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study K -armed Multiarmed Bandit (MAB) problem with M heterogeneous data sources, each exhibiting unknown and distinct noise variances \sigma_j^2_j=1^M . The learner’s objective is standard MAB regret minimization, with the additional complexity of adaptively selecting which data source to query from at each round. We propose Source-Optimistic Adaptive Regret minimization (SOAR), a novel algorithm that quickly prunes high-variance sources using sharp variance-concentration bounds, followed by a `balanced min-max LCB-UCB approach’ that seamlessly integrates the parallel tasks of identifying the best arm and the optimal (minimum-variance) data source. Our analysis shows SOAR achieves an instance-dependent regret bound of \tildeO\left(\sigma^^2\sum_i=2^K \frac\log T\Delta_i + \sqrtK \sum_j=1^M \sigma_j^2\right) , up to preprocessing costs depending only on problem parameters, where \sigma^^2 := \min_j \sigma_j^2 is the minimum source variance and \Delta_i denotes the suboptimality gap of the i -th arm. This result is both surprising as despite lacking prior knowledge of the minimum-variance source among M alternatives, SOAR attains the optimal instance-dependent regret of standard single-source MAB with variance \sigma^^2 , while incurring only an small (and unavoidable) additive cost of \tilde O(\sqrtK \sum_j=1^M \sigma_j^2) towards the optimal (minimum variance) source identification. Our theoretical bounds represent a significant improvement over some proposed baselines, e.g. Uniform UCB or Explore-then-Commit UCB, which could potentially suffer regret scaling with \sigma_\max^2 in place of \sigma^^2 -a gap that can be arbitrarily large when \sigma_\max \gg \sigma^* . Experiments on multiple synthetic problem instances and the real-world MovieLens;25M dataset, demonstrating the superior performance of SOAR over the baselines.

[LG-34] LACONIC: Length-Aware Constrained Reinforcement Learning for LLM

链接: https://arxiv.org/abs/2602.14468
作者: Chang Liu,Yiran Zhao,Lawrence Liu,Yaoqi Ye,Csaba Szepesvári,Lin F. Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.

[LG-35] raceable Latent Variable Discovery Based on Multi-Agent Collaboration

链接: https://arxiv.org/abs/2602.14456
作者: Huaming Du,Tao Hu,Yijie Huang,Yu Zhao,Guisong Liu,Tao Gu,Gang Kou,Carl Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Revealing the underlying causal mechanisms in the real world is crucial for scientific and technological progress. Despite notable advances in recent decades, the lack of high-quality data and the reliance of traditional causal discovery algorithms (TCDA) on the assumption of no latent confounders, as well as their tendency to overlook the precise semantics of latent variables, have long been major obstacles to the broader application of causal discovery. To address this issue, we propose a novel causal modeling framework, TLVD, which integrates the metadata-based reasoning capabilities of large language models (LLMs) with the data-driven modeling capabilities of TCDA for inferring latent variables and their semantics. Specifically, we first employ a data-driven approach to construct a causal graph that incorporates latent variables. Then, we employ multi-LLM collaboration for latent variable inference, modeling this process as a game with incomplete information and seeking its Bayesian Nash Equilibrium (BNE) to infer the possible specific latent variables. Finally, to validate the inferred latent variables across multiple real-world web-based data sources, we leverage LLMs for evidence exploration to ensure traceability. We comprehensively evaluate TLVD on three de-identified real patient datasets provided by a hospital and two benchmark datasets. Extensive experimental results confirm the effectiveness and reliability of TLVD, with average improvements of 32.67% in Acc, 62.21% in CAcc, and 26.72% in ECit across the five datasets.

[LG-36] A unified framework for evaluating the robustness of machine-learning interpretability for prospect risking

链接: https://arxiv.org/abs/2602.14430
作者: Prithwijit Chowdhury,Ahmad Mustafa,Mohit Prabhushankar,Ghassan AlRegib
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In geophysics, hydrocarbon prospect risking involves assessing the risks associated with hydrocarbon exploration by integrating data from various sources. Machine learning-based classifiers trained on tabular data have been recently used to make faster decisions on these prospects. The lack of transparency in the decision-making processes of such models has led to the emergence of explainable AI (XAI). LIME and SHAP are two such examples of these XAI methods which try to generate explanations of a particular decision by ranking the input features in terms of importance. However, explanations of the same scenario generated by these two different explanation strategies have shown to disagree or be different, particularly for complex data. This is because the definitions of “importance” and “relevance” differ for different explanation strategies. Thus, grounding these ranked features using theoretically backed causal ideas of necessity and sufficiency can prove to be a more reliable and robust way to improve the trustworthiness of the concerned explanation this http URL propose a unified framework to generate counterfactuals as well as quantify necessity and sufficiency and use these to perform a robustness evaluation of the explanations provided by LIME and SHAP on high dimensional structured prospect risking data. This robustness test gives us deeper insights into the models capabilities to handle erronous data and which XAI module works best in pair with which model for our dataset for hydorcarbon indication.

[LG-37] LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

链接: https://arxiv.org/abs/2602.14397
作者: Tingting Tang,Yongqin Wang,Murali Annavaram
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Secure Multi-party Computation (MPC) enables untrusted parties to jointly compute a function without revealing their inputs. Its application to machine learning (ML) has gained significant attention, particularly for secure inference services deployed across multiple cloud virtual machines (VMs), where each VM acts as an MPC party. Model providers secret-share model weights, and users secret-share inputs, ensuring that each server operates only on random shares. While MPC provides strong cryptographic guarantees, it incurs substantial computational and communication overhead. Deep neural networks rely heavily on convolutional and fully connected layers, which require costly matrix multiplications in MPC. To reduce this cost, we propose leveraging low-rank decomposition (LRD) for linear layers, replacing one large matrix multiplication with two smaller ones. Each matrix multiplication in MPC incurs a round of communication, meaning decomposing one matrix multiplication into two leads to an additional communication round. Second, the added matrix multiplication requires an additional truncation step to maintain numerical precision. Since truncation itself requires communication and computation, these overheads can offset the gains from decomposition. To address this, we introduce two complementary optimizations: truncation skipping and efficient linear layer concatenation. Truncation skipping removes the extra truncation induced by LRD, while linear layer concatenation pipelines operations to hide the additional communication round. Together, these techniques mitigate the main overheads of LRD in MPC and improve overall efficiency. Our approach is broadly applicable across MPC protocols. Experiments show up to 25% speedup in n-PC and 33% in 3-PC protocols over full-rank baselines, along with up to 52% GPU energy savings and 88% reduction in offline-phase latency.

[LG-38] A Study on Multi-Class Online Fuzzy Classifiers for Dynamic Environments

链接: https://arxiv.org/abs/2602.14375
作者: Kensuke Ajimoto,Yuma Yamamoto,Yoshifumi Kusunoki,Tomoharu Nakashima
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a multi-class online fuzzy classifier for dynamic environments. A fuzzy classifier comprises a set of fuzzy if-then rules where human users determine the antecedent fuzzy sets beforehand. In contrast, the consequent real values are determined by learning from training data. In an online framework, not all training dataset patterns are available beforehand. Instead, only a few patterns are available at a time step, and the subsequent patterns become available at the following time steps. The conventional online fuzzy classifier considered only two-class problems. This paper investigates the extension to the conventional fuzzy classifiers for multi-class problems. We evaluate the performance of the multi-class online fuzzy classifiers through numerical experiments on synthetic dynamic data and also several benchmark datasets.

[LG-39] AdaptManip: Learning Adaptive Whole-Body Object Lifting and Delivery with Online Recurrent State Estimation

链接: https://arxiv.org/abs/2602.14363
作者: Morgan Byrd,Donghoon Baek,Kartik Garg,Hyunyoung Jung,Daesol Cho,Maks Sorokin,Robert Wright,Sehoon Ha
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:This paper presents Adaptive Whole-body Loco-Manipulation, AdaptManip, a fully autonomous framework for humanoid robots to perform integrated navigation, object lifting, and delivery. Unlike prior imitation learning-based approaches that rely on human demonstrations and are often brittle to disturbances, AdaptManip aims to train a robust loco-manipulation policy via reinforcement learning without human demonstrations or teleoperation data. The proposed framework consists of three coupled components: (1) a recurrent object state estimator that tracks the manipulated object in real time under limited field-of-view and occlusions; (2) a whole-body base policy for robust locomotion with residual manipulation control for stable object lifting and delivery; and (3) a LiDAR-based robot global position estimator that provides drift-robust localization. All components are trained in simulation using reinforcement learning and deployed on real hardware in a zero-shot manner. Experimental results show that AdaptManip significantly outperforms baseline methods, including imitation learning-based approaches, in adaptability and overall success rate, while accurate object state estimation improves manipulation performance even under occlusion. We further demonstrate fully autonomous real-world navigation, object lifting, and delivery on a humanoid robot.

[LG-40] Conformal Signal Temporal Logic for Robust Reinforcement Learning Control: A Case Study

链接: https://arxiv.org/abs/2602.14322
作者: Hani Beirami,M M Manjurul Islam
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:We investigate how formal temporal logic specifications can enhance the safety and robustness of reinforcement learning (RL) control in aerospace applications. Using the open source AeroBench F-16 simulation benchmark, we train a Proximal Policy Optimization (PPO) agent to regulate engine throttle and track commanded airspeed. The control objective is encoded as a Signal Temporal Logic (STL) requirement to maintain airspeed within a prescribed band during the final seconds of each maneuver. To enforce this specification at run time, we introduce a conformal STL shield that filters the RL agent’s actions using online conformal prediction. We compare three settings: (i) PPO baseline, (ii) PPO with a classical rule-based STL shield, and (iii) PPO with the proposed conformal shield, under both nominal conditions and a severe stress scenario involving aerodynamic model mismatch, actuator rate limits, measurement noise, and mid-episode setpoint jumps. Experiments show that the conformal shield preserves STL satisfaction while maintaining near baseline performance and providing stronger robustness guarantees than the classical shield. These results demonstrate that combining formal specification monitoring with data driven RL control can substantially improve the reliability of autonomous flight control in challenging environments.

[LG-41] In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

链接: https://arxiv.org/abs/2602.14318
作者: Trishit Mondal,Ameya D. Jagtap
类目: Machine Learning (cs.LG)
*备注: 46 pages, 34 Figures

点击查看摘要

Abstract:Transformer architectures have revolutionized machine learning across a wide range of domains, from natural language processing to scientific computing. However, their growing deployment in high-stakes applications, such as computer vision, natural language processing, healthcare, autonomous systems, and critical areas of scientific computing including climate modeling, materials discovery, drug discovery, nuclear science, and robotics, necessitates a deeper and more rigorous understanding of their trustworthiness. In this work, we critically examine the foundational question: \textitHow trustworthy are transformer models? We evaluate their reliability through a comprehensive review of interpretability, explainability, robustness against adversarial attacks, fairness, and privacy. We systematically examine the trustworthiness of transformer-based models in safety-critical applications spanning natural language processing, computer vision, and science and engineering domains, including robotics, medicine, earth sciences, materials science, fluid dynamics, nuclear science, and automated theorem proving; highlighting high-impact areas where these architectures are central and analyzing the risks associated with their deployment. By synthesizing insights across these diverse areas, we identify recurring structural vulnerabilities, domain-specific risks, and open research challenges that limit the reliable deployment of transformers.

[LG-42] Floe: Federated Specialization for Real-Time LLM -SLM Inference

链接: https://arxiv.org/abs/2602.14302
作者: Chunlin Tian,Kahou Tam,Yebo Wu,Shuaihang Zhong,Li Li,Nicholas D. Lane,Chengzhong Xu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Parallel and Distributed Systems

点击查看摘要

Abstract:Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches.

[LG-43] MILD: Multi-Intent Learning and Disambiguation for Proactive Failure Prediction in Intent-based Networking

链接: https://arxiv.org/abs/2602.14283
作者: Md. Kamrul Hossain,Walid Aljoby
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Copyright 2026 IEEE. Accepted for presentation in IEEE/IFIP NOMS 2026

点击查看摘要

Abstract:In multi-intent intent-based networks, a single fault can trigger co-drift where multiple intents exhibit symptomatic KPI degradation, creating ambiguity about the true root-cause intent. We present MILD, a proactive framework that reformulates intent assurance from reactive drift detection to fixed-horizon failure prediction with intent-level disambiguation. MILD uses a teacher-augmented Mixture-of-Experts where a gated disambiguation module identifies the root-cause intent while per-intent heads output calibrated risk scores. On a benchmark with non-linear failures and co-drifts, MILD provides 3.8%–92.5% longer remediation lead time and improves intent-level root-cause disambiguation accuracy by 9.4%–45.8% over baselines. MILD also provides per-alert KPI explanations, enabling actionable diagnosis.

[LG-44] Radial-VCReg: More Informative Representation Learning Through Radial Gaussianization NEURIPS2025

链接: https://arxiv.org/abs/2602.14272
作者: Yilun Kuang,Yash Dagade,Deep Chakraborty,Erik Learned-Miller,Randall Balestriero,Tim G. J. Rudner,Yann LeCun
类目: Machine Learning (cs.LG)
*备注: Published in the Unifying Representations in Neural Models (UniReps) and Symmetry and Geometry in Neural Representations (NeurReps) Workshops at NeurIPS 2025

点击查看摘要

Abstract:Self-supervised learning aims to learn maximally informative representations, but explicit information maximization is hindered by the curse of dimensionality. Existing methods like VCReg address this by regularizing first and second-order feature statistics, which cannot fully achieve maximum entropy. We propose Radial-VCReg, which augments VCReg with a radial Gaussianization loss that aligns feature norms with the Chi distribution-a defining property of high-dimensional Gaussians. We prove that Radial-VCReg transforms a broader class of distributions towards normality compared to VCReg and show on synthetic and real-world datasets that it consistently improves performance by reducing higher-order dependencies and promoting more diverse and informative representations.

[LG-45] Energy-Efficient Over-the-Air Federated Learning via Pinching Antenna Systems

链接: https://arxiv.org/abs/2602.14250
作者: Saba Asaad,Ali Bereyhi
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: To be presented at IEEE International Conference on Communications (ICC 2026): Symposium on Next Generation Multiple Access. 6 pages, 3 algorithms, 3 figures

点击查看摘要

Abstract:Pinching antennas systems (PASSs) have recently been proposed as a novel flexible-antenna technology. These systems are implemented by attaching low-cost pinching elements to dielectric waveguides. As the direct link is bypassed through waveguides, PASSs can effectively compensate large-scale effects of the wireless channel. This work explores the potential gains of employing PASSs for over-the-air federated learning (OTA-FL). For a PASS-assisted server, we develop a low-complexity algorithmic approach, which jointly tunes the PASS parameters and schedules the mobile devices for minimal energy consumption in OTA-FL. We study the efficiency of the proposed design and compare it against the conventional OTA-FL setting with MIMO server. Numerical experiments demonstrate that using a single-waveguide PASS at the server within a moderately sized area, the required energy for model aggregation is drastically reduced as compared to the case with fully-digital MIMO server. This introduces PASS as a potential technology for energy-efficient distributed learning in next generations of wireless systems.

[LG-46] Robust multi-task boosting using clustering and local ensembling

链接: https://arxiv.org/abs/2602.14231
作者: Seyedsaman Emami,Daniel Hernández-Lobato,Gonzalo Martínez-Muñoz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Task Learning (MTL) aims to boost predictive performance by sharing information across related tasks, yet conventional methods often suffer from negative transfer when unrelated or noisy tasks are forced to share representations. We propose Robust Multi-Task Boosting using Clustering and Local Ensembling (RMB-CLE), a principled MTL framework that integrates error-based task clustering with local ensembling. Unlike prior work that assumes fixed clusters or hand-crafted similarity metrics, RMB-CLE derives inter-task similarity directly from cross-task errors, which admit a risk decomposition into functional mismatch and irreducible noise, providing a theoretically grounded mechanism to prevent negative transfer. Tasks are grouped adaptively via agglomerative clustering, and within each cluster, a local ensemble enables robust knowledge sharing while preserving task-specific patterns. Experiments show that RMB-CLE recovers ground-truth clusters in synthetic data and consistently outperforms multi-task, single-task, and pooling-based ensemble methods across diverse real-world and synthetic benchmarks. These results demonstrate that RMB-CLE is not merely a combination of clustering and boosting but a general and scalable framework that establishes a new basis for robust multi-task learning.

[LG-47] Fast Catch-Up Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws ICLR2026

链接: https://arxiv.org/abs/2602.14208
作者: Jinbo Wang,Binghui Li,Zhanpeng Zhou,Mingze Wang,Yuxuan Sun,Jiaqi Zhang,Xunliang Cai,Lei Wu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 34 pages, accepted by ICLR 2026 as a conference paper

点击查看摘要

Abstract:Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism – the fast catch-up effect – which also manifests in large language model (LLM) pretraining. After switching from small to large batches, the loss rapidly aligns with the constant large-batch trajectory. Using FSL, we show that this effect stems from rapid forgetting of accumulated gradient noise, with the catch-up speed determined by task difficulty. Crucially, this effect implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption. Finally, extensive LLM pretraining experiments – covering both Dense and MoE architectures with up to 1.1B parameters and 1T tokens – validate our theoretical predictions. Across all settings, late-switch schedules consistently outperform constant-batch and early-switch baselines.

[LG-48] S-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models

链接: https://arxiv.org/abs/2602.14200
作者: Nicolas Zumarraga,Thomas Kaar,Ning Wang,Maxwell A. Xu,Max Rosenblattl,Markus Kreft,Kevin O’Sullivan,Paul Schmiedmayer,Patrick Langer,Robert Jakob
类目: Machine Learning (cs.LG)
*备注: 18 pages, 13 figures, 10 tables (main paper: 5 pages, 3 figures, 2 tables)

点击查看摘要

Abstract:Time Series Language Models (TSLMs) are emerging as unified models for reasoning over continuous signals in natural language. However, long-context retrieval remains a major limitation: existing models are typically trained and evaluated on short sequences, while real-world time-series sensor streams can span millions of datapoints. This mismatch requires precise temporal localization under strict computational constraints, a regime that is not captured by current benchmarks. We introduce TS-Haystack, a long-context temporal retrieval benchmark comprising ten task types across four categories: direct retrieval, temporal reasoning, multi-step reasoning and contextual anomaly. The benchmark uses controlled needle insertion by embedding short activity bouts into longer longitudinal accelerometer recordings, enabling systematic evaluation across context lengths ranging from seconds to 2 hours per sample. We hypothesize that existing TSLM time series encoders overlook temporal granularity as context length increases, creating a task-dependent effect: compression aids classification but impairs retrieval of localized events. Across multiple model and encoding strategies, we observe a consistent divergence between classification and retrieval behavior. Learned latent compression preserves or improves classification accuracy at compression ratios up to 176 \times , but retrieval performance degrades with context length, incurring in the loss of temporally localized information. These results highlight the importance of architectural designs that decouple sequence length from computational complexity while preserving temporal fidelity.

[LG-49] When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

链接: https://arxiv.org/abs/2602.14161
作者: Max Fomin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compositions rather than semantic content. We systematically compare production guardrails (PromptGuard 2, LlamaGuard) and LLM-as-judge approaches on our benchmark, finding all three fail on indirect attacks targeting agents (7-37% detection) and that PromptGuard 2 and LlamaGuard cannot evaluate agentic tool injection due to architectural limitations. Finally, we show that LODO-stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts. We release our evaluation framework at this https URL to establish LODO as the appropriate protocol for prompt attack detection research.

[LG-50] Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization

链接: https://arxiv.org/abs/2602.14159
作者: Rizhen Hu,Yuan Cao,Boao Kong,Mou Sun,Kun Yuan
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap – redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts’ SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top- k routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top- k MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.

[LG-51] A Penalty Approach for Differentiation Through Black-Box Quadratic Programming Solvers

链接: https://arxiv.org/abs/2602.14154
作者: Yuxuan Linghu,Zhiyuan Liu,Qi Deng
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 18 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Differentiating through the solution of a quadratic program (QP) is a central problem in differentiable optimization. Most existing approaches differentiate through the Karush–Kuhn–Tucker (KKT) system, but their computational cost and numerical robustness can degrade at scale. To address these limitations, we propose dXPP, a penalty-based differentiation framework that decouples QP solving from differentiation. In the solving step (forward pass), dXPP is solver-agnostic and can leverage any black-box QP solver. In the differentiation step (backward pass), we map the solution to a smooth approximate penalty problem and implicitly differentiate through it, requiring only the solution of a much smaller linear system in the primal variables. This approach bypasses the difficulties inherent in explicit KKT differentiation and significantly improves computational efficiency and robustness. We evaluate dXPP on various tasks, including randomly generated QPs, large-scale sparse projection problems, and a real-world multi-period portfolio optimization task. Empirical results demonstrate that dXPP is competitive with KKT-based differentiation methods and achieves substantial speedups on large-scale problems.

[LG-52] Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

链接: https://arxiv.org/abs/2602.14111
作者: Anton Korznikov,Andrey Galichin,Alexey Dontsov,Oleg Rogov,Ivan Oseledets,Elena Tutubalina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models’ internal mechanisms.

[LG-53] Geometry-Aware Physics-Informed PointNets for Modeling Flows Across Porous Structures

链接: https://arxiv.org/abs/2602.14108
作者: Luigi Ciceri,Corrado Mio,Jianyi Lin,Gabriele Gianini
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 12 pages, 8 figures, short version to be published in MEDES 2025 conference proceedings

点击查看摘要

Abstract:Predicting flows that occur both through and around porous bodies is challenging due to coupled physics across fluid and porous regions and the need to generalize across diverse geometries and boundary conditions. We address this problem using two Physics Informed learning approaches: Physics Informed PointNets (PIPN) and Physics Informed Geometry Aware Neural Operator (P-IGANO). We enforce the incompressible Navier Stokes equations in the free-flow region and a Darcy Forchheimer extension in the porous region within a unified loss and condition the networks on geometry and material parameters. Datasets are generated with OpenFOAM on 2D ducts containing porous obstacles and on 3D windbreak scenarios with tree canopies and buildings. We first verify the pipeline via the method of manufactured solutions, then assess generalization to unseen shapes, and for PI-GANO, to variable boundary conditions and parameter settings. The results show consistently low velocity and pressure errors in both seen and unseen cases, with accurate reproduction of the wake structures. Performance degrades primarily near sharp interfaces and in regions with large gradients. Overall, the study provides a first systematic evaluation of PIPN/PI-GANO for simultaneous through-and-around porous flows and shows their potential to accelerate design studies without retraining per geometry.

[LG-54] Neural Optimal Transport in Hilbert Spaces: Characterizing Spurious Solutions and Gaussian Smoothing

链接: https://arxiv.org/abs/2602.14086
作者: Jae-Hwan Choi,Jiwoo Yoon,Dohyun Kwon,Jaewoong Choi
类目: Machine Learning (cs.LG)
*备注: 31 pages, 3 figures

点击查看摘要

Abstract:We study Neural Optimal Transport in infinite-dimensional Hilbert spaces. In non-regular settings, Semi-dual Neural OT often generates spurious solutions that fail to accurately capture target distributions. We analytically characterize this spurious solution problem using the framework of regular measures, which generalize Lebesgue absolute continuity in finite dimensions. To resolve ill-posedness, we extend the semi-dual framework via a Gaussian smoothing strategy based on Brownian motion. Our primary theoretical contribution proves that under a regular source measure, the formulation is well-posed and recovers a unique Monge map. Furthermore, we establish a sharp characterization for the regularity of smoothed measures, proving that the success of smoothing depends strictly on the kernel of the covariance operator. Empirical results on synthetic functional data and time-series datasets demonstrate that our approach effectively suppresses spurious solutions and outperforms existing baselines.

[LG-55] Decentralized Federated Learning With Energy Harvesting Devices

链接: https://arxiv.org/abs/2602.14051
作者: Kai Zhang,Xuanyu Cao,Khaled B. Letaief
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Decentralized federated learning (DFL) enables edge devices to collaboratively train models through local training and fully decentralized device-to-device (D2D) model exchanges. However, these energy-intensive operations often rapidly deplete limited device batteries, reducing their operational lifetime and degrading the learning performance. To address this limitation, we apply energy harvesting technique to DFL systems, allowing edge devices to extract ambient energy and operate sustainably. We first derive the convergence bound for wireless DFL with energy harvesting, showing that the convergence is influenced by partial device participation and transmission packet drops, both of which further depend on the available energy supply. To accelerate convergence, we formulate a joint device scheduling and power control problem and model it as a multi-agent Markov decision process (MDP). Traditional MDP algorithms (e.g., value or policy iteration) require a centralized coordinator with access to all device states and exhibit exponential complexity in the number of devices, making them impractical for large-scale decentralized networks. To overcome these challenges, we propose a fully decentralized policy iteration algorithm that leverages only local state information from two-hop neighboring devices, thereby substantially reducing both communication overhead and computational complexity. We further provide a theoretical analysis showing that the proposed decentralized algorithm achieves asymptotic optimality. Finally, comprehensive numerical experiments on real-world datasets are conducted to validate the theoretical results and corroborate the effectiveness of the proposed algorithm.

[LG-56] Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers EACL2026

链接: https://arxiv.org/abs/2602.14050
作者: Atsushi Shimizu,Shohei Taniguchi,Yutaka Matsuo
类目: Machine Learning (cs.LG)
*备注: To appear at EACL 2026

点击查看摘要

Abstract:Length generalization is the ability of language models to maintain performance on inputs longer than those seen during pretraining. In this work, we introduce a simple yet powerful position encoding (PE) strategy, Random Float Sampling (RFS), that generalizes well to lengths unseen during pretraining or fine-tuning. In particular, instead of selecting position indices from a predefined discrete set, RFS uses randomly sampled continuous values, thereby avoiding out-of-distribution (OOD) issues on unseen lengths by exposing the model to diverse indices during training. Since assigning indices to tokens is a common and fundamental procedure in widely used PEs, the advantage of RFS can easily be incorporated into, for instance, the absolute sinusoidal encoding, RoPE, and ALiBi. Experiments corroborate its effectiveness by showing that RFS results in superior performance in length generalization tasks as well as zero-shot commonsense reasoning benchmarks.

[LG-57] MC2Mark: Distortion-Free Multi-Bit Watermarking for Long Messages

链接: https://arxiv.org/abs/2602.14030
作者: Xuehao Cui,Ruibo Chen,Yihan Wu,Heng Huang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models now produce text indistinguishable from human writing, which increases the need for reliable provenance tracing. Multi-bit watermarking can embed identifiers into generated text, but existing methods struggle to keep both text quality and watermark strength while carrying long messages. We propose MC ^2 Mark, a distortion-free multi-bit watermarking framework designed for reliable embedding and decoding of long messages. Our key technical idea is Multi-Channel Colored Reweighting, which encodes bits through structured token reweighting while keeping the token distribution unbiased, together with Multi-Layer Sequential Reweighting to strengthen the watermark signal and an evidence-accumulation detector for message recovery. Experiments show that MC ^2 Mark improves detectability and robustness over prior multi-bit watermarking methods while preserving generation quality, achieving near-perfect accuracy for short messages and exceeding the second-best method by nearly 30% for long messages.

[LG-58] S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services

链接: https://arxiv.org/abs/2602.14017
作者: Chenyue Li,Wen Deng,Zhuotao Sun,Mengxi Jin,Hanzhe Cui,Han Li,Shentong Li,Man Kit Yu,Ming Long Lai,Yuhao Yang,Mengqian Lu,Binhang Yuan
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and decision-facing reasoning under uncertainty. Meanwhile, multimodal large language models (MLLMs) and corresponding agentic paradigms have made rapid progress in supporting various workflows, but it remains unclear whether they can reliably generate decision-making deliverables from operational service products (e.g., actionable signal comprehension, decision-making handoff, and decision analysis planning) under uncertainty. We introduce S2SServiceBench, a multimodal benchmark for last-mile S2S climate services curated from an operational climate-service system to evaluate this capability. S2SServiceBenchcovers 10 service products with about 150+ expert-selected cases in total, spanning six application domains - Agriculture, Disasters, Energy, Finance, Health, and Shipping. Each case is instantiated at three service levels, yielding around 500 tasks and 1,000+ evaluation items across climate resilience and sustainability applications. Using S2SServiceBench, we benchmark state-of-the-art MLLMs and agents, and analyze performance across products and service levels, revealing persistent challenges in S2S service plot understanding and reasoning - namely, actionable signal comprehension, operationalizing uncertainty into executable handoffs, and stable, evidence-grounded analysis and planning for dynamic hazards-while offering actionable guidance for building future climate-service agents.

[LG-59] KoopGen: Koopman Generator Networks for Representing and Predicting Dynamical Systems with Continuous Spectra

链接: https://arxiv.org/abs/2602.14011
作者: Liangyu Su,Jun Shu,Rui Liu,Deyu Meng,Zongben Xu
类目: Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Representing and predicting high-dimensional and spatiotemporally chaotic dynamical systems remains a fundamental challenge in dynamical systems and machine learning. Although data-driven models can achieve accurate short-term forecasts, they often lack stability, interpretability, and scalability in regimes dominated by broadband or continuous spectra. Koopman-based approaches provide a principled linear perspective on nonlinear dynamics, but existing methods rely on restrictive finite-dimensional assumptions or explicit spectral parameterizations that degrade in high-dimensional settings. Against these issues, we introduce KoopGen, a generator-based neural Koopman framework that models dynamics through a structured, state-dependent representation of Koopman generators. By exploiting the intrinsic Cartesian decomposition into skew-adjoint and self-adjoint components, KoopGen separates conservative transport from irreversible dissipation while enforcing exact operator-theoretic constraints during learning. Across systems ranging from nonlinear oscillators to high-dimensional chaotic and spatiotemporal dynamics, KoopGen improves prediction accuracy and stability, while clarifying which components of continuous-spectrum dynamics admit interpretable and learnable representations.

[LG-60] Steady-State Behavior of Constant-Stepsize Stochastic Approximation: Gaussian Approximation and Tail Bounds

链接: https://arxiv.org/abs/2602.13960
作者: Zedong Wang,Yuyang Wang,Ijay Narang,Felix Wang,Yuzhou Wang,Siva Theja Maguluri
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Constant-stepsize stochastic approximation (SA) is widely used in learning for computational efficiency. For a fixed stepsize, the iterates typically admit a stationary distribution that is rarely tractable. Prior work shows that as the stepsize \alpha \downarrow 0 , the centered-and-scaled steady state converges weakly to a Gaussian random vector. However, for fixed \alpha , this weak convergence offers no usable error bound for approximating the steady-state by its Gaussian limit. This paper provides explicit, non-asymptotic error bounds for fixed \alpha . We first prove general-purpose theorems that bound the Wasserstein distance between the centered-scaled steady state and an appropriate Gaussian distribution, under regularity conditions for drift and moment conditions for noise. To ensure broad applicability, we cover both i.i.d. and Markovian noise models. We then instantiate these theorems for three representative SA settings: (1) stochastic gradient descent (SGD) for smooth strongly convex objectives, (2) linear SA, and (3) contractive nonlinear SA. We obtain dimension- and stepsize-dependent, explicit bounds in Wasserstein distance of order \alpha^1/2\log(1/\alpha) for small \alpha . Building on the Wasserstein approximation error, we further derive non-uniform Berry–Esseen-type tail bounds that compare the steady-state tail probability to Gaussian tails. We achieve an explicit error term that decays in both the deviation level and stepsize \alpha . We adapt the same analysis for SGD beyond strongly convexity and study general convex objectives. We identify a non-Gaussian (Gibbs) limiting law under the correct scaling, which is validated numerically, and provide a corresponding pre-limit Wasserstein error bound.

[LG-61] QuRL: Efficient Reinforcement Learning with Quantized Rollout ICLR2026

链接: https://arxiv.org/abs/2602.13953
作者: Yuhang Li,Reena Elangovan,Xin Dong,Priyadarshini Panda,Brucek Khailany
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs). However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70% of the total training time. In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout. We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse. Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively. We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update. We evaluate our method with INT8 and FP8 quantization experiments on DeepScaleR and DAPO, and achieve 20% to 80% faster rollout during training.

[LG-62] A Multi-Agent Framework for Code-Guided Modular and Verifiable Automated Machine Learning

链接: https://arxiv.org/abs/2602.13937
作者: Dat Le,Duc-Cuong Le,Anh-Son Nguyen,Tuan-Dung Bui,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Automated Machine Learning (AutoML) has revolutionized the development of data-driven solutions; however, traditional frameworks often function as “black boxes”, lacking the flexibility and transparency required for complex, real-world engineering tasks. Recent Large Language Model (LLM)-based agents have shifted toward code-driven approaches. However, they frequently suffer from hallucinated logic and logic entanglement, where monolithic code generation leads to unrecoverable runtime failures. In this paper, we present iML, a novel multi-agent framework designed to shift AutoML from black-box prompting to a code-guided, modular, and verifiable architectural paradigm. iML introduces three main ideas: (1) Code-Guided Planning, which synthesizes a strategic blueprint grounded in autonomous empirical profiling to eliminate hallucination; (2) Code-Modular Implementation, which decouples preprocessing and modeling into specialized components governed by strict interface contracts; and (3) Code-Verifiable Integration, which enforces physical feasibility through dynamic contract verification and iterative self-correction. We evaluate iML across MLE-BENCH and the newly introduced iML-BENCH, comprising a diverse range of real-world Kaggle competitions. The experimental results show iML’s superiority over state-of-the-art agents, achieving a valid submission rate of 85% and a competitive medal rate of 45% on MLE-BENCH, with an average standardized performance score (APS) of 0.77. On iML-BENCH, iML significantly outperforms the other approaches by 38%-163% in APS. Furthermore, iML maintains a robust 70% success rate even under stripped task descriptions, effectively filling information gaps through empirical profiling. These results highlight iML’s potential to bridge the gap between stochastic generation and reliable engineering, marking a meaningful step toward truly AutoML.

[LG-63] voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models ICASSP2026

链接: https://arxiv.org/abs/2602.13928
作者: Aju Ani Justus,Ruchit Agrawal,Sudarsana Reddy Kadiri,Shrikanth Narayanan
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted to the Speech, Music and Mind (SMM26) workshop at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026). This is the preprint version of the paper to appear in the proceedings

点击查看摘要

Abstract:We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).

[LG-64] Evolving Multi-Channel Confidence-Aware Activation Functions for Missing Data with Channel Propagation

链接: https://arxiv.org/abs/2602.13864
作者: Naeem Shahabi Sani,Ferial Najiantabriz,Shayan Shafaei,Dean F. Hougen
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Learning in the presence of missing data can result in biased predictions and poor generalizability, among other difficulties, which data imputation methods only partially address. In neural networks, activation functions significantly affect performance yet typical options (e.g., ReLU, Swish) operate only on feature values and do not account for missingness indicators or confidence scores. We propose Three-Channel Evolved Activations (3C-EA), which we evolve using Genetic Programming to produce multivariate activation functions f(x, m, c) in the form of trees that take (i) the feature value x, (ii) a missingness indicator m, and (iii) an imputation confidence score c. To make these activations useful beyond the input layer, we introduce ChannelProp, an algorithm that deterministically propagates missingness and confidence values via linear layers based on weight magnitudes, retaining reliability signals throughout the network. We evaluate 3C-EA and ChannelProp on datasets with natural and injected (MCAR/MAR/MNAR) missingness at multiple rates under identical preprocessing and splits. Results indicate that integrating missingness and confidence inputs into the activation search improves classification performance under missingness.

[LG-65] sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals

链接: https://arxiv.org/abs/2602.13857
作者: Weixuan Yuan,Zengrui Jin,Yichen Wang,Donglin Xie,Ziyi Ye,Chao Zhang,Xuesong Chen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO _2 ). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present \textttsleep2vec, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. \textttsleep2vec is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a \textitDemography, Age, Site \ History-aware InfoNCE objective that incorporates physiological and acquisition metadata (\textite.g., age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On downstream sleep staging and clinical outcome assessment, \textttsleep2vec consistently outperforms strong baselines and remains robust to any subset of available modalities and sensor dropout. We further characterize, to our knowledge for the first time, scaling laws for nocturnal biosignals with respect to modality diversity and model capacity. Together, these results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.

[LG-66] sting For Distribution Shifts with Conditional Conformal Test Martingales

链接: https://arxiv.org/abs/2602.13848
作者: Shalev Shaer,Yarin Bar,Drew Prinster,Yaniv Romano
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a sequential test for detecting arbitrary distribution shifts that allows conformal test martingales (CTMs) to work under a fixed, reference-conditional setting. Existing CTM detectors construct test martingales by continually growing a reference set with each incoming sample, using it to assess how atypical the new sample is relative to past observations. While this design yields anytime-valid type-I error control, it suffers from test-time contamination: after a change, post-shift observations enter the reference set and dilute the evidence for distribution shift, increasing detection delay and reducing power. In contrast, our method avoids contamination by design by comparing each new sample to a fixed null reference dataset. Our main technical contribution is a robust martingale construction that remains valid conditional on the null reference data, achieved by explicitly accounting for the estimation error in the reference distribution induced by the finite reference set. This yields anytime-valid type-I error control together with guarantees of asymptotic power one and bounded expected detection delay. Empirically, our method detects shifts faster than standard CTMs, providing a powerful and reliable distribution-shift detector. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.13848 [cs.LG] (or arXiv:2602.13848v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.13848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] A Unified Physics-Informed Neural Network for Modeling Coupled Electro- and Elastodynamic Wave Propagation Using Three-Stage Loss Optimization

链接: https://arxiv.org/abs/2602.13811
作者: Suhas Suresh Bharadwaj,Reuben Thomas Thovelil
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks present a novel approach in SciML that integrates physical laws in the form of partial differential equations directly into the NN through soft constraints in the loss function. This work studies the application of PINNs to solve a one dimensional coupled electro-elastodynamic system modeling linear piezoelectricity in stress-charge form, governed by elastodynamic and electrodynamic equations. Our simulation employs a feedforward architecture, mapping space-time coordinates to mechanical displacement and electric potential. Our PINN model achieved global relative L2 errors of 2.34 and 4.87 percent for displacement and electric potential respectively. The results validate PINNs as effective mesh free solvers for coupled time-dependent PDE systems, though challenges remain regarding error accumulation and stiffness in coupled eigenvalue systems.

[LG-68] AnomaMind: Agent ic Time Series Anomaly Detection with Tool-Augmented Reasoning

链接: https://arxiv.org/abs/2602.13807
作者: Xiaoyu Tao,Yuchong Wu,Mingyue Cheng,Ze Guo,Tian Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series anomaly detection is critical in many real-world applications, where effective solutions must localize anomalous regions and support reliable decision-making under complex settings. However, most existing methods frame anomaly detection as a purely discriminative prediction task with fixed feature inputs, rather than an evidence-driven diagnostic process. As a result, they often struggle when anomalies exhibit strong context dependence or diverse patterns. We argue that these limitations stem from the lack of adaptive feature preparation, reasoning-aware detection, and iterative refinement during inference. To address these challenges, we propose AnomaMind, an agentic time series anomaly detection framework that reformulates anomaly detection as a sequential decision-making process. AnomaMind operates through a structured workflow that progressively localizes anomalous intervals in a coarse-to-fine manner, augments detection through multi-turn tool interactions for adaptive feature preparation, and refines anomaly decisions via self-reflection. The workflow is supported by a set of reusable tool engines, enabling context-aware diagnostic analysis. A key design of AnomaMind is an explicitly designed hybrid inference mechanism for tool-augmented anomaly detection. In this mechanism, general-purpose models are responsible for autonomous tool interaction and self-reflective refinement, while core anomaly detection decisions are learned through reinforcement learning under verifiable workflow-level feedback, enabling task-specific optimization within a flexible reasoning framework. Extensive experiments across diverse settings demonstrate that AnomaMind consistently improves anomaly detection performance. The code is available at this https URL.

[LG-69] Fast Physics-Driven Untrained Network for Highly Nonlinear Inverse Scattering Problems

链接: https://arxiv.org/abs/2602.13805
作者: Yutong Du,Zicheng Liu,Yi Huang,Bazargul Matkerim,Bo Qi,Yali Zong,Peixian Han
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Untrained neural networks (UNNs) offer high-fidelity electromagnetic inverse scattering reconstruction but are computationally limited by high-dimensional spatial-domain optimization. We propose a Real-Time Physics-Driven Fourier-Spectral (PDF) solver that achieves sub-second reconstruction through spectral-domain dimensionality reduction. By expanding induced currents using a truncated Fourier basis, the optimization is confined to a compact low-frequency parameter space supported by scattering measurements. The solver integrates a contraction integral equation (CIE) to mitigate high-contrast nonlinearity and a contrast-compensated operator (CCO) to correct spectral-induced attenuation. Furthermore, a bridge-suppressing loss is formulated to enhance boundary sharpness between adjacent scatterers. Numerical and experimental results demonstrate a 100-fold speedup over state-of-the-art UNNs with robust performance under noise and antenna uncertainties, enabling real-time microwave imaging applications.

[LG-70] Cast-R1: Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting

链接: https://arxiv.org/abs/2602.13802
作者: Xiaoyu Tao,Mingyue Cheng,Chuang Jiang,Tian Gao,Huanjian Zhang,Yaguo Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting has long been dominated by model-centric approaches that formulate prediction as a single-pass mapping from historical observations to future values. Despite recent progress, such formulations often struggle in complex and evolving settings, largely because most forecasting models lack the ability to autonomously acquire informative evidence, reason about potential future changes, or revise predictions through iterative decision processes. In this work, we propose Cast-R1, a learned time series forecasting framework that reformulates forecasting as a sequential decision-making problem. Cast-R1 introduces a memory-based state management mechanism that maintains decision-relevant information across interaction steps, enabling the accumulation of contextual evidence to support long-horizon reasoning. Building on this formulation, forecasting is carried out through a tool-augmented agentic workflow, in which the agent autonomously interacts with a modular toolkit to extract statistical features, invoke lightweight forecasting models for decision support, perform reasoning-based prediction, and iteratively refine forecasts through self-reflection. To train Cast-R1, we adopt a two-stage learning strategy that combines supervised fine-tuning with multi-turn reinforcement learning, together with a curriculum learning scheme that progressively increases task difficulty to improve policy learning. Extensive experiments on multiple real-world time series datasets demonstrate the effectiveness of Cast-R1. We hope this work provides a practical step towards further exploration of agentic paradigms for time series modeling. Our code is available at this https URL.

[LG-71] MEMTS: Internalizing Domain Knowledge via Parameterized Memory for Retrieval-Free Domain Adaptation of Time Series Foundation Models

链接: https://arxiv.org/abs/2602.13783
作者: Xiaoyun Yu,Li fan,Xiangfei Qiu,Nanqing Dong,Yonggui Huang,Honggang Qi,Geguang Pu,Wanli Ouyang,Xi Chen,Jilin Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Time Series Foundation Models (TSFMs) have demonstrated exceptional performance in generalized forecasting, their performance often degrades significantly when deployed in real-world vertical domains characterized by temporal distribution shifts and domain-specific periodic structures. Current solutions are primarily constrained by two paradigms: Domain-Adaptive Pretraining (DAPT), which improves short-term domain fitting but frequently disrupts previously learned global temporal patterns due to catastrophic forgetting; and Retrieval-Augmented Generation (RAG), which incorporates external knowledge but introduces substantial retrieval overhead. This creates a severe scalability bottleneck that fails to meet the high-efficiency requirements of real-time stream processing. To break this impasse, we propose Memory for Time Series (MEMTS), a lightweight and plug-and-play method for retrieval-free domain adaptation in time series forecasting. The key component of MEMTS is a Knowledge Persistence Module (KPM), which internalizes domain-specific temporal dynamics, such as recurring seasonal patterns and trends into a compact set of learnable latent prototypes. In doing so, it transforms fragmented historical observations into continuous, parameterized knowledge representations. This paradigm shift enables MEMTS to achieve accurate domain adaptation with constant-time inference and near-zero latency, while effectively mitigating catastrophic forgetting of general temporal patterns, all without requiring any architectural modifications to the frozen TSFM backbone. Extensive experiments on multiple datasets demonstrate the SOTA performance of MEMTS.

[LG-72] On Representation Redundancy in Large-Scale Instruction Tuning Data Selection

链接: https://arxiv.org/abs/2602.13773
作者: Youwei Shu,Shaomian Zheng,Dingnan Jin,Wenjie Qu,Ziyao Guo,Qing Cui,Jun Zhou,Jiaheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at this https URL.

[LG-73] Discrete Double-Bracket Flows for Isotropic-Noise Invariant Eigendecomposition

链接: https://arxiv.org/abs/2602.13759
作者: ZhiMing Li,JiaHe Feng
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 75 pages, 9 figures

点击查看摘要

Abstract:We study matrix-free eigendecomposition under a matrix-vector product (MVP) oracle, where each step observes a covariance operator C_k = C_sig + \sigma_k^2 I + E_k . Standard stochastic approximation methods either use fixed steps that couple stability to |C_k|_2 , or adapt steps in ways that slow down due to vanishing updates. We introduce a discrete double-bracket flow whose generator is invariant to isotropic shifts, yielding pathwise invariance to \sigma_k^2 I at the discrete-time level. The resulting trajectory and a maximal stable step size \eta_max \propto 1/|C_e|_2^2 depend only on the trace-free covariance C_e . We establish global convergence via strict-saddle geometry for the diagonalization objective and an input-to-state stability analysis, with sample complexity scaling as O(|C_e|_2^2 / (\Delta^2 \epsilon)) under trace-free perturbations. An explicit characterization of degenerate blocks yields an accelerated O(\log(1/\zeta)) saddle-escape rate and a high-probability finite-time convergence guarantee.

[LG-74] Data-driven Bi-level Optimization of Thermal Power Systems with embedded Artificial Neural Networks

链接: https://arxiv.org/abs/2602.13746
作者: Talha Ansar,Muhammad Mujtaba Abbas,Ramit Debnath,Vivek Dua,Waqar Muhammad Ashraf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industrial thermal power systems have coupled performance variables with hierarchical order of importance, making their simultaneous optimization computationally challenging or infeasible. This barrier limits the integrated and computationally scaleable operation optimization of industrial thermal power systems. To address this issue for large-scale engineering systems, we present a fully machine learning-powered bi-level optimization framework for data-driven optimization of industrial thermal power systems. The objective functions of upper and lower levels are approximated by artificial neural network (ANN) models and the lower-level problem is analytically embedded through Karush-Kuhn-Tucker (KKT) optimality conditions. The reformulated single level optimization framework integrating ANN models and KKT constraints (ANN-KKT) is validated on benchmark problems and on real-world power generation operation of 660 MW coal power plant and 395 MW gas turbine system. The results reveal a comparable solutions obtained from the proposed ANN-KKT framework to the bi-level solutions of the benchmark problems. Marginal computational time requirement (0.22 to 0.88 s) to compute optimal solutions yields 583 MW (coal) and 402 MW (gas turbine) of power output at optimal turbine heat rate of 7337 kJ/kWh and 7542 kJ/kWh, respectively. In addition, the method expands to delineate a feasible and robust operating envelope that accounts for uncertainty in operating variables while maximizing thermal efficiency in various scenarios. These results demonstrate that ANN-KKT offers a scalable and computationally efficient route for hierarchical, data-driven optimization of industrial thermal power systems, achieving energy-efficient operations of large-scale engineering systems and contributing to industry 5.0.

[LG-75] HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models

链接: https://arxiv.org/abs/2602.13710
作者: Xin Yan,Zhenglin Wan,Feiyang Ye,Xingrui Yu,Hangyu Du,Yang You,Ivor Tsang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable instruction-following embodied control, but their large compute and memory footprints hinder deployment on resource-constrained robots and edge platforms. While reducing weights to 1-bit precision through binarization can greatly improve efficiency, existing methods fail to narrow the distribution gap between binarized and full-precision weights, causing quantization errors to accumulate under long-horizon closed-loop execution and severely degrade actions. To fill this gap, we propose HBVLA, a VLA-tailored binarization framework. First, we use a policy-aware enhanced Hessian to identify weights that are truly critical for action generation. Then, we employ a sparse orthogonal transform for non-salient weights to induce a low-entropy intermediate state. Finally, we quantize both salient and non-salient weights in the Harr domain with group-wise 1-bit quantization. We have evaluated our approach on different VLAs: on LIBERO, quantized OpenVLA-OFT retains 92.2% of full-precision performance; on SimplerEnv, quantized CogAct retains 93.6%, significantly outperforming state-of-the-art binarization methods. We further validate our method on real-world evaluation suite and the results show that HBVLA incurs only marginal success-rate degradation compared to the full-precision model, demonstrating robust deployability under tight hardware constraints. Our work provides a practical foundation for ultra-low-bit quantization of VLAs, enabling more reliable deployment on hardware-limited robotic platforms.

[LG-76] Near-Optimal Regret for Policy Optimization in Contextual MDPs with General Offline Function Approximation

链接: https://arxiv.org/abs/2602.13706
作者: Orin Levy,Aviv Rosenberg,Alon Cohen,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce \textttOPO-CMDP, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of \widetildeO(H^4\sqrtT|S||A|\log(|\mathcalF||\mathcalP|)), where S and A denote the state and action spaces, H the horizon length, T the number of episodes, and \mathcalF, \mathcalP the finite function classes used to approximate the losses and dynamics, respectively. This is the first regret bound with optimal dependence on |S| and |A| , directly improving the current state-of-the-art (Qian, Hu, and Simchi-Levi, 2024). These results demonstrate that optimistic policy optimization provides a natural, computationally superior and theoretically near-optimal path for solving CMDPs.

[LG-77] Optimal Regret for Policy Optimization in Contextual Bandits

链接: https://arxiv.org/abs/2602.13700
作者: Orin Levy,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of \widetildeO(\sqrt K|\mathcalA|\log|\mathcalF|) , where K is the number of rounds, \mathcalA is the set of arms, and \mathcalF is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.

[LG-78] Attention Head Entropy of LLM s Predicts Answer Correctness

链接: https://arxiv.org/abs/2602.13699
作者: Sophie Ostmeier,Brian Axelrod,Maya Varma,Asad Aali,Yabin Zhang,Magdalini Paschali,Sanmi Koyejo,Curtis Langlotz,Akshay Chaudhari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

[LG-79] Physics Aware Neural Networks: Denoising for Magnetic Navigation

链接: https://arxiv.org/abs/2602.13690
作者: Aritra Das(1),Yashas Shende(1),Muskaan Chugh(1),Reva Laxmi Chauhan(1),Arghya Pathak(1),Debayan Gupta(1) ((1) Ashoka University)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Magnetic-anomaly navigation, leveraging small-scale variations in the Earth’s magnetic field, is a promising alternative when GPS is unavailable or compromised. Airborne systems face a key challenge in extracting geomagnetic field data: the aircraft itself induces magnetic noise. Although the classical Tolles-Lawson model addresses this, it inadequately handles stochastically corrupted magnetic data required for navigation. To address stochastic noise, we propose a framework based on two physics-based constraints: divergence-free vector field and E(3)-equivariance. These ensure the learned magnetic field obeys Maxwell’s equations and that outputs transform correctly with sensor position/orientation. The divergence-free constraint is implemented by training a neural network to output a vector potential A , with the magnetic field defined as its curl. For E(3)-equivariance, we use tensor products of geometric tensors representable via spherical harmonics with known rotational transformations. Enforcing physical consistency and restricting the admissible function space acts as an implicit regularizer that improves spatio-temporal performance. We present ablation studies evaluating each constraint alone and jointly across CNNs, MLPs, Liquid Time Constant models, and Contiformers. Continuous-time dynamics and long-term memory are critical for modelling magnetic time series; the Contiformer architecture, which provides both, outperforms state-of-the-art methods. To mitigate data scarcity, we generate synthetic datasets using the World Magnetic Model (WMM) with time-series conditional GANs, producing realistic, temporally consistent magnetic sequences across varied trajectories and environments. Experiments show that embedding these constraints significantly improves predictive accuracy and physical plausibility, outperforming classical and unconstrained deep learning approaches.

[LG-80] LEAD-Drift: Real-time and Explainable Intent Drift Detection by Learning a Data-Driven Risk Score

链接: https://arxiv.org/abs/2602.13672
作者: Md. Kamrul Hossain,Walid Aljoby
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Copyright 2026 IEEE. Accepted for publication in IEEE ICC 2026. This is a author version. 6 pages

点击查看摘要

Abstract:Intent-Based Networking (IBN) simplifies network management, but its reliability is challenged by “intent drift”, where the network’s state gradually deviates from its intended goal, often leading to silent failures. Conventional approaches struggle to detect the subtle, early stages of intent drift, raising alarms only when degradation is significant and failure is imminent, which limits their effectiveness for proactive assurance. To address this, we propose LEAD-Drift, a framework that detects intent drift in real time to enable proactive failure prevention. LEAD-Drift’s core contribution is reformulating intent failure detection as a supervised learning problem by training a lightweight neural network on fixed-horizon labels to predict a future risk score. The model’s raw output is then smoothed with an Exponential Moving Average (EMA) and passed through a statistically tuned threshold to generate robust, real-time alerts. Furthermore, we enhance the framework with two key features for operational intelligence: a multi-horizon modeling technique for dynamic time-to-failure estimation, and per-alert explainability using SHAP to identify root-cause KPIs. Our evaluation on a time-series dataset shows LEAD-Drift provides significantly earlier warnings, improving the average lead time by 7.3 minutes (+17.8%) compared to a distance-based baseline. It also reduces alert noise by 80.2% compared to a weighted-KPI heuristic, with only a minor trade-off in lead time. These results demonstrate that LEAD-Drift as a highly effective, interpretable, and operationally efficient solution for proactive network assurance in IBN.

[LG-81] Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

链接: https://arxiv.org/abs/2602.13670
作者: Binyu Zhao,Wei Zhang,Xingrui Yu,Zhaonian Zou,Ivor Tsang
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by these insights, we propose \textbfVILA, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal semantic anchor at the feature level through geometric calibration, and leverage cross-modal priors at the decision level to rectify prediction bias. This confluence maintains analytic-learning’s extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code is available at this https URL

[LG-82] Optimized Certainty Equivalent Risk-Controlling Prediction Sets

链接: https://arxiv.org/abs/2602.13660
作者: Jiayi Huang,Amirmohammad Farzaneh,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Sumitted to EUSIPCO

点击查看摘要

Abstract:In safety-critical applications such as medical image segmentation, prediction systems must provide reliability guarantees that extend beyond conventional expected loss control. While risk-controlling prediction sets (RCPS) offer probabilistic guarantees on the expected risk, they fail to capture tail behavior and worst-case scenarios that are crucial in high-stakes settings. This paper introduces optimized certainty equivalent RCPS (OCE-RCPS), a novel framework that provides high-probability guarantees on general optimized certainty equivalent (OCE) risk measures, including conditional value-at-risk (CVaR) and entropic risk. OCE-RCPS leverages upper confidence bounds to identify prediction set parameters that satisfy user-specified risk tolerance levels with provable reliability. We establish theoretical guarantees showing that OCE-RCPS satisfies the desired probabilistic constraint for loss functions such as miscoverage and false negative rate. Experiments on image segmentation demonstrate that OCE-RCPS consistently meets target satisfaction rates across various risk measures and reliability configurations, while OCE-CRC fails to provide probabilistic guarantees.

[LG-83] Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling

链接: https://arxiv.org/abs/2602.13659
作者: Valery Parfenov,Grigoriy Evseev,Andrey Veprikov,Nikolay Bushkov,Stanislav Moiseev,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Fine-tuning large pretrained language models (LLMs) is a cornerstone of modern NLP, yet its growing memory demands (driven by backpropagation and large optimizer States) limit deployment in resource-constrained settings. Zero-order (ZO) methods bypass backpropagation by estimating directional derivatives from forward evaluations, offering substantial memory savings. However, classical ZO estimators suffer from high variance and an adverse dependence on the parameter dimensionality d , which has constrained their use to low-dimensional problems. In this work, we propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy and updates it to reduce the variance of directional estimates. We develop a practical algorithm implementing this idea and provide a theoretical analysis, showing that learned sampling distributions improve the quality of gradient information and relax the explicit dependence on d in convergence bounds. Empirically, we validate the approach on challenging LLM fine-tuning benchmarks, demonstrating substantially improved performance compared to standard ZO baselines. Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale. The source code is available at this https URL

[LG-84] Joint Time Series Chain: Detecting Unusual Evolving Trend across Time Series

链接: https://arxiv.org/abs/2602.13649
作者: Li Zhang,Nital Patel,Xiuqi Li,Jessica Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series chain (TSC) is a recently introduced concept that captures the evolving patterns in large scale time series. Informally, a time series chain is a temporally ordered set of subsequences, in which consecutive subsequences in the chain are similar to one another, but the last and the first subsequences maybe be dissimilar. Time series chain has the great potential to reveal latent unusual evolving trend in the time series, or identify precursor of important events in a complex system. Unfortunately, existing definitions of time series chains only consider finding chains in a single time series. As a result, they are likely to miss unexpected evolving patterns in interrupted time series, or across two related time series. To address this limitation, in this work, we introduce a new definition called \textitJoint Time Series Chain, which is specially designed for the task of finding unexpected evolving trend across interrupted time series or two related time series. Our definition focuses on mitigating the robustness issues caused by the gap or interruption in the time series. We further propose an effective ranking criterion to identify the best chain. We demonstrate that our proposed approach outperforms existing TSC work in locating unusual evolving patterns through extensive empirical evaluations. We further demonstrate the utility of our work with a real-life manufacturing application from Intel. Our source code is publicly available at the supporting page this https URL .

[LG-85] Optimization-Free Graph Embedding via Distributional Kernel for Community Detection

链接: https://arxiv.org/abs/2602.13634
作者: Shuaibin Song,Kai Ming Ting,Kaifeng Zhang,Tianrun Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neighborhood Aggregation Strategy (NAS) is a widely used approach in graph embedding, underpinning both Graph Neural Networks (GNNs) and Weisfeiler-Lehman (WL) methods. However, NAS-based methods are identified to be prone to over-smoothing-the loss of node distinguishability with increased iterations-thereby limiting their effectiveness. This paper identifies two characteristics in a network, i.e., the distributions of nodes and node degrees that are critical for expressive representation but have been overlooked in existing methods. We show that these overlooked characteristics contribute significantly to over-smoothing of NAS-methods. To address this, we propose a novel weighted distribution-aware kernel that embeds nodes while taking their distributional characteristics into consideration. Our method has three distinguishing features: (1) it is the first method to explicitly incorporate both distributional characteristics; (2) it requires no optimization; and (3) it effectively mitigates the adverse effects of over-smoothing, allowing WL to preserve node distinguishability and expressiveness even after many iterations of embedding. Experiments demonstrate that our method achieves superior community detection performance via spectral clustering, outperforming existing graph embedding methods, including deep learning methods, on standard benchmarks.

[LG-86] Benchmark Leakage Trap: Can We Trust LLM -based Recommendation?

链接: https://arxiv.org/abs/2602.13626
作者: Mingqiao Zhang,Qiyao Peng,Yumeng Wang,Chunyuan Liu,Hongtao Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model’s capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at this https URL.

[LG-87] Interpretable clustering via optimal multiway-split decision trees

链接: https://arxiv.org/abs/2602.13586
作者: Hayato Suzuki,Shunnosuke Ikeda,Yuichi Takano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering serves as a vital tool for uncovering latent data structures, and achieving both high accuracy and interpretability is essential. To this end, existing methods typically construct binary decision trees by solving mixed-integer nonlinear optimization problems, often leading to significant computational costs and suboptimal solutions. Furthermore, binary decision trees frequently result in excessively deep structures, which makes them difficult to interpret. To mitigate these issues, we propose an interpretable clustering method based on optimal multiway-split decision trees, formulated as a 0-1 integer linear optimization problem. This reformulation renders the optimization problem more tractable compared to existing models. A key feature of our method is the integration of a one-dimensional K-means algorithm for the discretization of continuous variables, allowing for flexible and data-driven branching. Extensive numerical experiments on publicly available real-world datasets demonstrate that our method outperforms baseline methods in terms of clustering accuracy and interpretability. Our method yields multiway-split decision trees with concise decision rules while maintaining competitive performance across various evaluation metrics.

[LG-88] Scenario-Adaptive MU-MIMO OFDM Semantic Communication With Asymmetric Neural Network

链接: https://arxiv.org/abs/2602.13557
作者: Chongyang Li,Tianqian Zhang,Shouyin Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic Communication (SemCom) has emerged as a promising paradigm for 6G networks, aiming to extract and transmit task-relevant information rather than minimizing bit errors. However, applying SemCom to realistic downlink Multi-User Multi-Input Multi-Output (MU-MIMO) Orthogonal Frequency Division Multiplexing (OFDM) systems remains challenging due to severe Multi-User Interference (MUI) and frequency-selective fading. Existing Deep Joint Source-Channel Coding (DJSCC) schemes, primarily designed for point-to-point links, suffer from performance saturation in multi-user scenarios. To address these issues, we propose a scenario-adaptive MU-MIMO SemCom framework featuring an asymmetric architecture tailored for downlink transmission. At the transmitter, we introduce a scenario-aware semantic encoder that dynamically adjusts feature extraction based on Channel State Information (CSI) and Signal-to-Noise Ratio (SNR), followed by a neural precoding network designed to mitigate MUI in the semantic domain. At the receiver, a lightweight decoder equipped with a novel pilot-guided attention mechanism is employed to implicitly perform channel equalization and feature calibration using reference pilot symbols. Extensive simulation results over 3GPP channel models demonstrate that the proposed framework significantly outperforms DJSCC and traditional Separate Source-Channel Coding (SSCC) schemes in terms of Peak Signal-to-Noise Ratio (PSNR) and classification accuracy, particularly in low-SNR regimes, while maintaining low latency and computational cost on edge devices.

[LG-89] Out-of-Support Generalisation via Weight Space Sequence Modelling

链接: https://arxiv.org/abs/2602.13550
作者: Roussel Desmond Nzoyem
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:As breakthroughs in deep learning transform key industries, models are increasingly required to extrapolate on datapoints found outside the range of the training set, a challenge we coin as out-of-support (OoS) generalisation. However, neural networks frequently exhibit catastrophic failure on OoS samples, yielding unrealistic but overconfident predictions. We address this challenge by reformulating the OoS generalisation problem as a sequence modelling task in the weight space, wherein the training set is partitioned into concentric shells corresponding to discrete sequential steps. Our WeightCaster framework yields plausible, interpretable, and uncertainty-aware predictions without necessitating explicit inductive biases, all the while maintaining high computational efficiency. Emprical validation on a synthetic cosine dataset and real-world air quality sensor readings demonstrates performance competitive or superior to the state-of-the-art. By enhancing reliability beyond in-distribution scenarios, these results hold significant implications for the wider adoption of artificial intelligence in safety-critical applications.

[LG-90] Fast Swap-Based Element Selection for Multiplication-Free Dimension Reduction

链接: https://arxiv.org/abs/2602.13532
作者: Nobutaka Ono
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose a fast algorithm for element selection, a multiplication-free form of dimension reduction that produces a dimension-reduced vector by simply selecting a subset of elements from the input. Dimension reduction is a fundamental technique for reducing unnecessary model parameters, mitigating overfitting, and accelerating training and inference. A standard approach is principal component analysis (PCA), but PCA relies on matrix multiplications; on resource-constrained systems, the multiplication count itself can become a bottleneck. Element selection eliminates this cost because the reduction consists only of selecting elements, and thus the key challenge is to determine which elements should be retained. We evaluate a candidate subset through the minimum mean-squared error of linear regression that predicts a target vector from the selected elements, where the target may be, for example, a one-hot label vector in classification. When an explicit target is unavailable, the input itself can be used as the target, yielding a reconstruction-based criterion. The resulting optimization is combinatorial, and exhaustive search is impractical. To address this, we derive an efficient formula for the objective change caused by swapping a selected and an unselected element, using the matrix inversion lemma, and we perform a swap-based local search that repeatedly applies objective-decreasing swaps until no further improvement is possible. Experiments on MNIST handwritten-digit images demonstrate the effectiveness of the proposed method.

[LG-91] QuaRK: A Quantum Reservoir Kernel for Time Series Learning

链接: https://arxiv.org/abs/2602.13531
作者: Abdallah Aaraba,Soumaya Cherkaoui,Ola Ahmad,Shengrui Wang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum reservoir computing offers a promising route for time series learning by modelling sequential data via rich quantum dynamics while the only training required happens at the level of a lightweight classical readout. However, studies featuring efficient and implementable quantum reservoir architectures along with model learning guarantees remain scarce in the literature. To close this gap, we introduce QuaRK, an end-to-end framework that couples a hardware-realistic quantum reservoir featurizer with a kernel-based readout scheme. Given a sequence of sample points, the reservoir injects the points one after the other to yield a compact feature vector from efficiently measured k-local observables using classical shadow tomography, after which a classical kernel-based readout learns the target mapping with explicit regularization and fast optimization. The resulting pipeline exposes clear computational knobs – circuit width and depth as well as the measurement budget – while preserving the flexibility of kernel methods to model nonlinear temporal functionals and being scalable to high-dimensional data. We further provide learning-theoretic generalization guarantees for dependent temporal data, linking design and resource choices to finite-sample performance, thereby offering principled guidance for building reliable temporal learners. Empirical experiments validate QuaRK and illustrate the predicted interpolation and generalization behaviours on synthetic beta-mixing time series tasks.

[LG-92] Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability

链接: https://arxiv.org/abs/2602.13485
作者: Ayse Tursucular,Ayush Mohanty,Nazal Mohamed,Nagi Gebraeel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Manuscript under review

点击查看摘要

Abstract:Networks of modern industrial systems are increasingly monitored by distributed sensors, where each system comprises multiple subsystems generating high dimensional time series data. These subsystems are often interdependent, making it important to understand how temporal patterns at one subsystem relate to others. This is challenging in decentralized settings where raw measurements cannot be shared and client observations are heterogeneous. In practical deployments each subsystem (client) operates a fixed proprietary model that cannot be modified or retrained, limiting existing approaches. Nonlinear dynamics further make cross client temporal interdependencies difficult to interpret because they are embedded in nonlinear state transition functions. We present a federated framework for learning temporal interdependencies across clients under these constraints. Each client maps high dimensional local observations to low dimensional latent states using a nonlinear state space model. A central server learns a graph structured neural state transition model over the communicated latent states using a Graph Attention Network. For interpretability we relate the Jacobian of the learned server side transition model to attention coefficients, providing the first interpretable characterization of cross client temporal interdependencies in decentralized nonlinear systems. We establish theoretical convergence guarantees to a centralized oracle and validate the framework through synthetic experiments demonstrating convergence, interpretability, scalability and privacy. Additional real world experiments show performance comparable to decentralized baselines.

[LG-93] AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge

链接: https://arxiv.org/abs/2602.13476
作者: Noriaki Hirose,Catherine Glossop,Dhruv Shah,Sergey Levine
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Robotic foundation models achieve strong generalization by leveraging internet-scale vision-language representations, but their massive computational cost creates a fundamental bottleneck: high inference latency. In dynamic environments, this latency breaks the control loop, rendering powerful models unsafe for real-time deployment. We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution. Inspired by hierarchical control, AsyncVLA runs a large foundation model on a remote workstation to provide high-level guidance, while a lightweight, onboard Edge Adapter continuously refines actions at high frequency. To bridge the domain gap between these asynchronous streams, we introduce an end-to-end finetuning protocol and a trajectory re-weighting strategy that prioritizes dynamic interactions. We evaluate our approach on real-world vision-based navigation tasks with communication delays up to 6 seconds. AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines, effectively bridging the gap between the semantic intelligence of large models and the reactivity required for edge robotics.

[LG-94] xt Has Curvature

链接: https://arxiv.org/abs/2602.13418
作者: Karish Grover,Hanqing Zeng,Yinglong Xia,Christos Faloutsos,Geoffrey J. Gordon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Does text have an intrinsic curvature? Language is increasingly modeled in curved geometries - hyperbolic spaces for hierarchy, mixed-curvature manifolds for compositional structure - yet a basic scientific question remains unresolved: what does curvature mean for text itself, in a way that is native to language rather than an artifact of the embedding space we choose? We argue that text does indeed have curvature, and show how to detect it, define it, and use it. To this end, we propose Texture, a text-native, word-level discrete curvature signal, and make three contributions. (a) Existence: We provide empirical and theoretical certificates that semantic inference in natural corpora is non-flat, i.e. language has inherent curvature. (b) Definition: We define Texture by reconciling left- and right-context beliefs around a masked word through a Schrodinger bridge, yielding a curvature field that is positive where context focuses meaning and negative where it fans out into competing continuations. © Utility: Texture is actionable: it serves as a general-purpose measurement and control primitive enabling geometry without geometric training; we instantiate it on two representative tasks, improving long-context inference through curvature-guided compression and retrieval-augmented generation through curvature-guided routing. Together, our results establish a text-native curvature paradigm, making curvature measurable and practically useful.

[LG-95] High-Resolution Climate Projections Using Diffusion-Based Downscaling of a Lightweight Climate Emulator

链接: https://arxiv.org/abs/2602.13416
作者: Haiwen Guan,Moein Darman,Dibyajyoti Chakraborty,Troy Arcomano,Ashesh Chattopadhyay,Romit Maulik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of data-driven models in weather and climate sciences has marked a significant paradigm shift, with advanced models demonstrating exceptional skill in medium-range forecasting. However, these models are often limited by long-term instabilities, climatological drift, and substantial computational costs during training and inference, restricting their broader application for climate studies. Addressing these limitations, Guan et al. (2024) introduced LUCIE, a lightweight, physically consistent climate emulator utilizing a Spherical Fourier Neural Operator (SFNO) architecture. This model is able to reproduce accurate long-term statistics including climatological mean and seasonal variability. However, LUCIE’s native resolution (~300 km) is inadequate for detailed regional impact assessments. To overcome this limitation, we introduce a deep learning-based downscaling framework, leveraging probabilistic diffusion-based generative models with conditional and posterior sampling frameworks. These models downscale coarse LUCIE outputs to 25 km resolution. They are trained on approximately 14,000 ERA5 timesteps spanning 2000-2009 and evaluated on LUCIE predictions from 2010 to 2020. Model performance is assessed through diverse metrics, including latitude-averaged RMSE, power spectrum, probability density functions and First Empirical Orthogonal Function of the zonal wind. We observe that the proposed approach is able to preserve the coarse-grained dynamics from LUCIE while generating fine-scaled climatological statistics at ~28km resolution.

[LG-96] Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

链接: https://arxiv.org/abs/2602.13413
作者: Yuchen Fang,James Demmel,Javad Lavaei
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite p -th moment for some p \in (1,2] , and measure convergence after T iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate \mathcalO(T^-\fracp-13p-2) when problem parameters are known, and \mathcalO(T^-\fracp-12p) when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.

[LG-97] Accelerated Discovery of Cryoprotectant Cocktails via Multi-Objective Bayesian Optimization

链接: https://arxiv.org/abs/2602.13398
作者: Daniel Emerson,Nora Gaby-Biegel,Purva Joshi,Yoed Rabin,Rebecca D. Sandlin,Levent Burak Kara
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Designing cryoprotectant agent (CPA) cocktails for vitrification is challenging because formulations must be concentrated enough to suppress ice formation yet non-toxic enough to preserve cell viability. This tradeoff creates a large, multi-objective design space in which traditional discovery is slow, often relying on expert intuition or exhaustive experimentation. We present a data-efficient framework that accelerates CPA cocktail design by combining high-throughput screening with an active-learning loop based on multi-objective Bayesian optimization. From an initial set of measured cocktails, we train probabilistic surrogate models to predict concentration and viability and quantify uncertainty across candidate formulations. We then iteratively select the next experiments by prioritizing cocktails expected to improve the Pareto front, maximizing expected Pareto improvement under uncertainty, and update the models as new assay results are collected. Wet-lab validation shows that our approach efficiently discovers cocktails that simultaneously achieve high CPA concentrations and high post-exposure viability. Relative to a naive strategy and a strong baseline, our method improves dominated hypervolume by 9.5% and 4.5%, respectively, while reducing the number of experiments needed to reach high-quality solutions. In complementary synthetic studies, it recovers a comparably strong set of Pareto-optimal solutions using only 30% of the evaluations required by the prior state-of-the-art multi-objective approach, which amounts to saving approximately 10 weeks of experimental time. Because the framework assumes only a suitable assay and defined formulation space, it can be adapted to different CPA libraries, objective definitions, and cell lines to accelerate cryopreservation development.

[LG-98] he Speed-up Factor: A Quantitative Multi-Iteration Active Learning Performance Metric

链接: https://arxiv.org/abs/2602.13359
作者: Hannes Kath,Thiago S. Gouvêa,Daniel Sonntag
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models excel with abundant annotated data, but annotation is often costly and time-intensive. Active learning (AL) aims to improve the performance-to-annotation ratio by using query methods (QMs) to iteratively select the most informative samples. While AL research focuses mainly on QM development, the evaluation of this iterative process lacks appropriate performance metrics. This work reviews eight years of AL evaluation literature and formally introduces the speed-up factor, a quantitative multi-iteration QM performance metric that indicates the fraction of samples needed to match random sampling performance. Using four datasets from diverse domains and seven QMs of various types, we empirically evaluate the speed-up factor and compare it with state-of-the-art AL performance metrics. The results confirm the assumptions underlying the speed-up factor, demonstrate its accuracy in capturing the described fraction, and reveal its superior stability across iterations.

[LG-99] Benchmarking Anomaly Detection Across Heterogeneous Cloud Telemetry Datasets

链接: https://arxiv.org/abs/2602.13288
作者: Mohammad Saiful Islam,Andriy Miranskyy
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is important for keeping cloud systems reliable and stable. Deep learning has improved time-series anomaly detection, but most models are evaluated on one dataset at a time. This raises questions about whether these models can handle different types of telemetry, especially in large-scale and high-dimensional environments. In this study, we evaluate four deep learning models, GRU, TCN, Transformer, and TSMixer. We also include Isolation Forest as a classical baseline. The models are tested across four telemetry datasets: the Numenta Anomaly Benchmark, Microsoft Cloud Monitoring dataset, Exathlon dataset, and IBM Console dataset. These datasets differ in structure, dimensionality, and labelling strategy. They include univariate time series, synthetic multivariate workloads, and real-world production telemetry with over 100,000 features. We use a unified training and evaluation pipeline across all datasets. The evaluation includes NAB-style metrics to capture early detection behaviour for datasets where anomalies persist over contiguous time intervals. This enables window-based scoring in settings where anomalies occur over contiguous time intervals, even when labels are recorded at the point level. The unified setup enables consistent analysis of model behaviour under shared scoring and calibration assumptions. Our results demonstrate that anomaly detection performance in cloud systems is governed not only by model architecture, but critically by calibration stability and feature-space geometry. By releasing our preprocessing pipelines, benchmark configuration, and evaluation artifacts, we aim to support reproducible and deployment-aware evaluation of anomaly detection systems for cloud environments. Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2602.13288 [cs.NI] (or arXiv:2602.13288v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2602.13288 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andriy Miranskyy [view email] [v1] Sat, 7 Feb 2026 21:42:21 UTC (484 KB)

[LG-100] Expected Moral Shortfall for Ethical Competence in Decision-making Models

链接: https://arxiv.org/abs/2602.13268
作者: Aisha Aijaz,Raghava Mutharaju,Manohar Kumar
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Moral cognition is a crucial yet underexplored aspect of decision-making in AI models. Regardless of the application domain, it should be a consideration that allows for ethically aligned decision-making. This paper presents a multifaceted contribution to this research space. Firstly, a comparative analysis of techniques to instill ethical competence into AI models has been presented to gauge them on multiple performance metrics. Second, a novel mathematical discretization of morality and a demonstration of its real-life application have been conveyed and tested against other techniques on two datasets. This value is modeled as the risk of loss incurred by the least moral cases, or an Expected Moral Shortfall (EMS), which we direct the AI model to minimize in order to maximize its performance while retaining ethical competence. Lastly, the paper discusses the tradeoff between preliminary AI decision-making metrics such as model performance, complexity, and scale of ethical competence to recognize the true extent of practical social impact.

[LG-101] Securing SIM-Assisted Wireless Networks via Quantum Reinforcement Learning

链接: https://arxiv.org/abs/2602.13238
作者: Le-Hung Hoang,Quang-Trung Luu,Dinh Thai Hoang,Diep N. Nguyen,Van-Dinh Nguyen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 13 pages. Submmited for possible publication

点击查看摘要

Abstract:Stacked intelligent metasurfaces (SIMs) have recently emerged as a powerful wave-domain technology that enables multi-stage manipulation of electromagnetic signals through multilayer programmable architectures. While SIMs offer unprecedented degrees of freedom for enhancing physical-layer security, their extremely large number of meta-atoms leads to a high-dimensional and strongly coupled optimization space, making conventional design approaches inefficient and difficult to scale. Moreover, existing deep reinforcement learning (DRL) techniques suffer from slow convergence and performance degradation in dynamic wireless environments with imperfect knowledge of passive eavesdroppers. To overcome these challenges, we propose a hybrid quantum proximal policy optimization (Q-PPO) framework for SIM-assisted secure communications, which jointly optimizes transmit power allocation and SIM phase shifts to maximize the average secrecy rate under power and quality-of-service constraints. Specifically, a parameterized quantum circuit is embedded into the actor network, forming a hybrid classical-quantum policy architecture that enhances policy representation capability and exploration efficiency in high-dimensional continuous action spaces. Extensive simulations demonstrate that the proposed Q-PPO scheme consistently outperforms DRL baselines, achieving approximately 15% higher secrecy rates and 30% faster convergence under imperfect eavesdropper channel state information. These results establish Q-PPO as a powerful optimization paradigm for SIM-enabled secure wireless networks.

[LG-102] Generalization from Low- to Moderate-Resolution Spectra with Neural Networks for Stellar Parameter Estimation: A Case Study with DESI

链接: https://arxiv.org/abs/2602.15021
作者: Xiaosheng Zhao,Yuan-Sen Ting,Rosemary F.G. Wyse,Alexander S. Szalay,Yang Huang,László Dobos,Tamás Budavári,Viska Wei
类目: olar and Stellar Astrophysics (astro-ph.SR); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 20 pages, 13 figures, 4 tables. Submitted to AAS journals. Comments welcome

点击查看摘要

Abstract:Cross-survey generalization is a critical challenge in stellar spectral analysis, particularly in cases such as transferring from low- to moderate-resolution surveys. We investigate this problem using pre-trained models, focusing on simple neural networks such as multilayer perceptrons (MLPs), with a case study transferring from LAMOST low-resolution spectra (LRS) to DESI medium-resolution spectra (MRS). Specifically, we pre-train MLPs on either LRS or their embeddings and fine-tune them for application to DESI stellar spectra. We compare MLPs trained directly on spectra with those trained on embeddings derived from transformer-based models (self-supervised foundation models pre-trained for multiple downstream tasks). We also evaluate different fine-tuning strategies, including residual-head adapters, LoRA, and full fine-tuning. We find that MLPs pre-trained on LAMOST LRS achieve strong performance, even without fine-tuning, and that modest fine-tuning with DESI spectra further improves the results. For iron abundance, embeddings from a transformer-based model yield advantages in the metal-rich ([Fe/H] -1.0) regime, but underperform in the metal-poor regime compared to MLPs trained directly on LRS. We also show that the optimal fine-tuning strategy depends on the specific stellar parameter under consideration. These results highlight that simple pre-trained MLPs can provide competitive cross-survey generalization, while the role of spectral foundation models for cross-survey stellar parameter estimation requires further exploration.

[LG-103] Faster Molecular Dynamics with Neural Network Potentials via Distilled Multiple Time-Stepping and Non-Conservative Forces

链接: https://arxiv.org/abs/2602.14975
作者: Nicolaï Gouraud,Côme Cattin,Thomas Plé,Olivier Adjoua,Louis Lagardère,Jean-Philip Piquemal
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Following our previous work (J. Phys. Chem. Lett., 2026, 17, 5, 1288-1295), we propose the DMTS-NC approach, a distilled multi-time-step (DMTS) strategy using non conservative (NC) forces to further accelerate atomistic molecular dynamics simulations using foundation neural network models. There, a dual-level reversible reference system propagator algorithm (RESPA) formalism couples a target accurate conservative potential to a simplified distilled representation optimized for the production of non-conservative forces. Despite being non-conservative, the distilled architecture is designed to enforce key physical priors, such as equivariance under rotation and cancellation of atomic force components. These choices facilitate the distillation process and therefore improve drastically the robustness of simulation, significantly limiting the “holes” in the simpler potential, thus achieving excellent agreement with the forces data. Overall, the DMTS-NC scheme is found to be more stable and efficient than its conservative counterpart with additional speedups reaching 15-30% over DMTS. Requiring no finetuning steps, it is easier to implement and can be pushed to the limit of the systems physical resonances to maintain accuracy while providing maximum efficiency. As for DMTS, DMTS-NC is applicable to any neural network potential.

[LG-104] Activation-Space Uncertainty Quantification for Pretrained Networks

链接: https://arxiv.org/abs/2602.14934
作者: Richard Bergna,Stefan Depeweg,Sergio Calvo-Ordoñez,Jonathan Plenk,Alvaro Cartea,Jose Miguel Hernández-Lobato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable uncertainty estimates are crucial for deploying pretrained models; yet, many strong methods for quantifying uncertainty require retraining, Monte Carlo sampling, or expensive second-order computations and may alter a frozen backbone’s predictions. To address this, we introduce Gaussian Process Activations (GAPA), a post-hoc method that shifts Bayesian modeling from weights to activations. GAPA replaces standard nonlinearities with Gaussian-process activations whose posterior mean exactly matches the original activation, preserving the backbone’s point predictions by construction while providing closed-form epistemic variances in activation space. To scale to modern architectures, we use a sparse variational inducing-point approximation over cached training activations, combined with local k-nearest-neighbor subset conditioning, enabling deterministic single-pass uncertainty propagation without sampling, backpropagation, or second-order information. Across regression, classification, image segmentation, and language modeling, GAPA matches or outperforms strong post-hoc baselines in calibration and out-of-distribution detection while remaining efficient at test time.

[LG-105] From Classical to Quantum: Extending Prometheus for Unsupervised Discovery of Phase Transitions in Three Dimensions and Quantum Systems

链接: https://arxiv.org/abs/2602.14928
作者: Brandon Yee,Wilson Collins,Maximilian Rutkowski
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We extend the Prometheus framework for unsupervised phase transition discovery from 2D classical systems to 3D classical and quantum many-body systems, addressing scalability in higher dimensions and generalization to quantum fluctuations. For the 3D Ising model ( L \leq 32 ), the framework detects the critical temperature within 0.01% of literature values ( T_c/J = 4.511 \pm 0.005 ) and extracts critical exponents with \geq 70% accuracy ( \beta = 0.328 \pm 0.015 , \gamma = 1.24 \pm 0.06 , \nu = 0.632 \pm 0.025 ), correctly identifying the 3D Ising universality class via \chi^2 comparison ( p = 0.72 ) without analytical guidance. For quantum systems, we developed quantum-aware VAE (Q-VAE) architectures using complex-valued wavefunctions and fidelity-based loss. Applied to the transverse field Ising model, we achieve 2% accuracy in quantum critical point detection ( h_c/J = 1.00 \pm 0.02 ) and successfully discover ground state magnetization as the order parameter ( r = 0.97 ). Notably, for the disordered transverse field Ising model, we detect exotic infinite-randomness criticality characterized by activated dynamical scaling \ln \xi \sim |h - h_c|^-\psi , extracting a tunneling exponent \psi = 0.48 \pm 0.08 consistent with theoretical predictions ( \psi = 0.5 ). This demonstrates that unsupervised learning can identify qualitatively different types of critical behavior, not just locate critical points. Our systematic validation across classical thermal transitions ( T = 0 to T 0 ) and quantum phase transitions ( T = 0 , varying h ) establishes that VAE-based discovery generalizes across fundamentally different physical domains, providing robust tools for exploring phase diagrams where analytical solutions are unavailable.

[LG-106] Adjoint-based Shape Optimization Machine Learning based Surrogate Models Conditional Variational Autoencoder (CVAE) Voith Schneider propulsion (VSP) Self-propelled Ship Propulsion Model Hull Optimization

链接: https://arxiv.org/abs/2602.14907
作者: Moloud Arian Maram,Georgios Bletsos,Thanh Tung Nguyen,Ahmed Hassan,Michael Palm,Thomas Rung
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adjoint-based shape optimization of ship hulls is a powerful tool for addressing high-dimensional design problems in naval architecture, particularly in minimizing the ship resistance. However, its application to vessels that employ complex propulsion systems introduces significant challenges. They arise from the need for transient simulations extending over long periods of time with small time steps and from the reverse temporal propagation of the primal and adjoint solutions. These challenges place considerable demands on the required storage and computing power, which significantly hamper the use of adjoint methods in the industry. To address this issue, we propose a machine learning-assisted optimization framework that employs a Conditional Variational Autoencoder-based surrogate model of the propulsion system. The surrogate model replicates the time-averaged flow field induced by a Voith Schneider Propeller and replaces the geometrically and time-resolved propeller with a data-driven approximation. Primal flow verification examples demonstrate that the surrogate model achieves significant computational savings while maintaining the necessary accuracy of the resolved propeller. Optimization studies show that ignoring the propulsion system can yield designs that perform worse than the initial shape. In contrast, the proposed method produces shapes that achieve more than an 8% reduction in resistance.

[LG-107] Drift-Diffusion Matching: Embedding dynamics in latent manifolds of asymmetric neural networks

链接: https://arxiv.org/abs/2602.14885
作者: Ramón Nartallo-Kaluarachchi,Renaud Lambiotte,Alain Goriely
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 23 pages, 15 figures

点击查看摘要

Abstract:Recurrent neural networks (RNNs) provide a theoretical framework for understanding computation in biological neural circuits, yet classical results, such as Hopfield’s model of associative memory, rely on symmetric connectivity that restricts network dynamics to gradient-like flows. In contrast, biological networks support rich time-dependent behaviour facilitated by their asymmetry. Here we introduce a general framework, which we term drift-diffusion matching, for training continuous-time RNNs to represent arbitrary stochastic dynamical systems within a low-dimensional latent subspace. Allowing asymmetric connectivity, we show that RNNs can faithfully embed the drift and diffusion of a given stochastic differential equation, including nonlinear and nonequilibrium dynamics such as chaotic attractors. As an application, we construct RNN realisations of stochastic systems that transiently explore various attractors through both input-driven switching and autonomous transitions driven by nonequilibrium currents, which we interpret as models of associative and sequential (episodic) memory. To elucidate how these dynamics are encoded in the network, we introduce decompositions of the RNN based on its asymmetric connectivity and its time-irreversibility. Our results extend attractor neural network theory beyond equilibrium, showing that asymmetric neural populations can implement a broad class of dynamical computations within low-dimensional manifolds, unifying ideas from associative memory, nonequilibrium statistical mechanics, and neural computation.

[LG-108] Fast and accurate quasi-atom method for simultaneous atomistic and continuum simulation of solids

链接: https://arxiv.org/abs/2602.14867
作者: Artem Chuprov,Egor E. Nuzhin,Alexey A. Tsukanov,Nikolay V. Brilliantov
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We report a novel hybrid method of simultaneous atomistic simulation of solids in critical regions (contacts surfaces, cracks areas, etc.), along with continuum modeling of other parts. The continuum is treated in terms of quasi-atoms of different size, comprising composite medium. The parameters of interaction potential between the quasi-atoms are optimized to match elastic properties of the composite medium to those of the atomic one. The optimization method coincides conceptually with the online Machine Learning (ML) methods, making it computationally very efficient. Such an approach allows a straightforward application of standard software packages for molecular dynamics (MD), supplemented by the ML-based optimizer. The new method is applied to model systems with a simple, pairwise Lennard-Jones potential, as well with multi-body Tersoff potential, describing covalent bonds. Using LAMMPS software we simulate collision of particles of different size. Comparing simulation results, obtained by the novel method, with full-atomic simulations, we demonstrate its accuracy, validity and overwhelming superiority in computational speed. Furthermore, we compare our method with other hybrid methods, specifically, with the closest one – AtC (Atomic to Continuum) method. We demonstrate a significant superiority of our approach in computational speed and implementation convenience. Finally, we discuss a possible extension of the method for modeling other phenomena.

[LG-109] RF-GPT : Teaching AI to See the Wireless World

链接: https://arxiv.org/abs/2602.14833
作者: Hang Zou,Yu Tian,Bohao Wang,Lina Bariah,Samson Lasaulce,Chongwen Huang,Mérouane Debbah
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and multimodal models have become powerful general-purpose reasoning systems. However, radio-frequency (RF) signals, which underpin wireless systems, are still not natively supported by these models. Existing LLM-based approaches for telecom focus mainly on text and structured data, while conventional RF deep-learning models are built separately for specific signal-processing tasks, highlighting a clear gap between RF perception and high-level reasoning. To bridge this gap, we introduce RF-GPT, a radio-frequency language model (RFLM) that utilizes the visual encoders of multimodal LLMs to process and understand RF spectrograms. In this framework, complex in-phase/quadrature (IQ) waveforms are mapped to time-frequency spectrograms and then passed to pretrained visual encoders. The resulting representations are injected as RF tokens into a decoder-only LLM, which generates RF-grounded answers, explanations, and structured outputs. To train RF-GPT, we perform supervised instruction fine-tuning of a pretrained multimodal LLM using a fully synthetic RF corpus. Standards-compliant waveform generators produce wideband scenes for six wireless technologies, from which we derive time-frequency spectrograms, exact configuration metadata, and dense captions. A text-only LLM then converts these captions into RF-grounded instruction-answer pairs, yielding roughly 12,000 RF scenes and 0.625 million instruction examples without any manual labeling. Across benchmarks for wideband modulation classification, overlap analysis, wireless-technology recognition, WLAN user counting, and 5G NR information extraction, RF-GPT achieves strong multi-task performance, whereas general-purpose VLMs with no RF grounding largely fail.

[LG-110] Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

链接: https://arxiv.org/abs/2602.14828
作者: Ana F. Rodrigues,Lucas Ferraz,Laura Balbi,Pedro Giesteira Cotovio,Catia Pesquita
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.

[LG-111] SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment ICASSP2026

链接: https://arxiv.org/abs/2602.14785
作者: Fengyuan Cao,Xinyu Liang,Fredrik Cumlin,Victor Ungureanu,Chandan K. A. Reddy,Christian Schuldt,Saikat Chatterjee
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. Experimental results show that leveraging high-frequency information overlooked by SSL features is crucial for accurate multi-rate SQA, and that the proposed two-step training substantially improves generalization when multi-rate data is limited.

[LG-112] he Signal Horizon: Local Blindness and the Contraction of Pauli-Weight Spectra in Noisy Quantum Encodings

链接: https://arxiv.org/abs/2602.14735
作者: Ait Haddou Marwan
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of quantum classifiers is typically analyzed through global state distinguishability or the trainability of variational models. This study investigates how much class information remains accessible under locality-constrained measurements in the presence of noise. The authors formulate binary quantum classification as constrained quantum state discrimination and introduce a locality-restricted distinguishability measure quantifying the maximum bias achievable by observables acting on at most k subsystems. For n -qubit systems subject to independent depolarizing noise, the locally accessible signal is governed by a Pauli-weight-dependent contraction mechanism. This motivates a computable predictor, the k -local Pauli-accessible amplitude A_k§ , which lower bounds the optimal k -local classification advantage. Numerical experiments on four-qubit encodings demonstrate quantitative agreement between empirical accuracy and the prediction across noise levels. The research identifies an operational breakdown threshold where k -local classifiers become indistinguishable from random guessing despite persistent global distinguishability.

[LG-113] Kernel-based optimization of measurement operators for quantum reservoir computers

链接: https://arxiv.org/abs/2602.14677
作者: Markus Gross,Hans-Martin Rieser
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 26 pages, 4 figures

点击查看摘要

Abstract:Finding optimal measurement operators is crucial for the performance of quantum reservoir computers (QRCs), since they employ a fixed quantum feature map. We formulate the training of both stateless (quantum extreme learning machines, QELMs) and stateful (memory dependent) QRCs in the framework of kernel ridge regression. This approach renders an optimal measurement operator that minimizes prediction error for a given reservoir and training dataset. For large qubit numbers, this method is more efficient than the conventional training of QRCs. We discuss efficiency and practical implementation strategies, including Pauli basis decomposition and operator diagonalization, to adapt the optimal observable to hardware constraints. Numerical experiments on image classification and time series prediction tasks demonstrate the effectiveness of this approach, which can also be applied to other quantum ML models.

[LG-114] GenPANIS: A Latent-Variable Generative Framework for Forward and Inverse PDE Problems in Multiphase Media

链接: https://arxiv.org/abs/2602.14642
作者: Matthaios Chatzopoulos,Phaedon-Stelios Koutsourelakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse problems and inverse design in multiphase media, i.e., recovering or engineering microstructures to achieve target macroscopic responses, require operating on discrete-valued material fields, rendering the problem non-differentiable and incompatible with gradient-based methods. Existing approaches either relax to continuous approximations, compromising physical fidelity, or employ separate heavyweight models for forward and inverse tasks. We propose GenPANIS, a unified generative framework that preserves exact discrete microstructures while enabling gradient-based inference through continuous latent embeddings. The model learns a joint distribution over microstructures and PDE solutions, supporting bidirectional inference (forward prediction and inverse recovery) within a single architecture. The generative formulation enables training with unlabeled data, physics residuals, and minimal labeled pairs. A physics-aware decoder incorporating a differentiable coarse-grained PDE solver preserves governing equation structure, enabling extrapolation to varying boundary conditions and microstructural statistics. A learnable normalizing flow prior captures complex posterior structure for inverse problems. Demonstrated on Darcy flow and Helmholtz equations, GenPANIS maintains accuracy on challenging extrapolative scenarios - including unseen boundary conditions, volume fractions, and microstructural morphologies, with sparse, noisy observations. It outperforms state-of-the-art methods while using 10 - 100 times fewer parameters and providing principled uncertainty quantification.

[LG-115] Quantum Reservoir Computing with Neutral Atoms on a Small Complex Medical Dataset

链接: https://arxiv.org/abs/2602.14641
作者: Luke Antoncich,Yuben Moodley,Ugo Varetto,Jingbo Wang,Jonathan Wurtz,Jing Chen,Pascal Jahan Elahi,Casey R. Myers
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biomarker-based prediction of clinical outcomes is challenging due to nonlinear relationships, correlated features, and the limited size of many medical datasets. Classical machine-learning methods can struggle under these conditions, motivating the search for alternatives. In this work, we investigate quantum reservoir computing (QRC), using both noiseless emulation and hardware execution on the neutral-atom Rydberg processor \textitAquila. We evaluate performance with six classical machine-learning models and use SHAP to generate feature subsets. We find that models trained on emulated quantum features achieve mean test accuracies comparable to those trained on classical features, but have higher training accuracies and greater variability over data splits, consistent with overfitting. When comparing hardware execution of QRC to noiseless emulation, the models are more robust over different data splits and often exhibit statistically significant improvements in mean test accuracy. This combination of improved accuracy and increased stability is suggestive of a regularising effect induced by hardware execution. To investigate the origin of this behaviour, we examine the statistical differences between hardware and emulated quantum feature distributions. We find that hardware execution applies a structured, time-dependent transformation characterised by compression toward the mean and a progressive reduction in mutual information relative to emulation.

[LG-116] A Bayesian Approach to Low-Discrepancy Subset Selection

链接: https://arxiv.org/abs/2602.14607
作者: Nathan Kirk
类目: Methodology (stat.ME); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注: 13 pages, 3 figures, mODa14

点击查看摘要

Abstract:Low-discrepancy designs play a central role in quasi-Monte Carlo methods and are increasingly influential in other domains such as machine learning, robotics and computer graphics, to name a few. In recent years, one such low-discrepancy construction method called subset selection has received a lot of attention. Given a large population, one optimally selects a small low-discrepancy subset with respect to a discrepancy-based objective. Versions of this problem are known to be NP-hard. In this text, we establish, for the first time, that the subset selection problem with respect to kernel discrepancies is also NP-hard. Motivated by this intractability, we propose a Bayesian Optimization procedure for the subset selection problem utilizing the recent notion of deep embedding kernels. We demonstrate the performance of the BO algorithm to minimize discrepancy measures and note that the framework is broadly applicable any design criteria.

[LG-117] Constrained and Composite Sampling via Proximal Sampler

链接: https://arxiv.org/abs/2602.14478
作者: Thanh Dang,Jiaming Liang
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: The main paper is 13 pages; the rest are appendices

点击查看摘要

Abstract:We study two log-concave sampling problems: constrained sampling and composite sampling. First, we consider sampling from a target distribution with density proportional to \exp(-f(x)) supported on a convex set K \subset \mathbbR^d , where f is convex. The main challenge is enforcing feasibility without degrading mixing. Using an epigraph transformation, we reduce this task to sampling from a nearly uniform distribution over a lifted convex set in \mathbbR^d+1 . We then solve the lifted problem using a proximal sampler. Assuming only a separation oracle for K and a subgradient oracle for f , we develop an implementation of the proximal sampler based on the cutting-plane method and rejection sampling. Unlike existing constrained samplers that rely on projection, reflection, barrier functions, or mirror maps, our approach enforces feasibility using only minimal oracle access, resulting in a practical and unbiased sampler without knowing the geometry of the constraint set. Second, we study composite sampling, where the target is proportional to \exp(-f(x)-h(x)) with closed and convex f and h . This composite structure is standard in Bayesian inference with f modeling data fidelity and h encoding prior information. We reduce composite sampling via an epigraph lifting of h to constrained sampling in \mathbbR^d+1 , which allows direct application of the constrained sampling algorithm developed in the first part. This reduction results in a double epigraph lifting formulation in \mathbbR^d+2 , on which we apply a proximal sampler. By keeping f and h separate, we further demonstrate how different combinations of oracle access (such as subgradient and proximal) can be leveraged to construct separation oracles for the lifted problem. For both sampling problems, we establish mixing time bounds measured in Rényi and \chi^2 divergences. Comments: The main paper is 13 pages; the rest are appendices Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.14478 [stat.ML] (or arXiv:2602.14478v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.14478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-118] Frequentist Regret Analysis of Gaussian Process Thompson Sampling via Fractional Posteriors

链接: https://arxiv.org/abs/2602.14472
作者: Somjit Roy,Prateek Jaiswal,Anirban Bhattacharya,Debdeep Pati,Bani K. Mallick
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 34 pages, Submitted

点击查看摘要

[LG-119] CAIRO: Decoupling Order from Scale in Regression

链接: https://arxiv.org/abs/2602.14440
作者: Harri Vanhems,Yue Zhao,Peng Shi,Archer Y. Yang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Standard regression methods typically optimize a single pointwise objective, such as mean squared error, which conflates the learning of ordering with the learning of scale. This coupling renders models vulnerable to outliers and heavy-tailed noise. We propose CAIRO (Calibrate After Initial Rank Ordering), a framework that decouples regression into two distinct stages. In the first stage, we learn a scoring function by minimizing a scale-invariant ranking loss; in the second, we recover the target scale via isotonic regression. We theoretically characterize a class of “Optimal-in-Rank-Order” objectives – including variants of RankNet and Gini covariance – and prove that they recover the ordering of the true conditional mean under mild assumptions. We further show that subsequent monotone calibration guarantees recovery of the true regression function. Empirically, CAIRO combines the representation learning of neural networks with the robustness of rank-based statistics. It matches the performance of state-of-the-art tree ensembles on tabular benchmarks and significantly outperforms standard regression objectives in regimes with heavy-tailed or heteroskedastic noise.

[LG-120] High-accuracy log-concave sampling with stochastic queries

链接: https://arxiv.org/abs/2602.14342
作者: Fan Chen,Sinho Chewi,Constantinos Daskalakis,Alexander Rakhlin
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We show that high-accuracy guarantees for log-concave sampling – that is, iteration and query complexities which scale as \mathrmpoly\log(1/\delta) , where \delta is the desired target accuracy – are achievable using stochastic gradients with subexponential tails. Notably, this exhibits a separation with the problem of convex optimization, where stochasticity (even additive Gaussian noise) in the gradient oracle incurs \mathrmpoly(1/\delta) queries. We also give an information-theoretic argument that light-tailed stochastic gradients are necessary for high accuracy: for example, in the bounded variance case, we show that the minimax-optimal query complexity scales as \Theta(1/\delta) . Our framework also provides similar high accuracy guarantees under stochastic zeroth order (value) queries.

[LG-121] Fast Compute for ML Optimization

链接: https://arxiv.org/abs/2602.14280
作者: Nick Polson,Vadim Sokolov
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam’s second-moment scaling and AdamW’s weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with p \in \20, \ldots, 500\ , SM-EM with Nesterov acceleration attains up to 13\times lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a 10\times runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.

[LG-122] Federated Ensemble Learning with Progressive Model Personalization

链接: https://arxiv.org/abs/2602.14244
作者: Ala Emrani,Amir Najafi,Abolfazl Motahari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages

点击查看摘要

Abstract:Federated Learning provides a privacy-preserving paradigm for distributed learning, but suffers from statistical heterogeneity across clients. Personalized Federated Learning (PFL) mitigates this issue by considering client-specific models. A widely adopted approach in PFL decomposes neural networks into a shared feature extractor and client-specific heads. While effective, this design induces a fundamental tradeoff: deep or expressive shared components hinder personalization, whereas large local heads exacerbate overfitting under limited per-client data. Most existing methods rely on rigid, shallow heads, and therefore fail to navigate this tradeoff in a principled manner. In this work, we propose a boosting-inspired framework that enables a smooth control of this tradeoff. Instead of training a single personalized model, we construct an ensemble of T models for each client. Across boosting iterations, the depth of the personalized component are progressively increased, while its effective complexity is systematically controlled via low-rank factorization or width shrinkage. This design simultaneously limits overfitting and substantially reduces per-client bias by allowing increasingly expressive personalization. We provide theoretical analysis that establishes generalization bounds with favorable dependence on the average local sample size and the total number of clients. Specifically, we prove that the complexity of the shared layers is effectively suppressed, while the dependence on the boosting horizon T is controlled through parameter reduction. Notably, we provide a novel nonlinear generalization guarantee for decoupled PFL models. Extensive experiments on benchmark and real-world datasets (e.g., EMNIST, CIFAR-10/100, and Sent140) demonstrate that the proposed framework consistently outperforms state-of-the-art PFL methods under heterogeneous data distributions.

[LG-123] Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

链接: https://arxiv.org/abs/2602.14029
作者: Mingqi Wu,Archer Y. Yang,Qiang Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 8 pages main, 29 pages in total

点击查看摘要

Abstract:Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a U -shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.

[LG-124] Computable Bernstein Certificates for Cross-Fitted Clipped Covariance Estimation

链接: https://arxiv.org/abs/2602.14020
作者: Even He,Zaizai Yan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study operator-norm covariance estimation from heavy-tailed samples that may include a small fraction of arbitrary outliers. A simple and widely used safeguard is \emphEuclidean norm clipping, but its accuracy depends critically on an unknown clipping level. We propose a cross-fitted clipped covariance estimator equipped with \emphfully computable Bernstein-type deviation certificates, enabling principled data-driven tuning via a selector (\emphMinUpper) that balances certified stochastic error and a robust hold-out proxy for clipping bias. The resulting procedure adapts to intrinsic complexity measures such as effective rank under mild tail regularity and retains meaningful guarantees under only finite fourth moments. Experiments on contaminated spiked-covariance benchmarks illustrate stable performance and competitive accuracy across regimes.

[LG-125] A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization

链接: https://arxiv.org/abs/2602.13942
作者: Zexuan Sun,Garvesh Raskutti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of large language models (LLMs), fine-tuning pretrained models has become ubiquitous. Yet the theoretical underpinning remains an open question. A central question is why only a few epochs of fine-tuning are typically sufficient to achieve strong performance on many different tasks. In this work, we approach this question by developing a statistical framework, combining rigorous early stopping theory with the attention-based Neural Tangent Kernel (NTK) for LLMs, offering new theoretical insights on fine-tuning practices. Specifically, we formally extend classical NTK theory [Jacot et al., 2018] to non-random (i.e., pretrained) initializations and provide a convergence guarantee for attention-based fine-tuning. One key insight provided by the theory is that the convergence rate with respect to sample size is closely linked to the eigenvalue decay rate of the empirical kernel matrix induced by the NTK. We also demonstrate how the framework can be used to explain task vectors for multiple tasks in LLMs. Finally, experiments with modern language models on real-world datasets provide empirical evidence supporting our theoretical insights.

[LG-126] Quantifying Normality: Convergence Rate to Gaussian Limit for Stochastic Approximation and Unadjusted OU Algorithm

链接: https://arxiv.org/abs/2602.13906
作者: Shaan Ul Haque,Zedong Wang,Zixuan Zhang,Siva Theja Maguluri
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages

点击查看摘要

Abstract:Stochastic approximation (SA) is a method for finding the root of an operator perturbed by noise. There is a rich literature establishing the asymptotic normality of rescaled SA iterates under fairly mild conditions. However, these asymptotic results do not quantify the accuracy of the Gaussian approximation in finite time. In this paper, we establish explicit non-asymptotic bounds on the Wasserstein distance between the distribution of the rescaled iterate at time k and the asymptotic Gaussian limit for various choices of step-sizes including constant and polynomially decaying. As an immediate consequence, we obtain tail bounds on the error of SA iterates at any time. We obtain the sharp rates by first studying the convergence rate of the discrete Ornstein-Uhlenbeck (O-U) process driven by general noise, whose stationary distribution is identical to the limiting Gaussian distribution of the rescaled SA iterates. We believe that this is of independent interest, given its connection to sampling literature. The analysis involves adapting Stein’s method for Gaussian approximation to handle the matrix weighted sum of i.i.d. random variables. The desired finite-time bounds for SA are obtained by characterizing the error dynamics between the rescaled SA iterate and the discrete time O-U process and combining it with the convergence rate of the latter process. Comments: 42 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2602.13906 [stat.ML] (or arXiv:2602.13906v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.13906 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-127] Ensemble-Conditional Gaussian Processes (Ens-CGP): Representation Geometry and Inference

链接: https://arxiv.org/abs/2602.13871
作者: Sai Ravela,Jae Deok Kim,Kenneth Gee,Xingjian Yan,Samson Mercier,Lubna Albarghouty,Anamitra Saha
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 20 pages. Technical manuscrupt on representational equivalence between conditional Gaussian inference, quadratic optimization, and RKHS geometry in finite dimensions

点击查看摘要

Abstract:We formulate Ensemble-Conditional Gaussian Processes (Ens-CGP), a finite-dimensional synthesis that centers ensemble-based inference on the conditional Gaussian law. Conditional Gaussian processes (CGP) arise directly from Gaussian processes under conditioning and, in linear-Gaussian settings, define the full posterior distribution for a Gaussian prior and linear observations. Classical Kalman filtering is a recursive algorithm that computes this same conditional law under dynamical assumptions; the conditional Gaussian law itself is therefore the underlying representational object, while the filter is one computational realization. In this sense, CGP provides the probabilistic foundation for Kalman-type methods as well as equivalent formulations as a strictly convex quadratic program (MAP estimation), RKHS-regularized regression, and classical regularization. Ens-CGP is the ensemble instantiation of this object, obtained by treating empirical ensemble moments as a (possibly low-rank) Gaussian prior and performing exact conditioning. By separating representation (GP - CGP - Ens-CGP) from computation (Kalman filters, EnKF variants, and iterative ensemble schemes), the framework links an earlier-established representational foundation for inference to ensemble-derived priors and clarifies the relationships among probabilistic, variational, and ensemble perspectives.

[LG-128] Causally constrained reduced-order neural models of complex turbulent dynamical systems

链接: https://arxiv.org/abs/2602.13847
作者: Fabrizio Falasca,Laure Zanna
类目: Chaotic Dynamics (nlin.CD); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:We introduce a flexible framework based on response theory and score matching to suppress spurious, noncausal dependencies in reduced-order neural emulators of turbulent systems, focusing on climate dynamics as a proof-of-concept. We showcase the approach using the stochastic Charney-DeVore model as a relevant prototype for low-frequency atmospheric variability. We show that the resulting causal constraints enhance neural emulators’ ability to respond to both weak and strong external forcings, despite being trained exclusively on unforced data. The approach is broadly applicable to modeling complex turbulent dynamical systems in reduced spaces and can be readily integrated into general neural network architectures.

[LG-129] NeuroMambaLLM : Dynamic Graph Learning of fMRI Functional Connectivity in Autistic Brains Using Mamba and Language Model Reasoning

链接: https://arxiv.org/abs/2602.13770
作者: Yasaman Torabi,Parsa Razmara,Hamed Ajorlou,Bardia Baraeinejad
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong semantic reasoning across multimodal domains. However, their integration with graph-based models of brain connectivity remains limited. In addition, most existing fMRI analysis methods rely on static Functional Connectivity (FC) representations, which obscure transient neural dynamics critical for neurodevelopmental disorders such as autism. Recent state-space approaches, including Mamba, model temporal structure efficiently, but are typically used as standalone feature extractors without explicit high-level reasoning. We propose NeuroMambaLLM, an end-to-end framework that integrates dynamic latent graph learning and selective state-space temporal modelling with LLMs. The proposed method learns the functional connectivity dynamically from raw Blood-Oxygen-Level-Dependent (BOLD) time series, replacing fixed correlation graphs with adaptive latent connectivity while suppressing motion-related artifacts and capturing long-range temporal dependencies. The resulting dynamic brain representations are projected into the embedding space of an LLM model, where the base language model remains frozen and lightweight low-rank adaptation (LoRA) modules are trained for parameter-efficient alignment. This design enables the LLM to perform both diagnostic classification and language-based reasoning, allowing it to analyze dynamic fMRI patterns and generate clinically meaningful textual reports.

[LG-130] Locally Private Parametric Methods for Change-Point Detection

链接: https://arxiv.org/abs/2602.13619
作者: Anuj Kumar Yadav,Cemre Cadir,Yanina Shkel,Michael Gastpar
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 43 pages, 20 figures

点击查看摘要

Abstract:We study parametric change-point detection, where the goal is to identify distributional changes in time series, under local differential privacy. In the non-private setting, we derive improved finite-sample accuracy guarantees for a change-point detection algorithm based on the generalized log-likelihood ratio test, via martingale methods. In the private setting, we propose two locally differentially private algorithms based on randomized response and binary mechanisms, and analyze their theoretical performance. We derive bounds on detection accuracy and validate our results through empirical evaluation. Our results characterize the statistical cost of local differential privacy in change-point detection and show how privacy degrades performance relative to a non-private benchmark. As part of this analysis, we establish a structural result for strong data processing inequalities (SDPI), proving that SDPI coefficients for Rényi divergences and their symmetric variants (Jeffreys-Rényi divergences) are achieved by binary input distributions. These results on SDPI coefficients are also of independent interest, with applications to statistical estimation, data compression, and Markov chain mixing.

[LG-131] Learning Gradient Flow: Using Equation Discovery to Accelerate Engineering Optimization

链接: https://arxiv.org/abs/2602.13513
作者: Grant Norman,Conor Rowan,Kurt Maute,Alireza Doostan
类目: Optimization and Control (math.OC); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 44 pages, 13 figures. To be submitted to CMAME

点击查看摘要

Abstract:In this work, we investigate the use of data-driven equation discovery for dynamical systems to model and forecast continuous-time dynamics of unconstrained optimization problems. To avoid expensive evaluations of the objective function and its gradient, we leverage trajectory data on the optimization variables to learn the continuous-time dynamics associated with gradient descent, Newton’s method, and ADAM optimization. The discovered gradient flows are then solved as a surrogate for the original optimization problem. To this end, we introduce the Learned Gradient Flow (LGF) optimizer, which is equipped to build surrogate models of variable polynomial order in full- or reduced-dimensional spaces at user-defined intervals in the optimization process. We demonstrate the efficacy of this approach on several standard problems from engineering mechanics and scientific machine learning, including two inverse problems, structural topology optimization, and two forward solves with different discretizations. Our results suggest that the learned gradient flows can significantly expedite convergence by capturing critical features of the optimization trajectory while avoiding expensive evaluations of the objective and its gradient.

[LG-132] Stochastic variance reduced extrag radient methods for solving hierarchical variational inequalities

链接: https://arxiv.org/abs/2602.13510
作者: Pavel Dvurechensky,Andrea Ebner,Johannes Carl Schnebel,Shimrit Shtern,Mathias Staudigl
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We are concerned with optimization in a broad sense through the lens of solving variational inequalities (VIs) – a class of problems that are so general that they cover as particular cases minimization of functions, saddle-point (minimax) problems, Nash equilibrium problems, and many others. The key challenges in our problem formulation are the two-level hierarchical structure and finite-sum representation of the smooth operators in each level. For this setting, we are the first to prove convergence rates and complexity statements for variance-reduced stochastic algorithms approaching the solution of hierarchical VIs in Euclidean and Bregman setups.

[LG-133] Graph neural networks uncover structure and functions underlying the activity of simulated neural assemblies

链接: https://arxiv.org/abs/2602.13325
作者: Cédric Allier,Larissa Heinrich,Magdalena Schneider,Stephan Saalfeld
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks trained to predict observable dynamics can be used to decompose the temporal activity of complex heterogeneous systems into simple, interpretable representations. Here we apply this framework to simulated neural assemblies with thousands of neurons and demonstrate that it can jointly reveal the connectivity matrix, the neuron types, the signaling functions, and in some cases hidden external stimuli. In contrast to existing machine learning approaches such as recurrent neural networks and transformers, which emphasize predictive accuracy but offer limited interpretability, our method provides both reliable forecasts of neural activity and interpretable decomposition of the mechanisms governing large neural assemblies.

附件下载

点击下载今日全部论文列表