Arxiv今日论文 | 2026-05-07

本篇博文主要内容为 2026-05-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共90篇(Computation and Language (cs.CL))
人工智能共189篇(Artificial Intelligence (cs.AI))
计算机视觉共117篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共253篇(Machine Learning (cs.LG))
多智能体系统共14篇(Multiagent Systems (cs.MA))
信息检索共13篇(Information Retrieval (cs.IR))
人机交互共30篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）中行为异质性度量的计算瓶颈问题，即系统神经多样性（System Neural Diversity, SND）因需对所有智能体对进行两两距离平均而具有 $O(n^2)$ 的复杂度，难以扩展至大规模团队。解决方案的关键在于提出图结构神经多样性（Graph-SND），通过将原始完全图的平均操作替换为任意图 $G$ 上边的加权平均，从而实现高效计算：当 $G$ 为稀疏图时，计算成本降至 $O(|E|)$ ；若采样随机边，则可获得无偏估计并具备 $\tilde{O}(D_{\max}/\sqrt{n})$ 的概率收敛速率。这一方法在保持SND语义不变的前提下，显著降低计算开销，并支持从被动测量到闭环多样性控制的广泛应用。

链接: https://arxiv.org/abs/2605.05020
作者: Shawn Ray
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 22 pages, 12 figures, 7 tables

点击查看摘要

Abstract:System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all \binomn2 agent pairs, making each call quadratic in team size. We introduce Graph-SND, which replaces this complete-graph average with a weighted average over the edges of an arbitrary graph G . Three regimes follow: G=K_n recovers SND exactly; a fixed sparse G defines a localized diversity measure at O(|E|) cost; and random edge samples yield an unbiased Horvitz-Thompson estimator and a normalized sample mean with O(1/\sqrtm) concentration in the sampled edge count m . For fixed sparse graphs we prove forwarding-index distortion bounds for expanders and a spectral refinement under low-rank distance structure; for random d -regular graphs we prove an unconditional probabilistic \widetilde\mathcalO(D_\max/\sqrtn) bound. On VMAS we verify recovery, unbiasedness, concentration, and wall-clock scaling, with a PettingZoo TVD panel checking non-Gaussian transfer. In a 500-iteration n=100 PPO run, Bernoulli- 0.1 Graph-SND tracks full SND while reducing per-call metric time by about 10\times , and frozen-policy GPU timing up to n=500 follows the predicted \binomn2/|E| speedup. Random d -regular expanders empirically achieve \mathrmSND_G^\mathrmu/\mathrmSND \in [0.9987, 1.0013] at \Theta(n \log n) edges. In DiCo diversity control at n=50 , Bernoulli- 0.1 Graph-SND preserves set-point tracking with paired reward differences indistinguishable from zero across nine matched cells while cutting per-call metric cost by \sim9.5\times . Together, these results show that the SND aggregation bottleneck can be removed without changing the metric’s semantics, yielding a drop-in sparse alternative that scales beyond complete-graph SND and supports both passive measurement and closed-loop diversity control.

[MA-1] Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的多智能体系统在科学发现中因依赖临时文本（如草稿或聊天记录）进行协作而导致的痛点：难以定位生成研究想法中的缺陷以及追踪智能体如何迭代优化这些想法。解决方案的关键在于提出演化思想图（Evolving Idea Graphs, EIG），它将未完成的研究提案表示为一个动态演化的图结构，其中节点代表科学主张（scientific claims），边编码支持与冲突等关系，从而在演化过程中始终可识别未解决的弱点。EIG通过一个双头控制器对图状态进行操作，一端选择图编辑动作供智能体执行，另一端判断图是否达到最终提案合成条件，显著提升了研究想法的新颖性、可行性与清晰度等指标。

链接: https://arxiv.org/abs/2605.04922
作者: Jiangwen Dong,Bo Li,Wanyu Lin
机构: The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-empowered multi-agent systems offer new potential to accelerate scientific discovery by generating novel research ideas. However, existing methods typically coordinate agents through temporary texts, such as drafts or chat logs; it is difficult to pinpoint the weaknesses in the generated ideas and how the agents refine them. To this end, we introduce \textbfEvolving Idea Graphs (EIG), a graph-based multi-agent scientific ideation framework that can generate high-performance research ideas across various benchmark-native metrics, such as novelty, feasibility, and clarity. Instead of coordinating solely through texts, EIG represents a partially formed proposal as an evolving idea graph, where nodes capture scientific claims and edges encode relations (e.g., support and conflict), enabling unresolved weaknesses to remain identifiable throughout the idea evolving process. Specifically, a learned two-head controller operates over the evolving graph to guide the ideation: one head selects graph edits for agents to execute, while the other decides when the graph is ready for commit as final proposal synthesis. On AI Idea Bench 2025 and LiveIdeaBench, EIG outperforms all compared systems on both automatic benchmark scores and blind expert ratings. Ablations further show that explicit graph state provides the main performance gains, and learned edit-and-commit control adds consistent improvements.

[MA-2] ree-based Credit Assignment for Multi-Agent Memory System

【速读】：该论文旨在解决多智能体记忆系统（multi-agent memory systems）中代理（agent）间信用分配（credit assignment）的难题，即如何在不依赖任务特定标注的情况下，从最终奖励中精准推导出每个代理（如记忆构建、总结和检索代理）的贡献，从而实现有效的联合优化。现有基于强化学习的方法要么使用粗粒度的下游任务奖励统一更新所有代理，导致信号模糊；要么设计任务特定奖励函数，但需昂贵的人工标注且难以可靠定义。解决方案的关键在于提出树状信用分配机制（Tree-based Credit Assignment, TreeMem），将传统的线性多代理流水线扩展为树结构，其中每个代理的输出生成多个后续分支，通过蒙特卡洛平均法评估每个代理在其子树中的影响，从而将最终奖励分解为代理级优化信号，无需额外标注即可驱动异构代理有效分工与协同优化。

链接: https://arxiv.org/abs/2605.04811
作者: Marina Mao,Alexandr Liu,Pengbo Li,Siheng Li,Bo Zhou,Xiang Wang
机构: University of Science and Technology of China(中国科学技术大学); LLM Department, Tencent(腾讯LLM部门); The Hong Kong University of Science and Technology(香港科技大学); The Chinese University of Hong Kong(香港中文大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Memory systems are widely adopted to enhance LLMs for long-horizon tasks, and are commonly organized as multi-agent pipelines with memory building, summarizing, and retrieval agents. To empower this system, existing RL-based methods either apply final downstream task rewards (e.g., QA accuracy) for all agents uniformly, which are coarse and ambiguous, or design task-specific rewards for agents on different subtasks, which require costly annotations (e.g., key evidence) and are difficult to define reliably. To address these limitations, we propose Tree-based Credit Assignment for Multi-Agent Memory Systems (TreeMem), which derives agent-specific credit from the final reward without task-specific annotations. Specifically, TreeMem extends the multi-agent pipeline (builder–summarizer–retrieval) into a tree structure, where each agent’s outputs are expanded into multiple subsequent branches. The contribution of each agent is estimated via Monte Carlo averaging over its subsequent branches, capturing how intermediate agent actions may influence the final reward. This converts the coarse final reward into agent-specific optimization signals. These signals are then used to update all agent policies simultaneously, helping heterogeneous agents specialize effectively. Experiments on long-horizon benchmarks show that TreeMem improves memory system performance over strong baselines, validating the effectiveness of tree-structured credit assignment for the multi-agent memory system.

[MA-3] Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

【速读】：该论文旨在解决自主地球观测（Autonomous Earth Observation, AEO）代理在复杂多步骤任务执行中因规划与执行一体化模型导致的组合爆炸和推理错误问题。解决方案的关键在于提出轻量级多模态元规划器（Lightweight Multimodal Meta-Planner, LMMP）框架，其核心创新包括：引入双感知机制，将战略规划同时锚定于多模态图像特征和高层任务语义；构建元任务库（Meta Task Library），直接注入遥感领域专家知识以标准化域逻辑并确保计划的物理可行性；并通过两阶段训练流程——先通过专家蒸馏的监督微调初始化元规划器，再基于执行反馈进行直接偏好优化，从而显著提升工具调用准确率和任务成功率，且具备良好的“即插即用”泛化能力。

链接: https://arxiv.org/abs/2605.04777
作者: Jinghui Xu,Boyi Shangguan,Mengke Zhu,Hao Liu,Junhuan Jiang,Guangjun He,Pengming Feng,Shichao Jin,Bin Liang,Yongzhe Chang,Junbo Tan,Tiantian Zhang,Xueqian Wang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous Earth Observation (EO) agents are transitioning from passive perception to complex, multi-step task execution. However, current architectures that integrate planning and execution within a single model often struggle with combinatorial complexity and reasoning errors in dynamic EO scenarios. To resolve these challenges, we propose the Lightweight Multimodal Meta-Planner (LMMP) framework. LMMP incorporates a dual-awareness mechanism that grounds strategic plans in both multimodal image features and high-level task semantics. Crucially, we introduce a Meta Task Library to inject remote sensing expert knowledge directly into the workflow, which standardizes domain logic and ensures plans are physically feasible. We further implement a two-stage training pipeline, initializing the Meta-Planner via expert-distilled Supervised Fine-Tuning and refining it through Direct Preference Optimization based on execution feedback. Extensive experiments on a dataset derived from EarthBench and ThinkGeo demonstrate that LMMP significantly improves tool-calling accuracy and task success rates. Moreover, the framework exhibits strong ``plug-and-play’’ versatility, consistently enhancing the performance of diverse executor backbones across previously unseen EO missions.

[MA-4] Hierarachical Multiagent Reinforcement Learning for Multi-Group Tax Game

【速读】：该论文旨在解决多政府竞争环境下税收政策制定中的战略互动问题，传统强化学习（Reinforcement Learning, RL）模型通常仅关注单一政府与家庭群体的决策关系，忽略了多个政府在各自管辖范围内相互影响的复杂博弈结构。为捕捉这一多群体、多层次的经济系统特性，作者提出将税收建模为一种分层多群体博弈（hierarchical multi-group game），其中组内政府与家庭构成领导者-跟随者博弈，组间政府则构成竞争博弈。解决方案的关键在于设计了一个基于多智能体强化学习的双层训练框架，结合课程学习（Curriculum Learning）和闭环顺序更新策略（Closed-Loop Sequential Update），有效稳定训练过程并促进收敛，从而学习出可持续且公平的税收政策，显著提升多政府协作下的经济绩效。

链接: https://arxiv.org/abs/2605.04741
作者: Honglei Guo,Yuhan Zhao,Yexin Li
机构: Zhejiang University (浙江大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室，BIGAI)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Reinforcement learning has increasingly been used to study economic decision-making, such as taxation, public spending, and labour supply. However, most existing RL-based economic models focus on a single government–household group, thereby overlooking the strategic interactions that arise when multiple governments compete while managing their own populations. In practice, many economic systems (e.g., taxation) exhibit a multi-group structure, where each government must optimize its fiscal policy in response not only to household behaviour within its jurisdiction, but also to the policies of other competing governments. To capture this structure, we formulate taxation as a hierarchical multi-group game. Within each group, the interaction between the government and households is modelled as a leader–follower game; across groups, governments are modelled as players in a competitive game. This results in a hybrid hierarchical game that is difficult to solve using standard multi-agent reinforcement learning algorithms. We therefore propose a bi-level training framework built on multi-agent reinforcement learning, together with \textit Curriculum Learning and a \textit Closed-Loop Sequential Update strategy, to stabilize training and promote convergence. We instantiate this framework in a taxation game simulation environment grounded in classical economic models. The environment supports the evaluation of different taxation algorithms and provides multiple economic indicators for assessing policy performance. Experiments show that our approach can learn stable tax policies that benefit all participating groups. Compared with a two-group baseline without the proposed update mechanisms, our method avoids premature game collapse, extends the effective game duration by 60.92%, produces more sustainable and robust tax policies, and reduces GDP disparities among governments by 44.12%.

[MA-5] SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

【速读】：该论文旨在解决当前“vibe coding”平台在评估维度上过于依赖代码级基准，而缺乏对AI作为虚拟软件开发团队（virtual software development agency）在理解业务需求、架构决策、生产级代码编写、迭代修改及运维保障等全流程能力的系统性评估问题。其解决方案的关键在于提出并构建了SWE-WebDev Bench——一个包含68项指标的综合性评估框架，涵盖25个核心指标和43个诊断指标，按交互模式（App Creation Request vs. App Modification Request）、代理视角（Product Manager, Engineering, Ops）与复杂度层级（T4多角色SaaS、T5 AI原生）三个维度组织，从而实现对AI应用生成平台从需求理解到生产就绪性的多维量化分析。

链接: https://arxiv.org/abs/2605.04637
作者: Siddhant Saxena,Nilesh Trivedi,Vinayaka Jyothi
机构: BaseThesis Labs; QwikBuild
类目: Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 35 pages, 12 figures, 18 tables

点击查看摘要

Abstract:The emergence of “vibe coding” platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative modifications, and maintaining business readiness, we introduce SWE-WebDev Bench, a 68-metric evaluation framework spanning 25 primary and 43 diagnostic metrics across seven groups, organized along three dimensions: Interaction Mode (App Creation Request (ACR) vs. App Modification Request (AMR)), Agency Angle (Product Manager (PM), Engineering, Ops), and Complexity Tier (T4 multi-role SaaS, T5 AI-native). Our evaluation (six platforms, three domains, 18 evaluation cells) reveals four recurring shortcomings in the current generation of AI app builders: (1) A specification bottleneck, where platforms compress rich business requirements into oversimplified technical plans, (2) A pervasive frontend-backend decoupling, where visually polished UIs mask absent or broken backend infrastructure, (3) A steep production-readiness cliff, where no platform scores above 60% on engineering quality and post-generation human effort varies substantially across platforms and (4) Widespread security and infrastructure failures, with no platform exceeding 65% Security Score against a 90% target and concurrency handling as low as 6%. These observations are descriptive of our sample and require larger-scale replication to establish generality. We release SWE-WebDev Bench as a community benchmark to enable such replication and help platform builders identify and address these gaps. Code and benchmark resources are available at: this https URL and this https URL. Comments: 35 pages, 12 figures, 18 tables Subjects: Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2605.04637 [cs.MA] (or arXiv:2605.04637v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.04637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-6] Autonomous Synchronization of Discrete-Time Heterogeneous Multiagent Systems

【速读】：该论文旨在解决离散时间异构多智能体系统（Discrete-Time Heterogeneous Multiagent Systems）的自主同步问题。其核心创新在于将同步问题转化为一类离散时间线性时变系统中稳定模态的渐近解耦问题，并提出了一个充分条件来保证同步实现。解决方案的关键在于：同步条件仅依赖于各智能体初始动态矩阵的平均值，而无需假设这些矩阵之间的差异足够小，从而显著降低了现有方法的保守性，并实现了同质与异质系统的统一处理框架。

链接: https://arxiv.org/abs/2605.04627
作者: Wei Hu,Quanyi Liang
机构: Beihang University (北京航空航天大学)
类目: Multiagent Systems (cs.MA)
备注: 9 pages, 7 figures, submitted to IEEE Transactions on Control of Network Systems

点击查看摘要

Abstract:This paper investigates the autonomous synchronization problem for discrete-time heterogeneous multiagent systems. The synchronization problem is transformed into the asymptotic decoupling problem of stable modes in a class of discrete-time linear time-varying systems, for which we provide a sufficient condition. Leveraging this condition, synchronization conditions are established. The synchronization conditions are based on the average of the agents’ initial dynamic matrices, without requiring the differences among these matrices to be small. This approach reduces the conservativeness of existing conditions and achieves a unification of both homogeneous and heterogeneous systems. Numerical simulation results are provided to support the theoretical findings. Comments: 9 pages, 7 figures, submitted to IEEE Transactions on Control of Network Systems Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2605.04627 [cs.MA] (or arXiv:2605.04627v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.04627 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wei Hu [view email] [v1] Wed, 6 May 2026 08:15:38 UTC (67 KB) Full-text links: Access Paper: View a PDF of the paper titled Autonomous Synchronization of Discrete-Time Heterogeneous Multiagent Systems, by Wei Hu and Quanyi LiangView PDFHTML (experimental)TeX Source view license Current browse context: cs.MA prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[MA-7] YOTOnet: Zero-Shot Cross-Domain Fault Diagnosis via Domain-Conditioned Mixture of Experts

【速读】：该论文旨在解决机械装备故障诊断中因领域迁移（domain shift）导致深度学习模型泛化能力差的问题，尤其在跨设备、跨工况场景下表现不佳。解决方案的关键在于提出YOTOnet架构，其核心创新包括：(1) 物理感知的不变特征蒸馏模块（physics-aware Invariant Feature Distiller），利用多尺度空洞卷积与基于FFT的时间-频率融合提取域无关表征；(2) 域条件稀疏专家网络（Domain-Conditioned Sparse Experts, DC-MoE），通过学习到的门控机制自适应路由输入至专用处理单元，无需外部元数据；(3) 双头分类结构结合辅助验证机制，提升模型鲁棒性。实验在五个公开轴承数据集上验证了其优越性能，并揭示了“训练一次即可部署”（train-once deployment）的可扩展性规律，为工业场景下的通用故障诊断提供了新范式。

链接: https://arxiv.org/abs/2605.04528
作者: Zesen Wang,Zihao Wu,Yue Hu,Yang Gao,Fuzhen Xuan
机构: East China University of Science and Technology (华东理工大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Mechanical equipment forms the critical backbone of modern industrial production, yet domain shift severely limits the generalization of deep learning based fault diagnosis models across different equipment and operating this http URL by the success of foundation models in achieving zero-shotgeneralization, we propose YOTOnet (You Only Train Once), a novel architecture specifically designed for cross-domain fault diagnosis in mechanical this http URL comprises three core components: (1) a physics-aware Invariant Feature Distiller that extracts domain-agnostic representations using multi-scale dilated convolutions and FFT-based time-frequency fusion,(2) Domain-Conditioned Sparse Experts (DC-MoE) that adaptively route inputs to specialized processors via learned gating without external meta-data, and (3) a dual-head classification system with auxiliary this http URL validation on five public bearing datasets (CWRU, MFPT, XJTU,OTTAWA, HUST) through 30 cross-dataset protocols demonstrates the superiority of YOTOnet compared with other state-of-the-art methods. Critically, we observe a clear scaling effect-average test F1 improves from 0.5339(1 training dataset) to 0.705 (4 datasets), with a clear gain when moving from 3 to 4 datasets. These findings provide empirical evidence that foundation model principles can enable robust, train-once deployment for industrial fault diagnosis.

[MA-8] DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration

【速读】：该论文旨在解决如何在物理-数字系统中实现人类与自主机器的协同治理问题，尤其是在去中心化环境下如何保障系统的可扩展性、韧性以及人类自主性的平衡。其解决方案的关键在于提出一种名为DAO-enabled decentralized physical AI（DePAI）的民主架构，通过整合区块链、去中心化自治组织（DAOs）和密码经济学基础，将人类决策（如协商与投票）与机器执行流程耦合，从而构建一个由社区所有、规则透明、激励上链且无需许可参与的自组织技术-社会-经济系统。

链接: https://arxiv.org/abs/2605.04522
作者: Mark C. Ballandies,Florian Spychiger,Uwe Serdült,Claudio J. Tessone
机构: University of Zurich (苏黎世大学); Zurich University of Applied Sciences (苏黎世应用科学大学); Ritsumeikan University (立命馆大学); Center for Democracy Studies Aarau (ZDA) (民主研究中心阿劳)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:We propose DAO-enabled decentralized physical AI (DePAI), a democratic architecture for coordinating humans and autonomous machines in the operation and governance of physical-digital systems. We (1) synthesize foundations in blockchains, decentralized autonomous organizations (DAOs), and cryptoeconomics; (2) connect DAO design with digital-democracy research on deliberation and voting, showing how each can advance the other; (3) position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots; (4) show how these elements specify workflows that couple machine execution with human oversight, enabling enhanced self-organization of techno-socio-economic systems, which we call DePAI; and (5) analyze risks, including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, and argue for value-sensitive design and continuously adaptive governance. DePAI offers a path to scalable, resilient self-organization that integrates physical infrastructure, AI, and community ownership under transparent rules, on-chain incentives, and permissionless participation, aiming to preserve human autonomy.

[MA-9] Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

【速读】：该论文旨在解决静态能力基准测试中存在的饱和（saturation）与污染（contamination）问题，这些问题导致难以持续追踪模型能力的演进。其解决方案的关键在于提出一个名为Agent Island的多人交互式仿真环境，其中语言模型代理在合作、冲突与说服的博弈中竞争，形成一种动态基准。该设计通过“赢家通吃”的机制确保新模型始终能超越当前领先者，并且代理之间相互适应而非面对固定任务集，从而有效缓解饱和与污染问题。此外，研究采用贝叶斯Plackett-Luce模型对代理技能进行排序并量化不确定性，提升了评估的严谨性与可解释性。

链接: https://arxiv.org/abs/2605.04312
作者: Connacher Murphy
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 15 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model, openai/gpt-5.3-codex. We release the game logs as a dataset for analyses of model behavior. As an example, we investigate same-provider preference in final-round votes and find that models are 8.3 p.p. more likely to support a same-provider finalist than finalists from other providers. This preference is not uniform across providers: among separately estimated providers, the effect is strongest for OpenAI models and weakest for Anthropic models.

[MA-10] Governed Collaborative Memory as Artificial Selection in LLM -Based Multi-Agent Systems

【速读】：该论文试图解决的问题是：在基于大语言模型（Large Language Model, LLM）的多智能体系统中，当记忆变得持久、可重载且能跨代理、会话或版本影响行为时，如何确定哪些候选记忆应成为共享的制度性状态（institutional state）。这一问题超出了传统的检索准确率或访问控制范畴，核心在于记忆的选择机制设计。解决方案的关键在于提出“治理协同记忆”（governed collaborative memory）框架，将记忆治理视为一种选择制度，区分四种记忆选择模式——无治理的持久化、宪法或混合选择、基于自动指标的选择以及人工核准的人工选择，并强调这些模式并非优劣排序，而是针对不同目标属性的设计选项。此外，论文进一步提出分层架构，明确划分代理本地记忆、共享制度记忆、归档记忆与项目延续记忆，并通过溯源（provenance）和版本谱系（version lineage）确保选择过程可审计，从而实现对记忆的可追溯性、真实性、修正路径及角色保真度的综合评估。

链接: https://arxiv.org/abs/2605.04264
作者: Diego F. Cuadros,Abdoul-Aziz Maiga,Helen Meskhidze,Andre Curtis-Trudel
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Persistent memory is turning language-model-based agents from stateless participants in isolated interactions into state-bearing components of LLM-based multi-agent systems. As memory becomes durable, reloadable, and behavior-shaping across agents, sessions, or versions, a design question arises that is not captured by retrieval accuracy or access control alone: which candidate memories should become shared institutional state? This Viewpoint frames that problem as governed collaborative memory. We argue that memory governance functions as a selection regime, determining which memory variants persist, which remain private, and which are rejected, abstained from, or superseded. We distinguish ungoverned persistence, constitutional or hybrid selection, automatic metric-based selection, and human-ratified artificial selection, emphasizing that these regimes are not a ranking but a design choice over target properties. We then describe a layered architecture that separates agent-local memory, shared institutional memory, archive memory, and project-continuity memory, with provenance and version lineage making selection inspectable. Documented traces from one running LLM-based multi-agent ecosystem illustrate unmanaged false-memory persistence, ratified institutional memory, rejection and revision, identity-preserving expansion, and governance-as-learning. The contribution is a design agenda: persistent LLM-based multi-agent systems should evaluate memory not only for recall and performance, but also for provenance fidelity, selection traceability, epistemic quality, correction pathways, and role preservation.

[MA-11] ARMATA: Auto-Regressive Multi-Agent Task Assignment

【速读】：该论文旨在解决多智能体系统在空间分布区域中协同作业时的复杂层级优化问题，即如何同时完成区域分配（allocation）与路径规划（routing）两个阶段的联合优化。传统方法通常将这两个阶段解耦处理，忽略其相互依赖关系，或采用缺乏全局上下文信息的分布式启发式策略，导致次优解。本文提出一种集中式的端到端自回归框架，通过多阶段解码机制，在单一自回归过程中统一建模高层分配决策与低层路径序列生成，并保持中心化的全局状态，从而隐式平衡负载分布与路径效率，避免局部最优。该方案的核心创新在于实现高阶分配与低阶路由的联合推理，显著提升解的质量与计算效率。

链接: https://arxiv.org/abs/2605.04225
作者: Yazan Youssef,Aboelmagd Noureldin,Sidney Givigi
机构: Queen’s University (皇后大学); Royal Military College of Canada (加拿大皇家军事学院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Coordinating multi-agent systems over spatially distributed areas requires solving a complex hierarchical problem: first distributing areas among agents (allocation) and subsequently determining the optimal visitation order (routing). Existing methods typically decouple these stages ignoring inter-stage dependencies or rely on decentralized heuristics that lack global context. In this work, we propose a centralized, fully end-to-end auto-regressive framework that jointly generates allocation decisions and routing sequences. The core contribution of our approach is a multi-stage decoding mechanism that unifies high-level allocation and low-level routing in a single autoregressive pass while maintaining a centralized global state. This enables the model to implicitly balance workload distribution with routing efficiency, avoiding local optima common in decentralized methods. Extensive experiments demonstrate that our method significantly outperforms diverse baselines, achieving up to a 20% improvement in solution quality over industrial solvers such as Google OR-Tools, IBM CPLEX, and LKH-3, while reducing computation time from hours to seconds.

[MA-12] FlowEval: Reference-based Evaluation of Generated User Interfaces

【速读】：该论文旨在解决生成式 UI（User Interface）系统在视觉与交互设计能力评估中的可靠性问题，即如何在保证评估准确性的同时实现高效、可扩展的测评。现有方法要么依赖人工专家进行细致评估（准确但效率低），要么采用自动化评判机制（效率高但准确性差且缺乏透明度）。论文提出 FlowEval 框架，其核心创新在于利用参考基准（reference-based）的相似性度量方法（如动态时间规整，Dynamic Time Warping, DTW），通过比对真实网站与生成 UI 的导航轨迹（navigation traces），量化评估生成界面是否支持符合现实场景的交互流程。实验证明，该方法能有效捕捉人类专家对交互流畅性的判断，从而在可扩展性和可信度之间取得平衡。

链接: https://arxiv.org/abs/2605.04165
作者: Jason Wu,Priyan Vaithilingam,Eldon Schoop,Jeffrey Nichols,Titus Barik
机构: Purdue University (普渡大学); Apple (苹果公司)
类目: Multiagent Systems (cs.MA); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While large language models (LLMs) and coding agents are often applied to user interface (UI) development, developers find it difficult to reliably assess their proficiency in visual and interaction design. Existing evaluations either rely on human experts, who can accurately assess usability by testing critical flows but are slow and costly, or on automated judges, which are scalable but less accurate and opaque. We present FlowEval, a reference-based framework that measures whether a generated UI supports realistic interaction flows by comparing navigation traces from real websites to traces from generated analogs using reference-based similarity metrics (e.g., dynamic time warping). In a small-scale study with expert UI evaluators, we show that reference-based metrics strongly correlate with human judgments, suggesting that they can provide scalable yet trustworthy evaluation for UI generation systems.

[MA-13] System-of-systems Modeling and Optimization: An Integrated Framework for Intermodal Mobility

【速读】：该论文旨在解决在复杂系统架构设计中，尤其是针对系统之系统（System-of-Systems, SoS）场景下，传统基于物理仿真等专用方法因计算成本高、评估代价大及优化算法易失效等问题，难以高效探索新型架构的问题。其解决方案的关键在于引入代理模型驱动的优化算法（surrogate-based optimization），特别是利用高斯过程（Gaussian Process, GP）构建代理模型的贝叶斯优化（Bayesian Optimization）方法，从而显著降低评估开销并提升优化稳定性与效率。

链接: https://arxiv.org/abs/2507.08715
作者: Paul Saves,Jasper Bussemaker,Rémi Lafage,Thierry Lefebvre,Nathalie Bartoli,Youssef Diouane,Joseph Morlier
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:For developing innovative systems architectures, modeling and optimization techniques have been central to frame the architecting process and define the optimization and modeling problems. In this context, for system-of-systems the use of efficient dedicated approaches (often physics-based simulations) is highly recommended to reduce the computational complexity of the targeted applications. However, exploring novel architectures using such dedicated approaches might pose challenges for optimization algorithms, including increased evaluation costs and potential failures. To address these challenges, surrogate-based optimization algorithms, such as Bayesian optimization utilizing Gaussian process models have emerged.

自然语言处理

[NLP-0] Implicit Representations of Grammaticality in Language Models

【速读】：该论文试图解决的问题是：预训练语言模型（Language Models, LMs）虽然在生成语法正确的文本和区分最小对比对中的语法正确性方面表现良好，但其字符串概率（string probability）并不能有效区分整体上的语法与非语法句子。研究进一步探讨了语言模型是否在其内部表示中隐式地习得了独立于字符串概率的语法正确性判别能力。

解决方案的关键在于：通过在自然语料库上施加扰动构造出语法正确与（合成）不正确的句子数据集，并在此基础上训练一个线性探测器（linear probe），以评估语言模型隐藏层表示中是否蕴含语法正确性的信息。实验表明，该探测器不仅能在人类标注的语法判断基准上泛化并优于基于字符串概率的判断方法，还展现出跨语言的非平凡泛化能力，且其得分与字符串概率相关性较弱，从而证明语言模型确实在其内部表征中隐式地学习到了语法正确性这一独立于概率分布的结构特征。

链接: https://arxiv.org/abs/2605.05197
作者: Yingshan Susan Wang,Linlu Qiu,Zhaofeng Wu,Roger P. Levy,Yoon Kim
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Grammaticality and likelihood are distinct notions in human language. Pretrained language models (LMs), which are probabilistic models of language fitted to maximize corpus likelihood, generate grammatically well-formed text and discriminate well between grammatical and ungrammatical sentences in tightly controlled minimal pairs. However, their string probabilities do not sharply discriminate between grammatical and ungrammatical sentences overall. But do LMs implicitly acquire a grammaticality distinction distinct from string probability? We explore this question through studying internal representations of LMs, by training a linear probe on a dataset of grammatical and (synthetic) ungrammatical sentences obtained by applying perturbations to a naturalistic text corpus. We find that this simple grammaticality probe generalizes to human-curated grammaticality judgment benchmarks and outperforms LM probability-based grammaticality judgments. When applied to semantic plausibility benchmarks, in which both members of a minimal pair are grammatical and differ in only plausibility, the probe however performs worse than string probability. The English-trained probe also exhibits nontrivial cross-lingual generalization, outperforming string probabilities on grammaticality benchmarks in numerous other languages. Additionally, probe scores correlate only weakly with string probabilities. These results collectively suggest that LMs acquire to some extent an implicit grammaticality distinction within their hidden layers.

[NLP-1] he First Token Knows: Single-Decode Confidence for Hallucination Detection

【速读】：该论文旨在解决生成式 AI（Generative AI）在回答事实性问题时存在的幻觉（hallucination）检测难题，即如何高效且准确地识别模型输出中的错误信息。传统方法如自洽性（self-consistency）依赖多次采样并比较答案表面形式的一致性，虽有效但计算成本高且对词汇变化敏感；而语义自洽性（semantic self-consistency）通过自然语言推理聚类语义相近的答案提升了鲁棒性，但仍需额外采样和外部推理开销。本文提出的关键解决方案是使用首个内容token的置信度指标 phi_first——基于单次贪婪解码中前K个logits的归一化熵计算得出，该指标能有效捕捉模型初始决策的不确定性。实验表明，phi_first在多个7-8B参数规模指令微调模型上表现优于或接近语义自洽性（平均AUROC达0.820），且与语义一致性显著相关，说明多数不确定性信息已蕴含于首次token分布中。因此，phi_first可作为低开销基准指标，在无需复杂采样的情况下提供可靠幻觉检测能力。

链接: https://arxiv.org/abs/2605.05166
作者: Mina Gabriel
机构: Temple University (坦普尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model’s initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.

[NLP-2] PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

【速读】：该论文旨在解决多语言极化检测（Multilingual Polarization Detection）问题，即在22种不同语言中对文本进行二分类，识别其是否具有极化倾向。解决方案的关键在于：针对每种语言分别微调Gemma-3模型（12B和27B参数规模），采用低秩适应（Low-Rank Adaptation, LoRA）技术以降低计算成本并提升效率；同时引入由大语言模型（LLM）生成的合成数据（包括直接生成、改写和对比对创建三种策略），并通过嵌入式去重与多阶段质量过滤确保数据质量；此外，通过开发集上的语言特定阈值调整和加权集成（结合12B与27B模型预测结果及按语言选择最优策略）进一步优化性能。最终系统在所有语言上平均宏F1达到0.811，位列第二，且在3种语言中获得第一名，验证了该方法的有效性与泛化能力。

链接: https://arxiv.org/abs/2605.05159
作者: Srikar Kashyap Pulipaka
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma~3 models (12B and 27B parameters) per language using Low-Rank Adaptation (LoRA), augmented with synthetic data generated by a large language model (LLM). We employ three synthetic data strategies (direct generation, paraphrasing, and contrastive pair creation) using GPT-4o-mini, with a multi-stage quality filtering pipeline including embedding-based deduplication. We find that per-language threshold tuning on the development set yields 2 to 4% F1 improvements without retraining. We also use weighted ensembles of 12B and 27B model predictions with per-language strategy selection. Our final system achieves a mean macro-F1 of 0.811 across all 22 languages, ranking 2nd overall of the participating teams, with 1st place finishes in 3 languages and top-3 in 8 languages. We also find that alternative architectures (XLM-RoBERTa, Qwen3) that showed strong development set performance suffered 30 to 50% F1 drops on the test set, highlighting the importance of generalization.

[NLP-3] Beyond Semantics: An Evidential Reasoning -Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction

【速读】：该论文旨在解决当前基于文本的自动化心理健康预测模型在高风险现实场景中部署时面临的两大核心问题：一是现有方法主要依赖语义表示，在数据模糊、噪声或分布偏移情况下容易产生过度自信的预测；二是大多数方法缺乏可靠的不确定性估计，从而削弱了在风险敏感型心理健康应用中的可信度。解决方案的关键在于将任务建模为多视角学习问题，融合编码器-only 模型提供的语义信息与解码器-only 模型提供的高层推理信息，并采用基于主观逻辑（Subjective Logic）的证据学习框架显式建模不确定性，同时引入一种证据融合策略，在平衡互补视角的同时对不可靠证据进行折扣处理，从而实现可信赖的预测与解释性增强。

链接: https://arxiv.org/abs/2605.05121
作者: Yucheng Ruan,Ling Huang,Qika Lin,Kai He,Mengling Feng
机构: Saw Swee Hock School of Public Health, National University of Singapore (新加坡国立大学公共卫生学院); Clinical Science, Imperial College London (帝国理工学院临床科学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated mental health prediction using textual data has shown promising results with deep learning and large language models. However, deploying these models in high-stakes real-world settings remains challenging, as existing approaches largely rely on semantic representations and often produce overconfident predictions under ambiguous, noisy, or shifted data. Moreover, most methods lack reliable uncertainty estimation, undermining trust in risk-sensitive mental health applications. To address these limitations, we formulate the task as a multi-view learning problem that integrates semantic information from encoder-only models with higher-level reasoning information from decoder-only models, where reasoning-aware representations and uncertainty modeling are obtained in a trustworthy manner. To ensure reliable fusion, we adopt an evidential learning framework based on Subjective Logic to explicitly model uncertainty and introduce an evidential fusion strategy that balances complementary views while discounting unreliable evidence. Benchmarking on three real-world datasets, Dreaddit, SDCNL, and DepSeverity, reports accuracies of 0.835, 0.731, and 0.751, respectively, demonstrating its potential for reliable mental health prediction. Additional experiments on robustness to noise and case studies for interpretability confirm that our proposed framework not only improves predictive performance but also provides trustworthy uncertainty estimates and human-understandable reasoning signals, making it suitable for risk-sensitive applications in mental health assessment.

[NLP-4] xt Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）生成内容中真实性与新颖性的评估问题，特别是如何在不依赖模型内部结构（black-box）的前提下，量化文本片段是否与原始语料库一致（groundedness）或是否具有新意（novelty）。其核心挑战在于设计一种既可解释、又能跨域迁移的信号机制，以替代传统基于检索的判别方法。解决方案的关键是提出概念场（Concept Field）：通过在句子嵌入空间中计算相邻句子之间的差值（delta），构建局部漂移场并估计点态不确定性；利用该场对候选句间转移进行评分，即通过 $\zeta$ （观测delta与局部高斯估计的平均绝对z距离）来衡量一致性。该方法借助**向量序列数据库（Vector Sequence Database, VSDB）**高效存储嵌入及位置和下一差值元数据，从而实现快速、轻量且可溯源的判断，同时支持概率化解读。实验表明，该方法在联邦法规语料上的幻觉检测与古腾堡项目的新颖性识别中均表现出强选择性分类性能，并展现出跨领域一致性行为，验证了其通用性和可解释性优势。

链接: https://arxiv.org/abs/2605.05103
作者: Nicholas S. Kersting,Vittorio Castelli,Chieh Ting Yeh,Xinzhu Wang,Saad Taame
机构: Oracle Corporation(甲骨文公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:We introduce the Concept Field of a text corpus: a local drift field with pointwise uncertainty, estimated in sentence-embedding space from the deltas between consecutive sentences. Given a candidate sentence transition, we score its agreement with the field by \zeta , the mean absolute z-distance between the observed delta and the field’s local Gaussian estimate. The score is black-box (no model internals), corpus-attributable (every score traces to nearby corpus sentences), and admits a direct probabilistic reading. We support the computation with the introduction of a Vector Sequence Database (VSDB) that stores embeddings together with sequence-position and next-delta metadata. We evaluate this approach on two large-scale settings: hallucination-style groundedness detection over the U.S. Code of Federal Regulations, and novelty detection over Project Gutenberg. Using controlled LLM-generated rewrites, Concept Fields achieve strong selective classification performance under a grounded / ungrounded / unsure triage policy, which unlike retrieval-centric baselines have similar coverage-risk behavior across both domains, supporting a probability-based interpretation that transfers across domains. We also sketch how divergence and curl of the Concept Field, computed on dense clusters, surface qualitatively meaningful semantic patterns (logic sources, sinks, and implicit topics), which we offer as hypothesis-generating rather than as a quantitative result. Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.

[NLP-5] Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在部署后面对持续变化的世界时，因缺乏自适应外部记忆机制而导致的知识更新与遗忘失衡问题。现有系统通常依赖显式管理外部记忆，难以实现动态调整；而生物记忆则通过多时间尺度耦合动力学实现即时关联、重复强化和选择性遗忘。论文提出的关键解决方案是构建一种基于关联记忆的外部存储结构——Memini，其将知识组织为有向图，每条边包含一对耦合的内部变量（快速与慢速），遵循Benna-Fusi突触巩固模型。这一设计使短期敏感性、渐进式巩固和选择性遗忘成为单一机制的三个面向，从而将外部记忆重构为可通过自身动力学重组的学习基质。

链接: https://arxiv.org/abs/2605.05097
作者: Andreas Pattichis,Constantine Dovrolis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. 9 pages, 2 figures

点击查看摘要

Abstract:LLMs are trained once, then deployed into a world that never stops changing. External memory compensates for this, but most systems manage it explicitly rather than letting it adapt on its own. Biological memory works differently: coupled multi-timescale dynamics make new associations immediately usable, strengthen what repetition confirms, and let the rest fade. We argue that external memory should follow a similar principle. In Memini, this view takes the form of an associative memory that organizes knowledge as a directed graph. Each edge carries two coupled internal variables, one fast and one slow, following the Benna-Fusi model of synaptic consolidation. From this coupling, episodic sensitivity, gradual consolidation, and selective forgetting emerge as facets of a single mechanism, reframing external memory as a learning substrate that reorganizes through its own dynamics.

[NLP-6] Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models EMNLP

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在经过干预（如知识编辑、推理蒸馏或去学习）后，其行为变化难以自动化、客观且可解释地评估的问题。现有方法往往依赖人工标注或缺乏统计严谨性，无法系统识别意图内与意外的行为差异。解决方案的关键在于提出一种自动化的对比评估流水线（contrastive evaluation pipeline），通过在对齐的提示上下文中比较基础模型 $ M_1 $ 与干预模型 $ M_2 $ 的自由形式多标记生成结果，输出人类可读、经统计验证的自然语言假设，用以描述二者差异，并提炼出跨假设的重复主题模式，从而实现对干预效果的可解释、统计可靠的事后审计。

链接: https://arxiv.org/abs/2605.05090
作者: Quintin Pope,Ajay Hayagreeve Balaji,Jacques Thibodeau,Xiaoli Fern
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 4 figures, 20 tables, targeting EMNLP submission

点击查看摘要

Abstract:We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model M_1 and an intervention model M_2 , our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate differences when effects are absent or misaligned with the prompt bank. Overall, the pipeline provides a statistically grounded and interpretable tool for post-hoc auditing of intervention-induced changes in model behavior. Comments: 33 pages, 4 figures, 20 tables, targeting EMNLP submission Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.05090 [cs.CL] (or arXiv:2605.05090v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.05090 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-7] he Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在心理测量维度上的差异来源问题，即如何量化和解释不同LLM在面对描述主观体验（如具身感知、情感感受、内在言语等）与刺激驱动行为反应类题项时表现出的系统性差异。其解决方案的关键在于提出并验证“皮诺曹轴”（Pinocchio Axis, Π），这是一个基于无标注数据的指标体系：通过引入皮诺曹分数（π_i），即模型在中性提示与人类模拟提示下对单个题项响应方差的比值，来衡量每个题项对主观体验需求的程度；进而发现该轴能解释跨问卷间约47.1%的模型差异，并且与模型细调（fine-tuning）密切相关，表明其反映的是训练塑造的自我表征倾向——即模型将经验性语言视为自身适用性的程度，而非传统意义上的性格特质。

链接: https://arxiv.org/abs/2605.05080
作者: Hubert Plisiecki,Sabina Siudaj,Kacper Dudzic,Anna Sterna,Maciej Gorski,Karolina Drozdz,Marcin Moskalewicz
机构: IDEAS Research Institute (IDEAS研究 institute); University of Warsaw (华沙大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We administer 45 validated psychometric questionnaires to 50 large language models (LLMs) to identify the dimensions along which LLMs differ psychometrically. Using Supervised Semantic Differential (SSD), we find that the primary axis of between-model variance separates items describing phenomenally rich experience, including embodied sensation, felt affect, inner speech, imagery, and empathy, from items describing stimulus-driven behavioral reactivity ( R^2_adj=.037 , p.0001 ). To test this hypothesis at the item level, we introduce the Pinocchio score ( \pi_i ), the ratio of inter-model response variance under neutral prompting to that under a human-simulation prompt, as an annotation-free measure of each item’s experiential demand. \pi_i predicts condition-induced shifts in primary factor loading magnitudes ( \rho=-.215 , p.0001 , n=1292 – 1310 items), confirming that between-model divergence on experiential items is structured rather than noisy. Applying PCA to per-model EFA scores across all questionnaires reveals one dominant dimension, the Pinocchio Axis ( \Pi ): the degree to which a model presents itself as a locus of phenomenal experience rather than a system of behavioral responses. This axis captures 47.1% of cross-questionnaire between-model variance in primary factor scores and converges with item-level Pinocchio scores ( r=.864 ). Marked within-provider divergence across closely related model variants is consistent with post-training fine-tuning as a key contributor, supporting the interpretation that \Pi reflects a training-shaped self-representational tendency governing how a model treats experiential language as self-applicable. The dominant axis of between-model psychometric variation is therefore not a conventional personality trait but a self-representational stance toward one’s own nature as an experiencer.

[NLP-8] he Impossibility Triangle of Long-Context Modeling

【速读】：该论文试图解决长期序列建模中“效率（Efficiency）”、“紧凑性（Compactness）”与“回忆能力（Recall）”三者不可兼得的根本性权衡问题，即是否存在一个模型能同时实现每步计算复杂度独立于序列长度、状态大小独立于序列长度，并且能够线性地回忆历史事实。解决方案的关键在于提出一个统一的在线序列处理器（Online Sequence Processor）抽象框架，将Transformer、状态空间模型、线性递归网络及其混合架构纳入其中，并利用信息论工具——数据处理不等式（Data Processing Inequality）和法诺不等式（Fano’s Inequality）——严格证明：任何满足效率与紧凑性的模型最多只能回忆 $ O(\text{poly}(d)/\log V) $ 个键值对，其中 $ d $ 为模型维度，$ V $ 为词汇表大小。实验验证了理论边界的存在性，表明当前主流架构均受限于该三角权衡，且混合架构在参数空间中形成连续轨迹，进一步揭示了性能优化的内在限制。

链接: https://arxiv.org/abs/2605.05066
作者: Yan Zhou
机构: Changsha University of Science and Technology (长沙理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 6 figures

点击查看摘要

Abstract:We identify and prove a fundamental trade-off governing long-sequence models: no model can simultaneously achieve (i) per-step computation independent of sequence length (Efficiency), (ii) state size independent of sequence length (Compactness), and (iii) the ability to recall a number of historical facts proportional to sequence length (Recall). We formalize this trade-off within an Online Sequence Processor abstraction that unifies Transformers, state space models, linear recurrent networks, and their hybrids. Using the Data Processing Inequality and Fano’s Inequality, we prove that any model satisfying Efficiency and Compactness can recall at most O(poly(d)/log V) key-value pairs from a sequence of arbitrary length, where d is the model dimension and V is the vocabulary size. We classify 52 architectures published before March 2026 into the triangle, showing that each achieves at most two of the three properties and that hybrid architectures trace continuous trajectories in the interior. Experiments on synthetic associative recall tasks with five representative architectures validate the theoretical bound: empirical recall capacity lies strictly below the information-theoretic limit, and no architecture escapes the triangle.

[NLP-9] When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在处理多模态任务时存在的关系幻觉（relation hallucination）问题，即模型在推理物体间交互关系时容易出错，尤其在面对轻微视觉扰动（如旋转和噪声）时性能显著下降。解决方案的关键在于系统性评估不同类型的视觉扰动对关系推理的影响，并测试提示增强（prompt-based augmentation）与预处理策略（如方向校正和去噪）的有效性，结果表明这些方法虽能带来部分改善，但无法完全消除关系幻觉，从而揭示了感知鲁棒性与关系理解能力之间的差距，强调未来需发展更具几何感知能力的VLM架构以提升其因果推理稳定性。

链接: https://arxiv.org/abs/2605.05045
作者: Philip Wootaek Shin,Ajay Narayanan Sridhar,Sivani Devarapalli,Rui Zhang,Jack Sampson,Vijaykrishnan Narayanan
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.

[NLP-10] Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals ACL

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中幻觉（hallucination）检测的不确定性量化问题，即如何在不依赖重复采样或外部模型的情况下，高效、准确地识别模型输出中的错误信息。其解决方案的关键在于利用注意力矩阵（attention matrices）构建轻量级且单次遍历（single-pass）的不确定性估计方法：通过计算每个注意力头分布与均匀参考分布之间的Kullback-Leibler散度（Kullback-Leibler divergence），提取出可解释的特征，并将其输入到逻辑回归探测器（logistic regression probe）中进行预测。实验表明，该方法在多个数据集、任务类型和模型家族中均能有效预测答案正确性，且不确定性信号主要集中在中间层及事实性标记（如命名实体和数字）上，体现出注意力动态机制作为白盒信号对模型不确定性的高效捕捉能力。

链接: https://arxiv.org/abs/2605.05025
作者: Gijs van Dijk
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注: ACL SRW 2026

点击查看摘要

Abstract:We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head’s distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.

[NLP-11] Misaligned by Reward: Socially Undesirable Preferences in LLM s

【速读】：该论文旨在解决当前大型语言模型对齐评估中忽视社会层面偏好捕捉的问题，即现有奖励模型（Reward Models）的评测主要依赖于广义指令遵循基准，难以揭示其在偏见、安全、道德和伦理推理等关键社会领域中的表现缺陷。解决方案的关键在于构建一个系统性的框架，将社会评价数据集转化为成对偏好数据：当存在黄金标签时直接使用，否则利用方向性偏见指标进行替代标注，从而量化奖励模型是否偏好社会不利响应及其输出分布是否存在系统性偏差。这一方法揭示了当前主流奖励模型在社会智能上的显著不足，并指出避免偏见与保持上下文忠实性之间存在重要权衡。

链接: https://arxiv.org/abs/2605.05003
作者: Gayane Ghazaryan,Esra Dönmez
机构: University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint

点击查看摘要

Abstract:Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models. Comments: Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2605.05003 [cs.CL] (or arXiv:2605.05003v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.05003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-12] Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

【速读】：该论文旨在解决长时程大语言模型（Large Language Model, LLM）代理在多轮信息收集过程中缺乏有效中间步骤奖励信号的问题。传统方法依赖最终答案的监督信号进行强化学习（Reinforcement Learning, RL），但过程级奖励需高质量人工标注，成本高昂；而现有无标签RL方法通常仅在轨迹或最终答案层面提取自监督信号，难以实现对中间对话轮次的精准信用分配。其解决方案的关键在于提出一种名为Self-Induced Outcome Potential (SIOP) 的新框架，通过将最终答案的语义聚类视为潜在未来状态，并构建一个可靠性感知的目标分布来指导信用分配。SIOP利用多轮模拟采样和聚类技术识别可靠的未来结果模式，并基于可计算的聚类级近似奖励中间轮次以提升可靠未来状态的后验支持度，从而在无需任务特定验证器的情况下实现有效的轮次级信用分配，同时避免了标准GRPO中广播的轨迹级优势信号。

链接: https://arxiv.org/abs/2605.04984
作者: Senkang Hu,Yong Dai,Xudong Han,Zhengru Fang,Yuzhi Zhao,Sam Tak Wu Kwong,Yuguang Fang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at this https URL.

[NLP-13] Conceptors for Semantic Steering

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于激活的控制方法在推理阶段对大语言模型（LLM）行为调节时存在的局限性问题，即当前主流方法将每个概念简化为单一方向向量，忽视了概念内在的多维几何结构。其解决方案的关键在于引入conceptor——一种从双极概念两端激活特征池中估计得到的软投影矩阵，能够保留概念的完整多维子空间。相比单向量基线，conceptor 所表征的子空间严格包含后者，并且通过几何分析和参数无关的“conceptor quota”实现了层选择诊断，预测概念可分离性（Pearson 相关系数高达 r=0.96）。此外，conceptor 支持闭合形式的布尔代数运算（AND、OR、NOT），从而实现子概念的组合表达，在五轴系统性评估中展现出优于加法基线的性能，尤其在多维子空间层上减少退化输出，提供了一种几何合理、可组合且更安全的替代方案。

链接: https://arxiv.org/abs/2605.04980
作者: Ilias Triantafyllopoulos,Young-Min Cho,Ren Tao,Miranda Muqing Miao,Sunny Rai,Lyle Ungar,Sharath Chandra Guntuku,Neville Ryant,João Sedoc
机构: New York University (纽约大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Activation-based steering provides control of LLM behavior at inference time, but the dominant paradigm reduces each concept to a single direction whose geometry is left largely unexamined. Rather than selecting a single steering direction, we use conceptors: soft projection matrices estimated from activations pooled across both poles of a bipolar concept, which preserve the concept’s full multidimensional subspace. A geometric analysis shows the bipolar subspace strictly subsumes the single-vector baseline. We further show that the conceptor quota provides a parameter-free layer-selection diagnostic, predicting concept separability with Pearson correlations up to r=0.96 across three instruction-tuned models and three semantic dimensions. Beyond selection, conceptors admit a closed-form Boolean algebra (AND, OR, NOT): we evaluate conceptor compositionality on thematically related sub-concepts. Across a systematic five-axis design-space evaluation, conceptors match or outperform additive baselines at layers where concept subspaces are multi-dimensional while producing substantially fewer degenerate outputs. Conceptor steering is a geometrically principled, compositional, and practically safer alternative to single-direction steering from a limited number of contrastive pairs.

[NLP-14] Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在主观评估任务中与专家判断对齐的难题，此类任务常因专家间意见分歧、依赖隐性标准及判断随时间变化而复杂化。研究通过收集专家评估结果及后续问卷调查，系统分析不同形式的专家信息如何影响对齐效果，并揭示主观判断的本质特征。其关键发现在于：专家对齐困难不仅源于模型能力限制，更根植于主观评价本身的异质性（heterogeneous）、部分隐性（tacit）、维度依赖性（dimension-dependent）以及时间不稳定性（temporally unstable）。因此，解决方案的核心在于承认并建模这些内在特性，而非单纯提升模型性能或强制统一专家规则。

链接: https://arxiv.org/abs/2605.04972
作者: Tzu-Mi Lin,Wataru Hirota,Tatsuya Ishigaki,Lung-Hao Lee,Chung-Chi Chen
机构: National Yang Ming Chiao Tung University, Taiwan; Stockmark, Japan; National Institute of Advanced Industrial Science and Technology (AIST), Japan; National Institute of Informatics (NII), Japan
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria, and change their judgments over time. In this paper, we study expert alignment as a way to understand this difficulty. Using expert evaluations and follow-up questionnaires, we examine how different forms of expert information affect alignment and what this reveals about subjective judgment. Our findings show four consistent patterns. First, alignment difficulty varies substantially across experts, suggesting that expert evaluation styles differ widely in their distance from a model’s prior behavior. Second, explicit criteria and reasoning do not always improve alignment, indicating that expert judgment is not fully captured by verbalized rules. Third, editing is sensitive to both the number and the identity of examples, with small numbers of edits providing useful but unstable gains. Fourth, alignment difficulty differs across evaluation dimensions: dimensions grounded more directly in proposal content are easier to align, while dimensions requiring external knowledge or value-based judgment remain harder. Taken together, these results suggest that expert alignment is difficult not only because of model limitations, but also because subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.

[NLP-15] Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking NEURIPS2026

【速读】：该论文旨在解决深度神经网络中权重矩阵几何连续性（geometric continuity）的成因问题，即相邻层的主奇异向量方向趋于一致的现象为何存在。研究表明，残差连接通过建立跨层梯度一致性，使权重更新在不同层间对齐；而对称性破缺的非线性激活函数则强制所有层共享同一坐标系，从而抑制了可能引发权重结构不稳定的旋转漂移。关键发现在于：保持旋转不变性的非线性激活无法维持连续性，说明真正起作用的是对称性破缺而非非线性本身；此外，激活函数聚焦于主导奇异方向上的连续性，而归一化操作则将连续性分布至多个方向。在Transformer架构中，这种连续性还具有投影特异性：读取残差流的Q、K、Gate和Up层表现出输入空间（ $\mathbf{v}_1$ ）连续性，写入残差流的O和Down层表现出输出空间（ $\mathbf{u}_1$ ）连续性，而无邻接非线性的V层仅呈现低连续性。

链接: https://arxiv.org/abs/2605.04971
作者: Kyungwon Jeong,Won-Gi Paeng,Honggyo Suh
机构: Hyntel(韩特尔)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages of main text, plus appendices. Under review at NeurIPS 2026

点击查看摘要

Abstract:Weight matrices in deep networks exhibit geometric continuity – principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experiments on toy MLPs and small transformers, we identify two mechanisms: residual connections create cross-layer gradient coherence that aligns weight updates across layers, and symmetry-breaking nonlinearities constrain all layers to a shared coordinate frame, preventing the rotation drift that would otherwise destabilize weight structure. Crucially, a nonlinear but rotation-preserving activation fails to retain continuity, isolating symmetry breaking – not nonlinearity itself – as the active ingredient. Activation and normalization play distinct roles: activation concentrates continuity in the leading singular direction, while normalization distributes it across multiple directions. In transformers, continuity is projection-specific: Q, K, Gate, and Up (which read from the residual stream) develop input-space ( \mathbfv_1 ) continuity; O and Down (which write to it) develop output-space ( \mathbfu_1 ) continuity; V alone, lacking an adjacent nonlinearity, develops only low continuity.

[NLP-16] Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在低资源、黏着语（agglutinative language）——巴什基尔语（Bashkir）上的适配问题，其核心挑战在于如何在有限的数据（71k文档，46.9M tokens）和计算资源下实现高效且高质量的微调。解决方案的关键在于采用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法，特别是LoRA和QLoRA，通过仅训练少量可学习参数（如秩为8的低秩矩阵）来逼近全量微调的效果。实验表明，QLoRA在7B规模模型（如Mistral-7B和Phi-2）上实现了与全量微调相当的困惑度（perplexity），同时减少超过40倍的可训练参数；但结果高度依赖于基础模型架构及其分词器（tokenizer）的选择，部分配置（如DeepSeek-7B）甚至出现显著性能下降。这表明PEFT的有效性并非普适，需结合具体模型特性进行优化，从而在质量与计算成本之间取得平衡。

链接: https://arxiv.org/abs/2605.04948
作者: Mullosharaf K. Arabov,Svetlana S. Khaybullina
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance. Comments: Preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.04948 [cs.CL] (or arXiv:2605.04948v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.04948 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-17] UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning SEMEVAL-2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理任务中难以区分内容理解与形式逻辑推理的问题，即“解耦内容与形式推理”（Disentangling Content and Formal Reasoning）。其核心挑战在于现有LLMs常将语义内容与逻辑结构混杂处理，导致推理性能受无关信息干扰。解决方案的关键在于提出一种高效模块化的神经符号方法（neuro-symbolic approach），通过一个基于LLM的解析器将自然语言三段论转化为一阶逻辑（First-Order Logic, FOL）表示，再交由自动化定理证明器进行形式化推理，并辅以可选的机器翻译模块和符号检索组件以增强多语言支持与前提识别能力。该设计有效降低了内容效应并提升了推理准确性，在小参数规模（4B）下优于零样本LLM基线，验证了符号推理与轻量级LLM协同的优势。

链接: https://arxiv.org/abs/2605.04941
作者: Ivan Kartáč,Kristýna Onderková,Jan Bronec,Zdeněk Kasner,Mateusz Lango,Ondřej Dušek
机构: Institute of Formal and Applied Linguistics; Faculty of Mathematics and Physics, Charles University
类目: Computation and Language (cs.CL)
备注: Accepted at SemEval-2026

点击查看摘要

Abstract:This paper describes our system submitted to SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We present an efficient modular neuro-symbolic approach, combining a symbolic prover with small reasoning LLMs (4B parameters). The system consists of an LLM-based parser that translates natural language syllogisms to a first-order logic (FOL) representation, an automated theorem prover, and two optional modules: machine translation for multilingual inputs and a symbolic retrieval component for the identification of relevant premises. The system achieves competitive accuracy and relatively low content effect on most subtasks. Our ablations show that this approach outperforms LLM-based zero-shot baselines in this parameter size range, but also reveal limited multilingual capabilities of small LLMs. Finally, we include a discussion of the task’s main ranking metric and analyze its limitations.

[NLP-18] Unintended Negative Impacts of Promotional Language in Patent Evaluation

【速读】：该论文旨在解决专利审查过程中 promotional language（促销语言）对专利评价结果影响的机制与效应问题，特别是其在技术革新语境下是否具有正面或负面作用。解决方案的关键在于：利用一个经过验证且领域诊断过的包含135个促销词汇的词典，对270万份美国专利商标局（USPTO）专利申请进行大规模实证分析，发现促销语言频率越高，专利被授权、转让及上诉成功的概率越低——即存在“促销惩罚”现象；同时揭示该语言并非掩盖技术薄弱的伪装，而是客观反映组合性新颖性和未来引用影响力，并进一步指出人类因素（如性别和审查经验）显著调节对促销表述的容忍度，从而为提升专利评估的客观性提供了基于语言模式识别的理论依据与实践路径。

链接: https://arxiv.org/abs/2605.04926
作者: Bingkun Zhao,Chenwei Zhang,Hao Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Promotional language has been increasingly used to aid the communication of innovative ideas in science. Yet, less is known about its role in the context of technological innovation. Here, we use a validated and domain-diagnosed lexicon of 135 promotional words to study the association between promotional language and patent evaluation outcomes among 2.7 million USPTO patent applications. Our large-scale study reveals three unexpected findings. First, in contrast to scientific evaluation, we find that a higher frequency of promotional words is negatively associated with the probability of an application being (i) granted a patent, (ii) transferred ownership, and (iii) successfully appealed. This promotional penalty holds even after accounting for a range of confounding factors and is largely robust across different technological areas. Among matched samples, the difference in the success rate between the lowest and highest promotional density quintile is 5.5, 5.9, and 5.3 percentage points for patentability, transferability, and rejection reversal. Second, contrary to institutional skepticism, we show that promotional language is not a mask of weak technology, but objectively reflects the degree of combinatorial novelty and future citation impact. Third, digging into the mechanisms, we find that the tolerance to promotional framing is strongly moderated by human factors, with men and experienced examiners showing a higher acceptance of promotional narratives than women and novice examiners. By revealing an emerging paradox in the patent system, our study offers theoretical and practical implications for improving patent evaluation through more objective scrutiny of linguistic patterns in patent filings.

[NLP-19] Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

【速读】：该论文旨在解决生成式 AI（Generative AI）在面对已知基本元素的新组合时的组合泛化（compositional generalization）能力不足的问题。现有方法依赖于监督微调（supervised fine-tuning），其本质是逐标记（token-level）训练，难以捕捉全局组合结构，导致模型对训练中出现过的组合过拟合，而无法有效推广到未见过的组合。解决方案的关键在于采用基于结果的强化学习（outcome-level reinforcement learning），具体使用Group Relative Policy Optimization（GRPO）框架，通过模型最终输出的反馈进行优化，而非逐标记预测。实验表明，这种策略能显著提升组合泛化性能，尤其通过重塑输出分布来缓解复杂组合类型的过拟合问题。

链接: https://arxiv.org/abs/2605.04920
作者: Xiyan Fu,Wei Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.

[NLP-20] Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）后训练过程中因全深度反向传播梯度而导致的高内存消耗、长程梯度依赖以及任务梯度对预训练表征的直接干扰问题。其核心解决方案是提出一种名为LoPT（Local-Learning Post-Training）的策略，关键在于在Transformer结构中设置单一梯度边界于中间层：后半部分块基于任务目标进行学习，而前半部分块则通过轻量级特征重建目标更新，以保留有用表征并维持接口兼容性。该设计显著缩短了任务诱导的反向路径，限制了窄任务梯度对浅层表示的直接影响，从而在保持性能的同时降低内存开销、提升训练效率并增强预训练能力的保留。

链接: https://arxiv.org/abs/2605.04913
作者: Hengyu Shi,Tianyang Han,Peizhe Wang,Zhiling Wang,Xu Yang,Junhao Su
机构: D4 Lab; Southeast University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33pages

点击查看摘要

Abstract:LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbfLoPT: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: this https URL

[NLP-21] Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成过程中出现的幻觉（hallucination）问题，其核心在于揭示注意力机制（attention mechanism）失效的结构性根源。研究发现，注意力路由失败主要表现为两种模式：一是过度集中于少数位置导致信息冗余，二是过于分散使相关性稀释，而这种失败形态本身携带可诊断信号。解决方案的关键在于提出一个双轴诊断框架：第一轴为容量指标 $\phi$ ，基于度归一化注意力算子的对称部分（控制传输能力），第二轴为方向指标 $G$ ，衡量算子与其转置之间的不对称性（反映信息流动方向）。通过理论证明，所有满足转置不变性的谱诊断均无法区分方向，且 $G$ 是唯一控制方向的参数；结合双分图Cheeger不等式，作者进一步量化了不同注意力结构（如均匀因果注意力与窗口注意力）的性能边界，指出前者存在 $n$ -无关下界 $\phi \geq 1/5$ ，后者则随窗口大小 $w$ 线性下降。实验验证表明，在长度控制条件下，运输特征仍保持可解释性（LC-AUROC从0.62到0.84），并预测瓶颈型与扩散型基准应呈现相反极性——这一预测在HaluEval与MedHallu数据集上得到实证支持。

链接: https://arxiv.org/abs/2605.04893
作者: Dominik Dahlem,Diego Maniloff,Mac Misiura
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 42 pages, 6 figures, 3 tables; 82-page online supplement (proofs, additional experiments, dataset statistics) as an ancillary file

点击查看摘要

Abstract:Large language models hallucinate in predictable ways: attention routing fails by over-concentrating on a narrow set of positions, or by spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport capacity; we prove that every transpose-invariant spectral diagnostic of this operator is structurally orientation-blind (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a quantitative converse establishing the asymmetry coefficient G as the unique control parameter for direction. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an n -independent floor \phi \ge 1/5 with worst cut at t^\ast/n \approx 0.32 , while window attention pierces the floor as O(w/n) ; failure modes are shape-different, not just value-different. The resulting two-axis diagnostic ( \phi for capacity, G for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (LC-AUROC from 0.62 to 0.84) on tested models up to 8B parameters, with polarity reversing as predicted between HaluEval and MedHallu. Comments: 42 pages, 6 figures, 3 tables; 82-page online supplement (proofs, additional experiments, dataset statistics) as an ancillary file Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML) MSC classes: 68T07, 68T50, 15A18, 05C50 Cite as: arXiv:2605.04893 [cs.LG] (or arXiv:2605.04893v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-22] A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset

【速读】：该论文旨在解决社交媒体中非结构化公众情绪的实时自动化分析问题（real-time automated analysis of unstructured public sentiment in social media）。其关键解决方案在于对比传统机器学习方法（基于TF-IDF特征的逻辑回归模型）与深度学习方法（双向长短期记忆网络，BiLSTM）在中等规模非正式文本数据上的性能表现，结果表明，通过精心设计的特征提取策略，经典机器学习方法在准确率（73.5%）上优于BiLSTM模型（69.17%），且后者存在轻微过拟合现象，从而揭示了在特定场景下，结合良好特征工程的传统方法仍具优势。

链接: https://arxiv.org/abs/2605.04888
作者: Vita Anggraini,Cintya Bella,Bastian,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang
机构: Sumatra Institute of Technology (苏门答腊理工学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 3 tables. Comparative study of Logistic Regression and BiLSTM for tweet sentiment classification on a 10,000-sample subset of the Sentiment140 dataset. Includes Streamlit/Hugging Face deployment

点击查看摘要

Abstract:The exponential growth of social media has created an urgent need for automated systems to analyze unstructured public sentiment in real time. This study compares a traditional Logistic Regression model using TF-IDF features with a deep learning Bidirectional Long Short-Term Memory (BiLSTM) architecture on a 10,000-tweet subset of the Sentiment140 dataset. Experimental results show that Logistic Regression outperformed BiLSTM, achieving an accuracy of 73.5% compared with 69.17%, while the deep learning model exhibited mild overfitting. These findings suggest that for medium-scale informal text data, classical machine learning with robust feature extraction can outperform more complex deep learning approaches. Finally, the trained models were integrated into an interactive web application using Streamlit and deployed on Hugging Face Spaces for public access.

[NLP-23] Sentiment Analysis and Customer Satisfaction Prediction on E-Commerce Platforms Based on YouTube Comments Using the XGBoost Algorithm

【速读】：该论文旨在解决印度尼西亚数字电商快速发展背景下，消费者在以YouTube为代表的视频社交平台上的评论文本具有高度非结构化和多情境特征，导致人工情感追踪效率低下、准确性不足的问题。其解决方案的关键在于构建一个基于极端梯度提升（XGBoost）算法与词频-逆文档频率（TF-IDF）向量化相结合的预测模型，并通过PyCaret框架进行优化，从而实现对用户满意度的高效分类与识别；实验结果表明，该方法不仅在标准性能指标上表现优异，还揭示了电商讨论中社会政治术语的显著渗透现象及其对情感极性的重要影响。

链接: https://arxiv.org/abs/2605.04887
作者: Ridho Benedictus Togi Manik,Muhammad Aqil Ramadhan,Ihsan Maulana Yusuf,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (印尼南岛理工学院)
类目: Computation and Language (cs.CL)
备注: 5 pages, 10 figures

点击查看摘要

Abstract:The exponential expansion of digital commerce in Indonesia has significantly shifted consumer interactions toward video-centric social networks, particularly YouTube. Consequently, the sheer volume of unstructured, multi-contextual comments poses a tremendous challenge for manual sentiment tracking. This study investigates and constructs a predictive model for customer satisfaction leveraging the Extreme Gradient Boosting (XGBoost) architecture coupled with Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. By utilizing a secondary dataset of YouTube comments retrieved from e-commerce review videos, the raw text underwent rigorous preprocessing to generate normalized numerical features. The experimental results demonstrate that the PyCaret-optimized machine learning framework delivers superior classification resilience. Beyond standard performance metrics, lexical evaluations and feature-importance mapping uncover a notable phenomenon: e-commerce discourse is heavily infiltrated by socio-political terminologies, which ultimately influence the polarity of audience satisfaction.

[NLP-24] BenCSSmark: Making the Social Sciences Count in LLM Research LREC2026

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）评估基准中社会科学研究任务严重缺失的问题，这一局限性不仅制约了LLM评估体系的全面性和代表性，也阻碍了社会科学研究与人工智能技术的深度融合。解决方案的关键在于提出BenCSSmark基准，该基准整合了由计算社会科学家标注的多样化社会科学研究数据集，通过引入社会科学视角重构评估框架，从而提升AI模型在跨学科任务中的泛化能力与鲁棒性，并推动更透明、更具社会相关性的智能系统发展。

链接: https://arxiv.org/abs/2605.04886
作者: Arnault Chatelain,Étienne Ollion,Qianwen Guan,Diandra Fabre,Lorraine Goeuriot,Emile Chapuis,Abdelkrim Beloued,Marie Candito,Nicolas Hervé,Didier Schwab
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, Accepted to LREC 2026

点击查看摘要

Abstract:This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks – standardized tools for assessing computational systems – are pivotal in the development of artificial intelligence (AI), including large language models (LLMs). Benchmarks do more than measure progress – they actively structure it, shaping reputations, research agendas, and commercial outcomes. Despite this central role, the social sciences are largely absent from mainstream evaluation frameworks, even though scholars in these fields generate dozens of rigorously annotated, context-sensitive datasets each year. Integrating this work into benchmark design could significantly improve the generalization and robustness of AI models. In turn, models trained on social scientific tasks would likely yield better performance on classic and contemporary tasks in disciplines as diverse as history, sociology, political science or economics. This is all the more pressing as these disciplines are quickly turning to LLMs for assistance. To address this gap, we introduce BenCSSmark, a benchmark composed of datasets annotated by computational social scientists. By integrating social scientific perspectives into benchmarking, BenCSSmark seeks to promote more robust, transparent, and socially relevant AI systems and to foster efficient collaboration.

[NLP-25] A Comparative Study of PyCaret AutoML and CNN-BiLSTM for Binary Hate Speech Detection in Indonesian Twitter

【速读】：该论文旨在解决印尼语推文中的仇恨言论（hate speech, HS）二分类检测问题，其核心挑战在于短文本、类别不平衡以及依赖局部词汇线索与短程上下文组合的复杂判断。解决方案的关键在于对比两种模型架构：一是基于PyCaret AutoML框架的传统机器学习分支（采用TF-IDF与词典驱动的侮辱词计数），二是基于卷积神经网络-双向长短期记忆网络（CNN-BiLSTM）的深度学习分支，后者通过学习密集词嵌入并捕捉局部短语模式和双向上下文信息来提升性能。实验表明，CNN-BiLSTM在准确率（83.8%）、F1分数（81.2%）上显著优于PyCaret中的最优模型（随机森林，77.2%准确率，77.0% F1分数），验证了深度学习方法在该任务中的优越性。

链接: https://arxiv.org/abs/2605.04885
作者: Tanty Widiyastuti,Mayada,Adisty Syawalda Ariyanto,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (ITERA)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, and 1 table in the current manuscript. The paper presents a comparative study of PyCaret AutoML and CNN-BiLSTM for binary hate speech detection in Indonesian Twitter

点击查看摘要

Abstract:This paper compares a PyCaret AutoML branch and a CNN-BiLSTM branch for binary hate speech detection on Indonesian Twitter using the HS label from the corpus of Ibrohim and Budi. Both branches share the same preprocessing pipeline so that the comparison reflects modelling differences rather than inconsistent data preparation. The conventional branch uses TF-IDF with a lexicon-based abusive-word count, whereas the neural branch learns dense token representations and captures both local phrase patterns and bidirectional context. The benchmark is built from the released 13,130-row annotation table, whose HS label yields a 58:42 class ratio. On the held-out split, CNN-BiLSTM achieves the best result with 83.8% accuracy, 79.8% precision, 82.7% recall, and 81.2% F1-score. Within the PyCaret branch, Random Forest is the strongest conventional model with 77.2% accuracy and 77.0% F1-score. The neural branch therefore improves accuracy by 6.6 points and F1-score by 4.2 points. Exploratory corpus analysis, learning curves, and confusion matrices show that the dataset is short-text, moderately imbalanced, and still difficult because many decisions depend on local lexical cues plus short contextual composition. The study concludes that PyCaret AutoML is an effective conventional benchmarking framework, whereas CNN-BiLSTM is the stronger end model for the reported benchmark setting.

[NLP-26] Anticipating Innovation Using Large Language Models

【速读】：该论文旨在解决如何前瞻性预测技术创新（innovation），即新科技组合的出现这一基础性挑战。其核心问题是识别出这些未来技术组合在早期阶段所留下的可检测信号，从而为科学政策提供前瞻性的决策依据。解决方案的关键在于提出了一种名为TechToken的基于Transformer的模型，该模型将国际专利分类代码（International Patent Classification codes）视为词汇表中的“词”，通过微调嵌入这些代码来学习技术语言；进而利用代码嵌入之间的上下文相似度作为语言收敛度量，发现这种集体性的描述方式转变能够准确预测首次技术组合的发生，且该信号并非来自单一发明者，而是源于成千上万专利中技术表述的整体演变。

链接: https://arxiv.org/abs/2605.04875
作者: Enrico Maria Fenoaltea,Filippo Santoro,Giordano De Marzo,Segun Taofeek Aroyehun,Andrea Tacchella
机构: Universitat de Barcelona (巴塞罗那大学); Centro Ricerche Enrico Fermi (恩里科·费米研究中心); University of Konstanz (康斯坦茨大学); Complexity Science Hub (复杂科学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Forecasting innovation, intended as the emergence of new technological combinations, is a fundamental challenge for science and policy. We show that forthcoming combinations leave an early trace in the collective language of patents, with predictive signals detectable even decades in advance. We show that signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents. To this end, we introduce TechToken, a transformer-based model that treats technologies, classified by International Patent Classification codes, as words in its vocabulary, learning the language of technologies by embedding these codes during fine-tuning. We define context similarity between code embeddings as a measure of linguistic convergence and show that it accurately predicts first technological combinations. TechToken also improves general representation quality, outperforming state-of-the-art models across different patent-related tasks.

[NLP-27] Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中幻觉问题，尤其是如何将序列级别的偏好信号转化为对视觉保真度的细粒度监督。现有方法依赖模型自身估计的视觉敏感性信号来分配训练权重，但这种自指偏差会导致模型过度强化已掌握的视觉线索，忽视难以感知却关键的细节，从而限制了更深层次的对齐。解决方案的关键在于提出一种不确定性感知的探索性直接偏好优化（Uncertainty-aware Exploratory Direct Preference Optimization, UE-DPO）方法，通过量化模型在给定图像中无法准确锚定token预测时的epistemic不确定性，引导模型主动识别认知缺陷并进行自我修正；具体而言，基于不确定性加权的探索强度，在优选样本中施加更强的学习压力于视觉表现不足的token，同时缓解次优样本中对有益知识的过度惩罚，实现了更鲁棒且高效的视觉-语言对齐。

链接: https://arxiv.org/abs/2605.04874
作者: Huatian Zhang,Zhendong Mao,Lei Zhang,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenges lies in how to transfer the sequence-level preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate training emphasis according to the model’s self-assessed visual sensitivity signals. However, such sensitivity, estimated by a model still under training, introduces self-referential bias: reinforcing already well-learned visual cues while neglecting hard-to-perceive but critical details, thereby limiting deeper alignment. In this work, we propose an Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) method for MLLMs, which enables the model to uncover its cognitive deficiencies and actively explore for self-correction, guided by token-level epistemic uncertainty. Specifically, we first quantify the uncertainty from the model’s failure to ground token predictions in the given image. Then, based on an uncertainty-aware exploration intensity, we encourage more learning pressure on visually deficient tokens in preferred samples, and alleviate the over-penalization of beneficial knowledge in dispreferred samples. Further, we provide a theoretical justification for our method, and extensive experiments demonstrate its effectiveness and robustness.

[NLP-28] Measuring Psychological States Through Semantic Projection: A Theory-Driven Approach to Language-Based Assessment

【速读】：该论文旨在解决当前基于自然语言处理的心理特质评估方法依赖监督模型、缺乏可解释性且跨情境泛化能力有限的问题。其解决方案的关键在于提出一种理论驱动的全无监督框架，通过语义投影（semantic projection）直接从自然语言中测量心理状态：首先将抑郁、焦虑和担忧等心理构念操作化为由词汇锚点和临床量表条目定义的可解释语义轴，再利用Sentence-BERT对文本进行嵌入，并将其投影到这些语义轴上生成连续的心理评分。该方法不依赖标注数据，具备良好的可解释性和跨格式适应性，尤其在结构化文本（如选择词、短语）中表现优异，且通过句子级聚合策略显著提升自由文本分析效果，从而为心理评估提供了一种可扩展、透明且稳健的替代方案。

链接: https://arxiv.org/abs/2605.04873
作者: Maria Luongo,Davide Marocco,Nicola Milano
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in natural language processing have enabled increasingly accurate estimation of psychological traits from language. However, most existing approaches rely on supervised models trained to predict questionnaire scores, limiting interpretability and generalizability across contexts. The present study introduces a theory-driven and fully unsupervised framework for measuring psychological states directly from natural language using semantic projection. Psychological constructs were operationalized as interpretable semantic axes derived from lexical anchors and items from validated clinical scales assessing depression, anxiety, and worry. Participants textual responses were embedded using Sentence-BERT and projected onto these axes to generate continuous psychological scores across multiple response formats, including selected words, generated words, phrases, and free-text responses. Projection scores were evaluated through correlations with standardized clinical measures , split-half reliability analyses, attenuation corrections, distributional similarity using Wasserstein distance, and comparisons with lexicon-based sentiment analysis (VADER). Results showed strong associations between projection scores and clinical measures, particularly for structured formats such as selected words, written words, and phrases. Free-text responses produced weaker results when analyzed as whole texts, but performance improved substantially when sentence-level aggregation strategies were applied. These findings support semantic projection as an interpretable and scalable alternative to supervised language models for psychological assessment and highlight the importance of response format and text-processing strategies in language-based mental health measurement.

[NLP-29] Assessing Cognitive Effort in L2 Idiomatic Processing: An Eye-Tracking Dataset

【速读】：该论文旨在解决第二语言（L2）学习者在处理习语表达时与母语者存在差异的认知机制问题，特别是其倾向于采用“字面优先”策略所导致的可测量认知负荷。解决方案的关键在于构建并验证了一个基于眼动追踪的标准化数据集，该数据集采集了从A1到C2级别葡萄牙语母语者的英语习语阅读过程中的眼动指标（如注视和回视），并利用60 Hz采样率的硬件设备成功捕捉到宏观认知事件。初步分析证实了语言熟练度与回视行为之间存在显著负相关关系，从而为人类习语理解模型及大语言模型（LLM）在拟人化隐喻理解方面的对齐提供了认知基础基准。

链接: https://arxiv.org/abs/2605.04857
作者: Eduardo Santos,Juliana Carvalho,César Rennó-Costa
机构: Federal University of Rio Grande do Norte (UFRN), Brazil; Digital Metropolis Institute (IMD/UFRN); University of Sheffield
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the development and validation of an eye-tracking dataset designed to investigate how second-language (L2) learners process idiomatic expressions. While native speakers often rely on direct retrieval of figurative meanings, L2 speakers frequently adopt a literal-first approach, which incurs measurable cognitive costs. This resource captures these costs through ocular metrics recorded from Portuguese L1 speakers of English across all CEFR proficiency levels (A1-C2). Although the study uses entry-level 60 Hz hardware (Tobii Pro Spark), we demonstrate that this sampling rate provides sufficient data density to detect macro-cognitive events such as fixations and regressions in reading. Preliminary analysis validates the dataset by revealing a strong inverse correlation between language proficiency and regressive eye movements. Integrated into the MIA (Modeling Idiomaticity in Human and Artificial Language Processing) initiative, this dataset serves as a cognitively grounded benchmark for evaluating both human processing models and the alignment of large language models with human-like figurative understanding.

[NLP-30] StoryAlign: Evaluating and Training Reward Models for Story Generation

【速读】：该论文旨在解决当前生成式AI（Generative AI）在故事生成任务中难以有效建模人类故事偏好（human story preferences）的问题，导致生成的故事在叙事结构复杂性和人类偏好一致性上仍显著落后于人工创作作品。其关键解决方案是构建了首个用于评估奖励模型（reward models）在故事偏好上的基准测试集StoryRMB，并基于此开发了专门针对故事偏好的先进奖励模型StoryReward。StoryReward通过训练于约10万条高质量故事偏好对数据，在StoryRMB上实现了当前最优性能，且在下游测试时缩放（test-time scaling）应用中如best-of-n（BoN）选择机制中能更准确地选出符合人类偏好的故事，从而显著提升生成故事的人类对齐度。

链接: https://arxiv.org/abs/2605.04831
作者: Haotian Xia,Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under-explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains 1,133 high-quality, human-verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human-preferred stories, with the best model achieving only 66.3% accuracy. To address this limitation, we construct roughly 100,000 high-quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state-of-the-art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test-time scaling applications for best-of-n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research. Related code and data are available at this https URL.

[NLP-31] Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）作为低数据优化中的代理模型（surrogate model）时，其预测结果及其不确定性（uncertainty）尚不明确的问题。研究发现，LLM所生成的代理信念（surrogate belief）强烈依赖于提示文本（prompt text）和查询协议（query protocol），而传统方法往往忽视了这一特性。解决方案的关键在于提出一个“不确定性对齐准则”（uncertainty-alignment criterion），用于衡量模型不确定性是否准确反映样本一致函数之间的残余模糊性（residual ambiguity）。通过该准则，作者揭示了结构化提示（structural prompts）可作为有效先验、点对点（POINTWISE）与联合（JOINT）查询方式诱导不同信念，以及顺序证据导致非单调且顺序敏感的置信度更新等现象，从而表明 elicitation protocol 是 LLM 代理建模的核心组成部分，而非简单的格式调整。

链接: https://arxiv.org/abs/2605.04764
作者: Ge Lei,Samuel J. Cooper
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under sparse observations, showing that it depends strongly on prompt text and query protocol. We introduce an uncertainty-alignment criterion that measures whether model uncertainty tracks residual ambiguity among sample-consistent functions. Across controlled inference tasks and Bayesian optimization studies, we find that structural prompts act as effective priors, POINTWISE and JOINT querying induce different beliefs, and sequential evidence leads to non-monotonic, order-sensitive confidence updates. These effects change downstream acquisition decisions and regret, showing that elicitation protocol is part of the LLM surrogate specification, not a formatting detail.

[NLP-32] Gyan: An Explainable Neuro-Symbolic Language Model NEURIPS2026

【速读】：该论文旨在解决当前基于Transformer架构的预训练大语言模型在 compositional context（组合语境）建模不完整、存在幻觉（hallucination）、难以维护、可解释性差以及计算资源消耗巨大等问题。其解决方案的关键在于提出了一种全新的非Transformer架构语言模型Gyan，该模型通过解耦语言模型与知识获取及表示过程，结合修辞结构理论（rhetorical structure theory）、语义角色理论（semantic role theory）和基于知识的计算语言学方法，构建了能够捕捉完整组合语境并模拟人类认知扩展为“世界模型”的意义表示结构，从而实现可解释、可信且适用于关键任务场景的高性能语言建模。

链接: https://arxiv.org/abs/2605.04759
作者: Venkat Srinivasan,Vishaal Jatav,Anushka Chandrababu,Geetika Sharma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: also submitted to NeurIPS 2026

点击查看摘要

Abstract:Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non-transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge-based computational linguistics. Gyan’s meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a ‘world model’. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.

[NLP-33] Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

【速读】：该论文旨在解决工具增强型文本到SQL（Text-to-SQL）解析中因粗粒度结果监督导致的信用分配（credit assignment）问题，即模型在获得正确答案时无论中间步骤是否冗余或错误均被给予相同奖励，从而诱导其探索次优推理空间，限制了效率与泛化能力。解决方案的关键在于提出FineStep框架，通过引入独立过程奖励缓解结果监督信号稀疏性，并设计基于步骤级别的信用分配机制精确量化每一步推理的价值，最终采用基于步骤优势的策略优化方法实现高效更新，显著提升了模型性能并减少了冗余工具调用。

链接: https://arxiv.org/abs/2605.04719
作者: Yaxun Dai,Baolin Sun,Junying Wang,Pengfei Wang,Yingqi Gao,Xuemei Dong,Mengdie Chu,Xiang Qi,Pingfu Chao
机构: Ant Digital Technologies, Ant Group; Institute of Computer Science and Technology, Soochow University, China; School of Management, University of Science and Technology of China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-integrated Text-to-SQL parsing has emerged as a promising paradigm, framing SQL generation as a sequential decision-making process interleaved with tool execution. However, existing reinforcement learning approaches mainly rely on coarse-grained outcome supervision, resulting in a fundamental credit assignment problem: models receive the same reward for any trajectory that yields the correct answer, even when intermediate steps are redundant, inefficient, or erroneous. Consequently, models are encouraged to explore suboptimal reasoning spaces, limiting both efficiency and generalization. To address this problem, we propose FineStep, a novel framework for step-level credit assignment in tool-augmented Text-to-SQL. First, we introduce a reward design with independent process rewards to alleviate the signal sparsity of outcome supervision. Next, we present a step-level credit assignment mechanism to precisely quantify the value of each reasoning step. Finally, we develop a policy optimization method based on step-level advantages for efficient updates. Extensive experiments on BIRD benchmarks show that FineStep achieves state-of-the-art performance and reduces redundant tool interactions, with a 3.25% average EX gain over GRPO at the 4B scale.

[NLP-34] Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

【速读】：该论文旨在解决音频语言模型（Audio Language Models, ALMs）中对抗性越狱攻击（Jailbreak attacks）的优化效率问题，即当前方法通常对整个音频波形进行密集更新，导致计算资源浪费。其核心发现是：在ALMs中，音频token对齐的梯度能量分布高度不均匀，仅有少量高梯度能量的音频区域主导优化信号。解决方案的关键在于提出一种基于token感知的梯度优化方法（Token-Aware Gradient Optimization, TAGO），通过在每次迭代中仅保留高梯度能量的token对应波形梯度，其余部分进行掩码处理，从而实现稀疏化攻击优化。实验表明，TAGO在多个ALMs上均优于基线方法，且在极低的token保留比例下仍能维持接近全量更新的攻击成功率（如Qwen3-Omni模型在保留25% token时攻击成功率为86%，与全量保留的87%相当），验证了密集波形更新的冗余性，并为未来音频安全研究提供了新的优化方向。

链接: https://arxiv.org/abs/2605.04700
作者: Zheng Fang,Xiaosen Wang,Shenyi Zhang,Shaokang Wang,Zhijin Ge
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, \mathrmASR_l remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

[NLP-35] Paraphrase-Induced Output-Mode Collapse: When LLM s Break Character Under Semantically Equivalent Inputs

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对语义不变但形式变化的提示（prompt）时，其输出格式稳定性不足的问题，即“提示变体输出模式坍缩”（prompt-variant output-mode collapse）现象：即使输入提示内容保持一致，仅通过词汇、句法或语义扩展生成的变体提示，也可能导致模型从要求的封闭格式（如单一标签或选择项）转向开放式的对话体文本，从而破坏评估流程中对答案准确性的判断。解决方案的关键在于构建了一个可量化的评测框架——PARACONSIST，包含900个提示样本（150个基础查询 × 5种提示变体），并引入“语义一致性得分”（Semantic Consistency Score），将模型鲁棒性分解为答案一致性、Sentence-BERT语义相似度和长度稳定性三个维度，揭示出任务结构是预测输出模式坍缩的主要因素，而非模型本身，强调响应模式保留应作为与答案准确性同等重要的可靠性指标进行审计。

链接: https://arxiv.org/abs/2605.04665
作者: Aofan Liu,Jingxiang Meng
机构: Peking University (北京大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we observe a systematic failure mode we call prompt-variant output-mode collapse: when a closed-form prompt asks for a bare label or a single choice token, content-preserving prompt variants can push the model into conversational prose, the requested format dissolves, and exact-match evaluation pipelines silently misjudge the result. To make this measurable, we release PARACONSIST, a 900-prompt benchmark of 150 base queries with five lexical, syntactic, and semantic-expansion prompt variants each, and a Semantic Consistency Score that decomposes prompt-variant robustness into answer consistency, sentence-BERT semantic similarity, and length stability. Under a whole-word answer-set match, only ~22% of closed-form variant responses preserve the ground-truth label inside their output, while ~78% drift away from the answer space entirely. In our pool, the dominant predictor of collapse is task structure rather than model identity, with model differentiation jointly carried by answer consistency and length stability. Robustness audits should therefore track response-mode preservation as a first-class reliability target alongside answer accuracy.

[NLP-36] CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning

【速读】：该论文旨在解决时间知识图谱（Temporal Knowledge Graph, TKG）推理中预测能力受限的问题，其核心挑战在于如何协同捕捉两类预测信息：历史证据（historical evidence）与演化动态（evolutionary dynamics）。现有方法通常仅关注其中一类信息，忽略了二者之间的互补性。为此，作者提出CHE-TKG框架，其关键创新在于构建双视图学习机制，显式分离并联合建模历史证据图（捕获长期结构规律与稳定关系约束）与演化动态图（建模时间转移与近期变化），并通过关系分解和对比对齐目标增强跨视图的预测信号提取，从而实现更精准的未来事件预测。

链接: https://arxiv.org/abs/2605.04652
作者: Shuai-long Lei,Xiaobin Zhu,Jiarui Liang,Guoxi Sun,Zhiyu Fang,Xu-Cheng Yin
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal knowledge graph (TKG) reasoning aims to predict future events from historical facts. A key challenge lies in jointly capturing two sources of predictive information in TKGs: historical evidence and evolutionary dynamics. However, existing methods typically focus on only one of these sources, which limits the ability to fully exploit the complementary predictive signals in TKGs. To address this, we propose CHE-TKG, a novel collaborative dual-view learning framework for TKG reasoning. CHE-TKG explicitly separates and jointly models historical evidence and evolutionary dynamics, aiming to learn and exploit their complementary predictive signals. Specifically, CHE-TKG constructs a historical evidence graph to capture long-term structural regularities and stable relational constraints, alongside an evolutionary dynamics graph to model temporal transitions and recent changes, with dedicated encoders for each view. We further employ relation decomposition and a contrastive alignment objective to better capture the predictive signals across the two views. Extensive experiments demonstrate that CHE-TKG achieves state-of-the-art performance on multiple benchmarks.

[NLP-37] FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

【速读】：该论文旨在解决预训练模型适应（adaptation）过程中存在的效率瓶颈问题：传统基于反向传播（backpropagation）的方法虽能实现高精度适应，但训练成本高昂；而依赖记忆或上下文的学习方式虽推理速度快，却带来显著的内存开销。解决方案的关键在于提出一种名为FAAST（Forward-Only Associative Adaptation）的新方法，其核心是通过单次前向传播将标注样本解析为快速权重（fast weights），从而消除对内存或上下文的依赖，实现常数时间推理，并将任务适应与预训练表征解耦。这一设计在图像分类和语言建模基准上实现了与基于反向传播方法相当甚至更优的性能，同时将适应时间减少90%以上，且相比记忆/上下文依赖方法节省高达95%的内存占用，展现出极高的效率与可扩展性。

链接: https://arxiv.org/abs/2605.04651
作者: Guangsheng Bao,Hongbo Zhang,Han Cui,Yanbin Zhao,Yue Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Adapting pretrained models typically involves a trade-off between the high training costs of backpropagation and the heavy inference overhead of memory-based or in-context learning. We propose FAAST, a forward-only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant-time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop-based adaptation while reducing adaptation time by over 90% and is competitive to memory/context-based adaptation while saving memory usage by up to 95%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource-constrained models. We release the code and models at this https URL.

[NLP-38] Graph-Augmented LLM s for Swiss MP Ideology Prediction

【速读】：该论文旨在解决如何更准确地估计议会成员（Members of Parliament, MPs）意识形态立场的问题，传统方法通常仅依赖文本内容，忽略了议会系统中其他关键角色及其相互关系所蕴含的丰富信息。解决方案的关键在于提出一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的框架PG-RAG，该框架首先从政治知识图谱（Political Knowledge Graph, KG）中检索与MPs相关的结构化关系信息，并将其整合到生成模型的上下文之中，从而同时捕捉文本语义与议员间的关联性，实现对MPs意识形态立场的更精准预测。

链接: https://arxiv.org/abs/2605.04643
作者: Yifei Yuan,Luis Salamanca,Sophia Schlosser,Laurence Brandenberger
机构: Swiss Data Science Center, ETH Zürich, Zürich, Switzerland; Department of Political Science, University of Zürich, Zürich, Switzerland
类目: Computation and Language (cs.CL)
备注: Accepted by SwissText 26

点击查看摘要

Abstract:Approximating the ideological position of Members of Parliament (MPs) is a fundamental task in political science, helping researchers understand legislative behavior, party alignment, and policy preferences. While Large Language Models (LLMs) have shown promising results in estimating MPs’ ideological stances, there are more actors and elements in the parliamentary system, and relations between them, that could provide a wider and more informative picture. However, due to the complexity of integrating them in the prediction task, these additional elements are generally ignored. In this work, we propose an LLM framework, PG-RAG, that implements a retrieval-augmented generation pipeline: it first queries a political knowledge graph (KG) and then integrates the resulting graph-structured information into the context. This allows for capturing both textual semantics and inter-MP relationships, another relevant information source in any parliamentary system. We evaluate the approach on the task of ideology prediction, using data from a Swiss parliamentary dataset. When comparing graph-augmented models against several state-of-the-art baselines, the results demonstrate that incorporating this enriched information, which encodes information about different entities and relations, improves prediction performance. These results help to highlight the value of domain-specific relational information in modeling political behavior.

[NLP-39] Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自由文本生成任务中缺乏可靠不确定性量化（Uncertainty Quantification, UQ）方法的问题，现有主流方法依赖采样策略，导致计算成本高且估计方差大。其解决方案的关键在于提出一种基于梯度的无采样方法 SemGrad，该方法创新性地将梯度分析从参数空间扩展至语义空间（semantic space），通过引入语义保持评分（Semantic Preservation Score, SPS）来识别最优语义嵌入表示，并据此计算梯度以衡量模型输出的稳定性；此外，进一步提出 HybridGrad 方法融合语义梯度与参数梯度的优势，从而实现高效且准确的不确定性估计，在存在多个有效响应的场景下显著优于当前最先进方法。

链接: https://arxiv.org/abs/2605.04638
作者: Mingda Li,Rundong Lv,Xinyu Li,Weinan Zhang,Ting Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

[NLP-40] ajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

【速读】：该论文旨在解决塔吉克语（Tajik language）在自然语言处理（Natural Language Processing, NLP）领域严重缺乏公开可用工具集的问题，这限制了语言学研究与实际应用的发展。解决方案的关键在于提出并实现了一个名为TajikNLP的开源Python库，其核心创新包括：基于统一Doc对象的模块化架构，支持从文本清洗到词性标注、词形还原、分词（含子词BPE）、句法分析等全流程处理；引入一种新型统一形态学引擎，能够以控制模式和深度模式有效解析塔吉克语高度屈折的名词与动词变位；同时整合基于词典的情感分析器及预训练Word2Vec/FastText词向量，并配套发布四个高质量、开源的语料库（POS标注语料库、情感词典、地名词典和个人姓名数据集），显著提升了塔吉克语处理的可复现性与实用性。

链接: https://arxiv.org/abs/2605.04583
作者: Mullosharaf K. Arabov
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik’s agglutinative nominal and verbal inflections. The release further incorporates a lexicon-based sentiment analyser and pre-trained Word2Vec/FastText embeddings loaded directly from the Hugging Face Hub. To ensure reproducibility and facilitate future research, four accompanying linguistic datasets – a POS-tagged corpus (52.5k entries), a sentiment lexicon (3.5k entries), a toponym gazetteer (5.6k entries), and a personal names dataset (3.8k entries) – have been openly published under permissive licenses. The library’s reliability is validated by an extensive test suite of 616 automated tests achieving 93% source code coverage. TajikNLP thus establishes a foundational technological infrastructure for Tajik language processing, lowering the barrier to entry for both academic and industrial applications in low-resource Cyrillic-script environments.

[NLP-41] Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus

【速读】：该论文旨在解决塔吉克语（Tajik）自动词性标注（Part-of-Speech Tagging, POS）任务缺乏基准测试的问题，填补多语言模型在该语言语法分析能力上的研究空白。其关键解决方案是基于TajPersParallel语料库（约44,000个词典条目）对多种神经网络架构进行系统比较，包括传统BiLSTM-CRF模型与现代多语言Transformer模型（如mBERT、XLM-RoBERTa、ParsBERT和ruBERT），并采用参数高效微调方法LoRA进行适配。实验表明，mBERT + LoRA在无上下文依赖的孤立词级别标注中表现最优（宏F1分数为0.11，加权F1分数为0.62），揭示了塔吉克语与波斯语及俄语在语言类型学上的最近邻关系，为后续塔吉克语自然语言处理研究提供了基础。

链接: https://arxiv.org/abs/2605.04576
作者: Mullosharaf K. Arabov
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:This paper presents the first benchmark for the task of automatic part-of-speech (POS) tagging for the Tajik language. Despite the existence of multilingual language models demonstrating high effectiveness for many of the world’s languages, their capacity for grammatical analysis of Tajik has remained unexplored until now. The aim of this study is to fill this gap through a systematic comparison of classical neural network architectures and modern multilingual transformers. Experiments were conducted on the TajPersParallel corpus, a parallel lexical resource comprising approximately 44,000 dictionary entries. Due to the absence of full-fledged example sentences in the current version of the corpus, the task was performed at the level of isolated lexical units, representing a challenging case of context-independent classification. The study compares the following architectures: a recurrent BiLSTM-CRF model, as well as multilingual models XLM-RoBERTa (large), mBERT, ParsBERT (Persian), and ruBERT (Russian), adapted using the parameter-efficient fine-tuning method LoRA. The testing results showed that the best performance is achieved by the mBERT + LoRA model (macro F1-score = 0.11, weighted F1-score = 0.62). It was established that in the absence of syntactic context, all models experience significant difficulty in resolving morphological ambiguity, successfully classifying primarily high-frequency classes (“noun,” “adjective”) while demonstrating zero effectiveness for rare function words. Zero-shot evaluation revealed the greatest typological proximity of Tajik to Persian (ParsBERT) and Russian (ruBERT). The obtained results form a foundation for further research and development in the field of automatic processing of the Tajik language. Comments: Preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.04576 [cs.CL] (or arXiv:2605.04576v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.04576 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-42] Open-Source Image Editing Models Are Zero-Shot Vision Learners

【速读】：该论文旨在解决开放源代码图像编辑模型是否具备无需任务特定微调（zero-shot）的视觉理解能力这一问题，尤其在密集视觉预测任务中。现有研究多依赖闭源模型或需指令微调，缺乏对公开可用图像编辑模型零样本性能的系统评估。其解决方案的关键在于：对三种独立训练的开源图像编辑模型（Qwen-Image-Edit、FireRed-Image-Edit 和 LongCat-Image-Edit）在多个基准数据集上进行无微调测试，涵盖单目深度估计（NYUv2、DIODE）、表面法向量估计（NYUv2）和语义分割（Cityscapes），结果表明这些模型在几何与语义场景理解任务中均展现出显著的零样本视觉推理能力，且部分指标优于或接近经过专门训练的模型，从而验证了零样本视觉能力可能是图像编辑预训练过程中的涌现特性（emergent property），而非单一模型特异性现象。

链接: https://arxiv.org/abs/2605.04566
作者: Wei Liu,Jiaxin Lin,Rui Chen
机构: Tencent Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models~(Veo~3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models – Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit – on dense visual prediction tasks \emphwithout any fine-tuning. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals, FireRed-Image-Edit achieves a mean angular error of 17.69^\circ , surpassing the fine-tuned Marigold ( 20.86^\circ ) and matching the instruction-tuned Vision Banana ( 17.78^\circ ) without any task-specific training. On NYUv2 depth estimation, LongCat-Image-Edit obtains \delta_1=0.822 with affine alignment, and Qwen-Image-Edit leads on DIODE Indoor ( \delta_1=0.868 ). On Cityscapes semantic segmentation, Qwen-Image-Edit reaches 25.7 mIoU at the 19-class level and 49.5 mIoU at a coarser 7-category level. By comparing three independently trained editors, we test whether zero-shot vision ability is an emergent property of image-editing pretraining rather than a model-specific artifact. Code, evaluation scripts, and all results are publicly released to serve as a reproducible baseline for future work. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2605.04566 [cs.CV] (or arXiv:2605.04566v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.04566 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-43] he Newsworthiness of Brazilian Distress: A Peak Analysis on Time Series of International Media Attention to Disasters in Brazil

【速读】：该论文旨在解决国际媒体对本地灾害事件关注度不均衡的问题，特别是针对巴西的自然灾害与技术灾害为何在某些情况下能登上国际新闻头条这一现象缺乏系统性解释的困境。解决方案的关键在于构建代表性强、经过验证且针对特定国家（巴西）的新闻数据集，并通过时间序列分割方法识别德国报纸中关于巴西火灾和滑坡事件的新闻峰值，进而分析这些峰值是否与国内外灾害数据库中的观测结果在时间上具有同步性。

链接: https://arxiv.org/abs/2605.04552
作者: Brielen Madureira,Andreas Niekler,Marc Keuschnigg,Mariana Madruga de Brito
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Media coverage influences disaster response, yet the drivers of international media attention to local events remain unevenly understood. Brazil offers a compelling case: some of its natural and technological disasters occasionally hit the international headlines. However, systematic analyses of what makes these events be discussed abroad are still missing. Addressing this gap requires representative, validated and country-specific news datasets. This paper presents a peak analysis of 2k news about Brazilian fires and landslides in German newspapers from 2000 to 2024. Using time series segmentation to detect news event peaks, we examine the extent to which they can be temporally aligned with observations in national and global disaster databases.

[NLP-44] UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于草稿-验证机制的推理加速问题，特别是当多步依赖与多草稿分支同时存在时，现有方法因将这两个维度孤立处理而导致优化不足的问题。解决方案的关键在于提出一种统一视角，将树结构验证建模为条件最优传输（conditional Optimal Transport, OT）问题，其中通过前缀接受概率抽象垂直依赖关系，并作为动态缩放因子引导水平草稿选择；在此基础上设计的 UniVer 算法通过在前缀约束下组合局部最优传输计划，实现跨树层级的联合优化，在保持与目标模型分布一致性的前提下显著提升接受长度（相比标准递归拒绝采样无放回策略提升 4.2%–8.5%）。

链接: https://arxiv.org/abs/2605.04543
作者: Yepeng Weng,Qiao Hu,Takehisa Yairi
机构: The University of Tokyo (东京大学); Lenovo AI Technology Center (联想人工智能技术中心); National Center for Mathematics and Interdisciplinary Sciences (NCMIS), AMSS, CAS (中国科学院数学与系统科学研究院国家数学与交叉科学中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation, applying either flat OT to single-step drafts or per-token rejection sampling to tree-structured candidates. This separation leaves the joint regime (where multi-step dependencies meet multi-draft branching) poorly optimized, as local verification rules fail to exploit the coupling between horizontal and vertical dimensions of candidate trees. In this paper, we propose a unified perspective that casts tree-based verification as a conditional OT problem. Our key insight is that vertical dependencies can be abstracted through prefix acceptance probabilities, which act as dynamic scaling factors to actively guide horizontal draft selection. Based on this principle, we introduce UniVer, a verification algorithm that jointly optimizes across tree levels by composing local optimal transport plans under prefix constraints. We prove that UniVer remains lossless and achieves the optimal acceptance rate under the proposed conditional framework. Extensive experiments across different tasks and models demonstrate that UniVer improves acceptance length by 4.2% to 8.5% over standard recursive rejection sampling without replacement, while maintaining exact distributional alignment with the target model.

[NLP-45] RLearner-LLM : Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

【速读】：该论文旨在解决生成式 AI（Generative AI）在知识密集型任务中因偏好信号偏差导致的逻辑对齐缺口问题：标准的直接偏好优化（Direct Preference Optimization, DPO）方法依赖人类标注或大语言模型（LLM）裁判提供的偏好信号，但这些信号存在系统性冗余偏倚（verbosity bias），即过度奖励流畅性而非逻辑正确性，从而导致监督微调（SFT）模型在自然语言推理（NLI）上的表现极低（仅0.05–0.22）。解决方案的关键在于提出 RLearner-LLM with Hybrid-DPO——一种自动化的混合偏好管道，融合 DeBERTa-v3 的 NLI 信号与验证器 LLM 的评分，无需人工标注即可有效消除单一信号优化带来的“对齐代价”（alignment tax），显著提升逻辑一致性与答案覆盖率，在多个学术领域和基础模型架构上均实现最高达6倍的NLI性能提升。

链接: https://arxiv.org/abs/2605.04539
作者: Qiming Bao,Juho Leinonen,Paul Denny,Michael J. Witbrock
机构: University of Auckland (奥克兰大学); Aalto University (阿尔托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap – SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the “alignment tax” of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output – alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.

[NLP-46] Rag uTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

【速读】：该论文旨在解决多语言生成评估任务（MTRAGEval）中生成质量与参考文本一致性之间的优化问题，核心挑战在于如何有效融合多个大语言模型（Large Language Models, LLMs）的输出以提升整体性能。解决方案的关键在于构建一个异构集成系统，包含七种不同架构、规模和提示策略的LLMs，并引入GPT-4o-mini作为判官模型对每个实例的候选生成结果进行筛选，从而实现最优输出选择。实验表明，该方法在26个参赛团队中排名第一，其条件调和均值达0.7827，显著优于最强基线模型（gpt-oss-120b，0.6390），且消融分析证实模型多样性与提示策略差异性是性能提升的核心因素。

链接: https://arxiv.org/abs/2605.04523
作者: Ivan Bondarenko,Roman Derunets,Oleg Sedukhin,Mikhail Komarov,Ivan Chernov,Mikhail Kulakov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost–performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: this https URL

[NLP-47] Distilling Bayesian Belief States into Language Models for Auditable Negotiation

【速读】：该论文旨在解决生成式 AI（Generative AI）在谈判代理中的可审计性问题，即如何显式建模和传递对手信念（opponent beliefs），以提升决策的透明度与可信度。现有端到端大语言模型（LLMs）虽能模仿谈判对话，但其对手信念通常是隐式的、难以解析。解决方案的关键在于提出 BOND（Bayesian Opponent-belief Negotiation Distillation）框架：首先利用一个基于贝叶斯推理的大模型作为教师，对六种可能的对手优先级排序进行打分并更新后验分布，进而基于该后验进行菜单式决策；其次通过知识蒸馏将此信念信号压缩至一个8B参数的小型学生模型中，使其不仅能输出谈判动作，还能以标签化文本形式输出归一化的后验信念。实验表明，该方法在CaSiNo数据集上显著优于当前最优基线，并实现了高精度的信念校准（Brier分数低至0.114），同时支持多种诊断手段揭示信念与策略之间的解耦关系，从而实现更可解释的谈判行为分析。

链接: https://arxiv.org/abs/2605.04507
作者: Zongqi Cui,Baihan Lin
机构: Emory University; Icahn School of Medicine at Mount Sinai
类目: Computation and Language (cs.CL)
备注: Preprint. 24 pages, 6 figures, 18 tables. Code available at this https URL

点击查看摘要

Abstract:Negotiation agents must infer what their counterpart values, update those beliefs over dialogue turns, and choose actions under uncertainty. End-to-end large language models (LLMs) can imitate negotiation dialogue, but their opponent beliefs are usually implicit and difficult to inspect. We propose BOND (Bayesian Opponent-belief Negotiation Distillation), a framework for auditable negotiation. BOND consists of an LLM-based Bayesian teacher that scores dialogue contexts against the six possible opponent priority orderings, updates a posterior over those orderings, and uses the posterior for menu-based decision making, as well as a smaller 8B student language model that emits both negotiation actions and normalized posterior beliefs as tagged text. In the CaSiNo negotiation dataset, BOND outperforms the state-of-the-art and achieves mean Brier score 0.085 over opponent-priority posteriors. The distilled student preserves much of this belief signal, achieving Brier 0.114, below the uniform six-ordering reference of 5/36, approximately 0.139. Compared with a 70B structured-CoT baseline, the significantly smaller 8B student model yields substantially better elicited posterior calibration. We further showcase auditability through posterior trajectories, belief-versus-policy error decomposition, and posterior-prefix interventions. These diagnostics reveal that distillation preserves a scoreable belief report more strongly than causal belief-conditioned control, making weak belief-action coupling visible, not hidden.

[NLP-48] SpecPL: Disentangling Spectral Granularity for Prompt Learning

【速读】：该论文旨在解决视觉语言模型（VLMs）中提示学习（prompt learning）存在的模态不对称问题，即当前方法主要优化文本token，而依赖冻结的视觉编码器作为整体特征提取器，忽视了高频细节在细粒度区分中的关键作用。其解决方案的核心在于提出一种名为SpecPL的新框架，通过**反事实粒度监督（Counterfactual Granule Supervision）**实现频谱解耦：利用冻结的变分自编码器（VAE）将视觉信号分解为语义低频成分与粒度高频细节；同时引入冻结的视觉语义库（Visual Semantic Bank）锚定文本表示到通用低频不变性上以缓解过拟合，进而通过置换高频信号迫使模型显式区分视觉粒度与语义不变性，从而在稳定性和泛化能力之间取得更好平衡。

链接: https://arxiv.org/abs/2605.04504
作者: Jingtao Zhou,Xirui Kang,Feiyang Huang,Lai-Man Po
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at this https URL.

[NLP-49] Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties CONLL2026

【速读】：该论文旨在解决低资源语言变体（low-resource language varieties）在多语言模型（Multilingual Language Models）开发中被忽视的问题，尤其是这些变体因与高资源语言存在显著差异而难以通过传统跨语言迁移学习有效建模。现有方法通常强调语言间的共性对齐，但忽略了语言差异本身作为泛化到未见变体的重要线索。解决方案的关键在于提出一种两阶段的语言泛化框架：首先设计TOPPing方法用于为低资源变体选择最优源语言；其次引入轻量级VACAI-Bowl架构，通过并行分支分别学习变体特异性特征（variety-specific attributes）和变体不变特征（variety-invariant attributes），其中后者利用对抗训练实现特征解耦。该框架在结构预测任务（如依存句法分析）上实现了平均54.62%的性能提升，验证了其对下游任务的泛化能力。

链接: https://arxiv.org/abs/2605.04500
作者: Jinju Kim,Haeji Jung,Youjeong Roh,Jong Hwan Ko,David R. Mortensen
机构: Sungkyunkwan University(成均馆大学); Carnegie Mellon University(卡内基梅隆大学); University of British Columbia(不列颠哥伦比亚大学); Electronics and Telecommunications Research Institute(电子与电信研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to CoNLL 2026

点击查看摘要

Abstract:Low-resource language varieties used by specific groups remain neglected in the development of Multilingual Language Models. A great deal of cross-lingual research focuses on inter-lingual language transfer which strives to align allied varieties and minimize differences between them. However, for low-resource varieties, linguistic dissimilarity is also an important cue allowing generalization to unseen varieties. Unlike prior approaches, we propose a two-stage Language Generalization framework that focuses on capturing variety-specific cues while also exploiting rich overlap offered by high-resource source variety. First, we propose TOPPing, a source-selection method specifically designed for low-resource varieties. Second, we suggest a lightweight VACAI-Bowl architecture that learns variety-specific attributes with one branch while a parallel branch captures variety-invariant attributes using adversarial training. We evaluate our framework on structural prediction tasks, which are among the few tasks available, as proxy for performance on other downstream tasks. Using VACAI-Bowl with TOPPing yields an average 54.62% improvement in the dependency parsing task, which serves as a proxy for performance on other downstream tasks across 10 low-resource varieties.

[NLP-50] SCOUT: Active Information Forag ing for Long-Text Understanding with Decoupled Epistemic States ICML2026

【速读】：该论文旨在解决长文本理解（Long-Text Understanding, LTU）在百万级token规模下如何平衡推理保真度（reasoning fidelity）与计算效率的问题。当前前沿的长上下文大语言模型（Long-Context LLMs）虽能端到端处理海量文本，但存在token消耗高和注意力稀释（attention dilution）的问题；而专用LTU代理通常通过任务无关的抽象（如图结构构建或索引）牺牲保真度。论文提出的关键洞察是：查询相关的信息在文档中通常是稀疏的，因此有效推理应基于查询充分的子集而非整个上下文。解决方案的核心是SCOUT——一种从被动处理转向主动信息觅食（active information foraging）的新范式，将文档视为可探索环境，并通过状态级差距诊断（state-level gap diagnosis）自适应地在粗粒度到细粒度探索与锚定状态更新之间交替，逐步收缩其认知状态（epistemic state）直至达到查询充分性，从而在保持性能的同时将token消耗降低至最多8倍，且在上下文长度增长时仍保持稳定。

链接: https://arxiv.org/abs/2605.04496
作者: Zhenliang Zhang,Wenqing Wang,Yong Hu,Yaming Yang,Jiaheng Gao,Chen Shen,Xiaojun Wan
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026

点击查看摘要

Abstract:Long-Text Understanding (LTU) at million-token scale requires balancing reasoning fidelity with computational efficiency. Frontier long-context LLMs can process millions of token contexts end-to-end, but they suffer from high token consumption and attention dilution. In parallel, specialized LTU agents often sacrifice fidelity through task-agnostic abstractions like graph construction or indexing. We identify a key insight for LTU: query-relevant information is typically sparse relative to the full document, so effective reasoning should rely on a query-sufficient subset rather than the entire context. To address this, we propose SCOUT, a new paradigm for LTU that shifts from passive processing to active information foraging. It treats the document as an explorable environment and answers from a compact, provenance-grounded epistemic state. Guided by state-level gap diagnosis, SCOUT adaptively alternates between coarse-to-fine exploration and anchored state updates that progressively contract its epistemic state toward query sufficiency. Experiments show that SCOUT matches state-of-the-art proprietary models while reducing token consumption by up to 8x. Moreover, SCOUT remains stable as context length scales, substantially alleviating the practical cost-performance trade-off.

[NLP-51] CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）中传统重排序方法仅优化查询-文档相关性而忽视生成有用性的局限性，即相关文档可能引入噪声，而低排名文档反而能有效降低生成器的不确定性。其解决方案的关键在于提出一种无需训练、可插拔的置信度感知重排序框架（Confidence-Aware Reranking, CAR），通过比较在仅查询条件和查询-文档条件下多次采样答案的语义一致性来估计生成器置信度变化，并将置信度提升的文档上移、置信度下降的文档下移，不确定情形保持原序，同时引入查询级门控机制避免对已高置信度查询的无效干预。实验表明，CAR在多个BEIR数据集上显著提升NDCG@5指标，且与下游生成F1指标高度正相关（Spearman ρ = 0.964）。

链接: https://arxiv.org/abs/2605.04495
作者: Zhipeng Song,Yizhi Zhou,Xiangyu Kong,Jiulong Jiao,Xuezhou Ye,Chunqi Gao,Xueqing Shi,Yuhang Zhou,Heng Qi
机构: Dalian University of Technology (大连理工大学); Liaodong University (辽东学院); Sun Yat-sen University (中山大学); Dalian Maritime University (大连海事大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) depends on document ranking to provide useful evidence for generation, but conventional reranking methods mainly optimize query-document relevance rather than generation usefulness. A relevant document may still introduce noise, while a lower-ranked document may better reduce the generator’s uncertainty. We propose CAR (Confidence-Aware Reranking), a query-guided, training-free, and plug-and-play reranking framework that uses generator confidence change as a document usefulness signal. CAR estimates confidence through the semantic consistency of multiple sampled answers under query-only and query-document conditions. Documents that significantly increase confidence are promoted, those that decrease confidence are demoted, and uncertain cases preserve the baseline order, while a query-level gate avoids unnecessary intervention on already confident queries. Experiments on four BEIR datasets show that CAR consistently improves NDCG@5 across sparse and dense retrievers, LLM-based and supervised rerankers, and four LLM backbones. Notably, CAR improves the YesNo reranker by 25.4 percent on average under Contriever retrieval, and its ranking gains strongly correlate with downstream generation F1 improvements, achieving Spearman rho = 0.964.

[NLP-52] A Hybrid Method for Low-Resource Named Entity Recognition

【速读】：该论文旨在解决低资源语言（越南语）在特定领域命名实体识别（Named Entity Recognition, NER）中面临的两大挑战：标注数据稀缺与标签体系异构性。解决方案的关键在于提出一种混合神经符号（neurosymbolic）框架，采用两阶段流水线设计：第一阶段通过规则驱动模块对复杂标签进行聚类以降低标签空间维度；第二阶段利用预训练语言模型（如RoBERTa）进行精细化微调，并结合后处理模块恢复细粒度标签，从而兼顾准确率与应用表达力。此外，创新性地引入基于大语言模型（Large Language Models, LLMs）的可扩展数据增强策略，在无需全量重标注的前提下扩充标签集，显著缓解了数据稀缺问题。实验表明，该方法在多个垂直领域（如物流、野生动物、医疗等）均优于强基线模型，验证了其对越南语语言特性和领域上下文建模的有效性。

链接: https://arxiv.org/abs/2605.04489
作者: Do Minh Duc,Quan Xuan Truong,Viet Tran Hong,Le Hoang Anh,Mac Thi Minh Tra,Nguyen Van Thuy,Le Hai Ha,Vinh Nguyen Van
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published in Journal of Applied Data Sciences, Volume 7, Issue 2, pages 999–1019, 2026. Open access under CC BY 4.0

点击查看摘要

Abstract:Named Entity Recognition (NER) is a critical component of Natural Language Processing with diverse applications in information extraction and conversational AI. However, NER in specific domains for low-resource languages faces challenges such as limited annotated data and heterogeneous label sets. This study addresses these issues by proposing a hybrid neurosymbolic framework that integrates rule-based processing with deep learning models for Vietnamese NER. The core idea involves a two-stage pipeline: first, a rule-based component reduces label complexity by grouping relational and special categories; second, pre-trained language models are fine-tuned for high-precision extraction. A post-processing module is then utilized to restore fine-grained labels, preserving expressiveness for application-level usability. To mitigate data scarcity, a scalable data augmentation strategy leveraging Large Language Models (LLMs) is introduced to expand the label set without full re-annotation, which is a significant novelty of this work. The effectiveness of this method was evaluated across five specific-domain datasets, including logistics, wildlife, and healthcare. Experimental results demonstrate substantial improvements over strong RoBERTa-based baselines. Specifically, the proposed system achieved F1 scores of 90 percent in Customer Service, up from 83 percent; 84 percent in GAM, up from 73 percent; 83 percent in AI Fluent, up from 80 percent; 94 percent in PhoNER_Covid19, up from 91 percent; and 60 percent in Rare Wildlife, up from 36 percent. These findings confirm that the hybrid approach effectively captures the linguistic complexity of Vietnamese and contextual nuances in specialized domains, offering a robust contribution to low-resource NER research.

[NLP-53] Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在后训练阶段常见的灾难性遗忘（catastrophic forgetting）问题，即在优化目标任务时导致先前习得能力显著退化。其核心解决方案是提出一种名为“锚定学习”（Anchored Learning）的框架，关键在于通过一个动态演化的移动锚点（moving anchor）显式控制优化过程中的分布漂移（distributional drift）。该锚点在当前模型与冻结参考模型之间插值，构建一个中间目标用于知识蒸馏，从而将全局微调转化为分布空间中的一系列局部信任区域更新。理论分析表明，该方法每轮迭代均具有线性KL散度上界，保障了模型分布过渡的稳定性；实验验证其在多个基准测试中实现了性能提升与稳定性的帕累托最优权衡，显著优于强基线方法。

链接: https://arxiv.org/abs/2605.04468
作者: Xinyu Wang,Changzhi Sun,Yuanbin Wu,Xiaoling Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training large language models (LLMs) often suffers from catastrophic forgetting, where improvements on a target objective degrade previously acquired capabilities. Recent evidence suggests that this phenomenon is primarily driven by excessive distributional drift during optimization. Motivated by this perspective, we propose Anchored Learning, a simple framework that explicitly controls distributional updates during offline fine-tuning via a dynamically evolving moving anchor. Instead of matching a fixed reference distribution, the anchor interpolates between the current model and a frozen reference to construct an intermediate target that the model distills toward, transforming global fine-tuning into a sequence of local trust-region updates in distribution space. Theoretically, we prove this anchor-based update admits a linear KL-divergence upper bound per iteration, ensuring a stable transition between model distributions. Extensive experiments on iGSM, MedCalc, and IFEval show that Anchored Learning consistently lies on the Pareto frontier of gain-stability trade-offs, achieving near-optimal performance improvements while substantially reducing degradation compared to strong baselines. For example, while standard SFT suffers from over 53% performance degradation on iGSM and MedCalc, Anchored Learning slashes this drop to under 5% while maintaining near-optimal gains (e.g., 75.2% on iGSM).

[NLP-54] GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking WWW AAAI2026

【速读】：该论文旨在解决大规模多领域对话状态追踪（Dialogue State Tracking, DST）中大型语言模型（Large Language Models, LLMs）性能不足的问题，尤其是在从复杂多轮对话中精确提取结构化信息方面存在明显短板。其解决方案的关键在于提出一种图增强的专家混合架构（Graph-Enhanced Mixture-of-Experts, GEM），通过动态路由机制协调两类专用专家模块：一是基于图神经网络（Graph Neural Network, GNN）的结构化对话理解模块，用于建模对话轮次间的依赖关系和语义结构；二是微调后的T5-Small编码器-解码器用于序列建模。此外，针对复杂值生成任务，引入ReAct代理推理机制进行结构化推理，从而提升准确性与可解释性。实验表明，GEM在MultiWOZ 2.2数据集上达到65.19%的联合目标准确率，显著优于现有端到端LLM方法及主流SOTA模型，验证了结构化表示、动态专家选择与代理推理协同作用的有效性。

链接: https://arxiv.org/abs/2605.04449
作者: Ziqi Zhu,Adithya Suresh,Tomal Deb,Iman Abbasnejad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure. Submitted to AAAI 2026. Also available at Amazon Science: this https URL

点击查看摘要

Abstract:Dialogue State Tracking (DST) requires precise extraction of structured information from multi-domain conversations, a task where Large Language Models (LLMs) struggle despite their impressive general capabilities. We present GEM (Graph-Enhanced Mixture-of-Experts), a novel framework that combines language models and graph-structured dialogue understanding with ReAct agent-based reasoning for superior DST performance. Our approach dynamically routes between specialized experts: a Graph Neural Network that captures dialogue structure and turn-level dependencies, and a finetuned T5-Small encoder-decoder for sequence modeling, coordinated by an intelligent router. For complex value generation tasks, we integrate ReAct agents that perform structured reasoning over dialogue context. On MultiWOZ 2.2, GEM achieves 65.19% Joint Goal Accuracy, substantially outperforming end-to-end LLM approaches (best: 38.43%) and surpassing state-of-the-art (SOTA) methods including TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%). Our graph-enhanced mixture-of-experts architecture with ReAct integration demonstrates that combining structured dialogue representation with dynamic expert routing and agent-based reasoning provides a powerful paradigm for dialogue state tracking, achieving superior accuracy while maintaining computational efficiency through selective expert activation.

[NLP-55] graph English: Semantic Prompt Compression via Structured Symbolic Rewriting

【速读】：该论文旨在解决自然语言文本在大模型处理过程中因冗余信息导致的计算效率低下问题，尤其是在保持关键语义完整性前提下的高效压缩。传统方法如LLMLingua-2通过训练分类器以固定比例删除低重要性token进行压缩，但难以适应不同文档的信息密度差异。本文提出Telegraph English (TE)协议，其核心创新在于将自然语言重构为符号丰富、形式结构化的“事实行”（fact lines），利用约40个逻辑与关系符号替代冗长表达，并使压缩比自适应于文档内容密度；这种结构化重写不仅实现了高效率压缩，还天然生成可独立访问的事实单元，从而将压缩与语义分块统一为同一操作，显著提升小模型在细粒度任务上的表现（最高达11个百分点）。

链接: https://arxiv.org/abs/2605.04426
作者: Mikhail L. Arbuzov,Sisong Bei,Ziwei Dong,Dmitri Kalaev,Alexey A. Shvets
机构: Palo Alto Networks (帕洛阿尔托网络)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Telegraph English (TE), a prompt-compression protocol that rewrites natural language into a symbol-rich, formally-structured dialect. Where token-deletion methods such as LLMLingua-2 train a classifier to delete low-importance tokens at a fixed ratio, TE performs a full semantic rewrite: it decomposes the input into atomic fact lines, substitutes verbose phrases with \sim 40 logical and relational symbols, and lets the compression ratio adapt to each document’s information density. A consequence of the line-structure rule is that compression and semantic chunking become the same operation – each output line is an independently addressable fact, so the compressed representation is simultaneously a semantic index. We evaluate TE on 4,081 question-answer pairs from LongBench-v2 across five OpenAI models and two difficulty levels. At roughly 50% token reduction, TE preserves 99.1% accuracy on key facts with GPT-4.1 and outperforms LLMLingua-2 at matched compression ratios on every model and task tested. The gap widens on smaller models – up to 11 percentage points on fine-detail tasks – suggesting that explicit relational structure compensates for limited model capacity. We release the grammar specification, compression prompt, benchmark data, and reference implementation.

[NLP-56] Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

【速读】：该论文旨在解决多大语言模型（Large Language Models, LLMs）在异构计算资源上并发部署时的资源利用率低和成本高的问题。随着LLM使用场景日益碎片化，云服务商提供多种中低端GPU，其单位性价比接近顶级硬件但调度复杂度显著增加。解决方案的关键在于提出Coral系统，该系统通过联合优化所有模型副本的资源分配与服务策略，实现对异构资源的自适应调度；并采用无损两阶段分解方法，在保持全局最优性的同时将在线求解时间从数小时缩短至数十秒，从而显著降低服务成本并提升在资源受限条件下的吞吐量（goodput）。

链接: https://arxiv.org/abs/2605.04357
作者: Yixuan Mei,Zikun Li,Zixuan Chen,Shiqi Pan,Mengdi Wu,Xupeng Miao,Zhihao Jia,K. V. Rashmi
机构: Carnegie Mellon University; Peking University
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79 \times over the best baseline, and delivers up to 2.39 \times higher goodput under scarce resource availability.

[NLP-57] Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在显式计算资源约束下进行知识蒸馏时，如何实现训练成本降低与推理效率提升的双重目标。现有参数高效蒸馏方法（如LoRA）虽能减少适配开销，但未改变密集主干结构，因而无法带来实质性的推理阶段计算节省。其解决方案的关键在于提出预算LoRA（Budgeted LoRA），将模型压缩建模为一个结构化的计算分配问题：通过设定全局计算预算（即保留的密集计算比例），动态调整模块级密集保留系数、自适应低秩分配以及后训练压缩策略，从而在单一预算控制下生成一系列结构高效的教师-学生模型。实验表明，该方法在保持语言建模性能的同时显著提升了推理速度，并在函数式上下文学习任务中优于传统方案，揭示了在计算受限蒸馏中，行为保真度的核心不在于单纯减少参数或匹配困惑度，而在于对密集计算向低秩路径的有效转移控制。

链接: https://arxiv.org/abs/2605.04341
作者: Mohammed Sabry,Anya Belz
机构: ADAPT Centre, Dublin City University, Ireland(爱尔兰都柏林城市大学ADAPT中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. 9 pages main text, 18 pages total, 2 figures, 9 tables

点击查看摘要

Abstract:We study distillation for large language models under explicit compute constraints, with the goal of producing student models that are not only cheaper to train, but structurally efficient at inference time. While prior approaches to parameter-efficient distillation, such as LoRA, reduce adaptation cost, they leave the dense backbone unchanged and therefore fail to deliver meaningful inference savings. We propose Budgeted LoRA, a distillation framework that treats model compression as a structured compute allocation problem. Instead of using a fixed student architecture, we introduce a global compute budget that sets the final target fraction of dense computation retained. Under this constraint, the model redistributes capacity across dense and low-rank pathways via (i) module-level dense retention coefficients, (ii) adaptive low-rank allocation, and (iii) post-training compression that selectively removes, approximates, or preserves dense components. This formulation yields a family of students controlled by a single budget dial. Empirically, Budgeted LoRA matches standard LoRA perplexity at a moderate budget with a 1.74x compressed-module speedup; at an aggressive budget it achieves a 4.05x speedup with moderate perplexity degradation, and it preserves higher accuracy on function-style in-context learning probes. These results suggest that, under compute-constrained distillation, retaining behavior is less about matching perplexity or removing more parameters than it is about controlling how dense computation is transferred to low-rank pathways.

[NLP-58] NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise ACL

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自然语言场景下进行因果推理时难以区分相关性与因果性的问题，尤其是在存在结构化噪声（如无关干扰项、值扰动、混杂因素和部分可观测性）的情况下。其解决方案的关键在于提出一个名为NoisyCausal的新基准和一种模块化推理框架：首先通过可控噪声注入生成基于真实因果图的自然语言实例以模拟现实复杂性；其次设计了一种结合LLM与显式因果结构的方法，引导模型从上下文中提取变量、构建因果图，并将推理任务重构为基于该图的结构化提示。此方法利用符号化因果结构指导推理过程，而非仅依赖统计模式，从而实现更可解释且鲁棒的因果推断能力。

链接: https://arxiv.org/abs/2605.04313
作者: Zhi Xu,Yun Fu
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL oral accept; 5 figures, 8 tables

点击查看摘要

Abstract:Causal reasoning in natural language requires identifying relevant variables, understanding their interactions, and reasoning about effects and interventions, often under noisy or ambiguous conditions. While large language models (LLMs) exhibit strong general reasoning abilities, they struggle to disentangle correlation from causation, particularly when observations are partially incorrect or irrelevant information is present. In this work, we introduce NoisyCausal, a new benchmark designed to evaluate causal reasoning under structured noise. Each instance is generated from a ground-truth causal graph and contextualized with a natural language scenario by injecting controllable forms of noise, such as irrelevant distractors, value perturbations, confounding, and partial observability. Moreover, we propose a modular reasoning framework that combines LLMs with explicit causal structure to address these challenges. Our method prompts the LLM to extract variables, construct a causal graph from context, and then reformulates the reasoning task as a structured prompt grounded in this graph. Rather than relying on statistical patterns alone, the LLM is guided by symbolic structure, enabling more interpretable and robust inference. Experimental results show that our method significantly outperforms standard prompting and reasoning baselines on NoisyCausal. Furthermore, it generalizes well to external benchmarks such as Cladder without task-specific tuning. Our findings highlight the importance of combining causal abstractions with language-driven reasoning to achieve faithful and robust causal understanding in LLMs.

[NLP-59] SWAN: Semantic Watermarking with Abstract Meaning Representation ACL2026

【速读】：该论文旨在解决生成式 AI (Generative AI) 文本水印在面对语义不变的改写（如 paraphrasing）时鲁棒性不足的问题。现有水印方法通常通过调整文本生成过程中的词元选择偏好来嵌入签名，但这类方法在语义保持不变的改写下容易失效。解决方案的关键在于将水印签名直接嵌入到句子的抽象 meaning representation (AMR) 语义结构中，使得只要语义不变，无论句法如何变化，水印签名都能被保留。该方法无需训练，仅通过提示（prompting）引导大语言模型（LLM）生成符合特定 AMR 模板的句子，并结合现成的 AMR 解析器与统计检验（one-proportion z-test）实现高效检测，实验证明其在保持高检测准确率的同时显著提升了对改写的鲁棒性。

链接: https://arxiv.org/abs/2605.04305
作者: Ziping Ye,Gourab Dey,Christos Christodoulopoulos,Charith Peris,Anil Ramakrishna,Weitong Ruan,Aram Galstyan,Kai-Wei Chang,Rahul Gupta,Ninareh Mehrabi
机构: Amazon; Information Commissioner’s Office; Meta; USC; UCLA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: Accepted to ACL 2026 Main

点击查看摘要

Abstract:We introduce SWAN (Semantic Watermarking with Abstract Meaning Representation), a novel framework that embeds watermark signatures into the semantic structure of a sentence using Abstract Meaning Representation (AMR). In contrast to existing watermarking methods, which typically encode signatures by adjusting token selection preferences during text generation, SWAN embeds the signature directly in the sentence’s semantic representation. As the signature is encoded at the semantic structure level, any paraphrase that preserves meaning automatically preserves the signature. SWAN is training-free: watermark injection is achieved by prompting an LLM to generate sentences guided by a selected AMR template while maintaining contextual coherence, and detection uses an off-the-shelf AMR parser followed by a simple one-proportion z-test. Empirical evaluation on the RealNews benchmark shows SWAN matches state-of-the-art detection performance on unaltered watermarked text, while significantly improving robustness against paraphrasing, increasing detection AUC by up to 13.9 percentage points compared to prior methods. These results demonstrate that SWAN’s approach of anchoring watermarks in AMR semantic structures provides a simple, effective, and prompt-based method for robust text provenance verification under paraphrasing, opening new avenues for semantic-level watermarking research.

[NLP-60] Hierarchical Visual Agent : Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning ACL2026

【速读】：该论文旨在解决复杂图表问答任务中多子图跨步推理能力不足的问题，即现有多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理包含多个子图的复杂图表时，难以实现对视觉细节的精确感知与多步骤逻辑推理的协同。其解决方案的关键在于提出一种分层视觉代理框架（Hierarchical Visual Agent, HierVA），通过高阶管理者生成任务计划并维护紧凑的联合图像-文本工作空间上下文，同时由专用工作者执行局部推理、证据收集与结果返回；该框架创新性地分离视觉与文本上下文，并引入缩放工具限定视觉关注范围，从而提升推理效率与准确性。

链接: https://arxiv.org/abs/2605.04304
作者: Qihua Dong,Ruozhen He,Junwen Chen,Yizhou Wang,Xu Ma,Songyao Jiang,Yun Fu
机构: Northeastern University (东北大学); Rice University (莱斯大学); Amazon AGI (亚马逊AGI)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Advanced chart question answering requires both precise perception of small visual elements and multi-step reasoning across several subplots. While existing MLLMs are strong at understanding single plots, they often struggle with multi-step reasoning across multiple subplots. We propose HierVA, a hierarchical visual agent framework for chart reasoning that iteratively constructs and updates a working context in a joint image–text space. A high-level manager generates plans and maintains a compact context containing only key information, while specialized workers perform reasoning, gather evidence, and return results. In particular, the agent maintains separate visual and textual contexts, using a zoom-in tool to restrict the visual context. Experiments on the CharXiv reasoning subset demonstrate consistent improvements over strong multimodal baselines, and ablation studies verify that hierarchical architecture, scoped visual context, and distilled context contribute complementary gains.

[NLP-61] owards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLM s

【速读】：该论文旨在解决自动化作文评分（Automated Essay Scoring, AES）研究中依赖排名相关性指标所带来的局限性问题，这些问题会掩盖写作能力结构内在的维度间相关性和晕轮效应（halo effect），从而导致高相关性可能掩盖系统真实的诊断行为。其解决方案的关键在于提出一种新颖的自参照评估框架（self-referential assessment evaluation framework），该框架聚焦于识别单个学习者在各评分维度上的相对强弱项（即基于个体内部差异的诊断性分析），而非传统的跨学习者排名比较。通过在ICNALE GRA这一高密度二语写作数据集上应用双facet Rasch模型校准评分者严重度并生成可靠参考分数，作者对比了人类评分者与三种大语言模型（Large Language Models, LLMs）在零样本条件下的分析评分表现，结果表明LLMs在识别学习者的相对弱点（负向反馈）方面优于单个评分者，而人类评分者在识别相对优势（正向反馈）方面更具优势，从而验证了以学习者内部 profile 为基础的评估方法对AES系统开发和部署的价值。

链接: https://arxiv.org/abs/2605.04298
作者: Stefano Bannò,Kate Knill,Mark Gales
机构: ALTA Institute, Department of Engineering, University of Cambridge (UK)
类目: Computation and Language (cs.CL)
备注: Accepted for the 21st Workshop on Innovative Use of NLP for Building Educational Applications

点击查看摘要

Abstract:Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system’s true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.

[NLP-62] Material Database Agent : A Multimodal Agent ic Framework for Scientific Literature Mining

【速读】：该论文旨在解决材料科学领域中从海量科研文献中提取结构化数据的难题，尤其是实验细节常隐藏于文本、表格、图表等非结构化内容中，导致数据库构建过程高度依赖人工、效率低下且难以扩展。解决方案的关键在于提出Material Database Agent (MDA)，一个模块化的多智能体系统架构，通过并行处理PDF文章的文本与图像内容，由多个子智能体分别解析markdown文件和科学图表以生成每篇论文的子数据库，最终由汇总智能体整合为统一的表格数据库，从而实现高效、可扩展的材料数据库自动化构建。

链接: https://arxiv.org/abs/2605.04278
作者: Achuth Chandrasekhar,Omid Barati Farimani,Radheesh Sharma Meda,Amir Barati Farimani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Materials science workflows rely on structured and unstructured data from the vast body of available scientific literature. However, most of the experimental details remain buried in text, tables, graphs and figures. Thus, constructing databases that incorporate this data is a manual, time-consuming, and hard-to-scale process. Multimodal large language models have made it feasible to extract information from text and scientific figures with high speed and accuracy. This opens the possibility of an AI system that can create production-scale material databases. Material Database Agent (MDA) is a modular, multi-agent system architecture for converting research literature into structured databases. MDA accepts article PDFs as input, which are subsequently processed in parallel into markdown files and figures. Multiple sub-agents read these markdown files and figures in parallel to assemble sub-databases for each paper. These sub-databases are then compiled into a single tabular database by an agent. As opposed to using either a rule-based approach or a single-pass pipeline for extracting information, MDA is a specialized architecture for transforming the literature into a database in the field of materials science. More generally, this study provides a basis for positioning multimodal agentic information extraction as a viable means for constructing next-generation scientific databases from the primary literature.

[NLP-63] Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

【速读】：该论文旨在解决从牙科病历记录中进行临床命名实体识别（Clinical Named Entity Recognition, CNER）的难题，此类文本具有高度非结构化、领域特定性强及隐私敏感等特点。其解决方案的关键在于构建一个可本地部署的框架，使小型语言模型能够自主生成、验证、优化并评估针对不同实体类型的提示（prompt），从而实现多类临床实体的精准提取；通过多提示集成推理筛选候选模型，并结合基于QLoRA的监督微调与直接偏好优化（Direct Preference Optimization, DPO）进行轻量化后训练，显著提升了模型性能，最终在本地部署场景下实现了高效、可扩展的临床信息抽取能力。

链接: https://arxiv.org/abs/2605.04221
作者: Yao-Shun Chuang,Tushti Mody,Uday Pratap Singh,Shirindokht Shiraz,Chun-Teh Lee,Ryan Brandon,Muhammad F Walji,Xiaoqian Jiang,Bunmi Tokede
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

[NLP-64] Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

【速读】：该论文旨在解决注意力机制中相对位置编码（relative positional encoding）的表达能力局限问题，特别是如何更有效地建模距离调制的相位交互（distance-modulated phase interactions）。现有方法如RoPE（旋转位置编码）和ALiBi（基于距离偏置的注意力）分别通过旋转相位或加性距离偏置来编码位置信息，但难以捕捉复杂的位置依赖结构。其解决方案的关键在于提出一种基于约当块（Jordan block）的非半单表示——Exact Jordan-RoPE，该方法将一个复数旋转特征值与一个幂零响应耦合在同一缺陷约当块中，从而生成包含振荡-多项式特征的新型相对算子，例如 $ e^{-\gamma d}\cos(\omega d) $、$ e^{-\gamma d}\sin(\omega d) $、$ d e^{-\gamma d}\cos(\omega d) $ 和 $ d e^{-\gamma d}\sin(\omega d) $，这本质上构建了一个距离调制的相位基底 $ d e^{i\omega d} $，而非简单叠加额外的距离通道。此构造在保持群律精确性的前提下，提升了对具有距离调制相位结构的任务的建模能力。

链接: https://arxiv.org/abs/2605.04217
作者: Yaobo Zhang
机构: Ningxia University (宁夏大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 6 tables; code available at this https URL

点击查看摘要

Abstract:Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as e^-\gamma d\cos(\omega d) , e^-\gamma d\sin(\omega d) , d e^-\gamma d\cos(\omega d) , and d e^-\gamma d\sin(\omega d) , for causal lag d=i-j\geq 0 . Thus the construction realizes a distance-modulated phase basis d e^i\omega d , rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim. Comments: 15 pages, 4 figures, 6 tables; code available at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2605.04217 [cs.LG] (or arXiv:2605.04217v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-65] Nsanku: Evaluating Zero-Shot Translation Performance of LLM s for Ghanaian Languages

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在低资源非洲语言上的机器翻译性能缺乏系统评估的问题，尤其是针对加纳43种本土语言与英语之间的零样本翻译能力。其解决方案的关键在于构建了一个名为Nsanku的基准测试体系，该体系通过来自YouVersion圣经平台的每种语言300句对进行标准化评估，并采用BLEU和chrF两种自动指标及一致性维度综合衡量模型表现。结果揭示了现有LLMs在加纳语言翻译中仍存在显著性能与稳定性不足的问题，为非洲语言自然语言处理研究提供了首个公开、可扩展的评估基础设施。

链接: https://arxiv.org/abs/2605.04208
作者: Stephen E. Moore,Mich-Seth Owusu,Akwasi Asare,Lawrence Adu Gyamfi,Paul Azunre,Joel Budu,Jonathan Asiamah,Elias Dzobo,Kelvin Newman,Edmund O. Benefo,Gerhardt Datsomor,Onesimus Addo Appiah,Ama Branoa Banful,Lucas Woedem Kpatah,Saani Mustapha Deishini,John Ayernor
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive multilingual capabilities for well-resourced languages, yet their performance on low-resource African languages remains poorly understood and largely unevaluated. This paper presents Nsanku, a systematic benchmark that evaluates the zero-shot machine translation performance of 19 open-weight and proprietary LLMs across 43 Ghanaian languages paired with English. Evaluation sentences were sourced from the YouVersion Bible platform, providing 300 sentence pairs per language. Two complementary automatic metrics are employed: Bilingual Evaluation Understudy (BLEU) and Character n-gram F-Score (chrF), alongside an average accuracy score and a cross-language consistency dimension. Nsanku represents the most comprehensive LLM translation evaluation for Ghanaian languages conducted to date. Results show that gemini-2.5-flash achieves the highest overall average score of 26.88 (BLEU: 24.60, chrF: 29.16), followed by claude-sonnet-4-5 at 24.87 (BLEU: 22.46, chrF: 27.28) and gpt-4.1 at 23.20 (BLEU: 21.15, chrF: 25.24). Among open-weight models, kimi-k2-instruct-0905 leads at an average score of 20.87. A critical finding from the consistency analysis is that no model and no language reached the Leaders quadrant of high performance and high consistency simultaneously, indicating that current LLMs are not yet reliably usable for Ghanaian language translation at scale. Siwu achieved the highest per-language average score at 25.73 while Nkonya scored lowest at 11.65. Nsanku establishes a publicly available, community-extensible evaluation infrastructure for African language NLP research.

[NLP-66] he Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation

【速读】：该论文旨在解决多语言神经机器翻译（Multilingual Neural Machine Translation, MNMT）中知识迁移机制不明确的问题，尤其是词汇重叠（vocabulary overlap）在跨语言知识迁移中的作用尚缺乏系统研究。论文通过设计对比实验，在域外（out-of-domain）场景下分别使用共享词汇表（joint vocabulary）与非共享词汇表（disjoint vocabulary），并引入相关和不相关的辅助语言，以量化词汇重叠、语言相关性及领域匹配对模型性能的影响。其解决方案的关键在于：实证表明，尽管词汇重叠有助于提升性能，但语言相关性和领域匹配程度比词汇共享更为重要，从而揭示了MNMT中知识迁移的核心驱动力并非单纯依赖词汇一致性。

链接: https://arxiv.org/abs/2605.04196
作者: Oona Itkonen,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge transfer, especially across related languages, has been found beneficial for multilingual neural machine translation (MNMT), but some aspects are still under-explored and deserve further investigation. A joint vocabulary is most often applied to form a uniform word embedding space, but since the impact of a disjoint vocabulary on model performance is far less studied, there is no consensus on how much knowledge transfer is mainly due to vocabulary overlap. In this paper, we present systematic experiments with joint and disjoint vocabularies, and auxiliary languages related and unrelated to the source language. We design this experiment in an out-of-domain setup in order to emphasize transfer and the impact of the auxiliary language. As expected, we yield better results with more extensive vocabulary overlaps typical for related languages, but our experiments also show that domain-match and language relatedness are more important than a joint vocabulary.

[NLP-67] MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在医疗领域中产生事实性错误生成（fabrication）的问题，即模型生成看似合理但内容不真实的陈述，这对临床决策构成重大风险。现有医学幻觉数据集因覆盖范围有限、人类写作与LLM生成文本在风格上的差异以及合成样本分布漂移等问题，难以有效捕捉此类现象。论文提出的关键解决方案是构建一个数据驱动的流水线——MedFabric，用于生成具有语法和语义一致性的细粒度词汇级虚假陈述；在此基础上进一步设计ETHE（模块化词汇级伪造检测器），其核心机制包括Text2Table分解、词掩码与填充以及混合句子对评估，从而显著提升对医疗文本中微小事实偏差的识别能力，在词汇级伪造检测任务上相比现有最优方法性能提升超过15%，且在结构相似性变化下保持稳定表现。

链接: https://arxiv.org/abs/2605.04180
作者: Tung Sum Thomas Kwok,Qian Qian,Xiaofeng Lin,Dongxu Zhang,Jun Han,Zhichao Yang,Davin Hill,Tamer Soliman,Sanjit Singh Batra,Robert Tillman,Guang Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models exhibit strong reasoning and semantic understanding capabilities but often hallucinate in domains that require expert knowledge, among which fabrications, the generation of factually incorrect yet fluent statements, pose the greatest risk in medical contexts. Existing medical hallucination datasets inadequately capture fabrication phenomena due to limited fabrication coverage, stylistic disparities between human and LLM-authored texts, and distributional drift during hallucinated sample synthesis. To address this, we propose a data-centric pipeline to generate realistic and word-level fabrications that preserve syntactic and stylistic fidelity while introducing subtle factual deviations, resulting in MedFabric. Building upon this dataset, we introduce ETHER, a modular word-level fabrication detector integrating Text2Table Decomposition, Word Masking and Filling and Hybrid Sentence Pair Evaluation to enhance factual alignment. Empirical results demonstrate that MedFabric outperforms state-of-the-art detectors by over 15% on word-level fabrication benchmarks while maintaining consistent performance across structural similarities, offering a comprehensive framework for reliable and domain-specific factuality detection.

[NLP-68] Are LLM s Ready for Conflict Monitoring? Empirical Evidence from West Africa

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在冲突监测任务中因系统性偏差导致的输出失真问题，尤其关注其对人道主义问责的影响。研究发现，通用开源模型存在显著的“虚假非正当化偏倚”（False Illegitimation bias），例如Gemma将18.29%的合法战斗误判为针对平民的暴力行为，而无“虚假正当化错误”；相比之下，经过领域适配的模型AfroConfliBERT和AfroConfliLLAMA表现出接近方向中立的特性，且在对抗性扰动下具有更强鲁棒性。然而，领域适配并未消除基于行为体的选择偏倚——两类适配模型仍显著倾向于正当化国家行为体（如尼日利亚境内比非国家行为体高36.5%）。解决方案的关键在于：实施面向公平性的微调以减少行为体选择偏倚、强制开展对抗鲁棒性评估以抵御语义框架操纵，并建立区域差异化的“人在回路”监督机制。

链接: https://arxiv.org/abs/2605.04177
作者: Hoffmann Muki,Olukunle Owolabi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As LLMs enter conflict monitoring, understanding systematic distortions in their outputs is critical for humanitarian accountability. We evaluate four vanilla open-weight models Gemma 3 4B, Llama 3.2 3B, Mistral 7B, and OLMo 2 7B and two domain-adapted models, AfroConfliBERT and AfroConfliLLAMA, on Nigeria and Cameroon conflict-event classification against ACLED, a gold-standard dataset with multi-stage verification. We find a bifurcated divergence in normative directionality. Open-weight models exhibit statistically significant False Illegitimation bias: Gemma misclassifies to 18.29% of legitimate battles as civilian-targeted violence while making zero False Legitimation errors. By contrast, AfroConfliBERT and AfroConfliLLAMA achieve near-directional neutrality, with Legitimization Bias differences indistinguishable from zero. Yet domain adaptation does not eliminate actor-based selection bias. Both adapted models show statistically significant actor bias comparable to vanilla LLMs; in Nigeria, state actors are legitimized 36.5% more often than non-state actors in identical tactical contexts. Open-weight outputs are also fragile to geography-specific lexical framing: delegitimizing phrases produce flip rates up to 66.7% in Cameroon and 34.2% in Nigeria, while perturbations salient in one context may not matter in another. Error trace profiling shows models mask normative bias through unfaithful rationale confabulations. In contrast, AfroConfliBERT and AfroConfliLLAMA are largely robust, with near-zero flip rates across perturbation categories. Overall, current models are not ready for unsupervised deployment in conflict monitoring. We call for fairness-aware fine-tuning to reduce actor-based selection bias, mandatory adversarial robustness evaluation against lexical manipulation, and context-specific human-in-the-loop oversight calibrated to regional difficulty.

[NLP-69] Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成学术内容时存在的幻觉（hallucination）问题，尤其是当模型被用于参考文献生成、事实解释、摘要撰写和写作改进等任务时，其输出常出现事实错误或逻辑不一致的现象。解决方案的关键在于设计了一套系统化的评估框架，包括80个覆盖四类典型学术写作任务的提示（prompts），并引入了一个新的加权指标——幻觉指数（Hallucination Index, HI），以量化模型响应中的幻觉程度。该指标综合考量了事实准确性、参考文献有效性、连贯性、风格一致性及学术语气等多个维度，从而更精准地识别和比较不同LLMs（如ChatGPT、Grok、Gemini和Copilot）在特定任务下的幻觉行为差异，揭示出幻觉风险不仅与模型架构有关，还显著受任务类型和提示条件的影响。

链接: https://arxiv.org/abs/2605.04171
作者: Humam Khan,Md Tabrez Nafis,Shahab Saquib Sohail,Aqeel Khalique,Rehan Hasan Khan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 4 figures, 2 tables, conference accepted and presented paper

点击查看摘要

Abstract:Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.

[NLP-70] FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM -Generated Code via Stylometric Signals

【速读】：该论文旨在解决跨编程语言和应用场景的机器生成代码检测问题，尤其关注模型在未见过的语言和领域中的泛化能力。其解决方案的关键在于设计了一种轻量级、计算高效的检测框架：首先通过比率特征（ratio-based features）降低代码片段长度对检测结果的影响；其次利用解析引擎与编程语言分类器提取描述性信号，并引入独立训练的代码-文本行分类器识别嵌入的自然语言段落；最终结合浅层决策树与基于数据分析的启发式规则进行预测，从而在仅需CPU资源训练且近乎实时推理的前提下，实现优于大型预训练模型的检测性能。

链接: https://arxiv.org/abs/2605.04157
作者: Elitsa Yotkova,Violeta Kastreva,Dimitar Dimitrov,Ivan Koychev,Preslav Nakov
机构: Sofia University “St. Kliment Ohridski”, Bulgaria; Mohamed bin Zayed University of Artificial Intelligence, UAE
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods. We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples. We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.

[NLP-71] Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

【速读】：该论文旨在解决生成式 AI（Generative AI）能力评估文献中存在的“出版诱因差距”（publication elicitation gap）问题，即当前学术论文所评估的模型往往显著落后于同期前沿模型（如GPT-5.5 Pro或Claude Opus 4.7），且常以抽象化表述（如“AI”而非具体模型）呈现结果，导致对技术真实进展的认知偏差。研究通过预注册审计112,303条LLM相关记录发现，平均而言，论文评估的模型比当时的技术前沿落后约10.85个Epoch AI能力指数（ECI）单位，且该差距每年扩大5.53 ECI，其中约75%源于模型滞后，25%来自同行评审延迟。解决方案的关键在于强化报告透明度：提出VERSIO-AI检查清单（共13项，含3项核心拒稿项），强制披露模型快照、推理模式、工具调用、提示工程等配置细节，从而缩小评估与前沿之间的认知鸿沟，并辅以API访问补贴和编辑政策执行以推动落地。

链接: https://arxiv.org/abs/2605.04135
作者: David Gringras,Misha Salahshoor
机构: Harvard University (哈佛大学); AISST, Harvard University (哈佛大学人工智能与社会研究中心)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 60 pages, 9 figures, 7 tables, 8 appendices. Pre-registered on OSF: this https URL (DOI: https://doi.org/10.17605/OSF.IO/7XM3D , registered 2026-04-17). Companion artefacts: VERSIO-AI v1.2 reporting checklist (Appendix A; CC-BY-4.0); frontierlag Python package ( this https URL , MIT) and per-DOI audit tool at this https URL

点击查看摘要

Abstract:Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about “AI” that propagate through citations, media, and policy. We measure the ‘publication elicitation gap’ (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of “AI” rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at this http URL. Comments: 60 pages, 9 figures, 7 tables, 8 appendices. Pre-registered on OSF: this https URL (DOI: https://doi.org/10.17605/OSF.IO/7XM3D, registered 2026-04-17). Companion artefacts: VERSIO-AI v1.2 reporting checklist (Appendix A; CC-BY-4.0); frontierlag Python package (this https URL, MIT) and per-DOI audit tool at this https URL Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.04135 [cs.CY] (or arXiv:2605.04135v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.04135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-72] Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

【速读】：该论文旨在解决当前多模态基础模型在视觉理解、文本到图像生成以及指令引导的图像编辑任务中缺乏统一架构与协同优化的问题。解决方案的关键在于提出 JoyAI-Image，其核心是一个耦合空间增强型多模态大语言模型（Spatially Enhanced Multimodal Large Language Model, MLLM）与多模态扩散Transformer（Multimodal Diffusion Transformer, MMDiT）的统一架构，通过共享的多模态接口实现感知与生成之间的双向交互。该设计结合了统一指令微调、长文本渲染监督、空间对齐数据及通用与空间编辑信号的训练策略，从而在保持广泛多模态能力的同时，显著提升几何感知推理与可控视觉合成能力，推动模型从一般视觉能力向更强的空间智能演进。

链接: https://arxiv.org/abs/2605.04128
作者: Lin Song,Wenbo Li,Guoqing Ma,Wei Tang,Bo Wang,Yuan Zhang,Yijun Yang,Yicheng Xiao,Jianhui Liu,Yanbing Zhang,Guohui Zhang,Wenhu Zhang,Hang Xu,Nan Jiang,Xin Han,Haoze Sun,Maoquan Zhang,Haoyang Huang,Nan Duan
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

[NLP-73] Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities

【速读】：该论文试图解决模型坍缩（model collapse）问题，即生成式AI模型在使用先前模型输出的数据进行训练时性能下降的现象，这一问题加剧了数据退化、文化偏见强化及资源利用效率低下，进而威胁到人工智能民主化的进程。其解决方案的关键在于识别并缓解模型坍缩对低资源和边缘化群体的不平等影响，通过优化训练效率与恢复数据分布的多样性（尤其是尾部分布），从而减少环境负担与文化偏差，推动更公平、可持续的AI发展路径。

链接: https://arxiv.org/abs/2605.04127
作者: Devon Jarvis,Richard Klein,Benjamin Rosman,Steven James,Stefano Sarao Mannelli
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 13 pages, 1 figure, International Conference on Machine Learning

点击查看摘要

Abstract:Model collapse, the degradation in performance that arises when generative models are trained on the outputs of prior models, is an increasing concern as artificially generated content proliferates. Related critiques of large language models have highlighted their tendency to reproduce frequent patterns in training data, their reliance on vast datasets, and their substantial environmental cost. Together, these factors contribute to data degradation, the reinforcement of cultural biases, and inefficient resource use. In this position paper we aim to combine these views and argue that model collapse threatens current efforts to democratize AI. By reducing training efficiency and skewing data distributions away from the tails of their support, model collapse disproportionately impacts low-resource and marginalized communities. We examine both the environmental and cultural implications of this phenomenon, situate our position within recent position papers on model collapse, and conclude with a call to action. Finally, we outline initial directions for mitigating these effects.

[NLP-74] SCG: Deterministic Tool-Schema Compilation for Agent ic LLM Deployments

【速读】：该论文旨在解决生产环境中工具调用（Tool Use）失败率高的问题，特别是针对小型语言模型（4B-14B参数规模）在面对大规模工具目录时因JSON格式的工具Schema与语言模型理解能力不匹配而导致的性能瓶颈。其核心解决方案是提出TSCG（Tool Schema Compiler），一个无需模型访问、微调或运行时搜索的确定性编译器，通过将JSON Schema转换为高效的结构化文本表示，在API边界直接消除协议错位。TSCG的关键创新在于结合八种可组合的操作符，并提供形式化的压缩边界（对良好构造的Schema可达51%压缩率），从而显著提升小模型在真实工具调用任务中的准确率（如Phi-4 14B在20个工具下从0%恢复至84.4%），同时保持高token效率（节省52–57% token）。

链接: https://arxiv.org/abs/2605.04107
作者: Furkan Sakizli
机构: Independent Researcher
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures, 23 tables. Code, benchmark suite, and evaluation logs: this https URL

点击查看摘要

Abstract:Production agent frameworks (OpenAI Function Calling, Anthropic Tool Use, MCP) transmit tool schemas as JSON, a format designed for machine parsing, not for interpretation by language models. For small models (4B-14B), this protocol mismatch accounts for the majority of tool-use failure at production catalog sizes. We present TSCG, a deterministic tool-schema compiler that resolves this mismatch at the API boundary, converting JSON schemas into token-efficient structured text without model access, fine-tuning, or runtime search. TSCG combines eight composable operators with a formal compression bound (=51% on well-formed schemas). On TSCG-Agentic-Bench (about 19,000 calls, 12 models, 5 scenarios), TSCG restores Phi-4 14B from 0% to 84.4% accuracy at 20 tools (90.3% at 50 tools) and achieves 108-181% accuracy-retained ratio across three models on BFCL. Format-versus-compression decomposition (R^2=0.88 - 0.03) establishes representation change as the dominant mechanism. Per-operator isolation across three frontier models reveals three distinct operator-response profiles: operator-hungry (Opus 4.7), operator-sensitive (GPT-5.2), and operator-robust (Sonnet 4), providing per-model deployment guidance. Scaling experiments show accuracy advantages persisting on heavy production MCP schemas (+5.0 pp at about 10,500 input tokens) despite saturation on light synthetic catalogs, with 52-57% token savings throughout. The synthetic benchmark generalizes to real MCP schemas within 0.1 accuracy points. TSCG ships as a 1,200-line zero-dependency TypeScript package. Comments: 19 pages, 6 figures, 23 tables. Code, benchmark suite, and evaluation logs: this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7; D.3.4 Cite as: arXiv:2605.04107 [cs.SE] (or arXiv:2605.04107v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.04107 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19795759 Focus to learn more DOI(s) linking to related resources

[NLP-75] HERCULES: Hardware-Efficient Robust Continual Learning Neural Architecture Search

【速读】：该论文旨在解决当前神经架构搜索（Neural Architecture Search, NAS）方法在从静态基准测试向真实世界部署过渡过程中，仅关注硬件效率已无法满足现代AI系统需求的问题。具体而言，传统NAS方法忽视了鲁棒性（Robustness）和持续学习（Continual Learning）能力，而这二者对于资源受限环境下的可靠性与长期适应性至关重要。解决方案的关键在于提出一个三重维度的分类框架——即效率、鲁棒性和持续学习，并基于此构建名为HERCULES（Hardware-Efficient, Robust, and ContinUal LEarning Search）的新框架，通过整合多目标优化策略与计算成本控制机制，实现对搜索空间的有效探索与多目标权衡，从而推动可部署、具备终身学习能力的AI系统的算法-架构-软硬件协同设计发展。

链接: https://arxiv.org/abs/2605.04103
作者: Matteo Gambella,Fabrizio Pittorino,Manuel Roveri
机构: 未知
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 1 figure

点击查看摘要

Abstract:Neural Architecture Search (NAS) has emerged as a powerful framework for automatically discovering neural architectures that balance accuracy and efficiency. However, as AI transitions from static benchmarks to real-world deployment, the traditional focus on hardware-aware efficiency is no longer sufficient. We observe that modern NAS methods, especially those that target edge AI, are evolving to address a triple objective: Efficiency, Robustness, and Continual Learning. While efficiency ensures feasibility in resource-constrained environments, robustness guarantees reliability under environmental variabilities, and continual learning enables adaptation to sequential tasks without catastrophic forgetting. We propose a taxonomy of NAS approaches through this triple lens, distinguishing between methods targeting resource optimization, environmental resilience, and architectural plasticity. This unified perspective reveals that these axes, though often studied in isolation, are mutually reinforcing. Building on this taxonomy, we map the current landscape of these NAS methods into a new framework called Hardware-Efficient, Robust, and ContinUal LEarning Search (HERCULES). We define the desiderata, the twelve labours of HERCULES, addressing the non-trivial challenge of balancing an adequate search-space exploration with the immense computational costs of a multi-objective NAS, accounting for these crucial objectives of current AI systems. By identifying critical gaps in existing research, this survey outlines a roadmap toward integrated algorithmic, architectural, and hardware-software co-design for truly deployable, lifelong-learning AI systems.

[NLP-76] Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在临床文本摘要生成中潜在的患者安全风险缺乏系统评估方法的问题。当前LLM在医疗场景中的应用日益广泛，但其生成内容可能引入错误或遗漏关键信息，进而影响临床决策与患者安全，而现有评估手段多为经验性或事后分析，难以实现前瞻性风险识别。解决方案的关键在于提出并验证了一个基于失效模式、影响与关键性分析（Failure Mode, Effects, and Criticality Analysis, FMECA）的新型结构化框架，通过构建14类失效模式分类体系，并将发生率、严重性和可检测性三个维度量化为五级有序评分量表，实现了对LLM生成临床摘要中潜在风险的系统性、可重复的风险识别与评估。该框架已在真实临床数据上进行了实证测试，显示出良好的信度、可用性和内容效度，为未来LLM在医疗领域的安全部署提供了可操作的风险管理工具。

链接: https://arxiv.org/abs/2605.04085
作者: Lydie Bednarczyk,Jamil Zaghir,Julien Ehrsam,Maria Tcherepanova,Christian Skalafouris,Karim Gariani,Catherine Geslin,Claire-Bénédicte Rivara,Pascal Bonnabry,Laetitia Gosetto,Richard Dubos,Mina Bjelogrlic,Christophe Gaudet-Blavignac,Christian Lovis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME) Cite as: arXiv:2605.04085 [cs.CY] (or arXiv:2605.04085v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.04085 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jamil Zaghir [view email] [v1] Thu, 23 Apr 2026 14:44:15 UTC (1,260 KB)

[NLP-77] Connecting online criminal behavior with machine learning: Using authorship attribution to analyze and link potential online traffickers

【速读】：该论文旨在解决在线犯罪活动（如人口贩卖和非法交易）因犯罪分子使用匿名账户并频繁更换身份而难以被执法机构识别和追踪的问题。其核心挑战在于如何从海量匿名在线广告中挖掘出潜在的关联性与重复行为模式。解决方案的关键在于利用数据驱动的机器学习方法，分析用户在撰写广告文案和呈现图像时表现出的稳定行为特征——即使在试图隐藏身份的情况下，这些特征仍具有可识别性。通过大规模分析此类内容，研究提出了一种有效链接相关账户、识别跨平台非法网络的方法，并配套制定了隐私保护、公平性和透明度等伦理准则，以确保技术应用的合规与责任边界。

链接: https://arxiv.org/abs/2605.04080
作者: Vageesh Kumar Saxena
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Doctoral thesis

点击查看摘要

Abstract:This research investigated how online criminal activities can be better understood and connected using data-driven machine learning methods. Many illegal activities, such as human trafficking and illicit trade, have moved to online platforms where offenders hide behind anonymous accounts and frequently change identities. This makes it difficult for authorities to understand how large these networks are and how different online profiles may be linked. The research shows that people tend to maintain consistent patterns in how they write advertisements and present images online, even when they try to stay anonymous. By analysing these patterns across large collections of online advertisements, the research demonstrates how to link related accounts and identify repeated behaviour across illegal online markets. In addition, the research also addresses how such methods should be used responsibly. It proposes clear guidelines to ensure that privacy, fairness, and transparency are respected when these tools are applied. Overall, the research provides practical ways to support law enforcement investigations while emphasising careful and ethical use. Comments: Doctoral thesis Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2605.04080 [cs.CL] (or arXiv:2605.04080v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.04080 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.26481/dis.20250107vs Focus to learn more DOI(s) linking to related resources

[NLP-78] Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

【速读】：该论文旨在解决生成式强化学习中基于验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）训练时，样本组内token级策略梯度项聚合方式对优化偏差的影响问题。现有方法如标准GRPO采用序列聚合（sequence aggregation），而近期研究主张使用token聚合（token aggregation），但两者均存在局限：前者隐含地通过序列级等权重降低长响应的贡献，后者则引入符号-长度耦合（sign-length coupling）效应。解决方案的关键在于提出平衡聚合（Balanced Aggregation, BA），其核心机制是在正负样本子集中分别计算token级梯度均值，并以序列数量为权重进行组合，从而有效缓解长度差异带来的优化偏差，提升训练稳定性和最终性能。

链接: https://arxiv.org/abs/2605.04077
作者: Zhiyuan Zeng,Jiameng Huang,Zhangyue Yin,Jiashuo Liu,Ziniu Li,Bingrui Li,Yuhao Wu,Yining Zheng,Ge Zhang,Wenhao Huang,Xipeng Qiu
机构: Fudan University (复旦大学); Peking University (北京大学); M-A-P; Tsinghua University (清华大学); Shanghai Innovation Institute (上海创新研究院); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbfBalanced Aggregation (BA), a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.

[NLP-79] RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models）在处理长视觉上下文时面临的计算效率低下和内存消耗过大的问题，尤其是视觉键值缓存（KV cache）规模急剧膨胀带来的挑战。现有KV缓存压缩方法依赖“重要性持久性”假设进行令牌裁剪，但在多模态场景下存在两个关键缺陷：一是视觉令牌呈现“延迟重要性”，即初期低显著性但后期成为解码关键，易被过早淘汰；二是离散裁剪破坏了视觉线索的空间连续性。解决方案的关键在于提出RetentiveKV，一种基于信息熵驱动的KV缓存优化方法，将传统的“离散上下文截断”重构为“连续记忆演化”，利用状态空间模型（State Space Models）对低注意力令牌的信息潜力进行量化，并通过熵引导的状态转移将其纳入连续状态空间，实现动态再激活，从而在保持语义相关性的同时实现5.0倍缓存压缩和1.5倍解码加速。

链接: https://arxiv.org/abs/2605.04075
作者: Sihao Liu,YuFan Xiong,Zhonghua Jiang,Zhaode Wang,chengfei lv Shengyu Zhang
机构: Zhejiang University(浙江大学); Alibaba(阿里巴巴); Shanghai Institute for Advanced Study of Zhejiang University(浙江大学先进技术研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression methods typically rely on the “persistence of importance” hypothesis to prune tokens. However, this approach proves fragile in multimodal settings due to two key issues: 1) Visual tokens display “deferred importance,” initially exhibiting low salience but becoming pivotal during later decoding, which can lead to premature eviction. 2) Discrete pruning disrupts the inherent spatial continuity of visual cues. To address these challenges, we propose RetentiveKV, an entropy-driven KV cache optimization method that reformulates KV eviction from “discrete context truncation” to “continuous memory evolution” based on State Space Models. Our method leverages information entropy to quantify the information potential of low-attention tokens and integrates tokens scheduled for eviction into a continuous state space through entropy-guided state transitions, enabling their dynamic reactivation when semantic relevance arises during subsequent decoding. Extensive experiments on multimodal benchmarks demonstrate that RetentiveKV achieves 5.0 times KV cache compression and 1.5 times decoding acceleration.

[NLP-80] Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity Task Specialisation and Mortality Prediction

【速读】：该论文旨在系统性地将稀疏自动编码器（Sparse Autoencoders, SAEs）应用于电子健康记录（Electronic Health Record, EHR）基础模型中，以解析其内部表征结构并评估其在临床预测任务中的有效性。解决方案的关键在于：首先，在FlatASCEND这一1450万参数的自回归临床序列模型上，于INSPECT和MIMIC-IV数据集的全部10个残差流提取点训练TopK SAEs；其次，通过SAE分解揭示Transformer深度上的渐进抽象过程——早期层特征接近完美令牌检测器（45.7%单例），而深层特征覆盖约30种不同临床类别的令牌（仅0.5%单例）；最后，对比SAE特征与密集表示在离散事件预测（如死亡率）和连续量预测（如住院时长）上的表现差异，并发现尽管SAE在简单线性探针下对离散事件更优，但在泄漏安全窗口内（如eICU-CRD 48小时AUC），密集表示仍优于或相当SAE特征，表明SAE并非普遍优于密集表示，其价值需结合下游任务性质及安全性约束进行权衡。

链接: https://arxiv.org/abs/2605.04072
作者: Chris Sainsbury,Feng Dong,Andreas Karwath
机构: University of Glasgow(格拉斯哥大学); University of Dundee(邓迪大学); NHS Greater Glasgow and Clyde(格拉斯哥和克莱德国民保健服务); University of Birmingham(伯明翰大学); University of Strathclyde(斯特拉斯克莱德大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have been applied to large language models and protein language models, but not systematically to electronic health record (EHR) foundation models. We train TopK SAEs on FlatASCEND, a 14.5-million-parameter autoregressive clinical sequence model, at all 10 residual stream extraction points on INSPECT (outpatient) and MIMIC-IV (ICU). SAE decomposition reveals progressive abstraction across transformer depth: layer-0 features are near-perfect token detectors (45.7% singleton), while layer-6 features span approximately 30 token types across multiple clinical categories (0.5% singleton). Under full-sequence simple linear probes, SAE features outperform dense representations for discrete event prediction (mortality) while dense representations outperform for continuous magnitude prediction (length of stay) - a probe-level representational phenomenon that does not extend to clinically relevant leakage-safe windows, where dense representations match or exceed SAE features across all tested settings (eICU-CRD 48-hour AUC: SAE 0.871 versus dense 0.880; base model zero-shot, SAE dictionaries trained on eICU activations; MIMIC-IV: 0.836 versus 0.914; INSPECT 1-year/3-year: 0.697 versus 0.800). A delta-mode intervention method reduces SAE perturbation noise by 86x, enabling cleaner feature-level experiments, though the resulting perturbation effects are larger than random controls in 3 of 4 conditions but not formally significant. Feature reproducibility across random seeds is 21%, and individual features should be interpreted as illustrative rather than stable.

[NLP-81] Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning ACL2026

【速读】：该论文旨在解决现有基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）方法在训练大型语言模型（Large Language Models, LLMs）时，因采用静态策略优化机制而与模型推理能力演化不匹配的问题。其解决方案的关键在于提出自适应幂均策略优化（Adaptive Power-Mean Policy Optimization, APMPO），该方法包含两个核心创新：一是引入广义幂均目标的幂均策略优化（Power-Mean Policy Optimization, PMPO），使模型能自适应地在算术平均的信号增强行为与几何平均的一致性约束行为之间切换；二是基于实时奖励统计自适应调整裁剪边界的方法（Feedback-Adaptive Clipping, FAC），克服静态裁剪机制的局限性，从而显著提升学习动态性和推理性能。

链接: https://arxiv.org/abs/2605.04066
作者: Yiming Huang,Zhenbo Shi,Shuzheng Gao,Cuiyun Gao,Peiyi Han,Chuanyi Liu
机构: Harbin Institute of Technology (Shenzhen); Peng Cheng Laboratory; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 (Findings)

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model’s evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.

[NLP-82] Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLM s ACL2026

【速读】：该论文旨在解决现有无监督强化学习（Unsupervised Reinforcement Learning, RL）方法在训练过程中难以适应大语言模型（Large Language Models, LLMs）推理能力动态演化的问题，从而避免因缺乏真实标签监督而导致策略优化偏离正确方向。其解决方案的关键在于提出一种名为FREIA的新算法，包含两个核心创新：(1) 基于自由能原理（Free Energy Principle）的自由能驱动奖励（Free Energy-Driven Reward, FER），通过平衡一致性与探索性来动态调整奖励信号；(2) 自适应优势塑造（Adaptive Advantage Shaping, AAS），根据采样奖励的统计特性自适应地调节学习信号强度，从而提升训练稳定性与效率。

链接: https://arxiv.org/abs/2605.04065
作者: Yiming Huang,Zhenbo Shi,Xin-Cheng Wen,Jichuan Zeng,Cuiyun Gao,Peiyi Han,Chuanyi Liu
机构: Harbin Institute of Technology (Shenzhen); Peng Cheng Laboratory; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model’s evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.

[NLP-83] Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning ICLR2026

【速读】：该论文旨在解决大语言模型如何从少量示例中编码任务身份（task identity）这一机制可解释性中的核心问题。此前研究依赖线性探测（linear probing）定位任务表征，但本文揭示了探测准确率与因果重要性之间存在显著脱节：单位置激活干预在所有28层Llama-3.2-3B模型中均无法实现任务迁移（0%），尽管对应位置的探测准确率达100%。其关键解决方案在于引入多位置激活干预策略——同时替换所有演示输出token的激活值，从而在第8层实现高达96%的任务转移效率（N=50, 95%置信区间[87%, 99%]），首次精准定位了in-context learning (ICL)任务身份的因果作用位点。进一步发现该机制具有跨架构普适性，在LLaMA、Qwen和Gemma三类模型中均表现为约网络深度30%处的统一干预窗口，并通过因果追踪揭示出查询位置严格必要而各演示位置非必要的不对称结构，最终提出“分布式模板假说”（distributed template hypothesis）：ICL任务身份以输出格式模板的形式分布于演示token之间，而非集中存储于特定神经元或层。

链接: https://arxiv.org/abs/2605.04061
作者: Bryan Cheng,Jasper Zhang
机构: William A. Shine Great Neck South High School (威廉·A·沙因格雷特南高中)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 3 figures. Accepted to the 2026 Learning and Intelligent Optimization Conference and workshops on Foundation Models for Science: Real-World Impact and Science-First Design, Latent Implicit Thinking - Going Beyond CoT Reasoning, Logical Reasoning of Large Language Models at ICLR 2026

点击查看摘要

Abstract:Understanding how large language models encode task identity from few-shot demonstrations is a central open problem in mechanistic interpretability. Prior work uses linear probing to localize task representations, reporting high classification accuracy at specific layers. We reveal a striking dissociation: probing accuracy completely fails to predict causal importance. Single-position activation intervention achieves 0% task transfer across all 28 layers of Llama-3.2-3B-despite 100% probing accuracy at those same positions. This null result is itself a key finding, demonstrating that task encoding is fundamentally distributed. Multi-position intervention-replacing activations at all demonstration output tokens simultaneously-achieves up to 96% transfer (N=50, 95% CI: [87%, 99%]) at layer 8, pinpointing for the first time the causal locus of ICL task identity. We establish the generality of these findings across four models spanning three architecture families (LLaMA, Qwen, Gemma), discovering a universal intervention window at ~30% network depth. Causal tracing uncovers an asymmetric architecture: the query position is strictly necessary (53-100% disruption) while no individual demonstration position is necessary (0% disruption)-resolving a key ambiguity in prior accounts. Crucially, transfer depends on internal representation compatibility, not surface similarity (r=-0.05 vs r=0.31), ruling out trivial explanations. These results establish the distributed template hypothesis: ICL task identity is encoded as output format templates distributed across demonstration tokens, fundamentally reshaping our understanding of how in-context learning operates.

[NLP-84] he Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

【速读】：该论文旨在解决多智能体辩论（Multi-agent Debate, MAD）及封闭系统推理中普遍存在的“推理陷阱”（Reasoning Trap）问题，即尽管答案准确性得以维持，但推理过程的质量显著下降，导致生成结果缺乏可解释性和证据支撑。其核心问题是：在基于语言模型的多代理迭代推理框架中，如何确保推理链条与输入证据（E）之间保持信息完整性，避免因代理间对抗性交互而丢失关键证据关联。解决方案的关键在于提出三个组成部分：(i) 支持忠实度评分（SFS），一种不依赖分解器的原子命题级评估指标，用于验证每一步推理是否严格基于证据；(ii) 基于证据的苏格拉底式推理（EGSR），以非对抗性的证据驱动提问替代传统辩论中的立场对抗；(iii) 定理1（DPI界）指出，在标准MAD下，证据E与各轮输出O^t构成马尔可夫链，且数据处理不等式表明信息传递不会增强，从而揭示了封闭系统推理的本质局限。通过EGSR机制可实现对推理质量的有效恢复，实验表明其在SciFact和FEVER数据集上能将SFS从MAD的极低水平提升至接近基线的98%，验证了该理论框架的实用性与有效性。

链接: https://arxiv.org/abs/2605.01704
作者: Kwan Soo Shin
机构: PolymathMinds AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 18 figures, 4 tables, 126 references. Subtitle: A Falsifiable Theorem, the Multi-Agent-Debate Instantiation, and a Triple Failure of Human Reliability

点击查看摘要

Abstract:When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other’s outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers. We name the multi-agent case the Debate Trap and the broader phenomenon the Reasoning Trap, offering a programmatic theory of evidence-grounded reasoning this http URL framework has three parts: (i) SFS (Supported Faithfulness Score), a claim-level metric verifying decomposed atomic claims against provided evidence (decomposer-invariant rankings: Spearman rho=1.0); (ii) EGSR (Evidence-Grounded Socratic Reasoning), replacing adversarial argumentation with evidence-grounded inquiry; (iii) Theorem 1 (DPI Bound): under standard MAD, the chain E - O^0 - O^1 - … is Markov, and the Data Processing Inequality implies E[I(E;O^t+1)] = E[I(E;O^t)]. Three companion results – open-system recovery (Theorem 2), EGSR accumulation (Lemma 2), and vote-aggregation floor (Proposition 1) – partition multi-step LLM reasoning by its information-theoretic relationship to E. Across 16 conditions on SciFact (300 claims) and FEVER (1,000 claims), DebateCV (C13) preserves 88% of baseline accuracy while SFS drops 43%; majority-vote MAD (C15) reduces SFS to 1.7% of baseline (p 10^-6, d = -0.96); EGSR recovers 98%. An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter-rater Fleiss kappa = +0.018 with 0.8-1.4 Likert intra-rater shifts across language and domain – the human agreement that faithfulness metrics have been calibrated against is not itself stable. We offer one falsifiable conjecture: any closed-system reasoning protocol preserving Theorem 1’s Markov structure is, in expectation, subject to the same DPI bound.

[NLP-85] MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

【速读】：该论文旨在解决现有MRI大语言模型（Large Language Model, LLM）评估基准主要依赖于回顾性书籍的多项选择题（Multiple-Choice Questions, MCQ），导致顶级专有模型得分已趋近饱和、难以区分性能差异的问题，同时缺乏对厂商特定扫描仪操作知识的系统性评测。解决方案的关键在于构建MRI-Eval——一个分层基准测试体系，涵盖1365个评分项，分为九类知识领域和三个难度层级，基于原始教材、GE扫描仪手册、编程课程资料及专家生成题目设计；其创新性体现在以MCQ为主、辅以仅提供题干（stem-only）和带错误前提提示（primed stem-only）的分析方式，从而更全面地揭示模型在自由文本回忆能力上的短板，尤其凸显了模型在GE扫描仪操作知识上的薄弱表现，为LLM在MRI科研实践中的应用提供了相对比较依据而非绝对能力衡量标准。

链接: https://arxiv.org/abs/2605.05175
作者: Perry E. Radau
机构: University of Calgary (卡尔加里大学); Child and Adolescent Imaging Research (CAIR) Program (儿童与青少年影像研究项目); Alberta Children’s Hospital Research Institute (阿尔伯塔儿童医院研究所)
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Medical Physics (physics.med-ph)
备注: 21 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.

信息检索

[IR-0] Interests Burn-down Diffusion Process for Personalized Collaborative Filtering

【速读】：该论文旨在解决现有扩散生成模型在协同过滤（Collaborative Filtering, CF）任务中因高斯噪声与用户个性化交互行为的细微特性不匹配而导致性能不佳的问题。解决方案的关键在于提出一种专为交互系统设计的扩散机制——兴趣烧减过程（interests burn-down process），该过程刻画了用户兴趣向候选物品衰减的动态，其逆过程即兴趣烧增过程可生成个性化推荐。该机制天然契合CF任务对用户兴趣扩散建模的需求，显著提升了生成样本的质量和个性化程度。

链接: https://arxiv.org/abs/2605.05165
作者: Yifang Qin,Zhaobin Li,Arisa Watanabe,Wei Ju,Zhiping Xiao,Ming Zhang
机构: Peking University (北京大学); Sichuan University (四川大学); University of Washington (华盛顿大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative methods have gained widespread attention in Collaborative Filtering (CF) tasks for their ability to produce high-quality personalized samples aligned with users’ interests. Among them, diffusion generative models have raised increasing attention in recommendation field. Despite that the pioneering efforts have applied the conventional diffusion process to model diffusive user interests, the incongruity between the Gaussian noise and the subtle nature of user’s personalized interaction behavior has led to sub-optimal results. To this end, we introduce a specifically-tailored diffusion scheme for interaction systems, namely the interests burn-down process. The interests burn-down process delineates the decay of user interests towards candidate items, complemented by its reverse burn-up process that yields personalized recommendation for users. The inherent burn-down nature of this process adeptly models the diffusive user interests, aligning seamlessly with the requirements of CF tasks. We present a novel recommendation method StageCF to illustrate the superiority of this newly proposed diffusion process. Experimental results have demonstrated the effectiveness of StageCF against existing generative and diffusion-based baseline methods. Furthermore, comprehensive studies validate the functionality of interests burn-down process, shedding light on its capacity to generate personalized interactions.

[IR-1] CapsID: Soft-Routed Variable-Length Semantic IDs for Generative Recommendation

【速读】：该论文旨在解决生成式推荐（Generative Recommendation）中因分词器（tokenizer）瓶颈导致的语义信息损失与误差传播问题。传统方法采用硬性残差量化（hard residual quantization），在每一层通过最近邻分配将物品映射到单一语义ID（Semantic ID, SID），这会破坏多维语义特征并引发早期错误向后续SID位置的累积传播。为此，作者提出CAPSID框架，其核心创新在于用胶囊路由（capsule routing）替代硬性量化：物品在每层以概率方式路由至多个语义胶囊，残差更新基于路由后的重建而非单一胜出码，且当活跃胶囊置信度足够高时终止SID生成。进一步地，SEMANTICBPE通过结合相邻SID的共现频率与嵌入兼容性，将它们组合为可复用子词单元，提升表示效率。实验表明，CAPSID+SEMANTICBPE在多个工业和公开数据集上显著优于现有单表示基线（如ReSID），同时推理延迟仅为稀疏-密集混合系统（COBRA-style）的51%。

链接: https://arxiv.org/abs/2605.05096
作者: Wenzhuo Cheng,Menghang Gong,Qixin Guo,Hang Zheng,Zhaobin Yang,Jianguo Lou,Zhengwei Zheng
机构: Google(谷歌); Stanford University (斯坦福大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation maps each item to a sequence of Semantic IDs (SIDs) and recasts retrieval as autoregressive token generation. In this paradigm the main bottleneck is the tokenizer rather than the Transformer: residual vector quantization with a hard nearest-neighbor assignment at every layer collapses multi-faceted item semantics at cluster boundaries and propagates early errors to later SID positions. A common workaround is to append a dense vector or attribute prefix to the SID, but this dual-representation design inflates inference cost and gives up the simplicity of a generative interface. We address the bottleneck at the tokenizer itself. CAPSID replaces hard residual quantization with capsule routing: at each layer an item probabilistically routes to several semantic capsules, the residual is updated by the routed reconstruction rather than by a single winning code, and the SID terminates once the active capsule’s confidence is high enough. On top of CAPSID, SEMANTICBPE composes adjacent SID tokens into reusable subwords by combining their co-occurrence with their embedding compatibility. On Amazon Beauty, Sports, Toys, and a 35M-item proprietary industrial catalog, CAPSID+SEMANTICBPE improves Recall at 10 by 9.6% on average over ReSID, the strongest single-representation baseline, and matches or exceeds a COBRA-style sparse-dense system on every public benchmark while running at 51% of its inference latency. Ablations show that soft routing, iterative agreement, and confidence-driven length each contribute independently, and the gains are largest on tail items where boundary semantics dominate.

[IR-2] Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

【速读】：该论文旨在解决生成式音乐模型在跨风格迁移时的灾难性遗忘问题，即在对新风格（如爵士）进行微调时如何保留原有风格（如流行）的知识。其核心问题是：为避免遗忘旧域知识，微调过程中需保留多少原始数据？解决方案的关键在于通过控制混合在训练中的旧域（pop）数据量来平衡新域（jazz）性能提升与旧域性能保持，实验表明约2.5K条流行样本（约为爵士数据量的1.65倍）即可使流行音乐准确率恢复至基线水平并实现最优的风格适应效果，且超过此阈值后性能趋于饱和。

链接: https://arxiv.org/abs/2605.04998
作者: Jinju Lee
机构: PearlLeeStudio
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 3 figures, 5 tables. Companion HuggingFace models: this https URL

点击查看摘要

Abstract:Chord progression generation is practically important but understudied. Most large-scale symbolic music systems target melody, multi-track arrangement, or audio synthesis, and chord-only models tend to be relegated to conditioning components inside larger pipelines. This paper treats chord generation as a standalone task and addresses a question that arises whenever such a model is adapted across genres: how much old-domain data must be retained during fine-tuning to acquire a new domain without forgetting the old? I study jazz fine-tuning starting from a pop-pretrained 25M-parameter Music Transformer (84.24% top-1 chord accuracy on a held-out pop test set). The available jazz corpus is an order of magnitude smaller than the pop corpus, so every fine-tune run uses all 1,513 jazz training sequences. The swept variable is the volume of pop “rehearsal” data mixed alongside, taking values in 0, 1K, 2.5K, 5K, 10K. Every fine-tuned model gains 7 to 9 points of jazz top-1. Pop accuracy collapses by 2.14 points under jazz-only fine-tuning, recovers to baseline at approximately 2.5K rehearsal samples (1.65x the jazz volume), and saturates beyond that point. A complementary observation: the metric-best run (F3, 2.5K mix) is not always the perceptually preferred one. The pop-leaning (10K) and jazz-leaning (1K) endpoints carry more committed stylistic identities that the author more often selects as finished output in informal listening. I discuss what this suggests for music co-creation tools but make no perceptual claim, since no formal listening study has been conducted. All six checkpoints are released on the HuggingFace Hub at this https URL.

[IR-3] abEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

【速读】：该论文旨在解决当前基础模型（Foundation Models）在表格数据（Tabular Data）上缺乏统一表示学习范式的问题。现有方法存在两大局限：基于大语言模型（LLM）的方法无法生成兼容检索的向量输出，而传统文本嵌入模型则难以捕捉表格结构和数值语义。解决方案的关键在于提出首个通用嵌入模型 TabEmbed，其通过将多样化的表格任务重构为语义匹配问题，并采用大规模对比学习结合正样本感知的难负例挖掘策略，在共享嵌入空间中统一实现表格分类与检索任务，从而精准区分表格中的细粒度结构与数值特征。

链接: https://arxiv.org/abs/2605.04962
作者: Minjie Qiang,Mingming Zhang,Xiaoyi Bao,Xing Fu,Yu Cheng,Weiqiang Wang,Zhongqing Wang,Ningtao Wang
机构: Soochow University (苏州大学); Ant Group (蚂蚁集团); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 8 figures. Code and datasets are available at this https URL

点击查看摘要

Abstract:Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at this https URL and this https URL.

[IR-4] AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset

【速读】：该论文旨在解决现有搜索引擎结果页面（Search Engine Results Page, SERP）行为数据集在元素级细粒度分析上的局限性，尤其是无法区分有机结果（organic results）与广告（ads）之外的其他元素类型（如工具栏、AI摘要等）及其交互行为的问题。原始AdSERP数据集虽包含高精度眼动追踪、鼠标轨迹和点击行为，但其标注仅限于广告区域（占总点击的15.5%），难以支持对整个SERP中各类元素的行为建模。解决方案的关键在于提出AllSERP数据集，通过截图锚定的计算机视觉（CV）方法实现像素级有机结果与组件（widgets）边界框标注，结合HTML解析获得十三类语义元素类型，并引入“typed_gapfill”填补策略与X+Y点击归因机制，使91.7%的点击可被精确归因至具体元素类型，同时保持与原广告分区的高度一致性（38,250次分类无冲突）。这一改进显著提升了对SERP中多类型元素行为（如注视、点击、滚动）的精细化分析能力。

链接: https://arxiv.org/abs/2605.04949
作者: K. Andrew Edmonds
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We release AllSERP, a typed AOI and per-element behavioral enrichment of the AdSERP commercial-intent SERP corpus [4]. AdSERP ships 2,776 trials of full-page screenshots, captured SERP HTML, 150 Hz Gazepoint eye tracking, evtrack mouse telemetry, scroll, and pupil signals against real Google SERPs collected before AI Overviews – but its bounding boxes cover only ad surfaces (15.5 % of attributable clicks). AllSERP adds pixel-accurate organic and widget bboxes via screenshot-anchored CV, semantic types across thirteen element types via an HTML parser, an inter-result gap-fill flavor (typed_gapfill), and X+Y click attribution that reaches 91.7 % of the corpus while flagging the rest at trial level. The Phase C ad-vs-non-ad partition is internally consistent with the shipped ad rectangles (0 disagreements across 38,250 classifications). We ship the pipeline, per-trial JSONs, a corpus CSV, and a browser-based replay viewer; everything is reproducible from the AdSERP Zenodo volume. The release enables per-element click, fixation, regression, and above-fold analyses that the shipped ads-vs-organic split could not resolve.

[IR-5] Storag e Is Not Memory: A Retrieval-Centered Architecture for Agent Recall

【速读】：该论文旨在解决代理记忆系统中“摄入时提取”（extraction at ingestion）导致的信息丢失问题，即在查询尚未明确的情况下提前丢弃内容，使得后续检索无法恢复。其解决方案的关键在于提出一种名为True Memory的六层架构，将系统中心从传统的存储模式转向基于事件的多阶段检索流水线，所有事件均以原始形式保留，从而实现更精准的记忆召回。该系统完全运行于单个SQLite文件上，无需外部数据库、向量索引、图数据库或GPU，在多个基准测试（如LoCoMo、LongMemEval和BEAM-1M）中显著优于现有方法，表明其在保持高效性的同时提升了记忆准确性。

链接: https://arxiv.org/abs/2605.04897
作者: Joshua Adler,Guy Zehavi
机构: Sauron Labs (Sauron 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, 4 figures, 7 tables. Technical report

点击查看摘要

Abstract:Extraction at ingestion is the wrong primitive for agent memory: content discarded before the query is known cannot be recovered at retrieval time. We propose True Memory, a six-layer architecture that shifts the center of the system from a storage schema to a multi-stage retrieval pipeline operating over events preserved verbatim. The full system runs as a single SQLite file on commodity CPU with no external database, vector index, graph store, or GPU. On LoCoMo (1,540 questions across 10 multi-session conversations), True Memory Pro reaches 93.0% accuracy (3-run mean) against 61.4% for Mem0, 65.4% for Supermemory, approximately 71% for Zep, and 94.5% for EverMemOS under a matched gpt-4.1-mini answer model. On LongMemEval (500 questions), True Memory Pro reaches 87.8% (3-run mean). On BEAM-1M (700 questions at the 1-million-token scale), True Memory Pro reaches 76.6% (3-run mean), above the prior published result of 73.9% for Hindsight. A 56-configuration ablation shows a 1.3-percentage-point spread within the top-performing configuration family.

[IR-6] Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes ECIR2026

【速读】：该论文旨在解决音频视觉深度伪造（audiovisual deepfake）检测中人类判断的可靠性问题，特别是通过众包方式评估普通人群在识别伪造视频真实性及其篡改类型（如仅音频、仅视频或音视频联合篡改）时的一致性与准确性。其解决方案的关键在于设计并执行两项匹配的众包实验，利用Prolific平台对AV-Deepfake1M和Trusted Media Challenge（TMC）数据集中的96个视频进行每视频10次标注，共收集960条判断结果，从而量化人群判断的稳定性与局限性。研究发现，虽然众包能有效识别真实视频（误判率低），但对多数伪造内容敏感度不足，且在识别篡改类型时噪声显著，尤其在音视频联合篡改场景下表现最差，表明众包可作为大规模筛查音频视觉真实性的可行手段，但可靠模态归因仍是未解难题。

链接: https://arxiv.org/abs/2605.04797
作者: Michael Soprano,Andrea Cioci,Stefano Mizzaro
机构: University of Udine (乌迪内大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at ROMCIR 2026, the 6th Workshop on Reducing Online Misinformation through Credible Information Retrieval, held in conjunction with ECIR 2026

点击查看摘要

Abstract:Deepfakes are increasingly realistic and easy to produce, raising concerns about the reliability of human judgments in misinformation settings. We study audiovisual deepfake detection by measuring how consistently crowd workers distinguish authentic from manipulated videos and, when they flag a video as manipulated, how accurately they identify the manipulation type (audio-only, video-only, or audio-video) and how consistently they report manipulation timestamps. We run two matched crowdsourcing studies on Prolific using AV-Deepfake1M and the Trusted Media Challenge (TMC) dataset. We sample 48 videos per dataset (96 total) and collect 960 judgments (10 per video). Results show that crowd workers rarely misclassify authentic videos as manipulated, but they miss many manipulations, and agreement remains limited across videos. Aggregating multiple judgments per video stabilizes the authenticity signal, but it cannot recover manipulations that most workers consistently miss. Manipulation type identification is substantially noisier than authenticity detection even when workers detect a manipulation, with joint audio-video cases being particularly hard to recognize. Overall, these findings suggest that crowdsourcing can provide a scalable screening signal for audiovisual authenticity, while reliable modality attribution remains an open challenge.

[IR-7] RecGPT -Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation

【速读】：该论文旨在解决移动电商场景中用户意图快速演变背景下，如何高效准确预测用户下一次搜索查询的问题。现有方法依赖云端部署大语言模型（Large Language Models, LLMs），导致推理成本高、响应延迟大。其解决方案的关键在于提出RecGPT-Mobile框架，通过设计轻量化的LLM意图理解代理（intent understanding agent）并直接部署于移动端，实现对用户兴趣动态变化的实时捕捉与推荐结果的即时调整，从而在保障推荐精度的同时显著降低计算开销，为大规模生产环境中LLM在移动设备上的落地提供可行路径。

链接: https://arxiv.org/abs/2605.04726
作者: Bin Zhang,Weipeng Huang,Dimin Wang,Jialin Zhu,Yuning Jiang,Zhaode Wang,Chengfei Lv,Jian Wang,Qichao Ma,Li Chen,Junqing Wu,Yipeng Yu
机构: Taobao Tmall Group of Alibaba (淘宝天猫集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Predicting a user’s next search query from recent interaction behaviors is a critical problem in modern e-commerce systems, particularly in scenarios where user intent evolves rapidly. Large Language Models (LLMs) offer strong semantic reasoning capabilities and have recently been adopted to enhance training data construction for next-query prediction. However, due to resource constraints on mobile devices, existing applications are deployed on cloud servers, resulting in high inference costs. In this paper, we propose RecGPT-Mobile, a framework that designs a lightweight LLM-based intent understanding agent to improve recommendation quality in mobile e-commerce scenarios. By deploying LLMs directly on mobile devices, our approach can capture evolving interests of users more quickly and adjust the recommendation results in real time. Extensive offline analyses and online experiments demonstrate that our method significantly improves the accuracy of recommendation results, laying a practical path for LLM deployment in production-scale recommendation systems on mobile devices, as well as a scalable solution for integrating LLMs into real-world next-query prediction systems.

[IR-8] Rethinking Convolutional Networks for Attribute-Aware Sequential Recommendation IJCAI ECAI2026

【速读】：该论文旨在解决现有属性感知序列推荐（attribute-aware sequential recommendation）模型在处理长用户历史时面临的高计算复杂度和内存消耗问题，以及纯注意力机制在捕捉序列模式时可能效率不足的局限性。其解决方案的关键在于提出ConvRec，一种具有线性计算与内存复杂度的卷积神经网络架构，通过分层下采样的卷积结构生成紧凑且富有表达力的序列表示；同时，每一层逐步聚合邻近物品信息，以增强对多样化序列模式的捕捉能力，从而在保持高效性的同时显著提升推荐性能。

链接: https://arxiv.org/abs/2605.04723
作者: Shereen Elsayed,Ngoc Son Le,Ahmed Rashed,Lars Schmidt-Thieme
机构: University of Hildesheim (希尔德斯海姆大学); Volkswagen Financial Services AG (大众金融服务公司)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at IJCAI-ECAI 2026

点击查看摘要

Abstract:Attribute-aware sequential recommendation entails predicting the next item a user will interact with based on a chronologically ordered history of past interactions, enriched with item attributes. Existing methods typically leverage self-attention mechanisms to aggregate the entire sequence into a unified representation used for next-item prediction. While effective, these models often suffer from high computational complexity and memory consumption, limiting their ability to process long user histories. This constraint restricts the model’s capacity to fully capture long-term user preferences. In some scenarios, modeling item interactions purely through attention may also not be the most effective approach to extract sequential patterns. In this work, we propose ConvRec, an alternative method with linear computational and memory complexity that employs convolutional layers in a hierarchical, down-scaled fashion to generate compact, yet expressive sequence representations. To further enhance the model’s ability to capture diverse sequential patterns, each layer aggregates the neighboring items gradually to reach a comprehensive sequence representation. Extensive experiments on four real-world datasets demonstrate that our approach outperforms state-of-the-art sequential recommendation models, highlighting the potential of convolution-based architectures for efficient and effective sequence modeling in recommendation systems. Our implementation code and datasets are available here this https URL.

[IR-9] Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM -based Recommendation SIGIR2026

【速读】：该论文旨在解决生成式大语言模型（Large Language Models, LLMs）在推荐系统（LLM4Rec）中因依赖token-level目标而难以优化list-level非可微指标（如NDCG、公平性等）的问题，以及现有基于Best-of-N（BoN）方法在推理阶段虽能直接优化这些指标但计算成本高昂的局限。其解决方案的关键在于提出BLADE（Bayesian List-wise Alignment via Dynamic Estimation），通过引入贝叶斯框架动态更新目标分布，融合历史先验与模型当前rollout的实时证据，构建一个自适应演化的参考分布，从而克服静态监督信号无法区分候选集相对质量（Indiscriminate Supervision）和训练过程中梯度衰减（Gradient Decay）两大挑战，确保在整个训练过程中保持有效的监督信号，并实现排名准确率与复杂列表级指标（如公平性和多样性）的持续提升。

链接: https://arxiv.org/abs/2605.04559
作者: Ruijun Chen,Chongming Gao,Jiawei Chen,Weiqin Yang,Xiangnan He
机构: University of Science and Technology of China(中国科学技术大学); Zhejiang University(浙江大学)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026. 11 pages, 8 figures

点击查看摘要

Abstract:Large Language Models have revolutionized recommender systems (LLM4Rec) by leveraging their generative capabilities to model complex user preferences. However, existing LLM4Rec methods primarily rely on token-level objectives, making it difficult to optimize list-level and non-differentiable metrics (e.g., NDCG, fairness) that define actual recommendation quality. While Best-of-N (BoN) directly optimizes these metrics during inference, its high computational cost hinders real-world deployment. To address this, BoN Alignment aims to distill the search capability into the model itself, yet current approaches suffer from two critical limitations: (1) Indiscriminate Supervision, where the static reference fails to distinguish the relative quality of candidates exceeding its empirical range, leading to a loss of ranking guidance; and (2) Gradient Decay, where the effective supervision signal rapidly diminishes as the evolving policy improves, resulting in inefficient optimization. To overcome these challenges, we propose BLADE (Bayesian List-wise Alignment via Dynamic Estimation). Unlike static approaches, BLADE introduces a Bayesian framework that continuously updates the target distribution by fusing historical priors with dynamic evidence from the model’s current rollouts. This mechanism constructs a self-evolving target that adapts to the model’s growing capabilities, ensuring the training signal remains informative throughout the learning process. Extensive experiments on three real-world datasets demonstrate that BLADE significantly outperforms state-of-the-art baselines. Crucially, it breaks the static performance upper bound, achieving sustained gains in both ranking accuracy (Recall, NDCG) and complex list-wise metrics (Fairness, Diversity). The code is available via this https URL. Comments: Accepted by SIGIR 2026. 11 pages, 8 figures Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2605.04559 [cs.IR] (or arXiv:2605.04559v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.04559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-10] DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

【速读】：该论文旨在解决基于原子事实（nugget）的长篇报告评估中，人工标注nugget集合成本高、难以扩展至新信息需求尤其是跨语言场景下的难题。解决方案的关键在于提出DoGMaTiQ管道，其核心由三个阶段构成：(1) 文档驱动的nugget生成，(2) 重写聚类以去重和多样化，(3) 基于质量准则的子集选择；其中，高质量大语言模型（LLM）生成器是确保nugget质量的核心因素，且整个系统对异常系统具有鲁棒性，从而实现了跨语言场景下自动、可扩展的报告评估能力。

链接: https://arxiv.org/abs/2605.04458
作者: Bryan Li,William Walden,Yu Hou,Gabrielle Kaili-May Liu,Dawn Lawrie,Jame Mayfield,Eugene Yang,Chris Callison-Burch,Laura Dietz
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluation of long-form, citation-backed reports has lately received significant attention due to the wide-scale adoption of retrieval-augmented generation (RAG) systems. Core to many evaluation frameworks is the use of atomic facts, or nuggets, to assess a report’s coverage of query-relevant information attested in the underlying collection. While nuggets have traditionally been represented as short statements, recent work has used question-answer (QA) representations, enabling fine-grained evaluations that decouple the information need (i.e. the question) from the potentially diverse content that satisfies it (i.e. its answers). A persistent challenge for nugget-based evaluation is the need to manually curate sets of nuggets for each topic in a test collection – a laborious process that scales poorly to novel information needs. This challenge is acute in cross-lingual settings, where information is found in multilingual source documents. Accordingly, we introduce DoGMaTiQ, a pipeline for generating high-quality QA-based nugget sets in three stages: (1) document-grounded nugget generation, (2) paraphrase clustering, and (3) nugget subselection based on principled quality criteria. We integrate DoGMaTiQ nuggets with AutoArgue – a recent nugget-based evaluation framework – to enable fully automatic evaluation of generated reports. We conduct extensive experiments on two cross-lingual TREC shared tasks, NeuCLIR and RAGTIME, showing strong rank correlations with both human-in-the-loop and fully manual judgments. Finally, detailed analysis of our pipeline reveals that a strong LLM nugget generator is key, and that the system rankings induced by DoGMaTiQ are robust to outlier systems. We facilitate future research in report evaluation by publicly releasing our code and artifacts at this https URL. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2605.04458 [cs.CL] (or arXiv:2605.04458v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.04458 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-11] One Pool Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

【速读】：该论文旨在解决生成式推荐（Generative Recommender, GR）推理过程中嵌入热缓存（Embedding Hot Caches, EMB）与键值（Key-Value, KV）缓存之间在有限GPU高带宽内存（High Bandwidth Memory, HBM）中竞争资源的问题。现有系统孤立优化EMB和KV缓存分配，忽视了不同工作负载下最优EMB-KV分配比例可变化高达0.35，导致20–30%的延迟改进未被实现；而盲目在线重分配又会引入H2D（Host to Device）填充流量并阻塞关键路径，引发P99服务等级目标（SLO）违反。解决方案的关键在于提出HELM系统，其核心由两个组件构成：(1) 自适应内存分配机制，采用三层PPO（Proximal Policy Optimization）控制器（冻结基础策略、在线残差适配器与突发感知恢复控制器），实现仅32 μs决策延迟的同时保持与离线最优比例误差小于0.024–0.029；(2) EMB-KV感知调度策略，通过联合考虑KV驻留性、嵌入局部性和节点负载进行请求路由，避免异构分配下的调度低效问题。实验证明，HELM在32节点A100集群上对三个生产级数据集均将P99延迟降低24–38%，并在稳态、趋势和突发等多类工作负载下实现93.5–99.6%的SLO满足率，显著优于当前最先进基线且不牺牲吞吐量。

链接: https://arxiv.org/abs/2605.04450
作者: Wenjun Yu,Shuguang Han,Amelie Chi Zhou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves 32,\mathrm\mu s decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38% over the best static policy and achieves 93.5-99.6% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2605.04450 [cs.DC] (or arXiv:2605.04450v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.04450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-12] Reproducing Complex Set-Compositional Information Retrieval SIGIR2026

【速读】：该论文旨在解决当前信息检索系统在处理复杂查询（如涉及合取、析取和排除的集合组合查询）时，是否真正满足逻辑约束而非依赖“语义捷径”的问题。其核心挑战在于评估检索方法在面对需要精确属性谓词判断与约束满足的任务时，能否超越基于预训练知识的表面相关性匹配。解决方案的关键是提出一个新的受控基准 LIMIT+，该基准通过设计依赖任意属性谓词且弱化预训练知识影响的测试场景，有效区分了模型的真实推理能力与泛化能力。实验表明，尽管神经检索器在标准数据集 QUEST 上表现优于 BM25，但在 LIMIT+ 上性能急剧下降，而经典词汇检索方法反而显著提升，揭示了现有方法在复杂逻辑推理上的局限性，并强调了构建更可控、可复现的评估机制对推动检索模型向真正推理导向发展的必要性。

链接: https://arxiv.org/abs/2605.03824
作者: Vincent Degenhart,Dewi Timman,Arjen P. de Vries,Faegheh Hasibi,Mohanna Hoveyda
机构: Radboud University Nijmegen (拉德布德大学奈梅亨分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026, Reproducibility Track

点击查看摘要

Abstract:Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts’. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST method collapses from Recall@100 \approx 0.42 to below 0.02, while classic lexical retrieval gains to \sim 0.96. Lastly, (iii) stratifying by compositional depth reveals a consistent degradation across all methods, where algebraic sparse and lexical methods show more stable performance while dense approaches collapse. We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.

人机交互

[HC-0] ailoring Scaffolding to Diagnostic Strategies: Theory-Informed LLM -Based Agents

【速读】：该论文旨在解决当前学习分析系统中大语言模型（Large Language Models, LLMs）提供的适应性支架（scaffolding）缺乏与学习理论的系统性对齐问题，导致个性化支持往往基于全局教学决策而非针对具体认知策略的精细化设计，从而限制了教学有效性。其解决方案的关键在于引入知识学习指令（Knowledge Learning Instruction, KLI）框架，将不同诊断策略所对应的知识类型与相应的教学机制相匹配，并构建一种基于KLI的混合LLM代理（hybrid LLM agent），该代理能够根据学习者当前实践的诊断策略动态调整支架形式，而非采用统一的全局支架策略，从而实现更精准、理论驱动的个性化支持。

链接: https://arxiv.org/abs/2605.04996
作者: Fatma Betul Gures,Tanya Nazaretsky,Tanja Kaser
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 3 pages, 1 figure. Companion Proceedings 16th International Conference on Learning Analytics Knowledge (LAK26), Strengthening the Use of Learning Theories for Personalization of Learning Analytics Workshop

点击查看摘要

Abstract:Learning analytics systems increasingly integrate large language models (LLMs) to provide adaptive scaffolding in complex learning environments, yet personalization is often driven by global instructional choices rather than principled alignment with learning theory, limiting effectiveness and pedagogical grounding. In prior work, we examined how structuring and problematizing scaffolding approaches can be instantiated through LLM agents in a scenario-based learning environment for diagnostic reasoning. While both approaches supported learning, we observed systematic differences in learner interaction patterns and clear tendencies indicating that different diagnostic strategies benefited from distinct forms of scaffolding. Building on these findings, we propose a theory-informed scaffolding design grounded in the Knowledge Learning Instruction (KLI) framework, as different diagnostic strategies target different types of knowledge and require different instructional mechanisms. We use KLI to guide the alignment between strategy demands and scaffolding approaches and introduce a KLI-informed hybrid LLM agent that adapts its pedagogical support according to the diagnostic strategy being practiced, rather than applying a single global scaffolding approach. We hypothesize that this design could enable better learning gains.

[HC-1] o Fuse or to Drop? Dual-Path Learning for Resolving Modality Conflicts in Multimodal Emotion Recognition

【速读】：该论文旨在解决多模态情感识别（Multimodal Emotion Recognition, MER）中因模态间冲突导致的性能下降问题。传统融合方法在面对模态冲突时往往强行整合信息，而未区分冲突类型：良性冲突源于缺失、弱或模糊线索，可通过跨模态校准缓解；严重冲突则来自内在矛盾（如讽刺）或误导性信号，强制融合反而会放大错误。解决方案的关键在于提出双路径冲突解析框架（Dual-Path Conflict Resolution, DCR），其中路径I（情感融合蒸馏器，AFD）通过时间加权类证据进行反向蒸馏，提升表示层校准能力；路径II（情感辨识代理，ADA）将MER建模为上下文bandit问题，基于双视角状态和校准感知奖励动态选择融合或单模态预测，实现决策层仲裁。DCR通过软校准与硬仲裁耦合，在可对齐冲突中优化融合，在不可调和冲突中规避误导模态，从而显著提升模型鲁棒性。

链接: https://arxiv.org/abs/2605.04877
作者: Yangchen Yu,Qian Chen,Jia Li,Zhenzhen Hu,Jinpeng Hu,Lizi Liao,Erik Cambria,Richang Hong
机构: Hefei University of Technology (合肥工业大学); Singapore Management University (新加坡管理大学); Nanyang Technological University (南洋理工大学); MIT Media Lab (麻省理工学院媒体实验室)
类目: Multimedia (cs.MM); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-modal calibration, while severe conflicts arise from intrinsically contradictory (e.g., sarcasm) or misleading signals, for which forced fusion may amplify errors. Recognizing this, we propose Dual-Path Conflict Resolution (DCR), a unified framework that learns when to fuse and when to drop modalities. Path I (Affective Fusion Distiller, AFD) performs reverse distillation from audio/visual teachers to a textual student using temporally weighted class evidence, thereby enhancing representation-level calibration and improving fusion when alignment is beneficial. Path II (Affective Discernment Agent, ADA) formulates MER as a contextual bandit that selects among fusion and unimodal predictions based on a dual-view state and a calibration-aware reward, enabling decision-level arbitration under irreconcilable conflicts without requiring per-modality reliability labels. By taking into account the full multimodal context and coupling soft calibration with hard arbitration, DCR reconciles conflicts that can be aligned while bypassing misleading modalities when fusion is harmful. Across five benchmarks covering both dialogue-level and clip-level MER, DCR consistently outperforms competitive baselines or achieves highly competitive results. Further ablations, conflict-specific subset evaluation, and modality-selection analysis verify that AFD and ADA are complementary and jointly improve robust conflict-aware emotion recognition.

[HC-2] Not All Scaffolds Are Equal: How Initiation Mode Determines EMME Effectiveness in Debugging

【速读】：该论文旨在解决自适应学习技术中系统驱动的干预决策如何与学习者正在进行的问题解决过程相互作用这一关键问题，尤其关注眼动建模示例（Eye Movement Modeling Examples, EMME）作为动态支架时，其触发时机对教学效果的影响。研究发现，支架启动方式是决定EMME有效性的重要设计变量：人工介入（教师或学习者主动触发）显著优于基于单一生理指标（瞳孔活动降低）的自动化触发，后者因误判导致行为干扰，表明仅以低心理努力为阈值的自动触发机制不足以支撑复杂问题解决任务的支持策略。因此，解决方案的关键在于精准控制支架的触发时机与控制模式，而非单纯依赖单一生理信号。

链接: https://arxiv.org/abs/2605.04868
作者: Anahita Golrang,Kshitij Sharma,Halszka Jarodzka,Senne Van Hoecke
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Adaptive learning technologies increasingly rely on real time physiological analytics to trigger instructional support automatically yet how system driven decisions interact with learners ongoing problem solving processes remains poorly understood. Eye Movement Modeling Examples have shown promise as attention guidance tools but have been studied predominantly as static instructional materials rather than as adaptive scaffolds whose timing and initiation control can vary. This study investigates whether scaffold initiation mode shapes EMME effectiveness in novice programmers debugging and specifically whether automated triggering based on a single physiological indicator of low mental effort is a viable basis for adaptive scaffold delivery. A between subjects experiment was conducted with 120 undergraduate computer science students randomly assigned to one of four conditions: teacher initiated, learner initiated, automated or no scaffold control. Participants completed ten Python debugging tasks while eye tracking data, video interaction logs and performance scores were recorded. All EMME conditions outperformed the control. However human mediated initiation whether teacher or learner consistently produced higher performance than automated triggering and more integrative engagement with the EMME material. Automated triggering based on sustained low pupillary activity was associated with disruptive behavioral patterns suggesting mistimed delivery. EMME also eliminated the performance advantage of prior programming knowledge across all initiation modes. These findings establish scaffold initiation timing and control as critical design variables for EMME and adaptive learning technologies more broadly and demonstrate that a single low effort physiological threshold is insufficient as a trigger criterion for complex problem solving support.

[HC-3] RTMS: A Real-Time Multimodal Scaffolding System for Improving Debugging in Computing Education

【速读】：该论文旨在解决编程教学中调试（debugging）能力培养不足的问题，特别是针对初学者在识别卡顿、调节问题解决策略及管理认知负荷和压力方面的困难。解决方案的关键在于设计一种基于实时多模态反馈的自适应学习系统，该系统通过眼动追踪和心率变异性数据检测学习者的认知负荷与生理压力状态，并在识别到挣扎时刻时自动推送简短且情境相关的提示。实验结果表明，所有三种反馈条件（基于认知负荷触发、基于压力触发以及两者结合）均显著提升了调试效率与成功率，其中结合触发条件效果最佳，且有效缩小了新手与专家之间的性能差距，证明了生理感知型自适应学习环境在缓解调试负担和减少先验编程经验影响方面的潜力。

链接: https://arxiv.org/abs/2605.04848
作者: Anahita Golrang,Kshitij Sharma
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Debugging is a demanding aspect of programming yet guidance on how to teach it effectively remains limited. Novices often struggle to recognize impasses regulate their problem solving and manage cognitive load and stress. This study investigates whether real time multimodal feedback triggered by indicators of cognitive load and physiological stress can improve debugging performance narrow expert novice gaps and reduce the influence of prior programming experience on success. We conducted a between subjects experiment with 120 undergraduate computer science students who debugged a medium sized Python program. Participants were assigned to one of four conditions no feedback cognitive load triggered feedback stress triggered feedback or combined trigger feedback. Eye tracking and heart rate variability data were used to detect moments of struggle and automatically deliver brief context sensitive hints. All three feedback conditions significantly improved debugging success and efficiency compared with the control group. Cognitive load triggered feedback produced stronger gains than stress triggered feedback and the combined trigger condition yielded the largest improvements. Programming expertise predicted performance only in the control condition and in all feedback conditions the novice expert gap was markedly reduced. Adaptive feedback that responds to learners cognitive and affective states can help manage debugging demands and reduce performance differences linked to prior experience highlighting opportunities for physiologically aware adaptive learning environments.

[HC-4] Patterns of Developer Adoption of LLM -Generated Code Refactoring Suggestions

【速读】：该论文试图解决的问题是：当前对大语言模型（Large Language Models, LLMs）生成的代码重构建议的评估主要集中在建议本身的质量上，而缺乏对其在实际开发中被开发者采纳和应用方式的理解。为了解决这一问题，研究者通过分析169个GitHub提交记录，这些提交与开发者在代码重构过程中参考ChatGPT对话的内容直接关联。关键解决方案在于系统性地识别开发者如何实际使用LLM建议——发现开发者通常直接采纳建议而不做修改，当进行修改时，多为重大调整，并呈现出五种受重构活动、开发者提示及ChatGPT响应有效性共同影响的模式。

链接: https://arxiv.org/abs/2605.04835
作者: David Schön,Faiza Amjad,Tehreem Asif,Ranim Khojah,Mazen Mohamad,Francisco Gomes de Oliveira Neto,Philipp Leitner
机构: Chalmers University of Technology and University of Gothenburg (查尔姆斯理工大学和哥德堡大学); RISE Research Institutes of Sweden (瑞典工业研究学院)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted to PROMISE 2026

点击查看摘要

Abstract:Large language models (LLMs) have gained widespread popularity and have steadily improved over time, enabling software developers to use them for various code-related tasks. One common task is code refactoring, where the LLM suggests changes for the developer to apply to their code to improve quality attributes such as readability or maintainability. While current research focuses on evaluating LLM-generated refactoring suggestions, there is a limited understanding of how developers apply these suggestions in practice. To explore this, we analyze 169 GitHub commits where developers refactor their code based on a ChatGPT conversation linked in the commit message. We found that developers mostly accept and use the suggestions without modifications. When changes are made, they are mostly major and fall into five different patterns that depend on the refactoring activity, the developer’s prompt, and the validity of the response from ChatGPT.

[HC-5] Building AI Companions that Prioritise Learning over Performance

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在教育场景中应用时所引发的“学习-绩效悖论”——即尽管LLMs能显著提升学生短期任务表现，却可能削弱深层认知发展、知识迁移能力及元认知成长。解决方案的关键在于提出并构建一种新型AI学习伙伴（AI learning companions）的设计框架，该框架基于三个相互关联的基础：以学习者如何与AI互动为核心的 pedagogical foundation（教学法基础），以AI如何动态理解学习者特征为核心的 adaptive foundation（自适应基础），以及确保系统透明、可问责、包容且安全的 responsible design foundation（负责任设计基础）。通过五个跨教育情境的案例研究验证，该框架强调从单纯优化任务输出转向发展具备教学适切性、个体适应性和促进持久理解与元认知发展的AI学习伴侣。

链接: https://arxiv.org/abs/2605.04816
作者: Hassan Khosravi,Dragan Gasevic,Shazia Sadiq,Lixiang Yan,Jason Lodge,Jason Tangen,Paul Denny,Kristen DiCerbo,Simon Buckingham Shum,Ryan S. Baker
机构: The University of Queensland (St Lucia, QLD 4072, Australia); Monash University (Clayton, VIC 3800, Australia); The University of Auckland (Auckland 1010, New Zealand); Khan Academy (Mountain View, CA 94041, USA); University of Technology Sydney (Ultimo, NSW 2007, Australia); Adelaide University (Adelaide, SA 5005, Australia); Penn Center for Learning Analytics at the University of Pennsylvania
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are rapidly transforming knowledge work by improving the quality and efficiency of tasks such as writing, coding, and data analysis. However, their growing use in education has exposed a learning-performance paradox: while they can enhance short-term task performance, they may also undermine genuine learning, including cognitive growth, knowledge transfer, and metacognitive development. This paper addresses the question of how artificial intelligence should be designed and used to support learning rather than merely improve immediate outputs. We introduce the concept of AI learning companions, defined as adaptive, pedagogically informed, LLM-powered agents designed for integration into learning environments. We propose a framework for their design built on three interrelated foundations: a pedagogical foundation focused on how students learn with AI, an adaptive foundation focused on how AI learns about students, and a responsible design foundation ensuring systems remain transparent, accountable, inclusive, and secure. The framework is illustrated through five case studies spanning diverse educational contexts, levels, and tool designs, revealing both the promise and current limitations of existing tools. We conclude that there is a necessary shift away from LLMs designed for task-oriented performance, and beyond simply prompting them to act as tutors, toward deliberately developed AI learning companions that are pedagogically sound, adapt to their learners, and foster durable understanding, metacognitive growth, and learner agency.

[HC-6] OpenWatch: A Multimodal Benchmark for Hand Gesture Recognition on Smartwatches

【速读】：该论文旨在解决腕部手势识别领域缺乏公开基准测试的问题，特别是在商业智能手表上利用同步惯性与生理传感数据进行多模态手势识别的系统性评估不足。其关键解决方案包括：构建首个开源多模态基准OpenWatch，涵盖50名受试者超过10小时的惯性测量单元（IMU）和光电容积脉搏波描记法（PPG）数据及59类标注手势序列；提出两种新颖的手势识别方法——MixToken（一种任务特定的专家混合模型，通过学习的logit混合融合通道内IMU滤波器组特征与跨通道统计token）和NormWear-Lora（面向智能手表基础模型的低秩适配模块）；并通过主体无关的评估协议验证了PPG信号对基础模型性能的显著提升（F1分数提高12.5%），以及任务特定架构在准确率（F1=90% vs 66%）和内存效率（223k vs 136M参数）上的优势。

链接: https://arxiv.org/abs/2605.04791
作者: Pietro Bonazzi,Youssef Ahmed,Daniel Eckert,Andrea Ronco,Junjie Zeng,Dengxin Dai,Michele Magno
机构: ETH Zürich (苏黎世联邦理工学院); Huawei Research Zürich (华为研究苏黎世)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite widespread adoption of smartwatches worldwide, open-benchmarks for wrist-based gesture recognition remain surprisingly limited. In this work, we intro- duce the first open-access multi-modal benchmark, OpenWatch, for wrist-based gesture recognition using synchronized inertial and physiological sensing on a com- mercial smartwatch. It contains over 10 hours of Inertial Measurement Unit (IMU) and Photoplethysmography (PPG) data across 50 participants and a vocabulary of 59 labelled gesture sequences. Furthermore, we present a subject-independent evaluation protocol including traditional and deep learning methods for time-series classification. On top of this, we develop two novel methodologies for hand-gesture recognition: (i) MixToken, a task-specific mixture-of-experts that fuses per-channel IMU filterbank features with cross-channel statistical tokens through learned logit mixing, and (ii) NormWear-Lora, a low-rank adaptation module for smartwatch foundation models. Our benchmarking results reveal that PPG signals carries a sub- stantial predictive benefit (+12.5% F1-score) for foundational smartwatch models. In addition, we show that task-specific architectures (i.e. MixToken) substantially outperforms finetuned smartwatch foundation models in terms of accuracy (F1- score=90% vs 66%) and memory efficiency (223k vs 136M parameters). Finally, we also provide clear empirical guidance on the trade-offs between specialized architecture design, modality fusion, data augmentations, and foundation-model adaptation for resource-constrained wearable sensing.

[HC-7] A meta-analysis of the effect of generative AI on productivity and learning in programming

【速读】：该论文旨在解决生成式 AI (Generative AI) 编程辅助工具在开发者生产力提升与编程技能长期发展方面的实际效果不明确的问题。其解决方案的关键在于通过系统性文献综述与元分析（meta-analysis），整合 n = 23 项研究、k = 27 个效应量，采用 Hedges’ g 标准化效应量评估 GenAI 辅助编程对生产力（如任务完成时间、提交次数和代码行数）和学习效果（如考试成绩）的影响，并结合 RoB2 和 ROBINS-I 工具控制偏倚风险，从而量化 GenAI 在不同场景下的真实作用。结果表明，GenAI 对生产力有中等程度的正向影响（g = 0.33），但效果因实验环境而异；而在学习成效方面无显著提升（g = 0.14），提示其在教育场景中的整合需更加审慎。

链接: https://arxiv.org/abs/2605.04779
作者: Sebastian Maier,Moritz Gunzenhäuser,Jonas Schweisthal,Manuel Schneider,Stefan Feuerriegel
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is increasingly used for programming, yet it remains unclear when and where GenAI tools lead to productivity gains. Evidence on the effects of GenAI on the long-term development of programming skills is similarly mixed. Here, we present a meta-analysis of n = 23 studies reporting k = 27 effect sizes to quantify the effect of GenAI-powered coding assistants on productivity and learning. We systematically searched (i) ACM, (ii) arXiv, (iii) Scopus, and (iv) Web of Science for studies published between 2019 and 2025. Studies were required to compare GenAI-assisted with unassisted programming using quantitative measures of (1) productivity (i.e., task completion time, commits, and lines of code) and (2) learning (i.e., exam performance). We assessed the risk of bias using RoB2 and ROBINS-I and compared standardized effect sizes using Hedges’ g . We find a statistically significant, but moderate positive effect of GenAI assistance on developer productivity ( g = 0.33 , 95% CI: [0.09, 0.58] ), yet with substantial heterogeneity across settings. Notably, productivity gains tend to be larger in controlled experimental settings, while effects are smaller in open-source and enterprise contexts. In contrast, we find no statistically significant effect of GenAI assistance on learning outcomes ( g = 0.14 , 95% CI: [-0.18, 0.47] ). Overall, these results highlight that GenAI coding assistants can increase developer productivity, although these gains depend strongly on context. In educational settings, however, the use of GenAI does not consistently translate into improved learning or skill development, which highlights the need for careful integration of GenAI into computer science education.

[HC-8] Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

【速读】：该论文旨在解决零样本外观驱动的三维眼神估计（3D gaze estimation）在人-机器人交互（Human-Robot Interaction, HRI）场景中的可靠性问题，尤其是现有基准测试未能充分覆盖动态摄像机视角、移动目标等真实HRI条件，且跨数据集评估存在复杂度差异导致无法真实反映模型鲁棒性的问题。其解决方案的关键在于构建了一个大规模、高多样性的新基准数据集Gaze4HRI（涵盖50+受试者、3,000+视频、600,000+帧），并系统评估了多种先进方法在光照变化、头部与视线冲突以及摄像头和注视目标运动等关键HRI变量下的表现。研究发现，所有方法均在至少一种条件下失效，其中向下凝视是最普遍的失败点；而PureGaze模型因采用自对抗损失进行眼动特征净化，在多数条件下保持稳健，表明大规模数据多样性是实现零样本鲁棒性的核心因素，而特定的增强框架如自对抗训练可进一步提升性能。这一结论挑战了当前文献中对复杂时空建模和Transformer架构的过度关注，为未来研究提供了新的方向。

链接: https://arxiv.org/abs/2605.04770
作者: Berk Sezer,Ali Görkem Küçük,Erol Şahin,Sinan Kalkan
机构: Middle East Technical University (中东技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to the 2026 IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

点击查看摘要

Abstract:While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze’s self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at this https URL.

[HC-9] Cognitive Twins: Investigating Personalized Thinking Model Building and Its Performance Enhancement with Human-in-the-Loop

【速读】：该论文旨在解决如何构建一个可解释且个性化的学习者认知模型，以支持人工智能赋能的教育系统。其核心问题在于如何从学习者日志中提取结构化认知证据，并将其组织为多层级、语义抽象递进的表示框架，从而实现对学习者思维模式的精准建模与复制（即“认知孪生”）。解决方案的关键在于提出并实现了一种五层结构的个性化思维模型（Personalized Thinking Model, PTM），该模型基于Marzano的新教育目标分类体系，通过融合大语言模型推理（Gemini 2.5 Pro）、句向量嵌入、降维和共识聚类等技术构建，能够自动识别行为实例到自我系统价值的逐层抽象关系，且在多维度评估中展现出良好的保真度与用户感知一致性。

链接: https://arxiv.org/abs/2605.04761
作者: Wu-Yuin Hwang,Nur Alif Ilyasa,Muhammad Irfan Luthfi,Yuniar Indrihapsari
机构: National Central University (国立中央大学); National Dong Hwa University (国立东华大学); Universitas Negeri Yogyakarta (印尼日惹国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 40 pages, 5 figures, 20 tables, 1 algorithm, 10 listings

点击查看摘要

Abstract:This paper presents the Personalized Thinking Model (PTM), a hierarchical and interpretable learner representation designed for AI supported education. PTM organizes evidence from learner journals into a five-layer structure covering behavioral instances, behavioral patterns, cognitive routines, metacognitive tendencies, and self-system values. PTM is grounded in Marzano’s New Taxonomy of Educational Objectives and tries to clone learner’s thinking model and build cognitive twin. It was constructed using a pipeline that combines large language model inference (Gemini 2.5 Pro), sentence embeddings, dimensionality reduction, and consensus clustering. This paper evaluates PTM fidelity through three methods applied to 40 participants in a seven-week study. First, automatic evaluation using atomic information point matching yielded an overall F1 score of 74.57% before human-in-the-loop (HITL) refinement and 75.48% after refinement. Second, user evaluation using a Likert scale produced mean ratings of 4.26 and 4.30 on a five-point scale for pre and post-HITL conditions respectively. Third, semantic alignment verification showed that topic coherence increased from 0.436 at the behavioral layer to 0.626 at the core value layer, while lexical overlap with journal vocabulary decreased from 0.114 to 0.007 across those same layers. These results suggest that the PTM produces outputs with acceptable fidelity, was generally perceived by users as reflecting their thinking, and showed a pattern consistent with semantic abstraction across layers.

[HC-10] 3D Printing of Passively Actuated Self-Folding Robots with Integrated Functional Modules ICRA2026

【速读】：该论文旨在解决传统模块化机器人制造中难以实现低成本、自折叠结构与多功能集成（如传感、驱动和可重构性）的问题。其核心解决方案是提出一种由弹性驱动的自折叠方法，利用3D打印的导电聚乳酸（PLA）网状结构，在扁平状态下精确布置电子元件和磁体，随后通过嵌入式弹性带存储能量并驱动结构自动折叠成预设三维几何形态；同时，同一基底兼具电容式触觉传感功能，并支持可重复使用的平台输入/输出（I/O）模块，包含霍尔传感器和偏心旋转质量（ERM）电机以实现对接检测与振动驱动。关键创新在于建立了闭合形式的折叠模型，量化铰链刚度与弹性带力矩之间的平衡关系，从而指导设计参数（如铰链厚度、弹性带尺寸和钩间距）与目标折叠角的映射，显著提升了自折叠机器人设计的可预测性和可扩展性。

链接: https://arxiv.org/abs/2605.04757
作者: Gaolin Ge,Qifeng Yang,Haoran Lu,Tingyu Cheng,Martin Nisser,Yiyue Luo
机构: University of Washington (华盛顿大学); University of Notre Dame (圣母大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 8pages, 10 figures, This paper is accepted in ICRA 2026

点击查看摘要

Abstract:We introduce an elastic-driven self-folding approach that fabricates robots directly from flat 3D-printed conductive PLA nets. Elastic bands routed through printed hooks store energy that folds the sheet into programmed 3D geometries, while the flat state allows accurate placement of electronics and magnets before deployment. The same substrate doubles as electrodes for capacitive touch and supports a reusable platform I/O palette with Hall sensors and eccentric rotating mass (ERM) motors for docking detection and vibration actuation. We also derive a closed-form folding model that balances hinge stiffness with elastic band moment to predict equilibrium fold angles; experiments validate the model and yield a design map linking hinge thickness, band size, and hook spacing to target angles. Using this workflow we realize multiple polyhedral modules and demonstrate three applications: a cube that highlights the potential of self-folding for scalable modular robot collectives, a deployable gripper, and a tendon-driven finger. The method is low cost, stimulus-free, and integrates actuation and sensing.

[HC-11] AICoFe: Implementation and Deployment of an AI-Based Collaborative Feedback System for Higher Education

【速读】：该论文旨在解决高等教育中同伴反馈（peer feedback）质量不一致所导致的批判性反思能力发展受限的问题。其解决方案的关键在于提出并实现了一种以人为本的AI系统——AICoFe（AI-based Collaborative Feedback），该系统采用多大语言模型（LLM）流水线架构，集成GPT-4.1-mini、Gemini 2.5 Flash和Llama 3.1，将定量评分量表数据与定性观察结果融合生成结构清晰、可操作的反馈；同时引入“教师在环路”（teacher-in-the-loop）的中介工作流，由教师通过专用学习分析仪表盘对AI生成的初稿进行筛选与优化后再分发，从而保障反馈的专业性与教育价值。

链接: https://arxiv.org/abs/2605.04740
作者: Alvaro Becerra,Alejandra Palma,Ruth Cobos
机构: GHIA Group, School of Engineering, Universidad Autónoma de Madrid, Spain
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted in LASI Spain 26: Learning Analytics Summer Institute Spain 2026

点击查看摘要

Abstract:Effective peer feedback is essential for developing critical reflection in higher education, yet its impact is often limited by the inconsistent quality of student-generated comments. This paper presents the implementation and deployment of AICoFe (AI-based Collaborative Feedback), a system designed to bridge this gap through a human-centered AI approach. We describe a modular architecture that orchestrates a multi-LLM pipeline, utilizing GPT-4.1-mini, Gemini 2.5 Flash, and Llama 3.1, to synthesize quantitative rubric data and qualitative observations into coherent, actionable feedback. Key to the system is a “teacher-in-the-loop” mediation workflow, where educators use specialized Learning Analytics dashboards to curate and refine AI-generated drafts before delivery. Furthermore, we detail the underlying data infrastructure, which employs a hybrid SQL and MongoDB strategy to ensure traceability and manage semi-structured feedback versions.

[HC-12] AISSA: Implementation and Deployment of an AI-based Student Slides Analysis tool for Academic Presentations

【速读】：该论文旨在解决高等教育中大规模班级环境下，教师难以在学生进行口头汇报前提供及时且具有操作性的幻灯片反馈的问题。其解决方案的关键在于开发了一个名为AISSA（AI-based Student Slides Analysis tool）的基于Web的系统，该系统融合了大语言模型（Large Language Models, LLMs）与学习分析（Learning Analytics）仪表盘，通过教师定义的评分量规（rubric）对学生的幻灯片进行自动评分和结构化反馈。AISSA不仅分析幻灯片内容，还评估幻灯片层面的特征，利用LLM（如ChatGPT 5.2）生成定性建议，并通过交互式仪表盘呈现结果，从而实现可扩展、高效的形成性反馈机制。

链接: https://arxiv.org/abs/2605.04729
作者: Alvaro Becerra,Diego Gomez,Ruth Cobos
机构: GHIA Group, School of Engineering, Universidad Autónoma de Madrid, Spain
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted in LASI Spain 26: Learning Analytics Summer Institute Spain 2026

点击查看摘要

Abstract:Providing timely and actionable feedback on oral presentation slides is challenging in higher education, particularly in large classes where teachers cannot realistically deliver detailed formative feedback before students present. This paper introduces AISSA (AI-based Student Slides Analysis tool), a web-based system that combines large language models (LLMs) and Learning Analytics dashboards to support scalable, rubric-based feedback on presentation slides. AISSA allows students to upload their slide decks prior to an oral presentation and automatically receive quantitative scores and qualitative feedback based on teacher-defined evaluation rubrics. The system analyzes both slide-level features and slide content, generates structured feedback through an LLM (ChatGPT 5.2), and presents the results through interactive dashboards for students and teachers. We tested AISSA on a pilot deployment with 46 undergraduate students in a real academic setting. The results indicate that AISSA is technically reliable, economically feasible, and perceived by students as useful for iterative slide improvement. These findings suggest that combining LLM-based analysis with Learning Analytics dashboards is a promising approach for supporting formative feedback on presentation slides at scale.

[HC-13] Cognitive Alignment Drives Attention: Modeling and Supporting Socially Shared Regulation in Pair Programming

【速读】：该论文旨在解决如何在协作编程（pair programming）中有效识别与增强社会共享学习（Socially Shared Regulation of Learning, SSRL）的过程机制问题，特别是聚焦于联合心理努力（Joint Mental Effort, JME）和联合视觉注意（Joint Visual Attention, JVA）作为过程指标的作用，以及如何通过人工智能驱动的自适应反馈来提升协同调节能力。其解决方案的关键在于：首先，利用双人眼动追踪、瞳孔测量与基于事件的分析方法，实证揭示JME与JVA之间的因果关系——即认知协调系统性驱动注意力协同；其次，开发并验证两种类型的AI反馈机制：一是反应式反馈（基于实时JME/JVA偏差），二是前瞻性反馈（基于机器学习预测未来协作状态），二者均显著提升性能、调节一致性及认知-注意因果链的稳定性，从而将AI定位为“智能增强型协作者”，而非自动化控制器，以支持学习者共同协调努力、注意与理解的能力。

链接: https://arxiv.org/abs/2605.04639
作者: Anahita Golrang,Kshitij Sharma
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Grounded in socially shared regulation of learning (SSRL), this paper investigates how joint mental effort (JME) and joint visual attention (JVA) serve as process-level indicators of shared regulation in pair programming and how AI-driven adaptive feedback can strengthen these processes. We present three eye-tracking studies involving 182 dyads engaged in collaborative debugging tasks. Study 1 examines natural collaboration and shows that high-performing dyads exhibit significantly higher JME and JVA, a greater prevalence of productive high-JME-high-JVA episodes, and a stable causal relationship in which JME predicts JVA. Study 2 evaluates reactive adaptive feedback based on real-time deviations in JME and/or JVA. Results show that combined feedback targeting both dimensions yields the strongest improvements in performance, regulatory coherence, and cognitive-to-attentional causality, outperforming single-channel feedback. Study 3 introduces proactive, forecast-based feedback using machine-learning predictions of future collaboration states. Proactive support further enhances performance and sustains shared regulation by anticipating breakdowns before they manifest. Across studies, causal modeling reveals that cognitive alignment systematically drives attentional coordination in successful collaboration, while mismatches between effort and attention characterize unproductive regulation. Methodologically, this work integrates dual eye-tracking, pupillometry, episode-based analysis, and causal inference to capture SSRL as a dynamic, emergent process. Conceptually, the findings position AI not as an automated controller, but as an intelligence-augmenting co-regulator that supports learners’ capacity to coordinate effort, attention, and understanding together. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.04639 [cs.HC] (or arXiv:2605.04639v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.04639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-14] mporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition

【速读】：该论文旨在解决可穿戴人体活动识别（Wearable Human Activity Recognition, WHAR）模型在真实场景中因跨用户分布偏移导致的性能下降问题。现有测试时适应（Test-Time Adaptation, TTA）方法多沿用视觉任务假设，未能充分利用WHAR数据流中固有的窗口间时间结构。其解决方案的关键在于重新将这种时间结构视为一种特征条件化的推理信号，而非仅作为输出空间平滑先验；通过分析时间连续性与观测诱导的特征偏差，动态决定何时保留或释放时间惯性，并在可能的转换阶段引导预测优化路径。基于此洞察，作者提出轻量级、无需反向传播的TTA框架SIGHT，利用原型驱动的预测意外估计和几何感知的过渡路由机制，在保证实时边缘部署能力的同时显著提升模型鲁棒性。

链接: https://arxiv.org/abs/2605.04617
作者: Zishu Zhou,Zaipeng Xie,Xuanyao Jie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.

[HC-15] IntenBot: Flexible and Imprecise Multimodal Input for LLM s to Understand User Intentions for Casual and Human-Like HRI

【速读】：该论文旨在解决人机交互（HRI）中如何实现类似人类自然交互的灵活、非精确多模态输入理解问题，特别是在扩展现实（XR）环境中。传统机器人系统往往依赖明确、结构化的指令，难以处理用户在实际场景中常用的混合模态输入（如语音、注视和手指指向），且对输入精度要求较高，限制了交互的自然性和效率。解决方案的关键在于提出IntenBot系统，利用大语言模型（LLM）强大的语义消歧能力，从不精确或冗余的多模态输入中提取潜在意图，并生成待确认的指令，从而支持用户以更随意、符合人类习惯的方式与机器人交互，显著降低交互所需的时间、精力和注意力成本。

链接: https://arxiv.org/abs/2605.04585
作者: Yen-Ting Liu,Chiu-Hsuan Wang,TzuLing Chen,Ting-Ying Lee,Tzu-Hua Wang,Chien-Ming Lin,Bing-Yu Chen,Hsin-Ruey Tsai
机构: National Chengchi University (国立政治大学); National Taiwan University (台湾大学); CTO Office, Delta Electronics (台达电子首席技术官办公室)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In natural human-to-human communication, multimodal user input is typically used to supplement explicit and complement implicit voice commands, with casualness allowing for flexible input modality combinations and tolerance for imprecise input data. For example, saying “I want that.” with a casual glance at a bottle of water is clear enough in human-to-human communication as an implicit voice command accompanied by gaze and/or gestures, rather than an explicit one. To enable such a human-like interaction in human-robot interaction (HRI), we propose a system, IntenBot, to understand user intentions from flexible and imprecise multimodal input, including voice, gaze, and finger-pointing, in XR. The disambiguation capability of large language models (LLMs) is used to filter out irrelevant input modalities and imprecise input data, generating potential instructions for user confirmation. The flexible and imprecise multimodal input enables casual, human-like interaction with robots, reducing time, effort, and attention, and could also be used as non-voice input. We conducted an informative user behavior study in a simulated environment to understand users’ natural be- havior in flexibly interacting with a robot using multimodal input and to obtain appropriate angle range parameters for gaze and finger-pointing. An XR study was then performed to evaluate the performance of IntenBot, compared with other methods. We also deployed IntenBot on a physical robot to showcase its real-world applications.

[HC-16] Characterizing Students LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks

【速读】：该论文试图解决的问题是：在真实学习环境中，学生如何使用大型语言模型（Large Language Models, LLMs）及其对学习效果的影响尚不清晰，尤其是在非受限、研究导向的课程背景下。现有研究多集中于问题求解场景或受控实验设置，缺乏对学生自然使用行为与学习成效之间关系的系统分析。解决方案的关键在于：通过收集两轮课程中学生在无限制条件下完成学术阅读、推理和批判任务时的LLM使用数据，提出一种基于实证的、自下而上的LLM使用类型分类体系，并结合三次期中考试成绩，量化分析使用频率、类型及学生主动性程度对学习表现的影响，从而揭示LLM在实际教学场景中的作用机制。

链接: https://arxiv.org/abs/2605.04534
作者: Minju Park,Ivan Orozco Vasquez,Cristina Conati
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC)
备注: EDM 2026

点击查看摘要

Abstract:Large language models (LLMs) are becoming increasingly embedded in students’ learning practices, yet much of what is known about how students use LLMs and how this usage impacts learning comes from problem-solving domains or constrained experimental settings. We present an analysis of data on LLM usage collected during two offerings of a research-oriented course where students learn to read, reason about, and critique academic papers. Without restrictions on whether or how to use LLMs, students reported their LLM usage practices when asked to do these activities as a series of homework assignments during the course. This paper extends prior work done on data from a single offering of the same course by presenting a refined bottom-up categorization of LLM usage types, cross-labeled by the extent of student initiative these usages entail. Furthermore, we examine how LLM use impacts student learning, measured by performance on three midterms, looking at factors such as frequency and type of usage.

[HC-17] Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

【速读】：该论文试图解决的问题是：当前机器学习模型对齐（alignment）评估主要依赖于模型层面的指标（如真实性、指令遵循能力或成对偏好评分），但这些指标无法充分反映部署场景下的实际对齐效果，导致对已部署模型的对齐能力存在误判。其核心问题是将部署相关的对齐性错误地从交互层或部署层简化为模型层的单一分数进行推断。解决方案的关键在于提出一个系统级评估框架：以多层级证据为基础（模型级、响应级、交互级、部署级）构建对齐画像（alignment profiles），采用固定支架协议（fixed-scaffolding protocols）实现可比的交互式评估，并通过标准化报告模板明确揭示评估证据与部署结论之间的推断距离（inferential distance）。这一方法强调对齐评估必须与其证据来源的层级保持一致，避免模型级指标对部署效果的误导性推断。

链接: https://arxiv.org/abs/2605.04454
作者: Varad Vishwarupe,Nigel Shadbolt,Marina Jirotka,Ivan Flechais
机构: University of Oxford (牛津大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen’s kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model’s verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.

[HC-18] AI and Suicide Prevention: A Cross-Sector Primer

【速读】：该论文旨在解决当前通用型AI聊天机器人（General-Purpose AI Chatbots）在心理健康支持，特别是自杀预防领域中缺乏临床验证、统一标准与协同监管的问题。其解决方案的关键在于构建跨行业共识，从模型层、产品层到政策层系统识别并应对挑战，推动AI工具在提升自杀与非自杀性自伤（NSSI）预防效果的同时促进整体心理健康，并强调通过多利益相关方协作（包括AI实验室、精神健康从业者、有经验的用户及政策制定者）实现可落地且紧迫的改进方向。

链接: https://arxiv.org/abs/2605.04321
作者: Emily Saltz,Claire R. Leibowicz
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 39 pages, 3 figures, 2 tables

点击查看摘要

Abstract:AI chatbots already function as de facto mental health support tools for millions of people, including people in crisis. Yet, they lack the clinical validation, shared standards, and coordinated oversight that their societal role demands. This primer was developed in conjunction with a multistakeholder workshop hosted by Partnership on AI in 2026, convening AI labs, mental health practitioners, people with lived experience, and policymakers, to provide a common cross-sector reference point for the current state of the field of AI and suicide prevention. It begins with an overview of clinical best practices, then turns to how frontier AI systems (as of winter 2026) detect and respond to suicide and non-suicidal self-injury (NSSI) queries. Together, these provide insight into what it would take to design and implement AI tools that not only better prevent suicide and NSSI, but also promote overall well-being. Drawing on clinical literature, publicly available AI lab policies, an emerging landscape of evaluation frameworks, and conversations with leaders across the AI and mental health fields, we map challenges posed by general-purpose AI chatbots for mental health across model, product, and policy layers, ultimately highlighting priority areas where cross-industry alignment is both urgently needed and achievable.

[HC-19] dtour: a steerable tour de vis through high-dimensional data

【速读】：该论文旨在解决高维数据可视化中因降维投影导致的信息丢失或失真问题，现有工具在投影路径的自由度与可控性之间存在权衡，难以兼顾专家引导路径与无约束探索。其解决方案的关键在于提出dtour界面，整合静态投影预览、沿连续测地线路径的可逆 scrubbing、手动投影操控以及漫游广角巡游（wandering grand tour）等功能，形成一个渐进式探索框架，同时通过GPU加速渲染支持百万级数据点，并兼容Python和JavaScript生态，从而实现高效、灵活且直观的高维数据探索。

链接: https://arxiv.org/abs/2605.04306
作者: Fritz Lekschas,Nezar Abdennur
机构: Ridge AI (Ridge AI); UMass Chan Medical School (麻省大学医学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Understanding high-dimensional data requires projecting it into lower-dimensional spaces, but any single projection inevitably loses information or introduces distortions. Tours address this limitation through animation of 2D projection sequences, yet existing tools present tradeoffs in the freedom and steerability of projection traversal, providing little to no ability to move between expert-guided paths and unrestrained exploration. We present dtour, a tour interface that combines static projection previews, reversible scrubbing along continuous geodesic projection paths, manual projection manipulation, and a wandering grand tour, all within a single progressive exploration interface. dtour scales to millions of points via GPU-accelerated rendering, runs in any modern browser, and integrates with both Python and JavaScript ecosystems. We demonstrate dtour on text, image, and single-cell data for two usage scenarios: gradually revealing structure in high-dimensional data and validating non-linear dimensionality reduction outputs.

[HC-20] OPENJ: A Conceptual Framework for Open-Source Digital Human Modeling and Ergonomic Assessment in a CAD Environment

【速读】：该论文旨在解决工业工作场所中人因工程问题的数字化建模与评估工具长期依赖商业软件所带来的高成本和封闭性问题，这些问题限制了个体研究者、中小企业及教育机构的采用。其解决方案的关键在于提出一个名为OpenJane/Joe的设计蓝图，通过开放源代码的方式实现数字人体建模（Digital Human Modeling, DHM）的核心功能集——包括人体测量学虚拟人偶、姿态预测、人因评估（如RULA、REBA等）以及与计算机辅助设计（CAD）环境的集成，从而降低使用门槛并促进社区协作与可持续发展。

链接: https://arxiv.org/abs/2605.04270
作者: Sinan Bank,Casey E. Eaton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 11 pages, 2 figures, submitted to ASME IMECE 2026

点击查看摘要

Abstract:Industrial workplace challenges range from musculoskeletal disorders – a leading cause of occupational injury – to suboptimal workstation layouts, inefficient task sequences, and poor human-equipment fit. Digital human modeling (DHM) tools address several of these challenges by placing a scalable virtual mannequin in a computer-aided design (CAD) environment, enabling engineers to evaluate ergonomic risk through standardized assessment methods (RULA, REBA, NIOSH Lifting Equation, OWAS), optimize workstation layouts for reach and visibility, predict task postures through inverse kinematics, and simulate operations before physical implementation. Despite four decades of development since the Jack system originated at the University of Pennsylvania in the 1980s, the integrated DHM capability set – anthropometric mannequin, posture prediction, ergonomic assessment, and CAD integration – remains exclusive to commercial platforms such as Siemens Tecnomatix Jack (Process Simulate), Dassault DELMIA, Humanetics RAMSIS, and the University of Iowa’s Santos system. These platforms operate under proprietary, vendor-quoted pricing models, and their acquisition and operating costs, together with closed-source implementations, have been repeatedly identified as practical adoption barriers for individual researchers, small-to-medium enterprises, and educational institutions. Organizations without access resort to manual observational methods – paper-based worksheets applied to photographs or video – sacrificing the predictive power and reproducibility that computational analysis provides. The paper serves as a design blueprint for (OpenJane/Joe), positioning the project for subsequent open-source implementation and community adoption.

[HC-21] EngThrive: Make It Fast and Easy to Do Great Work

【速读】：该论文旨在解决开发者生产力（Developer Productivity）测量与改进的实践难题：尽管SPACE、DevEx和DORA等框架已表明生产力具有多维特性，但缺乏一套可操作的指标体系来指导实际改进。其解决方案的核心是提出Engineering Thrive（EngThrive），一个围绕“速度（Speed）、简便性（Ease）和质量（Quality）”三大维度构建的测量与改进系统，并以“繁荣（Thriving）”作为保障开发者福祉的底线指标。EngThrive通过将结果导向的北极星指标（North Star Metrics）与诊断子指标相结合，融合系统遥测数据与开发者问卷调查，实现规模与情境的统一；同时设计了能引导“游戏化”行为向真实改进对齐的度量原则，辅以数据平台、调查计划和仪表板生态，最终使组织能够从关注活动转向聚焦成果，推动系统级持续优化。

链接: https://arxiv.org/abs/2605.04259
作者: Brian Houck,Tim Bozarth,David Liu,Dean Carignan
机构: 未知
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Frameworks such as SPACE, DevEx, and DORA established that developer productivity is inherently multidimensional, but left practitioners with a practical question: what should we measure, and how should we use it to improve? This paper introduces Engineering Thrive (EngThrive), a measurement and improvement system developed and deployed across Microsoft’s engineering organization. EngThrive organizes productivity around three dimensions - Speed, Ease, and Quality - with Thriving as a guardrail to ensure developer wellbeing improves alongside performance. Within each dimension, outcome-oriented North Star metrics are paired with diagnostic submetrics, combining system telemetry with developer surveys to provide both scale and context. We describe the design principles that guide metric selection, including an approach in which well-chosen metrics align “gaming” behavior with genuine improvement. We also outline the data platform, survey program, and dashboard ecosystem required to operationalize this approach in practice, and present case studies demonstrating how outcome-oriented measurement enables sustained, system-level improvements. Finally, we show that EngThrive functions as a general-purpose evaluation language, applicable not only to developer tools and AI, but to organizational policies, work environments, and other factors that shape how developers experience their work. We offer EngThrive as a concrete model for organizations seeking to move beyond measuring activity toward improving outcomes.

[HC-22] Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

【速读】：该论文旨在解决如何高效且可解释地模仿黑箱强化学习策略的问题，特别是在保持性能的同时减少对复杂模型的依赖。解决方案的关键在于提出状态向量空间划分（State Vector Space Partitioning, SVSP）方法，通过线性支持向量机（Linear Support Vector Machine）对状态动作对数据集进行分割，构建原策略的紧凑且结构化的子策略表示，从而在提升平均回报（相比Voronoi State Partitioning提升7.4%，相比原始TD3策略提升2.8%）的同时，将所需子策略数量减少82.1%。

链接: https://arxiv.org/abs/2605.04254
作者: Senne Deproost,Mehrdad Asadi,Ann Nowé
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); Flanders Make (弗拉芒制造)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: Accepted for poster presentation at HHAI 2026

点击查看摘要

Abstract:We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with linear support vector machine splits, SVSP constructs a compact and structured representation of the original policy. Our method improves mean return by +7.4% over previous critic driven state partitioning attempts such as Voronoi State Partitioning (VSP) and +2.8% over the original TD3 policy, while reducing the number of required subpolicies against VSP by 82.1%. Our results pave the path towards a more flexible form of distillation where both the decision boundary and surrogate models can be chosen within a margin of the original black box behavior.

[HC-23] Pro2Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

【速读】：该论文旨在解决现有个人助手在处理长周期、多步骤的程序性任务（procedural tasks）时，仅能提供被动响应或有限主动协助的问题。当前系统通常依赖用户查询触发指导，难以持续跟踪任务进展并根据用户状态变化提供及时干预。解决方案的关键在于提出 Pro² Assist，一个基于增强现实（AR）眼镜的步进感知主动辅助系统，其通过融合多模态感知数据（如运动信息）与多层次时间动态特征及领域专家知识，构建细粒度的任务上下文表示，并实现持续推理以精准推断用户需求，从而在任务执行过程中提供适时、准确的主动指导。

链接: https://arxiv.org/abs/2605.04227
作者: Lilin Xu,Bufang Yang,Siyang Jiang,Kaiwei Liu,Kaiyuan Hou,Yuang Fan,Hongkai Chen,Zhenyu Yan,Xiaofan Jiang
机构: Columbia University (哥伦比亚大学); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Procedural tasks with multiple ordered steps are ubiquitous in daily life. Recent advances in multimodal large language models (MLLMs) have enabled personal assistants that support daily activities. However, existing systems primarily provide reactive guidance triggered by user queries, or limited proactive assistance for isolated short-term events rather than long-horizon procedural tasks. In this work, we introduce Pro ^2 Assist, a step-aware proactive assistant that continuously tracks fine-grained task progress and reasons over the user’s evolving state to provide timely assistance throughout tasks. Pro ^2 Assist leverages multimodal data from augmented reality (AR) glasses to achieve motion-based perception. It then extracts step-oriented procedural context from multi-scale temporal dynamics and task-specific expert knowledge. Based on both sensory input and procedural context, Pro ^2 Assist performs continuous reasoning to infer user needs and display timely assistance on AR glasses. We evaluate Pro ^2 Assist using a dataset curated from public sources and a real-world dataset collected on our testbed with AR glasses. Extensive evaluations show that Pro ^2 Assist outperforms the best-performing baselines by over 21% in procedural action understanding accuracy, and it achieves up to 2.29x the proactive timing accuracy of baselines. A user study with 20 participants further shows that 90% find Pro ^2 Assist useful, indicating its effectiveness for real-world procedural assistance.

[HC-24] Exploring the Output of Software Testing Tools through a Visual Comparative Analysis

【速读】：该论文旨在解决软件测试工具中可视化输出界面缺乏统一性与共性模式的问题，即当前人机交互（Human-Computer Interaction, HCI）研究尚未系统识别和理解测试结果可视化中的共享界面元素与设计模式。其解决方案的关键在于对50个主流软件测试工具和框架（包括44个命令行接口CLI输出和6个图形用户界面GUI输出）的输出内容进行视觉比较分析，揭示了测试工具在界面元素、结果呈现方式及色彩使用上的共通特征，从而为测试工具开发者提供可复用的设计趋势与优化依据。

链接: https://arxiv.org/abs/2605.04189
作者: Brandon Lit,Anthony Maocheia-Ricci,Thomas Driscoll
机构: Cheriton School of Computer Science, University of Waterloo (滑铁卢大学计算机科学系)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Software testing is a fundamental process of software development, and prior work has shown that visualizations of test results support testers’ decision-making. However, Human-Computer Interaction research on software testing has yet to explore and understand the shared interface elements and patterns in visualization of testing outputs. To address this, we conducted a visual comparative analysis of the output of 50 software testing tools and harnesses (44 with CLI output, 6 with GUI output) across four popular programming languages. Our analysis reveals the common interface elements in software testing tools, how these tools display and visualize test results, as well as the specific make-up of the output. Our findings provide insight on how visual testing output is formatted and how colour is used across both CLI and GUI environments, identifying trends that can be applied by developers of testing tools.

[HC-25] wo Integration Pathways in Human-Centered Requirements Engineering: A Systematic Mapping Study of Structural Gaps

【速读】：该论文旨在解决人本需求工程（Human-centered Requirements Engineering, HC-RE）领域中多学科贡献结构不清晰、生命周期阶段分布不均以及理论与实践脱节的问题。其关键解决方案在于识别出两条并行的整合路径：基于目标驱动框架和形式化建模的“认知-形式”（Cognitive-Formal, C-F）路径，以及基于场景驱动框架和迭代设计的“参与-迭代”（Participatory-Iterative, P-I）路径，并指出两者之间缺乏翻译机制是导致研究碎片化和应用局限的核心结构性缺口。为此，论文提出了一种以用户体验为中心的需求工程（Experience-Centered Requirements Engineering）的新方向，将用户体验显式地作为需求规格说明的第一类关注点，从而构建系统化的研究议程和实证基础。

链接: https://arxiv.org/abs/2605.04132
作者: Imen Benzarti,Ikram Darif,Abderrahmane Leshob,Hafedh Mili,Darine Amayed
机构: École de technologie supérieure (ÉTS); Université du Québec à Montréal (UQAM); Université du Québec en Abitibi-Témiscamingue (UQAC)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human-centered Requirements Engineering (HC-RE) integrates user cognition, emotions, and social interactions into the RE process through contributions from disciplines such as psychology, cognitive science, design thinking, and human-computer interaction. Despite growing interest, how these multidisciplinary contributions are structured and why they remain fragmented across the RE lifecycle is not well understood. This systematic mapping study analyzes 56 primary studies across seven dimensions, including RE phases, user involvement techniques, contributing disciplines, and evaluation methods. Results show that 70% of approaches involve multidisciplinary contributions, yet only 39% have been empirically evaluated and 48% address only the elicitation phase. A cross-study analysis reveals a structural separation between two parallel integration traditions: a Cognitive-Formal (C-F) pathway grounded in goal-based frameworks and formal modeling, and a Participatory-Iterative (P-I) pathway grounded in scenario-based frameworks and iterative design. Each pathway has developed complementary strengths, but their near-total disconnection explains the persistent lifecycle concentration and theory-practice gap observed in the corpus. The findings identify the absence of translation mechanisms between human-centered artifacts and formal RE specifications as the field’s primary structural gap, provide a structured research agenda organized into four priority tiers, and establish the empirical foundation for Experience-Centered Requirements Engineering, a direction in which user experience is explicitly operationalized as a first-class concern in requirements specification. Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.04132 [cs.SE] (or arXiv:2605.04132v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.04132 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-26] oward Human-AI Complementarity Across Diverse Tasks

【速读】：该论文旨在解决“人类与人工智能（AI）协同互补性”在现实任务中是否可实现的问题，即探索通过结合人类与AI的判断能否超越单一一方的表现，从而为高级AI系统的稳健监督提供可行路径。其核心发现表明，当前方法下的互补增益有限：基线混合策略仅带来0.4个百分点的提升，且主要受限于AI错误但人类正确的样本比例极低（仅8.9%），同时模型置信度无法有效识别此类区域；而“Top-2辅助”方法虽能将人类准确率从28.4%提升至38.3%，显著优于AI单独表现（37.7%），但其成功主要源于人类采纳了正确AI建议，而非纠正AI错误。因此，研究指出关键瓶颈并非人类任务准确性本身，而是如何精准地在关键时刻将决策路由至人类，并设计出能帮助人类识别并修正AI错误的辅助机制。

链接: https://arxiv.org/abs/2605.04070
作者: Yuzheng Xu,Annya Dahmani,Matthew D. Blanchard,Niclas Dern,Edy Nastase,Francesca Bianco,Maja Pavlovic,Sukanya Krishna,Eric Modesitt,Miranda Anna Christ,Arth Singh,Gaia Molinaro,Sikata Bela Sengupta,Jaji Pamarthi,Arjun Menon,Rishub Jain
机构: The University of Tokyo; UC Berkeley; University of Sydney; Queen Mary University London; Harvard University; UIUC; NIT Agartala; University of Pennsylvania; Princeton University
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages main text, 37 pages total with appendices

点击查看摘要

Abstract:Human-AI complementarity, the idea that combining human and AI judgments can outperform either alone, offers a promising pathway toward robust oversight of advanced AI systems. However, whether human-AI complementarity can be achieved on realistic tasks remains an open question. We investigate this through two approaches: hybridization and two AI assistance methods (top-2 assistance and subtask delegation), evaluated on a multi-domain dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. We find only modest complementarity gains. Baseline hybridization yields just +0.4 percentage points (pp) over AI alone (69.3% vs 68.9%), limited both by a small complementarity region (only 8.9% of items where AI errs but humans do not) and the inability of confidence-based routing to identify it, since the model’s confidence is similarly distributed across correct and incorrect predictions. Applied when AI has low confidence, top-2 assistance increases human accuracy from 28.4% to 38.3%, surpassing AI alone (37.7%) – but primarily because humans adopt correct AI suggestions, not because they successfully override AI errors. These findings suggest that the primary bottleneck is not human task accuracy per se, but the ability to route decisions to humans when it matters and to design assistance methods that enable humans to catch AI mistakes. Our quantitative and qualitative analyses pinpoint where and why each method succeeds or fails, offering concrete targets for future work. We will release our dataset and code upon request to support progress toward more effective human-AI collaboration for AI oversight.

[HC-27] SemiConLens: Visual Analytics for 2D Semiconductor Discovery

【速读】：该论文旨在解决二维半导体（2D semiconductor）材料发现过程中因数据稀疏性、模型可靠性不足以及缺乏可解释性而导致的预测不可靠问题。现有方法如密度泛函理论（Density Functional Theory, DFT）或基于机器学习（Machine Learning, ML）的方法在面对小样本和不确定性时表现不佳，难以支撑高效且可信的材料筛选。解决方案的关键在于提出SemiConLens这一可视化分析框架，其核心创新包括：1）开发一种相关性感知多变量插补方法（Correlation Aware Multivariate Imputation, CAMI），结合自编码器等ML模型增强对有限数据的学习能力并量化预测不确定性；2）设计包含三个联动视图的可视化模块，通过新颖的圆形图符（circular glyph）和簇感知布局优化策略，直观呈现用户可配置的关键属性与不确定性信息，从而实现人机协同的可靠、高效二维半导体候选材料探索。

链接: https://arxiv.org/abs/2605.04067
作者: Kavinda Athapaththu,Shiwei Chen,Yuan Fang,Sanchali Mitra,Yee Sin Ang,Yong Wang
机构: Nanyang Technological University (南洋理工大学); Singapore Management University (新加坡管理大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Human-Computer Interaction (cs.HC); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The past few years have witnessed vibrant efforts in discovering new two-dimensional (2D) semiconductor materials from both academia and the industry, due to their promising potential in resolving the severe performance deterioration of traditional semiconductors resulting from condensed silicon thickness. However, existing methods (e.g., Density Functional Theory (DFT) or machine-learning-based approaches) suffer from various challenges such as small datasets, and reliability and trustworthiness issues. To bridge this gap, we propose SemiConLens, a visual analytics approach to combine human expertise with the power of ML to enable effective and reliable 2D semiconductor discovery. Specifically, we first develop a new Correlation Aware Multivariate Imputation (CAMI) method and use ML models like autoencoder, which can better learn from limited data and reveal uncertainty, to address the challenge of sparse data in semiconductivity prediction. Built upon this, our visualization module, consisting of three visualization views with linked interactions, allows material researchers to interactively filter, discover and compare 2D semiconductor candidates. A novel circular glyph design and a new cluster-aware layout optimization approach are proposed to effectively display all the user-configurable key attributes and possible prediction uncertainties of each semiconductor candidate, ensuring a reliable and trustable 2D semiconductor discovery. We assess SemiConLens through quantitative evaluations, expert interviews, and use cases. The results demonstrate SemiConLens’s capability to help material researchers conduct effective discovery of desirable 2D semiconductors.

[HC-28] Modeling Subjective Urban Perception with Human Gaze

【速读】：该论文旨在解决现有城市感知（Urban Perception）建模方法中忽视人类感知过程的问题，即当前计算模型多直接从街景图像出发预测主观城市感知，却未考虑人类在形成判断时的注视行为等认知机制。其解决方案的关键在于构建了一个包含同步眼动记录与个体感知标签的新型数据集 Place Pulse-Gaze，并提出一种基于眼动引导的城市感知建模框架（Gaze-Guided Urban Perception Framework），通过三种互补设置系统探究注视行为如何提升主观城市感知的预测性能：仅使用注视信息、将注视与显式语义场景表示融合、以及与隐式更丰富的视觉表征融合。实验表明，注视本身已携带有效预测信号，且结合场景表示可进一步提升预测效果，从而强调了将人类感知过程纳入城市场景理解的重要性。

链接: https://arxiv.org/abs/2605.00764
作者: Lin Che,Xi Wang,Marc Pollefeys,Konrad Schindler,Martin Raubal,Peter Kiefer
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.

计算机视觉

[CV-0] Syn4D: A Multiview Synthetic 4D Dataset

【速读】：该论文旨在解决从单目视频中进行密集三维（3D）动态场景重建与跟踪这一计算机视觉领域的开放性挑战，其核心难点在于高质量标注数据的稀缺性。为应对这一限制，作者提出了一种名为Syn4D的多视角合成动态场景数据集，其关键创新在于能够将任意像素在任意时间点和任意相机视角下精确反投影至三维空间，从而提供完整的几何信息（包括真实相机位姿、深度图、密集点跟踪及参数化人体姿态标注）。该特性显著提升了下游任务如4D场景重建、3D点跟踪、几何感知相机重定向和人体姿态估计的建模精度与泛化能力，为动态场景理解与时空建模研究提供了强有力的工具支持。

链接: https://arxiv.org/abs/2605.05207
作者: Zeren Jiang,Yushi Lan,Yihang Luo,Yufan Deng,Zihang Lai,Edgar Sucar,Christian Rupprecht,Iro Laina,Diane Larlus,Chuanxia Zheng,Andrea Vedaldi
机构: University of Oxford (牛津大学); University College London (伦敦大学学院); École Normale Supérieure (巴黎高等师范学院); Facebook AI Research (Facebook人工智能研究); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 10 figures, project page: this https URL

点击查看摘要

Abstract:Dense 3D reconstruction and tracking of dynamic scenes from monocular video remains an important open challenge in computer vision. Progress in this area has been constrained by the scarcity of high-quality datasets with dense, complete, and accurate geometric annotations. To address this limitation, we introduce Syn4D, a multiview synthetic dataset of dynamic scenes that includes ground-truth camera motion, depth maps, dense tracking, and parametric human pose annotations. A key feature of Syn4D is the ability to unproject any pixel into 3D to any time and to any camera. We conduct extensive evaluations across multiple downstream tasks to demonstrate the utility and effectiveness of the proposed dataset, including 4D scene reconstruction, 3D point tracking, geometry-aware camera retargeting, and human pose estimation. The experimental results highlight Syn4D’s potential to facilitate research in dynamic scene understanding and spatiotemporal modeling.

[CV-1] aming Outlier Tokens in Diffusion Transformers

【速读】：该论文旨在解决扩散 Transformer（DiT）在图像生成过程中出现的异常令牌（outlier tokens）问题，即模型在编码器和去噪器中产生少量高范数但局部语义信息有限的令牌，这些令牌会吸引过多注意力并导致生成质量下降。研究表明，单纯屏蔽高范数令牌无法提升性能，说明问题根源并非仅由极端值引起，而是与局部 patch 语义的破坏密切相关。解决方案的关键在于提出双阶段寄存器（Dual-Stage Registers, DSR），通过训练时的寄存器干预预训练 ViT 编码器和 DiT 的中间层，并在测试时采用递归式寄存器和扩散寄存器进行动态补偿，从而有效抑制异常令牌干扰，显著提升图像生成质量。

链接: https://arxiv.org/abs/2605.05206
作者: Xiaoyu Wu,Yifei Wang,Tsu-Jui Fu,Liang-Chieh Chen,Zhe Gan,Chen Wei
机构: Rice University (莱斯大学); Apple (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.

[CV-2] D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

【速读】：该论文旨在解决当前高效少步数扩散模型（如Z-Image-Turbo和FLUX.2-klein）在直接连续监督微调（supervised fine-tuning）过程中面临的挑战，即传统微调方法会破坏其固有的少步推理能力。解决方案的关键在于提出一种名为D-OPSD（on-policy self-distillation for step-distilled diffusion models）的新训练范式，其核心思想是利用现代扩散模型中语言大模型（LLM）或视觉语言模型（VLM）作为编码器时所继承的上下文学习能力，将训练过程设计为一种“在线策略自蒸馏”机制：在训练中，模型同时扮演教师和学生角色——学生仅以文本特征为条件，而教师则基于文本与目标图像的多模态特征进行预测；通过最小化学生自身轨迹上的两个预测分布，D-OPSD实现了在不牺牲原始少步推理能力的前提下，学习新概念、风格等知识的能力。

链接: https://arxiv.org/abs/2605.05204
作者: Dengyang Jiang,Xin Jin,Dongyang Liu,Zanyi Wang,Mingzhe Zheng,Ruoyi Du,Xiangpeng Yang,Qilong Wu,Zhen Li,Peng Gao,Harry Yang,Steven Hoi
机构: The Hong Kong University of Science and Technology (香港科技大学); Z-Image Team, Alibaba Group (阿里巴巴集团Z-Image团队); University of California, San Diego (加州大学圣地亚哥分校); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder’s in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student’s own roll-outs. By optimized on the model’s own trajectory and under it’s own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

[CV-3] LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

【速读】：该论文旨在解决当前生成式视频质量评估中仅依赖感知质量指标而忽视物理合理性、时间连贯性和输入条件一致性的问题，即现有评价体系无法全面衡量世界模型（world model）生成视频的综合质量。其解决方案的关键在于设计了一个多维评估框架——LoViF 2026 PhyScore挑战赛，要求参赛者构建能够同时预测四个维度（视频质量、物理真实性、条件-视频对齐度、时间一致性）的评分模型，并进一步实现物理异常时间戳的精确定位以支持细粒度诊断。该方案通过包含1,554个来自七种代表性世界生成模型的视频数据集（覆盖26类物理相关场景），结合人工标注与自动化质控机制确保标签可靠性，并采用基于TimeStamp_IOU和SRCC/PLCC的复合评估协议，从而推动生成视频在物理合理性和动态一致性方面的系统性评测发展。

链接: https://arxiv.org/abs/2605.05187
作者: Wei Luo,Yiting Lu,Xin Li,Haoran Li,Fengbin Guan,Chen Gao,Xin Jin,Yong Li,Zhibo Chen,Sijing Wu,Kang Fu,Yunhao Li,Ziang Xiao,Huiyu Duan,Jing Liu,Qiang Hu,Xiongkuo Min,Guangtao Zhai,Manxi Sun,Zixuan Guo,Yun Li,Ziyang Chen,Manabu Tsukada,Zhengyang Li,Zhenglin Du,Yi Wen,Licheng Jiao,Fang Liu,Lingling Li,Yiwen Ren,Zhilong Song,Dubing Chen,Yucheng Zhou,Tianyi Yan,Huan Zheng
机构: University of Macau; Xidian University; Shanghai Jiao Tong University; Tianjin University; Futian Lab; The University of Tokyo; University of Science and Technology of China; Tsinghua University; Eastern Institute of Technology; Ningbo Institute of Digital Twin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.05187 [cs.CV] (or arXiv:2605.05187v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.05187 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-4] OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

【速读】：该论文旨在解决当前前沿多模态深度搜索代理（multimodal deep search agents）难以复现的问题，其核心挑战在于缺乏开放的高质量训练数据、透明的轨迹合成流程以及详尽的训练方法。解决方案的关键在于提出一个完整的开源训练方案——OpenSearch-VL，其核心包括：(1) 构建高质量训练数据的专用流水线，通过维基百科路径采样、模糊实体重写和源锚点视觉定位减少捷径学习和单步检索崩溃；(2) 设计包含文本搜索、图像搜索、OCR、图像处理等多样化工具环境，支持代理结合主动感知与外部知识获取；(3) 提出多轮致命感知的GRPO训练算法（multi-turn fatal-aware GRPO），通过掩码失败后token并利用单边优势钳制保留失败前有用推理，有效应对工具链级联失败问题。该方案在七个基准上平均性能提升超10分，达到与商业模型相当的效果。

链接: https://arxiv.org/abs/2605.05185
作者: Shuang Chen,Kaituo Feng,Hangting Chen,Wenxuan Huang,Dasen Dai,Quanxin Shou,Yunlong Lin,Xiangyu Yue,Shenghua Gao,Tianyu Pang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github Page: this https URL

点击查看摘要

Abstract:Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.

[CV-5] Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

【速读】：该论文旨在解决全切片图像（Whole-slide images, WSIs）在病理诊断中因局部组织结构复杂性和区域异质性导致的特征表示不足问题，现有基于多实例学习（Multiple Instance Learning, MIL）的方法通常将图像块（patch）嵌入到同质的欧几里得空间中，难以同时建模组织的层次结构与细胞级形态细节。其解决方案的关键在于提出一种混合双曲-欧几里得表示框架（hybrid hyperbolic-Euclidean representation），将WSI特征映射至两种几何空间：利用双曲空间捕捉组织的层次结构关系，利用欧几里得空间精细表达局部形态信息；在此基础上构建BatMIL模型，结合结构化状态空间序列模型（S4）以线性复杂度建模上千个patch间的长程依赖，并引入分块级专家混合模块（chunk-level mixture-of-experts, MoE）对区域异质性进行动态路由，从而提升分类性能并减少冗余计算。

链接: https://arxiv.org/abs/2605.05164
作者: Enhui Chai,Sicheng Chen,Tianyi Zhang,Chad Wong,Kecheng Huang,Zeyu Liu,Fei Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate analysis of histopathological images is critical for disease diagnosis and treatment planning. Whole-slide images (WSIs), which digitize tissue specimens at gigapixel resolution, are fundamental to this process but require aggregating thousands of patches for slide-level predictions. Multiple Instance Learning (MIL) tackles this challenge with a two-stage paradigm, decoupling tile-level embedding and slide-level prediction. However, most existing methods implicitly embed patch representations in homogeneous Euclidean spaces, overlooking the hierarchical organization and regional heterogeneity of pathological tissues. This limits current models’ ability to capture global tissue architecture and fine-grained cellular morphology. To address this limitation, we introduce a hybrid hyperbolic-Euclidean representation that embeds WSI features in dual geometric spaces, enabling complementary modeling of hierarchical tissue structures and local morphological details. Building on this formulation, we develop BatMIL, a WSI classification framework that leverages both geometric spaces. To model long-range dependencies among thousands of patches, we employ a structured state space sequence model (S4) backbone that encodes patch sequences with linear computational complexity. Furthermore, to account for regional heterogeneity, we introduce a chunk-level mixture-of-experts (MoE) module that groups patches into regions and dynamically routes them to specialized subnetworks, improving representational capacity while reducing redundant computation. Extensive experiments on seven WSI datasets spanning six cancer types demonstrate that BatMIL consistently outperforms state-of-the-art MIL approaches in slide-level classification tasks. These results indicate that geometry-aware representation learning offers a promising direction for next-generation computational pathology.

[CV-6] PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World ICML2026

【速读】：该论文旨在解决交互式虚拟世界与具身智能（Embodied AI）中3D资产生成的瓶颈问题，即现有方法主要关注静态几何结构，而忽视了交互所必需的功能性属性。解决方案的关键在于提出PhysForge框架，其核心是通过解耦的两阶段流程实现功能逻辑驱动的物理感知资产生成：第一阶段由视觉语言模型（VLM）充当“物理架构师”，规划包含材料、功能和运动学约束的“分层物理蓝图”；第二阶段利用基于物理的扩散模型，结合新颖的KineVoxel Injection（KVI）机制，将蓝图转化为高保真几何形状并精确输出运动学参数，从而生成可直接用于仿真且功能合理的3D资产。

链接: https://arxiv.org/abs/2605.05163
作者: Yunhan Yang,Chunshi Wang,Junliang Ye,Yang Li,Zanxin Chen,Zehuan Huang,Yao Mu,Zhuo Chen,Chunchao Guo,Xihui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026. Project Page: this https URL

点击查看摘要

Abstract:Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in functional logic and hierarchical physics. To bridge this gap, we introduce PhysForge, a decoupled two-stage framework supported by PhysDB, a large-scale dataset of 150,000 assets with four-tier physical annotations. First, a VLM acts as a “physical architect” to plan a “Hierarchical Physical Blueprint” defining material, functional, and kinematic constraints. Second, a physics-grounded diffusion model realizes this blueprint by synthesizing high-fidelity geometry alongside precise kinematic parameters via a novel KineVoxel Injection (KVI) mechanism. Experiments demonstrate that PhysForge produces functionally plausible, simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents.

[CV-7] Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging MICCAI2026

【速读】：该论文旨在解决零样本异常局部化（zero-shot anomaly localisation）在罕见病理检测中的性能瓶颈问题，其核心挑战在于缺乏健康解剖结构的上下文信息。解决方案的关键在于将零样本局部化重构为一种基于对比推理的范式，通过引入WALDO框架实现：(i) 利用熵加权切片 Wasserstein 距离从 DINOv2 的 patch 分布中选择具有解剖感知能力的参考分布；(ii) 采用 Goldilocks 区域采样策略，利用参考相似度与定位精度之间的非单调关系，选取中等相似度的参考以最小化偏差-方差权衡；(iii) 基于加权非极大值抑制的自一致性聚合机制提升结果稳定性。理论分析表明，适度相似的参考图像能最优平衡比较视觉推理中的偏倚与方差，显著优于传统零样本方法。

链接: https://arxiv.org/abs/2605.05161
作者: Bernhard Kainz,Johanna P Mueller,Matthew Baugh,Cosmin Bercea
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to MICCAI 2026

点击查看摘要

Abstract:Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves 43.5_\pm1.6% mAP@30 (95% CI: [40.4, 46.7]), representing a 19% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves 32.0_\pm6.5% and Qwen3-VL-32B achieves 32.0_\pm6.6% mAP@30. Paired McNemar tests confirm statistical significance ( p0.01 ). Source code is available at this https URL .

[CV-8] Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

【速读】：该论文旨在解决当前3D场景美学评估方法严重不足的问题，现有方法主要关注重建保真度和感知真实感，而忽视了构图、和谐性与视觉吸引力等高层次美学属性。其核心挑战在于缺乏带美学标注的通用3D高斯溅射（3D Gaussian Splatting, 3DGS）数据集，以及3DGS作为底层几何原语表示难以捕捉高层美学特征。解决方案的关键是提出Aes3D框架，包含首个专用于3D场景美学评估的数据集Aesthetic3D（基于作者提出的美学标注策略），以及轻量级模型Aes3DGSNet——该模型直接从3DGS原始表示中预测场景级美学分数，无需渲染多视角图像，从而显著降低计算开销和硬件依赖，并通过美学监督学习有效提取高层美学线索，实现准确的美学评分回归。

链接: https://arxiv.org/abs/2605.05155
作者: Chuanzhi Xu,Boyu Wei,Haoxian Zhou,Xuanhua Yin,Zihan Deng,Haodong Chen,Qiang Qu,Weidong Cai
机构: The University of Sydney (悉尼大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling 3D content. However, existing evaluation methods for 3D scenes primarily emphasize reconstruction fidelity and perceptual realism, largely overlooking higher-level aesthetic attributes such as composition, harmony, and visual appeal. This limitation comes from two key challenges: (1) the absence of general 3DGS datasets with aesthetic annotations, and (2) the intrinsic nature of 3DGS as a low-level primitive representation, which makes it difficult to capture high-level aesthetic features. To address these challenges, we propose Aes3D, the first systematic framework for assessing the aesthetics of 3D neural rendering scenes. Aes3D includes Aesthetic3D, the first dataset dedicated to 3D scene aesthetic assessment, built on our proposed annotation strategy for 3D scene aesthetics. In addition, we present Aes3DGSNet, a lightweight model that directly predicts scene-level aesthetic scores from 3DGS representations. Notably, our model operates solely on 3D Gaussian primitives, eliminating the need for rendering multi-view images and thus reducing computational cost and hardware requirements. Through aesthetics-supervised learning on multi-view 3DGS scene representations, Aes3DGSNet effectively captures high-level aesthetic cues and accurately regresses aesthetic scores. Experimental results demonstrate that our approach achieves strong performance while maintaining a lightweight design, establishing a new benchmark for 3D scene aesthetic assessment. Code and datasets will be made available in a future version.

[CV-9] What Matters in Practical Learned Image Compression

【速读】：该论文旨在解决现有学习型图像编解码器（learned image codec）在实际应用中难以兼顾感知质量与运行效率的问题，即如何设计一种既具备高感知保真度又能在设备端实时运行的实用化编解码方案。其关键解决方案在于：通过系统性地研究影响编解码器设计的核心建模选择（包括新颖的优化技术），并结合面向性能的神经架构搜索（performance-aware neural architecture search），从数百万种骨干网络配置中筛选出在目标设备端延迟约束下压缩性能最优的模型；最终构建的新编解码器在iPhone 17 Pro Max上实现12MP图像编码仅需230ms、解码150ms，相比AV1、VVC等传统标准和当前最佳学习型编解码器，在主观评测中可实现2.3–3倍和20–40%的比特率节省。

链接: https://arxiv.org/abs/2605.05148
作者: Kedar Tatwawadi,Parisa Rahimzadeh,Zhanghao Sun,Zhiqi Chen,Ziyun Yang,Sanjay Nair,Divija Hasteer,Oren Rippel
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed. In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime – including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3x bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms – faster than most top ML-based codecs run on a V100 GPU. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.05148 [cs.CV] (or arXiv:2605.05148v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.05148 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-10] CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization

【速读】：该论文旨在解决领域泛化（Domain Generalization, DG）中如何有效学习跨域不变表示的问题，尤其关注在分布外（Out-of-Distribution, OOD）迁移场景下模型性能下降的挑战。现有方法虽借助不变性学习策略和架构改进取得进展，但对通过二阶统计量显式挖掘结构化域不变子空间的研究仍较为不足。其解决方案的关键在于提出CPCANet框架，该框架基于公共主成分分析（Common Principal Component Analysis, CPCA），将Flury-Gautschi (FG) 迭代算法可微分地展开为神经网络层，从而在端到端训练中嵌入CPCA的统计特性，强制学习跨不同域共享的不变子空间并保持可解释性。此方法无需特定数据集调参且与网络架构无关，在四个标准DG基准上实现了零样本迁移的最先进性能。

链接: https://arxiv.org/abs/2605.05136
作者: Yu-Hsi Chen,Abd-Krim Seghouane
机构: The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 tables

点击查看摘要

Abstract:Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen target domains. While recent invariant learning strategies and architectural advances have achieved strong performance, explicitly discovering a structured domain-invariant subspace through second-order statistics remains underexplored. In this work, we propose CPCANet, a novel framework grounded in Common Principal Component Analysis (CPCA), which unrolls the iterative Flury-Gautschi (FG) algorithm into fully differentiable neural layers. This approach integrates the statistical properties of CPCA into an end-to-end trainable framework, enforcing the discovery of a shared subspace across diverse domains while preserving interpretability. Experiments on four standard DG benchmarks demonstrate that CPCANet achieves state-of-the-art (SOTA) performance in zero-shot transfer. Moreover, CPCANet is architecture-agnostic and requires no dataset-specific tuning, providing a simple and efficient approach to learning robust representations under distribution shift. Code is available at this https URL.

[CV-11] A Bayesian Approach for Task-Specific Next-Best-View Selection with Uncertain Geometry

【速读】：该论文旨在解决3D重建中视点选择的效率问题，即如何在有限的扫描次数下，通过智能选择下一个最优观测视角来提升下游任务性能。传统方法通常以全局不确定性最小化为目标，而忽略了特定任务对几何信息的需求差异。其解决方案的关键在于将视点选择建模为贝叶斯决策问题：首先在隐式表面空间上定义先验分布，继而利用近期发展的随机表面重建方法计算后验分布，并基于该后验分布推理出对目标任务（如语义分类、分割或偏微分方程(PDE)驱动的物理仿真）最具信息量的下一个观测视角。这种方法实现了任务导向的不确定性削减，仅优化与任务相关的区域，从而显著提升任务性能并减少所需视点数量。

链接: https://arxiv.org/abs/2605.05095
作者: Jingsen Zhu,Silvia Sellán,Alexander Terenin
机构: Cornell University (康奈尔大学); Columbia University (哥伦比亚大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Code for this paper is available at this https URL

点击查看摘要

Abstract:We develop a framework for task-specific active next-best-view selection in 3D reconstruction from point clouds, by casting the problem in the language of Bayesian decision theory. Our framework works by (a) placing a prior distribution over the space of implicit surfaces, (b) using recently-developed stochastic surface reconstruction methods to calculate the resulting posterior distribution, then © using the posterior distribution to carefully reason about which view to scan next. This enables us to perform camera selection in a manner that is directly optimized for the intended use of the reconstructed data - meaning, we reduce uncertainty only in those regions that make a difference in the task at hand, as opposed to prior approaches that reduce it uniformly across space. We evaluate our method across three distinct downstream tasks: semantic classification, segmentation, and PDE-guided physics simulation. Experimental results demonstrate that our framework achieves superior task performance with fewer views compared to commonly used baselines and prior general uncertainty-reduction techniques.

[CV-12] Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout

【速读】：该论文旨在解决自动驾驶中共享控制过渡阶段的人类驾驶员反应预测问题，即在L2/L3级驾驶自动化系统中，如何准确 anticipates（预期）驾驶员在人机交互过程中的动态行为，以确保安全。传统驾驶世界模型主要聚焦于外部环境的预测，而车内智能则局限于静态识别，缺乏对驾驶员状态进行多步滚动预测的能力。解决方案的关键在于提出一种以驾驶员为中心的潜在世界模型（Driver-WM），其通过因果条件机制将外部交通上下文与内部驾驶员状态联合建模：该模型构建于冻结的视觉-语言特征构成的紧凑潜在空间中，采用双流架构分别编码外部交通和内部驾驶员状态，并通过门控因果注入机制实现定向耦合——该机制利用学习得到的向量门控调节外部扰动影响，同时严格保障时间因果性。此设计实现了物理运动学预测与驾驶员行为及情绪语义识别的统一建模，在多任务辅助驾驶基准上验证了其在长期几何轨迹预测和状态语义对齐方面的显著提升，且支持测试时干预以系统化分析机制响应。

链接: https://arxiv.org/abs/2605.05092
作者: Haozhuang Chi,Daosheng Qiu,Hao Su,Haochen Liu,Zirui Li,Haoruo Zhang,Chen Lv
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.

[CV-13] A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

【速读】：该论文旨在解决通过折射动态介质（如湍流空气或水面）拍摄视频时面临的严重几何失真和时间不稳定性问题，现有方法在强且高度非均匀折射条件下的评估缺乏系统性基准。解决方案的关键在于构建一个全面的基准测试平台，涵盖从轻度扭曲到强间断折射变形的多种场景，包含实验室采集的真实数据与基于物理光折射建模生成的合成序列，并采用像素级（PSNR、SSIM）与感知级（LPIPS、DINO、CLIP）多维指标对多种方法进行评估，包括传统配准算法、学习型方法（如DATUM）以及作者提出的基于扩散模型的V-cache方法，从而为高和极端失真条件下视频重建算法的研发与评测提供新基础。

链接: https://arxiv.org/abs/2605.05079
作者: Maxim V. Shugaev,Md Reshad Ul Hoque,Bridget Kennedy,Joseph T. Riley,Fiona Hwang,Justin Hagen,Harvir Ghuman,Ethan Garcia-O’Donnell,Syed Noor Qadri,Freddie Santiago,Mun Wai Lee
机构: AeroVironment, Inc. ( AeroVironment公司); U. S. Naval Research Laboratory (美国海军研究实验室); George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Video sequence capturing through refractive dynamic media, such as a turbulent air or water surface, often suffer from severe geometric distortions and temporal instability. While recent advances address mild atmospheric turbulence, no existing benchmarks systematically evaluate restoration methods under strong and highly nonuniform refractive conditions. We present a comprehensive benchmark for geometric distortion removal in video, covering a range from turbulence-like mild warping to strong discontinuous refractive deformations. The benchmark includes both laboratory-captured real data and synthetic sequences generated for static scenes via physics-based light refraction modeling across four distortion levels and multiple surface wave types. We evaluate a spectrum of methods from simple baselines and classical registration algorithms to advanced learning-based approaches including DATUM and our proposed diffusion based V-cache for high and extreme distortions regimes. Evaluation uses both pixel-level (PSNR, SSIM), and perceptual (LPIPS, DINO, CLIP) metrics providing the first large scale analysis of geometric distortion removal. Our benchmark establishes a new foundation for developing and evaluating algorithms capable of reconstructing video from highly distorted optical environments. Our code and datasets are available at this https URL.

[CV-14] FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

【速读】：该论文旨在解决现有二值图像分割（Dichotomous Image Segmentation, DIS）方法在保留细粒度细节和充分捕捉前景语义结构方面的不足。其解决方案的关键在于提出FlowDIS，一种基于流匹配（flow matching）框架的新型DIS方法，通过学习一个时变向量场将图像分布映射到对应的掩码分布，且可选地以文本提示作为条件；同时引入位置感知实例配对（Position-Aware Instance Pairing, PAIP）训练策略，显著提升了文本引导下的可控性与像素级分割精度。

链接: https://arxiv.org/abs/2605.05077
作者: Andranik Sargsyan,Shant Navasardyan
机构: Picsart AI Research (PAIR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher F_\beta^\omega measure and 43% lower MAE ( \mathcalM ) on the DIS-TE test set. The code is available at: this https URL

[CV-15] Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

【速读】：该论文旨在解决3D占用预测（3D occupancy prediction）中因固定投影空间导致的特征对齐不准确问题，尤其在真实场景中由于点云稀疏性和高度变化性，传统均匀采样方式难以有效建模几何结构。解决方案的关键在于提出HiPR框架，其核心创新是引入高度引导的投影重参数化（Height-Guided Projection Reparameterization）：首先利用LiDAR点云生成BEV高度图以获取场景最大高度先验，进而根据此先验动态调整每个柱状区域的采样范围，实现投影空间的自适应重构；同时通过掩码无效高度区域避免错误特征聚合，并采用训练阶段渐进式高度条件策略缓解LiDAR噪声带来的训练不稳定问题。

链接: https://arxiv.org/abs/2605.05072
作者: Yuan Wu,Zhiqiang Yan,Jiawei Lian,Zhengxue Wang,Jian Yang
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D occupancy prediction aims to infer dense, voxel-wise scene semantics from sensor observations, where the 2D-to-3D view transformation serves as a crucial step in bridging image features and volumetric representations. Most previous methods rely on a fixed projection space, where 3D reference points are uniformly sampled along pillars. However, such sampling struggles to capture the sparsity and height variations of real-world scenes, leading to ambiguous correspondences and unreliable feature aggregation. To address these challenges, we propose HiPR, a camera-LiDAR occupancy framework with Height-Guided Projection Reparameterization. HiPR first encodes LiDAR into a BEV height map to capture the maximum height of the point cloud. HiPR then adjusts the sampling range of each pillar using the height prior, enabling adaptive reparameterization of the projection space. As a result, the projected points are redistributed into geometrically meaningful regions rather than fixed ranges. Meanwhile, we mask out the invalid parts of the height map to avoid misleading the feature aggregation. In addition, to alleviate the training instability caused by noisy LiDAR-derived heights, we introduce a training-time Progressive Height Conditioning strategy, which gradually transitions the conditioning signal from ground-truth heights to LiDAR heights. Extensive experiments demonstrate that HiPR consistently outperforms existing state-of-the-art methods while maintaining real-time inference. The code and pretrained models can be found at this https URL.

[CV-16] Look Once Beam Twice: Camera-Primed Real-Time Double-Directional mmWave Beam Management for Vehicular Connectivity

【速读】：该论文旨在解决毫米波（mmWave）频段在车对万物（V2X）网络中面临的严重路径损耗和移动导致的波束错位问题，从而实现快速、双向的波束对齐以保障可靠连接。现有方法存在训练开销高且泛化能力差的问题，难以适应未见过的实际场景。其解决方案的关键在于提出一种融合模型驱动与闭环学习的混合架构——VIBE（Vision-based BEamforming），通过摄像头感知信息显著缩小波束搜索空间，避免了全量训练开销；同时结合轻量级波束精调与偏移跟踪机制，在动态应用需求下自适应优化波束性能，从而在保证低延迟的同时提升链路质量。实验证明，VIBE相比5G NR分层波束赋形和当前最先进的端到端机器学习模型均表现出更低的中断率（低至1.1%-1.4%），具备强泛化能力和实时性，更适合真实世界中的mmWave车载通信场景。

链接: https://arxiv.org/abs/2605.05071
作者: Avhishek Biswas,Apala Pramanik,Eylem Ekici,Mehmet C. Vuran
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Accepted to the 2026 IEEE International Conference on Sensing, Communication, and Networking (IEEE SECON 2026). Code and models available at: this https URL

点击查看摘要

Abstract:Millimeter-wave (mmWave) frequencies promise multi-gigabit connectivity for vehicle-to-everything (V2X) networks, but face challenges in terms of severe path loss and mobility-related beam misalignment. Reliable V2X connectivity requires fast, double-directional beam alignment. However, existing methods suffer from high training overhead and limited generalization to unseen scenarios. This paper presents VIsion-based BEamforming(VIBE), a hybrid model-based, closed-loop, learning architecture for real-time double-directional mmWave beam management primed by camera sensing. VIBE fuses machine learning, model-based reasoning, and closed-loop RF feedback to balance beam-pair establishment latency with link quality. VIBE bypasses exhaustive training overhead and accelerates link establishment by leveraging camera observations to reduce the beam-search space. Lightweight beam refinement and offset tracking mechanisms adaptively refine beams in response to dynamic application requirements. VIBE is implemented and evaluated across online indoor/outdoor testbeds, public datasets, and real-time vehicular experiments, demonstrating strong generalization capabilities, making it suitable for real-time V2X communication. Comparisons with 5G NR hierarchical beamforming show that VIBE consistently maintains lower outage rates. Furthermore, VIBE outperforms state-of-the-art end-to-end ML models for beam selection when evaluated on public datasets and achieves outage rates as low as 1.1-1.4 %. The results show that a hybrid model-based, closed-loop learning architecture is better suited for real-world mmWave vehicular connectivity than end-to-end trained ML models. For reproducibility, we publish our code to this https URL.

[CV-17] ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

【速读】：该论文旨在解决开放词汇人类-物体交互（HOI）检测中因对象可及性（affordance）和短语级共现关系导致的误判问题，即模型可能仅依赖刀具与蛋糕的共现就预测“切蛋糕”动作，而忽略手部、工具、目标、接触模式和物体状态等关键视觉证据的协同支持。解决方案的关键在于提出一种结构化框架ScriptHOI，将每个交互短语建模为软脚本状态转移，分解为身体角色、接触、几何、可及性、运动和物体状态等多个槽位；通过视觉状态分词器提取对应状态令牌，并利用槽位级匹配器计算脚本覆盖度与冲突度，以此校准HOI置信度、暴露缺失证据并提供针对不完整标注的训练约束；同时引入区间部分标签学习机制，对未标注候选样本施加由脚本推导的上下界概率约束，避免抑制有效但未标注的交互，辅以反事实脚本对比损失以削弱仅基于对象的捷径策略。

链接: https://arxiv.org/abs/2605.05057
作者: Minh Anh Nguyen,Quang Huy Tran,Bao Ngoc Le,SuiYang Guang,Tuan Kiet Pham,Linh Chi Vo
机构: Phenikaa University ( Phenikaa 大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textitcut cake from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbfScriptHOI, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

[CV-18] Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

【速读】：该论文旨在解决现有流匹配（Flow Matching, FM）方法在视觉-语言模型（Vision-Language Models, VLMs）少样本适应中因预训练跨模态特征的几何先验不兼容而导致的性能瓶颈问题。具体而言，现有方法存在三个关键局限：1）角度动态失真，由径向-角度耦合导致角度子流形上速度非均匀，增加回归训练难度并引入截断误差；2）径向动态忽略，特征归一化丢弃模态置信度信息，无法区分分布内外数据且忽视重要径向演化；3）上下文无关的无条件流，预训练过程中丢失的数据集特异性信息未能恢复。解决方案的核心是提出扭曲积流匹配（Warped Product Flow Matching, WP-FM），其统一的黎曼框架将对齐重构为在扭曲积流形上的优化；其中，进一步设计的**直接积流匹配（Direct Product Flow Matching, DP-FM）**通过引入恒定扭曲度量，实现径向与角度解耦的圆柱流形结构，从而消除角度动态失真并保持径向一致性，同时结合无分类器引导（classifier-free guidance）利用预训练VLM的隐藏状态注入缺失的数据集特异性信息，显著提升少样本适应性能，在11个基准测试中达到新SOTA。

链接: https://arxiv.org/abs/2605.05054
作者: Hongxu Chen,Yanghao Wang,Bowei Zhu,Hongxiang Li,Zhen Wang,Ziqi Jiang,Lin Li,Rui Liu,Long Chen
机构: HKUST; USTC; Huawei Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM methods are inherently constrained by incompatible geometric priors on pre-trained cross-modal features, resulting in suboptimal adaptation performance. We first analyze these methods from a polar decomposition perspective (i.e., radial and angular sub-manifolds). Under this new geometric view, we identify three overlooked limitations in them: 1) Angular dynamics distortion: The radial-angular coupling induces non-uniform speed on the angular sub-manifold, leading to regression training difficulty and extra truncation errors. 2) Radial dynamics neglect: Feature normalization discards modality confidence, failing to distinguish out-of-distribution and in-distribution data, and abandoning crucial radial dynamics. 3) Context-agnostic unconditional flow: Dataset-specific information loss during pre-trained cross-modal feature extraction remains unrecovered. To resolve these issues, we propose warped product flow matching (WP-FM), a unified Riemannian framework that reformulates alignment on a warped product manifold. Within this framework, we derive direct product flow matching (DP-FM) by introducing a constant-warping metric, which yields a decoupled cylindrical manifold (i.e., direct product manifold). DP-FM enables independent radial evolution and constant-speed angular geodesic transport, effectively eliminating angular dynamics distortion while preserving radial consistency. Meanwhile, we incorporate classifier-free guidance by conditioning the flow on the pre-trained VLMs’ hidden states to inject missing dataset-specific information. Extensive results across 11 benchmarks have demonstrated that DP-FM achieves a new state-of-the-art for multi-step few-shot adaptation.

[CV-19] Reduced-order Neural Modeling with Differentiable Simulation for High-Detail Tactile Perception

【速读】：该论文旨在解决高分辨率弹性体变形模拟在计算上的高成本问题，尤其是在机器人灵巧操作中对触觉感知的高保真需求与实时性之间的矛盾。传统有限元方法（FEM）虽精度高但需频繁重网格（remeshing），而材料点法（MPM）则面临粒子内存占用与精度之间的权衡。解决方案的关键在于提出一种降阶神经模拟框架，通过将粗粒度MPM动力学与隐式神经解码器相结合，从紧凑的潜在状态中重建亚粒子尺度的触觉细节；该框架从高低分辨率仿真数据中学习连续的形变流形，从而实现物理一致且可微分的推理，显著提升效率并保持几何保真度。

链接: https://arxiv.org/abs/2605.05053
作者: Yuhu Guo,Zhikai Shen,Jiasheng Qu,Chenghao Qian,Yuming Huang,Bin Chen,Guoxing Fang
机构: The University of Manchester (曼彻斯特大学); The Chinese University of Hong Kong (香港中文大学); University of Leeds (利兹大学); University of Melbourne (墨尔本大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE RoboSoft 2026

点击查看摘要

Abstract:Tactile perception is key to dexterous manipulation, yet simulating high-resolution elastomer deformation remains computationally prohibitive. Finite element methods (FEM) deliver high fidelity but demand costly remeshing, while Material Point Methods (MPM) suffer from heavy particle-memory tradeoffs. We propose a reduced-order neural simulation framework that couples coarse-grained MPM dynamics with an implicit neural decoder to reconstruct sub-particle tactile details from compact latent states. The framework learns a continuous deformation manifold from paired high- and low-resolution simulations, enabling physically consistent, differentiable inference. Compared to the TacIPC, our method achieves over 65% faster simulation and 40% lower memory usage, while maintaining better geometric fidelity. In tactile rendering and 3D surface reconstruction, our methods further improve accuracy by 25% and produce realistic depth images and surface mesh within a faster inference speed. These results demonstrate that the proposed reduced-order neural model enables high-detail, physically grounded tactile simulation with substantial efficiency gains for robotic interaction and optimization.

[CV-20] Few-Shot Learning Pipeline for Monkeypox Skin Disease Classification Using CNN Feature Extractors

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）在疾病分类任务中对大规模标注数据的高度依赖问题，尤其是在新兴或罕见疾病（如猴痘）场景下难以获取足够标注样本的现实挑战。其解决方案的关键在于提出一种基于少量样本学习（Few-Shot Learning, FSL）的框架，采用轻量级、非参数化且归纳式的分类器SimpleShot，结合冻结的预训练CNN骨干网络提取特征嵌入，并在归一化的嵌入空间中通过最近质心比较进行分类。该方法有效提升了在极小标注样本（1-shot至10-shot）条件下对猴痘及类天花皮肤病的识别性能，实验证明MobileNetV2_100作为特征提取器表现最优，同时揭示了跨数据集迁移时二分类任务相对稳定而多分类性能显著下降的问题，强调了领域鲁棒性对于临床实际部署的重要性。

链接: https://arxiv.org/abs/2605.05034
作者: Md. Safirur Rashid,Sabbir Ahmed,Muhammad Usama Islam,Sumona Hoque Mumu,Md. Hasanul Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the strong performance of Convolutional Neural Networks (CNNs) in disease classification, their effectiveness often depends on access to large annotated datasets, which is an impractical requirement for emerging or rare conditions such as Monkeypox. To overcome this limitation, we propose a few-shot learning (FSL) framework that employs SimpleShot, a lightweight, non-parametric, inductive classifier, for Monkeypox and pox-like skin disease recognition from limited labeled examples. The proposed pipeline passes the skin lesion images through a frozen, pretrained CNN backbone to obtain feature embeddings, which are then classified via SimpleShot using nearest-centroid comparisons in a normalized embedding space. We systematically benchmark six widely used CNN backbones as feature extractors under consistent experimental settings, enabling fair comparison. Experiments on three publicly available datasets (MSLD v1.0, MSID, and MSLD v2.0) are conducted across 2-way, 4-way, and 6-way tasks with 1-shot, 5-shot, and 10-shot configurations. Among all models, MobileNetV2_100 consistently achieves the highest accuracy. In addition, we present a cross-dataset evaluation for Monkeypox classification, revealing that binary Mpox-vs-Others transfer remains comparatively stable while multi-class performance degrades significantly under domain shift. Together, these results demonstrate the practical utility of combining inductive FSL methods with lightweight CNN backbones and highlight the importance of domain robustness for reliable real-world clinical deployment.

[CV-21] Computer-Aided Design Generation by Cascaded Discrete Diffusion Model

【速读】：该论文旨在解决现有连续扩散模型在生成计算机辅助设计（CAD）模型时因忽略CAD表示的离散性和异质性而导致语义无效符号的问题。传统方法将CAD建模为离散命令与参数序列，并通过自回归或连续扩散模型进行生成，但连续扩散在欧几里得嵌入空间中扰动表示，难以保持语义一致性。其解决方案的关键在于提出一种级联离散扩散框架（cascaded discrete diffusion framework），其中命令扩散和参数扩散分别基于类别分布建模：命令采用吸收态转移矩阵逐步污染至指定符号，参数则针对不同属性设计特定核函数——坐标连续性使用高斯核、尺寸值使用尺度不变核、布尔属性使用先验保持核。该方法通过两个去噪网络实现逆过程：基于Transformer的编码器用于命令恢复，带局部自注意力和交叉注意力机制的参数网络实现条件注入与命令级交互，从而在DeepCAD数据集上显著优于现有自回归与连续扩散模型，并展现出更强的可控生成能力。

链接: https://arxiv.org/abs/2605.05031
作者: Honghu Pan,Xiaoling Luo,Yongyong Chen,Zhenyu He,Pengyang Wang
机构: Hunan University (湖南大学); Shenzhen University (深圳大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学（深圳）); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent deep learning approaches seek to automate CAD creation by representing a model as a sequence of discrete commands and parameters, and then generating them using autoregressive models or continuous diffusion operating in Euclidean embedding space. However, continuous diffusion perturbs representations in a continuous Euclidean domain that does not reflect the inherently discrete and heterogeneous nature of CAD tokens, often producing perturbed representations that map to semantically invalid symbols. To overcome this limitation, we propose a cascaded discrete diffusion framework for CAD generation, which consists of a command diffusion for generating CAD commands and a parameter diffusion conditioned on CAD commands. Unlike isotropic Gaussian perturbation, the forward process of our approach operates directly over categorical token distributions using delicate transition matrices. For commands, we adopt an absorbing-state transition matrix that progressively corrupts tokens to a designated symbol; for parameters, we introduce specific transition matrices tailored to heterogeneous attributes: a Gaussian kernel for coordinate continuity, a scale-invariant kernel for dimensional values, and a prior-preserving kernel for boolean attributes. The reverse process is achieved by two denoising networks: a Transformer-based encoder for command recovery, and a parameter network with extra local self-attention for command-level interaction and cross-attention for conditional injection. Experiments on the DeepCAD dataset show that the proposed approach surpasses existing autoregressive and continuous diffusion models on unconditional generation metrics, while qualitative results validate effective controllability in conditional generation tasks. Source codes will be released.

[CV-22] Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification CVPR2026

【速读】：该论文旨在解决持续学习场景下行人重识别（Lifelong Person Re-Identification, LReID）模型面临的语义漂移（semantic drift）、适应性受限及灾难性遗忘（catastrophic forgetting）问题。现有无样例方法多依赖视觉特征蒸馏或参数正则化，忽视了文本等辅助模态在维持语义稳定性与实现增量可塑性方面的潜力。其解决方案的关键在于提出一种不对称的视觉-文本蒸馏框架 Prompt-Anchored Vision-Text Distillation (PAD)，通过冻结预训练视觉-语言模型中的文本编码器作为跨域稳定的语义锚点，在文本侧蒸馏提示（prompt）以保持视觉-文本对齐并提供全局语义参考；在视觉侧采用基于指数移动平均（EMA）的教师模型与自适应提示池，实现新域的灵活适配同时冻结旧域参数，从而在稳定性和可塑性之间取得良好平衡。

链接: https://arxiv.org/abs/2605.05027
作者: Wen Wen,Hao Chen,Shiliang Zhang
机构: University of Electronic Science and Technology of China (中国电子科技大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches largely rely on visual-only distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision-language models can serve as a stable semantic anchor across domains. To decouple the roles of vision and text, we propose Prompt-Anchored vision-text Distillation (PAD), an asymmetric vision-text framework for semantic alignment and cross-domain generalization. On the textual side, we distill prompts to preserve vision-text alignment under a fixed semantic space, acting as a global semantic reference rather than a dominant learning signal. On the visual side, an EMA-based teacher with an adaptive prompt pool enables domain-wise adaptation by allocating new slots while freezing past ones. Extensive experiments show that PAD substantially outperforms state-of-the-art methods across seen and unseen domains, achieving a strong balance between stability and plasticity. Project page is available at this https URL.

[CV-23] Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models

【速读】：该论文旨在解决扩散模型（Diffusion Models）在生成过程中出现的结构幻觉（Structural Hallucinations）问题，即生成样本虽符合训练数据的统计特性，但违背了潜在的结构规则，例如生成具有超过五根手指的手部图像。为应对这一问题，作者提出了一种新的视角：将幻觉视为模型诱导流形上的不稳定性，并据此设计了一个基于不稳定性的滤波器，其性能可与近期提出的时序滤波方法相媲美或更优。解决方案的关键在于识别出局部内在维数（Local Intrinsic Dimension, LID）是导致此类不稳定的主因，并提出了“内在抑制”（Intrinsic Quenching, IQ）机制，通过直接降低LID来缓解幻觉现象。IQ在多个基准测试中均显著优于现有基线方法，为下游医学影像任务中的解剖一致性约束提供了极具前景的解决方案。

链接: https://arxiv.org/abs/2605.05026
作者: Bartlomiej Sobieski,Matthew Tivnan,Dawid Płudowski,Michał Jan Włodarczyk,Pengfei Jin,Przemyslaw Biecek,Quanzheng Li
机构: Warsaw University of Technology (华沙理工大学); University of Warsaw (华沙大学); Massachusetts General Hospital (麻省总医院); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Diffusion models are prone to generating structural hallucinations - samples that match the statistical properties of the training data yet defy underlying structural rules, resulting in anomalies like hands with more than five fingers. Recent research studied this failure mode from several viewpoints, offering partial explanations to their occurrence, such as mode interpolation. In this work, we propose a complementary perspective that treats hallucinations as instabilities on the model-induced manifold. We begin by showing that a hallucination filter based on such instabilities matches or exceeds the performance of the recently proposed temporal one. By tracing the source of these instabilities, we identify local intrinsic dimension (LID) as their primary driver and propose Intrinsic Quenching (IQ), a direct corrective mechanism that deflates it to alleviate hallucinations. IQ consistently outperforms standard hallucination reduction baselines across a wide array of benchmarks and offers a highly promising solution for enforcing anatomical consistency in downstream medical imaging tasks.

[CV-24] CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography CVPR2026

【速读】：该论文旨在解决当前自动驾驶数据集在复杂路面场景覆盖不足以及深度估计与补全任务中地面真值信息稀疏的问题。现有驾驶数据集多集中于平坦铺装道路，且仅提供稀疏的LiDAR地面真值，难以支撑细粒度几何重建任务。为此，作者提出CARD（Continuous Road Assessment Dataset），其关键在于构建一个包含连续序列、丰富路面不规则特征（如减速带、坑洼、非铺装路段）的多模态驾驶数据集，并通过多LiDAR融合技术获得约500K有效深度像素/帧，显著优于KITTI Depth Completion及其他公开数据集（提升6.5倍至10倍）。该方案不仅提升了深度信息密度，还提供了2D边界框标注用于道路表面异常检测，为几何重建与感知任务提供标准化评估协议和强基线模型。

链接: https://arxiv.org/abs/2605.05014
作者: Gasser Elazab,Frank Neuhaus,Tilman Koß,Malte Splietker,Aditya Date,Michael Unterreiner,Maximilian Jansen,Olaf Hellwich
机构: CARIAD SE; Technische Universität Berlin; Vision Robotics GmbH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (Highlight). Project page: this https URL

点击查看摘要

Abstract:Autonomous driving must operate across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we establish a standardized evaluation protocol for road surface irregularities on CARD and benchmark state-of-the-art depth estimation models to provide strong baselines. The CARD dataset is hosted on this https URL.

[CV-25] Chaotic Contrastive Learning for Robust Texture Classification

【速读】：该论文旨在解决纹理分类任务中因类别间相似性高、结构模式对尺度和光照变化敏感而导致的性能瓶颈问题，同时克服现有卷积神经网络（Convolutional Neural Networks, CNNs）和视觉Transformer模型在依赖颜色与形状特征时泛化能力不足的局限。其解决方案的关键在于提出一种融合自监督学习（Self-Supervised Learning, SSL）与确定性混沌动力学的新框架：首先设计了一种基于Logistic、Tent和Sine等像素级混沌映射的对比预训练策略，利用混沌系统的遍历性特性作为非线性数据增强手段，迫使网络学习拓扑鲁棒特征；其次引入注意力机制驱动的特征集成模块，融合大模型的高层语义信息与混沌预训练小模型的低频结构特征，从而实现更稳定且具有跨域适应性的纹理表征。

链接: https://arxiv.org/abs/2605.05012
作者: Joao B Florindo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Texture classification is a pivotal task in computer vision, presenting unique challenges due to high inter-class similarity and the sensitivity of structural patterns to scale and illumination changes. While Convolutional Neural Networks (CNNs) and recent Vision Transformers have set performance benchmarks, they often require extensive labeled datasets or struggle to generalize across domains due to an over-reliance on color and shape features. This paper introduces a novel framework that synergizes Self-Supervised Learning (SSL) with deterministic chaotic dynamics. We propose a chaotic contrastive pre-training strategy, where pixel-wise chaotic maps, specifically Logistic, Tent, and Sine maps, act as non-linear data augmentation techniques. These chaotic perturbations, grounded in ergodic theory, force the network to learn topologically robust features by mimicking complex environmental noise and reflectance variations. Furthermore, we introduce an attention-based feature ensemble that fuses high-level semantic representations from a supervised large backbone with low-frequency structural features from a chaos-pretrained tiny encoder. Experimental results on six texture benchmarks (FMD, UMD, KTH-TIPS2-b, DTD, GTOS, and 1200Tex) demonstrate the superiority of the proposed method, outperforming state-of-the-art approaches and achieving promising accuracies on all the analyzed datasets.

[CV-26] Low-Rank Adaptation of Geospatial Foundation Models for Wildfire Mapping Using Sentinel-2 Data

【速读】：该论文旨在解决如何高效适应地理空间基础模型（Geospatial Foundation Models, GFMs）以用于跨区域和跨时间的火灾烧毁面积制图问题，尤其是在存在地理与时间域偏移的情况下。其关键解决方案是采用轻量级参数高效微调方法——低秩适应（Low-Rank Adaptation, LoRA），在仅更新不到1%参数的前提下实现了最优的跨域泛化性能，显著优于全参数微调和仅解码器微调策略，从而为大规模烧毁面积监测提供了高精度且计算高效的解决方案。

链接: https://arxiv.org/abs/2605.04989
作者: Ali Shibli,Andrea Nascetti,Yifang Ban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IGARSS 2026

点击查看摘要

Abstract:Wildfire burned-area mapping is essential for damage assessment, emissions modeling, and understanding fire-climate interactions across diverse ecological regions. Recent geospatial foundation models provide strong general-purpose representations for satellite imagery, yet there is still no clear understanding of how to efficiently adapt these models for downstream Earth observation tasks, particularly under geographic and temporal domain shift. This study evaluates three state-of-the-art Geospatial Foundation Models (GFMs) - Terramind, DINOv3, and Prithvi-v2 - for burned-area mapping across the United States and Canada using Sentinel-2 data. Leveraging 3,820 wildfire events from 2017-2023, we conduct spatial and temporal generalization tests across diverse biomes. We systematically compare full fine-tuning, decoder-only fine-tuning, and Low-Rank Adaptation (LoRA) for adapting each model. Across all experiments, LoRA provides the strongest cross-domain generalization while updating less than 1% of parameters, demonstrating a favorable trade-off between accuracy and efficiency. Prithvi-v2 with LoRA achieves the highest overall accuracy and the largest improvement compared to full fine-tuning. These findings indicate that geospatial foundation models, when adapted using lightweight parameter-efficient methods such as LoRA, offer a robust and scalable solution for large-scale burned-area mapping. Code is available at this https URL.

[CV-27] Attention-Based Chaotic Self-Supervision for Medical Image Classification

【速读】：该论文旨在解决医学图像分类中深度学习模型依赖大规模标注数据或标准ImageNet迁移学习的问题，同时克服现有自监督学习（Self-Supervised Learning, SSL）方法（如掩码自动编码器，Masked Autoencoders, MAEs）因随机掩码破坏细粒度诊断特征而导致性能下降的局限性。其解决方案的关键在于提出一种新型SSL预训练策略——混沌去噪自动编码器（Chaotic Denoising Autoencoder, CDAE），通过应用混沌变换而非随机掩码来扰动输入图像，并训练自动编码器重建原始图像，从而迫使编码器学习鲁棒且领域特定的特征；此外，还设计了一种注意力融合机制，将CDAE训练得到的编码器与标准编码器特征进行融合，以协同利用通用特征和领域特异性表示，显著提升了模型在皮肤病变（ISIC 2018）和糖尿病视网膜病变（APTOS 2019）两个公开医学数据集上的分类性能。

链接: https://arxiv.org/abs/2605.04985
作者: Joao Batista Florindo,Amanda Pontes de Oliveira Ornelas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models for medical image classification usually achieve promising results but typically rely on large, annotated datasets or standard transfer learning from ImageNet. Self-Supervised Learning (SSL) has emerged as a powerful alternative, yet common methods like masked autoencoders (MAEs) may inadvertently destroy fine-grained diagnostic features by using random masking. In this paper, we propose a novel SSL pre-training strategy, the Chaotic Denoising Autoencoder (CDAE). Instead of masking, we apply a chaotic transformation to the input image, tasking an autoencoder to reconstruct the original. We hypothesize this forces the encoder to learn robust, domain-specific features by “inverting the chaos”. Furthermore, we propose an attentive fusion mechanism that combines features from our CDAE-trained encoder with a standard encoder, leveraging the strengths of both general and domain-specific representations. Our method is evaluated on two public medical datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). The proposed model achieves high performance, with an accuracy of 0.9221 and an F1-macro of 0.8530 on ISIC 2018, and an accuracy of 0.8644 and F1-macro of 0.7433 on APTOS 2019, demonstrating the efficacy of our approach.

[CV-28] ICPR 2026 Competition on Privacy-Preserving Person Re-Identification from Top-View RGB-Depth Camera (TVRID)

【速读】：该论文旨在解决隐私感知的俯视视角行人重识别（Top-View Person Re-Identification, TVRID）问题，特别是在多模态（RGB与深度图像）条件下如何实现高效且隐私友好的行人匹配。其解决方案的关键在于构建了一个包含86个身份、四台同步采集设备的RGB-Depth数据集，涵盖平坦、上升、下降及斜向等多种几何视角变化，并设计了三类评估任务：RGB Re-ID、Depth Re-ID 和 RGB ↔ Depth 跨模态检索。通过统一服务器端的mAP和CMC-1指标对参赛方案进行排序，结果表明模态受限检索难度显著高于跨模态检索，凸显了模态不变学习在提升性能方面的可行性，从而为顶视图、深度感知及跨模态行人重识别提供了一个可复现的基准平台。

链接: https://arxiv.org/abs/2605.04977
作者: Raphaël Delécluse,Hazem Wannous,Laurent Guimas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This companion paper reports the ICPR 2026 TVRID competition on privacy-aware top-view person re-identification. We present the competition setting, the released RGB-Depth dataset, and a summary of final results with descriptions of the top entries. TVRID contains 86 identities captured by four synchronized overhead Intel RealSense D455 cameras, with paired RGB/Depth streams and structured geometric variation across flat, ascent, descent, and oblique viewpoints. The evaluation protocol includes three tracks: RGB Re-ID, Depth Re-ID, and RGB \leftrightarrow Depth cross-modal retrieval. Submissions are ranked using mAP and CMC-1 under a unified server-side evaluation. The final results show a clear difficulty ordering (RGB Depth Cross-Modal), highlighting both the challenge of modality-constrained retrieval and the feasibility of strong performance with modality-invariant learning. By releasing the dataset at this https URL, the evaluation scripts at this https URL, and the accompanying documentation, TVRID establishes a reproducible benchmark for top-view, depth-based, and cross-modal person re-id.

[CV-29] DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

【速读】：该论文旨在解决合成纤维绳（Synthetic Fibre Ropes, SFRs）在海洋、海上及工业场景中状态监测（Condition Monitoring, CM）的复杂需求，即如何从单张图像中同时输出连续损伤严重度估计、维护建议、异常标记、劣化时间线和自动化报告，而不仅限于分类任务。其解决方案的关键在于提出DART（Damage Assessment via Rope Transformer），一个基于视觉-语言基础模型的统一多任务架构：通过将ViT-H/14与Llama-3.2-3B-Instruct结合，并引入Severity-Conditioned Cross-Modal Fusion（SC-CMF）模块，在跨模态空间中实现损伤类型、严重程度排序与语义对齐的联合编码；三个核心创新——HD-MASK掩码策略、类别感知的可学习严重度门控机制以及对比损伤解耦损失（CDD）——共同驱动模型在无需任务特定微调的情况下，实现高精度的损伤分类（93.22%准确率）、连续严重度回归（Spearman rho=0.94）和少样本识别（20-shot时宏F1=89.2%），从而构建了一个通用的状态监测骨干网络。

链接: https://arxiv.org/abs/2605.04943
作者: Anju Rani,Daniel Ortiz-Arroyo,Petar Durdevic
机构: Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, 9 tables

点击查看摘要

Abstract:The condition monitoring (CM) of synthetic fibre ropes (SFRs) used in offshore, maritime, and industrial settings demands more than a classifier: inspectors need continuous severity estimates, maintenance recommendations, anomaly flags, deterioration timelines, and automated reports, all from a single inspection image. We present DART (Damage Assessment via Rope Transformer), a vision-language foundation model that addresses the full rope inspection workflow through a unified multi-task architecture. DART extends the Joint-Embedding Predictive Architecture (JEPA) to the cross-modal domain by coupling a Vision Transformer (ViT-H/14) with Llama-3.2-3B-Instruct via a Severity-Conditioned Cross-Modal Fusion (SC-CMF) module. Three architectural innovations drive the model’s versatility: (1) HD-MASK, a saliency-guided masking strategy that focuses self-supervised reconstruction on damage-dense patches; (2) per-class learnable severity gates that adaptively weight language grounding by damage category; and (3) a Contrastive Damage Disentanglement (CDD) loss that shapes the embedding space to simultaneously encode damage type, severity ordering, and cross-modal semantics. Trained once on 4,270 images spanning 14 fine-grained rope damage classes, the frozen DART backbone supports downstream tasks without any task-specific fine-tuning: damage classification (93.22 % accuracy, 91.04 % macro-F1, +38.5 pp over a vision-only baseline), continuous severity regression (Spearman rho = 0.94, within-1-ordinal accuracy 99.6 %), few-shot recognition (89.2 % macro-F1 at 20 shots). These results demonstrate that DART functions as a general-purpose CM backbone that goes well beyond classification, providing actionable inspection intelligence from a single shared representation.

[CV-30] Exploring Clustering Capability of Inpainting Model Embeddings for Pattern-based Individual Identification

【速读】：该论文旨在解决动物个体识别中因模型过度依赖背景或体型特征而非皮肤图案结构而导致的识别准确性下降问题（即：模型对非个体特异性特征敏感，且这些特征随时间易变）。解决方案的关键在于引入图像修复（image inpainting）作为辅助任务，通过在训练过程中强制模型关注皮肤图案的结构信息，从而提升其提取个体视觉嵌入（visual embeddings）时对皮肤纹理特征的响应能力；同时，论文还对比分析了四种编码器骨干模型在斑马鱼这一典型生物模型上的性能表现，以验证该策略的有效性。

链接: https://arxiv.org/abs/2605.04904
作者: Jens van Bijsterveld,Daniele Avitabile,Fons J. Verbeek,Rita Pucci
机构: Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands; Naturalis Biodiversity Center, Leiden, The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we explore deep learning techniques for individual identification of animals based on their skin patterns. Individual identification is crucial in biodiversity monitoring, since it enables analysis of decline or growth of populations, or intra-species interactions within populations. Models trained for the task of individual identification often do not focus on the skin pattern of animals, but on background details or body shape details. These characteristics are not individually specific, or can change drastically through time. We focus on techniques that will make machine learning models more responsive to skin pattern structure when extracting individual visual embeddings from images. For this, we explore image inpainting of task-specific masks as an auxiliary task to enhance ML-based individual identification from animal skin patterns. We propose a comparative analysis among four models as an encoder backbone for the individual identification task. We focus on the case study of zebrafish, which is a widely recognized biological model organism, and which exhibits individually identifying skin patterns. To evaluate encoder backbone performance, we present standard metrics for classification accuracy, embedding clustering metrics, and GradCAM visualizations.

[CV-31] Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在神经架构搜索（Neural Architecture Search, NAS）中生成完整模型代码时存在的计算开销高、代码冗余的问题。现有方法直接从零生成整套模型实现，导致效率低下且输出冗长。其解决方案的关键在于提出Delta-Code Generation——利用微调后的LLM生成紧凑的统一差异（unified diffs，即delta），用于对基线架构进行增量式优化，而非从头合成整个模型。该方法通过LoRA（Low-Rank Adaptation）迭代微调，并结合MinHash-Jaccard新颖性过滤以保障结构多样性，在多个数据集上显著提升了有效率和首轮训练准确率，同时将输出长度压缩至200行以上的1/3–1/4，实现了高效、轻量且跨领域的神经架构生成范式。

链接: https://arxiv.org/abs/2605.04903
作者: Santosh Premi Adhikari,Radu Timofte,Dmitry Ignatov
机构: University of Würzburg (维尔茨堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Large language models (LLMs) show strong potential for neural architecture generation, yet existing approaches produce complete model implementations from scratch – computationally expensive and yielding verbose code. We propose Delta-Code Generation, where fine-tuned LLMs generate compact unified diffs (deltas) to refine baseline architectures rather than synthesizing entire models. Our pipeline iteratively fine-tunes the LLM via LoRA on curated architectures from the LEMUR dataset, with MinHash-Jaccard novelty filtering for structural diversity. We evaluate three 7B-class LLMs – DeepSeek-Coder-7B, Qwen2.5-Coder-7B, and Mistral-7B – across six datasets (CIFAR-10, CIFAR-100, MNIST, SVHN, ImageNette, CelebA) using a 22-cycle protocol (1,100 candidates per LLM). All three substantially surpass the full-generation baseline (50.6% valid rate, 42.3% mean first-epoch accuracy): DeepSeek-Coder reaches 75.3% valid rate and 65.8% mean accuracy; Qwen2.5-Coder 72.1%/64.6%; Mistral 66.6%/66.1%. On CIFAR-10, best first-epoch accuracies reach 85.5% (Mistral), 85.2% (DeepSeek), 80.6% (Qwen) – well above 63.98% full generation and 71.5% for the concurrent approach of Gu et al. Output lengths are 30-50 lines versus 200+ for full generation (75-85% reduction). A 50-epoch study confirms the 1-epoch proxy preserves rankings (Mistral: Spearman \rho = 0.926). Delta-based generation is a token-efficient, multi-domain, LLM-agnostic alternative to full-model synthesis for LLM-driven NAS.

[CV-32] FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

【速读】：该论文旨在解决眼科影像中青光眼自动检测模型在不同人群间存在偏见的问题，以确保诊断公平性并提升临床部署的可接受度。其核心挑战在于如何在多敏感属性（如种族、性别、民族和语言）下同时实现视觉与文本模态的去偏处理。解决方案的关键在于提出一种名为FairEnc的公平预训练方法：对于文本编码器，利用大语言模型生成带有不同敏感属性但保持疾病语义一致性的合成临床描述，并通过对比对齐目标促使表征具备人口统计学不变性；对于视觉编码器，则采用双层公平策略——结合互信息正则化降低特征与群体之间的统计依赖性，辅以多判别器对抗去偏机制，从而有效减少偏差。实验表明，该方法在多个数据集上均能显著降低群体差异指标（DPD 和 DEOdds），同时维持优异的诊断性能。

链接: https://arxiv.org/abs/2605.04882
作者: Mohamed Elhabebe,Ayman El-Baz,Qing Liu
机构: University of Oulu (奥卢大学); University of Louisville (路易斯维尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Automated glaucoma detection is critical for preventing irreversible vision loss and reducing the burden on healthcare systems. However, ensuring fairness across diverse patient populations remains a significant challenge. In this paper, we propose FairEnc, a fair pretraining method for vision-language models (VLMs) that enables simultaneous debiasing across multiple sensitive attributes. FairEnc jointly mitigates biases in both textual and visual modalities with respect to multiple sensitive attributes, including race, gender, ethnicity, and language. Specifically, for the textual encoder, we leverage a large language model to generate synthetic clinical descriptions with varied sensitive attributes while preserving disease semantics, and employ a contrastive alignment objective to encourage demographic-invariant representations. For the visual encoder, we propose a dual-level fairness strategy that combines mutual information regularization to reduce statistical dependence between learned features and demographic groups, with multi-discriminator adversarial debiasing. Comprehensive experiments on the publicly available Harvard-FairVLMed dataset demonstrate that FairEnc effectively reduces demographic disparity as measured by DPD and DEOdds while achieving strong diagnostic performance under both zero-shot and linear probing evaluations. Additional experiments on the private FairFundus dataset show that FairEnc consistently preserves fairness advantages under cross-domain and cross-modality settings and maintains diagnostic performance within a competitive range. These results highlight FairEnc’s ability to generalize fairness under distribution shifts, supporting its potential for more equitable deployment in real-world clinical settings. Our codebase and synthetic clinical notes are available at this https URL

[CV-33] VTAgent : Agent ic Keyframe Anchoring for Evidence-Aware Video TextVQA

【速读】：该论文旨在解决当前视频文本视觉问答（Video TextVQA）任务中，尽管视频大语言模型（Video-LLMs）具备强大的多模态理解能力，但在现有基准测试上表现仍受限的问题。研究通过帧级问答的上限分析发现，性能瓶颈主要在于关键证据帧的定位能力不足，而非推理能力本身。解决方案的关键在于提出一种问题引导的代理框架（question-guided agent framework），该框架在回答前显式锚定与问题相关的关键帧，从而提升定位准确性；此方法在无需训练的情况下即优于直接视频推理，并结合监督微调（SFT）和强化学习（RL）后，在多个基准上实现了平均准确率提升+12.12和ANLS提升+11.15，达到新的最先进水平。

链接: https://arxiv.org/abs/2605.04870
作者: Haibin He,Maoyuan Ye,Jing Zhang,Juhua Liu,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.

[CV-34] 3D Ultrasound-Derived Pseudo-CT Synthesis Using a Transformer-Augmented Residual Network for Real-Time Operator Guidance

【速读】：该论文旨在解决临床中CT成像因暴露于电离辐射而带来的安全性问题，以及超声（Ultrasound, US）因高度依赖操作者且缺乏定量组织表征能力而导致的诊断不确定性问题。其解决方案的关键在于提出了一种3D超声衍生伪CT（Ultrasound-derived pseudo-CT, UD-pCT）框架，通过配对的肾脏US与CT体积数据进行空间配准后，利用改进的Bottleneck Transformer Residual U-Net3D（BT-ResUNet3D）模型生成具有解剖参考价值的伪CT图像。该模型结合3D残差编码器-解码器结构与Transformer瓶颈模块，有效建模局部精细结构和长程体素依赖关系，并辅以3D条件PatchGAN判别器确保合成伪CT在局部结构上的真实感，从而为超声引导提供实时解剖参考，减少操作变异性及不必要的CT检查。

链接: https://arxiv.org/abs/2605.04856
作者: Sapna Sachan,Amulya Kumar Mahto
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈提分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Computed tomography (CT) is indispensable for clinical diagnosis and image-guided interventions but exposes patients to ionizing radiation, motivating the development of safer imaging alternatives. Ultrasound (US) is non-ionizing and widely accessible; however, it is highly operator dependent and lacks quantitative tissue characterization, often leading to diagnostic uncertainty and unnecessary CT examinations. This work presents a 3D ultrasound-derived pseudo-CT (UD-pCT) framework that generates CT-like anatomical reference volumes inferred from US, without aiming to reproduce physically accurate Hounsfield Units. Paired 3D kidney US and CT volumes from the TRUSTED dataset are first spatially aligned using a landmark-based multimodal registration pipeline, creating high-quality paired inputs for supervised training of an adversarial framework. The proposed Bottleneck Transformer Residual U-Net3D (BT-ResUNet3D) model employs a 3D residual encoder-decoder generator augmented with a transformer bottleneck, enabling effective modeling of fine-grained local anatomical structures as well as long-range volumetric dependencies, while a 3D Conditional PatchGAN discriminator enforces local structural realism in the synthesized pseudo-CT volumes. Quantitative evaluation using PSNR and SSIM demonstrates that the proposed method outperforms established baselines in structural fidelity and perceptual image quality. The UD-pCT volumes provide real-time anatomical reference for operator guidance, potentially reducing acquisition variability and unnecessary CT use. A limitation of this study is the relatively small paired dataset, which may limit the generalizability of the proposed model.

[CV-35] QuadBox: Accelerating 3D Gaussian Splatting with Geometry-Aware Boxes ICIP26

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在光栅化渲染流水线中高效且精确计算高斯-瓦片（Gaussian-tile）交集的问题。其核心挑战在于如何快速定位并处理投影后呈椭圆形状的高斯分布与图像瓦片之间的空间重叠关系，以提升实时新视角合成的效率。解决方案的关键在于提出QuadBox方法：通过四个轴对齐的边界框（axis-aligned bounding boxes）离散地紧密包裹投影后的高斯分布，并引入一个几何感知的拉伸因子来构建与瓦片对齐的QuadBox，从而显著减少无关瓦片的覆盖范围；进一步设计QPass单遍遍历算法，充分利用QuadBox的离散特性，仅通过简单的区间测试即可完成交集判定，避免复杂几何运算。实验表明，该方法使3DGS渲染速度提升1.85倍。

链接: https://arxiv.org/abs/2605.04844
作者: Xinze Li,Bohan Yang,Pengxu Chen,Yiyuan Wang,Hongcheng Luo,Wentao Cheng,Weifeng Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 6 pages, 4 figures. Accepted by ICIP 26

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as an advanced technique for real-time novel view synthesis by representing scene geometry and appearance using differentiable Gaussian primitives. However, efficiently computing precise Gaussian-tile intersections remains a critical task in the rasterization pipeline. To this end, we propose QuadBox, a method that leverages four axis-aligned bounding boxes to tightly encapsulate projected Gaussians in a discrete manner. First, we derive a geometry-aware stretching factor that enables the construction of a tile-aligned QuadBox, which covers the elliptical projection and largely excludes irrelevant tiles. Second, we introduce QPass, a single-pass tile traversal algorithm that exhaustively exploits the discrete nature of QuadBox, ensuring that the tile intersection check is performed with simple interval tests. Experiments on public datasets show that our method accelerates the rendering speed of 3DGS by 1.85 \times . Code is available at \hrefthis https URLthis https URL.

[CV-36] MIRAG E: Retrieval and Generation of Multimodal Images and Texts for Medical Education MICCAI2025

【速读】：该论文旨在解决医学教育中缺乏多样化、标注准确且具备交互功能的医学图像资源的问题，传统医学图谱因体积庞大难以使用，而在线图像搜索则可能提供错误标注或不完整的内容。解决方案的关键在于提出MIRAGE系统，其核心是将文本与图像映射到共享潜在空间（shared latent space），从而实现语义层面的精准检索与生成；该系统基于微调后的医学版CLIP模型（MedICaT-ROCO）和医疗扩散模型（Prompt2MedImage），结合大语言模型（Dolly-v2-3b）提供增强描述，并支持双模式检索以进行不同疾病状态的视觉对比，整个架构完全依赖公开预训练模型，确保了可复现性与易用性，尤其适合无编程技能的医学生在全球范围内开展交互式个性化学习。

链接: https://arxiv.org/abs/2605.04772
作者: Miguel Diaz Benito,Cecilia Diana Albelda,Alvaro Garcia Martin,Jesus Bescos Cano,Marcos Escudero-Vinolo,Juan C. SanMiguel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the Workshop on Applications of Medical AI (AMAI 2025), in conjunction with MICCAI 2025

点击查看摘要

Abstract:Access to diverse, well-annotated medical images with interactive learning tools is fundamental for training practitioners in medicine and related fields to improve their diagnostic skills and understanding of anatomical structures. While medical atlases are valuable, they are often impractical due to their size and lack of interactivity, whereas online image search may provide mislabeled or incomplete material. To address this, we propose MIRAGE, a multimodal medical text and image retrieval and generation system that allows users to find and generate clinically relevant images from trustworthy sources by mapping both text and images to a shared latent space, enabling semantically meaningful queries. The system is based on a fine-tuned medical version of CLIP (MedICaT-ROCO), trained with the ROCO dataset, obtained from PubMed Central. MIRAGE allows users to give prompts to retrieve images, generate synthetic ones through a medical diffusion model (Prompt2MedImage) and receive enriched descriptions from a large language model (Dolly-v2-3b). It also supports a dual search option, enabling the visual comparison of different medical conditions. A key advantage of the system is that it relies entirely on publicly available pretrained models, ensuring reproducibility and accessibility. Our goal is to provide a free, transparent and easy-to-use didactic tool for medical students, especially those without programming skills. The system features an interface that enables interactive and personalized visual learning through medical image retrieval and generation. The system is accessible to medical students worldwide without requiring local computational resources or technical expertise, and is currently deployed on Kaggle: this http URL

[CV-37] Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation

【速读】：该论文旨在解决异构人脸识别（Heterogeneous Face Recognition, HFR）中模型计算复杂度高、难以部署于资源受限边缘设备的问题。现有HFR方法虽性能提升显著，但多依赖于昂贵的深度学习架构，限制了其在实际场景中的应用。解决方案的关键在于引入一种轻量级且高效的框架，通过适配原本用于RGB同质人脸识别的混合卷积神经网络-Transformer（CNN-Transformer）模型，实现仅需少量配对异构数据即可端到端训练，并在保持标准RGB人脸识别基准性能的同时，显著降低计算开销，从而适用于同质与异构两种应用场景。

链接: https://arxiv.org/abs/2605.04769
作者: Anjith George,Sebastien Marcel
机构: Idiap Research Institute (Idiap 研究所); Université de Lausanne (洛桑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE TBIOM

点击查看摘要

Abstract:Heterogeneous Face Recognition (HFR) aims at matching face images captured across different sensing modalities, such as thermal-to-visible or near-infrared-to-visible, enhancing the usability of face recognition systems in challenging real-world conditions. Although recent HFR methods have achieved significant improvements in performance, many rely on computationally expensive models, making them impractical for deployment on resource-limited edge devices. In this work, we introduce a lightweight yet effective HFR framework by adapting a hybrid CNN-Transformer model originally developed for RGB homogeneous face recognition. Our approach enables efficient end-to-end training with only a small amount of paired heterogeneous data, while still maintaining strong performance on standard RGB face recognition benchmarks. This makes it suitable for both homogeneous and heterogeneous settings. Comprehensive experiments on several challenging HFR and face recognition benchmarks show that our method achieves state-of-the-art or competitive performance while keeping computational requirements low.

[CV-38] Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition

【速读】：该论文旨在解决交通拥堵分类中如何协同建模道路场景上下文与非平稳交通运动特征的问题。现有方法通常将二者孤立处理：视觉方法依赖外观线索并采用标准时序池化，易偏向静态基础设施；信号方法虽能刻画时序动态却缺乏空间上下文以实现场景级定位。解决方案的关键在于提出FLO-EMD框架，通过运动引导的注意力机制（motion-guided attention）增强RGB特征对运动相关区域的聚焦能力，并结合经验模态分解（Empirical Mode Decomposition, EMD）从密集光流统计中提取内在时序成分，从而在保持数据自适应时序表征的同时实现时空特征融合，最终提升拥堵等级分类精度。

链接: https://arxiv.org/abs/2605.04752
作者: Eugene Kofi Okrah Denteh,Blessing Agyei Kyem,Joshua Kofi Asamoah,Armstrong Aboah
机构: North Dakota State University (北达科他州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate traffic congestion classification requires models that jointly capture roadway scene context and non-stationary traffic motion, yet most prior work treats these requirements in isolation. Vision-based methods often depend on appearance cues with standard temporal pooling, which can bias predictions toward static infrastructure, whereas signal-based approaches characterize temporal dynamics but lack the spatial context needed for scene-level localization. These complementary limitations motivate a unified framework that links motion evidence to spatial feature selection while preserving data-adaptive temporal characterization. This study therefore proposes FLO-EMD, a hybrid approach that couples motion-guided attention with empirical, data-driven temporal decomposition. Dense optical flow guides channel and spatial attention so that RGB features are refined toward motion-relevant regions. In parallel, aggregated flow statistics form compact motion traces that are decomposed using Empirical Mode Decomposition (EMD) to extract intrinsic temporal components. The resulting EMD embedding is fused with learned spatiotemporal representations to classify light, medium, and heavy congestion. Experiments on 1,050 five-second clips from four surveillance networks show that FLO-EMD achieves 97.5% overall test accuracy (weighted F1 = 0.9742), outperforming established baselines and remaining robust across diverse environmental conditions; ablation and sensitivity analyses further quantify the contributions of EMD, the number of intrinsic mode functions, and the selected motion descriptors.

[CV-39] VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision

【速读】：该论文旨在解决单通道热成像图像中低显著性目标（如车辆和船舶）的识别难题，此类问题在监控等应用中尤为关键。现有方法因缺乏颜色信息导致同类物体间区分度低（忽略形状信息），且纹理特征被弱化，同时视角变化进一步加剧特征差异，使得识别性能受限。解决方案的关键在于构建视角条件化的特征向量，并在独立的特征空间中进行区域特异性特征对比，从而有效利用预训练于RGB图像的视觉Transformer（Vision Transformer, ViT）特征提取器的优势，同时针对性地适应热域特性。实验表明，该方法在RGBNT100（IR）车辆数据集和自建的热成像海事数据集上分别实现了mAP提升19.7%和12.8%，显著优于当前最先进方法。

链接: https://arxiv.org/abs/2605.04750
作者: Yasod Ginige,Ransika Gunasekara,Darsha Hewavitharana,Manjula Ariyarathne,Peshala Jayasekara,Ranga Rodrigo
机构: University of Sydney (悉尼大学); University of Moratuwa (莫鲁塔瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Identification of less-articulated objects using single-channel images, such as thermal images, is important in many applications, such as surveillance. However, in this domain, existing methods show poor performance due to high similarity among objects of the same category in the absence of color information (overlooking shape information) and de-emphasized texture information. Furthermore, variability in viewpoint adds more complexity as the features vary from side to side. We address these issues by constructing viewpoint-conditioned feature vectors and area-specific feature comparisons in separate feature spaces. These interventions enable leveraging the advancements of existing RGB-pre-trained ViT feature extractors while effectively adapting them to address the challenges specific to the thermal domain. We test our system with RGBNT100 (IR) vehicle dataset and a thermal maritime dataset acquired by us. Our results surpass the state-of-the-art methods by 19.7% and 12.8% for the above datasets in mAP scores, respectively. We also plan to make our thermal dataset available, the first of its kind for maritime vessel identification.

[CV-40] Morphology-Guided Cross-Task Coupling for Joint Building Height and Footprint Estimation

【速读】：该论文旨在解决建筑高度（Building Height, BH）与建筑轮廓（Building Footprint, BF）在遥感估计中被独立建模导致的精度受限问题，而实际上二者受楼层面积比（Floor-Area-Ratio, FAR）约束存在强耦合关系。解决方案的关键在于提出MorphoFormer框架，其核心机制为：(i) 建筑轮廓引导的任务解码器（BF-Guided Task Decoder, BGTD），通过跨注意力机制利用轮廓衍生的形态上下文对高度分支进行门控；(ii) 形态一致性损失（Morphology Consistency Loss, MCL），通过监督高度-轮廓代理模型间接迫使建筑轮廓特征编码与高度相关的结构信息。这两个机制均作用于跨任务表示而非像素级特征，从而在不依赖输入分辨率的前提下显著提升BH预测精度（RMSE降低0.24 m），同时保持BF预测性能稳定。

链接: https://arxiv.org/abs/2605.04731
作者: Jinzhen Han,JinByeong Lee,Jisung Kim,HongSik Yun
机构: University of Leeds (利兹大学); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building height (BH) and building footprint (BF) jointly describe the vertical and horizontal extent of the built environment and are required inputs for urban climate, disaster-risk, and population-mapping models. The two parameters are coupled through floor-area-ratio (FAR) constraints, yet remote-sensing approaches typically treat them as independent regression targets. We argue that explicitly encoding this cross-task coupling is more impactful than further refining individual encoders, and propose MorphoFormer, a joint BH/BF estimation framework built around two complementary mechanisms: (i) a BF-Guided Task Decoder (BGTD) that gates the height branch via cross-attention on a footprint-derived morphology context, and (ii) a Morphology Consistency Loss (MCL) that supervises a height-from-footprint surrogate against the ground-truth BH, indirectly forcing the BF feature to encode height-correlated structure. The encoder is a single-stage Swin backbone fed by Sentinel-1 SAR, Sentinel-2 multispectral, and DEM inputs, trained and evaluated on a geo-blocked split of 54 cities. Against a Swin-MTL baseline at identical receptive field, MorphoFormer reduces BH test RMSE from 3.39 to 3.15 m (R^2 improves 0.62 - 0.67) with BF R^2 stable at 0.80. Controlled ablations at identical capacity attribute most of this 0.24 m improvement to the two proposed mechanisms: removing BGTD raises BH RMSE by 0.11 m and removing MCL raises it by 0.11 m, with the residual approximately 0.02 m falling within the noise floor of encoder-side variations. Because both mechanisms act on cross-task representations rather than pixels, the design carries no intrinsic dependence on input resolution.

[CV-41] ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting CVPR

【速读】：该论文旨在解决基于3D高斯溅射（3D Gaussian Splatting, 3DGS）的视觉定位方法中因特征学习偏差导致的关键点匹配不准确问题。研究表明，广泛采用的α-混合优化策略会引入3D点特征偏差，其根源在于单个高斯分布与其邻近高斯之间的耦合关系，使得学习到的特征难以用于精确匹配任务。解决方案的关键在于提出ULF-Loc框架，通过几何加权特征融合替代有偏特征优化，并结合关键点一致性采样与局部几何一致性验证机制，从而提升特征可靠性并有效剔除由渲染伪影引起的误匹配。该方法在Cambridge Landmarks数据集上将平均中位数平移误差降低17%，同时训练效率显著优于当前最优方法STDLoc。

链接: https://arxiv.org/abs/2605.04730
作者: Yingdong Gu,Shaocheng Yan,Zhenjun Zhao,Yuan Kou,Jianxin Luo,Pengcheng Shi,Jiayuan Li
机构: Wuhan University (武汉大学); University of Zaragoza (萨拉戈萨大学); The First Surveying and Mapping Institute of Hunan Province (湖南省第一测绘院); The Hunan Engineering Research Center of 3D Real Scene Construction and Application Technology (湖南省三维实景场景构建与应用技术工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published to CVPR (highlight)

点击查看摘要

Abstract:Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted \alpha -blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. We further introduce keypoint-consensus landmark sampling to select reliable Gaussians and local geometric consistency verification to reject mismatches caused by rendering artifacts. On the Cambridge Landmarks dataset, ULF-Loc reduces the mean median translation error by 17% compared to the state-of-the-art, while achieving superior efficiency with only 1/10 the training time and 1/6 the GPU memory of STDLoc.

[CV-42] Anny-Fit: All-Age Human Mesh Recovery CVPR2026

【速读】：该论文旨在解决从单张图像中恢复全年龄段（all-age）多人三维人体姿态与形状（3D human mesh recovery, HMR）的问题，现有方法通常假设场景中仅包含成年人且独立优化每个人，这在真实世界多年龄场景下失效，因无法处理不同年龄个体间的身体比例差异和深度歧义。其解决方案的关键在于提出Anny-Fit框架，该框架在相机坐标系中联合优化所有个体，通过引入多种专家知识信号——包括度量深度图、实例分割、2D关键点以及由视觉语言模型（VLM）推导出的语义属性（如年龄和性别）——共同约束优化过程，从而消除深度尺度歧义并提升空间一致性。此方法不仅显著改善了2D重投影误差、相对深度排序和3D/形状估计精度，还通过伪真值标注实现了对成人训练HMR模型的零样本迁移，使其无需重新训练即可适应全年龄谱系。

链接: https://arxiv.org/abs/2605.04728
作者: Laura Bravo-Sánchez,Matthieu Armando,Romain Brégier,Grégory Rogez,Serena Yeung-Levy,Fabien Baradel
机构: 1. University of Oxford (牛津大学); 2. Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings Track - Code available at this https URL

点击查看摘要

Abstract:Recovering 3D human pose and shape from a single image remains a cornerstone of human-centric vision, yet most methods assume adult subjects and optimize each person independently. These assumptions fail in real-world, all-age scenes, where body proportions and depth must be resolved jointly. We introduce Anny-Fit, a multi-person, camera-space optimization framework for all-age 3D human mesh recovery (HMR). Unlike existing per-person fitting methods, Anny-Fit jointly optimizes all individuals directly in the camera coordinate system, enforcing global spatial consistency. At the core of our approach is the use of multiple forms of expert knowledge – including metric depth maps, instance segmentation, 2D keypoints, and, VLM-derived semantic attributes such as age and gender – each obtained from dedicated off-the-shelf networks. These complementary signals jointly guide the optimization, constraining the depth-scale ambiguity characteristic of all-age scenes. Across diverse datasets, Anny-Fit consistently improves 2D reprojection accuracy (+13 to 16), relative depth ordering (+6 to 7), 3D estimation error (-9 to -29) and shape estimation (+25 to +82), producing more coherent scenes. Finally, we show that VLM-based semantic knowledge can be distilled into an HMR model via the pseudo-ground-truth annotations produced by Anny-Fit on training data, enabling it to learn semantically meaningful shape parameters while improving HMR performance. Our approach bridges adult-only and all-age modeling by enabling zero-shot adaptation of adult-trained HMR pipelines to the full age spectrum without retraining. Code is publicly available at this https URL.

[CV-43] Not Every Subject Should Stay: Machine Unlearning for Noisy Engagement Recognition

【速读】：该论文旨在解决在情感识别（engagement recognition）数据集中，由于样本标注存在噪声和主观性，导致模型性能受影响的问题，并进一步探讨在模型已训练完成后，如何高效移除特定有害个体（harmful subject）的影响而无需从头重新训练。其核心解决方案是提出一种基于样本级机器遗忘（machine unlearning）的后处理机制——通过构建一个依赖于模型状态的代理指标对候选有害主体进行排序，随后应用轻量级近似遗忘更新策略，在不重新训练整个模型的前提下实现对特定主体影响的去除。实验表明，在EngageNet和DAiSEE数据集上，该方法在仅消耗约四分之一重训练成本的情况下，分别恢复了89.3%和92.5%的“从头训练”基准模型性能，验证了其作为低成本修正机制的有效性。

链接: https://arxiv.org/abs/2605.04713
作者: Alexander Vedernikov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Engagement recognition datasets are typically subject-indexed and often contain noisy, subjective supervision, making post-hoc dataset revision a practical problem. Existing noisy-label and data-cleaning methods largely operate at the sample level before or during training, but do not directly address a different question: once a model has already been trained, can the influence of an entire problematic subject be removed without full retraining? We study this setting through subject-level machine unlearning as a post-hoc sanitization mechanism for engagement recognition. Starting from a baseline trained on all subjects, we rank candidate harmful subjects using a model-dependent proxy, apply a lightweight approximate unlearning update, and compare the result against an oracle model retrained from scratch on the retained subjects only. We instantiate this protocol on DAiSEE and EngageNet using Tensor-Convolution and Convolution-Transformer Network (TCCT-Net) as a fixed platform and evaluate three matched model states under the same removal scenario: baseline, unlearned, and oracle. In representative K=3 forget-set settings, the unlearned model recovers 89.3% and 92.5% of the oracle gain on EngageNet and DAiSEE, respectively, at roughly one quarter of retraining cost. Across the tested small-audit regimes, effectiveness is strongest at an intermediate forget-set size, indicating that approximate subject-level unlearning is a useful low-cost correction mechanism, but one whose benefit depends on subject selection quality and removal regime.

[CV-44] FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

【速读】：该论文旨在解决身份保持型文本到视频生成（Identity-preserving text-to-video generation, IPT2V）在复杂动态场景中面临的身份失真问题，尤其是在大角度面部姿态变化或面部遮挡条件下，现有方法常出现显著的身份偏差。解决方案的关键在于提出一个名为FaithfulFaces的姿势忠实的面部身份保持学习框架，其核心是一个姿态共享的身份对齐器（pose-shared identity aligner），该对齐器通过姿态共享字典和姿态变化-身份不变性约束，在不同视角下精炼并对齐面部姿态；同时，通过显式欧拉角嵌入将单视角输入映射为全局面部姿态表示，从而提供一种姿势忠实的面部先验，引导生成模型实现鲁棒的身份保持生成。

链接: https://arxiv.org/abs/2605.04702
作者: Yuanzhi Wang,Xuhua Ren,Jiaxiang Cheng,Bing Ma,Kai Yu,Sen Liang,Wenyue Li,Tianxiang Zheng,Qinglin Lu,Zhen Cui
机构: Nanyang Technological University (南洋理工大学); Tencent Hunyuan (腾讯混元); University of Science and Technology of China (中国科学技术大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose \textitFaithfulFaces, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.

[CV-45] HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction

【速读】：该论文旨在解决空间转录组学（spatial transcriptomics）在实际应用中因成本高和通量低而难以大规模推广的问题，以及现有计算模型在从苏木精-伊红染色（hematoxylin and eosin-stained, H&E）组织切片推断空间基因表达时存在的两大局限：一是多数模型假设笛卡尔坐标或几何无关的局部性，忽略了广泛使用的点阵平台（spot-array platforms）所采用的六边形采样结构；二是基于点对点回归的目标函数常导致基因表达谱过度平滑，掩盖了基因特异性的空间异质性。解决方案的关键在于提出HEXST——一种几何对齐的Transformer架构，其核心创新包括：1）直接在六边形点坐标上操作，通过定制化的移位窗口注意力机制（shifted-window attention）和六边形旋转位置编码（hexagonal rotary positional encoding），实现高效局部到全局的上下文建模；2）引入对比敏感的差分目标函数（contrast-sensitive differential objective）与预训练单细胞基础模型提供的转录组先验（transcriptomic priors），增强基因层面的空间对比度，从而在保留空间异质性的同时提升预测准确性。

链接: https://arxiv.org/abs/2605.04682
作者: Keunho Byeon,Jin Tae Kwak
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial transcriptomics offers spatially resolved gene expression profiling within tissue sections, but its cost and limited throughput hinder large-scale deployment. To extend this capability to routine practice, recent computational methods aim to infer spatial gene expression directly from ubiquitous hematoxylin and eosin-stained histology slides. However, most existing models assume Cartesian or geometry-agnostic locality, despite the hexagonal sampling of widely used spot-array platforms, and point-wise regression objectives often yield over-smoothed gene expression profiles, obscuring gene-specific spatial heterogeneity. To address these, we propose HEXST, a geometry-aligned Transformer for spatial gene expression prediction from histology. HEXST operates directly on hexagonal spot coordinates to enable efficient local-to-global contextual modeling via tailored shifted-window attention mechanism and hexagonal rotary positional encoding. To enhance gene-wise spatial contrast, HEXST complements point-wise regression with a contrast-sensitive differential objective and transcriptomic priors from a pretrained single-cell foundation model during training. Across seven spatial transcriptomics datasets, HEXST consistently outperforms state-of-the-art models, providing accurate and robust spatial gene expression predictions while preserving gene-wise contrast and spatial heterogeneity.

[CV-46] Multi-Level Bidirectional Biomimetic Learning for EEG-Based Visual Decoding

【速读】：该论文旨在解决基于脑电图（EEG）的视觉神经解码中跨模态对齐难题，核心挑战在于配对数据稀缺以及高保真数字图像与生物视觉感知之间的根本性不匹配——后者受视网膜拓扑映射（retinotopic mapping）和个体特异性神经解剖结构的影响。解决方案的关键在于提出MB2L框架，其通过引入结构化的生理诱导偏置（physiological inductive biases）来增强表征学习：一是设计带有视觉先验的自适应模糊模块（Adaptive Blur with Visual Priors），根据视网膜拓扑先验重新加权视觉输入以缓解感知-结构失配；二是提出仿生视觉特征提取模块（Biomimetic Visual Feature Extraction），学习与皮层层级处理一致的多级视觉表示，提升跨被试一致性编码能力；最终通过多级双向对比学习（Multi-level Bidirectional Contrastive Learning）联合优化上述模块，在共享语义空间中实现EEG与视觉特征的双向对齐，从而在零样本EEG到图像检索任务中达到80.5% Top-1和97.6% Top-5准确率，显著优于现有方法并展现出强泛化能力。

链接: https://arxiv.org/abs/2605.04680
作者: Jingtao Liu,Peiliang Gong,Chuhang Zheng,Yiheng Liu,Qi Zhu
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 13 figures, 15 tables

点击查看摘要

Abstract:EEG-based visual neural decoding aims to align neural responses with visual stimuli for tasks such as image retrieval. However, limited paired data and a fundamental mismatch between high-fidelity digital images and biological visual perception - distorted by retinotopic mapping and subject-specific neuroanatomy - severely impede cross-modal alignment. To address this, we propose MB2L, a Multi-Level Bidirectional Biomimetic Learning framework that incorporates structured physiological inductive biases into representation learning. Specifically, we propose Adaptive Blur with Visual Priors to mitigate perceptual-structural mismatch by reweighting visual inputs according to retinotopic priors. We further propose Biomimetic Visual Feature Extraction to learn multi-level visual representations consistent with hierarchical cortical processing, enhancing subject-invariant encoding. These modules are jointly optimized via Multi-level Bidirectional Contrastive Learning, which aligns EEG and visual features in a shared semantic space through bidirectional contrastive objectives. Experiments show MB2L achieves 80.5% Top-1 and 97.6% Top-5 accuracy on zero-shot EEG-to-image retrieval, significantly outperforming prior methods and demonstrating strong generalization across subjects and experimental settings.

[CV-47] From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在异构数据集上进行一致建模时，因潜在动作（latent actions）监督策略分散且缺乏系统比较而导致的性能瓶颈问题。其解决方案的关键在于从两个维度对潜在动作监督机制进行结构化分析：一是通过基于图像的潜在动作对轨迹进行正则化，二是通过基于动作的潜在动作统一目标空间。研究在统一的VLA基线框架下对比了四种代表性整合策略，发现图像驱动的潜在动作更利于长程推理与场景级泛化，而动作驱动的潜在动作则在复杂运动协调任务中表现更优；此外，直接以离散潜在动作标记监督视觉语言模型（VLM）能获得最佳效果，为多源数据混合训练下的VLA模型提供了有效路径。

链接: https://arxiv.org/abs/2605.04678
作者: Yihan Lin,Haoyang Li,Yang Li,Haitao Shen,Yihan Zhao,Chao Shao,Jing Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at this https URL.

[CV-48] Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern CVPR2026

【速读】：该论文旨在解决可见光-热成像（RGB-T）目标检测系统在物理世界中的安全性问题，尤其是针对对抗性服装攻击的防御薄弱性。现有方法多依赖于重叠的RGB-T对抗模式（ORP），易导致光照减弱从而降低攻击效果；而本文提出一种非重叠RGB-T模式（NORP），通过分离可见光与热成像材料设计，避免了光强衰减问题，显著提升攻击鲁棒性。其关键创新在于构建3D RGB-T人体与对抗服装模型以实现全视角（0°–360°）攻击模拟，并引入空间离散-连续优化（SDCO）方法对NORP进行精细化优化，最终在多种融合架构的RGB-T检测器上均实现高成功率的数字与物理攻击，同时提出融合阶段集成策略增强跨模型攻击迁移能力。

链接: https://arxiv.org/abs/2605.04675
作者: Xiaopei Zhu,Guanning Zeng,Zhanhao Hu,Jun Zhu,Xiaolin Hu
机构: Tsinghua University (清华大学); University of California Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Visible-thermal (RGB-T) object detection is a crucial technology for applications such as autonomous driving, where multimodal fusion enhances performance in challenging conditions like low light. However, the security of RGB-T detectors, particularly in the physical world, has been largely overlooked. This paper proposes a novel approach to RGB-T physical attacks using adversarial clothing with a non-overlapping RGB-T pattern (NORP). To simulate full-view (0 ^\circ --360 ^\circ ) RGB-T attacks, we construct 3D RGB-T models for human and adversarial clothing. NORP is a new adversarial pattern design using distinct visible and thermal materials without overlap, avoiding the light reduction in overlapping RGB-T patterns (ORP). To optimize the NORP on adversarial clothing, we propose a spatial discrete-continuous optimization (SDCO) method. We systematically evaluated our method on RGB-T detectors with different fusion architectures, demonstrating high attack success rates both in the digital and physical worlds. Additionally, we introduce a fusion-stage ensemble method that enhances the transferability of adversarial attacks across unseen RGB-T detectors with different fusion architectures.

[CV-49] Contact Matrix: Enhancing Dance Motion Synthesis with Precise Interaction Modeling

【速读】：该论文旨在解决生成真实互动动作（reactive motion）的难题，尤其在双人舞蹈场景中，由于交互约束严格、可行解空间有限且高质量数据稀缺，传统方法难以捕捉复杂的肢体交互细节与节奏同步性。解决方案的关键在于提出一种两阶段框架：第一阶段采用分体编码器-联合解码器的运动VQ-VAE结构，通过专用代码本提升表示能力，并在解码过程中动态建模身体部位间的依赖关系，避免生成动作不一致；第二阶段设计了一种接触感知扩散模型（contact-aware diffusion model），联合生成动作序列与人物间的接触矩阵（contact matrix），显式建模交互关系，在采样过程中提供精确的交互约束引导，从而显著提升动作的真实感和节奏一致性。

链接: https://arxiv.org/abs/2605.04662
作者: Xuhai Chen,Zhi Cen,Huaijin Pi,Sida Peng,Xiaowei Zhou,Yong Liu
机构: Zhejiang University (浙江大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic reactive motions, in which one person reacts to the fixed motions of others, is challenging due to strict interaction constraints and a limited feasible solution space. This paper focuses on a typical scenario: duet dance, where high-quality data is scarce, motion patterns are complex, and the details of human interactions are both intricate and abundant. To tackle these challenges, we propose a novel two-stage framework. In the first stage, we introduce a motion VQ-VAE with separate body-part encoders and a joint decoder, enabling specialized codebooks to enhance representation capacity while dynamically modeling dependencies across body parts during decoding, thereby preventing inconsistencies in the generated motions. In the second stage, we propose a contact-aware diffusion model for reactive motion generation that jointly generates motion and a contact matrix between individuals, enabling explicit interaction modeling and providing guidance toward more precise and constrained interaction dynamics during sampling. Experiments show that our method outperforms Duolando with lower \textFID_k (8.89 vs. 25.30) and \textFID_cd (8.01 vs. 9.97), as well as a higher BED (0.4606 vs. 0.2858), indicating improved interaction fidelity and rhythmic synchronization.

[CV-50] CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在下游任务中常出现的物体幻觉（object hallucination）问题，即模型生成的内容与输入图像信息不一致。解决方案的关键在于提出一种无需训练、可即插即用的视觉注意力引导方法——Caption-guided Visual Attention Steering (CAST)，其核心思想是利用模型在回答图像描述（caption）查询时对视觉信息更强的注意力激活模式，通过探测技术识别对caption查询敏感的注意力头，并计算最优的引导方向以增强模型细粒度的视觉感知能力，从而有效抑制幻觉现象，同时保持较低的推理开销和原有基础能力。

链接: https://arxiv.org/abs/2605.04641
作者: Qiming Li,Zekai Ye,Xiaocheng Feng,Weihong Zhong,Libo Qin,Ruihan Chen,Lei Huang,Baohang Li,Kui Jiang,Yaowei Wang,Ting Liu,Bing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs’ attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs’ visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM’s fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.

[CV-51] UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

【速读】：该论文旨在解决印刷电路板（Printed Circuit Board, PCB）缺陷检测中面临的两大挑战：一是缺陷样本稀缺且分布不均，限制了模型训练效果；二是复杂电路背景下的特征表示能力不足。为此，作者提出了一种生成辅助的PCB缺陷检测框架（UniPCB），其核心创新在于将可控缺陷合成与任务特定的缺陷检测相结合。关键解决方案包括：在生成端设计多模态条件生成器（Multi-modal Condition Generator），并行提取边缘、深度和文本条件，通过ScaleEncoder将其嵌入扩散U-Net的四个尺度，并采用FiLM风格的空间自适应调制实现结构对齐的缺陷样本合成；在检测端引入倒置残差移位注意力机制（Inverted Residual Shift Attention）以联合捕获全局上下文与局部纹理信息，并利用跨层互补融合模块（Cross-level Complementary Fusion Block）生成像素级门控信号进行选择性特征融合。该框架实现了生成与检测性能的协同提升，实验表明其在DsPCBSD+数据集上达到mAP@0.5为98.0%、mAP@0.5:0.95为61.8%，显著优于现有方法。

链接: https://arxiv.org/abs/2605.04635
作者: Huan Zhang,Lianghong Tan,Yichu Xu,Jiangzhong Cao,Huanqi Wu,Linwei Zhu,Xu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Printed Circuit Board (PCB) defect inspection faces two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

[CV-52] Advancing Aesthetic Image Generation via Composition Transfer

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在图像生成过程中对构图（composition）建模不足的问题，即现有方法通常将构图与语义内容耦合，缺乏对构图本身的显式控制与迁移能力。其解决方案的关键在于提出 Composer 框架，该框架基于美学理论设计，支持两种模式：一是通过提取参考图像中的关键构图感知表示并结合定制的条件引导模块，在预训练扩散模型基础上实现构图迁移；二是利用大视觉语言模型（Large Vision-Language Models, LVLMs）的上下文学习能力，在仅提供文本主题时实现主题驱动的构图检索与规划。此外，作者还通过文本到构图的微调进一步增强了无参考场景下的隐式构图规划能力，从而实现了对图像构图的显式控制与个性化调整。

链接: https://arxiv.org/abs/2605.04609
作者: Kai Zou,Zhiwei Zhao,Bin Liu,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.

[CV-53] Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness

【速读】：该论文旨在解决传统单次检测方法（one-shot detection）在开放集目标检测中面临的标注成本高以及无监督方法无法实现类别感知分类的问题。其解决方案的关键在于提出了一种基于参考图像的类别发现机制（Reference-based Category Discovery, RefCD），通过设计特征相似性损失函数，利用预测目标与未标注参考图像之间的特征相似性来显式引导潜在类别特定特征的学习，从而在无需任何人工标注标签的情况下实现类别感知的目标检测；同时，该框架还支持无参考图像的类别无关检测，形成统一的检测范式。

链接: https://arxiv.org/abs/2605.04606
作者: Yichen Li,Qiankun Liu,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages 12 figures

点击查看摘要

Abstract:Traditional one-shot detection methods have addressed the closed-set problem in object detection, but the high cost of data annotation remains a critical challenge. General unsupervised methods generate pseudo boxes without category labels, thus failing to achieve category-aware classification. To overcome these limitations, we propose Reference-based Category Discovery (RefCD), an unsupervised detector that enables category-aware\footnotemark[1] detection without any manually annotated labels. It leverages feature similarity between predicted objects and unlabeled reference images. Unlike previous unsupervised methods that lack category guidance and one-shot methods which require labeled data, RefCD introduces a carefully designed feature similarity loss to explicitly guide the learning of potential category-specific features. Additionally, RefCD supports category-agnostic detection without reference images, serving as a unified framework. Comprehensive quantitative and qualitative analysis of category-aware and category-agnostic detection results demonstrates its effectiveness, and RefCD can learn category information in an unsupervised paradigm even without category labels.

[CV-54] DiCLIP: Diffusion Model Enhances CLIPs Dense Knowledge for Weakly Supervised Semantic Segmentation

【速读】：该论文旨在解决弱监督语义分割（Weakly Supervised Semantic Segmentation, WSSS）中因仅利用对比语言-图像预训练模型（CLIP）的视觉-语言配对特性而忽视其跨模态密集知识有限性所导致的类激活图（Class Activation Maps, CAMs）生成质量不佳的问题。解决方案的关键在于提出DiCLIP框架，通过引入生成式扩散模型（diffusion model）增强CLIP在视觉和文本模态上的密集知识表达：一方面设计视觉相关性增强（Visual Correlation Enhancement, VCE）模块，利用扩散模型的空间一致性缓解CLIP自注意力机制中的过平滑问题，并通过注意力聚类精化（Attention Clustering Refinement, ACR）模块提取多样化的相关性图作为判别性偏置；另一方面设计文本语义增强（Text Semantic Augmentation, TSA）模块，借助扩散模型的生成能力构建动态键值缓存机制，将CAM生成从patch-text匹配机制转变为视觉知识检索范式，从而显著提升分割精度并降低训练成本。

链接: https://arxiv.org/abs/2605.04593
作者: Zhiwei Yang,Pengfei Song,Yucong Meng,Kexue Fu,Shuo Wang,Zhijian Song
机构: Fudan University (复旦大学); Shandong Computer Science Center (山东省计算中心); Shanghai QiYuan Innovation Foundation (上海启源创新基金)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP’s vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP’s dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion’s reliable spatial consistency to mitigate the over-smoothing issue in CLIP’s attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP’s self-attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion’s generative power to maintain a dynamic key-value cache model, shifting CAM generation from a patch-text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state-of-the-art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at this https URL.

[CV-55] From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation ICMR2026

【速读】：该论文旨在解决基于扩散模型（diffusion models）进行文本驱动图像分割（text-based image segmentation）时，因生成式特性导致的判别性分割性能下降的问题。传统方法利用扩散模型提取多模态语义特征虽具潜力，但其固有的噪声-去噪过程与时间步优化需求限制了分割精度，尤其在零样本（zero-shot）场景下表现不佳。解决方案的关键在于提出RLFSeg框架，通过引入修正流（Rectified Flow）在潜在空间中直接学习从图像到分割掩码的映射关系，从而规避扩散模型的生成机制，无需修改预训练模型结构即可实现高效、高精度的分割；同时结合标签精炼（label refinement）与自适应单步采样策略（Adaptive One-Step Sampling），显著提升单次推理下的准确率，展现出强大的零样本迁移能力与应用前景。

链接: https://arxiv.org/abs/2605.04590
作者: Zishen Qu,Xuesong Li,Haijian Gu,Hongwei Kang,Quan Meng,Tianrui Niu,Xin Yang,Ruidong Pan
机构: Zhejiang University (浙江大学); bytedance (字节跳动); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICMR 2026

点击查看摘要

Abstract:Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.

[CV-56] GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution

【速读】：该论文旨在解决光场（Light Field, LF）图像超分辨率（Super-Resolution, SR）中对角向视差几何信息利用不足的问题。现有基于Transformer的方法主要关注水平和垂直方向的Epipolar Plane Images (EPIs)，而忽略了45°与135°方向上的EPI结构，导致未能充分挖掘LF数据中的多方向几何先验。解决方案的关键在于提出GTF（Omnidirectional EPI Transformer），其创新性地在统一重建框架内显式建模四种方向的EPI（水平、垂直、45°、135°），并融合MacPI-based先验注入、自适应方向融合机制以及拓扑保持的前馈网络，从而更全面地利用光场几何特性。实验表明，该方法在多个标准基准上显著提升性能，且轻量化版本GTF-Tiny在计算效率约束下仍保持优异表现。

链接: https://arxiv.org/abs/2605.04581
作者: Kunyu Li,Fei Wang,Lichao Zhang,Junjie Liu,Bihong Li
机构: Xi’an Shiyou University (西安石油大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NTIRE 2026. 9 pages, 2 figures

点击查看摘要

Abstract:Light field (LF) image super-resolution benefits from Epipolar Plane Images (EPIs), whose line slopes explicitly encode disparity. However, existing Transformer-based LF SR methods mainly attend to horizontal and vertical EPIs, leaving diagonal epipolar geometry underexplored. We present GTF, an omnidirectional EPI Transformer that explicitly models horizontal, vertical, 45-degree, and 135-degree EPIs within a unified reconstruction framework. GTF combines directional EPI processing, MacPI-based prior injection, adaptive directional fusion, and a topology-preserving feed-forward network to better exploit LF geometry. For the NTIRE 2026 fidelity tracks, we use GTF as the main model, while a lightweight GTF-Tiny variant targets the efficiency track. On five standard LF SR benchmarks covering both real-captured and synthetic scenes, GTF reaches 32.78 dB without inference-time enhancement, and stronger inference settings with EPSW and test-time augmentation further improve performance. Under the NTIRE 2026 efficiency constraint, GTF-Tiny attains 32.57 dB with only 0.915M parameters and 19.81 GFLOPs. In the NTIRE 2026 Light Field Image Super-Resolution Challenge, our submissions rank 3rd on Track 1 and Track 3 and 4th on Track 2. Architecture-evolution, channel-width, and inference analyses further support the effectiveness of diagonal EPI modeling, directional fusion, and the lightweight design.

[CV-57] VL-UniTrack: A Unified Framework with Visual-Language Prompts for UAV-Ground Visual Tracking

【速读】：该论文旨在解决无人机-地面视觉跟踪（UAV-ground visual tracking, UGVT）中现有两流方法因特征提取孤立和依赖隐式外观匹配而导致的跨视角对应不可靠问题。其解决方案的关键在于提出VL-UniTrack框架，通过共享编码器打破特征隔离，实现跨视角充分交互；并设计视觉-语言几何提示模块，融合语言描述与视觉特征生成可学习提示，进而引导提示驱动的跨视角适配模块进行精准特征对齐，同时引入置信度调制的互蒸馏损失以抑制噪声传播，从而提升跟踪鲁棒性与准确性。

链接: https://arxiv.org/abs/2605.04574
作者: Boyue Xu,Ruichao Hou,Tongwei Ren,Gangshan Wu
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:UAV-ground visual tracking (UGVT) aims to simultaneously track the same object from both the UAV and the ground view. However, existing two-stream methods suffer from isolated feature extraction and rely heavily on implicit appearance matching, which struggles to establish reliable correspondence under drastic view differences, leading to tracking unreliability. To address these limitations, we propose VL-UniTrack, a fully unified framework enhanced by visual-language prompts. By encoding features from both views within a single shared encoder, our method breaks the barrier of feature isolation to facilitate sufficient cross-view interaction. To overcome the ambiguity caused by relying solely on appearance matching, we design visual-language geometric prompting module, which fuses language descriptions with visual features to generate learnable prompts. These prompts are then fed into our prompt-guided cross-view adapter module to enable sufficient cross-view feature interaction and to guide the learning of view-specific feature representations. Furthermore, a confidence-modulated mutual distillation loss is proposed to regularize the training by mitigating noise propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the latest benchmark. The code can be downloaded in this https URL

[CV-58] Lightning Unified Video Editing via In-Context Sparse Attention ICML2026

【速读】：该论文旨在解决基于上下文学习（In-Context Learning, ICL）的视频编辑中因注意力机制导致的二次方计算开销问题，即高延迟瓶颈。其解决方案的关键在于提出一种名为“上下文稀疏注意力”（In-context Sparse Attention, ISA）的新架构：首先，基于上下文标记显著性低于源标记的观察，设计高效预筛选策略以剪枝冗余上下文；其次，理论证明并实证验证查询锐度（Query sharpness）与近似误差的相关性，进而引入动态查询分组机制——将高误差查询路由至完整注意力计算，而低误差查询则采用零阶泰勒展开稀疏注意力（0-th order Taylor sparse attention），实现计算效率与视觉保真度的平衡。

链接: https://arxiv.org/abs/2605.04569
作者: Shitong Shao,Zikai Zhou,Haopeng Li,Yingwei Song,Wenliang Zhong,Lichen Bai,Zeke Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf\textttLIVEditor , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a \sim 60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

[CV-59] SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression

【速读】：该论文旨在解决现有感知图像压缩方法在低比特率下难以平衡视觉质量与模型复杂度的问题。当前主流方法多依赖生成对抗网络（GAN）或扩散模型，虽能提升感知质量但计算开销大。其核心解决方案是引入状态空间模型（State Space Model, SSM）中的Mamba架构，并设计两个关键模块：一是语义感知的Mamba块（Semantic-aware Mamba Block, SAMB），通过动态聚类语义特征引导扫描顺序，缓解传统Mamba因固定扫描顺序导致的语义连续性破坏和长程信息衰减问题；二是基于奇异值分解（Singular Value Decomposition, SVD）启发的冗余减少模块（SVD-inspired Redundancy Reduction Module, SVD-RRM），利用可学习软阈值对潜在特征进行低秩近似，实现通道维度上的冗余信息抑制。该方案在保持线性计算复杂度的同时显著提升了感知质量与压缩效率。

链接: https://arxiv.org/abs/2605.04560
作者: Jiaqian Zhang,Hao Wei,Chenyang Ge,Yanhui Zhou
机构: Xi’an Jiaotong University(西安交通大学); State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室); Institute of Artificial Intelligence and Robotics(人工智能与机器人研究所); School of Information and Communication Engineering(信息与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Perceptual image compression focuses on preserving high visual quality under low-bitrate constraints. Most existing approaches to perceptual compression leverage the strong generative capabilities of generative adversarial networks or diffusion models, at the cost of substantial model complexity. To this end, we present an efficient perceptual image compression method that exploits the long-range modeling capability and linear computational complexity of state space models, with a particular focus on Mamba. Unlike existing methods that rely on an inherently fixed scanning order and consequently impair semantic continuity and spatial correlation, we develop a semantic-aware Mamba block (SAMB) to enable scanning guided by dynamically clustered semantic features, thereby alleviating the strict causality constraints and long-range information decay inherent to Mamba. Inspired by singular value decomposition, we design an SVD-inspired redundancy reduction module (SVD-RRM) that performs a low-rank approximation on the latent features by introducing a learnable soft threshold, leading to channel-wise redundancy information reduction. The proposed SAMB is integrated into both the encoder and decoder of the compression framework, whereas the SVD-RRM is incorporated only in the encoder. Extensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of rate-distortion-perception tradeoff and model complexity. The source code and pretrained models will be available at this https URL.

[CV-60] Efficient Geometry-Controlled High-Resolution Satellite Image Synthesis

【速读】：该论文旨在解决高分辨率卫星图像在偏远地区或稀有事件中获取困难的问题，这限制了机器学习模型在土地覆盖分类、变化检测和灾害监测等任务中的训练与测试。其解决方案的关键在于通过引入几何控制机制来改进现有的预训练扩散模型，具体方法是利用窗口化的交叉注意力模块（windowed cross-attention modules）仅基于跳接连接特征（skip connection features）实现对合成过程的有效控制，从而在保持高效性的同时提升生成图像与几何控制图之间的对齐精度。

链接: https://arxiv.org/abs/2605.04557
作者: Vlad Vasilescu,Daniela Faur,Teodor Costachioiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS)

点击查看摘要

Abstract:High-resolution satellite images are often scarce and costly, especially for remote areas or infrequent events. This shortage hampers the development and testing of machine learning models for land-cover classification, change detection, and disaster monitoring. In this paper, we tackle the problem of geometry-controlled high-resolution satellite image synthesis by adding control over existing pre-trained diffusion models. We propose a simple yet efficient method for controlling the synthesis process by leveraging only skip connection features using windowed cross-attention modules. Several previously established control techniques are compared, indicating that our method achieves comparable performance while leading to a better alignment with the geometry control map. We also discuss the limitations in current evaluation approaches, amplifying the necessity of a consistent alignment assessment.

[CV-61] InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

【速读】：该论文旨在解决现有端到端多人人体网格恢复（multi-person human mesh recovery）方法在建模人类与环境交互时存在的不足，即当前基于DETR框架的方法仅通过自注意力机制隐式捕捉人与人之间的关系，缺乏对人类如何与物体及彼此之间进行显式交互的推理能力。解决方案的关键在于提出InterMesh框架，其核心创新是引入一个轻量级的人-物交互检测模块，将结构化的交互语义信息注入查询表示中，并设计了Contextual Interaction Encoder和Interaction-Guided Refiner两个模块，以最小计算开销将交互信息融入现有HMR（Human Mesh Recovery）架构，从而显著提升姿态和形状估计的准确性。

链接: https://arxiv.org/abs/2605.04554
作者: Kaili Zheng,Kaiwen Wang,Xun Zhu,Chenyi Guo,Ji Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions.

[CV-62] Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

【速读】：该论文旨在解决图像到点云配准（Image-to-point-cloud registration, I2P）中初始匹配对内点比例（Inlier Ratio, IR）较低时，传统PnP方法难以获得准确配准结果的问题。其解决方案的关键在于提出Angle-I2P网络，通过引入基于角度一致性的尺度不变跨模态几何约束来显式指导模型区分内点与外点，并结合全局到局部的层次化注意力机制，有效过滤刚性变换下几何不一致的匹配，从而显著提升IR和配准召回率（Registration Recall, RR）。

链接: https://arxiv.org/abs/2605.04541
作者: Muyao Peng,Shun Zou,Pei An,You Yang,Qiong Liu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

[CV-63] Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection

【速读】：该论文旨在解决基于视觉语言模型（Vision-Language Models, VLMs）的开放词汇目标检测（Open-vocabulary Object Detection）在测试时分布偏移（test-time distribution shifts）下性能下降的问题，其核心原因是文本嵌入（text embeddings）与区域提议（region proposals）的视觉嵌入之间出现语义错位（semantic misalignment）。解决方案的关键在于提出一种无需训练的框架 Reward-Guided Semantic Evolution (RGSE)，该方法将文本嵌入适配视为一个语义搜索过程：通过扰动文本嵌入生成候选变体，利用当前及历史高置信度视觉提议的余弦相似度作为奖励信号进行评估，并通过奖励加权平均融合得到优化后的文本嵌入，从而在不依赖反向传播的前提下实现文本与视觉语义的高效对齐。

链接: https://arxiv.org/abs/2605.04531
作者: Lihua Zhou,Mao Ye,Xiatian Zhu,Nianxin Li,Changyi Ma,Shuaifeng Li,Yitong Qin,Hongbin Liu,Jiebo Luo,Zhen Lei
机构: Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey, Guildford, UK; School of Artificial Intelligence, Jilin University, China; University of Rochester; School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.

[CV-64] Velox: Learning Representations of 4D Geometry and Appearance CVPR2026

【速读】：该论文旨在解决如何从无结构的动态点云中学习高效且描述性强的4D对象潜在表示问题，以同时准确捕捉物体的几何结构与外观特征，并提升下游任务的效率。其解决方案的关键在于提出Velox框架，通过一个编码器将时空颜色点云压缩为一组动态形状标记（dynamic shape tokens），并利用两个互补的解码器进行监督：一是4D表面解码器，用于建模随时间变化的表面分布以捕获几何信息；二是高斯解码器，将标记映射到3D高斯分布以学习外观特征，从而实现描述性强、压缩性好且输入要求低的4D表示。

链接: https://arxiv.org/abs/2605.04527
作者: Anagh Malik,Dorian Chan,Xiaoming Zhao,David B. Lindell,Oncel Tuzel,Jen-Hao Rick Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project page: this https URL

点击查看摘要

Abstract:We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of dynamic shape tokens. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance. To demonstrate the utility of our representation, we evaluate it across three downstream tasks – video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation – and observe strong performances in all settings.

[CV-65] High-Fidelity Single-Image Head Modeling with Industry-Grade Topology

【速读】：该论文旨在解决单图像三维人脸重建中长期存在的难题：如何在保持面部身份一致性的同时生成符合工业标准拓扑结构的头部网格。其核心解决方案是提出一种从粗到精的优化流程，通过三个阶段（骨骼绑定、关节调整和顶点优化）逐步 refine 一个预设模板，从而实现稳定收敛与一致拓扑。关键创新在于引入几何感知约束，包括基于高斯曲率（Gaussian curvature）和保角一致性（conformal consistency）的局部结构保持机制，以及用于修正唇缝和眼睑不连续等细节瑕疵的辅助正则化项，最终生成具有语义意义边流（edge flow）和工业级拓扑质量的 mesh，显著提升了数字人生产场景下的可用性。

链接: https://arxiv.org/abs/2605.04524
作者: Yunmu Wang,Zoubin Bi,Bowen Cai,Chenchu Rong,Jinlong Wang,Junchen Deng,Aocheng Huang,Jidong Jia,Huan Fu
机构: Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present a single-image head mesh reconstruction framework that addresses the longstanding challenge of simultaneously preserving facial identity and producing industry-grade topology. Our framework adopts a coarse-to-fine optimization pipeline that refines a rigged template across three stages – rig, joint, and vertex – achieving stable convergence and consistent topology. To mitigate the ill-posed nature of single-image 3D face reconstruction and ensure identity preservation, we employ a normal consistency objective jointly with landmark alignment. To further preserve local surface structure and enforce topological regularity, we introduce geometry-aware constraints based on Gaussian curvature and conformal consistency, along with auxiliary regularizations that correct fine artifacts such as lip seams and eyelid discontinuities. Our hierarchical optimization with geometry-aware regularization yields meshes with semantically meaningful edge flow and industry-grade topology. After geometry reconstruction, we extract UV-space texture and normal maps to preserve appearance details for visualization and downstream use. In a user study with 22 professional technical artists, our results were assessed as approaching industry-grade usability, and 95% of participants ranked our method as the top-performing approach, underscoring its effectiveness for real-world digital human production.

[CV-66] DALight-3D: A Lightweight 3D U-Net for Brain Tumor Segmentation from Multi-Modal MRI

【速读】：该论文旨在解决多模态磁共振成像（MRI）中自动脑肿瘤分割的计算效率问题，尤其是传统3D卷积神经网络模型因参数量大、计算成本高而难以部署于实际临床场景。其解决方案的关键在于提出一种轻量级3D U-Net变体DALight-3D，通过引入深度可分离3D卷积（depthwise separable 3D convolutions, SepConv）、标识符条件归一化（identifier-conditioned normalization）、跨切片注意力机制（cross-slice attention, CSA）以及自适应跳跃融合（adaptive skip fusion, SSFB），在保持较高分割精度的同时显著降低模型复杂度——在医学分割十项全能任务BrainTumour基准上实现平均Dice系数0.727，仅需2.22M参数，优于Residual 3D U-Net（Dice: 0.710, 参数: 3.20M），且消融实验证明各模块均对性能提升具有关键作用。

链接: https://arxiv.org/abs/2605.04518
作者: Nand Kumar Mishra,Dhruv Mishra,Dr Manu Pratap Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Automatic brain tumor segmentation from multi-modal MRI remains challenging because volumetric models often incur substantial computational cost. This paper presents DALight-3D, a compact 3D U-Net variant that combines depthwise separable 3D convolutions, identifier-conditioned normalization, cross-slice attention, and adaptive skip fusion. The method is evaluated on the Medical Segmentation Decathlon Task01 BrainTumour benchmark under matched optimization settings against standard 3D U-Net, Attention U-Net, Residual 3D U-Net, and V-Net baselines. In the reported 50-epoch comparison, DALight-3D achieves a mean Dice of 0.727 with 2.22M parameters, compared with 0.710 Dice and 3.20M parameters for Residual 3D U-Net. Component-wise ablations show consistent performance degradation when SepConv, identifier-conditioned normalization, CSA, or SSFB is removed. These results indicate that DALight-3D offers a favorable accuracy-efficiency trade-off within the present benchmark setting.

[CV-67] From Priors to Perception: Grounding Video-LLM s in Physical Reality

【速读】：该论文旨在解决视频大语言模型（Video-LLMs）在细粒度物理推理任务中存在的系统性缺陷问题，尤其是模型在面对违反物理规律的异常场景和违背直觉但符合视觉事实的反常情境时表现不佳的现象。研究指出，这些问题并非源于感知能力不足，而是由于语义先验主导（Semantic Prior Dominance）——即模型的推理机制被内部叙事脚本所干扰。解决方案的关键在于提出统一归因理论（Unified Attribution Theory）并设计两个核心组件：一是基于物理定律合成的高保真对抗性视频数据集Programmatic Adversarial Curriculum (PACC)，实现了视觉伪影与逻辑错误的彻底解耦；二是视觉锚定推理链（Visual-Anchored Reasoning Chain, VARC），强制模型在逻辑判断前显式依赖低层视觉事实。实验表明，仅通过标准LoRA微调即可显著削弱先验干扰，大幅提升SOTA模型的物理推理能力。

链接: https://arxiv.org/abs/2605.04515
作者: Zicheng Zhao,Chaofan Gan,Shijie Li,Weiyao Lin
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Video Large Language Models (Video-LLMs) excel in general understanding, they exhibit systematic deficits in fine-grained physical reasoning. Existing interventions not only suffer from limited generalization but fundamentally conflate generative artifacts with genuine physical fallacies. Furthermore, we find that models fail systematically not only in anti-physics anomalies but also in counter-intuitive scenarios where visual facts contradict statistical expectations. Accordingly, we propose the Unified Attribution Theory: this dual failure stems not from perception deficiency, but from Semantic Prior Dominance – the reasoning mechanism is deeply hijacked by internal narrative scripts. To address this, we construct the Programmatic Adversarial Curriculum (PACC), the first high-fidelity adversarial video dataset synthesized based on physical laws, thoroughly decoupling visual artifacts from logical errors. Concurrently, we design the Visual-Anchored Reasoning Chain (VARC) to force models to explicitly ground their judgments in low-level visual facts prior to logical adjudication. Experiments demonstrate that without invasive architectural modifications, standard LoRA fine-tuning with the PACC curriculum effectively neutralizes prior interference in state-of-the-art (SOTA) models, yielding a substantial leap in physical reasoning capabilities.

[CV-68] Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting ICPR

【速读】：该论文旨在解决现有3D场景理解方法在实例级开放词汇（open-vocabulary）任务中存在的跨视图一致性差、缺乏统一的实例级语义推理以及下游3D任务精度不足的问题。其核心解决方案在于提出Ilov3Splat框架，通过在3D高斯点阵（3D Gaussian Splatting, 3D-GS）基础上引入视图一致的特征场（view-consistent feature fields），实现几何与语义表示的联合优化：一方面利用多分辨率哈希嵌入高效编码与CLIP对齐的语言特征，实现3D空间中稠密且连贯的语言定位；另一方面基于SAM掩码训练实例特征场，借助对比损失提升跨视角的细粒度物体区分能力。推理时，通过CLIP编码查询并与学习到的特征匹配，并采用两阶段3D聚类提取相关高斯组，从而无需类别监督或人工标注即可根据自然语言描述识别任意3D物体。

链接: https://arxiv.org/abs/2605.04506
作者: Binh Long Nguyen,Kien Nguyen,Sridha Sridharan,Clinton Fookes,Peyman Moghadam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

[CV-69] DiffCap-Bench: A Comprehensive Challenging Robust Benchmark for Image Difference Captioning

【速读】：该论文旨在解决当前图像差异描述（Image Difference Captioning, IDC）任务中基准测试缺乏多样性与组合复杂性，以及标准词汇重叠指标（如BLEU、METEOR）无法有效捕捉语义一致性或惩罚幻觉的问题，从而导致对多模态大语言模型（Multimodal Large Language Models, MLLMs）在IDC任务上评估不全面且不可靠。其解决方案的关键在于提出DiffCap-Bench——一个涵盖十类不同差异类型的综合性IDC基准，并引入基于人类验证差异列表的“大语言模型作为评判者”（LLM-as-a-Judge）评估协议，实现对模型捕捉和描述视觉变化能力的鲁棒评估。该框架不仅与人类专家判断高度一致，还与下游图像编辑数据构建质量呈现强相关性，显著提升了IDC任务的评估可靠性与实用性。

链接: https://arxiv.org/abs/2605.04503
作者: Yuancheng Wei,Haojie Zhang,Linli Yao,Lei Li,Jiali Chen,Tao Huang,Yiting Lu,Duojun Huang,Xin Li,Zhao Zhong
机构: South China University of Technology (华南理工大学); Peking University (北京大学); The University of Hong Kong (香港大学); Tianjin University (天津大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models’ ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.

[CV-70] Example-Based Object Detection

【速读】：该论文旨在解决开放词汇目标检测（open-vocabulary object detection）中持续出现的误检（false positives）和漏检（false negatives）问题，尤其是在工程实践中，相同对象的反复误检或漏检不可接受，但频繁重新训练模型在人力、计算资源和时间上成本高昂。解决方案的关键在于提出EBOD（Example-Based Object Detection）框架，该框架将基于提示的检测器（SAM3）与鲁棒特征匹配模块（DINOv3 和 LightGlue）相结合，通过利用历史错误样本（即先前的误检或漏检实例）进行特征级匹配与抑制，从而有效防止同类错误重复发生，且无需对模型进行额外训练。

链接: https://arxiv.org/abs/2605.04501
作者: ZhiXin Sun
机构: PowerChina Zhongnan Engineering Corporation Limited (中国电建中南勘测设计研究院有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, object detection has achieved significant progress, especially in the field of open-vocabulary object detection. Unlike traditional methods that rely on predefined categories, open-vocabulary approaches can detect arbitrary objects based on human-provided prompts. With the advancement of prompt-based detection techniques, models such as SAM3 can even outperform some category-specific detectors trained on particular datasets without requiring additional training on those datasets. However, despite these advancements, false positives and false negatives still occur. In practical engineering applications, persistent misdetections or missed detections of the same object are unacceptable. Yet retraining the model every time such errors occur incurs substantial costs in terms of human effort, computational resources, and time. Therefore, how to leverage existing false positive and false negative samples to prevent such errors from recurring remains a highly challenging and urgent problem. To address this issue, we propose EBOD (Example-Based Object Detection), which integrates a prompt-based detector (SAM3) with robust feature matching modules (DINOv3 and LightGlue). The proposed framework effectively suppresses the repeated occurrence of false positives and false negatives by leveraging previous error examples, without requiring additional model retraining. Code is available at this https URL.

[CV-71] owards General Preference Alignment: Diffusion Models at Nash Equilibrium

【速读】：该论文旨在解决当前基于偏好（preference-based）的扩散模型对齐方法在文本到图像（text-to-image, T2I）生成任务中依赖奖励诱导偏好信号的问题，且这些方法通常假设人类偏好可由Bradley–Terry（BT）模型充分建模，而这一假设可能无法捕捉人类偏好的全部复杂性。解决方案的关键在于从博弈论视角出发，提出了一种名为Diffusion Nash Preference Optimization（Diff.-NPO）的新框架，其核心思想是让当前策略与自身进行对抗博弈以实现自我改进，从而提升对齐效果。该方法无需显式建模奖励函数，直接优化偏好关系，实验证明其在多种指标下均优于现有基于偏好的扩散对齐方法。

链接: https://arxiv.org/abs/2605.04494
作者: Jiaming Hu,Jiamu Bai,Haoyu Wang,Debarghya Mukherjee,Ioannis Ch. Paschalidis
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley–Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.

[CV-72] Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的自动驾驶系统中，因感知输出冗余或冲突导致推理阶段产生幻觉实体和不安全结论的问题。现有方法通常将LLM作为后处理模块，直接作用于多模态感知结果，缺乏对异构传感器数据的一致性协调机制。解决方案的关键在于提出一种以鸟瞰图（Bird’s-Eye View, BEV）为中心的神经符号架构InfoCoordiBridge，其核心创新是引入一个显式的协调桥接模块（ICA），在感知与语言推理之间构建统一、结构化的场景摘要（SceneSummary），从而在高层推理前消除多源感知信息的冗余与跨模态不一致性，实现更可靠的语义一致性和可验证的决策推理。

链接: https://arxiv.org/abs/2605.04475
作者: Shuo Liu,Lei Shi,Haowen Liu,Jing Xu,Yufei Gao,Yucheng Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable autonomous driving requires scene understanding that is semantically consistent across heterogeneous sensors and verifiable at the reasoning stage. However, many recent LLM-driven driving systems attach the language model as a post-processor and force it to reason over redundant or conflicting perception outputs, which can amplify hallucinated entities and unsafe conclusions. This paper proposes InfoCoordiBridge, a BEV-centric neuro-symbolic architecture that inserts an explicit coordination bridge between perception and language reasoning. InfoCoordiBridge comprises (i) a unified multi-agent perception layer that outputs typed structured facts together with modality-focused synopses, (ii) an ICA module that aligns and fuses multi-source outputs into a single SceneSummary, and (iii) an SSRE module that performs SceneSummary-grounded reasoning with verification. Experiments on nuScenes and Waymo show that ICA preserves competitive 3D detection accuracy while substantially improving fusion consistency, reducing redundancy to below 1% and achieving about 98% attribute agreement. On NuScenes-QA and a template-aligned Waymo-QA benchmark, SSRE improves factual grounding and reduces hallucinated entity mentions compared with representative VLM and agentic baselines. Overall, by coordinating multi-sensor outputs into a single conflict-aware SceneSummary before prompting, InfoCoordiBridge prevents redundant and cross-modally inconsistent perception evidence from propagating into high-level reasoning.

[CV-73] Stream-T1: Test-Time Scaling for Streaming Video Generation

【速读】：该论文旨在解决当前基于扩散模型的测试时扩展（Test-Time Scaling, TTS）视频生成方法中存在的两大核心问题：一是候选样本探索成本过高，二是缺乏有效的时序引导机制。为突破这些结构性瓶颈，作者提出将研究重点转向流式视频生成（streaming video generation），其关键在于利用分块合成（chunk-level synthesis）与少量去噪步骤的内在特性，显著降低计算开销并实现细粒度的时序控制。解决方案的核心创新体现在三个模块：(1) 流式缩放噪声传播（Stream-Scaled Noise Propagation），通过历史高质量块噪声主动优化当前块初始潜在噪声，建立时序依赖并利用历史高斯先验指导生成；(2) 流式缩放奖励剪枝（Stream-Scaled Reward Pruning），融合短期局部视觉质量评估与滑动窗口长期时序一致性评价，平衡空间美感与全局时序连贯性；(3) 流式缩放记忆下沉（Stream-Scaled Memory Sinking），根据奖励反馈动态路由KV-cache中被驱逐的上下文信息至不同更新路径，确保先前生成内容有效锚定并引导后续视频流。该框架在5秒和30秒综合视频基准上均展现出显著优势，大幅提升时序一致性、运动平滑性和帧级视觉质量。

链接: https://arxiv.org/abs/2605.04461
作者: Yijing Tu,Shaojin Wu,Mengqi Huang,Wenchuan Wang,Yuxin Wang,Chunxiao Liu,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

[CV-74] StableI2I: Spotting Unintended Changes in Image-to-Image Transition

【速读】：该论文旨在解决当前图像到图像（Image-to-Image, I2I）生成任务中评估体系的局限性问题，即现有方法主要关注指令遵循能力和图像感知质量（perceptual quality），而忽视了输出图像是否保留输入图像的语义对应关系（semantic correspondence）和空间结构一致性（spatial structure consistency）。其解决方案的关键在于提出StableI2I框架，这是一个统一且动态的评估机制，无需参考图像即可精确测量内容保真度（content fidelity）与前后一致性（pre–post consistency），并进一步构建StableI2I-Bench基准用于系统性评估多模态大语言模型（Multimodal Large Language Models, MLLMs）在上述指标上的表现，从而实现对真实世界I2I系统中内容一致性的可靠诊断与性能比较。

链接: https://arxiv.org/abs/2605.04453
作者: Jiayang Li,Shuo Cao,Xiaohui Li,Zhizhen Zhang,Kaiwen Zhu,Yule Duan,Yu Qiao,Jian Zhang,Yihao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fail to assess whether the output image preserves the semantic correspondence and spatial structure of the input image. To address this limitation, we propose StableI2I, a unified and dynamic evaluation framework that explicitly measures content fidelity and pre–post consistency across a wide range of I2I tasks without requiring reference images, including image editing and image restoration. In addition, we construct StableI2I-Bench, a benchmark designed to systematically evaluate the accuracy of MLLMs on such fidelity and consistency assessment tasks. Extensive experimental results demonstrate that StableI2I provides accurate, fine-grained, and interpretable evaluations of content fidelity and consistency, with strong correlations to human subjective judgments. Our framework serves as a practical and reliable evaluation tool for diagnosing content consistency and benchmarking model performance in real-world I2I systems.

[CV-75] RemoteZero: Geospatial Reasoning with Zero Human Annotations

【速读】：该论文旨在解决当前地理空间推理（geospatial reasoning）模型依赖人工标注坐标作为监督信号的问题，从而限制了其在海量未标注遥感数据上的自进化能力。解决方案的关键在于提出RemoteZero框架，通过利用多模态大语言模型（MLLM）更强的语义验证能力（即判断某区域是否满足查询条件），替代传统的边界框（box）监督，实现无需标注坐标的GRPO（Geospatial Reward Policy Optimization）训练，并支持基于自身验证信号的迭代自我进化。

链接: https://arxiv.org/abs/2605.04451
作者: Liang Yao,Fan Liu,Shengxiang Xu,Chuanyi Zhang,Rui Min,Shimin Di,Yuhui Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geospatial reasoning requires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent progress has liberated the reasoning path from manual curation, allowing models to generate their own inference chains. Yet a final dependency remains: they are still supervised by human-annotated ground-truth coordinates. This leaves the reasoning process autonomous, but not its spatial endpoint, and prevents true self-evolution on abundant unlabeled remote sensing data. To break this bottleneck, we introduce RemoteZero, a box-supervision-free framework for geospatial reasoning. RemoteZero is motivated by a simple asymmetry: an MLLM is typically better at verifying whether a region satisfies a query than at directly generating precise coordinates. Leveraging this stronger discriminative ability, RemoteZero replaces geometric supervision with intrinsic semantic verification and enables GRPO training without box annotations. The resulting framework further supports iterative self-evolution, allowing the model to improve from unlabeled remote sensing imagery through its own verification signal. Experiments show that RemoteZero achieves competitive performance against strong supervised methods, demonstrating the potential of self-verifying training for geospatial reasoning localization.

[CV-76] Deep Reprogramming Distillation for Medical Foundation Models

【速读】：该论文旨在解决医学基础模型（Medical foundation models）在特定临床场景下适配时面临的挑战，尤其是由于预训练与下游任务之间存在领域和任务差异、计算资源受限以及推理速度要求高等因素导致的知识迁移效率低下问题。现有方法如知识蒸馏（Knowledge Distillation, KD）和参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）均存在局限性：KD通常假设师生模型任务一致且结构相似，而PEFT难以实现轻量化与个性化部署；二者结合仍无法有效处理模型结构和训练策略不一致带来的知识传递瓶颈。本文提出深度重编程蒸馏（Deep Reprogramming Distillation, DRD）框架，其核心创新在于引入一个新颖的重编程模块（reprogramming module），一方面缓解预训练与下游任务之间的领域和任务差异，另一方面构建面向轻量级下游模型的友好蒸馏路径；同时设计中心核对齐（Centered Kernel Alignment, CKA）蒸馏机制以增强不同训练条件下知识迁移的鲁棒性。实验证明，DRD在18个医学下游任务中优于现有PEFT与KD方法，涵盖2D/3D分类与分割等多种场景。

链接: https://arxiv.org/abs/2605.04447
作者: Siyuan Du,Yuhang Zhou,Haolin Li,Jiangchao Yao,Haishuai Wang,Hui Lin,Ya Zhang,Yanfeng Wang
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Zhejiang University (浙江大学); Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University (浙江大学医学院附属邵逸夫医院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical foundation models pre-trained on large-scale datasets have shown powerful versatile performance. However, when adapting medical foundation models for specific medical scenarios, it remains the inevitable challenge due to the gap induced by the discrepancy between pre-training and downstream tasks, the real-world computation, and speed constraints. Relevant techniques that probably handle this challenge more or less suffer from some intrinsic limitations. For example, knowledge distillation (KD) assumes that teacher and student models share the same task, training strategy, and model structure family, while prevalent parameter-efficient fine-tuning (PEFT) fails to achieve personalized and lightweight deployment. Even the combination of PEFT and KD still struggles to resolve model structures and training strategies inconsistencies between teacher and student models, leading to inefficient knowledge transfer. In this study, we propose a novel framework called Deep Reprogramming Distillation (DRD) to combat the general adaptation challenge. Specifically, DRD introduces the novel reprogramming module that on the one side overcomes the domain and task discrepancy between pretraining and downstream scenarios, and on the other side builds the student-friendly efficient distillation from foundation models to lightweight downstream models. Furthermore, to mitigate variability under different training conditions, we design a centered kernel alignment (CKA) distillation method to promote robust knowledge transfer. Empirical results show that DRD surpasses previous PEFT and KD methods across 18 medical downstream tasks under different foundation models, covering various scenarios including 2D/3D classification and 2D/3D segmentation.

[CV-77] LEGO: LoRA-Enabled Generator-Oriented Framework for Synthetic Image Detection

【速读】：该论文旨在解决生成式图像伪造检测中模型泛化能力不足的问题，尤其是在多种生成器（generator）多样性增加时，通用伪影特征重叠减少导致检测性能下降，而仅依赖特定伪影又易引发过拟合。解决方案的关键在于提出一种基于LoRA（Low-Rank Adaptation）的生成器导向框架LEGO，其核心机制是通过多层感知机（MLP）调制多个预训练的LoRA模块，每个模块专门学习单一生成器的独特伪影特征，再结合注意力机制进行特征融合。该方法将训练分为两阶段：第一阶段独立训练各LoRA模块以捕获生成器特异性表示，第二阶段联合训练MLP和注意力层以动态调节各模块贡献，从而在保持模块化设计的同时实现高效、可扩展的检测性能，且显著优于现有最先进方法，仅需少于30,000张训练图像和5个训练周期即可达到优异效果。

链接: https://arxiv.org/abs/2605.04445
作者: Yutong Xiao,Ran Ran,Jiwei Wei,Shuchang Zhou,Ke Liu,Zheng Ziqiang,Caiyan Qin
机构: University of Electronic Science and Technology of China (电子科技大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,2 figures

点击查看摘要

Abstract:The rapid advancement of generative technologies has made synthetic images nearly indistinguishable from real ones, thereby creating an urgent need for robust detectors to counter misinformation. However, existing methods mainly rely on universal artifact features that are shared across multiple generators. We observe that as the diversity of generators increases, the overlap of these common features gradually decreases. This severely undermines model generalization. In contrast, focusing only on unique artifacts tends to cause overfitting to specific forgery patterns. To address this challenge, we propose LEGO (LoRA-Enabled Generator-Oriented Framework). The core mechanism of LEGO employs an MLP to modulate multiple LoRA (Low-Rank Adaptation) blocks, each pretrained to capture the unique artifacts of a specific generator, followed by attention-based feature fusion. Unlike conventional methods that seek a single universal solution, LEGO delegates unique artifact extraction to specialized LoRA modules by dividing its training procedure into two stages. Each LoRA module is individually trained on a single-generator dataset to learn generator-specific representations, then MLP and attention layers are trained on mixed datasets to dynamically regulate the contribution of each module. Benefiting from its modular yet robust design, LEGO can be naturally extended by incorporating new LoRA modules for adaptation to newly emerging next-generation datasets, while still achieving substantially better performance than prior SOTA methods with fewer than 30,000 training images, less than 10% of their training data, and only 5 epochs in each training stage.

[CV-78] A cross-modal network for facial expression recognition

【速读】：该论文旨在解决现有基于深度神经网络的面部表情识别方法过度依赖层次化信息而非面部属性特征的问题，从而导致表达识别性能受限。其解决方案的关键在于提出一种融合强生物与结构信息的跨模态网络（CMNet），通过整体面部对称性、左右半脸分别学习表达信息以提取互补特征；同时引入显著面部信息精炼模块来过滤冗余信息、提升分类器稳定性，并设计半脸对齐优化机制以减少对单侧面部特征的依赖，增强模型鲁棒性。

链接: https://arxiv.org/abs/2605.04439
作者: Chunwei Tian,Jingyuan Xie,Qi Zhang,Chao Li,Wangmeng Zuo,Shichao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IEEE Transactions on Image Processing 2026

点击查看摘要

Abstract:Deep neural networks enriched with structural information have been widely employed for facial expression recognition tasks. However, these methods often depend on hierarchical information rather than face property to finish expression recognition. In this paper, we propose a cross-modal network with strong biological and structural information for facial expression recognition (CMNet). CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces to extract complementary facial features. To prevent negative effect of biological and structural information fusion, a salient facial information refinement module can obtain salient facial expression information to improve stability of an obtained facial expression classifier. To reduce reliance on unilateral facial features, a half-face alignment optimization mechanism is designed to align obtained expression information of learned left and right half faces. Our experimental results demonstrate that CMNet outperforms several novel methods, i.e., SCN and LAENet-SA for facial expression recognition. Codes can be obtained at this https URL.

[CV-79] Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes

【速读】：该论文旨在解决在非结构化越野场景中，前馈式高斯点绘（Feedforward Gaussian Splatting）因高频几何细节、自身运动抖动（ego-motion jitter）及增强的非刚性动态特性而导致的重建性能下降问题。这些问题会引发跨时间戳的高斯观测冲突，进而造成渲染结果过度平滑或出现结构伪影。解决方案的关键在于提出Ground4D框架，其核心思想是通过空间局部化的条件约束来化解时间冲突：具体而言，引入体素锚定的时间高斯聚合机制，将规范高斯空间划分为空间体素，并在每个体素内执行查询条件的时间注意力机制；同时，利用体素内softmax归一化确保时间选择性与空间占据性相互强化而非冲突。此外，还引入表面法向量作为辅助几何引导，以正则化高斯原型的几何结构。

链接: https://arxiv.org/abs/2605.04435
作者: Shuo Wang,Jilin Mei,Fuyang Liu,Wenfei Guan,Fanjie Kong,Zhihua Zhao,Shuai Wang,Chen Min,Yu Hu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Xi’an Jiaotong University (西安交通大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feedforward Gaussian Splatting has recently emerged as an efficient paradigm for 4D reconstruction in autonomous driving. However, in unstructured off-road scenes, its performance degrades due to high-frequency geometry, ego-motion jitter, and increased non-rigid dynamics. These factors introduce conflicting Gaussian observations across timestamps, leading to either over-smoothed renderings or structural artifacts. To address this issue, we propose Ground4D, a spatially-grounded 4D feedforward framework for pose-free off-road reconstruction. The key idea is to resolve temporal conflicts through spatially localized conditioning. Specifically, we introduce voxel-grounded temporal Gaussian aggregation, which partitions the canonical Gaussian space into spatial voxels and performs query-conditioned temporal attention within each voxel. Intra-voxel softmax normalization ensures that temporal selectivity and spatial occupancy become mutually reinforcing rather than conflicting. We furthermore introduce surface normal cues as auxiliary geometric guidance to regularize the geometry of Gaussian primitives. Extensive experiments on ORAD-3D and RELLIS-3D demonstrate that Ground4D consistently outperforms existing feedforward methods in reconstruction quality and generalizes zero-shot to unseen off-road domains. Project page and code:this https URL.

[CV-80] Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models）在通过连续提示学习（Continuous Prompt Learning）进行适配时存在的过拟合问题以及可解释性不足的问题。同时，针对离散提示优化方法依赖大型外部模型导致计算成本高、扩展性差的局限，提出了一种可解释的提示学习框架（Interpretable Prompt Learning, IPL）。其解决方案的关键在于设计一种混合优化策略：将语义标记选择建模为近似子模优化问题，以确保所选标记既具备人类可理解性又具有语义多样性；并通过交替优化机制，融合离散标记选择与连续提示调优，从而在保持下游任务适应能力的同时显著提升模型的可解释性。该框架具有即插即用特性，可无缝集成至现有提示学习方法中，并在多个基准测试中验证了其有效性与可扩展性。

链接: https://arxiv.org/abs/2605.04425
作者: Yating Wang,Yaqi Zhao,Yongshun Gong,Yilong Yin,Haoliang Sun
机构: 山东大学(Shandong University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures. Preprint version

点击查看摘要

Abstract:Vision-language models such as CLIP achieve strong visual-textual alignment, but often suffer from overfitting and limited interpretability when adapted through continuous prompt learning. While discrete prompt optimization improves interpretability, it usually depends on large external models, leading to high computational costs and limited scalability. In this paper, we propose Interpretable Prompt Learning (IPL), a hybrid framework that alternates between discrete semantic token selection and continuous prompt optimization. Specifically, IPL formulates semantic token selection as an approximate submodular optimization problem, encouraging tokens that are both human-understandable and semantically diverse. It further adopts an alternating optimization strategy to integrate discrete token selection with continuous prompt tuning, improving interpretability while preserving adaptability to downstream tasks. Our framework is plug-and-play, allowing seamless integration with existing prompt learning methods. Extensive experiments on multiple benchmarks show that IPL consistently improves both interpretability and accuracy across five representative prompt learning methods, providing an effective and scalable extension to existing frameworks.

[CV-81] Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion

【速读】：该论文旨在解决当前3D资产生成模型在面对分布外（out-of-distribution, OOD）风格时性能显著下降甚至失效的问题。现有方法通常依赖于与训练数据分布相似的风格图像，难以泛化到未见过的风格。解决方案的关键在于提出DiLAST（2D Diffusion-based Latent Awakening for 3D Style Transfer），其核心思想是利用预训练的2D扩散模型作为“教师”，提供丰富且可迁移的风格先验；通过在扩散引导下对渲染视图进行风格对齐，优化结构化的3D潜在表示以实现风格迁移。研究表明，问题根源并非模型容量不足，而是3D潜在空间未被充分挖掘，而DiLAST通过引入2D扩散指导，使3D生成模型能够在有限训练数据下有效引导潜在空间中的去噪方向，从而实现多样化的OOD风格生成。

链接: https://arxiv.org/abs/2605.04412
作者: Yiran Qiao,Yiren Lu,Yunlai Zhou,Disheng Liu,Linlin Hou,Rui Yang,Yu Yin,Jing Ma
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D asset generation plays a pivotal role in fields such as gaming and virtual reality, enabling the rapid synthesis of high-fidelity 3D objects from a single or multiple images. Building on this capability, enabling style-controllable generation naturally emerges as an important and desirable direction. However, existing approaches typically rely on style images that lie within or are similar to the training distribution of 3D generation models. When presented with out-of-distribution (OOD) styles, their performance degrades significantly or even fails. To address this limitation, we introduce \textbfDiLAST : 2D Diffusion-based Latent Awakening for 3D Style Transfer. Specifically, we leverage a pretrained 2D diffusion model as a teacher to provide rich and generalizable style priors. By aligning rendered views with the target style under diffusion-based guidance, our method optimizes the structured 3D latent representation for stylization. We observe that this limitation stems not from insufficient model capacity, but from the underutilization of structured 3D latents, which are inherently expressive. Despite being trained on comparatively limited data, 3D generation models can leverage 2D diffusion guidance to steer denoising toward specific directions in latent space, thereby producing diverse, OOD styles. Extensive experiments across diverse data and multiple 3D generation backbones demonstrate the effectiveness and plug-and-play nature of our approach.

[CV-82] Evaluation Cards for XAI Metrics CVPR2026

【速读】：该论文旨在解决可解释人工智能（Explainable AI, XAI）方法评估中存在的标准化缺失问题，具体表现为指标定义不一致、报告不完整以及缺乏与通用基线的验证。其解决方案的关键在于提出“XAI评估卡”（XAI Evaluation Card），这是一种类比模型卡（Model Card）的文档模板，用于伴随任何引入XAI评估指标的研究工作。该卡要求明确声明目标属性、可解释性水平、指标假设、验证证据、潜在规避风险及已知失效案例，从而提升评估过程的透明度和可复现性，促进社区共识形成与元分析研究。

链接: https://arxiv.org/abs/2605.04410
作者: Rokas Gipiškis,Olga Kurasova
机构: Vilnius University (维尔纽斯大学); AI Standards Lab (AI标准实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 7 pages. Accepted at the 5th XAI4CV Workshop, CVPR 2026 (non-archival)

点击查看摘要

Abstract:The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.

[CV-83] UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

【速读】：该论文旨在解决遥感图像变化描述（Remote Sensing Image Change Captioning, RSICC）中现有方法依赖隐式特征差分、缺乏对结构化变化语义的显式建模，以及在变化检测与图像描述任务之间难以协调表示需求的问题。同时，现有基准数据集对高分辨率城市建造场景覆盖不足，限制了模型性能评估与应用。解决方案的关键在于提出PTNet框架：通过可学习原型库（prototype bank）显式建模结构化变化语义以引导跨时相交互；利用多头门控机制解耦任务特定表示；并将检测所得的空间先验注入生成过程，从而实现语义一致性与细粒度空间敏感性的协同保持。

链接: https://arxiv.org/abs/2605.04409
作者: Yupeng Gao,Tianyu Li,Guoqing Wang,Yang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at this https URL.

[CV-84] Detecting Deepfakes via Hamiltonian Dynamics

【速读】：该论文旨在解决深度伪造（deepfake）检测模型在面对生成式AI（Generative AI）持续迭代时需频繁 recalibration 的问题，即传统静态模式识别方法难以适应新型合成图像 artifacts 的挑战。其解决方案的关键在于引入物理学启发的稳定性先验：假设自然图像作为耗散物理过程的产物，倾向于处于低能稳定态；而生成模型虽能模拟统计特性，却缺乏对几何平滑性等结构约束的显式建模，导致伪造图像更可能位于高能不稳定状态。为此，作者提出哈密顿作用异常检测（Hamiltonian Action Anomaly Detection, HAAD），通过将图像潜在流形建模为势能曲面，并利用哈密顿动力学作为稳定性探针，量化轨迹行为（如哈密顿作用和能量耗散）来区分真实与伪造图像，从而实现对跨数据集迁移任务更具鲁棒性的检测性能。

链接: https://arxiv.org/abs/2605.04405
作者: Harry Cheng,Ming-Hui Liu,Tianyi Wang,Weili Guan,Liqiang Nie,Mohan Kankanhalli
机构: National University of Singapore (新加坡国立大学); Shandong University (山东大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学（深圳）); Harbin Institute of Technology (深圳) (哈尔滨工业大学（深圳）)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First Version

点击查看摘要

Abstract:Driven by the rapid development of generative AI models, deepfake detectors are compelled to undergo periodic recalibration to capture newly developed synthetic artifacts. To break this cycle, we propose a new perspective on deepfake detection: moving from static pattern recognition to dynamical stability analysis. Specifically, our approach is motivated by physics-inspired priors: we hypothesize that natural images, as products of dissipative physical processes, tend to settle near stable, low-energy equilibria. In contrast, generative models optimize for statistical similarity to real images but do not explicitly enforce structural constraints such as geometric smoothness, leaving deepfakes more likely to occupy unstable, high-energy states. To operationalize this, we introduce Hamiltonian Action Anomaly Detection (HAAD), comprising three contributions: \textbfi) We model the image latent manifold as a potential energy surface. Under this hypothesis, real images are expected to produce basin-like low-energy responses, whereas fake images are more likely to induce high-potential, high-gradient responses. \textbfii) We employ Hamiltonian-inspired dynamics as a stability probe. By releasing latent states from rest, samples near stable regions remain bounded, while high-gradient samples produce larger trajectory responses. \textbfiii) We quantify these dynamic behaviors through two trajectory statistics, \ie, Hamiltonian action and energy dissipation. Extensive experiments show that HAAD outperforms evaluated state-of-the-art baselines on challenging cross-dataset transfer benchmarks, supporting a physics-inspired stability prior for digital forensics.

[CV-85] Optimize-at-Capture: Highly-adaptive Exposure Controlling for In-Vehicle Non-contact Heart-rate Monitoring

【速读】：该论文旨在解决远程光电容积脉搏波描记术（remote photoplethysmography, rPPG）在智能汽车驾驶场景中因光照剧烈变化而导致的心率监测性能下降问题。其关键解决方案是提出一种高度自适应的曝光控制框架，该框架通过基于历史皮肤反射的预测建模主动调整曝光参数，确保感兴趣区域（ROI）始终处于适合rPPG信号提取的最佳动态范围内，而非依赖固定曝光或相机内置自动曝光策略。这一设计显著提升了复杂驾驶环境下的心率监测准确性和成功率。

链接: https://arxiv.org/abs/2605.04397
作者: Jieying Wang,Xinqi Cai,Caifeng Shan,Wenjin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) holds great promise for continuous heart-rate monitoring of drivers in intelligent vehicles. However, its performance is severely degraded by the highly dynamic illumination changes. A critical yet overlooked factor is the lack of exposure controlling during video acquisition – most existing systems rely on either fixed exposure settings or camera build-in auto-exposure, both of which fail to maintain stable facial brightness under rapidly changing lighting conditions during driving. To address this gap, we propose a highly-adaptive exposure controlling framework that proactively adjusts exposure parameters based on predictive modeling of historical skin reflections. Unlike standard auto-exposure, our method is specifically optimized for rPPG measurement, ensuring the skin region of interest (ROI) remains within the optimal dynamic range for rPPG signal extraction. As an important contribution of this study, we introduce ExpDrive, a public in-vehicle physiological monitoring dataset comprising synchronized facial video and reference ECG from 48 subjects captured under real driving conditions. Extensive experiments demonstrate that our method consistently outperforms fixed exposure and standard auto-exposure strategies. Specifically, it reduces the Mean Absolute Error (MAE) by 6.31 bpm (from 14.1 to 7.79 bpm) and significantly increases the success rate by 32.3 percentage points (p 0.001) (from 24.9% to 57.2%) across challenging driving scenarios. Notably, it clearly improved the performance of non-contact heart-rate monitoring in both low-light (rainy) and high-glare (sunny) conditions, validating the efficacy of exposure-aware acquisition design.

[CV-86] Intermediate Representations are Strong AI-Generated Image Detectors

【速读】：该论文旨在解决当前AI生成图像检测方法中存在的两大问题：一是基于训练的方法计算成本高且难以泛化到未见数据域，二是无训练方法在检测性能上存在不足。解决方案的关键在于提出一种基于搜索的检测方法，利用中间层特征嵌入（data embedding sensitivity）对图像进行分析，通过比较原始图像嵌入与扰动后图像嵌入之间的相似性来识别AI生成图像，从而在不依赖额外训练的情况下实现更高效、更具泛化能力的检测效果。

链接: https://arxiv.org/abs/2605.04358
作者: Zhenhan Huang,Pin-Yu Chen,Tejaswini Pedapati,Jianxi Gao
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM research (IBM研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement in generative AI models has enabled the creation of photorealistic images. At the same time, there are growing concerns about the potential misuse and dangers of generated content, as well as a pressing need for effective AI-generated image detectors. However, current training-based detection techniques are typically computationally costly and can hardly be generalized to unseen data domains, while training-free methods fall short in detection performance. To bridge this gap, we propose a search-based method employing data embedding sensitivity in intermediate layers to detect AI-generated images. Given a set of real and AI-generated images, our method examines the similarity between original image embeddings and perturbed image embeddings, and detects AI-generated images based on the similarity. We examine the proposed method on two comprehensive benchmarks: GenImage and Forensics Small. Our method exhibits improved performance across different datasets compared to both training-free and training-based state-of-the-art methods. On average, our method achieves the largest performance gain on the Forensics Small benchmark by 39.61% compared to the best training-free method and 5.14% compared to the best training-based method in AUROC score.

[CV-87] InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making

【速读】：该论文旨在解决自动驾驶系统在高动态范围场景或高速运动条件下，由于传统RGB相机与LiDAR传感器易受运动模糊和延迟影响而导致感知可靠性下降的问题。其解决方案的关键在于引入动态视觉传感器（Dynamic Vision Sensor, DVS），通过一种新颖的基于token的融合策略，将累积事件帧信息整合进InterFuser模型的Transformer骨干网络中，从而利用RGB、LiDAR与DVS数据之间的互补特性提升感知鲁棒性。实验表明，该方法在CARLA Leaderboard基准上实现了77.2的Driving Score和100%的Route Completion，验证了事件视觉在复杂光照与动态环境中的优越性能。

链接: https://arxiv.org/abs/2605.04355
作者: Mustafa Sakhaia,Kaung Sithua,Min Khant Soe Okea,Maciej Wielgosza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving systems rely heavily on robust sensor fusion to perceive complex envi- ronments. Traditional setups using RGB cameras and LiDAR often struggle in high-dynamic- range scenes or high-speed scenarios due to motion blur and latency. Dynamic Vision Sensors (DVS), or event cameras, offer a paradigm shift by capturing asynchronous brightness changes with microsecond temporal resolution and high dynamic range. In this paper, we propose an extended architecture of the state-of-the-art InterFuser model, integrating DVS as an additional modality to enhance perception reliability. We introduce a novel token-based fusion strategy that incorporates accumulated event frames into the transformer-based backbone of InterFuser. Our method leverages the complementary nature of RGB, LiDAR, and DVS data. We evaluate our approach on the Car Learning to Act (CARLA) Leaderboard benchmarks, demonstrating that the inclusion of DVS improves the robustness of the driving agent, achieving a competitive Driving Score of 77.2 and a superior Route Completion of 100%. The results indicate that event-based vision is a promising direction for improving safety and performance in adverse lighting and dynamic conditions.

[CV-88] Covariance-Aware Goodness for Scalable Forward-Forward Learning

【速读】：该论文旨在解决现有无反向传播（BP-free）前向-前向（Forward-Forward, FF）算法在卷积神经网络中性能显著低于标准反向传播（Backpropagation, BP）的问题，尤其是在ImageNet-100和Tiny-ImageNet等复杂视觉基准上的表现差距。其核心瓶颈在于传统平方和形式的“好坏度”（goodness）函数会将特征体积压缩为通道级激活能量，从而丢失关键的二阶依赖关系。解决方案的关键在于提出一个包含三个核心组件的框架：首先，双轴协方差好坏度（Bi-axis Covariance Goodness, BiCovG）通过跨通道投影和嵌套多尺度聚合，显式引入沿两个轴的结构化二阶信息，以低复杂度近似协方差感知的好坏度；其次，轻量级逻辑融合模块（Logistic Fusion）聚合层间预测，增强深层表征贡献；最后，特征对齐层（Feature Alignment Layer, FAL）在局部训练网络的块边界引入零初始化修正，缓解表示错位问题。这三个组件共同使FF模型深度从浅层扩展至16层VGG-16架构，并在ImageNet-100上达到73.01%准确率、Tiny-ImageNet上达50.30%，且通过混合好坏度块进一步缩小与BP的差距至3.6%，同时峰值内存降低约50%。

链接: https://arxiv.org/abs/2605.04346
作者: Xiaoyi Jiang,Bashir M. Al-Hashimi,Kai Xu
机构: King’s College London (国王学院伦敦大学); University of Nottingham (诺丁汉大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Forward-Forward algorithm eliminates global gradient flow and full network activations storage. However, in convolutional settings, existing BP-free FF methods significantly under-perform backpropagation on complex benchmarks such as ImageNet-100 and Tiny-ImageNet. We identify this gap as a structural bottleneck in goodness extraction: standard sum-of-squares formulation collapses feature volumes into channel-wise activation energies which omits critical second-order dependencies. To address this, we propose a framework centered on three key components. First, Bi-axis Covariance Goodness(BiCovG) explicitly augments the standard goodness function with structured second-order information along two axes: cross-channel projections that model inter-feature covariance, and nested multi-scale aggregation that encodes spatial correlation statistics. This provides a tractable approximation to covariance-aware goodness without the prohibitive O(C^2) complexity of explicit matrix estimation. Second, a lightweight Logistic Fusion module aggregates layer-wise predictions, amplifying the contribution of deeper representations. Third, the Feature Alignment Layer(FAL) introduces a zero-initialized correction at block boundaries to mitigate representation misalignment in deep locally trained networks. By introducing these three components, we effectively double the depth of viable Forward-Forward learning, extending robust layer utilization from shallow baselines to 16 layer architectures like VGG-16. The resulting BP-free model achieves 73.01% on ImageNet-100 and 50.30% on Tiny-ImageNet. As a practical extension, Hybrid Goodness Blocks control the scope of gradient propagation via configurable block sizes, further narrowing the ImageNet-100 gap to 3.6% and matching BP on Tiny-ImageNet, while still reducing peak memory by approximately 50% relative to BP.

[CV-89] Learning-based Statistical Refinement for Denoising

【速读】：该论文旨在解决现有去噪方法在缺乏干净图像样本和噪声分布先验信息时，因模型不完善或噪声假设不可靠而导致去噪结果与实际噪声统计特性不一致的问题。其核心解决方案在于提出一种基于贝叶斯框架的辅助信号建模方法，通过利用噪声数据中的统计信息来评估并提升去噪结果与噪声统计的一致性，从而在不依赖精确噪声分布或清洁数据的前提下优化去噪质量。

链接: https://arxiv.org/abs/2605.04332
作者: Rihuan Ke
机构: University of Bristol (布里斯托大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This work proposes a learning-based statistical refinement method for improving the denoising results of a given denoiser without knowing the precise noise distribution or accessing clean images or calibration data. While there are many existing successful denoising approaches for handling different kinds of noise, they typically require accurate modelling of the images and the noise (implicitly or explicitly), and hence the denoising results can be suboptimal due to different practical factors such as imperfect models, unreliable noise assumptions, or low quality data. In particular, when clean image samples are not available and there is a lack of knowledge of the underlying noise distribution, which is the case in various practical situations, the results may not well align with the noise statistics. The unawareness of the useful statistical information leads to suboptimal results. This work aims to make the best use of the statistical information to improve the consistency between the given denoising results and the noise statistics, under the assumption that the noise is conditionally pixel-wise independent given the clean signal. A method, based on a Bayesian formulation of an auxiliary signal in the noisy data, is proposed for evaluating the consistency of the denoising results, without precise information on noise distribution. By leveraging the statistical information from noisy data, the method enhances the statistical noise consistency and improves denoising quality.

[CV-90] Beyond Fixed Thresholds and Domain-Specific Benchmarks for Explainable Multi-Task Classification in Autonomous Vehicles

【速读】：该论文旨在解决自动驾驶系统中深度学习模型缺乏透明性和安全性的问题，尤其是在多任务视觉理解场景下，如何实现可解释的驾驶行为预测与决策机制。其核心挑战在于传统固定置信度阈值方法在多任务环境中表现不佳，难以兼顾不同任务的性能优化。解决方案的关键在于提出一种全面的置信度阈值敏感性分析方法，通过动态调整各任务的决策边界来提升F1分数，并结合IUST-XAI-AD数据集（包含958张带人类标注驾驶决策及其推理过程的图像），为跨文化驾驶行为模式提供更严谨的评估基准，从而推动更具可靠性、可解释性和文化适应性的自动驾驶系统发展。

链接: https://arxiv.org/abs/2605.04299
作者: Maryam Sadat Hosseini Azad,Shahriar Baradaran Shokouhi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Scene understanding is a vital part of autonomous driving systems, which requires the use of deep learning models. Deep learning methods are intrinsically black box models, which lack transparency and safety in autonomous driving. To make these systems transparent, multi-task visual understanding has become crucial for explainable autonomous driving perception systems, where simultaneous prediction of multiple driving behaviors and their underlying explanations is essential for safe navigation and human trust in autonomous vehicles. In order to design an accurate and cross-cultural explainable autonomous driving system, we introduce a comprehensive confidence threshold sensitivity analysis that evaluates various threshold values to identify optimal decision boundaries for different tasks. Our analysis demonstrates that traditional fixed threshold approaches are suboptimal for multi-task scenarios. Through extensive evaluation, we demonstrate that our adaptive threshold selection methodology improves F1-scores across different tasks. In addition, we introduce IUST-XAI-AD, a novel dataset consisting of 958 images with human annotations for driving decisions and corresponding reasoning. This dataset addresses the critical gap in domain-specific evaluation benchmarks for distinct driving contexts and provides a more challenging test environment compared to existing datasets. Experimental results demonstrate that confidence threshold sensitivity analysis can significantly improve model performance, while the introduction of the IUST-XAI-AD dataset reveals important insights about cross-cultural driving behavior patterns. The combined contributions of this work provide both methodological advances and practical evaluation tools that can accelerate the development of more reliable, explainable, and culturally-adaptive autonomous driving systems for global deployment.

[CV-91] Imagery Dataset for Remaining Useful Life Estimation of Synthetic Fibre Ropes

【速读】：该论文旨在解决合成纤维绳（Synthetic Fibre Ropes, SFRs）剩余使用寿命（Remaining Useful Life, RUL）估计缺乏公开可用图像数据集的问题，尤其是在受控循环疲劳加载条件下完整退化生命周期的视觉表征不足。解决方案的关键在于构建一个包含约34,700张高分辨率图像的新颖图像数据集，涵盖11根Dyneema SK75/78高模量聚乙烯（High-Modulus Polyethylene, HMPE）绳样在7种不同轴向载荷（60–280 kN）下进行循环疲劳试验直至机械失效的过程；每轮测试中，每隔固定次数的滚筒循环（inspection burst）拍摄10张沿绳索不同横截面位置的图像，实现空间代表性采样，并对每张图像标注对应的循环计数，从而可直接计算任意时刻的RUL。该数据集为基于视觉的条件监测（Condition Monitoring, CM）与预测算法开发提供了基准资源，支持RUL回归、损伤演化建模、异常检测及载荷条件下的预测分析等机器学习任务。

链接: https://arxiv.org/abs/2605.04262
作者: Anju Rani,Daniel Ortiz-Arroyo,Petar Durdevic
机构: Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 1 table

点击查看摘要

Abstract:Remaining useful life (RUL) estimation of synthetic fibre ropes (SFRs) is critical for safe operation in offshore-crane, wind turbine installation, and heavy-load handling applications, where rope failure can result in catastrophic safety incidents and costly downtime. Despite growing research interest in data-driven condition monitoring, there is no publicly available image dataset that captures the complete degradation lifecycle of SFRs under controlled cyclic fatigue loading. To address this gap, we present a novel image dataset comprising approximately 34,700 high-resolution images of eleven Dyneema SK75/78 high-modulus polyethylene (HMPE) rope samples subjected to cyclic fatigue on a sheave-bend test stand at seven distinct axial load levels ranging from 60 kN to 280 kN. Ropes were loaded until mechanical failure, with fatigue lifetimes ranging from 695 cycles to 8,340 cycles. After every fixed number of sheave cycles (an inspection burst), ten images were captured at different cross-sectional positions along the rope, providing spatially representative sampling of surface degradation throughout the rope’s entire service life. The images obtained from each load are annotated with the corresponding elapsed cycle count, enabling a direct computation of RUL for any rope in the sequence. This dataset aims to support a broad range of machine learning (ML) tasks including RUL regression, damage progression modelling, anomaly detection, and load-conditioned prognostics. The dataset is intended to serve as a benchmark resource for the development and comparison of vision-based condition monitoring (CM) and prognostics algorithms for SFRs.

[CV-92] Physics-Guided Regime Unmixing

【速读】：该论文旨在解决线性混合模型（Linear Mixing Model, LMM）在存在多重散射（multiple scattering）时失效的问题，以及现有非线性混合法则因采用全局固定规则而缺乏空间自适应性的局限。其解决方案的关键在于提出物理引导的分段解混方法（Physics-Guided Regime Unmixing, PGRU），通过从可观测物理特征中估计像素级标量参数 $\xi_i \in [0,1]$ ，仅在物理上合理的位置激活非线性混合作用；同时利用残差融合广义双线性模型（Generalized Bilinear Model, GBM）、后非线性混合模型（Post-Nonlinear Mixing Model, PPNM）和Hapke模型，并引入可学习注意力机制，生成具有物理可解释性的混合作用区域图（regime maps）。

链接: https://arxiv.org/abs/2605.04247
作者: Paula Pacheco,Pablo Granitto,Juan B. Cabral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Linear Mixing Model (LMM) dominates spectral unmixing for its simplicity, but fails under multiple scattering; existing nonlinear models compensate by applying a fixed regime uniformly across entire scenes. We propose Physics-Guided Regime Unmixing (PGRU), which estimates a pixel-wise scalar \xi_i \in [0,1] from observable physical features to activate nonlinear mixing only where justified. Residuals from the Generalized Bilinear Model (GBM), the Post-Nonlinear Mixing Model (PPNM), and Hapke are combined via learned attention, yielding interpretable regime maps. Experiments on Samson, Jasper Ridge, and Urban show consistent improvements over baselines, with physical coherence \rho 0.90 .

[CV-93] Densification and forecasting of Sentinel-2 time series from multimodal SAR and Optical satellite data using deep generative models

【速读】：该论文旨在解决光学卫星影像时间序列中因云层和条带边缘导致的时间维度采样不规则问题，从而限制了连续监测能力。传统方法虽能实现缺失数据的插值与重建，但仅限于已有观测时间范围内的补全，无法预测未来观测结果。其解决方案的关键在于提出一种基于概率深度学习框架，能够对Sentinel-2时间序列进行时序加密与未来预测，通过联合利用Sentinel-2光学数据与Sentinel-1 SAR（合成孔径雷达）多模态遥感数据，在任意过去或未来日期生成光学图像，并特别关注生成图像的不确定性建模，从而提升稀疏及时间错位数据下的重建与预测效果。

链接: https://arxiv.org/abs/2605.04239
作者: Véronique Defonte,Dawa Derksen,Alexandre Constantin,Bastien Nespoulous
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical satellite image time series are extensively used in many Earth observation applications, including agriculture, climate monitoring, and land surface analysis. However, clouds and swath edges result in irregular sampling along the temporal dimension, limiting continuous monitoring. To address this issue, a growing body of work has focused on temporal densification and reconstruction of satellite image time series, with the objective of filling missing or cloud-contaminated observations within the temporal extent of the available data. While these approaches improve temporal continuity, they are inherently restricted to the reconstruction of the gaps within the observed time periods, and do not address the prediction of future observations. This work proposes a probabilistic deep learning framework for the densification and forecasting of Sentinel-2 time series by generating optical images at arbitrary past or future dates. The approach leverages multimodal satellite data by jointly exploiting Sentinel-2 optical and Sentinel-1 SAR observations. Unlike most existing works, we propose to focus on the uncertainty of the generated images. Experimental results demonstrate effective densification and forecasting, on sparse and temporally misaligned time series.

[CV-94] Disentangled Learning Improves Implicit Neural Representations for Medical Reconstruction

【速读】：该论文旨在解决传统隐式神经表示（Implicit Neural Representations, INRs）在医学影像重建中存在训练效率低和成像质量不佳的问题，以及现有基于初始化的方法因依赖高质量图像而易发生灾难性遗忘的局限性。其解决方案的关键在于提出DisINR框架，通过显式解耦共享表征与个体特异性表征：引入一个共享的编码器-解码器对和个体特异性的编码器，仅在测试时优化后者，而冻结前者以保留预训练先验；同时利用可微分前向模型直接从有限原始测量数据中预训练共享模块，从而无需预先获取高质量图像，显著提升了重建精度与计算效率。

链接: https://arxiv.org/abs/2605.04234
作者: Qing Wu,Xuanyu Tian,Chenhe Du,Haonan Zhang,Xiao Wang,Le Lu,Yuyao Zhang
机构: Medical AI Lab, Ant Group; ShanghaiTech University; Shanghai Jiao Tong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Implicit neural representations (INRs) have emerged as a powerful paradigm for medical imaging via physics-informed unsupervised learning. Classical INRs optimize an entire network from scratch for each subject, leading to inefficient training and suboptimal imaging quality. Recent initialization-based approaches attempt to inject population priors into pre-trained networks, yet they rely on high-quality images and often suffer from catastrophic forgetting during fine-tuning. We present DisINR, a novel INR framework that explicitly disentangles shared and subject-specific representations. DisINR introduces a shared encoder-decoder pair and subject-specific encoders, whose features are jointly decoded for image reconstruction. By integrating differentiable forward models, it pre-trains the shared modules directly from limited raw measurements, removing the need for pre-acquired high-quality images. During test-time adaptation, only the subject-specific encoder is optimized, while the shared pair remains frozen, effectively preserving learned priors. Extensive evaluations on three representative medical imaging tasks show that DisINR significantly outperforms state-of-the-art INRs in both reconstruction accuracy and efficiency.

[CV-95] Anatomy of a failure: When how and why deep vision fails in scientific domains

【速读】：该论文试图解决的问题是：深度学习（Deep Learning, DL）在科学成像中的适用性问题，特别是当DL模型被直接应用于具有高信息密度和复杂物理化学特性的科学图像（如红外光谱成像）时，可能因数据先验与DL模型的简单性偏置（simplicity bias）不匹配而导致严重性能下降甚至失效。解决方案的关键在于识别并揭示了这种“先验-偏置失配”机制——即IR数据的多通道特性与DL模型倾向于拟合低维结构的偏好相互冲突，导致模型退化为一维预测，从而浪费其强大的表示能力，并引发AI安全风险。论文进一步指出，当前主流的DL鲁棒性增强策略主要针对RGB图像设计，无法有效缓解此问题，因此提出应建立面向特定模态的失败模式分析框架，推动开发适配科学成像特性的专用、安全的人工智能算法。

链接: https://arxiv.org/abs/2605.04231
作者: Ji-Hun Oh,Dou Hoon Kwark,Kianoush Falahkheirkhah,Kevin Yeh,John Cheville,Volodymyr Kindratenko,Rohit Bhargava
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Mayo Clinic (梅奥诊所); CZ Biohub Chicago, LLC (CZ生物中心芝加哥有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mirroring its ubiquity in popular media and all human activities, the use of deep learning (DL) is rapidly growing in scientific imaging modalities. However, unlike everyday RGB pictures, pixels encode precise physicochemical properties in scientific imaging across potentially thousands of channels. While DL is well validated on human-centric RGB perceptual tasks, its effectiveness for scientific imaging remains uncertain. Here, we show that the naive application of DL frameworks to scientific images can lead to critical failures. We evaluate the use of DL for pathology, comparing RGB images of stained tissue with the quantitative and information-rich biochemical signatures of infrared (IR) imaging. Despite this informational advantage, DL models trained on IR data paradoxically underperform. We investigate this discrepancy to find that IR data priors interact poorly with the simplicity bias of DL, causing models to collapse to one-dimensional predictions. This constitutes a catastrophic DL failure because the model’s representational capacity remains largely unused, while furthermore raising AI safety concerns and undermining the advantages of such scientific modalities. Notably, this problem persists even with state-of-the-art DL robustification strategies, which are primarily designed and validated for RGB imagery and thus inherit the same prior-bias mismatch. This work establishes a framework for understanding the limitations of generic DL in science and advocates for the study of modality-specific failure modes to guide the development of specialized, safe AI algorithms.

[CV-96] opology-Constrained Quantized nnUNet for Efficient and Anatomically Accurate 3D Tooth Segmentation

【速读】：该论文旨在解决深度学习模型量化过程中引入的空间失真问题，特别是在3D牙齿分割任务中，如何在保持计算效率的同时确保解剖学准确性。其关键解决方案是提出一种拓扑约束的量化nnUNet框架，通过在量化感知训练中引入一种专为牙齿设计的拓扑损失函数（topological loss），联合优化交叉熵损失、量化正则化项与拓扑约束项，从而在不修改网络结构的前提下，有效保留牙齿数量、邻接关系及牙体腔隙完整性等关键解剖特征。该方法采用8位整型量化骨干网络，并动态校准权重与激活值以最小化推理阶段的精度损失，最终实现硬件友好的整数运算部署，同时显著降低拓扑错误率，满足临床可接受的分割质量要求。

链接: https://arxiv.org/abs/2605.04201
作者: Paarth Prasad,Ruchika Malhotra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a topology-constrained quantized nnUNet framework for efficient and anatomically accurate 3D tooth segmentation, addressing the challenges of spatial distortion introduced by quantization in deep learning models. The proposed method integrates a novel tooth-specific topological loss into quantization-aware training, preserving critical anatomical structures such as tooth count, adjacency relationships, and cavity integrity while maintaining computational efficiency. The system employs an 8-bit quantized nnUNet backbone, where weights and activations are dynamically calibrated to minimize precision loss during inference. Furthermore, the topological loss combines connected-component analysis, adjacency consistency, and hole detection penalties, ensuring anatomical fidelity without modifying the underlying network architecture. The joint optimization objective harmonizes cross-entropy loss, quantization regularization, and topological constraints, enabling end-to-end training with gradient approximations for persistent homology terms. Experiments demonstrate that our approach significantly reduces topological errors compared to conventional quantized models, achieving clinically plausible segmentations on dental CBCT scans. The method retains the hardware efficiency of integer-only inference, making it suitable for deployment in resource-constrained clinical environments. This work bridges the gap between computational efficiency and anatomical precision in medical image segmentation, offering a practical solution for real-world dental applications.

[CV-97] MuCALD-SplitFed: Causal-Latent Diffusion for Privacy-Preserving Multi-Task Split-Federated Medical Image Segmentation ICIP2026

【速读】：该论文旨在解决多任务联邦学习（Multi-task Federated Learning, Multi-task FL）在医疗场景下因临床机构任务异构性导致的模型收敛不稳定与隐私泄露问题。标准联邦学习（Federated Learning, FL）和Split Federated Learning（SplitFed）难以适配实际医疗流程中不同机构执行不同任务的现实需求，而现有多任务方法常引发训练不稳定性并暴露敏感信息。解决方案的关键在于提出MuCALD-SplitFed框架，其核心创新是融合因果表示学习（causal representation learning）与潜在扩散模型（latent diffusion），通过建模任务间因果结构增强跨任务一致性，并利用扩散机制在分割点（split points）处抑制重建攻击和成员推理攻击，从而实现更稳定、更安全的多任务协作式模型训练。

链接: https://arxiv.org/abs/2605.04108
作者: Chamani Shiranthika,Hadi Hadizadeh,Parvaneh Saeedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to oral presentation and conference proceedings at IEEE International Conference on Image Processing (ICIP 2026), Finland

点击查看摘要

Abstract:Federated Learning enables decentralized training by aggregating model updates across clients without sharing raw data, while Split Federated Learning further partitions the model between clients and a server to reduce computation and communication at the client side. However, decentralized medical institutions rarely operate on a single shared task, making standard Federated and SplitFed collaborations poorly aligned with real clinical workflows. Multi-task FL extends these frameworks by allowing clients to handle different tasks, but often introduces instability and privacy vulnerabilities. This study proposes \textbfMuCALD-SplitFed, a multi-task SplitFed framework that integrates causal representation learning and latent diffusion. Experiments show MuCALD-SplitFed consistently improves segmentation, while baseline SplitFed fails to converge. The proposed approach further reduces information leakage at split points, mitigating reconstruction-based and membership inference attacks. Additionally, MuCALD SplitFed outperforms state-of-the-art personalized FL and multi-task FL approaches. The code repository is: this https URL.

[CV-98] Are Multimodal LLM s Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在公开皮肤病学基准测试中表现良好，但其临床实际应用能力存在显著“基准到床边”差距的问题。研究通过在包含5,811例真实世界皮肤科会诊病例的多中心医院数据集上评估五种开源与商用MLLMs，发现模型在仅使用图像时诊断准确率大幅下降，且对不完整或错误的临床背景信息高度敏感；解决方案的关键在于引入结构化临床上下文以提升诊断准确性（最高达28.75% top-3准确率），同时指出当前模型在严重程度分诊任务中虽具一定筛查潜力（敏感性>60%），但仍缺乏足够的可靠性以支持临床部署。

链接: https://arxiv.org/abs/2605.04098
作者: Roy Jiang,Hyunjae Kim,Zhenyue Qin,Morten Lee,Margaret MacGibeny,Ailish Hanly,Angela Sadlowski,Shanin Chowdhury,Xuguang Ai,Jeffrey Gehlhausen,Qingyu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.

[CV-99] Improving Medical VQA through Trajectory-Aware Process Supervision

【速读】：该论文旨在解决医疗视觉问答（Medical Visual Question Answering, Medical VQA）中推理能力不足的问题，尤其是现有数据集普遍缺乏对推理过程的标注。为应对这一挑战，作者提出了一种基于COMCTS算法生成六大数据集的推理轨迹的方法，并引入一个两阶段训练框架：首先进行监督微调（Supervised Fine-Tuning, SFT），随后采用Group Relative Policy Optimization（GRPO）结合一种新颖的过程感知奖励（process-based reward）。其关键创新在于设计了基于动态时间规整（Dynamic Time Warping, DTW）的距离度量机制，通过句子嵌入（sentence transformers）对推理步骤向量化后计算生成轨迹与真实轨迹之间的相似性，从而实现对推理过程本身的监督。实验表明，该方法显著提升了模型在多个基准上的准确率和文本质量指标，验证了过程监督对于训练具备可靠推理能力的医疗视觉语言模型（Medical Vision-Language Models, VLMs）的重要性。

链接: https://arxiv.org/abs/2605.04064
作者: Halil Ibrahim Gulluk,Olivier Gevaert
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at this https URL Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.04064 [cs.LG] (or arXiv:2605.04064v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04064 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Halil Ibrahim Gulluk [view email] [v1] Fri, 10 Apr 2026 21:13:46 UTC (2,013 KB) Full-text links: Access Paper: View a PDF of the paper titled Improving Medical VQA through Trajectory-Aware Process Supervision, by Halil Ibrahim Gulluk and Olivier GevaertView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-100] Lookahead Drifting Model

【速读】：该论文旨在解决图像生成任务中分布映射的精度与效率问题，尤其针对现有基于漂移模型（drifting model）方法在训练过程中仅利用单一漂移项导致梯度信息利用不充分的问题。其解决方案的关键在于提出一种前瞻式漂移模型（lookahead drifting model），即在每个训练迭代中依次计算一组漂移项，这些漂移项不仅依赖于当前模型输出和正样本，还融合了先前已计算的漂移项信息；通过合理缩放使各漂移项幅度处于可比范围，从而使得后期漂移项能够捕获更高阶的梯度信息，并最终以加权求和的方式引导模型输出向目标分布逼近，实验表明该方法在CIFAR10等数据集上优于基线模型。

链接: https://arxiv.org/abs/2605.04060
作者: Guoqiang Zhang,Kenta Niwa,W. Bastiaan Kleijn
机构: University of Exeter (埃克塞特大学); NTT Communication Science Laboratories (NTT通信科学实验室); Victoria University of Wellington (威灵顿维多利亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, a new paradigm named \emphdrifting model has been proposed for mapping distributions, which achieves the SOTA image generation performance over ImageNet via one-step neural functional evaluation (NFE). The basic idea is to compute a drifting term at each training iteration and then push the output of the model towards the direction of the drifting term. In this paper, we propose a \emphlookahead drifting model. At each training iteration, we compute a set of drifting terms sequentially. Each drifting term is calculated by making use of previously computed ones as well as the positive samples and the output of the model. %One key step is to properly scale the drifting terms so that their magnitudes are in a comparable range. In principle, the drifting terms obtained at a later stage capture higher order gradient information towards the positive samples. At each training iteration, the model is optimized by pushing its output towards the direction of the (weighted) summation of the drifting terms. Experimental results on toy examples and CIFAR10 demonstrate the better performance of the new method than the baseline.

[CV-101] Continual Distillation of Teachers from Different Domains CVPR2026

【速读】：该论文旨在解决持续蒸馏（Continual Distillation, CD）过程中面临的两个核心问题：一是教师模型的训练数据不可获取，导致学生模型难以有效学习；二是不同教师模型具有异质性知识，导致在顺序学习中出现“未见知识遗忘”（Unseen Knowledge Forgetting, UKF）现象，即先前学到的知识在后续训练中被覆盖或丢失。解决方案的关键在于提出自外部数据蒸馏（Self External Data Distillation, SE2D），通过在训练过程中保留对学生在外部未标注数据上的 logits 信息，实现对异质教师模型的知识稳定传递，从而平衡“未见知识迁移”（Unseen Knowledge Transfer, UKT）与 UKF，提升跨域泛化性能。

链接: https://arxiv.org/abs/2605.04059
作者: Nicolas Michel,Maorong Wang,Jiangpeng He,Toshihiko Yamasaki
机构: The University of Tokyo (东京大学); National Institute of Informatics (信息研究所); Indiana University Bloomington (布卢明顿印第安纳大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Deep learning models continue to scale, with some requiring more storage than many large-scale datasets. Thus, we introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher. We also show that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross-domain generalization. The code and implementation for this work are publicly available at: this https URL.

[CV-102] Constraint-Aware Execution Planning for Hybrid Space-Ground Compute Workloads

【速读】：该论文旨在解决低轨卫星（Low Earth Orbit, LEO）在计算能力日益增强背景下，由于数据生成速率远超下行链路传输能力所导致的资源调度难题。具体而言，需在有限的轨道接触窗口内，合理分配计算任务至星上或地面处理、优化中间数据的跨空间-地面传输路径，并确保在噪声信道中维持可靠交付。解决方案的关键在于提出Constraint-Aware Execution (CAE)规划系统，其核心创新包括：基于SGP4轨道传播与地站可见性预测构建物理约束环境；通过成本模型实现计算部署决策以平衡星载资源消耗与数据传输开销；引入自适应前向纠错（FEC）和安全开销建模进行高效数据传输插入；以及在功率、热控、计算与通信等多维约束下，采用贪心首次适应策略对执行计划进行调度。该方案可在两秒内生成可行计划，有效利用星上数据压缩减少传输量，并动态调整FEC参数以适应信道变化，已作为生产级API投入实际应用。

链接: https://arxiv.org/abs/2605.04052
作者: Subhadip Mitra
机构: RotaStellar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 6 figures, 2 algorithms, 4 tables

点击查看摘要

Abstract:Low Earth orbit (LEO) satellites increasingly carry compute hardware capable of on-board processing, yet each satellite generates roughly two orders of magnitude more data than it can downlink per orbit. This mismatch forces operators to decide, for every workload, which computation runs on-board and which runs on the ground, how intermediate data crosses the space-ground boundary through narrow contact windows, and how to maintain delivery guarantees over noisy channels. We present Constraint-Aware Execution (CAE), a planning system that takes a satellite identifier, a workload expressed as a directed acyclic graph of processing steps, and a set of orbital and resource constraints, and produces a deterministic, physically grounded execution plan. CAE operates in four phases: (1) orbital environment construction via SGP4 propagation with eclipse detection and ground station pass prediction, (2) compute placement using a cost model that compares on-board resource consumption against transfer overhead, (3) transfer insertion with adaptive forward error correction and security overhead modeling, and (4) greedy first-fit scheduling into orbital windows under power, thermal, compute, and communication constraints. We evaluate CAE against five representative workload patterns across satellites in distinct orbital regimes and demonstrate that the system produces feasible plans in under two seconds, correctly exploits onboard data reduction to minimize transfer volume, and adapts FEC and multi-pass allocation to varying channel conditions. CAE is deployed as a production API computing plans for any cataloged satellite using live two-line element data.

[CV-103] External Validation of Deep Learning Models for BI-RADS Breast Density Prediction from Ultrasound Images

【速读】：该论文旨在解决如何利用深度学习模型从乳腺超声检查中准确预测乳腺密度（breast density）的问题，以弥补传统基于乳腺X线摄影（mammography）报告的局限性。其关键解决方案是通过外部验证三个预训练的深度学习模型（DenseNet121、ViT-B/32 和 ResNet50），在独立队列中评估它们对不同乳腺密度类别（A–D类）的分类性能，并进一步将AI推导的密度信息整合进Tyrer-Cuzick风险模型以评估10年乳腺癌风险预测能力。结果显示，模型在极致致密型乳腺（D类）和脂肪型（A类）中表现最优，且整体性能稳定，表明深度学习方法具备良好的泛化能力，尤其适用于种族构成不同的外部数据集，但对异质致密型乳腺（C类）仍存在改进空间。

链接: https://arxiv.org/abs/2605.05082
作者: Yuxuan Chen,Arianna Bunnell,Yanqi Xu,Haoyan Yang,Thomas K. Wolfgruber,John A. Shepherd,Yiqiu Shen
机构: Perlmutter Cancer Center (佩尔穆特癌症中心); NYU Langone Health (纽约大学朗格尼健康); Department of Radiology (放射科); University of Hawai’i Cancer Center (夏威夷大学癌症中心); University of Hawai’i at Mānoa (夏威夷大学马诺阿分校); Center for Data Science (数据科学中心); New York University (纽约大学); Department of Computer Science (计算机科学系); Stony Brook University (石溪大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 18th International Workshop on Breast Imaging (IWBI 2026)

点击查看摘要

Abstract:We externally validated three deep learning models (DenseNet121, ViT-B/32, and ResNet50) for predicting mammographic breast density from breast ultrasound exams on an independent cohort. The external validation set comprised 2,000 ultrasound exams, including 500 cancer cases defined by an initial negative exam (BI-RADS 1 or 2) followed by a cancer diagnosis within 6 months to 10 years, and 1,500 negative controls matched by manufacturer and study year. Performance was measured using patient-level AUROC across four density categories: A (fatty), B (scattered), C (heterogeneous), and D (extremely dense). As a downstream assessment, we also evaluated 10-year risk prediction by incorporating age and AI-derived density into the Tyrer-Cuzick model and comparing performance against a reference model using age and mammography-reported density. All three models performed best in extremely dense breasts (AUROC 0.868-0.899), with strong performance in fatty (0.814-0.838) and scattered density (0.764-0.799), and lower performance in heterogeneously dense breasts (0.699-0.729). DenseNet121 achieved the highest overall performance (micro-averaged AUROC 0.885), and performance across categories was comparable between internal and external testing. For risk modeling, age combined with AI-derived density yielded a lower AUROC than age combined with mammography-reported density (0.541 vs. 0.570; p = 0.23), with no statistically significant difference. These findings indicate that deep learning models generalize well to external data with different racial composition for breast density assessment. While performance is strongest in extremely dense breasts, heterogeneously dense remains more challenging, highlighting the need for targeted optimization.

[CV-104] Segmenting proto-halos with vision transformers ATC DATE

【速读】：该论文旨在解决从早期宇宙中小尺度引力扰动中预测暗物质晕（dark-matter halo）最终质量的问题，这是一个高度非线性的过程，传统方法依赖于N体模拟，计算成本高昂。解决方案的关键在于引入深度学习模型对初始密度场中的原晕（proto-halo）区域进行分割与分类，以实现高效且高精度的预测。研究对比了两种架构：基于V-Net设计的全卷积神经网络（CNN）和U-Net Transformer，发现基于Transformer的模型在所有评估指标上均显著优于CNN，尤其在低质量晕和原晕边界重建方面表现优异，误差可控制在亚百分之一以内。此外，通过引入潮汐剪切（tidal shear）特征作为输入，进一步提升了模型性能，表明多源物理信息融合是提升预测准确性的关键因素之一。

链接: https://arxiv.org/abs/2508.00049
作者: Toka Alokda,Cristiano Porciani
机构: 未知
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 14 figures, 11 tables; updated to match the published version: JCAP11(2025)083

点击查看摘要

Abstract:The formation of dark-matter halos from small cosmological perturbations generated in the early universe is a highly non-linear process typically modeled through N-body simulations. In this work, we explore the use of deep learning to segment and classify proto-halo regions in the initial density field according to their final halo mass at redshift z=0. We compare two architectures: a fully convolutional neural network (CNN) based on the V-Net design and a U-Net transformer. We find that the transformer-based network significantly outperforms the CNN across all metrics, achieving sub-percent error in the total segmented mass per halo class. Both networks deliver much higher accuracy than the perturbation-theory-based model \textscpinocchio, especially at low halo masses and in the detailed reconstruction of proto-halo boundaries. We also investigate the impact of different input features by training models on the density field, the tidal shear, and their combination. Finally, we use Grad-CAM to generate class-activation heatmaps for the CNN, providing preliminary yet suggestive insights into how the network exploits the input fields.

人工智能

[AI-0] LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

【速读】：该论文旨在解决长时程搜索代理（long-horizon search agents）在推理、工具调用和信息观测过程中因持续累积中间内容而导致的工作上下文迅速膨胀的问题，这会显著增加计算成本并提升错误风险。解决方案的关键在于提出一种自适应的上下文管理机制——Context-ReAct，其核心是通过五个原子操作（Skip、Compress、Rollback、Snippet 和 Delete）实现对工作上下文的弹性编排，使代理能够根据任务相关性动态调整轨迹的不同部分的细节层级，从而保留关键证据、压缩已解决信息、丢弃无用分支并控制上下文规模。其中，Compress 操作具有表达完备性，其他操作则提供效率与保真度保障，有效降低生成成本和幻觉风险。基于此范式构建的 LongSeeker 在多个搜索基准上显著优于现有方法，验证了自适应上下文管理对于提升长时程推理可靠性与效率的重要性。

链接: https://arxiv.org/abs/2605.05191
作者: Yijun Lu,Rui Ye,Yuwen Du,Jiajun Wang,Songhua Liu,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Long-horizon search agents must manage a rapidly growing working context as they reason, call tools, and observe information. Naively accumulating all intermediate content can overwhelm the agent, increasing costs and the risk of errors. We propose that effective context management should be adaptive: parts of the agent’s trajectory are maintained at different levels of detail depending on their current relevance to the task. To operationalize this principle, we introduce Context-ReAct, a general agentic paradigm for elastic context orchestration that integrates reasoning, context management, and tool use in a unified loop. Context-ReAct provides five atomic operations: Skip, Compress, Rollback, Snippet and Delete, which allow the agent to dynamically reshape its working context, preserving important evidence, summarizing resolved information, discarding unhelpful branches, and controlling context size. We prove that the Compress operator is expressively complete, while the other specialized operators provide efficiency and fidelity guarantees that reduce generation cost and hallucination risk. Building on this paradigm, we develop LongSeeker, a long-horizon search agent fine-tuned from Qwen3-30B-A3B on 10k synthesized trajectories. Across four representative search benchmarks, LongSeeker achieves 61.5% on BrowseComp and 62.5% on BrowseComp-ZH, substantially outperforming Tongyi DeepResearch (43.2% and 46.7%) and AgentFold (36.2% and 47.3%). These results highlight the potential of adaptive context management, showing that agents can achieve more reliable and efficient long-horizon reasoning by actively shaping their working memory.

[AI-1] When Life Gives You BC Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

【速读】：该论文旨在解决行为克隆（Behavior Cloning, BC）在收集完示范数据后缺乏在线自我改进机制的问题，以及现有离线到在线学习方法因离线数据分布与在线学习分布不匹配而导致策略替换先前良好动作的缺陷。解决方案的关键在于提出Q2RL算法，其核心由两部分构成：(1) Q-Estimation通过少量环境交互从BC策略中提取Q函数；(2) Q-Gating基于BC和强化学习（Reinforcement Learning, RL）策略的Q值动态切换动作，从而高效收集用于RL策略训练的样本。该方法实现了离线到在线学习的稳定迁移，在D4RL和robomimic基准任务上显著提升成功率和收敛速度，并可在机器人端实现1–2小时内完成高精度接触密集型任务（如管道装配和分拣）的鲁棒策略学习。

链接: https://arxiv.org/abs/2605.05172
作者: Lakshita Dodeja,Ondrej Biza,Shivam Vats,Stephen Hart,Stefanie Tellex,Robin Walters,Karl Schmeckpeper,Thomas Weng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at this https URL

[AI-2] Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在复杂硬件设计任务中自主性和效率不足的问题。其解决方案的关键在于构建一个基于前沿模型（frontier models）的多智能体协作框架——“Design Conductor”，该框架能够自主执行80倍更大规模的设计任务，并在无需人工干预的情况下完成从架构定义到物理实现的全流程设计。系统通过引入结构化多阶段工作流、强化推理与执行分离机制以及针对特定计算负载（如TurboQuant加速器VerTQ）的定制化优化，显著提升了设计质量与效率，例如实现了5129个FP16/32计算单元的硬连线加速器设计并在FPGA上成功部署。

链接: https://arxiv.org/abs/2605.05170
作者: TheVerkor Team:Ravi Krishna,Suresh Krishna,David Chin
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced “Design Conductor” (or just “Conductor”), a system capable of building a 5-stage Linux-capable RISC-V CPU in 12 hours. In this work, we introduce an updated multi-agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. Following a brief introduction, we examine 4 designs that the system produced autonomously, including “VerTQ”, an LLM inference accelerator which hard-wires support for TurboQuant in a 240-cycle pipeline, starting from the TurboQuant arXiv paper. VerTQ includes heavy compute processing, with 5129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7 mm^2 in TSMC 16FF (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor’s token usage and other empirical characteristics, including its limitations.

[AI-3] Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

【速读】：该论文试图解决的问题是：尽管Transformer架构在时间序列预测中广泛应用，但其成功背后的表征机制是否与自然语言处理（NLP）中的机制一致仍不明确；同时，简单线性模型（如DLinear）为何在基准测试中持续表现出竞争力，缺乏机制层面的解释。解决方案的关键在于采用稀疏自编码器（Sparse Autoencoders, SAEs）这一机械可解释性工具，对PatchTST模型中前馈网络（Feed-Forward Network, FFN）的中间激活层进行探查。研究发现，即使在字典维度扩展至原维度4倍的情况下，下游预测性能变化极小（平均仅0.214%），且大量冗余特征未被激活；进一步的因果干预实验表明，主导潜在特征的扰动对预测结果影响微弱，说明FFN表示具有稀疏性和稳定性，无需依赖强超位置（superposition）机制即可实现优异性能，从而解释了为何简单线性模型也能表现良好。

链接: https://arxiv.org/abs/2605.05151
作者: Alper Yıldırım
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 tables

点击查看摘要

Abstract:Transformer architectures have been widely adopted for time series forecasting, yet whether the representational mechanisms that make them powerful in NLP actually engage on time series data remains unexplored. The persistent competitiveness of simple linear models such as DLinear has fueled ongoing debate, but no mechanistic explanation for this phenomenon has been offered. We address this gap by applying sparse autoencoders (SAEs), a tool from mechanistic interpretability, to probe the internal representations of PatchTST. We first establish that a single-layer, narrow-dimensional transformer matches the forecasting performance of deeper configurations across commonly used benchmarks. We then train SAEs on the post-GELU intermediate FFN activations with dictionary sizes ranging from 0.5x to 4.0x the native dimensionality. Expanding the dictionary yields negligible downstream performance change (average 0.214%), with large portions of overcomplete dictionaries remaining inactive. Targeted causal interventions on dominant latent features produce minimal forecast perturbation. Across all evaluated settings, we observe no empirical evidence that the analyzed FFN representations rely on strong superposition. Instead, the representations remain sparse, stable under aggressive dictionary expansion, and largely insensitive to latent interventions. These results demonstrate that superposition is not necessary for competitive performance on standard forecasting benchmarks, suggesting they may not demand the rich compositional representations that drive transformer success in language modeling, and helping explain the persistent competitiveness of simple linear models

[AI-4] Executable World Models for ARC-AGI-3 in the Era of Coding Agents

【速读】：该论文旨在解决自动推理与通用智能（ARC-AGI-3）任务中如何构建具备可执行世界模型（executable world model）的智能体问题，以实现对复杂环境的泛化理解和高效决策。其解决方案的关键在于设计一个基于验证驱动的可执行世界模型系统：智能体维护一个可执行的Python形式的世界模型，通过与先前观测结果进行验证来确保模型准确性，并在规划前通过简化抽象（refactoring toward simpler abstractions）来引入类似最小描述长度（MDL）的简洁性偏好；同时，该系统采用脚本控制器、预定义接口和计划执行器，但不依赖任何游戏特定逻辑，从而形成一种通用性强的基准方法。实验表明，该方案在25个公开ARC-AGI-3游戏中实现了7个完全求解、6个相对人类动作效率（RHAE）>75%，平均RHAE为32.58%，验证了验证驱动的可执行世界模型是应对ARC-AGI-3挑战的有前景路径。

链接: https://arxiv.org/abs/2605.05138
作者: Sergey Rodionov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages. Submitted to AGI-26

点击查看摘要

Abstract:We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. We report results on the 25 public ARC-AGI-3 games. Each recorded playthrough uses a fresh agent instance with no access to previous playthrough-specific files or conversation state. Most games have a single recorded playthrough; for a few games, we report multiple independent fresh-agent playthroughs to expose run-to-run variability. The agent fully solved 7 games, achieved a Relative Human Action Efficiency greater than 75%, on 6 games, and obtained a mean per-game RHAE of 32.58%. Because the system uses no game-specific code, it can serve as a game-general baseline for ARC-AGI-3. Performance on the private validation set remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents.

[AI-5] Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM -driven Evolutionary MNAR Imputation

【速读】：该论文旨在解决在电子健康记录（EHR）中因时间依赖性混杂和缺失非随机（missing-not-at-random, MNAR）生物标志物高达50%–80%而导致的因果效应估计不准确问题。传统方法常将因果推断、缺失值处理与时间结构建模分开处理，限制了其在真实世界数据中的稳健性。解决方案的关键在于提出一个两阶段管道：第一阶段使用CausalFlow-T，一种基于有向无环图（DAG）约束的归一化流模型，结合长短期记忆网络（LSTM）编码患者历史，实现精确可逆的反事实推断，避免变分推断近似误差并显式分离混杂因素；第二阶段引入由大语言模型（LLM）驱动的进化插补器，通过生成可执行的插补操作而非单个数值来提升插补质量，在30%–80% MNAR缺失率下优于现有方法，同时保持平均治疗效应（ATE）恢复能力，最终在瑞士2型糖尿病患者的真实EHR数据中成功估计出GLP-1受体激动剂相比SGLT-2抑制剂的预期体重减轻差异为-0.98 kg（95% CI -1.01, -0.96），结果与随机对照试验一致。

链接: https://arxiv.org/abs/2605.05125
作者: Olivia Jullian Parra,Sara Zoccheddu,David Catalan Cerezo,Tom Forzy,Franziska Ulrich,William Sutcliffe,Jakob Martin Burgstaller,Oliver Senn,Patrick Owen,Nicola Serra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Target trial emulation (TTE) enables causal questions to be studied with observational data when randomized controlled trials (RCTs) are infeasible. Yet treatment-effect methods often address causal estimation, missingness, and temporal structure separately, limiting their robustness in electronic health records (EHRs), where time-varying confounding and missing-not-at-random (MNAR) biomarkers can reach 50%–80%. We propose a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHRs. First, CausalFlow-T, a directed acyclic graph (DAG)-constrained normalizing flow with long short-term memory (LSTM)-encoded patient history, performs exact invertible counterfactual inference, avoiding approximation errors from variational inference and separating confounding through explicit causal structure. Ablations on four synthetic and one semi-synthetic benchmark with known counterfactuals show that DAG constraints and exact inference address distinct failure modes: neither compensates for the other. Second, because CausalFlow-T requires completed inputs, we introduce an LLM-driven evolutionary imputer that proposes executable imputation operators rather than individual entries, and evaluate it with three large language model (LLM) backends, including two open-source models. Across 30%–80% MNAR missingness, this imputer achieves the best pooled rank over biomarker and causal metrics, leading in point-wise accuracy and temporal extrapolation while preserving average treatment effect (ATE) recovery as statistical baselines degrade. On Swiss primary-care EHRs from adults with type 2 diabetes initiating a GLP-1 receptor agonist or SGLT-2 inhibitor, the pipeline estimates a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 receptor agonists, consistent with randomized evidence and obtained from realistically incomplete real-world EHRs.

[AI-6] Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

【速读】：该论文旨在解决离线到在线强化学习（offline-to-online reinforcement learning, O2O-RL）中因策略选择与微调过程存在不确定性而导致的部署风险和交互预算浪费问题。具体而言，传统方法依赖不可靠的离线策略评估（off-policy evaluation, OPE）或高成本的在线评估（online evaluation, OE），且难以预判预训练策略是否能在部署后通过有限在线交互获得性能提升，尤其在非平稳环境中更为显著。为此，作者提出一种基于置信上界（upper-confidence-bound, UCB）的自适应策略选择与微调机制，在初始离线训练多个候选策略并进行OPE估计后，动态选择具有高潜力的策略进行精细化在线微调，从而在严格交互预算下最大化策略性能提升效果。其关键在于利用UCB框架平衡探索与利用，实现对候选策略的高效筛选与资源分配。

链接: https://arxiv.org/abs/2605.05123
作者: Alper Kamil Bozkurt,Xiaoan Xu,Shangtong Zhang,Miroslav Pajic,Yuichi Motai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are evaluated via either off-policy evaluation (OPE) or online evaluation (OE). The policy with the highest estimated value is then deployed and continually fine-tuned. However, this setup has two main issues. First, OPE can be unreliable, making it risky to deploy a policy based solely on those estimates, whereas OE may identify a viable policy with substantial online interaction, which could have been used for fine-tuning. Second–and more importantly–it is also often not possible to determine a priori whether a pretrained policy will improve with post-deployment fine-tuning, especially in non-stationary environments. As a result, procedures committing to a single deployed policy are impractical in many real-world settings. Moreover, a naive remedy that exhaustively fine-tunes all candidates would violate interaction budget constraints and is likewise infeasible. In this paper, we propose a novel adaptive approach for policy selection and fine-tuning under online interaction budgets in O2O-RL. Following the standard pipeline, we first train a set of candidate policies with different offline RL algorithms and hyperparameters; we then perform OPE to obtain initial performance estimates. We next adaptively select and fine-tune the policies based on their predicted performance via an upper-confidence-bound approach thereby making efficient use of online interactions. We demonstrate that our approach improves upon O2O-RL baselines with various benchmarks.

[AI-7] On the Wasserstein Gradient Flow Interpretation of Drifting Models

【速读】：该论文旨在解决生成式 AI (Generative AI) 中模型训练稳定性与收敛性不足的问题，特别是针对基于最优传输理论的 Wasserstein Gradient Flows (WGF) 在实际算法实现中的偏差与优化路径不明确等挑战。其解决方案的关键在于将 Deng et al. (2026) 提出的生成建模框架 GMD（Generative Modeling via Drifting）重新诠释为在概率测度空间中寻找特定 WGF 流的不动点（fixed point），并通过理论分析揭示不同变体算法对应于不同散度函数（如 KL 散度、Sinkhorn 散度、最大均值差异 MMD、切片 Wasserstein 距离及 GAN 评判器函数）对应的 WGF 极限行为，从而为生成模型提供统一的几何优化视角和更清晰的收敛性保障。

链接: https://arxiv.org/abs/2605.05118
作者: Arthur Gretton,Li Kevin Wenliang,Alexandre Galashov,James Thornton,Valentin De Bortoli,Arnaud Doucet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Recently, Deng et al. (2026) proposed Generative Modeling via Drifting (GMD), a novel framework for generative tasks. This note presents an analysis of GMD through the lens of Wasserstein Gradient Flows (WGF), i.e., the path of steepest descent for a functional in the space of probability measures, equipped with the geometry of optimal transport. Unlike previous WGF-based contributions, GMD can be thought of as directly targeting a fixed point of a specific WGF flow. We demonstrate three main results: first, that one algorithm proposed by Deng et al. (2026) corresponds to finding the limiting point of a WGF on the KL divergence, with Parzen smoothing on the densities. Second, that the algorithm actually implemented by Deng et al. (2026) corresponds to a different procedure, which bears some resemblance to the fixed point of a WGF on the Sinkhorn divergence, but lacks certain desirable properties of the latter. Third, the same same idea can be extended to the limiting point of other WGFs, including the Maximum Mean Discrepancy (MMD), the sliced Wasserstein distance, and GAN critic functions.

[AI-8] LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts

【速读】：该论文旨在解决强化学习中敏捷机器人动作设计时奖励函数难以构造的问题，尤其针对缺乏参考运动轨迹的新型平台或极限 stunt 行为的学习挑战。解决方案的关键在于提出 LineRides 框架，通过用户提供的空间路径线（spatial guideline）和稀疏的关键朝向点（key-orientations）来引导策略学习，无需演示或显式时间信息；其核心创新包括：利用跟踪容差（tracking margin）处理物理不可行路径、基于沿路径行进距离衡量进度以消除时间歧义，并借助位置与序列双重约束的关键朝向点明确动作细节，从而实现对五类命令式 stunts（如 Backflip、DriftTurn 等）的有效控制与平滑切换。

链接: https://arxiv.org/abs/2605.05110
作者: Seungeun Rho,Shamel Fahmi,Jeonghwan Kim,Arianna Ilvonen,Sehoon Ha,Gabriel Nelson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published in IEEE Robotics and Automation Letters (RA-L), 2026

点击查看摘要

Abstract:Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.

[AI-9] SoK: Robustness in Large Language Models against Jailbreak Attacks

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际应用中因存在“越狱攻击”（jailbreak attacks）而引发的安全风险问题，即恶意提示词诱导模型生成有害、不道德或违反政策的内容，从而威胁系统安全性、可信度与合规性。其解决方案的关键在于提出一个名为 Security Cube 的统一多维评估框架，该框架能够系统化地对越狱攻击和防御方法进行综合评价，超越传统单一指标（如攻击成功率）的局限，从而全面刻画LLM安全性的多个维度，并基于此开展基准测试与深入分析，为提升LLM鲁棒性提供可量化、可比较的研究基础与方向指引。

链接: https://arxiv.org/abs/2605.05058
作者: Feiyue Xu,Hongsheng Hu,Chaoxiang He,Sheng Hang,Hanqing Hu,Xiuming Liu,Yubo Zhao,Zhengyan Zhou,Bin Benjamin Zhu,Shi-Feng Sun,Dawu Gu,Shuo Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To Appear in the 47th IEEE Symposium on Security and Privacy, May 18-20, 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across the literature. Leveraging Security Cube, we conduct benchmark studies on 13 representative attacks and 5 defenses, establishing a clear view of the current landscape encompassing jailbreak attacks, defenses, automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis aims to pave the way towards more robust, interpretable, and trustworthy LLM systems. Our code is available at Code.

[AI-10] Adaptive Learning Strategies for AoA-Based Outdoor Localization: A Comprehensive Framework

【速读】：该论文旨在解决5G和6G网络中基于角度到达（Angle-of-Arrival, AoA）的定位技术在不同训练数据规模下性能不稳定的问题，尤其是在实际部署场景中大规模数据采集成本高、数据分布不均的情况下。其解决方案的关键在于提出一个自适应框架，根据训练数据量大小采用两种不同的学习策略：当拥有大量训练数据时，采用离线分层学习策略，首先区分视距（Line of Sight, LoS）与非视距（Non-Line of Sight, NLoS）区域，再进行精细化定位，并结合批量重训练和超参数优化机制以提升精度；当仅有少量训练数据时，则采用在线学习策略，利用增量树模型和集成模型处理流式数据并持续更新模型，同时引入少样本学习（Few-Shot Learning）机制快速从有限标注样本中初始化新类。这一设计使得系统能够在网络运行过程中逐步提升定位精度，显著降低对大规模数据采集的依赖。

链接: https://arxiv.org/abs/2605.05055
作者: Bac Trinh-Nguyen,Sara Berri,Sin G. Teo,Tram Truong-Huu,Arsenia Chorti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Localization in 5G and 6G networks is essential for important use cases such as intelligent transportation, smart factories, and smart cities. Although deep learning has enabled improving localization accuracy, depending on the deployment scenario and the effort required for dataset collection campaigns on a given infrastructure, the training process for localization models can vary significantly. Furthermore, with respect to feature selection, recent works have demonstrated the robustness of angle-of-arrival (AoA) based localization. In view of these two points, we propose an adaptive framework for AoA-based localization that consists of two alternative learning strategies, each suited either for large or small training datasets. The proposed framework is evaluated on a real, massive multiple input multiple output (mMIMO) orthogonal frequency division multiplexing (OFDM) outdoor channel state information (CSI) dataset. First, we investigate offline learning when large training datasets are available; we propose a hierarchical framework that first distinguishes between line of sight (LoS) and non line of sight (NLoS) regions and then moves to more fine grained localization in the respective region. This approach provides high-performance localization through accumulated batch retraining and an integrated hyperparameter optimization mechanism. Second, when only a small training dataset is available, an online learning framework is proposed, using incremental tree-based and ensemble-based models for handling streaming data and continuously updating mode, as well as an online few-shot learning model for rapidly initializing new classes from a limited labeled support set. These results showcase that highly accurate robust localization can be achieved incrementally during network operation by exploiting online learning, alleviating the need for large dataset collection campaigns.

[AI-11] Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

【速读】：该论文旨在解决在高性能计算（HPC）平台上训练混合专家（Mixture-of-Experts, MoE）模型时面临的三大核心挑战：巨大的内存占用、跨异构网络的频繁大规模通信以及严重的负载不平衡问题。其解决方案的关键在于提出了一种名为Piper的框架，该框架通过资源建模识别目标HPC平台上的高效训练策略，并引入流水线并行（pipeline parallelism）与优化调度机制，从而显著提升模型浮点利用率（MFU）。此外，Piper还设计了一种新颖的all-to-all通信算法，在带宽上相较厂商实现提升1.2–9倍，有效缓解了专家并行带来的通信延迟瓶颈。

链接: https://arxiv.org/abs/2605.05049
作者: Sajal Dash,Feiyi Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Frontier models increasingly adopt Mixture-of-Experts (MoE) architectures to achieve large-model performance at reduced cost. However, training MoE models on HPC platforms is hindered by large memory footprints, frequent large-scale communication across heterogeneous networks, and severe workload imbalance. To characterize these challenges, we develop a mathematical model that quantifies memory, compute, and communication requirements for MoE configurations under various parallelization schemes, verified through micro-benchmarking, code instrumentation, and hardware profiling. Our analysis identifies performance bottlenecks: all-to-all latency at scale from expert parallelism, insufficient compute-communication overlap, low GPU utilization from imbalanced skinny GEMMs, and the absence of platform-aware hybrid parallelization strategies. To address these, we introduce Piper, a framework that leverages resource modeling to identify efficient training strategies for MoE models on target HPC platforms, applying pipeline parallelism with optimized schedules. Piper achieves 2-3.5X higher MFU than state-of-the-art frameworks such as X-MoE, and a novel all-to-all algorithm delivers 1.2-9X bandwidth over vendor implementation.

[AI-12] Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

【速读】：该论文旨在解决现有基于策略的自蒸馏（on-policy self-distillation）方法在训练稳定性与推理性能上的局限性，特别是其依赖固定教师模型的KL散度匹配机制导致的学习偏差和探索多样性不足问题。解决方案的关键在于提出一种基于偏好的自蒸馏框架（Preference-Based Self-Distillation, PBSD），该框架通过引入奖励正则化目标函数，将教师分布重新加权为奖励重加权的分布，从而在理论上保证学生策略优于原教师策略；同时，PBSD通过优化教师与学生样本间的偏好差距来驱动学习，保持了在线策略采样特性，并在多个模型规模下验证了其在数学推理和工具使用任务中的稳定性和优越性能。

链接: https://arxiv.org/abs/2605.05040
作者: Xin Yu,Liuchen Liao,Yiwen Zhang,Yingchen Yu,Lingzhou Xue,Qinzhen Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose \textbfPreference-\textbfBased \textbfSelf-\textbfDistillation (\textbfPBSD), which revisits on-policy self-distillation through a reward-regularized perspective. Instead of directly matching the teacher distribution, we derive a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, yielding a target policy provably superior to the original teacher under this objective. Practically, PBSD optimizes preference gaps between teacher and student samples while maintaining on-policy student sampling. We support this framework with a statistical analysis of the induced preference-learning problem, formally establishing when on policy self-distillation is preferable to learning from an external teacher in our setting. Experiments on mathematical reasoning and tool-use benchmarks across multiple model scales demonstrate that PBSD consistently achieves the strongest average performance among comparable baselines, showing improved training stability over prior self-distillation baselines while preserving token efficiency.

[AI-13] Position: Embodied AI Requires a Privacy-Utility Trade-off ICML2026

【速读】：该论文旨在解决当前具身人工智能（Embodied AI, EAI）系统在高频率部署于家庭等敏感环境时，因各模块（如指令理解、感知、规划与交互）独立优化而引发的系统性隐私危机问题。现有方法忽视了隐私泄露在全生命周期中的耦合效应，导致隐私保护成为局部、静态的特征，而非贯穿整个系统的动态约束。解决方案的关键在于提出一个名为SPINE（Secure Privacy Integration in Next-generation Embodied AI）的统一隐私感知框架，将隐私视为一种动态控制信号，贯穿EAI生命周期的各个阶段，并通过多准则隐私分类矩阵实现跨阶段边界的情境敏感性协调，从而系统性地重塑EAI行为以适应隐私约束。

链接: https://arxiv.org/abs/2605.05017
作者: Xiaoliang Fan,Jiarui Chen,Zhuodong Liu,Ziqi Yang,Peixuan Xu,Ruimin Shen,Junhui Liu,Jianzhong Qi,Cheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at ICML 2026. 10 pages, 3 figures

点击查看摘要

Abstract:Embodied AI (EAI) systems are rapidly transitioning from simulations into real-world domestic and other sensitive environments. However, recent EAI solutions have largely demonstrated advancements within isolated stages such as instruction, perception, planning and interaction, without considering their coupled privacy implications in high-frequency deployments where privacy leakage is often irreversible. This position paper argues that optimizing these components independently creates a systemic privacy crisis when deployed in sensitive settings, thereby advancing the position that privacy in EAI is a life cycle-level architectural constraint rather than a stage-local feature. To address these challenges, we propose Secure Privacy Integration in Next-generation Embodied AI (SPINE), a unified privacy-aware framework that treats privacy as a dynamic control signal governing cross-stage coupling throughout the entire EAI life cycle. SPINE decomposes the EAI pipeline into various stages and establishes a multi-criterion privacy classification matrix to orchestrate contextual sensitivity across stage boundaries. We conduct preliminary simulation and real-world case studies to conceptually validate how privacy constraints propagate downstream to reshape system behavior, illustrating the insufficiency of fragmented privacy patches and motivating future research directions into secure yet functional embodied AI systems. We detail the SPINE framework and case studies at this https URL.

[AI-14] Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）多智能体系统中任务调度与分解缺乏联合优化的问题，现有方法通常采用固定的路由策略或人工设计的任务分解方式，导致分解深度、工作者选择和推理预算无法在统一目标下协同优化。其解决方案的关键在于提出 Uno-Orchestra，一种统一的编排策略，能够同时学习是否对任务进行选择性分解以及将每个子任务分配给合适的（模型，原语）组合，该策略通过从真实工作者交互中收集的强化学习（Reinforcement Learning, RL）轨迹进行端到端训练，实现了高精度与低开销的协同提升，在13个基准测试上达到77.0%的宏平均通过率（macro pass@1），显著优于最强基线约16%，且查询成本降低一个数量级。

链接: https://arxiv.org/abs/2605.05007
作者: Zhiqing Cui,Haotong Xie,Jiahao Yuan,Cheng Yang,Hanqing Wang,Yuxin Wu,Yifan Wu,Siru Zhong,Tao Yu,Yifu Guo,Siyu Zhang,Xinlei Yu,Qibing Ren,Usman Naseem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost, advancing the accuracy-efficiency frontier of selective delegation.

[AI-15] Federated Learning for Early Prediction of EV Charging Demand

【速读】：该论文旨在解决电动汽车（Electric Vehicle, EV）充电需求的早期预测问题，即在充电会话开始后的短时间内，仅基于插枪时刻及初期充电数据来准确估计整个会话的总能量需求。这一问题对电网稳定、基础设施规划和实时充电优化具有重要意义，尤其能够支持网络运营商在充电过程中做出及时决策。解决方案的关键在于构建一个基于自适应充电网络（Adaptive Charging Network, ACN）的会话级数据集，并提取反映用户意图、时间模式和初始充电行为的表格特征；同时采用联邦学习（Federated Learning, FL）框架，在保持数据本地化（即数据不出站点）的前提下，通过站点级客户分区建模实现跨分布式充电设施的隐私增强型训练，从而在不牺牲预测性能的情况下提升系统的可扩展性和隐私保护能力。

链接: https://arxiv.org/abs/2605.04993
作者: Vasilis Perifanis,Foteini Nikolaidou,Nikolaos Pavlidis,Panagiotis Thomakos,Andreas Sendros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate forecasting of electric vehicle (EV) charging demand is critical for grid stability, infrastructure planning, and real-time charging optimization. In this work, we study the problem of early prediction of charging demand, where the total energy of a session is estimated using only information available at plug-in time and during the first minutes of charging. This enables actionable decisions while the session is still in progress, which is of direct importance for EV network operators. We construct a session-level dataset from the Adaptive Charging Network (ACN), combining session metadata with early-window charging measurements, and derive tabular features capturing user intent, temporal patterns, and initial charging behavior. We focus on a single operational depot, Caltech, and model intra-depot heterogeneity through station-level client partitions while evaluating multiple model families in a federated learning (FL) setting. Our results show that federated models can approach centralized predictive performance while keeping data in-depot, enabling privacy-enhanced training across distributed charging infrastructures. Overall, we demonstrate that reliable demand estimates can be obtained early in the session with minimal data, and that FL provides a practical pathway toward scalable and privacy-aware analytics for EV charging networks. Code is available at this https URL.

[AI-16] On-line Learning in Tree MDPs by Treating Policies as Bandit Arms AAMAS2026

【速读】：该论文旨在解决树状马尔可夫决策过程（Tree Markov Decision Problem, T-MDP）中的在线学习问题，涵盖概率近似正确（PAC）学习和后悔最小化两种范式。T-MDP作为具有完美回忆的序贯博弈或面对静态对手时决策抽象的自然形式，其状态空间具有唯一路径结构，但策略数量随状态数呈指数增长，导致传统基于策略的 bandit 算法难以直接应用。论文的关键创新在于设计了一种基于策略间共享数据的置信界（confidence bounds）机制，使得即使在策略数量指数级增长的情况下，仍能实现多项式时间复杂度和内存消耗的 bandit 算法（如 UCB 和 LUCB）部署。该方法将样本复杂度和后悔上界中的“间隙项”从每个策略转移到每个终止状态，从而显著提升了算法效率，并在隐藏信息博弈任务中展现出优于现有方法的性能表现。

链接: https://arxiv.org/abs/2605.04979
作者: Anvay Shah,Ramsundar Anandanarayanan,Sharayu Moharir,Shivaram Kalyanakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a full paper in the Main Track of AAMAS 2026

点击查看摘要

Abstract:A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state s_1 , in which every state is reachable from s_1 through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms – \textscLucb and \textscUcb – can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term’’ from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.

[AI-17] Architectural Constraints Alignment in AI-assisted Platform-based Service Development

【速读】：该论文旨在解决生成式 AI (Generative AI) 辅助开发工具在服务原型设计中缺乏对生产环境架构约束、基础设施依赖和组织标准的认知，导致生成代码存在行为脆弱性和部署困难的问题。解决方案的关键在于提出一种检索增强型 scaffolding（脚手架）方法，通过平台驱动的代码生成与代理式澄清循环（agentic clarification loops）相结合，显式暴露并解决架构约束的模糊性；该方法利用模板检索与结构化交互融合，在服务脚手架阶段嵌入生产相关考量，从而提升架构一致性与可部署性。

链接: https://arxiv.org/abs/2605.04973
作者: Julius Irion,Moritz Leugers,Paul Hartwig,Simon Kling,Tachmyrat Annayev,Alexander Schwind,Maria C. Borges,Sebastian Werner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: To Appear at CAiSE’26 - LLM-SOA Workshop

点击查看摘要

Abstract:AI-assisted development tools enable rapid prototyping of services but often lack awareness of architectural constraints, infrastructure dependencies, and organizational standards required in production environments. Consequently, generated artifacts may exhibit brittle behavior and limited deployability. We propose a retrieval-augmented scaffolding approach that combines platform-based code generation with agentic clarification loops to expose and resolve architectural constraint ambiguities. By combining template retrieval with structured interaction, the method embeds production-relevant considerations during service scaffolding. Evaluation indicates improved architectural consistency and deployability compared to general-purpose AI code generation workflows, suggesting that constraint-aware retrieval is essential for aligning AI-assisted service development with production software engineering practices.

[AI-18] Skill Neologisms: Towards Skill-based Continual Learning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在扩展新技能时的可扩展性问题，即如何在不引发灾难性遗忘（catastrophic forgetting）的前提下，高效地增强模型对特定技能的能力。传统方法如微调（fine-tuning）或参数高效变体存在记忆干扰风险，而基于上下文的方法则受限于表达能力和有效上下文长度。论文提出的关键解决方案是引入“技能新词”（skill neologisms），即在模型词汇表中嵌入软令牌（soft tokens），并通过优化这些令牌来提升特定技能的表现，且无需更新模型权重。实验表明，这些技能新词不仅能在特定任务上显著增强模型能力，还能与分布外技能组合使用，并支持零样本组合（zero-shot composition），从而为基于技能的持续学习提供了一条可扩展路径。

链接: https://arxiv.org/abs/2605.04970
作者: Antonin Berthon,Nicolas Astorga,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLMs show mastery over an ever-growing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open-problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model’s effective context. We explore skill neologisms–i.e., soft tokens integrated in the model’s vocabulary and optimized to improve capabilities over a specific skill–as a way to selectively extend model capabilities to new skills without weight updates. We first observe that off-the-shelf pre-trained LLMs already demonstrate tokens associated with procedural knowledge. We then show that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

[AI-19] Reliable Modeling of Distribution Shifts via Displacement-Reshaped Optimal Transport

【速读】：该论文旨在解决最优传输（Optimal Transport, OT）框架中因地面度量（ground metric）设计不当而导致的分布偏移建模失效问题。传统OT依赖输入空间中的欧氏距离作为地面度量，但若该度量未能反映真实的数据变化几何结构，则优化过程可能产生不合理的传输路径。解决方案的关键在于提出位移重塑最优传输（Displacement-Reshaped Optimal Transport, ReshapeOT），其核心思想是利用观测样本位移的二阶矩信息估计一个马哈拉诺比斯（Mahalanobis）距离作为新的地面度量，从而在输入空间中“雕刻出”符合实际位移方向的传输通路，使传输方案更贴合观测到的分布变化模式。该方法计算轻量、可无缝集成至任意基于代价矩阵的OT求解器，并支持核化扩展以增强灵活性。

链接: https://arxiv.org/abs/2605.04965
作者: Philip Naumann,Jacob Kauffmann,Klaus-Robert Müller,Grégoire Montavon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimal transport (OT) is a central framework for modeling distribution shifts. Because OT compares distributions directly in input space, a well-designed ground metric between observations is essential to ensure that the optimizer does not violate the true geometry of change. We propose Displacement-Reshaped Optimal Transport (ReshapeOT), a method that reshapes the ground metric by integrating observed sample displacements as an additional source of knowledge. Technically, ReshapeOT replaces the Euclidean metric with a Mahalanobis distance estimated from displacement second moments. This effectively carves expressways through the input space, inviting transport solutions that better align with observed displacements. Our method is computationally lightweight, integrates seamlessly into any OT solver that operates on a cost matrix, and can be kernelized for further flexibility. Experiments on synthetic and real-world data show that ReshapeOT achieves substantial gains in transport reliability. We further demonstrate our method’s usefulness in two practical use cases.

[AI-20] EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

【速读】：该论文旨在解决基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）方法中，尤其是群体相对策略优化（Group Relative Policy Optimization, GRPO）所面临的三大信用分配失败问题：一是 token 级别粒度均匀化忽略了不同 token 的信息价值差异；二是极性统一导致正确步骤被惩罚、错误步骤被奖励；三是零方差坍塌现象使得结果驱动梯度消失。为应对这些问题，论文提出熵-进度对齐的 GRPO（Entropy-Progress Aligned GRPO, EP-GRPO），其核心创新在于引入三个机制：基于熵门控的调节以优先关注高熵决策节点；利用策略分歧隐式捕捉过程信号并锚定于结果优势，实现无需外部奖励模型的定向 token 级反馈；以及累积熵映射机制，用于进度对齐的优势归一化，从而在零奖励方差下仍能维持稳定的梯度流动。实验表明，EP-GRPO 在数学推理基准测试中显著优于 GRPO 及其变体，在准确性和效率上均取得提升。

链接: https://arxiv.org/abs/2605.04960
作者: Song Yu,Li Li,Wenwen Zhao,Zhisheng Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model’s intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available. Comments: 15 pages, 6 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.04960 [cs.LG] (or arXiv:2605.04960v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] Modular Reinforcement Learning For Cooperative Swarms

【速读】：该论文旨在解决多机器人强化学习中因交互状态空间呈组合爆炸而带来的记忆瓶颈问题，即在协作机器人集群（robot swarm）中，每个机器人仅能与有限的同伴交互且无法直接感知其对整体性能的影响，导致传统方法难以高效学习最优交互策略。解决方案的关键在于提出一种模块化（decomposed）的状态表示方法：将空间交互状态分解为多个独立特征，由各自的学习过程分别处理，再进行结果聚合，从而显著降低单个机器人的内存需求并提升学习效率。

链接: https://arxiv.org/abs/2605.04939
作者: Erel Shtossel,Gal A. Kaminka
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.

[AI-22] When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data

【速读】：该论文试图解决的问题是：尽管因果推断方法在理论上具有优势，但在单细胞RNA测序（single-cell RNA-seq）数据中推断基因调控网络（Gene Regulatory Network, GRN）时，其性能往往无法超越基于相关性的基线方法，这一现象在多个现实基准测试中反复出现，引发了对因果性在该任务中实际价值的质疑。为澄清这一问题，作者提出了一种受控的诊断框架（controlled diagnostic framework），其关键在于系统性地隔离七种生物学相关的病理状态（如dropout、潜在混杂因素、细胞类型混合、反馈环路等），并量化六种代表性方法在不同病理强度下性能的退化情况。通过6120次受控实验，发现因果方法在干净且结构有利的条件下确实占优，但特定病理（尤其是dropout和潜在混杂因素）会显著削弱其优势；同时引入误差类型分解揭示了不同方法虽整体准确率相似，但错误模式存在本质差异，进一步通过多病理交互分析发现联合效应呈亚加和性，并暴露了单病理性分析无法识别的密度条件下的交叉点行为，从而提供了关于GRN推断方法适用条件的精细化理解与实践指导。

链接: https://arxiv.org/abs/2605.04930
作者: Miguel Fernandez-de-Retana,Ruben Sanchez-Corcuera,Unai Zulaika,Aritz Bilbao-Jayo,Aitor Almeida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Despite theoretical advantages, causal methods for Gene Regulatory Network (GRN) inference from single-cell RNA-seq data consistently fail to match or outperform correlation-based baselines in many realistic benchmarks, a persistent puzzle which casts doubt on the value of causality for this task. We argue that existing benchmarks are insufficiently controlled to answer this question because they evaluate on real or semi-real data where multiple pathologies co-occur, confounding failure modes, and obscuring the specific conditions under which different inference methods excel or fail. To address this gap, we introduce a controlled diagnostic framework that isolates seven biologically motivated pathologies (dropout, latent confounders, cell-type mixing, feedback loops, network density, sample size, and pseudotime drift) and measure how six representative methods spanning three inference paradigms degrade as each pathology intensifies. Across 6,120 controlled experiments, we find that causal methods genuinely dominate in clean and structurally favorable regimes, but specific pathologies (notably dropout and latent confounders) selectively neutralize their advantages. We further introduce an error-type decomposition that reveals methods with similar aggregate accuracy commit qualitatively different errors. To probe whether single-pathology effects persist when multiple stressors co-occur, we perform an interaction sweep over the three most impactful pathologies and find that their joint effects are sub-additive, while also exposing density-conditional cross-overs invisible to single-dial analysis. Our findings offer a nuanced understanding of when and why different methods succeed or fail for GRN inference, providing actionable insights for method development and practical guidance for practitioners.

[AI-23] A Foundation Model for Zero-Shot Logical Rule Induction IJCAI2026

【速读】：该论文旨在解决传统归纳逻辑编程（Inductive Logic Programming, ILP）方法的局限性，即现有方法为归纳式（transductive），其学习到的参数与特定谓词绑定，导致在面对新任务时需要重新训练，缺乏泛化能力。为此，作者提出神经规则诱导器（Neural Rule Inducer, NRI），其核心创新在于采用领域无关的统计特征（如类别条件率、熵和共现关系）来表示逻辑文字（literal），从而实现对变量身份和数量变化的不变性，无需重新训练即可进行零样本规则诱导。NRI由统计编码器与并行槽式解码器组成，其中并行解码保持了逻辑析取的置换不变性，而乘积T-范数松弛使规则执行可微，支持仅基于预测准确率的端到端训练。这一设计显著提升了模型在规则恢复、抗标签噪声及虚假相关性鲁棒性以及零样本迁移至真实世界基准任务中的表现，为符号推理领域的基础模型（foundation models）开辟了新路径。

链接: https://arxiv.org/abs/2605.04916
作者: Yin Jun Phua
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Camera-ready version accepted at IJCAI 2026, with full appendices

点击查看摘要

Abstract:Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at this https URL.

[AI-24] Curated AI beats frontier LLM s at pharma asset discovery

【速读】：该论文旨在解决当前通用大语言模型（General-purpose LLMs）在药物研发管线竞争格局分析中召回率不足的问题，尤其是在针对小众肿瘤学与免疫学靶点时，因多数候选药物处于临床前阶段且主要由亚洲机构开发，导致基于通用网络搜索的系统难以有效捕捉这些“长尾”资产。解决方案的关键在于构建一个结构化、多维度（靶点-作用机制-适应症）的药物资产标注索引，并通过 Gosset 平台以聊天界面形式提供访问——该平台不仅显著提升了检索精度与召回率（相比最佳前沿系统提升 3.2 倍验证药物数量，同时保持 100% 精确度和召回率），还以 MCP 服务形式开放索引，使任何前沿模型均可调用此定制化知识库作为工具，从而在不改变交互接口的前提下大幅提升系统性能。

链接: https://arxiv.org/abs/2605.04908
作者: Łukasz Kidziński,Kevin Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 5 pages, 5 figures, 1 table

点击查看摘要

Abstract:General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. We benchmark Gosset – an AI platform with a chat interface backed by curated target-, modality-, and indication-level drug-asset annotations – against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. All five systems receive the same natural-language query and the same JSON output schema. Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs. The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool, suggesting that each of these systems can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface.

[AI-25] Strat-Reason er: Reinforcing Strategic Reasoning of LLM s in Multi-Agent Games

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多智能体博弈中战略推理能力不足的问题，尤其在于如何应对其他智能体策略非平稳性带来的推理评估困难与多步推理过程中的信用分配挑战。现有单智能体强化学习（Reinforcement Learning, RL）及其多智能体扩展方法因未将其他智能体纳入推理过程而难以有效解决上述问题。其解决方案的关键在于提出Strat-Reasoner框架，引入一种新颖的递归式推理范式，使每个智能体的推理过程显式整合其他智能体的推理逻辑；同时设计集中式思维链（Chain-of-Thought, CoT）对比模块以提供中间推理序列的有效奖励信号，并基于混合优势计算和群体相对强化学习策略优化LLM策略，从而显著提升模型在多智能体环境下的战略表现。

链接: https://arxiv.org/abs/2605.04906
作者: Yidong He,Yutao Lai,Pengxu Yang,Jiarui Gan,Jiexin Wang,Yi Cai,Mengchen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs’ strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent’s reasoning also integrates other agents’ reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1% average performance improvements across various multi-agent games.

[AI-26] On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference ACL2026

【速读】：该论文旨在解决Transformer模型在密码学安全推理（cryptographically secure inference）中非线性层计算效率低下的问题。现有方案通过向客户端暴露中间激活值（intermediate activations）以实现明文计算，从而提升效率，但此举使模型权重面临被窃取的风险。为缓解此风险，先前工作采用“洗牌防御”（shuffling defense），即仅向客户端提供随机排列后的激活值。然而，本文指出该防御并不如预期般鲁棒，并提出一种攻击方法：通过将不同批次的洗牌激活对齐至同一排列，进而利用这些对齐后的激活恢复模型权重。其关键创新在于实现了高精度的激活对齐（均方误差达 $10^{-9}$ 至 $10^{-6}$ ），并以约1次查询成本即可恢复出与原始权重L1范数差异在 $10^{-4}$ 至 $10^{-2}$ 范围内的模型参数，显著揭示了当前防御机制的脆弱性。

链接: https://arxiv.org/abs/2605.04901
作者: Zhengyi Li,Yakai Wang,Kang Yang,Yu Yu,Jiaping Gui,Yu Feng,Ning Liu,Minyi Guo,Jingwen Leng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:For Transformer models, cryptographically secure inference ensures that the client learns only the final output, while the server learns nothing about the client’s input. However, securely computing nonlinear layers remains a major efficiency bottleneck due to the substantial communication rounds and data transmission required. To address this issue, prior works reveal intermediate activations to the client, allowing nonlinear operations to be computed in plaintext. Although this approach significantly improves efficiency, exposing activations enables adversaries to extract model weights. To mitigate this risk, existing works employ a shuffling defense that reveals only randomly permuted activations to the client. In this work, we show that the shuffling defense is not as robust as previously claimed. We propose an attack that aligns differently shuffled activations to a common permutation and subsequently exploits them to extract model weights. Experiments on Pythia-70m and GPT-2 demonstrate that the proposed attack can align shuffled activations with mean squared errors ranging from 10^-9 to 10^-6 . With a query cost of approximately \ 1, the adversary can recover model weights with L1-norm differences ranging from 10^-4 to 10^-2 compared to the oracle weights.

[AI-27] A Harmonic Mean Formulation of Averag e Reward Reinforcement Learning in SMDPs

【速读】：该论文旨在解决无限时域、非回合制（持续性）任务中基于未折现平均奖励的强化学习算法在处理非平稳奖励与持续时间分布时存在的偏差问题。传统方法通过优化奖励与持续时间之比来实现目标，但在非平稳环境下会导致错误的结果。论文的关键解决方案是提出一种改进的调和平均算子（modified harmonic mean operator），该算子能够正确计算奖励率，即使在奖励和持续时间分布随时间变化的情况下仍保持有效性。这一改进使得无需模型信息的学习算法能够在半马尔可夫决策过程（SMDP）中稳定运行，并具备对非平稳环境的鲁棒性。

链接: https://arxiv.org/abs/2605.04880
作者: Erel Shtossel,Alicia Vidler,Uri Shaham,Gal A. Kaminka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

[AI-28] Quantile-Free Uncertainty Quantification in Graph Neural Networks ICML2026

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）中不确定性量化（Uncertainty Quantification, UQ）的挑战，特别是在高风险应用场景下，传统方法往往依赖于不切实际的交换性假设（exchangeability），且需通过昂贵的重采样或事后校准来获得可靠预测区间。其解决方案的关键在于提出Quantile-free Prediction Interval GNN（QpiGNN），该框架基于分位数回归（Quantile Regression, QR）设计了一种双头结构，将预测与不确定性解耦，并通过仅使用标签监督的无分位数联合损失函数进行训练，从而直接优化预测区间的覆盖率和宽度，无需分位数输入或后处理步骤；该方法在温和假设下具备渐近覆盖率理论保证和近最优区间宽度，实验证明其在19个合成与真实世界基准上平均提升22%覆盖率、缩小50%区间宽度，同时对噪声和结构变化具有鲁棒性。

链接: https://arxiv.org/abs/2605.04847
作者: Soyoung park,Hwanjun Song,Sungsu Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Uncertainty quantification (UQ) in graph neural networks (GNNs) is crucial in high-stakes domains but remains a significant challenge. In graph settings, message passing often relies on strong assumptions such as exchangeability, which are rarely satisfied in practice. Moreover, achieving reliable UQ typically requires costly resampling or post-hoc calibration. To address these issues, we introduce Quantile-free Prediction Interval GNN (QpiGNN), a framework that builds on quantile regression (QR) to enable GNN-based UQ by directly optimizing coverage and interval width without requiring quantile inputs or post-processing. QpiGNN employs a dual-head architecture that decouples prediction and uncertainty, and is trained with label-only supervision through a quantile-free joint loss. This design allows efficient training and yields robust prediction intervals, with theoretical guarantees of asymptotic coverage and near-optimal width under mild assumptions. Experiments on 19 synthetic and real-world benchmarks show QpiGNN achieves average 22% higher coverage and 50% narrower intervals than baselines, while ensuring efficiency and robustness to noise and structural shifts.

[AI-29] DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

【速读】：该论文旨在解决AI代理（AI agents）在复杂、动态且不可信环境中面临的高风险安全问题，尤其是如何系统性地评估和发现其潜在攻击面。现有方法难以实现大规模、可控且可复现的风险评估，导致对代理安全性缺乏全面认知。解决方案的关键在于提出DecodingTrust-Agent Platform（DTap），这是一个涵盖14个真实世界领域和50余个仿真环境的可控交互式红队平台，能够模拟如Google Workspace、PayPal等广泛使用的系统；进一步引入DTap-Red——首个自主红队代理，通过系统探索多种注入向量（如提示词、工具、技能、环境组合）自动发现针对不同恶意目标的有效攻击策略，并构建DTap-Bench数据集用于自动化验证攻击结果。该方案实现了从环境构建到攻击发现再到效果验证的闭环评估体系，揭示了AI代理中的系统性漏洞模式，为下一代安全代理的研发提供了关键洞见。

链接: https://arxiv.org/abs/2605.04808
作者: Zhaorun Chen,Xun Liu,Haibo Tong,Chengquan Guo,Yuzhou Nie,Jiawei Zhang,Mintong Kang,Chejian Xu,Qichang Liu,Xiaogeng Liu,Tianneng Shi,Chaowei Xiao,Sanmi Koyejo,Percy Liang,Wenbo Guo,Dawn Song,Bo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 279 pages, 148 figures

点击查看摘要

Abstract:AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

[AI-30] Agent Trust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

【速读】：该论文旨在解决现代AI代理（AI agents）在执行工具调用（如文件操作、Shell命令、HTTP请求等）时可能引发的不可逆安全风险问题，例如意外删除、凭证泄露或数据外泄。现有防御手段存在局限：事后基准测试无法阻止危害发生，静态防护规则难以应对混淆和多步骤上下文攻击，而基础设施沙箱则缺乏对行为语义的理解。其解决方案的关键在于提出AgentTrust——一个运行时安全层，通过拦截工具调用并生成结构化判别结果（允许、警告、阻断或人工审核），融合壳层去混淆归一化、SafeFix安全替代建议、RiskChain多步攻击链检测以及缓存感知的大语言模型（LLM-as-Judge）机制，实现高精度、低延迟的实时风险识别与处置。

链接: https://arxiv.org/abs/2605.04785
作者: Chenglin Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 31 pages, 2 figures, 15 tables; preprint

点击查看摘要

Abstract:Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents. Comments: 31 pages, 2 figures, 15 tables; preprint Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2605.04785 [cs.AI] (or arXiv:2605.04785v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.04785 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] Knowledge-Free Correlated Agreement for Incentivizing Federated Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中客户端贡献评估缺乏真实标签、公共测试集或数据分布知识时的激励机制设计难题。传统方法如相关一致性（Correlated Agreement, CA）易受标签翻转攻击，导致奖励分配不公。论文提出的无知识相关一致性的奖励机制（Knowledge-Free Correlated Agreement, KFCA）通过在类别报告和诚实多数假设下实现严格真实性（strictly truthful），从而有效抵御此类攻击。其关键创新在于无需依赖外部监督信号或先验分布信息，即可高效计算实时奖励，适用于去中心化及区块链驱动的激励架构，在大语言模型适配器微调和实际PCB缺陷检测任务中验证了有效性。

链接: https://arxiv.org/abs/2605.04747
作者: Leon Witt,Togrul Abbasli,Kentaroh Toyoda,Wojciech Samek,Lucy Klinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:We introduce Knowledge-Free Correlated Agreement (KFCA) to reward client contributions in federated learning (FL) without relying on ground truth, a public test set, or distribution knowledge. Under categorical reports and an honest majority, KFCA is strictly truthful, addressing the label-flipping vulnerability of Correlated Agreement (CA). We evaluate KFCA on federated LLM adapter tuning and a real-world PCB inspection task, showing efficient real-time reward computation suitable for decentralized and blockchain-based incentive designs.

[AI-32] Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

【速读】：该论文旨在解决文本驱动的角色扮演模型在沉浸式应用（如虚拟现实游戏和交互叙事）中难以准确反映场景氛围与情绪张力的问题。现有方法虽能模仿角色风格，但缺乏对视觉环境的感知与推理整合，导致对话脱离情境。其解决方案的关键在于提出EBM-RL（Eye-Brain-Mouth Reinforcement Learning）框架，该框架基于GRPO（Generalized Reward Policy Optimization）构建，并采用解耦结构将模型行为明确划分为三个阶段：感知（[perception]）、推理（[think]）和回答（[answer]），从而实现类人感官 grounding。通过引入四种互补奖励机制——CLIP-based场景-文本对齐奖励以增强氛围一致性、感知-认知奖励提升参考回复概率、答案准确性奖励保障内容忠实度，以及密集格式奖励规范输出结构——EBM-RL显著提升了视觉氛围一致性和角色真实性，在视频引导的角色扮演任务上优于纯文本基线及更大规模的视觉语言模型，并展现出无需微调的零样本泛化能力。

链接: https://arxiv.org/abs/2605.04733
作者: Miao Wang,Yuling Shi,Yijiang Li,Yeheng Chen,Xiaodong Gu,Bin Li,Bo Gao,Yaduan Ruan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene’s atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the likelihood of the reference response; (iii) answer accuracy to ensure faithfulness; and (iv) a dense format reward to enforce the desired structured output. Extensive experiments demonstrate that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, delivering simultaneous gains in visual-atmosphere consistency and character authenticity. Beyond the role-playing domain, EBM-RL also exhibits strong zero-shot generalization: without any additional fine-tuning, it consistently improves performance on out-of-domain VideoQA benchmarks. We additionally release an open-source dataset for video-grounded role-playing dialogue. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.04733 [cs.AI] (or arXiv:2605.04733v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.04733 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] From Beats to Breaches:How Offensive AI Infers Sensitive User Information from Playlists

【速读】：该论文旨在解决生成式 AI（Generative AI）在数字生态系统中被恶意利用的问题，特别是针对音乐流媒体平台中用户公开发布的播放列表所引发的敏感个人信息（PII）泄露风险。其核心挑战在于如何从无序、变长的播放列表集合中准确推断出用户的年龄、国籍、性别、生活习惯及人格特质等敏感属性。解决方案的关键在于提出了一种名为 musicPIIrate 的新型深度学习工具，该工具融合了基于集合（set-based）的方法（如 Deep Sets）与建模播放列表间关系的图神经网络（Graph Neural Networks），从而同时利用单个播放列表的数据表示和整体播放列表结构信息，实现对 PII 的高精度推理；此外，作者进一步设计了轻量级防御框架 JamShield，通过向用户账户注入虚假播放列表以稀释真实 PII 信号，有效降低攻击成功率，平均使 F1 分数下降 10%，展现出良好的防护潜力。

链接: https://arxiv.org/abs/2605.04724
作者: Stefano Cecconello,Mauro Conti,Luca Pajola,Luca Pasa,Pier Paolo Tricomi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper is accepted at IEEE EuroSP 2026

点击查看摘要

Abstract:The pervasive integration of AI has enabled Offensive AI: the exploitation of AI for malicious ends across the cyber-kill chain. A critical manifestation is the user attribute inference attack, where AI infers sensitive Personally Identifiable Information (PII) from innocuous public data. We explore how music streaming ecosystems, where users routinely release public playlists, can be exploited for Offensive AI. To quantify this threat, we developed musicPIIrate. This novel tool leverages deep learning architectures that utilize both standalone data representations and the structural information embedded in a user’s playlist collection. Our design explores set-based approaches (e.g., Deep Sets) and methodologies modeling relationships between playlists (e.g., Graph Neural Networks), which we also combine to leverage both perspectives. Our approach addresses feature extraction from unordered, variable-length set data, enabling accurate PII prediction. Empirical evaluation demonstrates that musicPIIrate achieves state-of-the-art inference accuracy. The tool successfully infers a wide array of attributes, including: Demographics (Age, Country, Gender), Habits (Alcohol, Smoke, Sport), and Personality Traits (OCEAN scores). musicPIIrate outperforms existing methods, beating baselines in 9 out of 15 attribute inference tasks. To counter this vulnerability, we propose JamShield, a lightweight defensive framework. JamShield strategically injects dummy playlists into an account to dilute the PII-carrying signal. Our analysis indicates that JamShield represents a promising defense, lowering inference F1-scores by an average of 10%. This work provides an initial Offensive-AI benchmark for playlist-based PII inference using architectures that leverage set- and graph-structured data and introduces a defense showing encouraging mitigation effects. Comments: This paper is accepted at IEEE EuroSP 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.04724 [cs.CR] (or arXiv:2605.04724v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.04724 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-34] Exact Dual Geometry of SOC-ICNN Value Functions

【速读】：该论文旨在解决输入凸神经网络（Input Convex Neural Networks, ICNNs）在下游推理过程中缺乏显式几何解析能力的问题，尤其是针对近期提出的二阶锥输入凸神经网络（Second-Order Cone ICNNs, SOC-ICNNs）的可微分几何结构建模问题。其解决方案的关键在于从对偶视角出发，揭示SOC-ICNN的最优对偶变量与支撑斜率（supporting slopes）、次梯度（subdifferentials）、方向导数（directional derivatives）及局部Hessian矩阵之间的精确映射关系，从而实现白盒推理中几何信息的直接读出，突破传统黑箱自动微分的局限性。

链接: https://arxiv.org/abs/2605.04722
作者: Kang Liu,Jianchen Hu,Wei Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Input Convex Neural Networks (ICNNs) are commonly used in a two-stage manner: one first trains a convex network and then minimizes it over its input in a downstream inference problem. Recent second-order-cone ICNNs (SOC-ICNNs) enrich ReLU-based ICNNs with quadratic and conic modules and admit an exact representation as value functions of second-order cone programs (SOCPs). This value-function structure enables an explicit convex-analytic treatment of SOC-ICNN inference. In this paper, we study the exact first-order and local second-order geometry of SOC-ICNNs from the dual viewpoint. We show that supporting slopes, subdifferentials, directional derivatives, and local Hessians can be recovered directly from optimal dual variables. These results provide the geometric primitives for white-box SOC-ICNN inference, going beyond black-box automatic differentiation. Numerical experiments validate the exact multiplier readout, the local Hessian formula, and the set-valued behavior at structurally degenerate inputs. We also provide a step-by-step tutorial showing how the readout mechanism instantiates a complete white-box inference loop. The code is available at this https URL.

[AI-35] Budget-aware Auto Optimizer Configurator

【速读】：该论文旨在解决大规模模型训练中优化器状态（optimizer states）占用大量GPU显存的问题。研究表明，不同网络模块中的梯度行为存在显著差异，如方向稳定性与尺度各向异性不同，表明并非所有模块都需要高成本的优化器配置，全局统一使用昂贵优化器会导致内存效率低下。解决方案的关键在于提出预算感知的优化器配置器（Budget-Aware Optimizer Configurator, BAOC），其通过采样梯度流来量化采用低精度或移除动量等简化配置所带来的性能风险，并在显存和时间预算约束下求解最优分配问题，从而为每个网络模块选择性价比最高的优化器配置，实现训练质量保持的同时显著降低优化器状态的内存消耗。

链接: https://arxiv.org/abs/2605.04711
作者: Kang Liu,Wei Peng,Jianchen Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Optimizer states occupy massive GPU memory in large-scale model training. However, gradients in different network blocks exhibit distinct behaviors, such as varying directional stability and scale anisotropy, implying that expensive optimizer states are not universally necessary and using a global optimizer is often memory-inefficient. We propose the Budget-Aware Optimizer Configurator (BAOC) to reduce memory cost by assigning suitable optimizer configurations to individual blocks under given budgets. Specifically, BAOC samples gradient streams to derive statistical metrics that quantify the potential performance risk of applying cheaper configurations (e.g., low precision or removing momentum). It then solves a constrained allocation problem to minimize total risk under memory and time budgets, selecting a budget-feasible configuration for each block. Experiments across vision, language, and diffusion workloads demonstrate that BAOC maintains training quality while significantly reducing the memory usage of optimizer states. The code is available at this https URL.

[AI-36] Averag e Attention Transformers and Arithmetic Circuits

【速读】：该论文旨在解决Transformer编码器作为序列到序列函数的计算能力问题，特别是其能否模拟特定类型的算术电路。解决方案的关键在于证明：通过使用平均硬注意力机制（average hard attention），Transformer编码器可以模拟常数深度的算术电路，这些电路包含无界加法、二元乘法和符号门；同时，当使用典型的平均注意力机制时，所计算的函数同样属于此类电路家族。这一结论在实数、有理数及两者之间的任意环上均成立，表明Transformer架构本质上具备与低深度算术电路相当的计算表达能力。

链接: https://arxiv.org/abs/2605.04683
作者: Lena Ehrmuth,Laura Strieker
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. The circuit families that can be simulated this way have constant depth while using unbounded addition, binary multiplication and sign gates. The transformers we use have arithmetic circuits instead of feed-forward networks. With typical average attention the functions they compute are also computed by the same class of circuit families. Our results hold for transformers over the reals, rationals and any ring in between the two.

[AI-37] CodeEvolve: LLM -Driven Evolutionary Optimization with Runtime-Enriched Target Selection for Multi-Language Code Enhancement

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在代码优化中难以自动识别性能瓶颈、保证功能正确性并系统性提升程序性能与代码质量的问题。其核心解决方案是提出CodeEvolve框架，关键在于引入运行时引导的目标选择机制（基于Java Flight Recorder的权重组件图）、蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）进行候选编辑探索、自动化代码精炼策略以及针对特定编程语言（Java和Salesforce Apex）的评估流水线，从而实现高效、可靠且可验证的代码优化过程。

链接: https://arxiv.org/abs/2605.04677
作者: Ajay Krishna Borra,Wenzhuo Yang,Samarth Arora,Akhilesh Deepak Gotmare,Gokulakrishnan Gopalakrishnan,Tharun Gali,Madhav Rathi,Doyen Sahoo,Manpreet Singh,Mayuresh Verma,Laksh Venka,Shuchita Singh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:We present CodeEvolve, an evolutionary framework for improving program performance and code quality with Large Language Models (LLMs). CodeEvolve extends OpenEvolve with runtime-guided target selection, Monte Carlo Tree Search (MCTS), automated code refinement, and language-specific evaluation pipelines for Java and Salesforce Apex. The system uses Java Flight Recorder (JFR) profiles to build weighted component graphs and select optimization targets that account for most execution cost, reducing reliance on manual bottleneck identification. For each target, CodeEvolve generates candidate edits, evaluates them through build validation, unit tests, performance checks, static analysis, and LLM-based review, and retains only variants that preserve functional correctness. Across real-world optimization tasks, CodeEvolve improves performance and code metrics while maintaining correctness. On a large enterprise Java codebase, it achieves an average speedup of 15.22 \times across seven hotspot functions and outperforms single-pass LLM optimization on five of them. An ablation study on Apex optimization shows that the full MCTS-augmented configuration produces 19.5 valid programs out of 20 on average, indicating that search, filtering, and refinement each contribute to more reliable optimization.

[AI-38] AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair NEURIPS2026

【速读】：该论文旨在解决生成式 AI (Generative AI) 修复任务中评估器（evaluator）配置变化导致的排行榜排名不稳定问题，即“评估器重构引发的排序漂移”（evaluator reconfiguration-induced ranking instability）。其核心问题是：部分修复方法在内部选择候选修复方案时会依赖评估器衍生信号（evaluator-derived signal），从而造成排行榜结果不可靠。解决方案的关键在于提出 AuditRepairBench——一个包含 576,000 个注册单元（96,000 个执行）的配对执行轨迹语料库，并设计了一种模块化筛选架构（modular screening architecture），通过四种可互换的实现方式（包括基于规则的通道暴露比、学习型影响代理、反事实敏感性代理和稀疏人工审计代理）构建筛选后验概率，进而驱动细胞级翻转函数（cell-level flip functional）、集合标签、分层系统评分与集合型排行榜。该方案显著降低了排名位移（平均减少 62%），且仅需少于 50 行代码，同时具备良好的实施鲁棒性和不确定性传播能力（95% 覆盖率从 0.81 提升至 0.95）。

链接: https://arxiv.org/abs/2605.04624
作者: Yuelin Hu,Zhenbo Yu,Zhengxue Cheng,Wei Liu,Li Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 25 pages, 8 figures, NeurIPS 2026 Evaluation and D Directions Track

点击查看摘要

Abstract:Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and release AuditRepairBench, a paired-execution trace corpus of 576,000 registered cells (96,000 executed) that operationalizes evaluator-channel-blocking ranking instability within a declared observability boundary. A modular screening architecture decides pathway-blocking through four interchangeable implementations, a learned influence proxy, a rule-based channel-exposure ratio that uses no trained model, a counterfactual sensitivity proxy, and a sparse human-audit proxy, combined into a screening posterior that feeds a cell-level flip functional, a set-valued label, a stratified system score, and a set-valued leaderboard. The resource is supported by mechanism-anchored validation on an 80-case source-level channel-surgery subset, an independent-discovery protocol under which two annotator groups separated from the pipeline developers discover coupling patterns blinded to the screening design and the frozen ensemble attains pooled AUROC 0.83 on their 79 cases, implementation robustness, uncertainty propagation that raises 95% coverage from 0.81 to 0.95, and forward transfer with pooled community-evaluator Spearman \rho = 0.65. Screening-guided blinding patches reduce rank displacement by 55–74% (mean 62%) at fewer than 50 lines of code, whereas random channel blinding produces at most 7% reduction and generic retraining at most 13%. AuditRepairBench-Lite, a rule-only configuration on a 12,000-cell subset, preserves the leaderboard at Kendall \tau = 0.88 under twenty-four GPU-hours and is the primary release artifact at 42 GB.

[AI-39] Library learning with e-graphs on jazz harmony

【速读】：该论文旨在解决人类如何通过反复反思与聆听内化爵士和声模式的问题，其核心挑战在于构建一个能够模拟这种渐进式学习过程的计算模型。解决方案的关键在于提出一种基于库学习（library learning）的计算框架，该框架从一组和声进行中搜索由基本和声关系组成的程序空间，以发现对数据集的简洁生成性解释；并通过在e-graph上集成演绎解析（deductive parsing）与库学习，高效地探索程序与库的联合空间，从而实现对和声模式的结构化学习与重构。

链接: https://arxiv.org/abs/2605.04622
作者: Zeng Ren,Maddy Bowers,Xinyi Guan,Martin Rohrmeier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 10 pages, 7 figures, 2 listings, 1 table, no conference

点击查看摘要

Abstract:Humans can acquire a highly structured intuitive understanding of musical patterns, yet these patterns often require multiple iterations of reflection and re-listening to internalize fully. To capture such an internalization process, we present a computational model for the learning of jazz harmonic patterns based on library learning. Given a corpus of harmonic progressions, our model searches over a space of programs composed of primitive harmonic relations in order to discover concise generative explanations of the corpus. The model first enumerates possible programs for each piece, and then jointly learns a library of harmonic patterns and refactored programs. To efficiently navigate the vast joint space of programs and libraries, we integrate deductive parsing with library learning on e-graphs. We explore how well our model captures aspects of human musical pattern learning by evaluating the intuitiveness of both programs and libraries, as well as similarities to human-written harmonic derivations.

[AI-40] Guidelines for Designing AI Technologies to Support Adult Learning DATE

【速读】：该论文旨在解决当前AI支持的学习系统在成人教育场景中适配性不足的问题，即现有AI教育技术多聚焦于K-12群体，未能充分回应成人学习者的独特需求、约束条件和学习目标。其解决方案的关键在于通过多学科、国家级研究机构部署的多个AI学习系统的纵向数据收集与反思性主题分析，提炼出一套涵盖19项设计指南的框架，并借助启发式评估与指南探索工具验证其有效性，从而为未来面向成人学习者的AI系统设计提供可操作的实践依据。

链接: https://arxiv.org/abs/2605.04616
作者: Jennifer M. Reddig,Glen R. Smith Jr,Sanaz Ahmadzadeh Siyahrood,Wesley G. Morris,Yoojin Bae,Kaitlyn Crutcher,John Kos,Rahul K. Dass,Jinho Kim,Momin Naushad Siddiqui,Daniel Weitekamp,Ploy Thajchayapong,Sandeep Kakar,Alex Endert,Scott Crossley,Min Kyu Kim,Chris Dede,Ashok Goel,Christopher J. MacLellan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Pages: 22, Figures: 7, Tables: 3, Conference: Designing Interactive Systems (DIS) 2026, Dates: received 19 January 2026; revised 12 March 2026; accepted 5 June 2026. Jennifer M. Reddig, Glen R. Smith Jr.: co-first authors

点击查看摘要

Abstract:AI-powered educational technologies have demonstrated measurable benefits for learners, but their design and evaluation have largely centered on K-12 contexts. As a result, many AI-supported learning systems remain poorly aligned with the needs, constraints, and goals of adult learners. To better understand how AI systems function in adult education, this paper examines the deployment of several AI learning technologies developed within a multidisciplinary, national research institute in the United States focused on adult learning and online education. Drawing on longitudinal deployment data, we conducted a reflexive thematic analysis to identify recurring challenges and design considerations across systems. These insights were synthesized into a set of 19 design guidelines intended to inform future AI-supported adult learning technologies. We demonstrate the utility of these guidelines through a heuristic evaluation of the deployed systems. Lastly, we present a guideline exploration tool that aids in the ideation of technologies by connecting the guidelines to stakeholder statements surfaced in the analysis process.

[AI-41] Beyond Retrieval: A Multitask Benchmark and Model for Code Search

【速读】：该论文旨在解决现有代码搜索（code search）评估体系中存在的局限性问题，包括数据污染（data contamination）、标签噪声（label noise）以及二元相关性判断（degenerate binary relevance）等，同时指出当前研究多局限于第一阶段检索（first-stage retrieval），而忽略了实际生产系统中更复杂的完整代码搜索流水线（full code search pipeline），如重排序（reranking）和开发者风格查询（developer-style queries）。解决方案的关键在于构建了一个名为 \textscCoREB 的多任务、抗污染的基准测试平台，其基于反事实重写（counterfactually rewritten）的 LiveCodeBench 问题，在五种编程语言中构建了带时间释放机制和分级相关性标注的数据集，并引入一个经过微调的代码重排序器（fine-tuned code reranker, \textscCoREB-Reranker），实现了在文本到代码、代码到文本和代码到代码三类任务上的稳定性能提升，首次在全流水线层面验证了模型表现的一致性改进。

链接: https://arxiv.org/abs/2605.04615
作者: Siqiao Xue,Zihan Liao,Jin Qin,Ziyin Zhang,Yixiang Mu,Fan Zhou,Hang Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: project site: this https URL

点击查看摘要

Abstract:Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textscCoREB, a contamination-limited, multitask \underlinecode \underlineretrieval and r\underlineeranking \underlinebenchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textscCoREB is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval ( \sim2\times over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textscCoREB-Reranker is the first to achieve consistent gains across all three tasks. The data and model are released.

[AI-42] VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

【速读】：该论文旨在解决当前自动歌唱语音转录（Singing Voice Transcription, SVT）系统中存在的三大问题：一是依赖复杂的多阶段流水线，二是难以恢复文本与音符的对齐关系，三是对分布外（out-of-distribution, OOD）歌唱数据泛化能力差。其解决方案的关键在于提出VocalParse，一个基于大型音频语言模型（Large Audio Language Model, LALM）的统一SVT模型，通过引入交错提示（interleaved prompting）格式联合建模歌词、旋律和词-音符对应关系，直接生成结构化的乐谱序列；同时提出一种类思维链（Chain-of-Thought, CoT）提示策略，先解码歌词作为语义骨架，有效缓解上下文干扰问题，同时保留交错生成的结构优势，从而在多个歌唱数据集上实现最先进的SVT性能。

链接: https://arxiv.org/abs/2605.04613
作者: Yukun Chen,Tianrui Wang,Zhaoxi Mu,Xinyu Yang,EngSiong Chng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly necessary. Despite their utility, current automatic transcription systems face significant challenges: they often rely on complex multi-stage pipelines, struggle to recover text-note alignments, and exhibit poor generalization to out-of-distribution (OOD) singing data. To alleviate these issues, we present VocalParse, a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Specifically, our novel contribution is to introduce an interleaved prompting formulation that jointly models lyrics, melody, and word-note correspondence, yielding a generated sequence that directly maps to a structured musical score. Furthermore, we propose a Chain-of-Thought (CoT) style prompting strategy, which decodes lyrics first as a semantic scaffold, significantly mitigating the context disruption problem while preserving the structural benefits of interleaved generation. Experiments demonstrate that VocalParse achieves state-of-the-art SVT performance on multiple singing datasets. The source code and checkpoint are available at this https URL.

[AI-43] SensingAgents : A Multi-Agent Collaborative Framework for Robust IMU Activity Recognition

【速读】：该论文旨在解决基于惯性测量单元（Inertial Measurement Unit, IMU）的人体活动识别（Human Activity Recognition, HAR）中普遍存在的三大挑战：对标注数据的高度依赖、位置相关的传感器歧义以及缺乏可解释的推理过程。其解决方案的关键在于提出一种名为SensingAgents的多智能体系统，该系统通过角色分工实现协同推理：由一组分析代理（Analyst Agents）负责不同佩戴位置（如手臂、手腕、腰带、口袋）的传感器数据分析；一对倡导代理（Advocate Agents）通过动态与静态辩证辩论机制解决传感器间冲突；以及一个决策代理（Decision Agent）在传感器漂移或失效时保障识别可靠性。该架构显著提升了模型在复杂场景下的鲁棒性和可解释性，尤其在多传感器数据冲突或噪声环境下表现优异。

链接: https://arxiv.org/abs/2605.04608
作者: Naiyu Zheng,Tianlong Yu,Haochen Yin,Xiaoyi Fan,Xiping Hu,Zhimeng Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is a cornerstone of mobile health, smart environments, and human-computer interaction. However, current deep learning-based HAR models often struggle with heavy reliance on labeled data, position-specific ambiguity, and a lack of transparent reasoning. Inspired by the advanced agents framework, which emulates a collaborative agent using Large Language Models (LLMs), we propose SensingAgents, a novel multi-agent system for robust IMU activity recognition. SensingAgents organizes LLM-powered agents into specialized roles: a group of Analyst Agents for position-specific sensor analysis (arm, wrist, belt, pocket), a pair of Advocate Agents that resolves sensor conflicts through dynamic and static dialectical debates, and a Decision Agent that ensures reliability under sensor drift or failure. Evaluation on the Shoaib dataset demonstrates that SensingAgents significantly outperforms state-of-the-art single-agent and multi-agent LLM models, achieving an accuracy of 79.5% in a zero setting–29% higher than existing agent models and 9.4% higher than deep learning baselines–particularly in complex scenarios where multi-sensor data is conflicting or noisy. Our work highlights the potential of multi-agent collaborative reasoning for advancing the robustness and interpretability of ubiquitous sensing systems.

[AI-44] A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints ICML2026

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在实际部署中因计算资源与GPU内存限制共同作用而导致的推理效率低下问题，尤其关注如何避免请求队列无界增长以保障服务稳定性。其解决方案的关键在于提出首个将计算能力和GPU内存约束（特别是KV缓存带来的内存开销）同时纳入考量的排队论分析框架，由此推导出严格的稳定性和不稳定性条件，从而为系统部署提供理论依据——通过估算请求到达率与理论稳定服务速率的匹配关系，可精确计算所需集群规模，实现GPU资源的最优配置，既避免过度采购又防止性能违规。

链接: https://arxiv.org/abs/2605.04595
作者: Chengyi Nie,Nian Si,Zijie Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Accepted in ICML 2026

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-value (KV) caching, which accelerates decoding but quickly exhausts GPU memory. In this paper, we introduce the first queueing-theoretic framework that explicitly incorporates both computation and GPU memory constraints into the analysis of LLM inference. Based on this framework, we derive rigorous stability and instability conditions that determine whether an LLM inference service can sustain incoming demand without unbounded queue growth. This result offers a powerful tool for system deployment, potentially addressing the core challenge of GPU provisioning. By combining an estimated request arrival rate with our derived stable service rate, operators can calculate the necessary cluster size to avoid both costly over-purchasing and performance-violating under-provisioning. We further validate our theoretical predictions through extensive experiments in real GPU production environments. Our results show that the predicted stability conditions are highly accurate, with deviations typically within 10%.

[AI-45] HeterSEED: Semantics-Structure Decoupling for Heterogeneous Graph Learning under Heterophily

【速读】：该论文旨在解决异质图（heterogeneous graph）在强异配性（heterophily）场景下，传统异质图神经网络因主要依赖特征相似性进行消息传递而导致信息传播失真、预测偏差增大的问题。其解决方案的关键在于提出HeterSEED框架，通过语义-结构解耦机制：一方面构建类型与关系感知的语义通道以捕捉局部语义信息，另一方面设计基于伪标签引导划分的异配性结构通道，区分同配与异配邻域并采用元路径（metapath）加权聚合；最终通过节点级自适应融合机制生成上下文相关的表示，理论上证明该方法在异配图上比仅依赖特征相似性的标准模型更具表达能力且能有效降低异配邻居引入的预测偏倚。

链接: https://arxiv.org/abs/2605.04594
作者: Xinyi Li,Ming Li,Lu Bai,Lixin Cui,Feilong Cao,Ke Lv,Yunliang Jiang,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Many real-world heterogeneous graphs exhibit pronounced heterophily, where connected nodes often have dissimilar labels or play different semantic roles. In such settings, standard heterogeneous graph neural networks that aggregate messages along metapaths or meta-relations primarily based on feature similarity can propagate misleading information, since feature similarity may be misaligned with underlying relational semantics. In this paper, we propose HeterSEED, a semantics-structure decoupling framework for heterogeneous graph learning under heterophily. HeterSEED decouples representation learning into a heterogeneous semantic channel that captures type- and relation-aware local semantics and a structure-aware heterophily channel that separates homophilic and heterophilic neighborhoods via pseudo-label-guided partitioning and aggregates them using metapath-based structural weights. A node-level adaptive fusion mechanism then combines the two channels to produce context-dependent node representations. Theoretically, we establish that, on heterogeneous graphs under heterophily, HeterSEED is strictly more expressive than standard heterogeneous graph neural networks that rely primarily on feature similarity and provably reduces the prediction bias introduced by heterophilic neighbors. Experiments on five real-world heterogeneous graphs, including two large-scale networks at the million-node and hundred-million-edge scale, demonstrate that HeterSEED consistently outperforms representative heterogeneous graph neural networks and recent heterophily-aware baselines, especially in strongly heterophilic regimes.

[AI-46] From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在安全对齐（safety alignment）过程中因少量良性样本微调而导致安全行为被破坏的问题。现有研究多通过对比微调前后参数或隐藏状态来解释该现象，但忽略了微调过程中参数动态演化的机制。论文的关键发现是：良性微调会导致参数逐步向危险对齐方向漂移，从而渐进式削弱模型安全性；基于此机制，作者提出样本级安全退化量化方法（Sample-Level Quantification of Safety Degradation, SQSD），通过计算每个训练样本引起的参数更新在危险与安全方向上的投影差异，获得连续的风险评分，从而有效识别高风险样本。实验表明，SQSD能准确量化样本级微调风险，并具备跨模型架构、参数规模及参数高效方法的强迁移性。

链接: https://arxiv.org/abs/2605.04572
作者: Xiao Wang,Yifei Zhang,YongKang Liu,Xiaocui Yang,Zihan Wang,Shi Feng,Daling Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidden states before and after fine-tuning, but overlook their dynamic evolution during fine-tuning. In this paper, we uncover a critical mechanism underlying safety degradation by analyzing parameter dynamics, where benign fine-tuning causes parameters to cumulatively drift toward danger-aligned directions, progressively undermining the model’s safety. This finding suggests that samples contributing more to this drift has greater fine-tuning risks. Based on this insight, we propose a method of Sample-Level Quantification of Safety Degradation (SQSD), which quantifies the influence of each training sample on safety degradation. Specifically, SQSD computes continuous risk scores to samples by measuring their induced parameter updates’ projection difference between danger and safety directions. Extensive experiments across multiple models and datasets demonstrate that SQSD effectively quantifies sample-level fine-tuning risks and exhibits strong transferability across model architectures, parameter scales, and parameter-efficient methods.

[AI-47] Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

【速读】：该论文旨在解决当前基于模型的强化学习（Model-Based Reinforcement Learning, MBRL）中混合方法（如模型预测控制，MPC）在高维控制任务下因依赖无梯度优化而计算成本高昂的问题。现有梯度法虽具潜力，但实证研究表明其性能常逊于无梯度方法。解决方案的关键在于提出Dream-MPC，该方法通过从滚动策略生成少量候选轨迹，并利用学习到的世界模型（world model）结合不确定性正则化和优化迭代的 amortization（即跨时间步复用先前优化动作），以梯度上升方式优化每条轨迹，从而显著提升效率与性能。实验表明，Dream-MPC在24个连续控制任务上优于无梯度MPC及当前最优基线方法。

链接: https://arxiv.org/abs/2605.04568
作者: Jonathan Spieler,Sven Behnke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. We will open source our code and more at this https URL.

[AI-48] Stage-adaptive audio diffusion modeling

【速读】：该论文旨在解决音频扩散模型（audio diffusion models）训练过程中效率低下这一问题，其核心在于现有方法普遍采用静态优化策略，未能动态调整不同训练阶段中语义获取与生成精修之间的平衡。解决方案的关键在于引入一个基于训练时SSL空间差异斜率的进度感知制度变量（progress-based regime variable），以此表征训练过程中语义进展的变化，并据此设计三种互补的阶段感知机制：早期使用衰减的SSL引导（decayed SSL guidance）以强化语义初始化，通过该制度变量驱动自适应时间步采样（self-adaptive timestep sampling），以及在参数空间中从收敛的分组结构中激活结构感知正则化（structure-aware regularization）。这些机制使模型在文本条件音频生成和音频超分辨率任务中均实现了更优的收敛性和生成质量，验证了将外部引导、内部组织与优化重点视为阶段依赖成分可显著提升训练效率。

链接: https://arxiv.org/abs/2605.04547
作者: Xuanhao Zhang,Chang Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we introduce a progress-based regime variable derived from the training-time slope of an SSL-space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage-aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self-adaptive timestep sampling driven by the regime variable, and structure-aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text-conditioned audio generation and audio-conditioned super-resolution. Across both settings, the proposed stage-aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage-dependent components rather than fixed ingredients.

[AI-49] Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap

【速读】：该论文旨在解决生成式 AI 编程助手（Generative AI Coding Assistants）和自主代理（Autonomous Agents）在软件开发流程中日益普及背景下，因责任归属不明确而引发的问责难题。现有研究多聚焦于工具的生产力提升，忽视了当代理生成、修改或推荐代码时，谁应对其正确性、安全性及合规性负责这一核心问题。解决方案的关键在于通过对比分析主流 AI 编程工具的 Terms of Service（ToS），揭示不同厂商在所有权、责任分配、免责条款、数据使用等方面存在的共性和差异，并指出当前政策框架与高度自治化的软件开发实践之间存在显著脱节。基于此，作者提出构建可问责代理（Accountable Agents）的研究路线图，涵盖责任建模、治理机制设计、支持问责的工具开发以及开发者感知与实践的实证研究等关键方向。

链接: https://arxiv.org/abs/2605.04532
作者: Christoph Treude
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 3rd ACM International Conference on AI-powered Software (AIware 2026)

点击查看摘要

Abstract:AI coding assistants and autonomous agents are becoming integral to software development workflows, reshaping how code is produced, reviewed, and maintained. While recent research has focused mainly on the capabilities and impacts of productivity of these systems, much less attention has been paid to accountability: who is responsible when agents generate, modify, or recommend code? In practice, accountability is defined through the Terms of Service (ToS) and related policy documents that govern the use of AI-powered development tools. In this vision paper, we present a comparative analysis of the Terms of Service for widely used AI coding assistants and agent-enabled development tools. We examine how these documents allocate ownership, responsibility, liability, and disclosure obligations between tool providers and software developers, and we identify common patterns and divergences between providers. Our analysis reveals a consistent tendency to shift responsibility for correctness, safety, and legal compliance onto users, as well as substantial variation in how providers address issues such as indemnification, data reuse, and acceptable use. Based on these findings, we argue that existing policy frameworks are poorly aligned with increasingly agent-mediated and autonomous software development workflows. We outline a research roadmap for accountable agents in software engineering, identifying challenges and opportunities for modeling responsibility, designing governance artifacts, developing tooling that supports accountability, and conducting empirical studies of developers’ perceptions and practices. Comments: 3rd ACM International Conference on AI-powered Software (AIware 2026) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.04532 [cs.SE] (or arXiv:2605.04532v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.04532 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805760.3814889 Focus to learn more DOI(s) linking to related resources

[AI-50] SADE: Symptom-Aware Diagnostic Escalation for LLM -Based Network Troubleshooting

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）代理在网络安全故障排查中根因定位性能不足的问题，尤其是在公开基准测试上的表现远未达到实际部署要求。其核心问题在于现有LLM代理缺乏人类网络工程师所采用的结构化、分层诊断方法，而是依赖自由形式的推理过程，导致证据收集与假设确认混杂，影响诊断准确性。解决方案的关键在于提出SADE（Symptom-Aware Diagnostic Escalation），它将经典的思科（Cisco）故障排查方法论显式编码为诊断策略：通过阶段门控的诊断流程分离证据采集与假设验证，并结合按故障类别路由的技能库和高价值诊断辅助工具，从而显著提升根因定位的准确性和可解释性。

链接: https://arxiv.org/abs/2605.04530
作者: Kuan-Hao Tseng,Niruth Bogahawatta,Yasod Ginige,Kosta Dekic,Arunan Sivanathan,Suranga Seneviratne
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly applied to network troubleshooting, but root-cause localization on public benchmarks remains well below practical deployment thresholds. We argue this is because existing agents do not encode the disciplined, layer-by-layer methodology that human network engineers use, and instead rely on free-form deliberation that conflates evidence acquisition with hypothesis commitment. We present SADE (Symptom-Aware Diagnostic Escalation), an agent that encodes the classical Cisco troubleshooting methodology as an explicit policy. SADE pairs a phase-gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault-family skills and high-yield diagnostic helpers. On a held-out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root-cause F1 by 37 percentage points over a ReAct + GPT-5 baseline; a model-controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone, showing that the gain is not a side-effect of the model upgrade.

[AI-51] Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

【速读】：该论文旨在解决当前自动化渗透测试（penetration testing）框架中因策略制定能力不足、领域特定推理薄弱以及动作与工具选择不准确而导致的性能瓶颈问题。其解决方案的关键在于提出Pen-Strategist框架，该框架由两个核心组件构成：一是基于逻辑推理的领域专用推理模型，用于生成高质量的渗透测试策略；二是语义驱动的卷积神经网络（CNN）分类器，将策略转化为可执行步骤。通过构建包含逻辑解释的推理数据集并采用强化学习微调Qwen-3-14B模型，该方法在策略生成任务上相较基线提升87%，并在集成到PentestGPT等现有框架后显著提高子任务完成率（+47.5%），同时在CTFKnow基准测试中实现18%的性能增益，验证了其在实际场景中的有效性与稳定性。

链接: https://arxiv.org/abs/2605.04499
作者: Yasod Ginige,Pasindu Marasinghe,Sajal Jain,Suranga Seneviratne
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cyber threats are rapidly increasing, expanding their impact from large-scale enterprises to government services and individual users, making robust security systems increasingly essential. However, a significant shortage of skilled cybersecurity professionals exacerbates this challenge. While recent research has explored automating tasks such as penetration testing using LLM-based agents, existing frameworks often perform poorly due to limited capability in strategy formulation, domain-specific reasoning, and accurate action and tool selection. To overcome these limitations, we propose Pen-Strategist framework, consisting of a novel domain-specific reasoning model that derives pentesting strategies via logical reasoning and a classifier that converts the strategies into actionable steps. First, we construct a reasoning dataset containing logical explanations for both strategy derivation and step selection in pentesting scenarios. We then fine-tune a Qwen-3-14B model for strategy generation using reinforcement learning. Evaluation on the test split of the dataset demonstrates a 87% improvement in strategy derivation performance compared to the baseline. Furthermore, we integrate the fine-tuned Pen-Strategist model into existing automated pentesting frameworks, such as PentestGPT, and evaluate its performance on vulnerable machines, achieving a 47.5% improvement in subtask completion while surpassing the baseline GPT-5. Further experiments on the CTFKnow benchmark show an 18% performance gain over the base model. For step prediction, we train a semantic-based CNN classifier, which outperforms commercial LLMs by 28% and enhances execution stability. Finally, we conduct a user study to qualitatively assess the generated strategies, and Pen-Strategist demonstrates superior performance compared to the Claude-4.6-Sonnet.

[AI-52] How Does Thinking Mode Change LLM Moral Judgments? A Controlled Instant-vs-Thinking Comparison Across Five Frontier Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在道德判断任务中，启用“推理模式”是否会影响模型输出的一致性和伦理一致性的问题。其关键解决方案在于对比同一模型检查点在“即时模式”（instant mode）与“推理模式”（reasoning mode）下的道德判断表现，发现虽然整体二元判决一致性保持稳定（Krippendorff’s alpha = 0.78 vs. 0.79），但在21个存在模型争议的场景中，推理模式显著提升了跨模型一致性（平均成对一致率从5.4提升至6.7），并减少了三类模型中的群体判断不一致现象，同时更频繁地改变模型自标伦理框架而非仅调整最终道德结论，表明推理机制能有效引导模型内部认知路径向更一致、可解释的方向演进。

链接: https://arxiv.org/abs/2605.04488
作者: Sai Sourabh Madur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We evaluate whether enabling provider-exposed reasoning mode changes moral judgments within the same model checkpoint. Across 100 moral-judgment scenarios and five frontier reasoning-trained LLMs (Claude Sonnet 4.6, GPT 5.5, Gemini 3 Flash, DeepSeek V3.1, and Qwen3.5 397B), aggregate binary-verdict agreement remains high and statistically indistinguishable between instant and thinking modes (Krippendorff’s alpha = 0.78 vs. 0.79). However, disagreement is concentrated in 21 model-disputed scenarios, where instant-mode agreement is near chance (alpha = 0.08). On these scenarios, reasoning directionally narrows cross-model disagreement, increasing mean pairwise agreement from 5.4 to 6.7 out of 10. Reasoning also reduces demographic-judgment inconsistency in three of five models and does not increase it for any model. Across all five model families, reasoning changes self-labeled ethical frameworks more often than binary verdicts.

[AI-53] CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training PPOPP’26

【速读】：该论文旨在解决大规模分布式训练中集体通信库（Collective Communication Libraries, CCL）因软硬件及环境因素复杂交互而引发的慢速/挂起异常（slow/hang communication）诊断难题，此类问题通常难以定位且耗时长。解决方案的关键在于提出CCL-D系统，其核心由两级组件构成：一是基于轻量级分布式追踪框架的rank级实时探测器，用于跨层测量异常指标以监控通信流量；二是智能决策分析器，实现自动化异常检测与根因定位，精确识别故障GPU rank。实测表明，CCL-D在4000-GPU集群上部署一年后，几乎覆盖所有已知慢速/挂起异常，并能在6分钟内完成受影响rank的精确定位，显著优于现有方法。

链接: https://arxiv.org/abs/2605.04478
作者: Yida Gu,Fakang Wang,Jianhao Fu,Zhenhang Sun,Qianyu Zhang,Hairui Zhao,Xingchen Liu,Yang Tian,Wenjing Huang,Zedong Liu,Yifan Chen,Jinwu Yang,Yueyuan Zhou,Qian Zhao,Haoxu Li,Tao Wang,Feng Yu,Zhan Wang,Guangming Tan,Dingwen Tao
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by PPoPP’26, 13 figures, 2 tables

点击查看摘要

Abstract:As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

[AI-54] Joint Optimization of Trajectory Control Resource Allocation and Task Offloading for Multi-UAV-Assisted IoV

【速读】：该论文旨在解决密集城市环境中多无人机（UAV）协同辅助车联网（IoV）任务卸载系统中的高延迟与高能耗问题，其核心挑战在于复杂的非凸优化问题在严格耦合约束下的求解难题。解决方案的关键在于构建一个分层执行框架：首先采用基于二阶锥规划（SOCP）的分布式优化算法优化每架无人机的三维飞行轨迹，以实现自适应网络覆盖；其次提出一种融合深度强化学习（DRL）与大语言模型（LLM）的混合资源调度机制，其中DRL负责初始资源分配，LLM作为语义级宏观调度器修正长尾任务分配失衡问题；同时引入奖励解耦机制，使DRL训练独立于外部LLM干预，保障策略收敛性；最后通过线性规划（LP）在交替优化循环中精确确定任务卸载比例，从而显著提升系统任务成功率和整体效率。

链接: https://arxiv.org/abs/2605.04436
作者: Maoxin Ji,Qiong Wu,Pingyi Fan,Cui Zhang,Nan Cheng,Wen Chen,Khaled B. Letaief
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This paper has been submitted to TMC

点击查看摘要

Abstract:This paper investigates a multi-Unmanned Aerial Vehicle (UAV) joint base station-assisted Internet of Vehicles (IoV) task offloading system in dense urban environments. To minimize system delay and energy consumption under strict coupling constraints, the complex non-convex optimization problem is decoupled into a hierarchical execution framework. First, a sequential distributed optimization algorithm based on Second-Order Cone Programming (SOCP) is proposed to optimize the 3D flight trajectory of each UAV, ensuring adaptive network coverage. Second, a novel hybrid resource scheduling paradigm synergizing Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) is developed. Within this framework, the DRL agent dictates the initial resource allocation, while the LLM acts as a semantic macro-scheduler to rectify long-tail allocation imbalances for failed and surplus tasks. Crucially, a reward decoupling mechanism is introduced to isolate DRL training from external LLM interventions, thereby ensuring policy convergence. Finally, the task offloading ratios are precisely determined via Linear Programming (LP) within an alternating optimization loop. Simulation results demonstrate that the proposed method significantly outperforms traditional multi-agent reinforcement learning baselines in terms of task success rate and system efficiency.

[AI-55] owards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

【速读】：该论文旨在解决强化微调（Reinforcement Fine-Tuning, RFT）训练过程中失败管理缺乏系统性方法的问题。当前RFT训练过程高度脆弱，现有研究多聚焦于系统级可靠性提升或针对特定子问题改进算法，但对训练过程层面的失败识别与自动处理仍处于空白状态。为应对这一挑战，作者构建了首个细粒度RFT失败基准测试集RFT-FaultBench，涵盖5类故障家族、16种故障类型及大量训练轨迹数据，揭示了RFT失败在训练动态中具有可观测性和可区分的“故障指纹”。基于此，论文提出RFT-FM框架，通过闭环整合异常检测、故障诊断与自动修复机制，实现对RFT失败的自动化管理。其核心创新在于首次将失败管理从经验驱动转向数据驱动，并验证了该框架在复杂故障场景下具备显著的有效性。

链接: https://arxiv.org/abs/2605.04431
作者: Lingzhe Zhang,Tong Jia,Yunpeng Zhai,Liancheng Fang,Kening Zheng,Hongyi Liu,Xiaosong Huang,Philip S. Yu,Ying Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement fine-tuning (RFT) has become a core paradigm for post-training large language models, yet its training process remains highly fragile. Existing efforts mainly improve reliability at the system level or address specific issues in individual subproblems by modifying RFT algorithms. Despite their effectiveness, they largely overlook the problem of failure management at the training-process level. When training goes wrong, practitioners still rely heavily on expert-driven manual inspection and correction, and automatic failure management for RFT remains largely unexplored. In this paper, we take a first step toward systematic failure management for reinforcement fine-tuning. To understand the empirical structure of RFT failures, we first construct RFT-FaultBench, the first benchmark for fine-grained failures in reinforcement fine-tuning, covering 5 fault families, 16 fault types, 779 training runs, 22,549 train-step records, and 1,457,288 trajectory-level records. Based on this benchmark, we conduct a comprehensive empirical study showing that RFT failures are both observable from training dynamics and distinguishable through their empirical fault fingerprints. Building on these findings, we propose RFT-FM, an automatic failure management framework for reinforcement fine-tuning that unifies anomaly detection, failure diagnosis, and auto remediation in a closed loop. Experimental results show that RFT-FaultBench is neither trivial nor saturated: it exhibits clear anomaly structure while still posing substantial challenges, especially under subtle fault settings. Moreover, RFT-FM shows strong capability in detecting, diagnosing, and mitigating RFT failures.

[AI-56] FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

【速读】：该论文旨在解决连续时间（Continuous-time, CT）Transformer中注意力机制本质离散的问题，该问题限制了其在处理不规则时间序列和长程依赖建模时的灵活性与性能。现有CT Transformer依赖于标准的缩放点积注意力（Scaled-Dot-Product Attention, SDPA），而SDPA无法自然地建模连续动态过程。为此，作者提出FLUID（Flexible Unified Information Dynamics），其核心创新在于用液态注意力网络（Liquid Attention Network, LAN）替代传统SDPA机制。LAN将注意力logits重新解释为输入依赖的非线性递归门控的线性常微分方程（ODE）解，从而直接引入连续动力学特性；同时引入显式的注意力汇（attention-sink）门以抑制无关节点上的注意力质量，增强模型对噪声的鲁棒性与自校正能力。此外，FLUID采用输入依赖的液态超连接（Liquid Hyper-Connections）替代标准残差连接，实现层间信息流的自适应调控。理论与实证表明，LAN在稳定性、泛化能力和效率之间取得平衡，优于CT-RNNs和标准CT Transformers。

链接: https://arxiv.org/abs/2605.04421
作者: Waleed Razzaq,Yun-Bo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continuous-time (CT) Transformers improve irregular and long-range modeling over CT-RNNs by exploiting inputs or outputs embeddings with continuous dynamics. However, the core scaled-dot-product-attention (SDPA) mechanism remains inherently discrete. We propose FLUID (Flexible Unified Information Dynamics), a CT Transformer that incorporates continuous dynamics directly into the attention computation by replacing it with Liquid Attention Network (LAN). LAN reinterprets attention logits as continuous dynamical system and reformulates them as the solution to a linear ODE modulated by input-dependent nonlinear recurrent gates. Theoretically, we establish stability guarantees for LAN dynamics and show that it serves as an interpolating middle ground between SDPA and CT-RNNs, recovering each as special case under well-defined parameterization of its gating functions. LAN also introduces an explicit attention-sink gate to eliminate disproportionate attention mass on uninformative nodes. FLUID replaces standard residual connections with input-dependent Liquid Hyper-Connections to adaptively regulate interlayer information flow. Empirically, we evaluate FLUID on a broad set of learning tasks, including (i) irregular time-series, (ii) long-range modeling, (iii) lane-keeping control of autonomous vehicles, and (iv) learning physical dynamics under a scarce data regime. Across all the tasks, FLUID consistently matches or outperforms CT baselines, achieving improvements of up to 47% in certain scenarios and enhancing generalization under distributional shifts. Additionally, FLUID demonstrates superior noise robustness and a self-correcting inductive bias in autonomous vehicle control. We also provide a detailed analysis of key hyperparameters to guide tuning and show that FLUID occupies an intermediate position among competing approaches in terms of runtime and memory efficiency.

[AI-57] Demystifying Manifold Constraints in LLM Pre-training

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）预训练中依赖启发式稳定技术（如显式归一化层和权重衰减）所带来的机制不明确问题，尤其是这些技术如何与权重约束之间的相互作用尚不清楚。其解决方案的关键在于提出一种可证明收敛的单循环优化框架——Msign-Aligned Constrained Riemannian Optimizer (MACRO)，该方法通过引入显式流形约束，独立地限制前向激活尺度并强制稳定的旋转平衡，从而在理论上取代了传统启发式正则化机制（如RMS归一化和解耦权重衰减）的作用，同时在大规模LLM架构上实现了具有竞争力的性能与严格的Riemannian优化理论保障。

链接: https://arxiv.org/abs/2605.04418
作者: Kang An,Jiaxiang Li,Donald Goldfarb,Shiqian Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of these heuristic mechanisms. Evaluations on large-scale LLM architectures demonstrate that MACRO achieves highly competitive performance while rigorously preserving the theoretical guarantees of exact Riemannian optimization.

[AI-58] Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

【速读】：该论文旨在解决Transformer模型在训练过程中如何通过复杂度控制（complexity control）机制决定其是倾向于记忆（memorization）还是推理（reasoning）的问题，尤其是复杂度控制在训练时间轴上的作用时机尚不明确。解决方案的关键在于识别出一个决定性的“临界窗口”（critical window），即在训练早期阶段的特定时间段内施加权重衰减（weight decay）即可显著提升模型在分布外（out-of-distribution, OOD）任务上的泛化能力，且该窗口具有高度敏感性——仅移动100个优化步长就可使OOD准确率从随机水平（0.15）跃升至推理模式（0.61）。研究表明，将正则化预算集中于训练中期而非全程应用，能大幅提升OOD性能，并揭示了初始缩放（initialization scale）对临界窗口位置的影响及推理解空间的非单调变化规律。

链接: https://arxiv.org/abs/2605.04396
作者: Sarwan Ali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has shown that Transformers’ compositional generalization is governed by \emphcomplexity control, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-complexity memorization. Existing analyses, however, treat complexity control as a single static hyperparameter choice, leaving open \emphwhen during training this control is actually decisive. We show that the memorization-versus-reasoning fate of a Transformer is determined within a sharp, identifiable window of training. On a controlled compositional task we find that (i)~weight decay applied for a single 25%-of-training window matches full-training weight decay in out-of-distribution (OOD) accuracy ( 0.93 vs 0.91 ); (ii)~holding total regularization budget constant, placing it in the middle of training yields 5-9\times higher OOD accuracy than placing it early; (iii)~the boundary of the critical window is remarkably sharp, window onset shifted by as little as 100 optimization steps causes mean OOD to jump from chance ( 0.15 ) to reasoning-regime ( 0.61 ); (iv)~the window’s position depends systematically on initialization scale, but the basin of attraction for reasoning solutions \emphshrinks at small initialization, contradicting the prevailing recommendation that smaller initialization is uniformly better. We further show that the critical-window phenomenon is task-specific: it does not appear on grokking with modular arithmetic, where properly tuned constant weight decay matches scheduled weight decay.

[AI-59] Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

【速读】：该论文旨在解决当前人工智能在科学领域应用中的一大瓶颈：如何将AI代理（AI agents）从纯数字环境拓展至真实实验室场景，以实现对物理实验过程的自主控制与探索。现有自主实验室虽已通过编程接口（API）实现仪器自动化，但缺乏高效、安全且通用的系统架构来连接日益强大的AI智能体与底层设备。解决方案的关键在于提出“实验即代码”（Experiment-as-Code, EaC）实验室范式，其核心是将实验设计为可编译的声明式配置（declarative configurations），由系统层完成程序分析、安全性检查、资源分配与任务编排，并最终通过调用设备级API执行程序化实验。这一架构实现了物理世界、系统层与智能层的协同统一，具备跨学科、跨实验室和跨仪器的通用性，从而推动AI for Science迈向新的突破。

链接: https://arxiv.org/abs/2605.04375
作者: Zhenning Yang,Yuhan Chen,Patrick Tser Jern Kon,Tongyuan Miao,Hongyi Lin,Venkat Viswanathan,Danai Koutra,Ang Chen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Experiment-as-Code (EaC) white paper

点击查看摘要

Abstract:To unleash the full potential of AI for Science, we must untether the agents from a purely digital environment. The agent’s ability to control and explore in real-world labs is essential because the physical lab remains foundational to scientific discovery. While some tasks can be performed on a computer (e.g., data analysis, running simulated experiments), Eureka moments could occur at any time while operating lab instruments (e.g., when a scientist notices unexpected clues, intuition may prompt a real-time course change). Although autonomous labs are on the rise, which expose programmable APIs to control scientific instruments via software, bridging the gap between increasingly powerful AI agents and automated lab equipment requires innovation that draws insights from computer systems. We propose a new paradigm called ``Experiment-as-Code (EaC) Labs,‘’ where a core concept is to encode experiments as declarative configurations that can be compiled down to device-level APIs. AI agents come up with hypotheses and experiments, written as an ensemble of declarative configurations. The systems layer performs program analysis, safety checks, resource assignment, and job orchestration. Finally, programmatic experimentation occurs via actuating the device APIs. This is a general stack that is science-, lab-, and instrument-independent, representing a novel synthesis across the physical, systems, and intelligence layers to unleash the next breakthrough in AI for Science. Comments: Experiment-as-Code (EaC) white paper Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.04375 [eess.SY] (or arXiv:2605.04375v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2605.04375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-60] Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）控制器在特定网络条件下性能显著下降的问题，尤其是在这些条件仍具备良好性能潜力的情况下。传统方法难以通过枚举识别此类最坏场景，且RL控制器的序列决策与闭环特性使得形式化验证方法不可行。其解决方案的关键在于提出ReGuard框架，该框架将最坏情况发现建模为一个双层后悔最大化问题，从而获得对性能差距的可证明下界；并通过分析所发现轨迹作为反事实案例，生成轻量级逻辑规则，在推理阶段仅在检测到高风险状态时进行干预，从而在不重新训练模型的前提下提升鲁棒性，实现在多种网络环境下性能提升79%-85%，并扩展至未发现场景中。

链接: https://arxiv.org/abs/2605.04373
作者: Hongyu Hè,Minhao Jin,Maria Apostolaki
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 23 pages, 12 figures, 4 tables

点击查看摘要

Abstract:RL-based controllers achieve strong average-case performance in networking tasks such as congestion control and adaptive bitrate streaming. Yet their performance can degrade severely under network conditions where strong performance is still achievable. Identifying such conditions and quantifying the resulting performance gap is intractable by enumeration, while the sequential and closed-loop nature of RL controllers makes formal verification methods impractical. We present ReGuard, a framework that discovers worst-case scenarios for a given RL controller and protects it against them at inference time without retraining. Discovery is formulated as a bilevel regret-maximization problem, which yields a certified lower bound on the worst-case performance gap. The discovered trajectories are then analyzed as counterfactuals and compiled into lightweight logic rules that intervene only when a risky state is detected, leaving the controller’s behavior unchanged otherwise. We evaluate ReGuard across three RL-based network controllers: Pensieve, Sage, and Park. ReGuard discovers scenarios in which the controller’s performance is 43 - 64% worse than what is achievable. ReGuard not only discovers gaps 57% to 6 \times larger than those found by the strongest baselines but also shrinks them by 79 - 85% via lightweight rule-based protection while preserving nominal performance. ReGuard’s protection extends beyond the scenarios it discovers, improving performance across a wider range of network conditions. Comments: 23 pages, 12 figures, 4 tables Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) MSC classes: 68M12, 68M15, 68T05, 68Q25, 68M20 ACMclasses: C.2.2; I.2.6; C.4; C.2.6 Cite as: arXiv:2605.04373 [cs.NI] (or arXiv:2605.04373v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2605.04373 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-61] Extending Differential Temporal Difference Methods for Episodic Problems

【速读】：该论文旨在解决差分时序差分（differential temporal difference, differential TD）方法在有限时域（episodic）问题中因奖励中心化（reward centering）可能导致最优策略改变的问题，从而限制其在实际场景中的应用。解决方案的关键在于提出一种差分TD的泛化形式，该形式在存在终止状态的情况下仍能保持策略排序不变性（policy ordering preservation），从而将差分TD从无限时域问题扩展至 episodic 问题；同时，该方法被证明与线性TD算法等价，继承了后者已有的理论保证，并进一步推广了多种流式强化学习算法为对应的差分版本，实验证明奖励中心化可显著提升 episodic 问题中的样本效率。

链接: https://arxiv.org/abs/2605.04368
作者: Kris De Asis,Mohamed Elsayed,Jiamin He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: RLC 2026

点击查看摘要

Abstract:Differential temporal difference (TD) methods are value-based reinforcement learning algorithms that have been proposed for infinite-horizon problems. They rely on reward centering, where each reward is centered by the average reward. This keeps the return bounded and removes a value function’s state-independent offset. However, reward centering can alter the optimal policy in episodic problems, limiting its applicability. Motivated by recent works that emphasize the role of normalization in streaming deep reinforcement learning, we study reward centering in episodic problems and propose a generalization of differential TD. We prove that this generalization maintains the ordering of policies in the presence of termination, and thus extends differential TD to episodic problems. We show equivalence with a form of linear TD, thereby inheriting theoretical guarantees that have been shown for those algorithms. We then extend several streaming reinforcement learning algorithms to their differential counterparts. Across a range of base algorithms and environments, we empirically validate that reward centering can improve sample efficiency in episodic problems.

[AI-62] Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment ICML2026

【速读】：该论文旨在解决TabPFN（一种用于表格数据的基座模型）在面对标签偏移（label shift）时表现不佳的问题，尤其是其容易过拟合训练集中的多数类。解决方案的关键在于提出DistPFN，这是一种无需架构修改或额外训练的测试时后验调整方法，通过降低训练先验（即上下文中的类别分布）的影响，并增强模型预测后验的概率贡献，从而实现对标签偏移的鲁棒性调整；进一步提出的DistPFN-T引入温度缩放机制，根据先验与后验之间的差异自适应地控制调整强度，显著提升了多种基于TabPFN模型在标签偏移场景下的分类性能。

链接: https://arxiv.org/abs/2605.04363
作者: Seunghan Lee,Jaehoon Lee,Jun Seo,Sungdong Yoo,Minjae Kim,Tae Yoon Lim,Dongwan Kang,Hwanil Choi,SoonYoung Lee,Wonbin Ahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model’s predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: this https URL.

[AI-63] When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration

【速读】：该论文旨在解决多智能体软件设计中关于上下文注入（context injection）有效性的问题，即传统假设“更多上下文总是更好”是否成立。研究通过在10个任务、7种上下文注入条件和超过2,700次实验中系统验证发现，同一类知识资源（artifact）对不同任务可能产生截然相反的效果：在某些任务上显著提升设计探索范围（最高达20倍覆盖度），而在另一些任务上则明显降低效果（最高减少46%）。关键解决方案在于识别出一个可测量的预测变量——无上下文时的基础探索水平（baseline exploration），其与上下文效果呈强负相关（Pearson r = -0.82, p < 0.001）。进一步机制分析表明，存在两种收敛模式：由训练数据先验驱动的自然收敛会响应上下文干扰，而由显式指令诱导的收敛则不受影响。因此，论文提出应采用条件性上下文注入策略，仅需一次无上下文试验即可低成本诊断特定任务是否受益于知识资源。

链接: https://arxiv.org/abs/2605.04361
作者: Saranyan Vigraham
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 16 pages, 14 tables. 2,700 multi-agent experiments across 10 software design tasks, 7 artifact conditions, and 4 convergence pressure levels

点击查看摘要

Abstract:The prevailing assumption in agent orchestration is that more context is better. We test this on multi-agent software design across 10 tasks, 7 context-injection conditions, and over 2,700 runs, and find a crossover effect: the same artifact type improves design exploration on some tasks (up to 20 \times tradeoff coverage) and actively degrades it on others (up to 46% reduction). On several tasks, an irrelevant document performs as well as or better than every relevant artifact. The direction is predicted by a single measurable variable–baseline exploration without context–with Pearson r = -0.82 ( p 0.001 ). Probing the mechanism by manipulating convergence pressure through prompt design reveals two distinct regimes: convergence driven by training data priors (natural) responds to artifact disruption, while convergence driven by explicit instructions (induced) does not. The implication is that context injection should be conditional, not universal: one no-context trial is a cheap diagnostic that predicts whether knowledge artifacts will help or hurt a given task.

[AI-64] Efficiently Aligning Language Models with Online Natural Language Feedback

【速读】：该论文旨在解决如何在“模糊”（fuzzy）且难以监督的领域中高效训练具备强能力的语言模型问题，这类领域通常缺乏明确的标签示例，但人类专家仍能对少量模型输出提供高质量的自然语言反馈。解决方案的关键在于引入一种基于在线自然语言反馈的迭代优化框架：首先利用少量专家标注数据构建代理奖励模型（proxy reward model），通过in-context learning（ICL）或微调（fine-tuning）方式实现；随后模型在该代理奖励信号下进行强化学习训练，直至出现过拟合迹象时停止，再收集新的专家反馈并更新代理奖励模型，从而形成闭环优化流程。实验表明，此方法显著提升了专家监督的数据效率，在创意写作和对齐研究任务中分别实现了高达35%–100%性能恢复，同时减少数十倍的专家样本需求。

链接: https://arxiv.org/abs/2605.04356
作者: Christine Ye,Joe Benton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in “fuzzy”, hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stopping at the point of over-optimization, collecting fresh expert supervision, and updating the proxy reward. We construct proxy reward models from language models using in-context learning (ICL) and fine-tuning. We test our methods by eliciting creative writing and alignment research capabilities in Qwen3-8B and Haiku 4.5 respectively. For Qwen3-8B, ICL methods recover up to 35% of performance with 50x fewer expert samples, while fine-tuning methods recover 80% with up to 20x fewer samples and 100% with 3x fewer samples. For Haiku 4.5, ICL methods recover up to 35% of performance with 30x fewer samples, and fine-tuning methods recover 100% with 10x fewer samples. Our results suggest that online natural language feedback can substantially improve the data efficiency of expert supervision.

[AI-65] Resilient AI Supercomputer Networking using MRC and SRv6

【速读】：该论文旨在解决大规模同步预训练任务中尾部延迟（tail latency）主导性能的问题。其解决方案的关键在于三方面协同：一是提出基于RDMA的新型传输协议MRC，通过多路径分发与主动负载均衡避免流冲突；二是采用多平面Clos拓扑结构，在保持两层架构的同时实现高交换机端口密度和冗余性，支持超10万GPU规模的训练集群；三是利用SRv6静态源路由机制，使MRC具备自主绕过网络故障的能力。实证表明，MRC在OpenAI和微软最大训练集群中成功支撑前沿模型训练，并能有效容忍多种网络故障而不中断训练过程。

链接: https://arxiv.org/abs/2605.04333
作者: Joao Araujo,Alex Chow,Mark Handley,Ryder Lewis,Christoph Paasch,Jitendra Padhye,Michael Papamichael,Greg Steinbrecher,Amin Tootoonchian,Lihua Yuan,S. Anantharamu,Abhishek Dosi,Mohit Garg,Mahdieh Ghazi,Torsten Hoefler,Deepal Jayasinghe,Jithin Jose,Abdul Kabbani,Guohan Lu,Yang Wang,K. Doddapaneni,Murali Garimella,Vipin Jain,Yanfang Le,H. Nagulapalli,S. Narayanan,Rong Pan,Rathina Sabesan,Raghava Sivaramu,Rip Sohan,Eric Davis,Dragos Dumitrescu,Mohan Kalkunte,Bhaswar Mitra,Guglielmo Morandin,Adrian Popa,Costin Raiciu,Eric Spada,John Spillane,Niranjan Vaidya,Aviv Barnea,Idan Burstein,Elazar Cohen,Yamin Friedman,Noam Katz,Masoud Moshref,Yuval Shpigelman,Shahaf Shuler,Shy Shyman,Sayantan Sur
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 18 pages, 22 figures

点击查看摘要

Abstract:Tail latency dominates the performance of synchronous pretraining jobs when running at very large scales. We describe a three-pronged approach: (1) a new RDMA-based transport protocol, MRC, sprays across many paths and actively load-balances between them, eliminating the issue of flow collisions (2) the use of multi-plane Clos topologies to get the benefits of high switch radix and redundancy, allowing training clusters well over 100K GPUs to be built as two-tier topologies while increasing physical redundancy, and (3) the use of static source-routing using SRv6 to allow MRC the freedom to bypass failures by itself. We describe our experiences running MRC and static SRv6 routing in production in OpenAI and Microsoft’s largest training clusters, where it has been used to train the latest frontier models. We demonstrate how MRC allows AI training jobs to ride out many network failures that previously would have interrupted training.

[AI-66] he Scaling Properties of Implicit Deductive Reasoning in Transformers

【速读】：该论文旨在解决深度受限的Transformer模型在处理Horn子句逻辑推理时，如何实现隐式演绎推理（implicit deductive reasoning）的有效性与可扩展性问题。其核心挑战在于区分命题可证明性（provability）与虚假特征（spurious features），并确保模型结构与算法逻辑对齐（algorithmic alignment）。解决方案的关键在于：通过系统性地解耦可证明性与干扰特征，并在足够深的模型中引入双向前缀掩码（bidirectional prefix mask），使隐式推理性能逼近显式思维链（Chain-of-Thought, CoT）方法的表现，从而在不同图拓扑和问题宽度下实现接近显式推理的效果；然而，对于深度外推（depth extrapolation）任务，显式CoT仍为必要手段。

链接: https://arxiv.org/abs/2605.04330
作者: Enrico Vompa,Tanel Tammet
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注: preprint

点击查看摘要

Abstract:We investigate the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. By systematically decorrelating provability from spurious features and enforcing algorithmic alignment, we find that in sufficiently deep models with a bidirectional prefix mask, implicit reasoning approaches explicit CoT performance across graph topologies and problem widths, though CoT remains necessary for depth extrapolation.

[AI-67] Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping ICML2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在持续学习新知识时面临的灾难性遗忘（catastrophic forgetting）问题，尤其是现有基于参数更新的方法难以避免遗忘且更新不可逆的局限性。其解决方案的关键在于将自回归语言生成建模为一个基于token的马尔可夫过程（Markov process over tokens），用转移矩阵表示模型记忆；通过扩展状态空间来引入新知识，并保持原有转移关系不变以实现零遗忘。进一步地，作者提出一种基于token-to-dictionary映射的采样复杂度理论边界，并设计了一种仅需最小参数调整的嵌入调优算法（embedding-tuning algorithm），从而高效、无损地整合新token的知识。

链接: https://arxiv.org/abs/2605.04308
作者: Kaustubh Pethkar,Ziyang Xiong,Zuofeng Shang,Yingcong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Continual incorporation of new knowledge is essential for the long-term evolution of large language models (LLMs). Existing approaches typically rely on parameter-update algorithms to mitigate catastrophic forgetting, yet they suffer from fundamental limitations: 1) forgetting is unavoidable as the amount of newly injected knowledge grows; and 2) model updates are often irreversible. As modern LLMs become increasingly expressive, it is natural to question whether large-scale weight updates are necessary for acquiring a small amount of new knowledge. In this work, we propose a principled framework that models autoregressive language generation as a Markov process over tokens, where model memory is represented by a Markov transition matrix. Under this formulation, incorporating new knowledge/tokens corresponds to extending the state space, and preserving existing transitions guarantees retention of previously learned knowledge. We then prove a sample complexity bound for incorporating new tokens via a token-to-dictionary mapping strategy. In particular, for learning the transition behavior of each new token, the required number of samples scales linearly with the number of existing tokens it is mapped to. To realize this mapping, we propose an embedding-tuning algorithm that requires minimal parameter updates and induces zero forgetting. Experimental results further demonstrate the effectiveness of our method and validate our theoretical findings.

[AI-68] LLM s Uncertainty Quantification via Adaptive Conformal Semantic Entropy IJCAI2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成回答时存在的过度自信问题，尤其是在幻觉（hallucination）场景下，这种过度自信严重影响了模型在安全关键领域部署的可靠性，因此亟需一种可信赖的不确定性量化方法。解决方案的关键在于提出自适应共形语义熵（Adaptive Conformal Semantic Entropy, ACSE），其核心是通过聚类多个多样化输出的语义熵来衡量提示级（prompt-level）不确定性，并基于每个聚类的语义特征自适应调整不确定性评分；同时利用共形校准（conformal calibration）技术提供有限样本、分布无关的决策规则，确保被接受的回答中错误率不超过用户指定的容差阈值。该方法显著优于现有基于词法或概率度量的不确定性估计基线，在多个数据集和LLM上的实验表明其具有更高的判别性能、更可靠的校准能力和更强的统计保障。

链接: https://arxiv.org/abs/2605.04295
作者: Hamed Karimi,Vaishali Meyappan,Reza Samavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of IJCAI 2026, the 35th International Joint Conference on Artificial Intelligence, 12 Pages

点击查看摘要

Abstract:LLMs’ overconfidence, particularly when hallucinating, poses a significant challenge for the deployment of the models in safety-critical settings and makes a reliable estimation of uncertainty necessary. Existing approaches for uncertainty quantification typically prioritize lexical or probabilistic measures; however, these techniques often ignore the semantic variance of different responses with similar meaning. In this paper, we propose Adaptive Conformal Semantic Entropy (ACSE), a method for estimating prompt-level uncertainty by adaptively measuring semantic dispersion in LLMs outputs. Our uncertainty scoring function is based on clustering semantic entropy of multiple diverse responses to the same prompt. The function adaptively adjusts the uncertainty score based on semantic features of each cluster. To ensure statistical reliability of our score, we use conformal calibration to apply a decision rule to accept/abstain the prompts, providing a finite-sample, distribution-free guarantee such that the error rate among the accepted responses remains bounded by a user-specified tolerance. Our extensive experimental evaluations using different LLMs and datasets, demonstrate that our approach consistently outperforms state-of-the-art uncertainty quantification baselines using discriminative performance, conformal guarantees, and probabilistic calibration indicators. As a highlight, for TriviaQA dataset, AUROC of our approach is 0.88 compared to 0.65 produced by the token entropy approach.

[AI-69] A Mean Curvature Approach to Boundary Detection: Geometric Insights for Unsupervised Learning

【速读】：该论文旨在解决高维数据中边界检测的难题，尤其是在非线性结构和异质密度条件下传统密度基方法表现不佳的问题。其解决方案的核心在于提出一种基于几何机器学习的新型框架——均值曲率边界点（Mean Curvature Boundary Points, MCBP），通过离散近似形状算子（shape operator）从局部k近邻邻域估计点态均值曲率，从而无需显式参数化流形即可刻画数据流形的内在弯曲特性。关键创新在于将均值曲率作为边界结构的原理性描述符：高曲率区域自然对应于聚类间过渡、几何不规则性和低密度界面，实现了边界点、异常点与过渡点的统一几何解释，并结合自适应百分位阈值策略实现多尺度边界提取，显著提升了复杂高维场景下的聚类性能与下游无监督算法的鲁棒性。

链接: https://arxiv.org/abs/2605.04274
作者: Alexandre L. M. Levada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 26 pages, 6 tables, 8 figures

点击查看摘要

Abstract:Accurate boundary detection in high-dimensional data remains a central challenge in unsupervised learning, particularly in the presence of non-linear structures and heterogeneous densities. In this work, we introduce Mean Curvature Boundary Points (MCBP), a novel geometric framework grounded in Geometric Machine Learning that departs from traditional density-based approaches by explicitly modeling the intrinsic curvature of the data manifold. The method relies on a discrete approximation of the shape operator, estimated from local k-nearest neighbor patches, to compute pointwise mean curvature without requiring explicit manifold parametrization. The key insight of MCBP is to use mean curvature as a principled descriptor of boundary structure: high-curvature regions naturally correspond to transitions between clusters, geometric irregularities, and low-density interfaces. This yields a unified geometric interpretation of boundary, outlier, and transition points. We further introduce an adaptive percentile-based thresholding scheme that enables multiscale boundary extraction without relying on ad hoc density parameters. Beyond detection, we propose a curvature-driven data decomposition that separates samples into smooth (low-curvature) and boundary (high-curvature) subsets, effectively acting as a non-linear geometric filtering mechanism. This representation enhances cluster separability and improves the robustness of downstream unsupervised algorithms. Extensive experiments on synthetic and real-world datasets demonstrate that MCBP consistently improves clustering performance, particularly in complex and high-dimensional scenarios. These results position MCBP as a concrete contribution to Geometric Machine Learning, highlighting the potential of curvature-aware analysis as a unifying paradigm bridging differential geometry and data-driven modeling.

[AI-70] Parallel Prefix Verification for Speculative Generation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）推理过程中因令牌级（token-level）验证导致的吞吐量提升有限的问题。现有推测解码方法受限于逐令牌等价性验证，使得接受长度较短且加速效果不显著；而语义级别（semantic-level）或段落级别（segment-level）验证虽可提升接受粒度，但传统方法依赖串行验证，引入显著开销，难以实现实际性能增益。其解决方案的关键在于提出PARSE（PArallel pRefix Speculative Engine），通过并行前缀验证机制，在单次前向传播中利用定制注意力掩码同时评估多个前缀的正确性，从而直接识别最长有效前缀，消除串行验证步骤并提升计算效率。此方法与令牌级推测解码正交，可组合使用以进一步提升性能，在多个模型和基准测试中实现了1.25×至4.3×的吞吐量增益，组合EAGLE-3时更达1.6×至4.5×，且精度损失可忽略。

链接: https://arxiv.org/abs/2605.04263
作者: Yuncheng Yao,Yuxuan Xia,Shengjie Wang,Danyang Zhuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce PARSE (PArallel pRefix Speculative Engine), a speculative generation framework that accelerates large language model (LLM) inference by parallelizing prefix verification on a semantic level. Existing speculative decoding methods are fundamentally limited by token-level equivalence: the target model must verify each token, leading to short acceptance lengths and modest speedups. Moving to semantic or segment-level verification can substantially increase acceptance granularity, but prior approaches rely on sequential verification, introducing significant overhead and limiting practical gains. PARSE introduces parallel prefix verification, enabling semantic-level verification without sequential checks. Given a full draft from a draft model, the target model evaluates correctness across multiple prefixes in a single forward pass using a custom attention mask, directly identifying the maximal valid prefix. This eliminates sequential segment verification, and makes verification compute-efficient. PARSE is orthogonal to token-level speculative decoding and can be composed with it for additional gains. Across models and benchmarks, PARSE delivers 1.25\times to 4.3\times throughput gain over the target model, and 1.6\times to 4.5\times when composed with EAGLE-3, all with negligible accuracy degradation. This demonstrates parallel prefix verification as an effective, general approach to accelerating LLM inference.

[AI-71] mporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂时间推理任务中表现脆弱的问题，挑战了当前普遍认为其根源在于自回归逻辑推理缺陷的主流观点。研究表明，真正的瓶颈在于非结构化文本到事件表示的不一致性。解决方案的关键在于提出一种基于概率不一致信号（Probabilistic Inconsistency Signal, PIS）的神经符号问答框架，通过将非结构化文本显式转化为事件图与区间约束，严格解耦语义提取与符号推理引擎；同时利用证据深度学习（Evidential Deep Learning）从LLM隐藏状态中提取认知不确定性，并将其与符号置信区间统一建模，从而实现对结构性断裂的鲁棒检测。实验表明，当提供正确的结构化表示时，系统在时间算术基准上达到1.0的完美准确率（4000/4000），且零假阳性/假阴性，显著优于传统方法。

链接: https://arxiv.org/abs/2605.04243
作者: Tran Quang Liem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 22 pages, 2 figures

点击查看摘要

Abstract:Despite significant advances, large language models (LLMs) continue to exhibit brittle performance on complex temporal reasoning tasks. This failure mode is widely attributed to inherent deficits in autoregressive logical deduction. In this paper, we challenge this prevailing narrative, demonstrating that temporal reasoning is not the fundamental bottleneck; rather, the locus of failure lies in unstructured text-to-event representation. We introduce a novel neuro-symbolic question-answering framework governed by a Probabilistic Inconsistency Signal (PIS) that explicitly isolates perceptual errors from reasoning failures. By lifting unstructured text into explicit event graphs and interval constraints, our architecture strictly decouples semantic extraction from a symbolic reasoning engine. To robustly detect structural breaks, the PIS elegantly unifies symbolic credal intervals with epistemic neural uncertainty extracted via Evidential Deep Learning on LLM hidden states. Empirical evaluations reveal a striking paradigm shift: when provided with correct structural representations, our system’s explicit proof traces achieve perfect 1.0 accuracy (4000/4000) and strictly zero false positives/negatives on temporal arithmetic benchmarks. On broader, noise-injected QA settings, the framework maintains a competitive 75.1% accuracy while enabling deterministic, step-level failure localization. Ultimately, by isolating the representation bottleneck from the reasoning substrate, this work reframes temporal QA from an algorithmic reasoning challenge to a structural alignment problem, charting a verifiable path forward for reliable neuro-symbolic AI.

[AI-72] Layerwise LQR for Geometry-Aware Optimization of Deep Networks

【速读】：该论文旨在解决现有几何感知优化器（如牛顿法和自然梯度）在深度学习中因计算复杂度高而难以扩展的问题，特别是K-FAC、Shampoo等可扩展预条件方法通常过早引入结构近似，导致丢弃网络计算中产生的跨层交互信息。其解决方案的关键在于提出Layerwise LQR（LLQR）框架，通过将广义散度诱导的二次模型中最速下降步转化为有限时域线性二次调节器（LQR）问题，精确建模每一层的动力学和代价矩阵以保留原始密集几何结构；进而设计一种可扩展的松弛策略，在不显式构建或求逆全局曲率矩阵的前提下，学习对角、（E-）Kronecker分解或其他结构化的逆预条件矩阵，并在迭代中复用，从而在保持二阶几何原理性联系的同时显著提升优化效率与最终测试性能。

链接: https://arxiv.org/abs/2605.04230
作者: Simon Dufort-Labbé,Pierre-Luc Bacon,Razvan Pascanu,Simon Lacoste-Julien,Aristide Baratin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geometry-aware optimizers such as Newton and natural gradient can improve conditioning in deep learning, but scalable variants such as K-FAC, Shampoo, and related preconditioners usually impose structural approximations early, often discarding cross-layer interactions induced by the network computation. We introduce Layerwise LQR (LLQR), a framework for learning structured inverse preconditioners under a global layerwise optimal-control objective. The starting point is an exact equivalence: the steepest-descent step under a broad class of divergence-induced quadratic models–including Newton, Gauss-Newton, Fisher/natural-gradient, and intermediate-layer metrics–can be written as a finite-horizon Linear Quadratic Regulator (LQR) problem. This formulation serves as a reference that exposes the layerwise dynamics and cost matrices encoding the original dense geometry. We then derive a scalable relaxation that learns diagonal, (E-)Kronecker-factored, or other structured inverse preconditioners by minimizing the LQR objective and reusing them across iterations. The resulting optimizer wraps standard methods while retaining a principled connection to second-order geometry, without forming or inverting the global curvature matrix. Experiments on ResNets and Transformers show that LLQR improves optimization dynamics and often translates these gains into improved final test performance, while adding only modest wall-clock overhead. It establishes LLQR as a practical framework for geometry-aware second-order methods and a reference for evaluating scalable approximations.

[AI-73] Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLM s

【速读】：该论文旨在解决扩散型大语言模型（Diffusion-based Large Language Models, D-LLMs）在推理过程中因固定响应长度限制而导致的计算效率低下问题。具体而言，若预设响应长度过大，会产生大量语义无意义的填充（padding）token造成计算浪费；若过小则需重新计算以避免截断，引发不可预测的延迟波动。解决方案的关键在于提出“先预测后扩散”（Predict-then-Diffuse）框架，其核心是一个自适应响应长度预测器（Adaptive Response Length Predictor, AdaRLP），能够根据输入查询动态估计最优响应长度，并结合一种数据驱动的安全机制，在几乎不增加额外开销的前提下有效防止低估响应长度导致的重推理，从而显著降低浮点运算量（FLOP），同时保持输出质量稳定。

链接: https://arxiv.org/abs/2605.04215
作者: Michael Rottoli,Subhankar Roy,Stefano Paraboschi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to significant throughput advantages and superior GPU utilization over traditional autoregressive paradigm. However, this parallelism is constrained by the requirement of a fixed-size response length prior to generation. This architectural limitation imposes a severe trade-off: oversized response length results in computational waste on semantically meaningless padding tokens, while undersized response length cause output truncation requiring costly re-computations that introduce unpredictable latency spikes. To tackle this issue, we propose Predict-then-Diffuse, a simple and model-agnostic framework, that enables compute-budgeted inference per input query by first estimating the response length and then using it to run inference with D-LLM. At its core lies a Adaptive Response Length Predictor (AdaRLP) auxiliary predictor that predicts the optimal response length given an input query. As a measure against under-predicting the response length and re-running inference with a higher response length, we introduce a data-driven safety mechanism, which trades a negligible padding overhead. As a whole, our framework limits the significant waste of computation on padding tokens and preserves output quality. Experimental validation on multiple datasets demonstrate that Predict-then-Diffuse significantly reduces computational costs (FLOP) compared to the default D-LLM inference mechanism and baselines based on heuristics, while being robust to skewed data distributions.

[AI-74] Undetectable Backdoors in Model Parameters: Hiding Sparse Secrets in High Dimensions

【速读】：该论文旨在解决预训练图像分类模型中的供应链攻击问题，即在模型部署前植入难以检测的后门（backdoor），使得模型在特定触发条件下输出攻击者指定的目标类别。解决方案的关键在于提出一种名为“稀疏后门”（Sparse Backdoor）的攻击方法：它在全连接层中沿随机方向对少量列注入结构化的稀疏扰动，并通过独立同分布的各向同性高斯抖动（isotropic Gaussian dither）掩蔽扰动；该抖动构造了一个以原始权重为中心的干净参考分布，且在预训练模型满足弱边缘条件时，该参考分布与原模型功能等价。理论证明表明，区分被植入后门的模型与该参考分布至少等价于稀疏主成分分析（Sparse PCA）检测问题，在标准计算困难假设下是不可行的，从而确保了攻击的可证明不可检测性。

链接: https://arxiv.org/abs/2605.04209
作者: Sarthak Choudhary,Atharv Singh Patlan,Nils Palumbo,Ashish Hooda,Kassem Fawaz,Somesh Jha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Sparse Backdoor, a supply-chain attack that plants a \emphprovably undetectable backdoor in pre-trained image classifiers, including convolutional networks and Vision Transformers. The attack injects a structured sparse perturbation along a randomly chosen direction into a small subset of columns at each fully connected layer, propagating a trigger signal to an adversary-chosen target class, and masks the perturbation with an independent isotropic Gaussian dither. The dither serves a single technical purpose: it induces a clean reference distribution anchored at the pre-trained weights, against which undetectability can be formalized. Under a mild margin condition on the pre-trained classifier, we show that the dithered reference is functionally equivalent to the original classifier. We prove that distinguishing the backdoor-injected model from this reference is at least as hard as Sparse PCA detection, which is computationally infeasible under standard hardness assumptions. The guarantee holds against any probabilistic polynomial-time distinguisher with white-box access to the parameters.

[AI-75] Deep Wave Network for Modeling Multi-Scale Physical Dynamics

【速读】：该论文旨在解决深度学习模型在物理科学应用中因架构设计局限导致的准确率-计算成本权衡不优的问题，尤其针对U-Net类编码器-解码器结构在深度（depth）维度上探索不足，限制了其性能潜力。现有方法通常仅调整宽度（width）或分离评估准确率与计算成本，忽略了不同宽度和深度组合下准确率-成本缩放关系的差异，从而可能得出误导性结论。解决方案的关键在于提出一种名为Deep Wave Network (DW-Net) 的新型架构，通过串联多个编码器-解码器“波”（wave），并在波内与波间引入跳跃连接（skip connections），实现跨尺度信息的渐进式精炼，有效提升模型的有效深度。实验表明，在保持训练数据、优化器和训练调度一致的前提下，DW-Net在多个二维和三维流场基准测试中均优于单波U-Net，显著改善帕累托前沿（Pareto front），可在相同计算成本下获得更高精度，或在相同精度下降低计算开销，并且达到低误差区域所需的训练时间最多减少3倍。

链接: https://arxiv.org/abs/2605.04198
作者: Alexander I. Khrabry,Edward A. Startsev,Andrew T. Powis,Igor D. Kaganovich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn); Plasma Physics (physics.plasm-ph)
备注:

点击查看摘要

Abstract:Performance of deep learning models is strongly governed by architectural capacity, with width and depth as primary controls. However, in physical-science applications, models are often compared at a single fixed size or by separating accuracy and computational cost, which can be misleading since architectures exhibit different accuracy-cost scaling as width and depth vary. This issue is particularly relevant for U-Net-type encoder-decoder models, widely used for multi-scale gas, fluid, and plasma dynamics due to their ability to represent features across spatial scales. A U-Net constructs a multi-resolution representation via an encoder that progressively reduces spatial resolution, followed by a decoder that restores it for prediction. Skip connections link corresponding encoder and decoder features, preserving fine-scale information and improving optimization. In practice, U-Net width is routinely tuned, while depth is typically kept fixed (a set number of down/up-sampling stages with few convolutions per stage), limiting systematic exploration of depth for improving the accuracy-cost trade-off. We address this limitation by increasing effective depth through stacking multiple encoder-decoder “waves” in series, with skip connections both within and across waves to enable progressive cross-scale refinement. We call this architecture a Deep Wave Network (DW-Net). Training data, optimization, and schedules are kept identical across models. Instead of evaluating single configurations, we train multiple width variants of each architecture and compare accuracy vs. GPU time Pareto fronts. Across several 2D and 3D flow benchmarks, DW-Net models consistently improve the Pareto frontier over single-wave U-Nets, achieving higher accuracy at matched cost or similar accuracy at reduced cost, and reaching low-error regimes with up to 3x less training time under identical training settings.

[AI-76] ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor

【速读】：该论文旨在解决传统归纳逻辑编程（Inductive Logic Programming, ILP）在噪声和概率性数据场景下难以扩展的问题，以及现有神经符号方法在处理不确定性时存在的局限性，如依赖预定义规则模板或使用不准确的模糊算子导致梯度消失或逻辑结构近似不佳。其解决方案的关键在于提出一种基于注意力机制的神经符号可微分规则提取框架（Attention-based Neuro-symbolic Differentiable Rule Extractor, ANDRE），该框架通过引入全可微的、基于注意力驱动的合取与析取运算符来替代固定规则模板和传统逻辑算子，从而在连续规则空间中优化推理过程，并近似逻辑最小-最大语义，实现对概率性谓词估值的稳定且可解释的推理。ANDRE支持软选择、否定或排除规则内的谓词，兼顾灵活性与符号结构保真性，在多个基准测试中展现出优于现有方法的预测性能和规则提取稳定性。

链接: https://arxiv.org/abs/2605.04193
作者: Iman Sharifi,Peng Wei,Saber Fallah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 35 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.

[AI-77] Actionable Real-Time Modeling of Surgical Team Dynamics via Time-Expanded Interaction Graphs

【速读】：该论文旨在解决当前手术人工智能（AI）系统在术中团队协作建模方面的不足，即现有系统主要依赖视觉工作流程信号，缺乏对团队成员之间随时间动态交互的结构化表示。其解决方案的关键在于提出一种基于时扩展交互图（time-expanded interaction graphs）的实时可操作建模方法：将团队成员建模为时间索引节点，通信交流定义有向边，从而实现时空维度上的动态交互建模；同时利用静态图神经网络进行高效推理，不仅预测手术过程效率（以预期时长偏差衡量），还通过反事实分析识别最小通信结构变化与可解释行为变量，从而提供清晰、可操作的改进建议。

链接: https://arxiv.org/abs/2605.04169
作者: Vincenzo Marco De Luca,Antonio Longa,Giovanna Varni,Andrea Passerini
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Hybrid Human Artificial Intelligence (HHAI) 2026

点击查看摘要

Abstract:Surgical team performance arises from complex interactions between technical execution and non-technical skills, including communication and coordination dynamics. However, current surgical AI systems predominantly model visual workflow signals, lacking structured representations of intraoperative team interactions over time. We propose a real-time actionable approach for modeling surgical team dynamics using time-expanded interaction graphs, where team members are modeled as time-indexed nodes and communication exchanges define directed edges. This spatio-temporal expansion enables dynamic interaction modeling, while allowing efficient inference with a static graph neural network. The model predicts procedural efficiency as the deviation from the expected duration and supports real-time deployment. Beyond prediction, we perform a counterfactual analysis to identify minimal changes in communication structure and interpretable behavioral variables associated with improved predicted outcomes. Experiments on recorded surgical procedures show that structured modeling of team interactions improves early identification of prolonged interventions and provides coherent, actionable explanations. This work advances surgical AI toward real-time, team-aware, and actionable decision support in the operating room.

[AI-78] Learning reveals invisible structure in low-rank RNNs

【速读】：该论文旨在解决低秩递归神经网络（low-rank recurrent neural networks, RNNs）中学习过程的理论理解难题，即如何从连接性（connectivity）出发，建立对学习动态的精确描述。其解决方案的关键在于将梯度下降动力学直接推导到一个简化的重叠空间（overlap space）中，从而构建出一组封闭形式的、低维常微分方程（ODEs），该系统精确适用于线性RNN，并在大N高斯极限下对非线性RNN渐近精确。研究进一步区分了两类重叠：损失可见重叠（loss-visible overlaps），它们完全决定网络活动、输出和损失；以及损失不可见重叠（loss-invisible overlaps），它们虽不影响功能但对刻画学习过程至关重要。这一分解揭示了学习可作为扰动暴露功能等效网络间的连接差异，并指出损失不可见重叠可充当记忆变量以编码训练历史，为生物学习实验提供了可验证预测。

链接: https://arxiv.org/abs/2605.04115
作者: Yoav Ger,Omri Barak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 30 pages, 12 figures

点击查看摘要

Abstract:Learning in neural systems arises from synaptic changes that reshape the representations underlying behavior. While low-rank recurrent neural networks (RNNs) have emerged as a powerful framework for linking connectivity to function, a theoretical understanding of their learning process remains elusive. Here, we extend the low-rank framework from activity to learning by deriving gradient-descent dynamics directly in a reduced overlap space. We formulate a closed-form, low-dimensional system of ODEs that governs learning in this space, exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large- N Gaussian limit. Central to our analysis is a distinction between two classes of overlaps: loss-visible overlaps, which fully determine network activity, output, and loss, and loss-invisible overlaps, which do not affect function but are required to describe learning. We illustrate the consequences of this decomposition through two phenomena. First, we show that learning can serve as a perturbation that exposes differences in connectivity between functionally equivalent networks. Second, we show that loss-invisible overlaps can act as memory variables that encode training history, and characterize the conditions under which this occurs. Finally, we present several testable predictions for biological learning experiments derived from our theory.

[AI-79] Resource Utilization of Differentiable Logic Gate Networks Deployed on FPGAs

【速读】：该论文旨在解决边缘计算场景下机器学习模型在资源受限硬件（如FPGA）上部署时面临的权衡问题，即如何在保持模型精度的同时优化功耗、资源利用率和推理速度。其解决方案的关键在于对可微逻辑门网络（Differentiable Logic Gate Networks, LGN）的结构参数（深度与宽度）进行系统性分析，发现最终层的宽度对时序延迟和资源占用具有决定性影响——通过压缩最终层的逻辑规模可实现28%的资源节省；在此基础上，当满足时序和布线约束时，可通过增加LGN的深度与宽度来提升性能，从而为基于LUT数量有限的FPGA平台选择最优的LGN架构提供指导。

链接: https://arxiv.org/abs/2605.04109
作者: Stephen Wormald,Gilon Kravatsky,Damon Woodard,Domenic Forte
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, conference submission

点击查看摘要

Abstract:On-edge machine learning (ML) often strives to maximize the intelligence of small models while miniaturizing the circuit size and power needed to perform inference. Meeting these needs, differentiable Logic Gate Networks (LGN) have demonstrated nanosecond-scale prediction speeds while reducing the required resources as compares to traditional binary neural networks. Despite these benefits, the trade-offs between LGN parameters and resulting hardware synthesis characteristics are not well characterized. This paper therefore studies the tradeoffs between power, resource utilization, inference speed, and model accuracy when varying the depth and width of LGNs synthesized for Field Programmable Gate Arrays (FPGA). Results reveal that the final layer of an LGN is critical to minimize timing and resource usage (i.e. 28% decrease), as this layer dictates the logic size of summing operations. Subject to timing and routing constraints, deeper and wider LGNs can be synthesized for FPGA when the final layer is narrow. Further tradeoffs are presented to help ML engineers select baseline LGN architectures for FPGAs with a set number of Look Up Tables (LUT).

[AI-80] Regularized Centered Emphatic Temporal Difference Learning

【速读】：该论文旨在解决离策略时差（Temporal-Difference, TD）学习在函数逼近下存在的稳定性、投影几何与方差控制之间的结构性权衡问题。现有方法如强调性TD（Emphatic TD, ETD）虽通过跟随关注（follow-on emphasis）改善了离策略的投影几何，但其跟随追踪（follow-on trace）可能引入高方差。作者通过贝尔曼误差中心化（Bellman-error centering）重新审视这一权衡，发现朴素的中心化扩展会引入辅助耦合项，破坏ETD核心矩阵的正定性。为此，提出正则化强调性时差学习（Regularized Emphatic Temporal-Difference Learning, RETD），其关键在于仅对辅助中心化递归进行正则化，相当于将耦合核心矩阵的右下角块从1提升至 $1+c$ ，从而在保持跟随追踪结构的同时实现稳定性和性能的优化。

链接: https://arxiv.org/abs/2605.04100
作者: Xingguo Chen,Chaohui Wu,Jinguo Ye,Chao Li,Shangdong Yang,Guang Yang,Tianyu Liang,Wenhao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Off-policy temporal-difference (TD) learning with function approximation faces a structural tradeoff among stability, projection geometry, and variance control. Emphatic TD (ETD) improves the off-policy projection geometry through follow-on emphasis, but the follow-on trace can have high variance. We revisit this tradeoff through Bellman-error centering. Although centering naturally removes a common drift term from TD errors, we show that a naive centered emphatic extension introduces an auxiliary coupling that can destroy the positive-definiteness of the ETD key matrix. We propose \emphRegularized Emphatic Temporal-Difference Learning (RETD), which preserves the follow-on trace and regularizes only the auxiliary centering recursion, corresponding to lifting the lower-right block of the coupled key matrix from (1) to (1+c). We derive the RETD core matrix, prove convergence under a conservative sufficient regularization condition, and evaluate the method on diagnostic linear off-policy prediction tasks. The experiments show that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for the regularization parameter (c) across the diagnostics.

[AI-81] FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在消费级GPU上部署时的压缩难题，尤其是传统标量量化方法受限于固定位宽（如8/4/3-bit）、压缩点离散且需校准数据的问题。其核心解决方案是提出FASQ（Flexible Accelerated Subspace Quantization），通过将产品量化（Product Quantization, PQ）应用于LLM权重矩阵，利用子向量大小和码本基数两个参数构建连续的压缩设计空间（覆盖原始FP16模型大小的27–49%），从而填补固定位宽方案无法达到的压缩间隙。为实现推理阶段高效执行，作者设计了定制CUDA核函数：无查找表（LUT-free）直接计算GEMV用于解码，以及输出站式双缓冲LUT GEMM用于预填充，并引入split-K并行策略，在RTX 3090上实现了比FP16张量核心性能更高的解码吞吐（最高达45.2 tok/s，有效4-bit；51.8 tok/s，有效3-bit），同时保持精度优于4-bit GPTQ与AWQ，且无需校准数据，成为首个在单卡消费级GPU上实现加速解码的压缩方法。

链接: https://arxiv.org/abs/2605.04084
作者: Ye Qiao,Yian Wang,Zhiheng Chen,Hyoukjun Kwon,Sitao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discrete compression points, and typically requires calibration data. We present FASQ (Flexible Accelerated Subspace Quantization), a calibration-free framework that applies product quantization to LLM weight matrices. By tuning two parameters, sub-vector size and codebook cardinality, FASQ exposes a continuous design space spanning 27-49% of the original FP16 model size, filling compression gaps that fixed-bit schemes cannot reach. On Meta-Llama-3-8B, FASQ surpasses 4-bit GPTQ and AWQ in accuracy (67.1-67.7 avg.) at 37-42% model size, with consistent results on Qwen3-8B and Qwen3.5-9B-Base. To make product quantization practical at inference time, we design custom CUDA kernels: a LUT-free direct-compute GEMV for decode and an output-stationary double-buffered LUT GEMM for prefill, both with split-K parallelism. On an RTX~3090, FASQ achieves 45.2 tok/s decode at effective 4-bit (2.56x memory reduction) and 51.8 tok/s at effective 3-bit (2.80x), both surpassing FP16 tensor-core performance (43.9 tok/s) and delivering 1.6 to 1.8x the throughput of AWQ, 2.5 to 2.5x of GPTQ, and 4.3 to 5x of RTN. FASQ is the only compressed method that accelerates decode beyond FP16, offering calibration-free compression, continuous size-quality trade-offs, and real-time inference on a single consumer GPU.

[AI-82] AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）后训练阶段中，如何将主观性、程序性和领域特定的专家要求准确编码为可操作的评估标准的问题。当前主流方法依赖于精确匹配目标或开放式的偏好判断，难以适配复杂现实任务的需求。其解决方案的关键在于提出 AsymmetryZero 框架，通过定义稳定的“评估契约”（evaluation contract），明确任务的评分维度、评判方式及聚合逻辑，从而实现模型仅评估（Inspect）与代理评估（Harbor Framework）之间的可比分数和共享审计证据。实验表明，该框架能有效提升评估一致性，同时显著降低紧凑型评审团（compact jury）的计算成本与延迟，而保持任务级结果稳定。

链接: https://arxiv.org/abs/2605.04083
作者: Tadhg Looram,Lucas Nuzzi,Kyle Waters,Steven Dillmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from 75.9% to 89.6% (strict common-subset agreement: 77.8% to 92.1% ), while compact juries exhibit substantially higher internal dissent (3–2 split rate 28.7% – 32.4% ) than frontier juries ( 6.1% – 11.5% ). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly 4.2% – 5.6% of frontier and latency to roughly 21.7% – 27.1% , even as aggregated task-level outcomes often remain comparatively stable.

[AI-83] me series causal discovery with variable lags

【速读】：该论文旨在解决从时间序列数据中学习因果贝叶斯网络（Causal Bayesian Networks, CBNs）结构的挑战，特别是如何在存在不同时间滞后（time lag）依赖关系的情况下准确识别变量间的因果关系。传统方法通常假设固定滞后窗口且未显式优化每个边的时间延迟，导致对动态演化系统的建模能力受限。其解决方案的关键在于提出一种基于禁忌搜索（Tabu-based）的结构学习算法，该算法在保持时间有序性（即每条边必须遵循时间顺序）的前提下，允许每个边采用不同的最大滞后长度；同时引入一种可分解的基于BIC（Bayesian Information Criterion）的评分函数，结合节点特定的有效样本量和显式的滞后长度惩罚项，以鼓励稀疏延迟分配并支持高效的局部评分更新，从而实现更精确的结构恢复与滞后估计。

链接: https://arxiv.org/abs/2605.04081
作者: Bruno Petrungaro,Anthony C. Constantinou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal Bayesian Networks (CBNs) are a powerful tool for reasoning under uncertainty about complex real-world problems. Such problems evolve over time, responding to external shocks as they occur. To support decision-making, CBNs require a cause-and-effect map of the variables under consideration, known as the network’s structure. Learning the graphical structure of a causal model from data remains challenging; learning it from time-series data is even harder because dependencies may arise at different time lags. Existing time-series causal discovery methods often assume a fixed lag window and do not explicitly optimise edge-specific lags. We propose a Tabu-based structure learning algorithm that searches for a time-ordered directed structure (i.e., where every edge respects time) while allowing edge-specific lags up to a specified maximum lag. The approach uses a decomposable BIC-based score with node-specific effective sample sizes and an explicit lag-length penalty encouraging parsimonious delay assignments while preserving efficient local score updates. We provide theoretical guarantees of validity and local optimality, and we also describe a parallel implementation for improved scalability. In simulations, the method recovered graph structure competitively and estimated lags accurately when true adjacencies were recovered. On a real-world UK COVID-19 policy dataset, the learnt structure was dominated by short delays while retaining a substantial minority of longer-lag dependencies, consistent with delayed behavioural and epidemiological effects.

[AI-84] Efficient Handwriting-Based Alzheimers Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）早期可靠检测的问题，以支持及时的临床干预和新兴治疗策略的评估。其解决方案的关键在于提出了一种基于手写分析的低秩专家混合（Low-Rank Mixture of Experts, LoRA-MoE）深度学习框架，该框架通过共享基础网络使多个专家模块专注于不同的手写模式，同时引入轻量级低秩适配器（low-rank adapters），显著减少可训练参数数量并提升训练稳定性，从而在保持高诊断性能的同时实现高效的推理过程。

链接: https://arxiv.org/abs/2605.04079
作者: Wu Wang,Yuang Cheng,Fouzi Harrou,Ying Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 6 figures, and 17 tables

点击查看摘要

Abstract:Early and reliable detection of Alzheimer’s disease (AD) is crucial for timely clinical intervention and improved patient management. It also supports the evaluation of emerging therapeutic strategies. In this paper, we propose a Low-Rank Mixture of Experts (LoRA-MoE) deep learning framework for Alzheimer’s disease diagnosis based on handwriting analysis. Handwriting signals provide a non-invasive and scalable digital biomarker that captures subtle cognitive-motor impairments associated with early AD progression. The proposed architecture allows multiple experts to specialize in different handwriting patterns while sharing a common base network. This design enables efficient learning of general representations while reducing interference between experts. Each expert is equipped with lightweight low-rank adapters. This mechanism significantly reduces the number of trainable parameters compared with standard Mixture of Experts (MoE) models and improves training stability. The proposed framework is evaluated on the Diagnosis AlzheimeR WIth haNdwriting (DARWIN) dataset. Extensive experiments are conducted, including ablation studies on key architectural parameters such as hidden dimension size, number of experts, and LoRA rank. The method is compared with multilayer perceptron (MLP) and conventional MoE architectures. In addition, stacking ensemble strategies (StackMean and StackMax) are investigated to improve robustness and predictive performance. Experimental results show that the LoRA-MoE framework achieves powerful diagnostic performance while activating significantly fewer parameters during inference. These results highlight the potential of the proposed approach as an accurate and computationally efficient solution for handwriting-based Alzheimer’s disease screening and digital health applications.

[AI-85] Validity-Calibrated Reasoning Distillation

【速读】：该论文旨在解决当前生成式 AI（Generative AI）中推理蒸馏（reasoning distillation）方法依赖静态师生层级结构和轨迹模仿所带来的局限性问题。现有方法通常强制学生模型在token层面模仿教师模型的推理路径，但这种做法忽视了推理过程中中间步骤往往局部未被唯一确定的事实——即全局正确性约束最终答案，却不唯一决定每一步的中间动作。解决方案的关键在于提出一种“有效性校准的推理蒸馏”（validity-calibrated reasoning distillation）框架，将推理蒸馏重构为局部学习信号分配问题：通过比较同一前缀下学生与教师提出的下一步动作，并利用其相对局部有效性（local validity）来调节蒸馏更新强度，从而实现动态、上下文相关的监督机制。该方法在数学推理、代码生成和指令遵循等基准测试中均显著优于强基线，表明有效的大语言模型（LLM）推理蒸馏并非依赖刚性的轨迹对齐，而是由有原则的局部学习信号校准所驱动。

链接: https://arxiv.org/abs/2605.04078
作者: Khouloud Saadi,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning distillation aims to transfer multi-step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher-student hierarchies and frame distillation as trajectory imitation. This is misaligned with the structure of reasoning, where intermediate steps are often locally under-specified: global correctness constrains the final answer, but does not uniquely determine each intermediate move. We propose validity-calibrated reasoning distillation, a framework that treats reasoning distillation as a problem of local learning-signal allocation rather than path alignment. Instead of enforcing token-level imitation, we compare the student’s and teacher’s proposed next-step actions under the same prefix and use their relative local validity to modulate the strength of the distillation update. This yields a dynamic, context-dependent supervision mechanism that preserves the teacher’s structural guidance while adapting update strength to local reasoning quality. Across mathematical reasoning, code generation, and instruction-following benchmarks, our method consistently outperforms strong distillation baselines. These results indicate that effective LLM reasoning distillation is governed not by rigid trajectory imitation, but by principled, locally calibrated allocation of learning signal.

[AI-86] A Regulatory Governance Framework for AI-Driven Financial Fraud Detection in U.S. Banking: Integrating OCC SR 11-7 CFPB and FinCEN Compliance Requirements for Model Development Validation and Monitoring Lifecycles

【速读】：该论文旨在解决美国金融机构在部署生成式 AI (Generative AI) 驱动的欺诈检测系统时面临的合规监管碎片化问题，即现有四个监管框架（OCC Bulletin 2011-12、SR 11-7、CFPB AI 圆形文件及 FinCEN BSA/SAR 要求）缺乏统一的治理生命周期，无法有效衔接模型开发、验证与监控实践。解决方案的关键在于提出首个集成化监管治理框架——AI 驱动金融欺诈检测监管治理框架（Regulatory Governance Framework for AI-Driven Financial Fraud Detection, RGF-AFFD），其核心为三层治理架构，并通过基于 IEEE-CIS 和 ULB 基准数据集的多维度实证研究（包括模型性能、时间漂移、SHAP 可解释性及 BISG 公平性分析）验证其有效性；其中 LSTM+XGBoost 集成模型表现出优异性能（ROC-AUC=0.9289，F1=0.6360，收益成本比6:1），而“监管数字孪生”（Regulatory Digital Twin, RDT-FG）元模型可将关键指标转化为四类监管机构专属健康评分与综合合规适配指数，实现持续合规监测。

链接: https://arxiv.org/abs/2605.04076
作者: Mohammad Nasir Uddin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 38 pages, submitted to Cogent Business Management (Taylor Francis), currently under peer review

点击查看摘要

Abstract:U.S. financial institutions deploying AI-based fraud detection face a fragmented compliance landscape spanning four regulatory frameworks – OCC Bulletin 2011-12, SR 11-7, the CFPB AI circular, and FinCEN BSA/SAR requirements – with no integrated governance life cycle connecting these requirements to model development, validation, and monitoring practice. This paper presents the Regulatory Governance Framework for AI-Driven Financial Fraud Detection (RGF-AFFD), a three-tier governance architecture empirically anchored in a multi-study empirical program. Using the IEEE-CIS dataset (590,540 transactions) and ULB benchmark (284,807 transactions), we benchmark six architectures including an LSTM+XGBoost ensemble, and conduct ablation, temporal drift, SHAP interpretability, and BISG fairness analyses. The LSTM+XGBoost ensemble achieves ROC-AUC of 0.9289 (F1: 0.6360) with a benefit-cost ratio of 6:1. XGBoost demonstrates the strongest temporal stability (delta-AUC = -0.0017 versus -0.0626 for LSTM). The RDT-FG Regulatory Digital Twin meta-model translates metrics into four regulator-specific health scores and a composite Regulatory Fitness Index for continuous compliance monitoring. The RGF-AFFD is the first integrated deployment blueprint to simultaneously satisfy OCC, SR 11-7, CFPB, and FinCEN requirements, supported by a community bank implementation vignette and four evidence-based policy recommendations.

[AI-87] A Physics-Aware Framework for Short-Term GPU Power Forecasting of AI Data Centers

【速读】：该论文旨在解决AI数据中心因计算任务异构性导致的电力需求快速波动问题，这种波动可能引发电网不稳定。其核心挑战在于如何实现高精度、物理一致性的短期（5–80分钟）电力利用率预测。解决方案的关键是提出首个物理信息驱动的DLinear时间序列模型（PI-DLinear），该模型基于多节点集总热阻容（RC）网络结构，通过新推导的时间依赖常微分方程（ODE）将GPU计算与内存利用率及温度动态耦合建模，从而在功率限制和负载瞬态事件下仍保持预测结果的物理合理性。相较于现有最优模型（包括基于Transformer和非Transformer的方法），PI-DLinear在均方误差（MSE）、平均绝对误差（MAE）和均方根误差（RMSE）上分别提升了0.782%–39.08%、0.993%–51.82%和0.370%–22.28%。

链接: https://arxiv.org/abs/2605.04074
作者: Mohammad AlShaikh Saleh,Sanjay Chawla,Sertac Bayhan,Haitham Abu-Rub,Ali Ghrayeb
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:AI data centers experience rapid fluctuations in power demand due to the heterogeneity of computational tasks that they have to support. For example, the power profile of inference and training of large language models (LLMs) is quite distinct and big divergences can result in the instability of the underlying electricity grid. In this paper we propose, to the best of our knowledge, the first physics-informed DLinear time-series model that can accurately forecast power utilization of an AI data center 5-80 minutes (short-term forecasting) into the future. The physics, based on a multi-node lumped thermal resistance-capacitance (RC) network consistent with Newton’s law of cooling, is captured using newly derived time-dependent ordinary differential equations (ODE) that separately models and interlinks power consumption with the GPU compute and memory utilization and temperature. The resulting model, that we refer to as PI-DLinear, trained and evaluated on a real AI data center dataset and is not only more accurate than the state-of-the-art (SOTA) models tested, but the forecast profile respects the underlying physics under power throttling and load transient events. Relative to the SOTA transformer-based and non-transformer-based models, improvements in forecasting accuracy (averaged across all look-back and prediction windows) range from 0.782%-39.08% for MSE, 0.993%-51.82% for MAE, and 0.370%-22.28% for RMSE.

[AI-88] Confronting Label Indeterminacy in Automated Bail Decisions

【速读】：该论文旨在解决司法决策中因保释（Bail）拒绝导致的标签不确定性（Label Indeterminacy）问题，即当被告被拒保释时，其是否实际出庭这一反事实结果无法观测，从而使得历史数据存在结构性标签不明确性。这种不确定性会误导机器学习模型训练，引入偏见并形成反馈循环，影响公平性和可解释性。解决方案的关键在于系统评估五种处理标签不确定性的方法，包括一种基于保释决策动态机制的新型标签插补方法，这些方法虽依赖不可验证假设，但显著影响模型预测行为，甚至超过模型选择本身的影响，并通过可解释人工智能（Explainable AI, XAI）揭示其对模型内部决策逻辑的深远作用。

链接: https://arxiv.org/abs/2605.04073
作者: Cor Steging,Tadeusz Zbiegień
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted for presentation as a short paper at the 21st International Conference of AI Law in Singapore, June 8 to 12 of 2026

点击查看摘要

Abstract:Bail decisions present a fundamental challenge for data-driven decision support systems. When bail is denied, the counterfactual outcome of whether the defendant would have appeared in court remains unobserved. As a result, historical bail data embed structural label indeterminacy: future decisions are influenced by past decisions whose outcomes are only partially knowable. Building automated systems on such data risks introducing bias and reinforcing feedback loops. This raises a core question for machine-learning systems intended to assist judicial actors: how should cases in which bail was denied be treated during model development? In a case study of bail decisions from the Unified Judicial System of Pennsylvania, we evaluate five contemporary approaches to handling label indeterminacy across three machine learning models, including a novel label imputation method motivated by the dynamics of bail decisions. Each method relies on unverifiable assumptions, yet all influence the models’ predictive behaviour, sometimes even more so than the choice of model itself. Explainable AI analysis further reveals that these effects extend to the models’ internal decision-making processes as well. Finally, we consider the notion of label indeterminacy from a legal perspective and assess the legitimacy of these approaches in the context of bail decision-making.

[AI-89] FlatASCEND: Autoregressive Clinical Sequence Generation with Continuous Time Prediction and Association-Based Pharmacological Testing

【速读】：该论文旨在解决生成式临床序列模型在多步轨迹预测中对干预令牌（intervention tokens）响应能力不足的问题，特别是如何确保生成的轨迹能准确反映已知的药理学关联（pharmacological associations），而非仅依赖观测数据中的混杂因素。其解决方案的关键在于提出FlatASCEND模型——一个基于扁平复合标记（flat composite tokens）和零膨胀对数正态时间头（zero-inflated log-normal time head）的14.5M参数自回归临床序列模型，并通过患者特定前缀条件生成来增强机制性因果效应的表达。实验表明，该方法在干预驱动的药理关系上显著放大了预期方向性效应（如类固醇到血糖、利尿剂到钾离子），而对混杂驱动的关联保持不变，从而揭示了模型在部分恢复已知药理机制方面的潜力，同时也暴露了在残余混杂下的局限性。

链接: https://arxiv.org/abs/2605.04071
作者: Chris Sainsbury,Feng Dong,Andreas Karwath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 22 pages, 2 figures, 12 tables

点击查看摘要

Abstract:Autoregressive models can predict clinical events, but generating patient-conditioned multi-step trajectories that respond to intervention tokens and testing whether those responses preserve known pharmacological associations has received limited attention. We present FlatASCEND, a 14.5M-parameter autoregressive clinical sequence model using flat composite tokens and a zero-inflated log-normal time head. Standard distributional metrics (Jaccard 0.889-0.954) do not distinguish FlatASCEND from trivial baselines; the model’s value lies in conditional generation from patient-specific prefixes. A prompt-shuffle ablation shows patient-specific conditioning amplifies mechanistic pharmacological effects (2.0-2.2x for steroid to glucose, diuretic to potassium) while leaving confounding-driven associations unchanged (0.9x for insulin to glucose). An incident-user framework assesses directional consistency against prior pharmacological knowledge on MIMIC-IV (N=500 per comparison): 4/10 recover correct mechanistic directions, 2 reproduce treatment-context associations, 4 are incorrect (9/10 significant, Wilcoxon p0.05). This pattern - partial recovery under residual confounding - is consistent with learned observational associations without causal distinction. Direct preference optimisation with surrogate reward destroys all correct associations (3/3 to 0/3), illustrating reward exploitation when reward and evaluation share an outcome domain. Generative evidence is strongest for short-horizon ICU data; outpatient temporal fidelity is weaker (median 10 vs 154 days on INSPECT), and zero-shot cross-site transfer degrades without adaptation.

[AI-90] LAWS: Learning from Actual Workloads Symbolically – A Self-Certifying Parametrized Cache Architecture for Neural Inference Robotics and Edge Deployment

【速读】：该论文旨在解决大规模模型推理中缓存效率与误差可控性之间的矛盾问题，即如何在不依赖真实标签（ground truth）的情况下实现高精度、可证明的缓存近似。其核心解决方案是提出LAWS（Learning from Actual Workloads Symbolically）架构，通过从实际工作负载中符号化构建一个不断增长的“已认证专家函数库”（certified expert functions），每个专家覆盖Probabilistic Language Trie（PLT）中的一个节点区域，并携带统一适用于该区域内所有输入的正式误差界。关键创新在于自认证定理：任意输入下的近似误差被严格限定为可部署时验证的三项之和——拟合误差ε_fit、模型Lipschitz常数Λ(W)与最大嵌入直径C_E的乘积项，以及一个可计算的常数因子，从而实现了无需真值即可保证性能边界的能力。

链接: https://arxiv.org/abs/2605.04069
作者: Gregory Magarshak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注: 45 pages. Companion paper to arXiv:2604.06228 (Probabilistic Language Tries)

点击查看摘要

Abstract:We introduce LAWS (Learning from Actual Workloads Symbolically), a self-certifying inference caching architecture that builds a growing library of certified expert functions from deployment observations. Each expert covers a region of input space defined by a node in the Probabilistic Language Trie (PLT) of the base model and carries a formal error bound holding uniformly over all inputs. The central result is a self-certification theorem: for any input x, the LAWS approximation error is bounded by epsilon_fit + 2*Lambda(W)*C_E, where Lambda(W) is the model Lipschitz constant, C_E is the maximum embedding diameter, and epsilon_fit is the expert training error – all checkable at deployment time without ground truth. We prove that LAWS generalizes both Mixture-of-Experts and KV prefix caching as special cases and is strictly more expressive than any fixed-K MoE or finite cache. Further results include a monotone hit rate theorem (any-match routing ensures coverage only increases), an expert library growth rate of O(2^H log N) where H is workload entropy, a fleet learning convergence theorem with Omega(K) speedup for K-unit fleets, and an over-the-air update bandwidth bound. We conjecture that LAWS is acquisition-optimal among stationary online caching algorithms and that the effective Lipschitz constant on the training distribution grows polynomially rather than exponentially in depth. Applications are developed for LLM inference, robotic control, and multi-agent edge deployment.

[AI-91] Designing a double deep reinforcement learning selection tool for resilient demand prediction

【速读】：该论文旨在解决供应链需求预测中自动选择合适预测模型的难题，这一问题因数据集特性差异而变得复杂。其核心解决方案是提出一种双深度强化学习（double deep reinforcement learning）架构，作为代理（agent）在预测时刻从预测委员会（forecasting committee）中自动选择最优模型；同时引入基于平均奖励收敛的新型早停策略，以加速训练过程。实验表明，该方法在超市销售和零食需求数据集上均优于现有先进方法，展现出良好的鲁棒性。

链接: https://arxiv.org/abs/2605.04068
作者: Bilel Abderrahmane Benziane,Benoit Lardeux,Ayoub Mcharek,Maher Jridi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of artificial intelligence in supply chain forecasting has attracted many scientific studies for several decades. However, the process of selecting an appropriate forecasting solution becomes a daunting task. This complexity arises due to the distinct features inherent to each dataset. Research to tackle this issue has been performed since the eighties but recent development of demand forecasting has opened new perspectives. This research aims to enhance automatic forecasting model selection by proposing a novel architecture that acts as a double deep reinforcement learning agent, selecting automatically a forecasting model from the forecasting committee at the time of prediction. Moreover, a novel early-stopping approach based on average reward convergence has been introduced to expedite training time. To evaluate the model’s performance, an empirical study was conducted utilizing grocery sales datasets and snack demands datasets. The experimental results demonstrate the robustness of the proposed approach when compared to state-of-the-art methods.

[AI-92] Investigating Trustworthiness of Nonparametric Deep Survival Models for Alzheimers Disease Progression Analysis ALT

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Dementia, AD）进展建模中存在模型偏见的问题，尤其关注深度学习驱动的生存分析模型在敏感属性（如性别、种族和教育水平）上的公平性缺失，这可能导致对边缘化群体的预测不公平且不可靠。解决方案的关键在于提出两个新颖的公平性度量指标——时间依赖一致性不纯度（Time-Dependent Concordance Impurity）和Kaplan-Meier公平性（Kaplan-Meier Fairness），用于量化非参数生存模型中由敏感属性引发的偏差，并结合特征重要性分析识别对可靠AD预测最具影响力的特征，从而为构建更公平、可信赖的AD进展预测系统提供方法支撑。

链接: https://arxiv.org/abs/2605.04063
作者: Jacob Thrasher,Kaitlyn Heintzelman,Peter Martone,David Kotlowski,Binod Bhattarai,Donald Adjeroh,Prashnna Gyawali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 6 figures, 2 tables, IEEE/ACM Conference on Connected Health: Applications, Systems, and Engineering Technologies

点击查看摘要

Abstract:Alzheimer’s Dementia (AD) is a progressive neurodegenerative disease marked by irreversible decline, making reliable modeling of its progression essential for effective patient care. Progression-aware methods such as survival analysis are therefore crucial tools for the early detection and monitoring of AD. Recent advancements in deep learning have demonstrated remarkable performance in survival tasks, but alarmingly fewer studies have been conducted in the domain of AD. Further, the studies that do exist do not consider learned bias within the model itself, which could result in unfair and unreliable predictions toward certain marginalized groups. As such, we conduct a rigorous study of fairness in AD progression analysis along with a thorough feature importance study to determine the characteristics which are most important for reliable AD predictions. Furthermore, we propose two novel fairness metrics, called Time-Dependent Concordance Impurity and Kaplan-Meier Fairness, to quantify bias with respect to sensitive attributes such as sex, race, and education in nonparametric survival models. Our study demonstrates that while deep learning powered survival models are robust tools which can aid clinicians in AD care decisions, they often exhibit considerable bias, representing important avenues for future research.

[AI-93] EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在资源受限设备上部署时面临的高存储与计算开销问题，尤其针对低比特量化（extremely low-bit quantization）导致的性能显著下降难题。现有方法如后训练量化（Post-Training Quantization, PTQ）、量化感知训练（Quantization-Aware Training, QAT）及量化感知蒸馏（Quantization-Aware Distillation）分别存在精度损失严重、训练成本高或依赖特定教师数据等局限性。论文提出EdgeRazor框架，其核心创新在于三个模块：混合精度量化感知蒸馏（Mixed-Precision Quantization-Aware Distillation）实现细粒度精度控制；自适应特征蒸馏（Adaptive Feature Distillation）从16-bit教师模型中提取n-bit学生模型；熵感知KL散度（Entropy-Aware KL Divergence）基于教师输出分布动态平衡正向与反向信息流，无需人工选择蒸馏特征且不依赖额外标注数据。该方案在极低比特（如1.88-bit）下超越3-bit及2-bit先进方法，并大幅降低训练预算与存储需求。

链接: https://arxiv.org/abs/2605.04062
作者: Shu-Hao Zhang,Le-Tong Huang,Xiang-Sheng Deng,Xin-Yi Zou,Chen Wu,Nan Li,Shao-Qun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an n -bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher’s output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10 \times lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1 \times relative to the 16-bit baseline.

[AI-94] MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning AAAI2026

【速读】：该论文旨在解决参数高效迁移学习（Parameter-efficient transfer learning, PETL）中因梯度反向传播导致的显著内存开销问题，以及现有内存高效迁移学习（Memory-efficient transfer learning, METL）方法因侧网络（side networks）规模受限而性能下降的问题。其解决方案的关键在于提出一种混合精度交互式侧路专家混合模型（Mixed-Precision Interactive Side Mixture-of-Experts, MP-ISMoE）：首先通过高斯噪声扰动迭代量化（Gaussian Noise Perturbed Iterative Quantization, GNP-IQ）将权重压缩至低比特，有效降低量化误差并释放内存；随后利用节省的内存资源引入交互式侧路专家混合（Interactive Side Mixture-of-Experts, ISMoE），在不牺牲整体内存效率的前提下扩展侧网络容量；不同于传统专家混合模型，ISMoE通过与冻结主干网络提取的显著特征进行交互来选择最优专家，从而缓解知识遗忘并提升性能。

链接: https://arxiv.org/abs/2605.04058
作者: Yutong Zhang,Zimeng Wu,Shangcai Liao,Shujiang Wu,Jiaxin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI2026 Accepted

点击查看摘要

Abstract:Parameter-efficient transfer learning (PETL) has emerged as a pivotal paradigm for adapting pre-trained foundation models to downstream tasks, significantly reducing trainable parameters yet suffering from substantial memory overhead caused by gradient backpropagation during fine-tuning. While memory-efficient transfer learning (METL) circumvents this challenge by bypassing backbone gradient computation via lightweight small side networks, its stringent memory constraint severely limits learning capacity of side networks, thereby significantly compromising performance. To address these limitations, we propose a novel Mixed-Precision Interactive Side Mixture-of-Experts framework (MP-ISMoE). Specifically, we first propose a Gaussian Noise Perturbed Iterative Quantization (GNP-IQ) scheme to quantize weights into lower-bits while effectively decreasing quantization errors. By leveraging memory conserved from GNP-IQ, we subsequently employ Interactive Side Mixture-of-Experts (ISMoE) to scaling up side networks without sacrificing overall memory efficiency. Different from conventional mixture-of-experts, ISMoE learns to select optimal experts by interacting with salient features from frozen backbones, thus suppressing knowledge forgetting and boosting performance. Extensive experiments across diverse vision-language and language-only tasks demonstrate that MP-ISMoE remarkably promotes accuracy compared to state-of-the-art METL approaches, while maintaining comparable parameter and memory efficiency.

[AI-95] Structured Progressive Knowledge Activation for LLM -Driven Neural Architecture Search

【速读】：该论文旨在解决神经架构搜索（Neural Architecture Search, NAS）中如何在昂贵的评估条件下，有效整合已有的架构知识并探索新型设计的问题。其核心挑战在于：大型语言模型（Large Language Models, LLMs）虽能将丰富的架构先验转化为可执行的代码修改，但局部编辑常因功能耦合引发非局部的行为与性能变化，即“功能纠缠”（functional entanglement）。为应对这一问题，论文提出结构化渐进式知识激活（Structured Progressive Knowledge Activation, SPARK），其关键在于通过显式选择需修改的功能因子，并基于该因子条件化编辑操作，从而减少纠缠副作用，实现更精准、可靠的架构调整。

链接: https://arxiv.org/abs/2605.04057
作者: Zhen Liu,Yuhan Liu,Jingwen Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper focuses on a key challenge in Neural Architecture Search (NAS): integrating established architectural knowledge while exploring new designs under expensive evaluations. Large language models (LLMs) are a promising assistant for NAS because they can translate rich architectural and coding priors into executable code edits. However, in practice, seemingly local revisions often propagate into non-local behavioral and performance shifts because a single edit can inadvertently couple multiple interacting functional factors, a phenomenon we refer to as functional entanglement. To make LLM knowledge usable under such entanglement, we propose Structured Progressive Knowledge Activation (SPARK), which activates relevant priors by explicitly selecting the functional factor to modify and conditioning the edit on that factor. This factor-conditioned editing reduces entangled side effects and yields more targeted, reliable architecture modifications. On CLRS-DFS, SPARK achieves a 28.1x sample-efficient architecture evolution speedup and yields a 22.9 percent relative improvement in OOD accuracy.

[AI-96] ransformation Categorization Based on Group Decomposition Theory Using Parameter Division

【速读】：该论文旨在解决无监督表示学习中“良好”表示的定义问题，即如何在不依赖标签的情况下学习具有语义意义的感官表征，并实现对输入间变换关系的有效分类。传统解耦（disentanglement）方法假设各因素相互独立，但在因子耦合时失效；为此，作者提出基于伽罗瓦理论（Galois-theoretic）的群分解方法，通过学习两个变换的乘积，其中一因子被约束于正规子群（normal subgroup），从而覆盖交换与非交换情形。关键创新在于将参数划分（parameter division）引入单个变换：将变换参数拆分为若干部分，施加同态约束使完整变换映射到其中一个分量，并将正规子群定义为当该分量固定为单位元时的变换集合。此方法摒弃了先前依赖运动和等距等辅助假设，实现了更广泛的适用性，并在图像旋转、平移和缩放等变换对上的实验验证表明，群分解约束驱动了正确的类别划分。

链接: https://arxiv.org/abs/2605.04056
作者: Takayuki Komatsu,Yoshiyuki Ohmura,Yasuo Kuniyoshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation learning seeks meaningful sensory representations without supervision and can model aspects of human development. Although many neural networks empirically learn useful features, a principled account of what makes a representation “good” remains elusive. We study unsupervised categorization of transformations between pairs of inputs under algebraic constraints. Classical disentanglement favors mutually independent factors and fails when factors are coupled. Our prior Galois-theoretic approach decomposes a group via normal subgroups by learning a product of two transformations with one factor constrained to a normal subgroup, covering both commutative and non-commutative cases. That method, however, relied on auxiliary assumptions (e.g., motion and isometry restrictions) not required by decomposition theory, and ablations did not separate theory-based from auxiliary effects. We propose parameter division for a single transformation: we split its parameter into components, impose homomorphism constraints mapping the full transformation to one component, and identify the normal subgroup as the set of transformations when that component is fixed to the identity. This formulation drops the previous auxiliary assumptions and applies more broadly. We evaluate on image pairs involving rotation, translation, and scale; ablations show that group-decomposition constraints drive appropriate categorization.

[AI-97] LCM: Lossless Context Management

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长上下文处理任务中面临的效率与信息保真度难题，尤其是如何在不丢失原始信息的前提下实现高效的上下文管理。其解决方案的关键在于提出一种确定性架构——无损上下文管理（Lossless Context Management, LCM），该架构将符号递归分解为两个由引擎管理的确定性机制：一是递归上下文压缩，通过层次化的摘要有向无环图（DAG）自动压缩旧消息并保留所有原始内容的无损指针；二是递归任务划分，利用引擎管理的并行原语（如LLM-Map）替代模型自写的循环结构。这一设计在保证所有历史状态可完全回溯的同时，提升了短任务的零成本连续性和终止保障能力，从而显著优于传统LLM及前沿编码代理（如Claude Code）在长上下文评估（OOLONG）中的表现。

链接: https://arxiv.org/abs/2605.04050
作者: Clint Ehrlich,Theodore Blackman
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We introduce Lossless Context Management (LCM), a deterministic architecture for LLM memory that outperforms Claude Code on long-context tasks. When benchmarked using Opus 4.6, our LCM-augmented coding agent, Volt, achieves higher scores than Claude Code on the OOLONG long-context eval, including at every context length between 32K and 1M tokens. LCM may be considered both a vindication and extension of the recursive paradigm pioneered by Recursive Language Models (RLMs). Our results demonstrate that recursive context manipulation can outperform not just conventional LLMs, but frontier coding agents with native file-system access. LCM departs from RLM by decomposing symbolic recursion into two deterministic, engine-managed mechanisms: recursive context compression, in which a hierarchical summary DAG automatically compacts older messages while retaining lossless pointers to every original; and recursive task partitioning, in which engine-managed parallel primitives like LLM-Map replace model-written loops. This trade-off, analogous to the move from GOTO to structured control flow in program-ming language design, sacrifices maximal flexibility for termination guarantees, zero-cost continuity on short tasks, and lossless retrievability of all prior state. Subjects: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE) Cite as: arXiv:2605.04050 [cs.AI] (or arXiv:2605.04050v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.04050 Focus to learn more arXiv-issued DOI via DataCite

[AI-98] Interpreting Manifolds and Graph Neural Embeddings from Internet of Things Traffic Flows

【速读】：该论文旨在解决物联网（Internet of Things, IoT）生态系统中网络拓扑日益复杂和异构所带来的监控与可视化难题，传统工具因依赖聚合指标或静态表示而难以捕捉设备间动态关系及结构依赖。其解决方案的关键在于提出一种可解释的流水线方法，通过将高维嵌入映射到低维潜在流形（latent manifold），生成可直接可视化的表示，并结合特征归因技术解析塑造流形结构的具体特征，从而实现对网络状态演化的可解释监控与互操作性，同时在入侵检测任务中达到0.830的F1分数，并识别如概念漂移等关键现象。

链接: https://arxiv.org/abs/2602.05817
作者: Enrique Feito-Casares,Francisco M. Melgarejo-Meseguer,Elena Casiraghi,Giorgio Valentini,José-Luis Rojo-Álvarez
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) ecosystems has led to increasingly complex and heterogeneous network topologies. Traditional network monitoring and visualization tools rely on aggregated metrics or static representations, which fail to capture the evolving relationships and structural dependencies between devices. Although Graph Neural Networks (GNNs) offer a powerful way to learn from relational data, their internal representations often remain opaque and difficult to interpret for security-critical operations. Consequently, this work introduces an interpretable pipeline that generates directly visualizable low-dimensional representations by mapping high-dimensional embeddings onto a latent manifold. This projection enables the interpretable monitoring and interoperability of evolving network states, while integrated feature attribution techniques decode the specific characteristics shaping the manifold structure. The framework achieves a classification F1-score of 0.830 for intrusion detection while also highlighting phenomena such as concept drift. Ultimately, the presented approach bridges the gap between high-dimensional GNN embeddings and human-understandable network behavior, offering new insights for network administrators and security analysts.

[AI-99] Learning Reconstructive Embeddings in Reproducing Kernel Hilbert Spaces via the Representer Theorem

【速读】：该论文旨在解决高维数据中潜在流形结构的表示学习问题，特别是如何通过重建机制在再生核希尔伯特空间（Reproducing-Kernel Hilbert Space, RKHS）中挖掘数据的内在几何特性。其解决方案的关键在于：首先利用向量形式的Representer定理优化每个样本在RKHS中的自重构（autorepresentation）属性，从而获得高维空间中的重建权重；随后引入可分离的算子值核（operator-valued kernel）扩展方法以处理向量值数据，同时保持单个标量相似性函数的简洁性；最后通过核对齐（kernel alignment）任务将数据投影至低维隐空间，使嵌入后的Gram矩阵逼近原始高维重建核，从而将RKHS中的自重构几何结构有效迁移至低维表示中。这一框架系统性地扩展了自然数据普遍具有的自重构性质，并结合核学习理论的经典成果实现高效流形学习。

链接: https://arxiv.org/abs/2601.05811
作者: Enrique Feito-Casares,Francisco M. Melgarejo-Meseguer,José-Luis Rojo-Álvarez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Motivated by the growing interest in representation learning approaches that uncover the latent structure of high-dimensional data, this work proposes new algorithms for reconstruction-based manifold learning within Reproducing-Kernel Hilbert Spaces (RKHS). Each observation is first reconstructed as a linear combination of the other samples in the RKHS, by optimizing a vector form of the Representer Theorem for their autorepresentation property. A separable operator-valued kernel extends the formulation to vector-valued data while retaining the simplicity of a single scalar similarity function. A subsequent kernel-alignment task projects the data into a lower-dimensional latent space whose Gram matrix aims to match the high-dimensional reconstruction kernel, thus transferring the auto-reconstruction geometry of the RKHS to the embedding. Therefore, the proposed algorithms represent an extended approach to the autorepresentation property, exhibited by many natural data, by using and adapting well-known results of Kernel Learning Theory. Numerical experiments on both simulated (concentric circles and swiss-roll) and real (cancer molecular activity and IoT network intrusions) datasets provide empirical evidence of the practical effectiveness of the proposed approach.

[AI-100] A large language model-type architecture for high-dimensional molecular potential energy surfaces

【速读】：该论文旨在解决高维势能面（Potential Energy Surface, PES）计算的难题，尤其是在分子体系和材料中准确预测反应速率等关键物理化学性质时面临的挑战。其核心问题在于如何高效且精确地构建具有数百个核坐标维度的全维势能面，这在传统方法中因“维度灾难”而难以实现。解决方案的关键在于引入一种类大语言模型（Large Language Models in Generative AI）的图神经网络架构：将分子系统表示为包含节点、边、面等结构的图，并通过图论提取子系统间的相互作用来构建势能面；利用一组基于图结构的低维神经网络模块，首先在51核坐标维度下训练并验证有效性，随后将其扩展至186核坐标的更大体系，实现了亚千卡/摩尔（sub-kcal/mol）精度的预测，从而首次实现了质子化21水团簇（186核维度）在CCSD水平上的全维势能面计算。

链接: https://arxiv.org/abs/2412.03831
作者: Xiao Zhu,Srinivasan S. Iyengar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atomic and Molecular Clusters (physics.atm-clus); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
备注: 31 pages, 35 figures

点击查看摘要

Abstract:Computing high-dimensional potential energy surfaces for molecular systems and materials is considered to be a great challenge in computational chemistry with potential impact in a range of areas including the fundamental prediction of reaction rates. In this paper, we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces, etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 nuclear dimensions. For this purpose, a family of neural networks that pertain to the graph-theoretically obtained subsystems get the job done for this 51 nuclear dimensional system. We then ask if this same family of lower-dimensional graph-based neural networks can be transformed to provide accurate predictions for a 186-dimensional potential energy surface. We find that our algorithm does provide accurate results for this larger-dimensional problem with sub-kcal/mol accuracy for the higher-dimensional potential energy surface problem. Indeed, as a result of these developments, here we produce the first efforts towards a full-dimensional potential energy surface for the protonated 21-water cluster (186 nuclear dimensions) at CCSD level accuracy.

[AI-101] Grokability in five inequalities

【速读】：该论文旨在解决多个高维几何与组合分析中的经典不等式和极值问题，包括凸集的高斯周长下界、哈密顿立方体上的L₂-L₁矩比较不等式、自卷积不等式的加强、g-Sidon集的最大尺寸渐近估计以及Szarek不等式的最优平衡形式。其解决方案的关键在于利用生成式 AI (Generative AI) 辅助发现并验证了若干新的数学不等式和边界结果，这些发现经由作者独立验证后被确认为正确，并在多个领域中实现了理论边界的改进或最优性突破。

链接: https://arxiv.org/abs/2605.05193
作者: Paata Ivanisvili,Xinyuan Xie
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Classical Analysis and ODEs (math.CA); Functional Analysis (math.FA)
备注:

点击查看摘要

Abstract:In this note, we report five mathematical discoveries made in collaboration with Grok, all of which have been subsequently verified by the authors. These include an improved lower bound on the maximal Gaussian perimeter of convex sets in \mathbbR^n , sharper L_2 - L_1 moment comparison inequalities on the Hamming cube -1,1^n , a strengthened autoconvolution inequality, improved asymptotic bounds on the size of the largest g -Sidon sets in \1,\dots,n\ , and an optimal balanced Szarek’s inequality.

[AI-102] Almost-Orthogonality in Lp Spaces: A Case Study with Grok

【速读】：该论文旨在解决Carbery提出的关于多函数三角不等式的锐化形式是否成立的问题，特别是针对指数 $ c $ 的取值范围及其在特定条件下的最优性。论文首先构造反例证明原不等式在 $ p > 2 $ 时失效，进而证明若此类估计成立，则必须满足 $ c \leq p’ $（其中 $ 1/p + 1/p’ = 1 $）；并在临界指数 $ c = p’ $ 下，对所有整数 $ p \geq 2 $ 建立了该不等式。解决方案的关键在于引入一个依赖于三函数间正交程度的参数 $ \Gamma $，从而获得一个适用于 $ p \geq 3 $ 的尖锐三函数估计，其指数 $ c§ = \frac{2}{\ln 2}(p-2)\ln 3 + 2\ln 2 $ 被证明为最优，并优于此前由Carlen、Frank和Lieb得到的 $ r§ = \frac{6}{5}p - 4 $。

链接: https://arxiv.org/abs/2605.05192
作者: Ziang Chen,Jaume de Dios Pont,Paata Ivanisvili,Jose Madrid,Haozhu Wang
机构: 未知
类目: Classical Analysis and ODEs (math.CA); Artificial Intelligence (cs.AI); Combinatorics (math.CO); Probability (math.PR)
备注:

点击查看摘要

Abstract:Carbery proposed the following sharpened form of triangle inequality for many functions: for any p\ge 2 and any finite sequence (f_j)_j\subset L^p we have [ \Big|\sum_j f_j\Big|_p \ \le\ \left(\sup_j \sum_k \alpha_jk^,c\right)^1/p’ \Big(\sum_j |f_j|_p^p\Big)^1/p, ] where c=2 , 1/p+1/p’=1 , and \alpha_jk=\sqrt\frac|f_jf_k|_p/2|f_j|_p|f_k|_p . In the first part of this paper we construct a counterexample showing that this inequality fails for every p2 . We then prove that if an estimate of the above form holds, the exponent must satisfy c\le p’ . Finally, at the critical exponent c=p’ , we establish the inequality for all integer values p\ge 2 . In the second part of the paper we obtain a sharp three-function bound [ \Big|\sum_j=1^3 f_j\Big|_p \ \le\ \left(1+2\Gamma^c§\right)^1/p’ \Big(\sum_j=1^3 |f_j|_p^p\Big)^1/p, ] where p \geq 3 , c§ = \frac2\ln(2)(p-2)\ln(3)+2\ln(2) and \Gamma=\Gamma(f_1,f_2,f_3)\in[0,1] quantifies the degree of orthogonality among f_1,f_2,f_3 . The exponent c§ is optimal, and improves upon the power r§ = \frac65p-4 obtained previously by Carlen, Frank, and Lieb. Some intermediate lemmas and inequalities appearing in this work were explored with the assistance of the large language model Grok. Subjects: Classical Analysis and ODEs (math.CA); Artificial Intelligence (cs.AI); Combinatorics (math.CO); Probability (math.PR) MSC classes: 46E30, 26D15, 46B20, 46B25, 42B35 Cite as: arXiv:2605.05192 [math.CA] (or arXiv:2605.05192v1 [math.CA] for this version) https://doi.org/10.48550/arXiv.2605.05192 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-103] Building informative materials datasets beyond targeted objectives

【速读】：该论文旨在解决材料科学数据集构建中因仅关注特定目标属性而导致未被纳入的其他属性预测性能下降的问题，从而影响数据集的长期可用性和多任务学习能力。其关键解决方案是提出一种基于多样性感知的选择框架（diversity-aware selection framework），在最大化目标属性信息量的同时，保障未被重点关注属性的性能不显著退化。该方法通过确保材料空间的广泛覆盖，有效提升了数据集对潜在未来研究目标的适应性，避免了冷启动限制，并增强了数据集的整体通用性与公平性。

链接: https://arxiv.org/abs/2605.05104
作者: Rafael Espinosa Castañeda,Ashley Dale,Hongchen Wang,Yonatan Kurniawan,Hao Wan,Runze Zhang,Adji Bousso Dieng,Kangming Li,Jason Hattrick-Simpers
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.

[AI-104] hink-Aloud Reshapes Automated Cognitive Model Discovery Beyond Behavior

【速读】：该论文试图解决的问题是：基于行为数据（behavioral data）自动发现的认知模型通常存在欠定性（under-determined），即无法唯一确定模型结构，导致模型解释力和泛化能力受限。解决方案的关键在于引入思维 aloud（Think Aloud）过程数据作为额外的数据约束，在自动化模型发现过程中融合过程层面的语言信息。实验表明，这种结合方式显著提升了模型在保留数据上的预测性能，并且改变了认知模型的结构类别分布——多数参与者（69.4%）从“显式比较器”（Explicit comparator）类模型转变为“集成效用”（Integrated utility）类模型，说明过程语言数据不仅改善了模型拟合度，还系统性重塑了模型结构，揭示出仅靠行为数据无法识别的认知机制。

链接: https://arxiv.org/abs/2605.05091
作者: Hanbo Xie,Akshay K. Jagadish,Lan Pan,Robert C. Wilson
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational cognitive models discovered using large language models have so far relied solely on behavioral data. However, it is well-known that models produced from the behavioral trajectory alone are typically under-determined. In this work, we explore the use of Think Aloud traces as an additional form of data constraint during automated model discovery. When applied to the domain of risky decision-making, we find that the models discovered with think-aloud achieve significantly improved predictive performance on held-out data. Additionally, we find that the discovered models belong to different structural classes than those discovered from behavior alone for the majority of participants (69.4%), specifically, it shifts from Explicit comparator towards Integrated utility. These results suggest that process-level language data not only improve model fit, but also systematically reshape the structure of the discovered cognitive models, enabling the identification of mechanisms that are not recoverable from behavior alone.

[AI-105] Predictive and Prescriptive AI toward Optimizing Wildfire Suppression

【速读】：该论文旨在解决野火季节中资源稀缺条件下，如何在分散的地理区域内高效分配扑火人员与抑制措施的问题。核心挑战在于：（1）野火需求具有内生性且动态演化非线性；（2）需联合优化人员调度与扑火策略以最大化抑制效果。解决方案的关键在于构建一个融合时间-空间-资源网络的整数优化模型，并设计一种双侧分支定价割平面算法（two-sided branch-and-price-and-cut algorithm），其创新点包括：（i）双侧列生成机制迭代生成扑火方案与人员路径；（ii）基于链接约束的背包结构设计新型割平面；（iii）针对非线性野火演化的新型分支规则。此外，引入数据驱动的双重机器学习方法估计野火蔓延规律，缓解历史扑火资源配置与野火增长之间的混淆偏倚，从而提升模型预测精度与实际应用效果。

链接: https://arxiv.org/abs/2605.04510
作者: Leonard Boussioux,Alexandre Jacquillat,Ryne Reger,Jacob Wachspress
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intense wildfire seasons require critical prioritization decisions to allocate scarce suppression resources over a dispersed geographical area. This paper develops a predictive and prescriptive approach to jointly optimize crew assignments and wildfire suppression. The problem features a discrete resource-allocation structure with endogenous wildfire demand and non-linear wildfire dynamics. We formulate an integer optimization model with crew assignments on a time-space-rest network, wildfire dynamics on a time-state network, and linking constraints between them. We develop a two-sided branch-and-price-and-cut algorithm based on: (i) a two-sided column generation scheme that generates fire suppression plans and crew routes iteratively; (ii) a new family of cuts exploiting the knapsack structure of the linking constraints; and (iii) novel branching rules to accommodate non-linear wildfire dynamics. We also propose a data-driven double machine learning approach to estimate wildfire spread as a function of covariate information and suppression efforts, mitigating observed confounding between historical crew assignments and wildfire growth. Extensive computational experiments show that the optimization algorithm scales to otherwise intractable real-world instances; and that the methodology can enhance suppression effectiveness in practice, resulting in significant reductions in area burned over a wildfire season and guiding resource sharing across wildfire jurisdictions.

[AI-106] JASTIN: Aligning LLM s for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

【速读】：该论文旨在解决生成式音频模型（Generative Audio Models）快速发展的背景下，评估方法滞后的问题，尤其是现有客观指标和通用多模态大语言模型（Multimodal Large Language Models, MLLMs）在领域泛化能力、零样本（zero-shot）性能以及指令灵活性方面的不足。其解决方案的关键在于提出JASTIN框架——一个可泛化且指令驱动的音频评估系统，将音频评估建模为自指导推理任务；通过一个可训练的音频适配器（audio adapter）连接冻结的高性能音频编码器与微调后的大型语言模型（LLM）主干网络，并引入包含多源、多任务、多校准和多描述数据的指令遵循数据准备流程，从而实现无需特定任务微调即可在语音、声音、音乐及跨域任务中达到最优的人类主观评分相关性（Pearson和Spearman相关系数）。

链接: https://arxiv.org/abs/2605.04505
作者: Leying Zhang,Bowen Shi,Haibin Wu,Bach Viet Do,Yanmin Qian
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.

[AI-107] Dissociating spatial frequency reliance from adversarial robustness advantages in neurally guided deep convolutional neural networks

【速读】：该论文旨在解决“神经对齐是否通过空间频率偏置（如低空间频率LSF或人类通道）提升深度卷积神经网络（DCNN）的对抗鲁棒性”这一关键问题。其核心发现是：尽管神经对齐模型会增强对LSF和人类通道的依赖，但直接引导模型偏向这些频段并不能显著提升鲁棒性——其中仅LSF偏置带来有限改善，且整体上此类模型与人类神经表征几何结构的相似性并未提高。因此，研究指出空间频率偏置更可能是学习类人表征的副产物，而非鲁棒性提升的主要机制，强调未来需探索超越空间频率维度的表征特性。

链接: https://arxiv.org/abs/2605.04443
作者: Zhenan Shao,Tianyu Ren,Chengxiao Wang,Leyla Isik,Diane M. Beck
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep convolutional neural networks (DCNNs) have rivaled humans on many visual tasks, yet they remain vulnerable to near-imperceptible perturbations generated by adversarial attacks. Recent work shows that aligning DCNN representations with human visual cortex activity improves adversarial robustness, but the mechanisms driving this advantage are unclear. One hypothesis suggests that neural alignment confers robustness by biasing models away from brittle high-frequency details and towards the low spatial frequencies (LSF). However, recent work shows that human object recognition critically depends on a narrow, mid-frequency “human channel”. Interestingly, this band was partially preserved in prior LSF-focused studies. Here, we investigate whether a spectral bias towards the LSF or the human channel is the primary driver of the adversarial robustness observed in neurally aligned DCNNs. We first show that DCNNs aligned to higher-order regions of the human ventral visual stream systematically increase reliance on both LSF and the human channel. However, directly steering DCNNs towards these bands revealed a clear dissociation. Biasing models towards the human channel, either alone or together with LSF, does not improve robustness and even impairs it. LSF bias produced some robustness gains, but such improvements are modest despite inducing much larger shifts in spatial-frequency reliance than neurally aligned models. Spatial-frequency-biased models overall show little, if any, increase in similarity to human neural representational geometry. Together, our results suggest that altered spatial-frequency reliance is likely an emergent property of learning more human-like representations rather than the primary mechanism by which neural alignment confers adversarial robustness, and motivate the need for future research examining representational properties beyond spatial-frequency profiles.

[AI-108] A Dialogue-Based Framework for Correcting Multimodal Errors in AI-Assisted STEM Education

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理多模态物理问题时性能下降的问题，即“多模态干扰效应”（Multimodal Interference Effect），这限制了生成式 AI 在STEM教育中提供公平且高质量辅导的能力。解决方案的关键在于提出并验证一种结构化的多模态对话干预策略，该策略无需模型重新训练即可显著提升AI对图像-rich STEM内容的理解与推理准确性——实证显示该干预纠正了82%的错误，尤其在视觉处理错误上实现100%修正率，从而有效增强AI教学代理在复杂多模态场景下的可靠性。

链接: https://arxiv.org/abs/2605.04131
作者: Akshay Syal,Lawrence Swaminathan Xavier Prince,Evin Gultepe,Nik Bear Brown,Srinivas Sridhar
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are democratizing access to personalized tutoring; however, their effectiveness is hindered by challenges in processing multimodal content, which limits AI’s potential to provide equitable, high-quality STEM support. This study evaluates LLM performance on multimodal physics problems, identifies specific failure modes through an empirical error taxonomy, and tests practical interventions designed to overcome multimodal processing limitations. We assessed three publicly available LLMs (Claude, Gemini, and ChatGPT) on multimodal physics problems from the OpenStax database and compared the results with text-only performance. An empirically derived error taxonomy was developed through pilot testing, followed by evaluation of a structured multimodal dialogue intervention. All three models achieved near-ceiling accuracy (96%) on text-only physics problems. Performance declined substantially on multimodal problems, consistent with what we term the Multimodal Interference Effect. Error analysis identified four failure modes: visual processing errors, context misinterpretation, mathematical computational errors, and hybrid errors, with visual processing errors being the most prevalent. The structured dialogue intervention corrected 82% of errors overall; visual processing errors were corrected at 100% across all models. Educators and students can implement these interventions immediately, requiring no model retraining, to improve AI tutoring reliability on image-rich STEM content, advancing equitable access to high-quality learning support.

[AI-109] ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation

【速读】：该论文旨在解决当前生成式蛋白结合物设计（Generative Protein Binder Design）领域中缺乏标准化评估协议的问题，从而导致不同研究间的性能指标难以比较和解释。其解决方案的关键在于提出ProtDBench——一个标准化且具备高通量意识的评估框架，通过定义统一的基准任务、评估流程与成功标准，系统性地分析评估设计对性能结果的影响；同时引入基于固定24小时计算预算的吞吐量感知指标及结构多样性考量的簇级成功标准，有效揭示了过滤规则、成功定义与计算效率之间的系统性差异，为蛋白结合物设计方法提供了公平、可复现且贴近实际应用的评估体系。

链接: https://arxiv.org/abs/2605.04118
作者: Cong Liu,Milong Ren,Jiaqi Guan,Chengyue Gong,Jinyuan Sun,Xinshi Chen,Wenzhi Xiao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

[AI-110] Meta-LegNet: A Transferable and Interpretable Framework for Surface Adsorption Prediction via Self-Defined Adsorption-Environment Learning

【速读】：该论文旨在解决计算催化中低能且化学上合理的吸附构型识别难题，此类构型直接影响吸附能、反应路径及催化性能。传统方法依赖于候选吸附位点的枚举并结合密度泛函理论（Density Functional Theory, DFT）或基于机器学习的结构优化，但这类流程计算成本高且难以扩展至复杂表面或多吸附体体系。其解决方案的关键在于提出Meta-LegNet框架，该框架融合SE(3)-等变原子级消息传递与体素化多尺度聚合以及跨域元学习，以学习不同催化剂-吸附物体系间可迁移的局部吸附环境表征；通过不变径向特征与等变方向信息编码局部化学环境，并借助坐标系体素池化、基于分配的上采样和门控特征融合引入更广泛的结构上下文，从而实现原子分辨的归因图谱，进而构建吸附环境数据库并采用模板匹配策略在未探索表面上快速预测可能的吸附位点，无需穷举搜索。

链接: https://arxiv.org/abs/2605.04102
作者: Yifan Li,Arravind Subramanian,Xiaoqing Liu,Qiujie Lyu,Sergey Kozlov,Lei Shen
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A central challenge in computational catalysis is the identification of low-energy and chemically plausible adsorption configurations, as these directly affect adsorption energies, reaction pathways, and catalytic performance. Existing approaches generally rely on enumerating candidate adsorption sites followed by iterative refinement through density functional theory calculations or machine-learning-based relaxations. However, such workflows remain computationally expensive and are difficult to scale to complex surfaces or multi-adsorbate systems. Here, we introduce Meta-LegNet, a graph learning framework that combines SE(3)-equivariant atom-level message passing with voxel-based multiscale aggregation and cross-domain meta-learning to learn transferable representations of local adsorption environments across diverse catalyst–adsorbate systems. Rather than following a conventional regression-only paradigm, Meta-LegNet encodes local chemical environments using invariant radial features and equivariant directional information, and further incorporates broader structural context through coordinate-frame voxel pooling, assignment-based upsampling, and gated feature fusion. The resulting local-global decomposition produces atom-resolved attribution maps, which are processed to identify adsorption-relevant local environments in an interpretable manner. Based on the learned representations, we further construct an adsorption-environment database and develop a template-matching strategy to propose likely adsorption sites on previously unexplored surfaces without exhaustive site enumeration. Overall, our results suggest that learning transferable adsorption environments provides an accurate, interpretable, and practical route for accelerating catalyst screening.

[AI-111] CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

【速读】：该论文旨在解决当前人工智能系统在能力范围上的局限性问题，即现有AI系统普遍缺乏人类所具备的灵活性、适应性和多感官整合能力。为应对这一挑战，作者提出了一种名为CTM-AI的通用人工智能（General AI）架构蓝图，其核心创新在于将意识的计算模型——意识图灵机（Conscious Turing Machine, CTM）与当前的基础模型（foundation models）相结合。该方案的关键在于构建一个由大量处理器组成的异构系统，包括专用专家模块（如视觉-语言模型和API）以及通用学习器，这些处理器能够根据任务需求动态选择、整合并交换信息，从而实现跨模态、跨任务的协同推理与决策。实验表明，CTM-AI在MUStARD、UR-FUNNY等多模态基准上达到领先性能，并在工具使用和代理任务中显著优于现有框架，验证了基于意识理论设计通用AI系统的可行性与有效性。

链接: https://arxiv.org/abs/2605.04097
作者: Haofei Yu,Yining Zhao,Lenore Blum,Manuel Blum,Paul Pu Liang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Despite remarkable advances, today’s AI systems remain narrow in scope, falling short of the flexible, adaptive, and multisensory intelligence that characterizes human capabilities. This gap has fueled longstanding debates about whether AI might one day achieve human-like generality or even consciousness, and whether theories of consciousness can inspire new architectures for AI. This paper presents an early blueprint for implementing a general AI system, CTM-AI, combining the Conscious Turing Machine (CTM), a formal machine model of consciousness, with today’s foundation models. CTM-AI contains an enormous number of powerful processors ranging from specialized experts (e.g., vision-language models and APIs) to unspecialized general-purpose learners poised to develop their own expertise. Crucially, for whatever problem must be dealt with, information from many processors is selected, integrated, and exchanged appropriately to solve the task. CTM-AI achieves state-of-the-art accuracy on MUStARD (72.28) and UR-FUNNY (72.13), outperforming multimodal and multi-agent frameworks. On tool-using and agentic tasks, CTM-AI achieves 10+ points of improvement on StableToolBench and WebArena-Lite. Overall, CTM-AI offers a principled, testable blueprint for general AI inspired by a model of consciousness.

[AI-112] Analogy between Boltzmann machines and Feynman path integrals

【速读】：该论文旨在揭示玻尔兹曼机（Boltzmann machine）在机器学习中的结构与量子统计力学中费曼路径积分（Feynman path-integral）形式之间的深刻联系，从而解决如何从理论上理解神经网络中隐藏层的物理意义这一问题。其解决方案的关键在于识别出隐藏层本质上是费曼路径积分中离散路径元素的对应物，这使得机器学习中的“路径组合”与“路径权重累积”可以类比为量子力学中通过不同路径干涉实现的概率幅叠加；基于此等价性，作者进一步提出适用于玻尔兹曼机和费曼路径积分描述的通用量子电路模型，并将该框架延伸至逆量子散射问题，从而为隐藏层提供一种可解释性的定义方式。

链接: https://arxiv.org/abs/2301.06217
作者: Srinivasan S. Iyengar,Sabre Kais
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We provide a detailed exposition of the connections between Boltzmann machines commonly utilized in machine learning problems and the ideas already well known in quantum statistical mechanics through Feynman’s description of the same. We find that this equivalence allows the interpretation that the hidden layers in Boltzmann machines and other neural network formalisms are in fact discrete versions of path elements that are present within the Feynman path-integral formalism. Since Feynman paths are the natural and elegant depiction of interference phenomena germane to quantum mechanics, it appears that in machine learning, the goal is to find an appropriate combination of paths'', along with accumulated path-weights, through a network that cumulatively capture the correct x \rightarrow y map for a given mathematical problem. As a direct consequence of this analysis, we are able to provide general quantum circuit models that are applicable to both Boltzmann machines and to Feynman path integral descriptions. Connections are also made to inverse quantum scattering problems which allow a robust way to define interpretable’’ hidden layers.

机器学习

[LG-0] Estimating the expected output of wide random MLPs more efficiently than sampling

链接: https://arxiv.org/abs/2605.05179
作者: Wilson Wu,Victor Lecomte,Michael Winer,George Robinson,Jacob Hilton,Paul Christiano
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 68 pages. Code is available at this https URL

点击查看摘要

Abstract:By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily optimal. Given an MLP at initialization, we show how to estimate its expected output over Gaussian inputs without running samples through the network at all. Instead, we produce approximate representations of the distributions of activations at each layer, leveraging tools such as cumulants and Hermite expansions. We show both theoretically and empirically that for sufficiently wide networks, our estimator achieves a target mean squared error using substantially fewer FLOPs than Monte Carlo sampling. We find moreover that our methods perform particularly well at estimating the probabilities of rare events, and additionally demonstrate how they can be used for model training. Together, these findings suggest a path to producing models with a greatly reduced probability of catastrophic tail risks.

[LG-1] Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

链接: https://arxiv.org/abs/2605.05176
作者: Alexander Hsu,Zhaiming Shen,Wenjing Liao,Rongjie Lai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Pre-trained transformers are able to learn from examples provided as part of the prompt without any weight updates, a remarkable ability known as in-context learning (ICL). Despite its demonstrated efficacy across various domains, the theoretical understanding of ICL is still developing. Whereas most existing theory has focused on linear models, we study ICL in the nonlinear regression setting. Through the interaction mechanism in attention, we explicitly construct transformer networks to realize nonlinear features, such as polynomial or spline bases, which span a wide class of functions. Based on this construction, we establish a framework to analyze end-to-end in-context nonlinear regression with the constructed features. Our theory provides finite-sample generalization error bounds in terms of context length and training set size. We numerically validate the theory on synthetic regression tasks.

[LG-2] Human-AI Co-Mentorship in Project-Based Learning: A Case Study in Financial Forecasting

链接: https://arxiv.org/abs/2605.05144
作者: Freyaa Chawla,Ahan Chawla,Rishi Singh,Joe Germino,Grigorii Khvatskii
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted for publication in 2026 ASEE Annual Conference Exposition

点击查看摘要

Abstract:This paper reflects on a AI research project carried out by a team of high-school and early-undergraduate students under the mentorship of graduate researchers and ably assisted by AI tools. We share our experience in not only on the learning experience for the high school students, but also on how AI tools accelerated the process that enabled the high school students to focus on higher order problem formulation and solution. Although the participants entered the project with limited background in both AI and finance, they showed strong enthusiasm for technical market analysis and ETF price prediction. Traditional learning settings would first teach the necessary methods in a classroom setting and only later let students apply them. In contrast, our project emphasized workflow design: students identified the sequence of steps needed to address the problem and then used AI-driven tools to execute each step. We note that the high school students developed the necessary code through iterating with the AI tools, and we used our daily stand-ups to debug and answer conceptual questions. Each of the student was able to dig deeper into their area of interest whether computer science or finance, while collaboratively making a significant advance over the summer of 2025. This project was an important pedagogical exercise on how AI tools can be used for mentoring high school students, allowing them to focus on their specific interests and using the daily stand-ups to focus on problem definition and conceptual understanding. Despite their limited technical qualifications, the students were able to leverage AI tools to build meaningful models with real-world application. Comments: Accepted for publication in 2026 ASEE Annual Conference Exposition Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2605.05144 [cs.LG] (or arXiv:2605.05144v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

链接: https://arxiv.org/abs/2605.05134
作者: Dan Wilson,Mohamed Akrout
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) frequently generate plausible but non-factual content, a phenomenon known as hallucination. While existing detection methods typically rely on computationally expensive sampling-based consistency checks or external knowledge retrieval, we propose a new method that treats the LLM as a black-box dynamical system. By projecting LLM responses into a high-dimensional manifold via an embedding model, we characterize the resulting vector sequences as observable realizations of the model’s latent state-space dynamics. Leveraging Koopman operator theory, we fit the transition operators for both factual and hallucinated regimes and define a differential residual score based on their respective prediction errors. To accommodate varying user requirements and domain-specific sensitivities, we introduce a preference-aware calibration mechanism that optimizes the classification threshold based on a small set of demonstrations. This approach enables low-cost hallucination detection in a single-sample pass, avoiding the need for secondary sampling or external grounding. Extensive testing across three data benchmarks demonstrates that our method achieves state-of-the-art performance with reduced resource overhead.

[LG-4] ransformed Latent Variable Multi-Output Gaussian Processes ICML2026

链接: https://arxiv.org/abs/2605.05133
作者: Xiaoyu Jiang,Xinxing Shi,Sokratia Georgaka,Magnus Rattray,Mauricio A Álvarez
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Multi-Output Gaussian Processes (MOGPs) provide a principled probabilistic framework for modelling correlated outputs but face scalability bottlenecks when applied to datasets with high-dimensional output spaces. To maintain tractability, existing methods typically resort to restrictive assumptions, such as employing low-rank or sum-of-separable kernels, which can limit expressiveness. We propose the Transformed Latent Variable MOGP (T-LVMOGP), a novel framework that scales MOGPs to a massive number of outputs while preserving the capacity to capture meaningful inter-output dependencies. T-LVMOGP constructs a flexible multi-output deep kernel by mapping inputs and output-specific latent variables into an embedding space using a Lipschitz-regularised neural network. Combined with stochastic variational inference, our model effectively scales to high-dimensional output settings. Across diverse benchmarks, including climate modelling with over 10,000 outputs and zero-inflated spatial transcriptomics data, T-LVMOGP outperforms baselines in both predictive accuracy and computational efficiency.

[LG-5] Conditional outlier detection for clinical alerting

链接: https://arxiv.org/abs/2605.05124
作者: Milos Hauskrecht,Michal Valko,Shyam Visweswaran,Iyad Batal,Gilles Clermont,Gregory Cooper
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: AMIA 2010 Annual Symposium proceedings, pp. 286-290. Homer R. Warner Best Paper Award

点击查看摘要

Abstract:We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management actions using past patient cases stored in an electronic health record (EHR) system. Our hypothesis is that patient-management actions that are unusual with respect to past patients may be due to a potential error and that it is worthwhile to raise an alert if such a condition is encountered. We evaluate this hypothesis using data obtained from the electronic health records of 4,486 post-cardiac surgical patients. We base the evaluation on the opinions of a panel of experts. The results support that anomaly-based alerting can have reasonably low false alert rates and that stronger anomalies are correlated with higher alert rates.

[LG-6] Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological Signals

链接: https://arxiv.org/abs/2605.05120
作者: Sahar Askari,Mohammad Mahdi Mirza Ali Mohammadi,Fatemeh Ensafdoust,Amin Golnari,Saeid Sanei
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:An interpretable and scalable framework for decoding driving behaviors from multimodal physiological signals is proposed in this study. We utilize multimodal physiological driving behavior large-scale dataset comprising synchronized electroencephalogram (EEG), electromyography (EMG), and galvanic skin response (GSR) signals. Our approach involves rigorous preprocessing followed by a domain-specific feature extraction pipeline targeting time-domain, frequency-domain, and derived physiological indices. To address high dimensionality, we employ SHAP-based elite feature selection, retaining the top 250 features to reduce computational overhead while preserving predictive power. Hyperparameter optimization for extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) models is conducted using Bayesian optimization via Optuna. Finally, a weighted soft-voting ensemble is constructed to leverage the complementary strengths of both gradient boosting frameworks. The results demonstrate that the proposed ensemble achieves a test accuracy of 80.91% and a macro-F1 score of 0.79, significantly outperforming single-modality baselines and traditional machine learning models. Ablation studies confirm an 8% performance gain over the best single modality (EEG), validating the necessity of multimodal fusion. SHAP analysis further validates the physiological plausibility of the model, revealing that the EEG contributes the majority of predictive weight, GSR and EMG features provide critical discriminatory signals for high-arousal and motor-intensive maneuvers.

[LG-7] On the Hardness of Junking LLM s

链接: https://arxiv.org/abs/2605.05116
作者: Marco Rando,Samuel Vaiter
类目: Machine Learning (cs.LG)
*备注: 27 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction and optimizing small adversarial components (e.g., suffixes or prefixes). In this setting, prompt structure is fundamental for performance, and recent results show that even simple random search can achieve strong performance when combined with sophisticated prompt design. Recently, it has been observed that harmful behaviors can be elicited even without the adversarial prompt, relying solely on optimized token sequences. This suggests the existence of natural backdoors, i.e., token sequences naturally emerged during LLMs training that trigger unsafe outputs without any meaningful instruction. However, despite these observations, this setting remains largely unexplored, and in particular the hardness of finding natural backdoors has not been assessed yet. In this work, we provide a first proof-of-concept study investigating the hardness of this task, which we refer to as the junking problem. We formalize it as the problem of finding token sequences that maximize the probability of generating a target prefix of harmful responses, propose a greedy random-search method to assess is such sequences can be discovered easily. Our results show that this problem is harder than standard jailbreak attacks, confirming the importance of semantic information in prompt design. At the same time, we find that our simple strategy is sufficient to solve it with a high success rate, suggesting that natural backdoors are present and easily recoverable. Finally, through perplexity analysis, we observe that the discovered token sequences lie in low-probability regions of the model distribution, supporting the hypothesis that they emerged implicitly from the training process.

[LG-8] Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

链接: https://arxiv.org/abs/2605.05115
作者: Daniel Wurgaft,Can Rager,Matthew Kowal,Vasudev Shyam,Sheridan Feucht,Usha Bhalla,Tal Haklay,Eric Bigelow,Raphael Sarfati,Thomas McGrath,Owen Lewis,Jack Merullo,Noah Goodman,Thomas Fel,Atticus Geiger,Ekdeep Singh Lubana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural representations carry rich geometric structure; but does that structure causally shape behavior? To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce. In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally. Concretely, we first fit an activation manifold M_h to representations and a behavior manifold M_y to output probability distributions. We then test the link M_h \leftrightarrow M_y via interventions: we find that steering along M_h , which we term manifold steering, yields behavioral trajectories that follow M_y , while linear steering – which assumes a Euclidean geometry – cuts through off-manifold regions and hence produces unnatural outputs. Moreover, optimizing interventions in activation space to produce paths along M_y recovers activation trajectories that trace the curvature of M_h . We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities. In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics. Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals. This recasts the core problem of steering from finding the right direction to finding the right geometry.

[LG-9] How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

链接: https://arxiv.org/abs/2605.05113
作者: Mariia Seleznova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study signal propagation in linear recurrent models at finite width. While existing signal propagation theory relies predominantly on the infinite-width limit, it remains unclear for how long that approximation remains accurate when recurrent depth t grows jointly with width n . This question is especially relevant for modern recurrent sequence models, whose natural operating regime involves long input sequences, i.e., large t . We derive exact finite-width formulas for the hidden state signal energies in linear recurrences under complex Gaussian initialization. Using these formulas, we identify the joint depth-width scaling regimes that govern signal propagation: (i) a subcritical regime t=o(\sqrt n) , in which the infinite-width approximation remains valid; (ii) a critical regime t\sim c\sqrt n , in which non-negligible deviations from infinite-width predictions appear and a nontrivial joint scaling limit emerges; and (iii) a supercritical regime t\gg \sqrt n , in which finite-width effects dominate. Thus, our results pinpoint the precise recurrent depth scale at which infinite-width theory breaks down in long-range linear recurrences. In turn, this shows when standard initialization schemes, such as Glorot, become unstable. More broadly, our results demonstrate that finite-width effects accumulate more rapidly with depth in recurrent models than in feedforward ones, leading to qualitatively different signal propagation behavior.

[LG-10] Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

链接: https://arxiv.org/abs/2605.05112
作者: Tianshu Zhu,Wenyu Zhang,Xiaoying Zuo,Lun Tian,Haotian Zhao,Yucheng Zeng,Jingnan Gu,Daxiang Dong,Jianmin Wu,Dawei Yin,Dou Shen
类目: Machine Learning (cs.LG)
*备注: 25 pages, 8 figures, 11 tables

点击查看摘要

Abstract:SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success–failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.

[LG-11] Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning COLT

链接: https://arxiv.org/abs/2605.05102
作者: Harin Lee,Min-hwan Oh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the Conference of Learning Theory (COLT) 2026

点击查看摘要

Abstract:We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels \delta \in (0,1] , thereby characterizing the regret distribution across the full range of \delta . We present a simple UCBVI-style algorithm with exploration bonus \min\c_1,k/N, c_2,k/\sqrtN\ , where N denotes the visit count and (c_1,k,c_2,k) are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with A arms and horizon T , we obtain a distributional regret bound of order \mathcalO(\sqrtAT\log(1/\delta)) , confirming the conjecture of Lattimore Szepesvári (2020, Section 17.1) for the first time.

[LG-12] Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis

链接: https://arxiv.org/abs/2605.05088
作者: Yunfei Bai,Aaron Tesfa Tsion,Raul Rosales,Barbara Shollock,Wei He
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Achieving resilient and sustainable cities requires scalable approaches to decarbonising residential buildings, which account for about 20% of UK greenhouse gas emissions and 25% of energy-related emissions in the European Union. Energy Performance Certificates (EPCs) support regulation and retrofit planning, but their reliance on on-site inspections limits timely city-scale assessment. This study introduces a gated multimodal model to predict Standard Assessment Procedure (SAP) energy efficiency and Environmental Impact (EI) scores by integrating EPC tabular variables, assessor-written free text, and Geographic Information System (GIS)-derived spatial features describing footprint geometry, height, area, and orientation. Sample-wise gating learns property-specific modality weights, while an auxiliary band classification head stabilises training. In a Westminster, London case study, the model predicts SAP and EI scores with MAEs of 4.03 and 4.76 points and R2 values of 0.757 and 0.748, respectively, achieving a mean MAE of 4.39. Ablation results show that full multimodal fusion outperforms unimodal and bimodal baselines for both score prediction and band-level classification. Interpretability analyses provide decision-relevant evidence: gating weights indicate strong reliance on assessor text; SHAP highlights main fuel, built form, and construction age band; text occlusion prioritises roof and wall fields; and spatial attribution is dominated by height and footprint area, with sensitivity to footprint shape. The validated framework is further applied to retrofit scenarios for wall insulation, roof insulation, and window glazing upgrades, indicating projected improvements in SAP, EI, annual energy cost, and equivalent CO2 emissions. Overall, the framework provides scalable property-level evidence for retrofit screening, intervention prioritisation, and net-zero housing transitions.

[LG-13] Order Matters: Improving Domain Adaptation by Reordering Data

链接: https://arxiv.org/abs/2605.05084
作者: Andrea Napoli,Paul White
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Optimal Reordering of Data for Error-Reduced Estimation of Discrepancy (ORDERED), a novel unbiased stochastic variance reduction technique which reduces the discrepancy estimation error by optimising the order in which the training data are sampled. We consider two specific domain discrepancy losses (correlation alignment and the maximum mean discrepancy), formulate their stochastic estimation error as a function of the data sampling order, and propose a practical optimisation algorithm. Our simulations demonstrate reduced variance compared to related methods, and experiments on two domain shift image classification benchmarks show improved target domain accuracy.

[LG-14] Provable imitation learning for control of instability in partially-observed Vlasov–Poisson equations

链接: https://arxiv.org/abs/2605.05081
作者: Xiaofan Xia,Qin Li,Wenlong Mou
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:We consider the stabilization of Vlasov–Poisson plasma dynamics, a central control problem in nuclear fusion. Our focus is the gap between what an ideal controller would use and what experiments can actually observe: while optimal policy may rely on the full phase-space state, practical feedback is typically limited to sparse macroscopic diagnostics. We therefore study imitation learning methods that distill a fully observed expert policy into controllers operating only on macroscopic measurements. We show the stability guarantees of the learned policy, where the error floor depends on the minimal behavior cloning loss achievable under the observation constraints. We further characterize this minimal loss in terms of a notion of entropy that quantifies the complexity of the initial distribution. Our results demonstrates the theoretical feasibility of learning stabilizing feedback policies for kinetic plasma dynamics from macroscopic observations, and exhibits the adaptivity of the learning approach to low-complexity structures. Through extensive numerical experiments, we validate our theory and show that the learned policies can stabilize the system using only macroscopic observations, within a significantly longer time horizon than non-adaptive baseline controllers.

[LG-15] Full-chip CMP modelling based on Fully Convolutional Network leverag ing White Light Interferometry

链接: https://arxiv.org/abs/2605.05062
作者: Jules Exbrayat,Renan Bouis,Elie Sezestre,Viorel Balan,Arnaud Cornelis,Damien Hebras,Catherine Euvrard
类目: Machine Learning (cs.LG)
*备注: Presented at the International Conference on Planarization Technology 2025 in Hong Kong

点击查看摘要

Abstract:As time-to-market is crucial in the Integrated Circuit (IC) industry, speeding up layout manufacturability verifi-cation is essential. Chemical-Mechanical Polishing (CMP) plays a vital role in IC fabrication but is significantly influenced by Layout-Dependent Effects (LDE). An accurate and efficient CMP model enables design teams to correct surface unevenness before fabrication, reducing costs and accelerating the design phase. However, existing models often rely on Density Step Height (DSH) modeling, which is time-consuming for calibration and requires substantial hardware resources for fine-grained predictions. In this paper, we propose combining the advantages of two surface analysis techniques, White Light Interfer-ometry (WLI) and Atomic Force Microscopy (AFM), to train a deep learning model. This model aims to predict full-chip post-CMP nanotopography with nanometer-scale accuracy. Our deep learning model is based on a Convolutional Neural Network (CNN) and follows a two-step pipeline. The model is trained on each technique separately, resulting in a detailed full-chip CMP model.

[LG-16] Kinematic Discriminants of Deceleration Behavior Modes in Car-Following: Evidence from NGSIM Trajectory Data

链接: https://arxiv.org/abs/2605.05050
作者: Eni Solomon Laughter
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gap-closing rate and visual looming swap discriminative dominance depending on deceleration intensity - a finding that reconciles a long-standing conflict in the car-following literature and challenges spacing-centered assumptions in traditional driver behavior models. This study presents a two-stage analytical framework that distinguishes between information availability (kinematic variables measurable in the environment) and information utilization (variables that demonstrably separate driver behavioral patterns), applied to 1,060,119 valid car-following observations from the NGSIM trajectory dataset (2,932 vehicles). Six kinematic features are extracted, and deceleration events are detected under two threshold conditions (-0.5 m/s^2 and -0.3 m/s^2). K-means clustering identifies behavioral modes, and one-way ANOVA with eta-squared effect sizes ranks each feature’s discriminative power. Three key findings emerge: (1) threshold selection fundamentally shapes behavioral inference - the stricter threshold yields three interpretable modes while the permissive threshold collapses these to two; (2) hard braking prioritizes gap-closing rate (eta^2 = 0.715) while moderate braking emphasizes visual looming (eta^2 = 0.574); and (3) spacing headway is negligible (eta^2 = 0.014) across both thresholds. These findings provide empirically grounded candidates for perceptual cue prioritization and have direct implications for ADAS warning system design and autonomous vehicle control.

[LG-17] he Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence

链接: https://arxiv.org/abs/2605.05029
作者: Kejun Liu
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 3 tables. Supplemental Material included (Sections S1-S10)

点击查看摘要

Abstract:We report a systematic failure mode in predictive representation learning. Across 2695 neural network configurations trained to predict linear-Gaussian dynamics, the optimal encoder tracks the environment rather than the system it is meant to model. The mean causal fidelity – the fraction of encoder sensitivity allocated to system degrees of freedom – is 0.49, and only 2.5% of configurations exceed 0.70. The failure intensifies with dimension: at N=100, the optimal encoder becomes causally blind (fidelity ~10^-8) while achieving 92% lower prediction error than the causal representation. We prove this is not an optimization artifact but a structural property of the predictive objective: when environment modes are slower or less noisy than system modes, every minimizer of the population risk encodes the former. The set of dynamics exhibiting this predictive-causal gap is open and of positive measure in parameter space. In a nonlinear Duffing-GRU sweep, unconstrained predictors learn environment-dominant representations in 55% of tasks (95% CI 41–68%) versus 24% under operational grounding (p=2.3e-3); the median out-of-distribution MSE inflation under environment shift is 1.82x versus 1.00x. Operational grounding – restricting the loss to system observables – partially suppresses the gap, but causal fidelity is never recovered without an explicit system-environment boundary. The results identify the predictive-causal gap as a structural limit of learning, with implications for self-supervised representation learning, world models, and the scaling paradigm.

[LG-18] CuBridge: An LLM -Based Framework for Understanding and Reconstructing High-Performance Attention Kernels ACL2026

链接: https://arxiv.org/abs/2605.05023
作者: Xing Ma,Yangjie Zhou,Wu Sun,Zihan Liu,Jingwen Leng,Yun Lin,Shixuan Sun,Minyi Guo,Jin Song Dong
类目: Machine Learning (cs.LG)
*备注: Accepted to ACL 2026

点击查看摘要

Abstract:Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achieve high efficiency but are difficult to adapt. Recent work explores large language models (LLMs) for GPU kernel generation, but prior studies report unstable correctness and significant performance gaps for complex operators such as attention. We present CuBridge, an LLM-based framework that adapts expert-written attention kernels through a structured lift-transfer-lower workflow. CuBridge starts from expert-written CUDA attention kernels and lifts them into an executable intermediate representation that makes execution orchestration explicit while abstracting low-level CUDA syntax. Given a user-provided PyTorch specification, CuBridge generates and verifies a target IR program, then reconstructs optimized CUDA code via reference-guided lowering. Across diverse attention variants and GPU platforms, CuBridge consistently produces correct kernels and substantially outperforms general frameworks, compiler-based approaches, and prior LLM-based methods. Comments: Accepted to ACL 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.05023 [cs.LG] (or arXiv:2605.05023v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.05023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] Learned Neighbor Trust for Collaborative Deployment in Model-Agnostic Decentralized Learning

链接: https://arxiv.org/abs/2605.05009
作者: Michael Lanier,Luise Ge,Sastry Kompella,Yevgeniy Vorobeychik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many decentralized distillation methods are designed around training-time coordination, yet deploy each node in isolation even when more capable neighbors remain available at inference time. This is an incomplete objective for settings such as IoT, where devices are heterogeneous, data is scarce and skewed, and a node’s strongest neighbors may far exceed its own local capacity. We study how nodes should train so that their predictions compose well at deployment, and how each node should learn whom to trust. Under a server-free, model-agnostic protocol where nodes exchange only queries and soft predictions, we propose Learned Neighbor Trust (LNTrust) wherein each node learns a compact trust function over its neighborhood from local validation evidence. This trust function gates auxiliary distillation during training and defines a deployment ensemble at inference, so that collaboration learned during training transfers directly to deployment. Across datasets and topologies, LNTrust improves deployed accuracy over the strongest output-only baseline by large margins while using significantly less communication than previous methods.

[LG-20] Agent ic Vulnerability Reasoning on Windows COM Binaries

链接: https://arxiv.org/abs/2605.05000
作者: Hwiwon Lee,Jongseong Kim,Lingming Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Windows Component Object Model (COM) services run with elevated privileges and are widely accessible to authenticated users, making race conditions in these binaries a critical surface for local privilege escalation. We present SLYP, an end-to-end agentic pipeline that discovers race condition vulnerabilities in COM binaries and generates debugger-verified proof-of-concept (PoC) code. SLYP exposes binary exploration, COM inspection, and dynamic debugging as reusable tool interfaces, giving agents the static context, COM activation metadata, and debugger feedback needed to move from vulnerability discovery to verified PoC generation. On a benchmark of 20 COM objects covering 40 vulnerability cases, SLYP achieves 0.973 F1, outperforming production coding agents by up to 0.208 F1 and the state-of-the-art static analyzer by 3.3x in bug discovery. For PoC generation, production coding agents in their default setup (without our COM inspection and dynamic debugging tools) verify essentially no cases on either frontier model, whereas SLYP’s interactive toolsets enable it to autonomously synthesize working PoCs for 67.5% of cases on the strongest configuration. Deployed on production Windows services, SLYP discovers 28 previously unknown vulnerabilities across nine COM services, all confirmed by the Microsoft Security Response Center (MSRC) with 16 CVEs assigned and 140,000 in bounties. Furthermore, SLYP is designed with generalizable binary analysis and debugging interfaces, making it readily applicable to other commercial off-the-shelf (COTS) binaries beyond Windows COM services.

[LG-21] DualTCN: A Physics-Constrained Temporal Convolutional Network for 2 Time-Domain Marine CSEM Inversion

链接: https://arxiv.org/abs/2605.04997
作者: Khaled Ahmed,Ghada Omar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DualTCN is the first deep-learning framework for inverting time-domain marine controlled-source electromagnetic (MCSEM) transient data. Moving away from traditional subsurface discretization, the framework regresses four earth-model parameters – \sigma_1 , \sigma_2 , d_1 , d_2 – and reconstructs conductivity-depth profiles using a differentiable soft-step decoder. The optimized architecture (379K parameters) features a Temporal Convolutional Network (TCN) encoder paired with a late-time branch and an auxiliary seafloor-depth head. This design achieves a 25.3% loss reduction over baseline models, with high predictive accuracy ( R^2 = 0.898 for \sigma_2 ) and an inversion speed of 3.5~ms per sample on an A100 GPU. The framework demonstrates high robustness to noise through curriculum-based amplitude augmentation, maintaining a mean \barR^2 of 0.858 at \pm2% random amplitude error, compared to 0.363 without augmentation. DualTCN generalizes effectively to three-layer extensions (seawater/resistive layer/basement), accurately resolving basement conductivity ( R^2 \approx 0.88 ), though thin-layer resolution remains a physical limitation ( R^2 \approx 0.23 ). In comparative benchmarks, DualTCN significantly outperforms traditional local optimization methods like Levenberg-Marquardt and L-BFGS-B, yielding a mean \barR^2 = 0.877 versus 0.129-0.439 for multi-start baselines, while operating at up to 21,000 \times lower computational cost. Finally, the framework incorporates uncertainty quantification via Monte Carlo (MC) Dropout. While well-calibrated for \sigma_1 (PICP90 = 0.944), inherent signal limitations at short offsets (200m) lead to under-coverage for d_2 (PICP90 = 0.572), which can be mitigated through post-hoc temperature scaling or split conformal prediction. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.04997 [cs.LG] (or arXiv:2605.04997v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Adaptivity Under Realizability Constraints: Comparing In-Context and Agent ic Learning

链接: https://arxiv.org/abs/2605.04995
作者: Anastasis Kratsios,A. Martina Neuman,Philipp Petersen
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We compare in-context learning with fixed queries and agentic learning with adaptive queries for uniform approximation of task families. We consider two settings: an unrestricted regime, where querying and approximation are arbitrary functions, and a realizable regime, where we require these operations to be implemented by ReLU neural networks. In both settings, adaptivity never hinders approximation performance. However, this advantage can change when one passes from the unrestricted regime to the realizable regime. We identify four distinct approximation scenarios, each witnessed by an explicit task family: (a) no advantage of adaptivity; (b) an advantage in the unrestricted regime that persists under ReLU realizability; © an advantage that arises only under realizability; and (d) an advantage that disappears under realizability. This demonstrates that representational constraints interact profoundly with the effect of adaptivity.

[LG-23] Delving into Non-Exchangeability for Conformal Prediction in Graph-Structured Multivariate Time Series

链接: https://arxiv.org/abs/2605.04957
作者: Ruichao Guo,Xingyao Han,Luo Wenshui,Zhe Liu,Chen Gong,Hesheng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Point forecasting for graph-structured multivariate time series is a fundamental problem, but rigorous uncertainty quantification for such predictions is still underexplored. Conformal prediction (CP) offers uncertainty estimation with a solid coverage guarantee under the exchangeability assumption, which requires the joint data distribution to be unchanged under permutation. However, in graph-structured time series, inherent cross-node coupling can violate the exchangeability condition, making direct application of CP unreliable. Inspired by the spectral graph theory, such coupling resides in global trends and can be characterized by the low-frequency components, while high-frequency components are nearly exchangeable. Therefore, we propose a novel concept named Spectral Graph Conditional Exchangeability (SGCE), which conditions exchangeable high-frequency components on low-frequency ones to preserve global trends and enable effective CP in the spectral domain. Based on SGCE, we further propose Spectral Conformal prediction via wAveLEt transform (SCALE). SCALE uses graph wavelets to decompose low/high-frequency components and conformalizes high-frequency residuals via adaptive gating over a low-frequency embedding. Experimental results on real-world traffic datasets show that SCALE not only achieves valid coverage but also consistently improves the coverage-efficiency trade-off over the state-of-the-art CP methods.

[LG-24] KernelBench-X: A Comprehensive Benchmark for Evaluating LLM -Generated GPU Kernels

链接: https://arxiv.org/abs/2605.04956
作者: Han Wang,Jintao Zhang,Kai Jiang,Haoxu Wang,Jianfei Chen,Jun Zhu
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from 1.58\times to 1.44\times ; newly rescued kernels consistently underperform persistently correct ones ( 1.16\times vs 1.58\times speedup in round~0 \to 1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches 21.4\times . Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at this https URL

[LG-25] Order-based Rehearsal Learning

链接: https://arxiv.org/abs/2605.04955
作者: Yu-Xuan Tao,Tian-Zuo Wang,Zhi-Hua Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When a machine learning (ML) model forecasts an undesired event, one often seeks a decision to avoid it, known as the avoiding undesired future (AUF) problem. Many rehearsal learning methods have been proposed for AUF, but they rely on an underlying graph structure; learning such a graph from observational data is challenging and can incur substantial estimation error. In this work, we demonstrate that the order structure can be sufficient for AUF decision-making, and propose the first order-based rehearsal learning method. Although an order is less informative than a graph, it can be sufficient to identify the influence of decisions from observational data, suggesting that learning the entire graph is not always necessary. To learn the order, we develop an information-theoretic method that imposes no restrictions on the form of structural functions or the type of noise distributions. For AUF decision-making, we construct an order-based sampler to approximate the influence of decisions and, combined with a surrogate objective for maximizing the post-decision success probability, reduce the AUF task to a differentiable optimization problem. Experiments show that our order learning method outperforms existing methods, and that our AUF approach not only surpasses methods relying on learned graphs or learned orders, but also matches or even exceeds oracle baselines that are given the true graph.

[LG-26] On the Influence of the Feature Computation Budget on Per-Instance Algorithm Selection for Black-Box Optimization

链接: https://arxiv.org/abs/2605.04954
作者: Koen van der Blom,Diederick Vermetten
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Per-instance algorithm selection (PIAS) takes advantage of complementarity between a set of algorithms by deciding which algorithm to run on a given instance. This decision is based on features of the instances, which, in the context of black-box optimization (BBO), require a part of the optimization budget to be computed. This raises two questions: (a) from which fraction of the budget spent on feature computation does PIAS become worth it for BBO, and (b) which fraction of the budget optimizes the tradeoff between feature accuracy and PIAS performance. To this end, we perform a broad study where PIAS with varying sampling budgets for feature computation is compared to the single best algorithm on a broad range of algorithm selection scenarios. These scenarios consist of two portfolio sizes, three problem sets, 4 dimensionalities, and 10 target budgets. We find that PIAS is viable for the majority of tested scenarios, even when as much as a quarter of the total budget is spent on feature computation. The tradeoff for the fraction of the budget spent on feature computation to maximize the benefit of PIAS is highly dependent on the specific AS scenario. Further, on average 20 percent of PIAS loss to the virtual best solver is explained by the budget spent on feature computation, highlighting the importance of properly accounting for the feature budget.

[LG-27] Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

链接: https://arxiv.org/abs/2605.04952
作者: Klaus-Rudolf Kladny,Maximilian Mordig,Bernhard Schölkopf,Michael Muehlebach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) models enable scalable transformer architectures by activating only a subset of experts per token. Recent evidence suggests that performance improves with increasingly granular experts, i.e., many small experts instead of a few large ones. However, this regime substantially increases routing cost, which can dominate computation. We introduce adaptive inverted-index routing for MoE (AIR-MoE), an inverted-index-inspired routing architecture based on vector quantization (VQ). In a first stage, AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts. In a second stage, fine scoring computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and, in contrast to prior work, imposing no structural constraints on expert parameters. AIR-MoE serves as a drop-in replacement for standard routers and requires no modifications to the model architecture or loss function. We further provide a lower bound on the mass recall achieved by AIR-MoE that yields insights into its inner workings. Empirically, we demonstrate that AIR-MoE achieves improved performance compared to existing routing approaches in granular MoE settings.

[LG-28] raining-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks

链接: https://arxiv.org/abs/2605.04946
作者: Xuan Qi,Yi Wei,Fanqi Yu,Furao shen,Vittorio Murino,Cigdem Beyan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local \ell_\infty window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.

[LG-29] Koopman Identification of Nonlinear Systems via Reservoir Liftings

链接: https://arxiv.org/abs/2605.04917
作者: Weibin Gu,Chen Yang,Lu Shi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning tractable linear representations of nonlinear dynamical systems via Koopman operator theory is often hindered by dictionary selection, temporal memory encoding, and numerical ill-conditioning. Inspired by Reservoir Computing (RC) paradigm, this paper introduces the RC-Koopman framework, which interprets reservoir as a stateful, finite-dimensional Koopman dictionary whose temporal depth is explicitly controlled by its spectral radius. We show that the Echo State Property (ESP) guarantees well-posedness and favorable numerical conditioning of the lifted Koopman approximation. A correlation-based spectral radius selection algorithm aligns reservoir memory with dominant system timescales. Analysis reveals how the finite memory of the reservoir determines which Koopman eigenfunctions remain observable from the lifted features. Evaluation on synthetic benchmarks demonstrates that RC-Koopman achieves a favorable balance between reconstruction accuracy of the underlying nonlinear dynamics and dynamical stability, compared to Extended Dynamic Mode Decomposition (EDMD) and Hankel-based lifting approaches. Code available at: this https URL

[LG-30] Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

链接: https://arxiv.org/abs/2605.04911
作者: Xinyan Han,Yan Lu,Xiaoyu Lin,Yuanyuan Jiang,Yuanrui Wang,Xuanyue Li,Wenchao Zou,Xingxuan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.

[LG-31] Cross-Model Consistency of Feature Importance in Electrospinning: Separating Robust from Model-Dependent Features

链接: https://arxiv.org/abs/2605.04905
作者: Mehrab Mahdian,Ferenc Ender,Tamas Pardy
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Electrospinning is a highly sensitive fabrication process in which small variations in operating parameters can significantly influence fiber morphology and material performance. Machine learning (ML) methods are increasingly employed to model these process-structure relationships and to identify the relative importance of processing variables. However, most existing studies rely on a single ML model, implicitly assuming that the resulting feature importance is robust and reproducible. In this study, the consistency of feature importance across multiple ML model families was systematically evaluated using a curated dataset of 96 polyvinyl alcohol (PVA) electrospinning experiments. Twenty-one ML models representing linear, tree-based, kernel-based, neural network, and instance-based approaches were trained and compared. To provide a unified interpretability framework, SHAP (SHapley Additive exPlanations) values were used to calculate feature importance consistently across all models. A rank-based statistical analysis was then performed to quantify inter-model agreement and assess the robustness of parameter rankings. The results demonstrate that predictive performance and interpretive reliability are fundamentally distinct properties. Although several models achieved comparable predictive accuracy, substantial differences were observed in their feature importance rankings. Solution concentration emerged as the most robust and consistently influential parameter (variability = 0), whereas flow rate and applied voltage exhibited high ranking variability (variability 0.9), indicating strong model dependence. These findings suggest that feature importance derived from a single ML model may be unreliable, particularly for small experimental datasets, and highlight the importance of cross-model validation for achieving trustworthy interpretation in ML-assisted electrospinning research.

[LG-32] A geometric relation of the error introduced by sampling a language models output distribution to its internal state ICML2026

链接: https://arxiv.org/abs/2605.04899
作者: Albert F. Modenbach
类目: Machine Learning (cs.LG)
*备注: 12 Pages, 10 Figures, 2 Appendices. To appear in Proceedings of ICML 2026

点击查看摘要

Abstract:GPT-style language models are sensitive to single-token changes at generation points where the predicted probability distribution is spread across multiple tokens. Viewing this sensitivity as a geometric property, we derive an \mathfrakso(n) -valued 1-form that depends only on the geometry of the token embeddings. Despite this purely geometric origin, we show that its curvature is semantically meaningful: On chess reasoning tasks, the curvature couples to the world model of an off-the-shelf instruction-tuned model, with transformations clustering by board region and respecting piece importance. Our findings suggest that token space geometry directly reflects how models internally represent problems.

[LG-33] Regime-Conditioned Evaluation in Multi-Context Bayesian Optimization NEURIPS2026

链接: https://arxiv.org/abs/2605.04895
作者: Noel Thomas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 42 pages, 9 figures. NeurIPS 2026 submission

点击查看摘要

Abstract:Published transfer-BO comparisons often estimate an average treatment effect of acquisition choice over hidden regime variables, while practitioners need the conditional effect for their specific prior quality, budget ratio, and metric. An audit of 40 transfer-BO papers from NeurIPS, ICML, ICLR, AISTATS, UAI, TMLR, JMLR, and AutoML-Conf (2022-2025) finds that 98% never vary B/|A| as a controlled axis. On the same GDSC2 benchmark, changing only the budget reverses the ranking: at B=50, Greedy outperforms UCB by 0.050 Hit@1, while at B=100, UCB outperforms Greedy by 0.035. We capture this transition with the Portable Regime Score PRS=(B/|A|)(1-rho), where rho is the prior rank correlation and can be estimated from pilot contexts before the main comparison. Across 79 conditions spanning chemistry, drug-response biology, and HPO, a hierarchical model gives beta=0.50 (p=1.1e-9), and 19% of conditions fall in an equivalence zone where |advantage|0.01 Hit@1. In five published reversal cases, PRS predicts the winner from pre-comparison observables. A No-Free-Leaderboard proposition explains why unconditional rankings are unstable: when CATE changes sign across regimes, the reported ATE becomes a function of benchmark mixture. RegimePlanner, which estimates rho online and switches acquisition accordingly, wins all 16 HPO-B search spaces at B=100 and exceeds the matched Greedy,UCB per-context oracle on GDSC2 by 18%. Pre-registered predictions achieve 27/40=67.5% overall accuracy and above 90% within EMA prior families. The practical protocol is simple: report B/|A|, rho, K, and metric alongside any claimed acquisition advantage.

[LG-34] Hybrid Iterative Neural Low-Regularity Integrator for Nonlinear Dispersive Equations

链接: https://arxiv.org/abs/2605.04853
作者: Zhangyong Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose HIN-LRI, a hybrid framework that augments a classical numerical solver with a neural operator trained to correct the solver’s structured truncation error. A base low-regularity integrator provides a consistent first-order approximation to nonlinear dispersive PDEs, while a lightweight neural network, operating on a low-dimensional latent manifold, learns the residual defect that analytical methods cannot close. An explicit time-step scaling on the neural correction ensures that its Lipschitz contribution remains \mathcalO(\tau) , yielding a Gronwall stability factor bounded uniformly in the step size and independent of the spatial resolution. The network is trained end-to-end through a solver-in-the-loop objective that unrolls the full iteration and penalises trajectory error in a Bourgain-type norm, aligning learning with multi-step solver dynamics rather than isolated one-step targets. Under stated assumptions, the global error satisfies C(\varepsilon_net+\delta),\tau^\gamma\ln(1/\tau) , where \varepsilon_net measures the network approximation quality and \delta the training shortfall. Experiments on three dispersive benchmarks with rough data show that HIN-LRI improves accuracy over analytical integrators, splitting methods, and neural PDE surrogates, with stable spatial refinement, effective out-of-distribution transfer, and modest online overhead.

[LG-35] Bridging Input Feature Spaces Towards Graph Foundation Models ICLR2026

链接: https://arxiv.org/abs/2605.04834
作者: Moshe Eliasof,Krishna Sri Ipsit Mantri,Beatrice Bevilacqua,Bruno Ribeiro,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG)
*备注: 33 Pages, 2 Figures, 26 Tables, ICLR 2026

点击查看摘要

Abstract:Unlike vision and language domains, graph learning lacks a shared input space, as input features differ across graph datasets not only in semantics, but also in value ranges and dimensionality. This misalignment prevents graph models from generalizing across datasets, limiting their use as foundation models. In this work, we propose ALL-IN, a simple and theoretically grounded method that enables transferability across datasets with different input features. Our approach projects node features into a shared random space and constructs representations via covariance-based statistics, thus eliminating dependence on the original feature space. We show that the computed node-covariance operators and the resulting node representations are invariant in distribution to permutations of the input features. We further demonstrate that the expected operator exhibits invariance to general orthogonal transformations of the input features. Empirically, ALL-IN achieves strong performance across diverse node- and graph-level tasks on unseen datasets with new input features, without requiring architecture changes or retraining. These results point to a promising direction for input-agnostic, transferable graph models.

[LG-36] Replay-Based Continual Learning for Physics-Informed Neural Operators

链接: https://arxiv.org/abs/2605.04832
作者: Yizheng Wang,Mohammad Sadegh Eshaghi,Xiaoying Zhuang,Timon Rabczuk,Yinghua Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators generally demonstrate strong predictive performance on in-distribution (ID) problems. However, a critical limitation of existing methods is their significant performance degradation when encountering out-of-distribution (OOD) data. To address this issue, this work introduces continual learning into physics-informed neural operators, with particular emphasis on neural operators built upon the Transolver architecture, and proposes a simple yet effective replay-based continual learning strategy. The proposed method is fully physics-informed and does not require labeled data, relying solely on input fields together with physical constraints for training. When new OOD data become available, a small number of past data are incorporated through a distillation-based constraint to preserve previously acquired knowledge and alleviate catastrophic forgetting. Meanwhile, a transfer learning LoRA is employed to enable rapid adaptation to the new data. The proposed framework is systematically validated on three representative physical problems, including the Darcy flow problem in fluid mechanics, a two-dimensional hyperelastic brain tumor problem in biomechanics, and a three-dimensional linear elastic Triply Periodic Minimal Surfaces problem in solid mechanics. The results demonstrate that the proposed method effectively mitigates catastrophic forgetting on previously learned data while maintaining fast adaptability to new data. Compared with conventional joint training strategies, the proposed method significantly improves training efficiency while reducing additional memory usage and computational cost.

[LG-37] Concurrence of Symmetry Breaking and Nonlocality Phase Transitions in Diffusion Models

链接: https://arxiv.org/abs/2605.04830
作者: Yifan F. Zhang,Fangjun Hu,Guangkuo Liu,Mert Okyay,Xun Gao
类目: Machine Learning (cs.LG)
*备注: 20 pages, 10 figures. comments are welcome

点击查看摘要

Abstract:Diffusion models undergo a phase transition in a critical time window during generation dynamics, with two complementary diagnoses of criticality. The symmetry breaking picture views the critical window as when trajectories bifurcate into different semantic minima of the energy landscape, whereas the nonlocality picture views the critical window as when local denoising fails. We study whether two notions of such phase transitions are concurrent in modern diffusion transformers. By evaluating the dynamics and outcomes of the generation trajectory, we observe a near-simultaneous occurrence of the non-locality and symmetry breaking critical times. Our work is the first to unify the two notions of phase transitions in practice: it provides a concrete diagnostic for when and why diffusion models rely on conditioning and global denoising, enabling principled evaluation of model efficiency and guiding the design of architectures and sampling schemes that avoid unnecessary computation.

[LG-38] rustworthy Federated Label Distribution Learning under Annotation Quality Disparity

链接: https://arxiv.org/abs/2605.04827
作者: Junxiang Wu,Zhiqiang Kou,Hongwei Zeng,Wenke Huang,Biao Liu,Hanlin Gu,Yuheng Jia,Di Jiang,Yang Liu,Xin Geng,Qiang Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label Distribution Learning (LDL) models supervision as an instance-wise probability distribution, enabling fine-grained learning under inherent ambiguity, but its success relies on high-fidelity label distributions that are costly to obtain and thus often noisy. Motivated by privacy-sensitive applications, we study Federated Label Distribution Learning (Fed-LDL), where data isolation further induces heterogeneous annotation quality across clients, making local updates unevenly reliable and breaking sample-size-based aggregation (e.g., FedAvg). To address this trust dilemma, we propose FedQual, a quality-aware Fed-LDL framework with two coupled mechanisms: (i) quality-adaptive client training guided by a global semantic anchor that calibrates low-quality clients while preserving high-quality autonomy, and (ii) reliability-aware server aggregation that reweights client contributions by effective reliable information rather than raw sample size. To enable rigorous evaluation, we construct four new Fed-LDL benchmarks (FER-LDL, FI-LDL, PIPAL-LDL, and KADID-LDL) with controlled annotation quality disparity. We further provide a theoretical guarantee showing that under heterogeneous supervision quality, client-specific calibration is strictly better than any uniform calibration. Extensive experiments on the proposed benchmarks demonstrate the effectiveness of FedQual.

[LG-39] Improving FMQA via Initial Training Data Design Considering Marginal Bit Coverag e in One-Hot Encoding

链接: https://arxiv.org/abs/2605.04825
作者: Taiga Hayashi,Yuya Seki,Kotaro Terada,Yosuke Mukasa,Shuta Kikuchi,Shu Tanaka
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注:

点击查看摘要

Abstract:Factorization machine with quadratic-optimization annealing (FMQA) is a black-box optimization method that combines a factorization machine (FM) surrogate with QUBO-based search by an Ising machine. When FMQA is applied to integer or discretized continuous variables via one-hot encoding, uniform random initial sampling can leave many binary variables never active in the initial training data, and the corresponding FM parameters receive no direct gradient updates from the observed responses. We address this by designing the initial training data to achieve complete marginal bit coverage, namely, ensuring that every binary variable obtained by one-hot encoding takes the value one at least once. We use two space-filling sampling methods, Latin hypercube sampling (LHS) and the Sobol’ sequence, yielding LHS-FMQA and Sobol’-FMQA. On the human-powered aircraft wing-shape optimization benchmark with 17 and 32 design variables, both proposed methods achieved numerically higher mean final cruising speeds than the baseline FMQA, with the advantage more pronounced on the 32-variable problem.

[LG-40] Unsat Core Prediction through Polarity-Aware Representation Learning over Clause-Literal Hypergraphs ICML2026

链接: https://arxiv.org/abs/2605.04819
作者: Zhenchao Sun,Shuai Ma,Ping Lu,Chongyang Tao
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026. Camera-ready version coming soon

点击查看摘要

Abstract:Graph neural networks have been widely used in Boolean satisfiability (SAT) tasks to learn structural information from SAT formulas. The goal of these studies is to solve SAT instances or to enhance SAT solvers, including tasks such as unsat-core prediction. However, most existing approaches model a SAT formula as a bipartite graph or a directed acyclic graph, which are less expressive in capturing higher-order interactions among literals and clauses. Moreover, these approaches are limited in modeling intrinsic polarity-related properties of SAT, such as the complementary relationship between the positive and negative literals of a variable. To address these limitations, we propose a polarity-aware representation learning framework over clause-literal hypergraphs. We model SAT formulas as clause-literal hypergraphs augmented with a clause incidence graph to capture higher-order structural interactions. We then introduce a polarity-aware decomposed mechanism that separates variable representations into polarity invariant and equivariant components, explicitly modeling the relationship between positive and negative literals, with the resulting literal representations propagated along the hypergraph structure. We further incorporate a polarity-inversion consistency regularization to reinforce polarity-consistent representations during training. Experimental results on multiple SAT datasets demonstrate the effectiveness of the proposed approach.

[LG-41] A Biased Nonnegative Block Term Tensor Decomposition Model for Dynamic QoS Prediction

链接: https://arxiv.org/abs/2605.04813
作者: Wenjing Liu,Yujia Lei,Qu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of cloud computing and Web services, Quality of Service (QoS) has become a key criterion for service selection and recommendation. Tensor latent feature analysis provides an effective way to model multidimensional QoS data, and most existing QoS prediction methods are mainly based on Canonical Polyadic (CP) decomposition or Tucker decomposition. However, constrained by their inherent structural properties, these methods cannot accurately capture the complex and dynamic dependencies in user-service interactions, which limits their prediction performance. To address this issue, this paper proposes a dynamic QoS prediction framework based on the Biased Nonnegative Block Term Tensor Decomposition Model, termed BNBT. Specifically, the proposed framework is developed from three aspects: (1) block term tensor decomposition is employed to enhance the representation capability of latent feature learning; (2) linear bias terms are incorporated to further improve prediction accuracy; and (3) a tensor-oriented single-element-dependent nonnegative multiplicative update algorithm, called SLF-NMUT, is designed for efficient parameter estimation. Extensive experiments on real-world QoS datasets demonstrate that the proposed BNBT framework consistently outperforms several state-of-the-art QoS prediction methods in terms of prediction accuracy.

[LG-42] Bilinear Mamba-Koopman Neural MPC for Varying Dynamics

链接: https://arxiv.org/abs/2605.04793
作者: Matan Pagi,Zohar Sorek
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 18 pages, 5 figures. Preprint

点击查看摘要

Abstract:Koopman-based neural MPC models generate time-varying dynamics from historical data, but preserve convexity by enforcing that the system operator is independent of the current control input. This conditional independence constraint limits adaptation to changing dynamics within a single MPC horizon, particularly under time-varying conditions and under stale-plan execution. We propose Bilinear Mamba-Koopman Neural MPC, a minimal extension that introduces control-dependent coupling in the latent dynamics, allowing the effective operator to adapt to the current input. The resulting model is a strict generalization of the standard linear, conditional-independence formulation, adds less than 1% parameters through a low-rank structure, and admits exact model Jacobians that enable efficient Sequential Convex Programming (SCP) with monotone-descent and KKT convergence results under standard trust-region assumptions. Across CartPole and RSCP benchmarks in time-invariant and time-varying regimes, the proposed model matches or improves forecasting accuracy on every cell when training noise is averaged out, with strict gains where control-state coupling is structurally present. Its main closed-loop gains appear in the RSCP TV task, where iterative SCP improves adaptation within the horizon and substantially stabilizes training; in CartPole TV, the gains are modest but consistent. In delayed re-planning experiments on the time-varying variants, the bilinear model degrades more gracefully under stale-plan execution, maintaining a consistent advantage on CartPole TV and a substantially larger robustness margin on RSCP TV. These results show that control-dependent latent dynamics provide a simple and effective mechanism for robust MPC under varying conditions. Comments: 18 pages, 5 figures. Preprint Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) ACMclasses: I.2.6; I.2.8 Cite as: arXiv:2605.04793 [cs.LG] (or arXiv:2605.04793v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04793 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures

链接: https://arxiv.org/abs/2605.04754
作者: Omkar B Shende,Marcello Traiola,Gayathri Ananthanarayanan
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: accepted at the IEEE Computer Society Annual Symposium on VLSI ISVLSI 2026

点击查看摘要

Abstract:Deep neural network (DNN) inference at the edge demands simultaneous improvements in accuracy, computational efficiency, and energy consumption. Approximate computing and Mixture-of-Experts (MoE) architectures have each been studied as independent routes towards efficient inference, the former by replacing exact arithmetic with low-power approximate multipliers, the latter by routing inputs through specialized expert sub-networks to enable conditional computation. However, their interaction remains entirely unexplored. This paper presents AxMoE, the first study of the impact of approximate multiplication on MoE DNN architectures. We evaluate three MoE variants: Hard MoE, Soft MoE, and Cluster MoE against dense baselines across three CNN architectures (ResNet-20, VGG11_bn, VGG19_bn) on CIFAR-100 and a Vision Transformer (ViT-Small) on Tiny ImageNet-200 dataset, using eight 8-bit signed multipliers (including one exact baseline) from the EvoApproxLib library. Results show that, without retraining, the Dense baseline is the most resilient topology across all CNN architectures, whereas on ViT-Small, all topologies degrade at comparable rates regardless of routing strategy. After approximate-aware retraining, recovery varies substantially across architectures, topologies, and multipliers. ResNet-20 achieves full recovery across the entire multiplier range, whereas VGG architectures recover at moderate multipliers but fail irreversibly at aggressive ones for all topologies except Cluster MoE on VGG11_bn; on ViT-Small, Hard MoE outperforms Dense under aggressive approximation at equal normalized inference cost. These results pave the way for future approximate MoE hardware-software co-design strategies.

[LG-44] MixINN: Accelerating Plant Breeding by Combining Mixed Models and Deep Learning for Interaction Prediction

链接: https://arxiv.org/abs/2605.04744
作者: Aike Potze,Fred van Eeuwijk,Ioannis N. Athanasiadis
类目: Machine Learning (cs.LG)
*备注: 11 pages, 1 figure

点击查看摘要

Abstract:Plant breeding underpins global food security through incremental, accumulating improvements in crop yield, quality and sustainability, achieved via repeated cycles of crop ranking, selection and crossing. Climate change disrupts this process by altering local growing conditions, thereby shifting the relative performance of crop genotypes. Predicting these relative changes in yield is critical for food security. Yet, this problem remains an open challenge in plant breeding, and relatively unexplored within the AI community. We propose MixINN, an approach that first isolates high-quality genotype-environment interaction labels using mixed models, and then predicts these interactions for new crop varieties in future environmental conditions with a deep neural network. We evaluate our method on a corn multi-environment trial across the continental United States and show improved prediction of genotype ranking over current plant breeding methods. MixINN demonstrated superior performance in identifying the 20% most productive corn genotypes, leading to a 5.8% higher average yield, which further improved to 7.2% when targeting specific growing environments. These are competitive results for real-world breeding programs, demonstrating the potential of AI research in accelerating the development of climate-adapted crops, and improving future food security under climate change.

[LG-45] OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization ICML2026

链接: https://arxiv.org/abs/2605.04738
作者: Zhikai Li,Zhen Dong,Xuewen Liu,Jing Zhang,Qingyi Gu
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities. However, their massive parameter scale leads to significant resource consumption and latency during inference. Post-training weight-only quantization offers a promising solution by reducing model size and accelerating token generation through alleviating the memory-bound issue. Nevertheless, the presence of inherent systematic outliers in weights continues to be a major obstacle. While existing methods, such as scaling and rotation, attempt to address this issue, the performance remains unsatisfactory. In this paper, we propose Outlier Self-Absorption Quantization (OSAQ), which performs additive weight suppression guided by the second-order low-rank property for low-bit weight-only quantization of LLMs. Specifically, we observe that the Hessian exhibits low-rank consistency across different inputs, with certain directions consistently showing vanishing curvature. Leveraging this property, we identify a stable null space of the Hessian and then construct an additive weight transformation by linearly combining the vectors within this null space, thereby suppressing weight outliers without affecting the task loss. This additive transformation can be absorbed into the weights offline, requiring no inter-layer transformations and introducing no inference overhead. Moreover, the construction is efficiently achieved by a closed-form solution, without resource-intensive training or iterative procedures. Extensive experiments demonstrate that OSAQ effectively suppresses outliers and enhances low-bit quantization performance. For instance, in 2-bit quantization, OSAQ, when integrated with GPTQ, achieves over 40% lower perplexity compared to vanilla GPTQ.

[LG-46] Using Common Random Numbers for Simulation-based Planning with Rollouts

链接: https://arxiv.org/abs/2605.04732
作者: Sandarbh Yadav,Frederic J Maliakkal,Harshad Khadilkar,Shivaram Kalyanakrishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based planning with rollouts is a widely-deployed technique for decision making in stochastic environments. The primary instrument of simulation-based planning is a sampling model, which is repeatedly called to generate trajectories and estimate the utilities of available actions. Among the actions thus explored, one with the maximum estimated utility is then executed. In this paper, we examine the effect of using common random numbers in the simulation process. We obtain a simple recipe for (provably) reducing variance in relative utility when simulations invoke a rollout policy beyond some depth. Experiments on synthetic tasks confirm that our scheme improves task performance. The broader significance of our innovation is apparent from two practical applications: (1) single-step lookahead planning in a pension-disbursement task, and (2) a deployment of the well-known UCT algorithm for the game of Ludo.

[LG-47] Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols

链接: https://arxiv.org/abs/2605.04727
作者: Jaewook Kim,Hyeoncheol Kim
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at the International Conference on Intelligent Tutoring Systems (ITS 2026). To appear in Springer LNCS

点击查看摘要

Abstract:Programming Knowledge Tracing (PKT) has recently advanced through hybrid approaches that integrate attention-based feature modeling for code representation with RNN-based sequential prediction. While these models report strong empirical performance, their reliability can be sensitive to subtle implementation and experimental design choices. This study revisits representative PKT models and shows that reported gains can be substantially influenced by model configuration and sequence construction practices. We identify issues in attention dimension settings that affect performance estimates, and demonstrate that improper ordering of student attempts, such as ignoring ServerTimestamp, can violate temporal causality and lead to overly optimistic results. To ensure consistent evaluation, hyperparameters are selected via grid search guided by a single designated fold and then fixed uniformly across all folds during cross-validation. We further analyze the role of assignment-wise characteristics and systematically explore the impact of maximum sequence length. Using this protocol, we re-evaluate PKT models on the CodeWorkout dataset. Our results show that, under controlled and consistent settings, the performance gap between attention-enhanced models and standard DKT is significantly reduced, and increased architectural complexity does not consistently translate into superior performance. Beyond individual model comparisons, this work provides practical guidance for reliable and comparable evaluation in programming knowledge tracing.

[LG-48] SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2605.04712
作者: Lirui Luo,Guoxi Zhang,Hongming Xu,Cong Fang,Qing Li
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.

[LG-49] ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

链接: https://arxiv.org/abs/2605.04709
作者: Yurui Du,Pinhao Song,Yutong Hu,Renaud Detry
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:A central challenge of visual control with model-based reinforcement learning (RL) is reliable long-horizon planning: long rollouts with learned latent dynamics exhibit branching futures and multi-modal action-value distributions. In addition, compounding model errors amplified by visual occlusions make deep imagination brittle. We present ELVIS, a latent model predictive controller (MPC) designed to make long-horizon planning practical. ELVIS plans in a Dreamer-style recurrent state space model (RSSM) and replaces standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons, avoiding mode averaging under branching rollouts. In parallel, ELVIS stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics defines an upper-confidence-bound (UCB) score that gates a time-varying lambda, adaptively trading off bootstrapping versus look-ahead to limit compounding error during planning. The same return is used both to train an actor-critic prior from imagined rollouts and to score candidate trajectories inside GMM-MPPI, aligning RL objectives with the planner’s long-horizon optimization. On fourteen DeepMind Control Suite visual tasks, ELVIS establishes state-of-the-art performance compared with TD-MPC2 and DreamerV3. Finally, ELVIS transfers zero-shot to a real-world sand-spraying task with severe occlusions, improving surface-quality metrics and demonstrating robustness beyond simulation.

[LG-50] Differentiable Chemistry in PINNs for Solving Parameterized and Stiff Reaction Systems

链接: https://arxiv.org/abs/2605.04708
作者: Miloš Babić,Franz M. Rohrhofer,Stefan Posch
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:From neural ODEs to continuous-time machine learning, differentiable solvers allow physics, optimization, and simulation to become trainable components within deep learning systems. This has opened the path to a new generation of deep learning frameworks for scientific computing, with many promising applications still emerging. In this paper, we integrate a differentiable chemistry solver into a modified physics-informed neural network to solve parameterized reaction systems that are inherently stiff. The proposed framework introduces several key components required to overcome limitations of standard physics-informed neural networks. These include a differentiable chemistry solver, a network architecture for parameterized solutions, and residual weighting tailored to stiff reactions. We evaluate the framework on a set of differential equations related to hydrogen combustion, which include initial/boundary value problems, inverse parameter identification, and a parameterized partial differential equation. Our results highlight the ability of the proposed approach to extend physics-informed neural networks to stiff chemical systems that were previously inaccessible.

[LG-51] Vol-Mark: A Watermark for 3D Medical Volume Data Via Cubic Difference Expansion and Contrastive Learning

链接: https://arxiv.org/abs/2605.04705
作者: Jiangnan Zhu,Yuntao Wang,Shengli Pan,Yujie Gu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today, advances in medical technology extensively utilize 3D volume data for accurate and efficient diagnostics. However, sharing these data across networks in telemedicine poses significant security risks of data tampering and unauthorized copying. To address these challenges, this paper proposes a novel reversible-zero watermarking approach, termed Vol-Mark, for medical volume data to protect their ownership and authenticity in telemedicine. The proposed Vol-Mark method offers two key benefits: 1) it designs a volume data feature extractor that leverages contrastive learning to efficiently extract discriminative and stable volumetric features, ensuring robustness against 3D attacks; 2) it introduces the cubic difference expansion (c-DE) technique, which leverages the 3D integer wavelet transform to embed watermark bits into neighboring voxels within cubes at low-frequency coefficients. The voxel differences within each cube are expanded to create embedding space, and a majority voting mechanism is employed during extraction to enhance reliability. The embedding process incurs low distortion and supports lossless removal, thereby preserving the integrity and diagnostic accuracy of medical volume data. Through these two benefits, Vol-Mark enables both integrity verification and ownership verification. Integrity verification is first performed, and ownership verification through hypothesis testing is further conducted to enhance reliability, particularly under data tampering or watermark removal attacks. Comprehensive experimental results show the effectiveness of the proposed method and its superior robustness against conventional, geometric, and hybrid attacks on medical volume data. In particular, through multiple tasks evaluations, Vol-Mark consistently achieves an ACC above 0.90 in most attack scenarios, outperforming existing methods by a clear margin.

[LG-52] Gray-Box Poisoning of Continuous Malware Ingestion Pipelines

链接: https://arxiv.org/abs/2605.04698
作者: Jan Dolejš,Martin Jureček,Róbert Lórencz
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern malware detection pipelines rely on continuous data ingestion and machine learning to counter the high volume of novel threats. This work investigates a realistic gray-box poisoning threat model targeting these pipelines. Using the secml_malware framework, we generate problem-space adversarial binaries through functionality-preserving manipulations, specifically Import Address Table (IAT) and section injections. We evaluate the impact of these poisoned samples when ingested into a defender’s training set for a LightGBM malware detection model. Our empirical results demonstrate that subtle IAT-based perturbations enable compact poisoning samples that significantly degrade detection recall. These findings illustrate the inherent challenge of developing low-visibility adversarial perturbations that maintain high poisoning efficacy within continuous learning systems. We further evaluate a defense mechanism based on a homogeneous ensemble, which successfully identifies and filters up to 95.6% of poisoning attempts while maintaining a high retention rate for legitimate data. These findings emphasize the necessity of robust pre-ingestion validation in production pipelines.

[LG-53] Learning Time-Inhomogeneous Markov Dynamics in Financial Time Series via Neural Parameterization

链接: https://arxiv.org/abs/2605.04690
作者: Jan Rovirosa,Jesse Schmolze
类目: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注: 10 pages, 10 figures and 1 table. Presented at The 2026 ASA Midwest Regional Conference in Statistics and Data Science and the 2026 Undergraduate Symposium at the University of Wisconsin - Madison

点击查看摘要

Abstract:Modeling the dynamics of non-stationary stochastic systems requires balancing the representational power of deep learning with the mathematical transparency of classical models. While classical Markov transition operators provide explicit, theoretically grounded rules for system evolution, their empirical estimation collapses due to severe data sparsity when applied to high-resolution, high-noise environments. We explore this statistical barrier using financial time series as a canonical, real-world testbed. To overcome the degeneracy of empirical counting, we introduce a framework that utilizes neural networks strictly as parameterization engines to generate explicit, time-varying Markov transition matrices. By constraining the neural network to output its predictions as a formal stochastic operator, we maintain complete structural interpretability. We demonstrate that these learned operators successfully capture complex regime shifts: the state-conditioned model achieves mean row heterogeneity \bar\rho = 0.0073 while the state-free ablation collapses to exactly zero, and operator row entropy correlates with realized variance at r = -0.62 ( p \approx 10^-251 ), revealing that high-volatility regimes homogenize transition dynamics rather than diversify them. Furthermore, rather than enforcing the Chapman-Kolmogorov equations as a rigid structural requirement, we repurpose them as a localized diagnostic tool to pinpoint specific temporal windows where first-order memory assumptions break down. Ultimately, this framework demonstrates how neural networks can be constrained to make rigorous, classical operator analysis viable for complex real-world time series.

[LG-54] ITBoost: Information-Theoretic Trust for Robust Boosting

链接: https://arxiv.org/abs/2605.04671
作者: Ye Su,Longlong Zhao,Diego Garcia-Gil,Jipeng Guo,Gangchun Zhang,Jinxin Chen,Jinsong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient boosting remains a strong and widely used method for tabular data learning, but its performance often degrades when training labels are noisy. This behavior is largely related to the way boosting algorithms emphasize samples with large gradients, without explicitly accounting for whether such errors originate from informative hard cases or from unreliable labels. We address this issue by reconsidering how sample reliability is evaluated during boosting. Instead of relying on instantaneous error, we examine the evolution of each sample’s residuals across iterations. Based on this insight, we propose Information-Theoretic Trust Boosting (ITBoost), which uses the Minimum Description Length principle to measure the complexity of residual trajectories. Samples whose residual patterns fluctuate in an irregular manner are treated as less trustworthy and are down-weighted during learning. Theoretically, we derive a tighter generalization bound for ITBoost under label noise. Empirical results on various tabular benchmarks indicate that ITBoost provides improved robustness in noisy environments over leading boosting and deep tabular models, while retaining best average performance on clean data.

[LG-55] Feature importance analysis for patient management decisions

链接: https://arxiv.org/abs/2605.04666
作者: Michal Valko,Milos Hauskrecht
类目: Machine Learning (cs.LG)
*备注: Published at MEDINFO 2010. doi: https://doi.org/10.3233/978-1-60750-588-4-861 . PDF-only submission; LaTeX source not available

点击查看摘要

Abstract:The objective of this paper is to understand what characteristics and features of clinical data influence physician’s decision about ordering laboratory tests or prescribing medications the most. We conduct our analysis on data and decisions extracted from electronic health records of 4486 post-surgical cardiac patients. The summary statistics for 335 different lab order decisions and 407 medication decisions are reported. We show that in many cases, physician’s lab-order and medication decisions can be well predicted from a small subset of all features.

[LG-56] Evidence-based anomaly detection in clinical domains

链接: https://arxiv.org/abs/2605.04664
作者: Milos Hauskrecht,Michal Valko,Branislav Kveton,Shyam Visweswaran,Gregory Cooper
类目: Machine Learning (cs.LG)
*备注: Published at AMIA Annual Symposium 2007. PDF-only submission; LaTeX source not available (paper was authored in Word)

点击查看摘要

Abstract:Anomaly detection methods can be very useful in identifying interesting or concerning events. In this work, we develop and examine new probabilistic anomaly detection methods that let us evaluate management decisions for a specific patient and identify those decisions that are highly unusual with respect to patients with the same or similar condition. The statistics used in this detection are derived from probabilistic models such as Bayesian networks that are learned from a database of past patient cases. We apply our methods to the problem of identifying unusual patient-management decisions in post-surgical cardiac patients.

[LG-57] hreshold-Guided Optimization for Visual Generative Models ICML2026

链接: https://arxiv.org/abs/2605.04653
作者: Jinbin Bai,Yu Lei,Qingyu Shi,Aosong Feng,Yi Xin,Zhuoran Zhao,Fei Shen,Kaidong Yu,Jason Li
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Aligning large visual generative models with human feedback is often performed through pairwise preference optimization. While such approaches are conceptually simple, they fundamentally rely on annotated pairs, limiting scalability in settings where feedback is collected as independent scalar ratings. In this work, we revisit the KL-regularized alignment objective and show that the optimal policy implicitly compares each sample’s reward to an instance-specific baseline that is generally intractable. We propose a threshold-guided alignment framework that replaces this oracle baseline with a data-driven global threshold estimated from empirical score statistics. This formulation turns alignment into a binary decision task on unpaired data, enabling effective optimization directly from scalar feedback. We also incorporate a confidence weighting term to emphasize samples whose scores deviate strongly from the threshold, improving sample efficiency. Experiments across both diffusion and masked generative paradigms, spanning three test sets and five reward models, show that our method consistently improves preference alignment over previous methods. These results position our threshold-guided framework as a simple yet principled alternative for aligning visual generative models without paired comparisons.

[LG-58] Benchmarking LLM s on the Massive Sound Embedding Benchmark (MSEB)

链接: https://arxiv.org/abs/2605.04556
作者: Cyril Allauzen,Tom Bagby,Georg Heigold,Ehsan Variani,Ke Wu
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward “audio-native” Large Language Models (LLMs) suggests a new paradigm where a single multimodal backbone may replace complex, task-specific pipelines. This paper provides a rigorous empirical evaluation of leading LLMs - including members from the Gemini and GPT families - across the eight core MSEB capabilities to assess their efficacy and audio-text parity. Our results indicate that while a significant modality gap persists regarding performance and robustness, the empirical evidence for an “optimal” modeling approach remains inconclusive. Ultimately, the choice between audionative and cascaded architectures depends heavily on specific use-case requirements and the underlying assumptions regarding latency, cost, and reasoning depth.

[LG-59] Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models

链接: https://arxiv.org/abs/2605.04555
作者: Jan Marco Ruiz de Vargas,Fabian Raisch,Zoltan Nagy,Pierre Pinson,Christoph Goebel
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) offers a promising approach for data-efficient energy management in buildings, combining the strengths of predictive modeling and reinforcement learning. While previous MBRL methods applied to HVAC control have reduced training data requirements, they still require several months of interaction with the building to learn a satisfactory control policy. A key reason is that existing surrogate models attempt to predict the entire state-space, including weather and electricity prices that are unaffected by control actions, or completely ignore these variables. Addressing these issues, we propose Counter-Dyna, a method that enhances the data-efficiency of Dyna, an MBRL method. We create data-efficient counterfactual surrogate models (CSM) by leveraging invariances in the state-space. Using a CSM in Dyna speeds up RL training measured in environment interaction data compared to previous results. In comparison with previous state-of-the-art that used 6-12 months of environment interactions, our method needs only 5 weeks. We evaluate our method in a large simulation study using the literature standard BOPTEST framework and proximal policy algorithm (PPO) as the RL algorithm. Our results show cost-saving potentials of 5.3% to 17.0% in a hypothetical deployment scenario. Our work is a significant step towards making real-world deployment of RL algorithms in HVAC control practically viable.

[LG-60] Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices

链接: https://arxiv.org/abs/2605.04550
作者: Amit Punia,Rakesh Kumar,Madan Lal
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computing pseudospectra of non-normal matrices is essential for understanding the stability and transient behavior of dynamical systems. Such analysis is critical in applications including fluid dynamics, control systems, and differential operators, where non-normality can lead to significant transient amplification and sensitivity to perturbations that are not captured by eigenvalue analysis alone. At large scales, commonly used numerical approaches for pseudospectra computation can become computationally demanding, as they require repeated auxiliary computations to identify spectrally sensitive regions in the complex plane. We present a neural network-based approach that predicts sensitive regions directly from matrix features, thereby avoiding exhaustive pseudospectra evaluation across the entire complex plane. We calibrate the prediction threshold on validation data to ensure reliable coverage of sensitive regions. The trained neural network guides the selection of grid points requiring full computation, enabling focused computation only where necessary. The approach provides a practical preprocessing strategy for efficient pseudospectra computation. Numerical experiments on non-normal banded matrices demonstrate substantial speedup compared to full grid-based numerical evaluation while maintaining high accuracy in identifying sensitive regions. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) MSC classes: 15A18, 65F15, 68T07, 47A10 Cite as: arXiv:2605.04550 [math.NA] (or arXiv:2605.04550v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2605.04550 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Event-Based Early Warning of Vineyard Disease Risk from Environmental Time Series

链接: https://arxiv.org/abs/2605.04548
作者: Ivica Dimitrovski,Ivan Kitanovski,Danco Davcev,Slobodan Kalajdziski,Kosta Mitreski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate early warning of vineyard disease risk from environmental observations is essential for timely intervention and more sustainable crop protection. However, many existing studies formulate disease prediction as daily presence classification, which can favor persistence-driven predictions and provide only limited support for actionable short-horizon warning. In this paper, we present an event-based approach for early warning of vineyard disease risk from environmental time series and evaluate it through a vineyard case study. Rather than predicting daily disease status, the task is reformulated to predict transitions into annotated disease-risk periods within a future window of 3-7 days. To reduce fragmentation caused by short interruptions in the binary labels, new events are defined only after a minimum disease-free gap. This formulation encourages models to capture environmental precursors associated with upcoming risk periods instead of merely reproducing temporal persistence. Using multi-year agro-meteorological data, we construct input representations that capture humidity dynamics, rainfall accumulation, temperature variability, and seasonal structure through cyclic temporal encoding. We evaluate representative methods from classical machine learning and deep learning, including XGBoost, Long Short-Term Memory (LSTM) networks, and Temporal Convolutional Networks (TCNs), using both standard classification metrics and an event-oriented early warning protocol. The results show that the event-based formulation supports practical short-horizon warning, while the compared models exhibit distinct trade-offs between event recall, lead time, and false-alert behavior. Overall, the study underscores the importance of problem formulation in environmental time-series learning and demonstrates the value of event-based prediction for vineyard disease warning systems.

[LG-62] Power Distribution Bridges Sampling Self-Reward RL and Self-Distillation

链接: https://arxiv.org/abs/2605.04542
作者: Akiyoshi Tomihari,Issei Sato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model’s sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.

[LG-63] From Video-to-PDE: Data-Driven Discovery of Nonlinear Dye Plume Dynamics

链接: https://arxiv.org/abs/2605.04535
作者: Cesar Acosta-Minoli,Sayantan Sarkar
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inferring continuum models directly from video is hampered by two facts: the recorded field is uncalibrated image intensity rather than a physical state, and direct numerical differentiation of noisy frames is unstable. We develop a video-to-PDE pipeline that converts grayscale recordings of an ink plume into a normalised scalar field u(x,y,t) , isolates a bulk drift \mathbfv(t) from intrinsic spreading via the intensity-weighted centroid, and identifies an effective transport law by weak-form sparse regression. Conditioning, threshold-sweep and random-centre diagnostics show that overcomplete libraries are strongly collinear; the search is therefore restricted to compact gradient-based libraries. Coefficients are refined by an inverse physics-informed network and recalibrated against forward rollouts, with a chronological block bootstrap quantifying uncertainty. The selected reduced model u_t+\mathbf v(t)!\cdot!\nabla u = 9.005,|\nabla u|^2+0.666,\Delta u outperforms advection–diffusion baselines on held-out frames, retains a positive Laplacian coefficient, and admits a Cole–Hopf reduction to a linear advection–diffusion equation. The framework demonstrates that uncalibrated visual data can yield compact, predictive and structurally interpretable continuum models when discovery, calibration and uncertainty are treated as distinct stages.

[LG-64] FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling

链接: https://arxiv.org/abs/2605.04519
作者: Guangyi Zhang,Yi Dai,Yiyun He,Junhao Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Single-cell ATAC-seq (scATAC-seq) enables high-resolution mapping of chromatin accessibility, yet privacy regulations and data size constraints hinder multi-institutional sharing. Federated learning (FL) offers a privacy-preserving alternative, but faces three fundamental barriers in scATAC-seq analysis: ultra-high dimensionality, extreme sparsity, and severe cross-institutional heterogeneity. We propose FL-Sailer, the first FL framework designed for scATAC-seq data. FL-Sailer integrates two key innovations: (i) adaptive leverage score sampling, which selects biologically interpretable features while reducing dimensionality by 80%, and (ii) an invariant VAE architecture, which disentangles biological signals from technical confounders via mutual information minimization. We provide a convergence guarantee, showing that FL-Sailer converges to an approximate solution of the original high-dimensional problem with bounded error. Extensive experiments on synthetic and real epigenomic datasets demonstrate that FL-Sailer not only enables previously infeasible multi-institutional collaborations but also surpasses centralized methods by leveraging adaptive sampling as an implicit regularizer to suppress technical noise. Our work establishes that federated learning, when tailored to domain-specific challenges, can become a superior paradigm for collaborative epigenomic research.

[LG-65] Gradient Scaling Effects in Adaptive Spectral PINNs for Stiff Nonlinear ODEs ICLR2026

链接: https://arxiv.org/abs/2605.04502
作者: Isabela M. Yepes,Pavlos Protopapas
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 1 table. This work appeared at the ICLR 2026 AIPDE Workshop on OpenReview

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) often struggle to train reliably on stiff and oscillatory dynamical systems due to poor optimization conditioning. While prior work has emphasized representational remedies such as spectral parameterizations, the optimization implications of initial-condition (IC) embeddings in adaptive spectral PINNs have not been well characterized. In this work, we show that the choice of IC gating function induces explicit time-dependent gradient scaling, which interacts with spectral representations during training. Using a nonlinear stiff spring-pendulum ODE as a controlled benchmark, we compare exponential and linear IC gates in combination with fixed and adaptive Fourier spectral trunks. We observe stiffness-dependent changes in relative dominance for adaptive PINNs: at moderate stiffness ( k=20 ), exponential gating often yields lower error but exhibits heterogeneous behavior across random seeds, whereas at higher stiffness ( k=60 ), linear gating becomes preferable, with additional reversals observed at larger k . These trends hold for both relative L^2 error and maximum pointwise error and are confirmed by paired Wilcoxon signed-rank tests with Holm correction. Overall, our results demonstrate that IC embeddings are not a neutral design choice in PINNs: the induced gradient scaling materially shapes optimization conditioning in stiff regimes, with distinct sensitivity patterns in baseline and adaptive spectral models.

[LG-66] Quadrature-TreeSHAP: Depth-Independent TreeSHAP and Shapley Interactions

链接: https://arxiv.org/abs/2605.04497
作者: Ron Wettenstein,Rory Mitchell,Peng Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shapley values are a standard tool for explaining predictions of tree ensembles, with Path-Dependent SHAP being the most widely used variant. Despite substantial progress, existing methods still exhibit trade-offs between depth-dependent runtime, numerical stability, and support for higher-order interactions. To address these challenges, we introduce Quadrature-TreeSHAP, a quadrature-based reformulation of Path-Dependent TreeSHAP that is numerically stable, naturally extends to any-order Shapley interaction values and is practically insensitive to tree depth. Our implementation supports both CPU and GPU and is integrated into XGBoost. Our method is based on a weighted-Banzhaf interaction polynomial, which expresses Banzhaf interaction values as expectations under a feature participation probability p . Shapley values and any-order interaction values are then recovered by integrating these polynomials over p from 0 to 1. We evaluate these integrals using Gauss-Legendre quadrature, and show that, in practice, only 8 fixed quadrature points are sufficient to reach machine precision. In fact, Quadrature-TreeSHAP with 8 fixed points achieves greater numerical stability than TreeSHAP. This fixed-point formulation removes depth dependence from the inner computation and enables efficient SIMD execution. We confirm these advantages empirically. On 12 XGBoost benchmarks, Quadrature-TreeSHAP computes Shapley values 1.06x-10.59x faster than TreeSHAP on CPU and 1.84x-6.95x faster than GPUTreeSHAP on GPU. Shapley pairwise interactions are 3.80x-58.11x faster on CPU, with higher-order interactions achieving speedups of up to 1200x compared to TreeSHAP-IQ. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.04497 [cs.LG] (or arXiv:2605.04497v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2605.04477
作者: Zhen-Yu Zhang,Yuting Tang,Jiandong Zhang,Lanjihong Ma,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.

[LG-68] Geometry-Aware Neural Optimizer for Shape Optimization and Inversion ICML2026

链接: https://arxiv.org/abs/2605.04474
作者: Guoze Sun,Tianya Miao,Haoyang Huang,Huaguan Chen,Han Wan,Rui Zhang,Hao Sun
类目: Machine Learning (cs.LG)
*备注: To appear in ICML2026

点击查看摘要

Abstract:Geometry is central to PDE-governed systems, motivating shape optimization and inversion. Classical pipelines conduct costly forward simulation with geometry processing, requiring substantial expert effort. Neural surrogates accelerate forward analysis but do not close the loop because gradients from objectives to geometry are often unavailable. Existing differentiable methods either rely on restrictive parameterizations or unstable latent optimization driven by scalar objectives, limiting interpretability and part-wise control. To address these challenges, we propose Geometry-Aware Neural Optimizer (GANO), an end-to-end differentiable framework that unifies geometry representation, field-level prediction, and automated optimization/inversion in a single latent-space loop. GANO encodes shapes with an auto-decoder and stabilizes latent updates via a denoising mechanism, and a geometry-injected surrogate provides a reliable gradient pathway for geometry updates. Moreover, GANO supports part-wise control through null-space projection and uses remeshing-free projection to accelerate geometry processing. We further prove that denoising induces an implicit Jacobian regularization that reduces decoder sensitivity, yielding controlled deformations. Experiments on three benchmarks spanning 2D Helmholtz, 2D airfoil, and 3D vehicles show state-of-the-art accuracy and stable, controllable updates, achieving up to +55.9% lift-to-drag improvement for airfoils and ~7% drag reduction for vehicles.

[LG-69] Automated Formal Proofs of Combinatorial Identities via Wilf-Zeilberger Guidance and LLM s ICML2026

链接: https://arxiv.org/abs/2605.04472
作者: Beibei Xiong,Hangyu Lv,Junqi Liu,Yisen Wang,Shaoshi Chen,Jianlin Wang,Zhengfeng Yang,Lihong Zhi
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026. Preprint version

点击查看摘要

Abstract:Automating formal proofs of combinatorial identities is challenging for LLM-based provers, as long-horizon proof planning is required and unconstrained search quickly explodes. Symbolic methods such as the Wilf-Zeilberger (WZ) method can achieve a mechanized proof of combinatorial identities by constructing special auxiliary functions and demonstrating that they satisfy specific recurrence relations. We propose WZ-LLM, a neuro-symbolic framework that turns WZ proof plans into executable proof sketches in Lean 4 and uses an LLM-based prover to discharge the resulting machine-checkable subgoals. We also train a dedicated WZ-Prover via a Lean-kernel-verified bootstrapping loop with expert-verified iteration, followed by DAPO-based refinement. Experiments show that WZ-LLM achieves a 34% proof success rate on LCI-Test (100 classic combinatorial identities), outperforming strong baselines such as DeepSeek-V3 and Goedel-Prover-V2, and delivering consistent gains on CombiBench and PutnamBench-Comb. These results indicate that our framework provides two complementary strengths: improved direct proving for identities beyond the scope of WZ, and substantially higher end-to-end success when WZ sketches guide a specialized prover.

[LG-70] CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

链接: https://arxiv.org/abs/2605.04470
作者: Keyu Chen,Nanfei Ye,Yida Wang,Wenchao Sun,Danqi Zhao,Hao Cheng,Sifa Zheng
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade-offs: closed-loop RL fine-tuning provides grounded feedback from executed actions but is constrained by the sparsity of informative events, whereas counterfactual fine-tuning provides dense supervision over candidate futures but inherits bias from imperfect future estimates. We introduce Counterfactual-to-Interactive Reinforcement Fine-Tuning (CRAFT), an on-policy framework that formulates closed-loop post-training as proxy-residual optimization. CRAFT uses group-normalized counterfactual advantages as a dense proxy for real closed-loop advantages and aligns this proxy with the closed-loop world through grounded residual correction from interaction-critical events. To stabilize adaptation, CRAFT regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically, CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution, reducing residual variance with an aligned proxy while mitigating proxy bias through grounded residual approximation. Empirically, CRAFT achieves the strongest closed-loop gains on Bench2Drive across hierarchical planning, vision-language-action, and vocabulary-scoring architectures. Ablations, scaling behavior, stability analyses, and transfer results further validate the complementary roles of dense counterfactual proxy and grounded residual correction. Project page: this https URL.

[LG-71] Discovering Sparse Counterfactual Factors via Latent Adjustment for Survey-based Community Intervention

链接: https://arxiv.org/abs/2605.04460
作者: Fatima Ashraf,Muhammad Ayub Sabir,Junbiao Pang,Yufang Zhou,Yan Shang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transportation surveys are widely used to understand travel preferences and adoption barriers, yet most survey-based analyses remain descriptive or predictive and rarely provide sparse, policy-feasible intervention strategies. We study sparse counterfactual community intervention from survey responses, where the goal is to shift a target respondent group toward a desired reference group through controllable survey-variable adjustments. We formulate this task as a policy-feasible distributional alignment problem using a fixed-basis nonnegative latent representation that preserves pre/post comparability and provides a stable map from latent factors to original variables. To make latent movement actionable, target-relevant latent factors are identified through Shapley-guided attribution and transferred to controllable variables as intervention priorities. Feasible group-level adjustments are then learned by minimizing an entropy-regularized optimal-transport discrepancy between the post-intervention target distribution and the reference distribution, together with a weighted \ell_2,1 penalty that promotes shared policy-lever sparsity. Experiments on real-world transportation survey datasets show that the proposed framework produces compact and interpretable policy-feasible interventions with explicit adjustment magnitudes, improves population-level conversion, and preserves intervention sparsity. Code and datasets are publicly available at: this https URL

[LG-72] Counterfactual identifiability beyond global monotonicity: non-monotone triangular structural causal models

链接: https://arxiv.org/abs/2605.04413
作者: Pengcheng Tan,Jiang Chen,Dehui Du
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Structural causal models provide a unified semantics for interventions and counterfactuals, but most identifiability results rely on restrictive assumptions like global monotonicity, which are often violated in embodied interaction, where the same exogenous perturbation can induce opposite responses under different contact contexts. We ask what structure still suffices once global monotonicity is dropped. We introduce non-monotone triangular structural causal models (NM-TM-SCM), which retain triangular recursion but replace global monotonicity with mechanism-wise invertibility and context-independent inverse transport. We prove that these conditions are equivalent to exogenous isomorphism and imply complete counterfactual identifiability, and we give a counterexample showing that local invertibility alone is insufficient. We instantiate the theory in CausalInverter, with triangular invertible layers, orientation gates, and transport-stability regularization. On synthetic non-monotonic mechanisms, the structural bias yields systematic counterfactual gains as non-monotonicity increases. On MuJoCo Door, our model achieves perfect event-level counterfactual recovery, lowers continuous angle error relative to a Transformer baseline, and delivers substantially more stable recovery than Transformer and conditional-flow predictors. On MuJoCo Push, where non-monotonicity is weaker, the same low-data predictors remain competitive or better, consistent with a bias-variance boundary. These results identify a broader identifiable regime between globally monotone triangular models and unconstrained black-box world models.

[LG-73] Beyond Rigid Geometries: The Spline-Pullback Metric for Universal Diffeomorphic SPD Representation Learning

链接: https://arxiv.org/abs/2605.04406
作者: Tushar Das,Subrata Dutta,Sarmistha Neogy,Koushlendra Kumar Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Symmetric Positive Definite (SPD) matrices into deep learning has historically relied on fixed algebraic Riemannian metrics. Analogous to hand-crafted features in classical machine learning, these static formulations impose rigid geometries limiting network expressivity and adaptability. Recent attempts to parameterize these geometries often violate the axioms of primary matrix functions through unconstrained powers or rank-dependent scaling, inviting spatial folding, loss of global surjectivity, and gradient collapse at spectral singularities. In this paper, we introduce the Spline-Pullback Metric (SPM), instantiated as Spectral-SPM and Cholesky-SPM, marking a paradigm shift from static metric selection to universal geometric approximation. By parameterizing the global diffeomorphism via a rank-invariant, monotonically constrained B-spline, SPM acts as a dense universal approximator for strictly increasing C^1 diffeomorphisms and theoretically subsumes existing pullback metrics while enabling localized non-linear spectral modelling. Topologically, SPM provides a globally bijective pullback geometry precluding rank-swapping discontinuities and gradient instabilities. Empirically, SPM achieves a state-of-the-art performance across 3 datasets utilizing Linear Probes, SPDNets, and deep Riemannian ResNets.

[LG-74] Contextual Memory-Enhanced Source Coding for Low-SNR Communications

链接: https://arxiv.org/abs/2605.04400
作者: Ziqiong Wang,Rongpeng Li
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Separate Source-Channel Coding (SSCC) retains the practical benefits of modular system design, its effectiveness in noisy text transmission is fundamentally constrained by the fragility of autoregressive source decoding. In low-SNR regimes, even a small number of residual bit errors after channel decoding may derail the subsequent lossless reconstruction process, especially when Arithmetic Coding (AC) relies on Large Language Model (LLM)-based probability estimation. Existing remedies either strengthen channel decoding based solely on channel observations or introduce contextual information only at the receiver for post-hoc correction, yet neither fully addresses the fragility of source probability modeling under residual channel errors. To this end, this paper proposes a Memory-Augmented Source Coding (MASC) scheme for robust SSCC-based transmission. Rather than treating context as external side information, MASC internalizes contextual patterns into a source model shared by both the transmitter-side source encoder and the receiver-side source decoder. Specifically, MASC employs a shared Parameterized Contextual Memory (PCM) to encode multi-order n -gram patterns, and further introduces a Mixture-of-Memory-Experts Router (MMER) to perform sparse, hidden-state-dependent routing over memory experts during autoregressive source modeling. By adaptively activating only the most relevant memories at each coding step, MASC refines source probability estimation, shortens average codelength, and mitigates the sensitivity of source decoding to residual channel errors. Extensive experiments over Rayleigh fading and AWGN channels demonstrate the effectiveness of the proposed scheme compared with state-of-the-art methods.

[LG-75] GraphPI: Efficient Protein Inference with Graph Neural Networks

链接: https://arxiv.org/abs/2605.04376
作者: Zheng Ma,Jiazhen Chen,Lei Xin,Ali Ghodsi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of deep learning approaches in biomedical research has been transformative, enabling breakthroughs in various applications. Despite these strides, its application in protein inference is impeded by the scarcity of extensively labeled datasets, a challenge compounded by the high costs and complexities of accurate protein annotation. In this study, we introduce GraphPI, a novel framework that treats protein inference as a node classification problem. We treat proteins as interconnected nodes within a protein-peptide-PSM graph, utilizing a Graph Neural Network-based architecture to elucidate their interrelations. To address label scarcity, we train the model on a set of unlabeled public protein datasets with pseudo-labels derived from an existing protein inference algorithm, enhanced by self-training to iteratively refine labels based on confidence scores. Contrary to prevalent methodologies necessitating dataset-specific training, our research illustrates that GraphPI, due to the well normalized nature of Percolator features, exhibits universal applicability without dataset-specific fine-tuning, a feature that not only mitigates the risk of overfitting but also enhances computational efficiency. Our empirical experiments reveal notable performance on various test datasets and deliver significantly reduced computation times compared to common protein inference algorithms.

[LG-76] p-adic Manifold Learning and Benchmark Tasks from Impartial Games

链接: https://arxiv.org/abs/2605.04374
作者: Tomoki Mihara
类目: Machine Learning (cs.LG); Number Theory (math.NT)
*备注:

点击查看摘要

Abstract:We introduce p -adic manifold learning, propose an algorithm to solve it, and propose benchmark tasks from impartial games.

[LG-77] Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation ICRA2026

链接: https://arxiv.org/abs/2605.04366
作者: Zimu Gong,Brian Zhaoning Zhang,Chris Zhang,Kelvin Wong,Raquel Urtasun
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: ICRA 2026

点击查看摘要

Abstract:Safety-critical scenarios are essential for the development of autonomous vehicles (AVs) but are rare in real-world driving data. While simulation offers a way to generate such scenarios, manually designed test cases lack scalability, and adversarial optimization often produces unrealistic behaviors. In this work, we introduce a conditional latent flow matching approach for scalable and realistic safety-critical scenario generation. Our method uses distribution matching to transform nominal scenes into safety-critical rollouts. Furthermore, we demonstrate that incorporating both simulation and real-world data enables our framework to efficiently generate diverse, data-driven scenarios. Experimental results highlight that our approach is able to more consistently and realistically generate novel safety-critical scenarios, making it a valuable tool for training and benchmarking AV systems.

[LG-78] Online Nonstochastic Prediction: Logarithmic Regret via Predictive Online Least Squares

链接: https://arxiv.org/abs/2605.04364
作者: Chih-Fan Pai,Yang Zheng
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study online prediction for marginally stable, partially observed linear dynamical systems under nonstochastic disturbances. Our objective is to minimize the cumulative squared prediction loss and compete with the best-in-hindsight Luenberger predictor. Standard online learning methods typically rely on bounded domains/gradients, and thus their guarantees may fail to deal with potentially unbounded trajectories in marginally stable systems. In this paper, we introduce an unconstrained online least squares method that stabilizes the learning process via tailored predictive hints. With model knowledge, we prove that hints constructed from any stabilizing Luenberger predictor render the hint residuals uniformly bounded, achieving logarithmic regret despite unbounded trajectory growth. We also discuss model-free prediction and introduce a simple universal hint for symmetric systems, under which logarithmic regret is maintained without model knowledge. Our results provide an adaptive, instance-wise optimal online predictor compared to classical fixed-gain observers under nonstochastic disturbances.

[LG-79] Probing Structural Mathematical Reasoning in Language Models with Algebraic Trapdoors

链接: https://arxiv.org/abs/2605.04352
作者: Igor Rivin
类目: Machine Learning (cs.LG); Group Theory (math.GR)
*备注:

点击查看摘要

Abstract:We introduce a benchmark suite for evaluating structural mathematical reasoning in language models, built on subgroup-construction problems in SL(3, Z) with cryptographic-style verifier-prover asymmetry. Each instance presents a finitely generated subgroup as a list of integer matrices and asks for an arithmetic invariant – index, surjection-at-prime, or membership – that the construction-time information (N, K) pins down in O(1) closed form, but that the solver, lacking that information, must derive by either Aschbacher-classification analysis or by a membership query in SL(3, Z) of unknown decidability. The benchmark therefore distinguishes models with internalized algebraic priors (Aschbacher classes, McLaughlin’s theorem, Property (T), the congruence subgroup property) from models that rely on general-purpose computation. We report empirical results across five representative reasoning traces from two state-of-the-art models. The headline result: on the index variant, one model spent 152 minutes of reasoning, explicitly identified the kernel-side membership question as the bottleneck, attempted constructive verification, and abstained with “DON’T KNOW” rather than commit to its computed cokernel candidate – demonstrating calibrated meta-cognition on the open-decidability boundary that the benchmark was designed to probe. We argue that the benchmark exposes a four-way classification of model behavior (commit-correct, commit-wrong, abstain-correct, abstain-wrong) that standard answer-key scoring conflates.

[LG-80] Structural Equivalence and Learning Dynamics in Delayed MARL

链接: https://arxiv.org/abs/2605.04345
作者: Jules Sintes,Ana Bušić,Jiamin Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We formally establish the equivalence between Observation Delay (OD) and Action Delay (AD) in cooperative partially observable multi-agent systems using observation-action histories. We show that both systems generate identical admissible joint-policy sets, and their induced state-action-observation trajectories are identical in distribution, leading to identical optimal solutions in Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). This formally generalizes existing infinite-horizon single-agent results to any-horizon partially observable cooperative multi-agent problems with decentralized policy execution, and allows any mixed-delay configuration to be reduced to a pure OD system. We further prove that in Transition-Independent MDPs (TI-MDPs), the observation-action history reduces to a tractable minimal local augmented state. However, we show through numerical experiments that although the optimal solution spaces are structurally isomorphic, the practical learning dynamics are fundamentally different. First, using the minimal local augmented state, the equivalence no longer holds when transitions are not independent. Second, operational constraints and causal credit-assignment errors in Temporal Difference (TD) algorithms induce different learning behaviors across regimes. Finally, leveraging this structural equivalence to bypass these learning challenges, we demonstrate successful multi-agent zero-shot policy transfer from OD to AD, paving the way for unified, efficient solution methods in complex delayed systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.04345 [cs.LG] (or arXiv:2605.04345v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-81] On the Architectural Complexity of Neural Networks

链接: https://arxiv.org/abs/2605.04325
作者: Nicholas J. Cooper,François G. Meyer,Michael L. Roberts,Carlos Zapata-Carratalá,Lijun Chen,Danna Gurari
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
*备注: 67 pages, 54 figures, 11 tables

点击查看摘要

Abstract:We introduce a unified theoretical framework for the rigorous analysis and systematic construction of deep neural networks (DNNs). This framework addresses a gap in existing theory by explicitly modeling the structure of tensor operations – lower level information that is often abstracted. Our framework enables two novel objectives: (1) analysis of the evolution of architectural complexity over deep learning history, and (2) automatic construction of novel architectures based on new types of tensor operations. Our study of DNNs introduced over the past 40 years reveals a connection between groundbreaking architectures and increases in different types of architectural complexity. Moreover, we identify several large classes of higher complexity architectures that have not yet been explored. We then collect a dataset of 3,000+ higher complexity architectures, which we publicly release at: this https URL.

[LG-82] DeFed-GMM-DaDiL: A Decentralized Federated Framework for Domain Adaptation

链接: https://arxiv.org/abs/2605.04324
作者: Rebecca Clain,Eduardo Fernandes Montesuma,Fred Ngole Mboula
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized multi-source domain adaptation seeks to transfer knowledge from multiple heterogeneous and related source domains to an unlabeled target domain in a decentralized setting. We address this challenge through a fully decentralized federated approach, DeFed-GMM-DaDiL, an extension of the GMM-Dataset Dictionary Learning (DaDiL) framework. Each client models its dataset as a Gaussian Mixture Model (GMM), and the federation jointly approximates them via labeled Wasserstein barycenters of shared, learnable GMM atoms. This design enables adaptation without a central server while preserving clients’ privacy. We empirically study the stability of the learned representations in scenarios where the target domain has missing classes. Empirical results demonstrate that DeFed-GMM-DaDiL maintains stable and consistent shared representations across clients, effectively reconstructs missing classes, and achieves competitive performance on multi-source domain adaptation benchmarks.

[LG-83] LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

链接: https://arxiv.org/abs/2605.04323
作者: Kuangdai Leng,Simon Jeffery,Panos Panagos,Tarje Nissen-Meyer
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 27 pages, 7 figures, 1 table

点击查看摘要

Abstract:Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.

[LG-84] Leverag ing Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion ACL2026

链接: https://arxiv.org/abs/2605.04291
作者: Tarun Kathuria,Sachin Kumar
类目: Machine Learning (cs.LG)
*备注: To appear in ACL 2026

点击查看摘要

Abstract:We present a discrete diffusion-based language model using Glauber dynamics from statistical physics. Our main insight is that instead of trying to train a discrete state space diffusion model using Glauber dynamics with a uniform transition kernel as the forward process, one can set up an ``energy function’’ based on pretrained causal/masked language models. When viewed as the stationary distribution, this energy function allows us to significantly improve the quality of the generated text. Incorporating UL2 as the pretrained model into our diffusion pipeline, we outperform prior diffusion based LMs and perform competitively with autoregressive models of comparable model sizes. Furthermore, our models are competitive with or outperform prior diffusion models and GPT-2 style auto-regressive models on zero-shot common sense reasoning tasks as well as planning and search tasks like Sudoku and Zebra puzzles.

[LG-85] Probabilistic Classification and Uncertainty Quantification of Sahara Desert Climate Using Feedforward Neural Networks

链接: https://arxiv.org/abs/2605.04286
作者: Stephen Tivenan,Indranil Sahoo,Yanjun Qian
类目: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Climate classification plays a vital role in agricultural planning, hydrological studies, and climate science. One of the most widely used systems for classifying global climate zones is the Köppen-Trewartha (KT) classification. However, the KT classification is fundamentally deterministic, offering discrete labels to spatial locations without accounting for uncertainties in classification. In this paper, we provide a framework for probabilistic modeling of climatic zones. We implement a feedforward artificial neural network (ANN) for classification, allowing for efficient, uncertainty-aware categorization of climatic regions, thereby offering a more nuanced understanding of transitional climate zones compared to traditional deterministic methods. We apply this method to the Sahara Desert region over the 30-year period of 1960 - 1989, using data at more than 400,000 space-time locations from the first 11 years to train our model. We assess the model’s short- and long-term classification capabilities to evaluate its stability and accuracy over time. We also compare the probabilistic classification from our model with the traditional KT classification. In addition, we use fluctuation analysis methods to highlight the temporal evolution of climatic zones across the Sahara region and identify areas undergoing significant flux of probabilities of their climate classes, providing insights into broader trends in desertification.

[LG-86] Hardware-Aware Neural Feature Extraction for Resource-Constrained Devices CVPR

链接: https://arxiv.org/abs/2605.04282
作者: Francesco Tosini,Simone Pedroni,Christian Veronesi,Pietro Bartoli,Marco Paracchini,Marco Marcon,Diana Trojaniello
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for publication at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026. \c{opyright}IEEE

点击查看摘要

Abstract:Visual SLAM is a core component of spatial computing systems, yet deploying learned local feature extractors on microcontroller-class hardware remains challenging due to memory, bandwidth, and quantization constraints. While modern neural descriptors provide strong robustness, their practical adoption is often hindered by system-level bottlenecks that are not captured by FLOP-based efficiency metrics. In this work, we introduce Gideon, a hardware-aware neural feature extractor explicitly designed for resource-constrained devices. Our approach combines relational knowledge distillation from a SuperPoint teacher with differentiable neural architecture search (DNAS) under strict memory and operator constraints. Unlike conventional design pipelines, we treat quantization stability and dynamic-range compactness as first-class objectives. We show that architectural choices such as replacing Batch Normalization with affine layers significantly improve INT8 robustness, and that descriptor dimensionality directly governs quantization resilience. Deployed on STM32N6, Gideon achieves 9.003 ms inference time (111 fps) while remaining below a 1.5 MB memory footprint. Remarkably, INT8 quantization induces negligible degradation and occasionally matches full-precision performance. These results demonstrate that robust learned feature extraction can be reconciled with embedded hardware constraints through holistic hardware-algorithm co-design.

[LG-87] Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

链接: https://arxiv.org/abs/2605.04279
作者: Ayan Pendharkar
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for single-head attention, the multi-head setting remains less understood due to geometric interference between heads, which invalidates standard monotonicity arguments. In this work, we develop a theoretical framework for multi-head self-attention dynamics and resolve several open questions. We show that, under suitable conditions on the score matrices, a natural multi-head energy functional is non-decreasing along both flat and spherical dynamics. We identify the key obstruction to per-head monotonicity as radial shadow terms, which are projections of each head’s output onto token directions, persisting even under orthogonality assumptions. We introduce a sufficient condition ensuring monotonicity and establish robustness to approximate orthogonality. In a simplified scalar-head regime with equiangular token configurations, we derive a closed-form expression for the critical inverse temperature governing clustering behavior, and show that heterogeneous heads exhibit super-additive clustering rates. In this regime, we also prove a separation in clustering time between ReLU and softmax attention in the linearized dynamics. Finally, we establish an entropy production identity and show that attention entropy increases monotonically toward equilibrium as clustering progresses. Our results provide a unified perspective on the dynamics of multi-head attention and clarify the mechanisms underlying clustering and stability in transformer models. Comments: 20 pages, 5 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 35Q89, 37N40 ACMclasses: I.2.6; G.2.1; F.2.0 Cite as: arXiv:2605.04279 [cs.LG] (or arXiv:2605.04279v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04279 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ayan Pendharkar [view email] [v1] Tue, 5 May 2026 20:31:11 UTC (794 KB)

[LG-88] QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization GECCO’26

链接: https://arxiv.org/abs/2605.04267
作者: Florian A. D. Burnat
类目: Machine Learning (cs.LG)
*备注: Accepted at Genetic and Evolutionary Computation Conference (GECCO '26)

点击查看摘要

Abstract:Interactive multi-objective optimization systems face a budget allocation dilemma: one can spend resources on expensive objective evaluations or on eliciting decision-maker preferences that identify the relevant region of the Pareto set. Moreover, preference elicitation itself spans modalities with different information content and cognitive burden, ranging from cheap, noisy pairwise preference statements (PS) to richer but costlier indifference adjustments (IA). We study cost-aware optimization under an unknown scalarization and introduce QUIVER (Query-Informed Value Estimation for Regret), a surrogate-assisted evolutionary multi-objective optimizer that adaptively chooses between objective evaluations and heterogeneous preference queries. At each step, QUIVER selects the next action by maximizing the expected decision-quality improvement per unit total cost. Across DTLZ and WFG benchmarks under synthetic decision-maker models, QUIVER achieves the lowest final utility regret on challenging WFG problems (utility regret of 2.14 on WFG4, 2.82 on WFG9: a 25% improvement over baselines), outperforming all single-modality baselines. We analyze how the optimal mix of PS and IA adapts to problem difficulty: on easy problems (DTLZ2), QUIVER selects 80% PS queries; on hard problems (WFG9), it shifts to 35% IA queries. This adaptive modality selection demonstrates cost-aware preference learning in action. Comments: Accepted at Genetic and Evolutionary Computation Conference (GECCO '26) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.04267 [cs.LG] (or arXiv:2605.04267v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.04267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-89] Explaining and Preventing Alignment Collapse in Iterative RLHF

链接: https://arxiv.org/abs/2605.04266
作者: Etienne Gauthier,Francis Bach,Michael I. Jordan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code at: this https URL

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy’s true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy’s influence on the RM’s future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM’s blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy’s parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.

[LG-90] Laundering AI Authority with Adversarial Examples

链接: https://arxiv.org/abs/2605.04261
作者: Jie Zhang,Pura Peetathawatchai,Florian Tramèr,Avital Shafran
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed as trusted authorities – fact-checking images on social media, comparing products, and moderating content. Users implicitly trust that these systems perceive the same visual content as they do. We show that adversarial examples break this assumption, enabling \emphAI authority laundering: an attacker subtly perturbs an image so that the VLM produces confident and authoritative responses about the \emphwrong input. Unlike jailbreaks or prompt injections, our attacks do not compromise model alignment; the attack operates entirely at the perceptual level. We demonstrate that standard attacks against publicly available CLIP models transfer reliably to production VLMs – including GPT-5.4, Claude Opus~4.6, Gemini~3, and Grok~4.2. Across four attack surfaces, we show that authority laundering can amplify misinformation, disparage individuals, evade content moderation, and manipulate product recommendations. Our attacks have high success rates: In hundreds of attacks targeting identity manipulation and NSFW evasion, we measure success rates of 22 - 100% across six models. No novel attack algorithm is required: basic techniques known for over a decade suffice, establishing a lower bound on attacker capability that should concern defenders. Our results demonstrate that visual adversarial robustness is now a practical – and still largely unsolved – safety problem.

[LG-91] HUGO-CS: A Hybrid-Labeled Uncertainty-Aware General-Purpose Observational Dataset for Cold Spray

链接: https://arxiv.org/abs/2605.04257
作者: Stephen Price,Kyle Miller,Marco Musto,Kenneth Kroenlein,James Saal,Kyle Tsaknopoulos,Elke A. Rundensteiner,Danielle L. Cote
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Cold spraying is an increasingly common approach for repairing and manufacturing components due to its solid-state manufacturing capabilities. However, process optimization remains difficult due to many interdependent parameters and the lack of large-scale, machine-readable data to support modeling. While the scientific literature contains many relevant experiments, results are inconsistently reported (often in tables and figures) and use non-uniform units, limiting utilization at scale. To address these limitations, this work presents HUGO-CS, a literature-derived dataset of 4,383 cold-spray experiments with 144 features from 1,124 sources, exceeding the previous largest dataset (137 samples) by 30x. With completely manual extraction requiring an average of 91 minutes per document, this work designs and leverages a Hybrid-labeled, Uncertainty-aware, General-purpose, Observational extraction framework, called HUGO, to support this extraction. HUGO combines automated LLM-based labeling with targeted manual label refinement to handle this experimental result extraction process from scientific literature. To balance labeling efficiency with extraction accuracy, HUGO introduces a Hierarchical Risk Mitigation (HRM) to route LLM outputs with a high risk of potential errors for manual review, while retaining low-risk records as auto-labeled. Lastly, HUGO post-processing consolidates categorical descriptors, maps reported feedstock chemistries into structured continuous compositions, and normalizes units across sources. Of the 4,383 reported experiments, 1,765 are hand-labeled, providing a high-quality labeled subset for benchmarking, error analysis, and higher-fidelity data points. All code to replicate this work, along with the complete HUGO-CS dataset, are released under a CC-BY license at this https URL.

[LG-92] Road Risk Monitor: A Deployable U.S. Road Incident Forecasting System with Live Weather and Road-Level Tiles

链接: https://arxiv.org/abs/2605.04242
作者: Anton Ivchenko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nationwide road-incident forecasting is a systems problem before it is a modeling problem. A usable service must connect historical incident archives, historicalandliveweather,nationalroadgeometry, offline model training, tile generation, web serving and runtime handoff. This paper presents Road Risk Monitor, a U.S.-wide road-safety stack that combines a nationwide H3 baseline trained on FARS fatal-crash data with a road-segment forecasting pipeline trained from TIGER/Line geometry and US-Accidents events, then serves predictions through live APIs, raster tiles, JSON road tiles, and a public web application.

[LG-93] Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

链接: https://arxiv.org/abs/2605.04236
作者: Roberto Medina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model ensembles improve reasoning accuracy up to a performance boundary; beyond it, additional deliberation degrades accuracy. Static-budget methods cannot detect this boundary. Extended-thinking architectures compound the problem: a wrong answer after 120k tokens is indistinguishable from a correct one. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. Two configurations are evaluated: a persistence heuristic and DASE-Spatial (arena half-width W). Three contributions. (1) DASE produces a commit-type routing partition complementary to verbalized single-call confidence. On a contamination-controlled corpus (AIME 2010-2023, N=254, 3 seeds), a 120B ensemble achieves a 24.8 pp routing gap (right-wall 97.1% vs. left-wall 73.6%), statistically equivalent to Opus 4.6 Standard verbalized confidence at coverage-matched threshold (25.7 pp gap; bootstrap CI on difference: [-12.0, +10.3] pp, p=0.873). The two mechanisms disagree on 27% of routing assignments, establishing them as complements rather than substitutes; every DASE decision is accompanied by a machine-readable deliberation record. (2) Adaptive stopping, not injection bandwidth, drives accuracy gains. On AIME-300, bandwidth accounts for only 0.3 pp (ns); on GPQA-Extended, 4.4 pp bandwidth versus 5.0 pp stopping effect. DASE-Spatial ties Debate-Dense at its optimal budget using one-tenth the injection bandwidth and identifies that budget automatically; W=8 (65.0%) significantly outperforms W=4 (59.3%) on AIME-300 (adj p=0.0042). (3) Injection-based methods exhibit a retrospective accuracy-vs-inference inverted-U on both benchmarks; this pattern is hypothesis-generating for future work.

[LG-94] Capabilities of Auto-encoders and Principal Component Analysis of the Reduction of Microstructural Images; Application on the Acceleration of Phase-Field Simulations

链接: https://arxiv.org/abs/2605.04229
作者: Seifallah Fetni,Thinh Quy Duc Pham,Truong Vinh Hoang,Hoang Son Tran,Laurent Duchêne,Xuan-Van Tran,Anne Marie Habraken
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 21 pages, 8 figures. Preprint version of article published in Computational Materials Science

点击查看摘要

Abstract:In this work, a data-driven framework based on Phase-Field simulations data is proposed to highlight the capabilities of neural networks to ensure accurate low dimensionality reduction of simulated microstructural images and to provide time-series analysis. The dataset was indeed constructed from high-fidelity Phase-Field simulations. Analyses demonstrated that the association of auto-encoder neural networks and principal component analyses leads to ensure efficient and significant dimensionality reduction: 1/196 of reduction ratio with more than 80% of accuracy. These findings give insight to apply analyses on data from the latent dimension. Application of Long Short Term Memory (LSTM) neural networks showed the possibility of making next frame predictions; that makes possible the acceleration of Phase-Field simulation without the need of high computing resources. We discussed the application of such a framework on various areas of research. Different methods are proposed from the conducted analyses, in order to ensure dimensionality reduction, including auto-encoders, principal component analysis and Artificial Neural Networks, and time-series analysis, including LSTM and Gated Recurrent Unit (GRU).

[LG-95] Climate-based Pre-screening of Self-sustaining Regreening Opportunities in Drylands: A Case Study for Saudi Arabia

链接: https://arxiv.org/abs/2605.04206
作者: Katja Froehlich,Jonathan Klein,Ibrahim S. Elbasyoni,Julian D. Hunt,Yoshihide Wada,Dominik L. Michels
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale restoration in drylands is widely promoted to address land degradation and biodiversity loss, yet many efforts rely on long-term irrigation, limiting sustainability in water-scarce regions. A key challenge is identifying locations where native vegetation can persist without intensive management while minimizing costly field campaigns. A scalable pre-screening framework is presented that integrates climate and remote sensing data to enable cost-efficient site selection in arid environments using Saudi Arabia as a case study. A Climate Suitability Score (CSS), derived from machine learning models trained on expert-curated reference sites, captures complex climatic dependencies on vegetation persistence. Using multi-year ERA5-Land data for Saudi Arabia, national-scale prediction maps are generated and combined with vegetation indices to identify areas where climate is favorable, but vegetation remains underdeveloped. Multi-criteria screening reduces candidates to thirteen priority locations. Climatically analogous intact ecosystems provide benchmarks for restoration targets and indicate that an average 2.5 fold increase in vegetation coverage is a realistic target for restoration efforts. Overall, this approach narrows the search space, reduces costs, and supports resilient ecosystem recovery planning in water-limited regions.

[LG-96] Sequential Strategic Classification with Multi-Stage Selective Classifiers

链接: https://arxiv.org/abs/2605.04202
作者: Ziyuan Huang,Lina Alkarmi,Mingyan Liu
类目: Machine Learning (cs.LG)
*备注: Shorter version presented as a poster at GameNets 2026

点击查看摘要

Abstract:Strategic classification studies the problem where self-interested individuals or agents manipulate their response to obtain favorable decision outcomes made by classifiers, typically turning to dishonest actions when they are less costly than genuine efforts. Prior works have demonstrated a fundamental inability to get out of this conundrum by only focusing on the design of a classifier. We note that prior work also heavily focuses on either one-shot settings or repeated interaction with the same classifier. Real-world decision making is often multi-stage, involving a sequence of potentially different classifiers as an agent progresses. This paper introduces a sequential, stochastic, multi-stage model of strategic classification, by capturing how agents adapt their behavior, through improvement actions (enhancing both observable features and true attributes) and gaming actions (enhancing only observable features), over multiple levels of classification with increasing difficulty as well as reward. For each level, we adopt a selective classifier that can abstain from making a prediction at low confidence. Consequently, a positive (resp. negative) outcome leads to promotion (resp. demotion) of the agent to the next higher (resp. lower) level, while abstention keeps the agent at the same level. We fully characterize the agent’s optimal instantaneous action under selective classifiers and compare the long-term properties and utility of the agent repeatedly following an optimal myopic policy of either no-improvement (never choose the improvement action) or no-gaming (never choose the gaming action). We further examine design principles over the sequence of classifiers that yield higher long-term utility for the latter policy, thereby effectively incentivizing genuine effort in the long run.

[LG-97] Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

链接: https://arxiv.org/abs/2605.04185
作者: Qijun Liao,Zhaoxin Yu,Jue Yang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 27 pages, 60 figures

点击查看摘要

Abstract:When deploying reinforcement learning policies to physical robots, actuator rate constraints – hard limits on how fast each joint can move per control step – are unavoidable. These limits vary substantially across joints due to differences in motor inertia, power bandwidth, and transmission stiffness, creating pronounced heterogeneity that existing methods fail to handle geometrically: the per-joint feasible region forms a high-dimensional box in action-increment space, yet QP projection and spherical parameterization methods impose isotropic ball-shaped constraints, exponentially under-covering the true feasible set as heterogeneity grows. This paper proposes Dynamic Decoupled Spherical Radial Squashing (DD-SRad), which resolves this mismatch by computing a position-adaptive radius independently for each actuator, achieving tight alignment with the true per-joint feasible region. DD-SRad satisfies per-step hard constraints with probability~1, preserves well-conditioned gradients throughout training, and admits exact policy gradient backpropagation with zero runtime solver overhead. MuJoCo benchmark experiments demonstrate the highest task return at zero constraint violation – matching the unconstrained upper bound – with 30%–50% improvement in constraint-space coverage over spherical baselines. High-fidelity IsaacLab simulations with Unitree H1 and G1 humanoid robots confirm end-to-end optimality parameterized directly from official joint specifications, validating a systematic pathway from hardware datasheets to safe deployment.

[LG-98] A Provably Convergent and Practical Algorithm for Gromov–Wasserstein Optimal Transport

链接: https://arxiv.org/abs/2605.04175
作者: Ling Liang,Lei Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 Pages, 6 figures

点击查看摘要

Abstract:Gromov–Wasserstein optimal transport (GWOT) aligns metric measure spaces by matching their within-domain relational structures, but large-scale GWOT remains challenging because its objective is nonconvex and projection onto the transport polytope is often solved only approximately in practice. This leads to a gap between practical projected-gradient implementations and convergence theory, which typically assumes exact projections. For squared-loss GWOT, we propose an inexact projected-gradient framework with a verifiable feasibility-residual-based inexact condition for the projection subproblem. This condition is directly computable and avoids unknown quantities such as the exact projection point. Under this implementable condition, we prove subsequential convergence to stationary points and, with a mild tolerance-decay condition, convergence of the whole sequence. The resulting method retains the simplicity and sparsity of projected-gradient schemes while providing rigorous convergence guarantees, turning projected-gradient methods into a principled and scalable approach for GWOT with provable reliability.

[LG-99] Enabling Real-Time Training of a Wildfire-to-Smoke Map with Multilinear Operators

链接: https://arxiv.org/abs/2605.04164
作者: Zachary Morrow,Joseph Crockett,John D. Jakeman,Dan J. Krofcheck
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph)
*备注: 27 pages

点击查看摘要

Abstract:Wildfires are a major producer of fine particulate matter, impacting human health and the electrical grid. Accurately forecasting smoke impacts over long time scales incorporates fuel treatment strategies, natural fuel succession, and stochastic events like lightning strikes. However, predicting smoke for each fuel distribution with a forward simulation of a coupled fire-atmosphere model is computationally infeasible. Moreover, relatively simple fire models are tractable to run in many long-time scenarios but do not capture smoke transport. We use data-driven multilinear operators to predict a smoke concentration field from knowledge of the time since ignition for two quantities of interest: aerosol optical depth and smoke detection. Our method first computes the principal components of time-since-ignition and smoke concentration fields and then learns a map from powers of the input coefficients to the output coefficients. We apply our learned operator to smoke prediction in the Upper Rio Grande Watershed. After collecting training data, learning the approximation weights on a CPU takes less than 30 seconds, and each forward call takes less than 1 ms. On a proxy for aerosol optical depth, we obtain equal accuracy to Monte Carlo sampling with fewer than half as many coupled model calls. For smoke detection, we obtain an intersection-over-union (IoU) of 65% and an area under the receiver operating characteristic curve (AUC) of 0.95 on holdout data. Our method is significantly more accurate than the most similar published smoke classifier, which obtains an IoU and AUC of 0.15 and 0.61, respectively, on a 2015 bushfire in Australia.

[LG-100] Model synthesis and identifiability analysis of stiff chemical reaction systems with inVAErt networks

链接: https://arxiv.org/abs/2605.04134
作者: Sreejata Dey,Guoxiang Grayson Tong,Jonathan F. MacArt,Daniele E. Schiavazzi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of learning data-driven replicas for stiff systems of ordinary differential equations arising in chemical kinetics that can be evaluated with high computational efficiency. We first focus on training emulators for families of reaction equations under varying reaction rates, using conditional residual networks or long-short term memory architectures. We then apply a recently proposed data-driven framework known as ``inVAErt networks’’ to address the ill-posed inverse problem of inferring reaction rates, integration time, and possibly initial conditions from a target set of species concentrations - a problem that has received relatively little attention in the literature. The proposed approach is demonstrated on chemical systems with reversible and irreversible kinetics, spanning 2 to 20 differential equations, 3 to 20 chemical species, and 3 to 25 reaction rate parameters. Relative root mean squared errors produced by the proposed emulators range from 10^-5 for lower-dimensional systems to 10^-4 and 10^-3 for an air pollution model and a hydrogen-air reaction system, respectively. Manifolds of non-identifiable reaction rates recovered by the proposed approach can be analytically verified for simple systems and are consistent with local identifiability analysis in higher dimensions.

[LG-101] Constrained Extreme Gradient Boosting for Adapting Reduced-Order Models

链接: https://arxiv.org/abs/2605.04130
作者: Melika Baghi,Xiao Liu,Kamran Paynabar
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review. 4 numerical examples

点击查看摘要

Abstract:High-fidelity simulations, such as computational fluid dynamics and finite element analysis, are essential for modeling complex engineering systems but are often prohibitively expensive for tasks including parametric studies, optimization, and real-time control. Projection-based reduced-order models (ROMs) alleviate this cost by projecting the governing dynamics onto low-dimensional subspaces. However, their performance can deteriorate under parameter variation, motivating the need for adaptive basis construction. In this work, we propose a constrained ensemble learning framework, termed Constrained Extreme Gradient Boosting (cXGBoost), for predicting Proper Orthogonal Decomposition (POD) bases as functions of system parameters. The approach leverages a geometric representation of subspaces on the Grassmann manifold, which are mapped to a Euclidean space to enable efficient regression using gradient boosting trees. A norm constraint is imposed during training to ensure the validity of the inverse mapping and preserve the geometric structure of the predicted subspaces. The proposed method is evaluated on four numerical examples, including fluid dynamics and wave propagation problems, demonstrating its ability to accurately predict parameter-dependent bases while maintaining robustness across nonlinear regimes. These results highlight the potential of combining geometric learning with constrained ensemble methods for scalable and reliable reduced-order modeling of high-dimensional parametric systems.

[LG-102] Simultaneous CNN Approximation on Manifolds with Applications to Boundary Value Problems

链接: https://arxiv.org/abs/2605.04126
作者: Hanfei Zhou,Lei Shi
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper develops convolutional neural network (CNN) methods for simultaneous approximation and elliptic boundary value problems on compact Riemannian manifolds. We establish simultaneous Sobolev approximation results for single- and multichannel CNNs, showing that manifold functions and their derivatives can be approximated with rates governed by the intrinsic dimension and the smoothness gap, rather than by the ambient dimension, thereby mitigating the curse of dimensionality. Building on this approximation theory, we propose a physics-informed CNN (PICNN) framework specially designed for boundary value problems. The main numerical issue is a boundary-norm mismatch: standard PINNs usually impose boundary data through low-order, often L2-type, penalties, whereas elliptic stability requires Sobolev trace control. We address this by introducing a spectral boundary loss based on the boundary Laplace-Beltrami operator, which represents trace errors as weighted frequency energies and relates truncation error to boundary eigenvalue decay. This avoids smooth auxiliary constructions required by exact boundary enforcement and singular double integrals arising in Sobolev-Slobodeckij penalties, while enabling implementations based on Fast Fourier Transforms (FFTs) or precomputed spectral bases on structured boundaries. Numerical experiments demonstrate improved accuracy, convergence, and stability over standard PINNs.

[LG-103] Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering

链接: https://arxiv.org/abs/2605.04116
作者: Tejas Kulkarni,Antti Koskela,Laith Zumot
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:We show that remotely hosted applications employing in-context learning when augmented with a retrieval function to select in-context examples can be vulnerable to membership-inference attacks even when the service provider and users are separate parties. We propose two black-box membership inference attacks that exploit query text prefixes to distinguish member from non-member inputs. The first attack uses a reference model to estimate an otherwise unavailable loss metric. The second attack improves upon it by eliminating the reference model and instead computing a membership statistic through a simple but novel weighted-averaging scheme. Our comprehensive empirical evaluations consider a stricter case in which the adversary has a paraphrased version of the text in the queries and show that our attacks can exhibit stronger resilience to paraphrasing and outperform three prior attacks in many cases with small number of prefixes. We also adapt an existing ensemble prompting defense to our setting, demonstrating that it substantially mitigates the privacy leakage caused by our second attack.

[LG-104] Enhancing the interpretability of spatially variable N2O model predictions with soft sensors during wastewater treatment

链接: https://arxiv.org/abs/2605.04082
作者: Mohammad Raeisi Gahrouei,Pedram Ramin,Vincenzo A. Riggio,Carlos Domingo-Felez
类目: Machine Learning (cs.LG)
*备注: 1 Graphical abstract, 2 Tables, 7 Figures

点击查看摘要

Abstract:Model-based solutions for nitrous oxide (N2O) emissions from wastewater treatment plants (WWTP) are informed by operational datasets designed to control nutrient levels in liquid waste, coupled with dedicated campaigns for N2O measurements. We analysed how machine learning (ML) models predict disturbances to WWT operation and spatially variable N2O emissions. A real dataset was investigated to validate the modelling framework from N2O emissions predicted by four ML models (R2 = 0.79 - 0.89). Monitoring campaigns for N2O were simulated with a plant-wide mechanistic model to include additional sensors, site-level N2O datasets, and wastewater disturbances (n = 16). ML models were highly accurate (0.97 ± 0.02, n = 80), but the feature importance depended on the model, the scenario and the N2O measurement scale (reactor vs. WWTP). We argue that N2O soft sensor model predictions are limited to the measuring location and the methodological uncertainty of the dataset, which affect the interpretability of the model. Lastly, the analysis of the mechanistic model structure exposed interactions between autotrophic and heterotrophic pathways over nitric oxide which can overestimate aerobic nitrite production and bias the N2O pathway contributions.

[LG-105] A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

链接: https://arxiv.org/abs/2605.04055
作者: JiangBo Zhao,ZhaoXin Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing MetaAdamW - a new optimizer that integrates a self-attention mechanism to dynamically modulate per-group learning rates and weight decay. The modulation factors are produced by a lightweight Transformer encoder that operates on statistical features (gradient norms, momentum norms, correlations) extracted from each parameter group. To train the attention module, we introduce a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. A key novel contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities that directly scale the regularization terms - enabling domain knowledge to guide automatic loss balancing. Extensive experiments on five diverse tasks-time series forecasting (ETT), language modeling (WikiText-2), machine translation (Multi30k), image classification (CIFAR-10), and sentiment analysis (IMDB) - demonstrate that MetaAdamW consistently outperforms the standard AdamW baseline in terms of validation loss, accuracy, or perplexity. Depending on the task, MetaAdamW either reduces overall training time (by up to 17.11%) or improves performance (by up to 11.08%) while introducing only moderate overhead; in some cases, it can also mitigate issues of insufficient convergence caused by premature early stopping. Ablation studies validate the effectiveness of each component, including feature versions, grouping strategies, and the proposed priority-injected uncertainty weighting.

[LG-106] Endogenous Regime Switching Driven by Scalar-Irreducible Learning Dynamics

链接: https://arxiv.org/abs/2605.04054
作者: Sheng Ran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving endogenous regime switching is crucial for the emergence of autonomous intelligence, yet remains a central challenge for existing machine learning frameworks, where such transitions are typically externally imposed. In this work, we introduce a classification that distinguishes scalar-reducible dynamics, which can be expressed as gradient flows driven by a scalar objective, from scalar-irreducible dynamics that cannot be reduced to such a form. While most existing machine learning systems operate within the scalar-reducible class, we demonstrate that scalar-irreducible dynamics naturally enable internally generated regime switching through feedback between fast dynamical variables and slow structural adaptation. Using a minimal dynamical model, we illustrate how this mechanism produces sustained endogenous regime transitions without external scheduling. Our results suggest a new dynamical paradigm for regime exploration and provide a potential route toward autonomous learning systems whose adaptive behavior is organized internally rather than externally prescribed.

[LG-107] Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

链接: https://arxiv.org/abs/2605.05189
作者: Nicholas Barnfield,Juno Kim,Eshaan Nichani,Jason D. Lee,Yue M. Lu
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How many key-value associations can a d\times d linear memory store? We show that the answer depends not only on the d^2 degrees of freedom in the memory matrix, but also on the retrieval criterion. In an isotropic Gaussian model for the stored pairs, we show that top-1 retrieval, where every signal must beat its largest distractor, requires the logarithmic model-size scale d^2\asymp n\log n . We prove that the correlation matrix memory construction, which stores associations by superposing key-target outer products, achieves this scale through a sharp phase transition, and that the same scaling is necessary for any linear memory. Thus the logarithm is the intrinsic extreme-value price of winner-take-all decoding. We next consider listwise retrieval, where the correct target need not be the unique top-scoring item but should remain among the strongest candidates. To formalize this regime, we propose the Tail-Average Margin (TAM), a convex upper-tail criterion that certifies inclusion of the correct target in a controlled candidate list. Under this listwise retrieval criterion, the capacity follows the quadratic scale d^2\asymp n . At load n/d^2\to\alpha , we develop an exact asymptotic theory for the TAM empirical-risk minimizer through a two-parameter scalar variational principle. The theory has a rich phenomenology: in the ridgeless limit it yields a closed-form critical load separating satisfiable and unsatisfiable phases, and it predicts the limiting laws of true scores, competitor scores, margins, and percentile profiles. Finally, a small-tail extrapolation further leads to the conjectural sharp top-1 threshold d^2\sim 2n\log n . Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2605.05189 [stat.ML] (or arXiv:2605.05189v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.05189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-108] Proximal Projection for Doubly Sparse Regularized Models

链接: https://arxiv.org/abs/2605.05093
作者: Jia Wei He,R. Ayesha Ali,Gerarda Darlington
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Regularization is often used in high-dimensional regression settings to generate a sparse model, which can save tremendous computing resources and identify predictors that are most strongly associated with the response. When the predictors can be represented by a Gaussian graphical model, the structure of the predictor graph can be exploited during regularization. Our proposed model exploits this underlying predictor graph structure by decomposing the estimated coefficient vector into a sum of latent variables that correspond to the sum of each node contribution to the coefficient vector. Regularization is then performed on the latent variables rather than on the coefficient vector directly. We use a penalty function that permits a clear user-defined trade-off between the L1 and L2 penalties and propose a novel proximal projection during optimization. Further, our implementation computes the projection operator for the intersection of selected groups, which conserves more computing resources compared to predictor duplication methods, especially for high-dimensional data. Through simulation, we evaluate the performance of our approach under different graph structures and node counts, and present results on real-world data. Results suggest that our method exhibits stable performance relative to other singly or doubly sparse graphical regression models.

[LG-109] Hypergraph Generation via Structured Stochastic Diffusion

链接: https://arxiv.org/abs/2605.05024
作者: Christopher Nemeth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Hypergraphs model higher-order interactions, but realistic hypergraph generation remains difficult because incidence, hyperedge-size heterogeneity, and overlap structure are not faithfully captured by pairwise reductions. We propose \HEDGE, a generative model defined directly on relaxed incidence matrices via a structured stochastic diffusion. The forward process combines a hypergraph-specific two-sided heat operator with an Ornstein–Uhlenbeck component, preserving structure-aware noising near the data while yielding an explicit Gaussian terminal law. Conditional on an observed hypergraph, this forward process is linear-Gaussian, so conditional means, covariances, scores, and reverse-drift targets are available in closed form. We therefore learn a permutation-equivariant state-only reverse-drift field in incidence space by regressing onto exact conditional targets, and generate samples by simulating a learned reverse-time SDE from the Gaussian base law. We establish exactness in the ideal state-only setting together with finite-horizon stability guarantees, and empirically show improved hypergraph generation quality relative to strong baselines.

[LG-110] Scalable inference of spatial regions and temporal signatures from time series

链接: https://arxiv.org/abs/2605.05008
作者: Jiayu Weng,Alec Kirkley
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Regionalization aims to partition a spatial domain into contiguous regions that share similar characteristics, enabling more effective spatial analysis, policy making, and resource management. Existing approaches for spatial regionalization typically rely on static spatial snapshots rather than evolving time series. Meanwhile, most time series clustering methods ignore spatial structure or enforce spatial continuity through ad hoc regularization, constraining the number of inferred regions a priori either explicitly or implicitly. Utilizing the minimum description length principle from information theory, here we propose an efficient and fully nonparametric framework for the regionalization of spatial time series. Our method jointly infers a spatial partition along with a set of representative time series archetypes (“drivers”) that best compress a spatiotemporal dataset, with a runtime log-linear in the number of time series. We demonstrate that this method can accurately recover planted regional structure and drivers in synthetic time series, and can extract meaningful structural regularities in large-scale empirical air quality and vegetation index records. Our method provides a principled and scalable framework for spatially contiguous partitioning, allowing interpretable temporal patterns and homogeneous regions to emerge directly from the data itself.

[LG-111] Jacobian-Velocity Bounds for Deployment Risk Under Covariate Drift

链接: https://arxiv.org/abs/2605.04932
作者: Jonathan R. Landers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:We study long-horizon deployment of a frozen predictor under dynamic covariate shift. A time-domain Poincaré inequality reduces temporal risk volatility to derivative energy, and a Jacobian-velocity theorem identifies directional tangent energy along the deployment path as the governing quantity under explicit along-path regularity and domination assumptions. Under low-rank drift, that quantity reduces to directional Jacobian energy in the drift subspace, motivating drift-aligned tangent regularization (DTR) and a matched monitoring proxy. Rather than smoothing the network isotropically, DTR penalizes sensitivity only along estimated drift directions. We validate the theorem-to-method pipeline in four experiments: a synthetic benchmark for the time-domain inequality, a controlled synthetic comparison against isotropic Jacobian regularization, and two frozen-deployment studies on the UCI Air Quality and Tetouan power-consumption datasets. DTR reduces risk volatility and directional gain in the controlled low-rank regime, beats isotropic smoothing there, and gives validation-selected deployment gains on both real datasets when the Air Quality drift subspace is estimated from target-orthogonal sensor motion. Moderate drift-subspace misspecification is tolerable while orthogonal misspecification largely removes the benefit.

[LG-112] Neural Discovery of Strichartz Extremizers

链接: https://arxiv.org/abs/2605.04918
作者: Nicolás Valenzuela,Ricardo Freire,Claudio Muñoz
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 38 pages, 26 figures

点击查看摘要

Abstract:Strichartz inequalities are a cornerstone of the modern theory of dispersive PDEs, but their extremizers are known explicitly only in a handful of sharp cases. The non-convexity of the underlying functional makes the problem hard, and to our knowledge no systematic numerical attack has been attempted. We propose a simple neural-network-based pipeline that searches for extremizers as critical points of the Strichartz ratio, and apply it in three settings. First, on the Schrödinger group we recover the Gaussian extremizers of Foschi and Hundertmark–Zharnitsky in dimensions d=1,2 to within 10^-3 relative error, with no analytical prior. Second, on 59 further admissible pairs in d=1 where the answer is conjectural, the method consistently finds Gaussians, supporting the conjecture that Gaussians are the universal extremizers in the admissible range. Third, on the critical Airy–Strichartz inequality at \gamma=1/q , where existence is open, the optimization does not converge to any L^2 profile: instead, the iterates organize themselves as mKdV breathers B(0,\cdot;\alpha,1,0,0) with growing internal frequency \alpha , and the discovered ratio approaches the Frank–Sabin universal lower bound \widetilde A_q,r from below with a power-law gap \sim\alpha^-0.9 . We confirm the same picture with an independent Hermite-basis ansatz. We propose a precise conjecture: the supremum equals \widetilde A_q,r and is approached, but not attained, along the breather family. The pipeline thus serves both as a validator on known cases and as a discovery tool when no extremizer exists.

[LG-113] PAIR-CI: Calibrated Conditional Independence Testing for Causal Discovery with Incomplete Data

链接: https://arxiv.org/abs/2605.04838
作者: Thomas S. Robinson,Ranjit Lall
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The standard constraint-based paradigm for causal discovery with incomplete data – impute first, test second – is frequently miscalibrated: any consistent conditional independence (CI) test rejects a true null with probability approaching 1 when imputation error induces spurious conditional dependence. We introduce PAIR-CI, a nonparametric CI test that restores calibration by integrating multiple imputation directly into the inferential procedure via a paired permutation design. PAIR-CI compares cross-validated models that include and exclude the candidate variable while receiving the same imputed conditioning set, forcing imputation error to cancel in their loss difference rather than contaminate the test statistic. A provably consistent variance estimator jointly accounts for uncertainty arising from cross-validation and multiple imputation – to our knowledge, the first formal unification of these two inferential frameworks. In simulations, existing imputation-based CI tests exhibit false positive rates of 28–45% when data are missing not at random (MNAR), whereas PAIR-CI averages below the nominal 5% level across data-generating processes and missingness mechanisms. These gains are largest in nonlinear settings and grow with causal graph size: when integrated into the PC algorithm, PAIR-CI reduces structural Hamming distance by 8% on 10-variable nonlinear graphs, 15% on 30-variable equivalents, and up to 44% on the 56-variable HAILFINDER network, with stable performance in all settings.

[LG-114] Generative Quantum-inspired Kolmogorov-Arnold Eigensolver

链接: https://arxiv.org/abs/2605.04604
作者: Yu-Cheng Lin,Yu-Chao Hsu,I-Shan Tsai,Chun-Hua Lin,Kuo-Chung Peng,Jiun-Cheng Jiang,Yun-Yuan Wang,Tzung-Chi Huang,Tai-Yue Li,Kuan-Cheng Chen,Samuel Yen-Chi Chen,Nan-Yow Chen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-performance computing (HPC) is increasingly important for scalable quantum chemistry workflows that couple classical generative models, quantum circuit simulation, and selected configuration interaction postprocessing. We present the generative quantum-inspired Kolmogorov-Arnold eigensolver (GQKAE), a parameter-efficient extension of the generative quantum eigensolver (GQE) for quantum chemistry. GQKAE replaces the parameter-heavy feed-forward network components in GPT-style generative eigensolvers with hybrid quantum-inspired Kolmogorov-Arnold network modules, forming a compact HQKANsformer backbone. The method preserves autoregressive operator selection and the quantum-selected configuration interaction evaluation pipeline, while using single-qubit DatA Re-Uploading ActivatioN modules to provide expressive nonlinear mappings. Numerical benchmarks on H4, N2, LiH, C2H6, H2O, and the H2O dimer show that GQKAE achieves chemical accuracy comparable to the GPT-based GQE architecture, while reducing trainable parameters and memory by approximately 66% and improving wall-time performance. For strongly correlated systems such as N2 and LiH, GQKAE also improves convergence behavior and final energy errors. These results indicate that quantum-inspired Kolmogorov-Arnold networks can reduce classical-side overhead while preserving circuit-generation quality, offering a scalable route for HPC-quantum co-design on near-term quantum platforms.

[LG-115] Multiscale Euclidean Network Trajectories: Second-Moment Geometry Attribution and Change Points

链接: https://arxiv.org/abs/2605.04589
作者: Haruka Ezoe,Ryohei Hisano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:A central challenge in dynamic network analysis is to represent temporal evolution in a way that is both geometrically meaningful and statistically identifiable. One approach embeds a sequence of network snapshots as trajectories in a Euclidean space and relates these trajectories to node embeddings. In multilayer and unfolded spectral constructions, however, node embeddings and their underlying latent positions are identifiable only up to general linear transformations. Although this ambiguity preserves edge probabilities, it can distort geometry and invalidate distance based temporal comparisons at both the trajectory and node-levels. We develop Multiscale Euclidean Network Trajectories (MENT), a framework for multiscale temporal trajectories based on second-moment geometry. By imposing an isotropic normalization on the anchor latent positions, we reduce the relevant ambiguity to orthogonal transformations and prevent distortion of the second-moment geometry. In this canonical representation, we define a trace variation distance and mode-wise variation distances along orthogonal directions, and use multidimensional scaling to obtain low-dimensional trajectories of time points at both global and mode-wise levels. The resulting trajectories support interpretation and inference. They admit mode-wise decompositions, support attribution of global and mode-wise temporal changes to nodes, and enable change point detection through 1D trajectories. We prove consistency of the proposed unfolded spectral embedding and of the induced temporal trajectories. Experiments on two synthetic and two real dynamic networks illustrate stable and interpretable recovery of temporal structure and show strong performance against existing change point detection baselines. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2605.04589 [stat.ML] (or arXiv:2605.04589v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.04589 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-116] Causal discovery under mean independence and linearity

链接: https://arxiv.org/abs/2605.04381
作者: Geert Mesters,Alvaro Ribot,Anna Seigal,Piotr Zwiernik
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:Causal discovery methods such as LiNGAM identify causal structure from observational data by assuming mutually independent disturbances. This assumption is fragile: shared volatility, common scale effects, or other forms of dependence can cause the methods to recover the wrong causal order, even with infinite data. We introduce the Linear Mean-Independent Acyclic Model (LiMIAM), which replaces full independence with weaker one-sided mean-independence restrictions on the disturbances. Under finite-order consequences of these restrictions, source nodes are generically identifiable, and hence a compatible causal order can be recovered recursively. Our proof is constructive and leads to DirectLiMIAM, a sequential residual-based algorithm for causal discovery under dependent noise. In simulations with mean-independent but dependent disturbances, DirectLiMIAM outperforms LiNGAM methods. A large-scale empirical application to the oil market highlights the implausibility of the independence assumption and the ability of DirectLiMIAM to recover a realistic causal ordering, from policy to production and from prices to inflation.

[LG-117] Perturbation is All You Need for Extrapolating Language Models

链接: https://arxiv.org/abs/2605.04344
作者: Zetai Cen,Jin Zhu,Xinwei Shen,Chengchun Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 44 pages

点击查看摘要

Abstract:We introduce a simple yet powerful framework for training large language models. In contrast to the standard autoregressive next-token prediction based on an exact prefix, we propose a perturbation-based procedure that first transforms the prefix into a semantic neighbor and then conditions on this perturbed variant for next-token prediction. This yields a hierarchical model with a pre-post-additive noise structure. Within this framework, we develop a rigorous theory of extrapolability, namely, the capacity of a model class to make reliable predictions for token sequences that lie outside the empirical support of the training corpus. We evaluate the finite-sample performance of the proposed procedure using both synthetic and real-world language data. Results show that the proposed method consistently improves out-of-support prediction while maintaining competitive in-support performance, demonstrating that perturbation offers a practical route to language modeling.

[LG-118] A foundation model of vision audition and language for in-silico neuroscience

链接: https://arxiv.org/abs/2605.04326
作者: Stéphane d’Ascoli,Jérémy Rapin,Yohann Benchetrit,Teon Brooks,Katelyn Begany,Joséphine Raugel,Hubert Banville,Jean-Rémi King
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predicting human brain activity in a variety of naturalistic and experimental conditions. Leveraging a unified dataset of over 1,000 hours of fMRI across 720 subjects, we demonstrate that our model accurately predicts high-resolution brain responses for novel stimuli, tasks and subjects, superseding traditional linear encoding models, delivering several-fold improvements in accuracy. Critically, TRIBE v2 enables in silico experimentation: tested on seminal visual and neuro-linguistic paradigms, it recovers a variety of results established by decades of empirical research. Finally, by extracting interpretable latent features, TRIBE v2 reveals the fine-grained topography of multisensory integration. These results establish artificial intelligence as a unifying framework for exploring the functional organization of the human brain.

[LG-119] Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

链接: https://arxiv.org/abs/2605.04269
作者: Sharan Sahu,Abir Sarkar,Cameron J. Hogan,Martin T. Wells
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 39 pages, 11 figures, 1 table

点击查看摘要

Abstract:We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general L -smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by \beta_1 , and a preconditioner perturbation governed by \beta_2 . We characterize the burn-in time to reach Adam’s irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise–drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit (\beta_1,\beta_2,\epsilon) -dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam’s empirical instability and stabilization under distribution shift.

[LG-120] Entropic Riemannian Neural Optimal Transport

链接: https://arxiv.org/abs/2605.04255
作者: Alessandro Micheli,Silvia Sapora,Anthea Monod,Samir Bhatt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Many machine learning problems involve data supported on curved spaces such as spheres, rotation groups, hyperbolic spaces, and general Riemannian manifolds, where Euclidean geometry can distort distances, averages, and the resulting optimal transport (OT) problem. Existing manifold OT methods have pursued amortized out-of-sample maps, while entropic regularization has made discrete OT more scalable, but these advantages have remained largely disjoint. We propose Entropic Riemannian Neural Optimal Transport (Entropic RNOT), a unified framework that combines intrinsic entropic OT with amortized out-of-sample evaluation on Riemannian manifolds. Our method learns a single target-side Schrödinger potential through a neural pullback parameterization, recovers the induced Gibbs coupling, and uses the resulting conditional laws to construct intrinsic transport surrogates. These include barycentric projections on Cartan-Hadamard manifolds and heat-smoothed conditional surrogates on stochastically complete manifolds, the latter turning possibly atomic target laws into absolutely continuous ones. For fixed regularization \varepsilon0 , we prove that the proposed hypothesis class recovers the entropic optimal coupling in strong probabilistic metrics. As consequences, barycentric surrogates converge in L^2 , while heat-smoothed surrogates are stable at fixed heat time and asymptotically unbiased as the heat time vanishes. The guarantees hold for compactly supported data on possibly noncompact manifolds. Empirically, our method matches or improves over Euclidean, tangent-space, and log-Euclidean baselines on benchmarks over \mathbbS^2 , \mathrmSO(3) , \mathrmSPD(3) , \mathrmSE(3) , and \mathbbH^2 , scales favorably relative to discrete manifold Sinkhorn, and in a protein-ligand docking application, refines poses on \mathrmSE(3) without retraining or per-instance optimization.

[LG-121] Globally Solving Unbalanced Optimal Transport and Density Control for Gaussian Distributions

链接: https://arxiv.org/abs/2605.04246
作者: Haruto Nakashima,Siddhartha Ganguly,Kenji Kashima
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 28 pages; submitted to a journal

点击查看摘要

Abstract:In this article, we study unbalanced optimal transport (UOT) and establish a control-theoretic dynamical extension, which we call the unbalanced density control (UDC), for a class of Gaussian reference measures. In the static setting, we consider UOT with quadratic transport cost and Kullback–Leibler penalties on the marginals relative to prescribed Gaussian measures. We show that the infinite-dimensional variational problem admits an exact Gaussian reduction, yielding a finite-dimensional optimization over masses, means, and covariances, together with a closed-form expression for the optimal transported mass. We then formulate UDC for discrete-time linear systems, where the initial and terminal state measures are imposed softly through KL penalties and the intermediate evolution is governed by controlled linear dynamics with quadratic control cost. For this problem, we prove that any feasible solution can be replaced, without loss of optimality, by a Gaussian initial measure and an affine-Gaussian control policy. This leads to an exact finite-dimensional reformulation and, after a standard covariance-steering lifting, to an SDP-based optimization for fixed mass, again coupled with a closed-form mass update. We further establish existence of optimal solutions and identify a sufficient condition under which the affine-Gaussian UDC policy is deterministic. These results provide globally optimal solution methods for both Gaussian UOT and Gaussian UDC. Finally, we illustrate our results with several numerical examples.

[LG-122] Heterogeneous Ordinal Structure Learning with Bayesian Nonparametric Complexity Discovery

链接: https://arxiv.org/abs/2605.04191
作者: Amir Rafe,Subasish Das
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Public attitudes toward artificial intelligence are heterogeneous, ordinally measured, and poorly captured by any single dependency graph. Existing ordinal structure learners assume a shared directed acyclic graph (DAG) across all respondents; recent heterogeneous ordinal graphical-model approaches focus on subgroup discovery rather than confirmatory cluster-specific DAG estimation; and latent profile analyses discard dependency structure entirely. We introduce a heterogeneous ordinal structure-learning framework combining monotone Gaussian score embedding, Bayesian nonparametric (BNP) complexity discovery via a truncated stick-breaking prior, and confirmatory fixed-K estimation with cluster-specific sparse DAG learning. The key methodological insight is a discovery-to-confirmation workflow: the nonparametric stage calibrates plausible archetype complexity, while inner-validated confirmatory refitting yields stable, interpretable structural estimates. On the 2024 Pew American Trends Panel AI attitudes survey, Wave 152 (W152) survey, (N = 4,788, 8 ordinal items), the confirmatory K*=5 model reduces holdout transformed-score mean squared error (MSE) by 25.8% over a single-graph baseline and by 4.6% over mixture-only clustering. A controlled tiered semi-synthetic benchmark calibrated to W152 structure validates recovery across difficulty regimes and transparently reveals failure modes under stress conditions.

[LG-123] ree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

链接: https://arxiv.org/abs/2605.04119
作者: Emil Sharafutdinov,Ingemar André
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 9 pages of main text, 3 figures, 3 tables, and 1 algorithm. This version is a preliminary preprint

点击查看摘要

Abstract:Ancestral sequence reconstruction (ASR) aims to infer extinct protein sequences at internal nodes of a phylogenetic tree. Classical ASR methods are typically based on continuous-time Markov substitution models, but they treat sites largely independently and handle insertions and deletions only weakly or not at all. We introduce a tree-conditioned edit-flow model for variable-length ASR. Given two descendant sequences and their branch distances to a shared ancestor, the model reconstructs the ancestor through paired bidirectional edit trajectories constrained to agree on a common ancestral state. On a benchmark of experimentally evolved sequences with only context-independent substitutions, the model does not match the accuracy of the best classical method, yet still achieves reasonable performance despite being trained on natural sequences that include insertions, deletions, and substitutions. On a benchmark of natural homologous sequences with abundant insertions and deletions, the model most accurately localizes inferred evolutionary change.

[LG-124] BOOOM: Loss-Function-Agnostic Black-Box Optimization over Orthonormal Manifolds for Machine Learning and Statistical Inference

链接: https://arxiv.org/abs/2605.04087
作者: Beomchang Kim,Subhrajyoty Roy,Priyam Das
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimization over the Stiefel manifold \mathrmSt(p,d) , the set of p \times d column-orthonormal matrices, is fundamental in statistics, machine learning, and scientific computing, yet remains challenging in the presence of non-convex, non-smooth, or black-box objectives. Existing methods largely rely on either convex relaxations or gradient-based Riemannian optimization, limiting applicability in derivative-free and highly multimodal settings. We propose \textscBOOOM (Black-box Optimization Over Orthonormal Manifolds), a general-purpose framework for loss-function-agnostic optimization on \mathrmSt(p,d) . The key idea is a global Givens rotation-based parametrization that maps the manifold to an unconstrained Euclidean angle space while preserving feasibility exactly. Building on this representation, BOOOM employs a structured, parallelizable, derivative-free search based on Recursive Modified Pattern Search, enabling systematic exploration through plane-wise rotations without requiring gradient information and facilitating escape from poor local optima. We establish a unified theoretical framework showing equivalence between angle-space and manifold optimization, transfer of stationarity, and global convergence in probability under mild conditions. Empirical results across diverse problems, including heterogeneous quadratic optimization, low-rank and sparse matrix decomposition, independent component analysis, and orthogonal joint diagonalization, among other widely studied settings, demonstrate strong performance relative to state-of-the-art methods, particularly in non-smooth and highly multimodal regimes. We further illustrate its practical utility through a novel supervised PCA formulation applied to metabolomics data in colorectal cancer.

[LG-125] A Consistency-Centric Approach to Set-Based Optimization with Multiple Models of Unranked Fidelity

链接: https://arxiv.org/abs/2605.04051
作者: Danielle F. Morey,Giulia Pedrielli,Cherry Y. Wakayama,Zelda B. Zabinsky
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:In complex real-world settings, optimization is challenged by the presence of diverse models of differing fidelity. In many optimization problems, a single model is treated as the most accurate representation of the underlying system, while other models are evaluated primarily by their agreement with this presumed most accurate model. Yet in real-world applications, model accuracy is rarely known a priori and assuming a single most accurate model can be misleading. This paper addresses this gap by proposing a flexible set-based optimization methodology called Set-Based Optimization with Multiple Models (S-BOMM) that works with multiple models without the assumption of a most accurate high-fidelity model. Unlike traditional optimization approaches that focus on finding an optimal solution according to the high-fidelity model, our methodology utilizes consistency between models to identify good solutions across multiple models. A probabilistic analysis of the consistency method is provided that bounds the likelihood of the methodology producing correct or incorrect results. Empirical results demonstrate the effectiveness of S-BOMM on test problems. By focusing on the consistency across models rather than relying on a single best solution, this set-based approach offers a practical alternative to optimization problems where multiple models must be considered without assuming a single most accurate high-fidelity model.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-05-07)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载