本篇博文主要内容为 2026-04-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-09)
今日共更新699篇论文,其中:
- 自然语言处理共139篇(Computation and Language (cs.CL))
- 人工智能共224篇(Artificial Intelligence (cs.AI))
- 计算机视觉共128篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共181篇(Machine Learning (cs.LG))
- 多智能体系统共19篇(Multiagent Systems (cs.MA))
- 信息检索共22篇(Information Retrieval (cs.IR))
- 人机交互共38篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Intertemporal Demand Allocation for Inventory Control in Online Marketplaces
【速读】:该论文旨在解决数字平台如何通过非直接控制库存的方式,利用时间维度上的需求分配机制来影响卖家的库存决策问题。其核心挑战在于,在平台不直接干预卖家补货行为的前提下,如何设计有效的订单分配策略以引导卖家选择平台提供的履约服务(如 fulfill-by-platform, FBP)并优化整体库存水平。解决方案的关键在于“信息机制”:平台通过调整各卖家销售流的可预测性,改变其安全库存需求,从而在平均需求份额不变的情况下影响卖家的履约模式选择与库存持有量。研究发现,公平分配政策中,均匀分割订单能最小化预测不确定性;而引入更高不确定性则可通过低记忆规则实现,且需确保卖家无法从自身销售历史推断出聚合需求信息。这一机制使平台能够以较低的信息成本实现对卖家行为的有效引导,从而平衡平台履约采纳率与采用者持有的库存水平。
链接: https://arxiv.org/abs/2604.07312
作者: Rene Caldentey,Tong Xie
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Online marketplaces increasingly do more than simply match buyers and sellers: they route orders across competing sellers and, in many categories, offer ancillary fulfillment services that make seller inventory a source of platform revenue. We investigate how a platform can use intertemporal demand allocation to influence sellers’ inventory choices without directly controlling stock. We develop a model in which the platform observes aggregate demand, allocates orders across sellers over time, and sellers choose between two fulfillment options, fulfill-by-merchant (FBM) and fulfill-by-platform (FBP), while replenishing inventory under state-dependent base-stock policies. The key mechanism we study is informational: by changing the predictability of each seller’s sales stream, the platform changes sellers’ safety-stock needs even when average demand shares remain unchanged. We focus on nondiscriminatory allocation policies that give sellers the same demand share and forecast risk. Within this class, uniform splitting minimizes forecast uncertainty, whereas any higher level of uncertainty can be implemented using simple low-memory allocation rules. Moreover, increasing uncertainty above the uniform benchmark requires routing rules that prevent sellers from inferring aggregate demand from their own sales histories. These results reduce the platform’s problem to choosing a level of forecast uncertainty that trades off adoption of platform fulfillment against the inventory held by adopters. Our analysis identifies demand allocation as a powerful operational and informational design lever in digital marketplaces.
[MA-1] Designing for Accountable Agents : a Viewpoint
【速读】:该论文旨在解决当前人工智能(AI)系统中“问责制”(accountability)概念定义模糊、难以落地的问题,尤其是在多智能体系统(Multi-Agent System, MAS)中如何使智能体(包括人类与非人类代理)能够参与并执行问责过程。其核心贡献在于:首先通过跨学科文献综述提炼出一个连贯的问责制定义;其次以现实场景为例说明在MAS中实现问责机制的价值;最后识别出构建可问责智能体的关键研究挑战,并提出初步解决方案,从而为开放社会技术系统中的自主元素参与问责流程奠定理论与实践基础。关键创新点在于将问责制从传统的人类组织管理层面拓展至智能体之间的交互关系,强调智能体间相互问责的能力是实现可信AI的重要路径。
链接: https://arxiv.org/abs/2604.07204
作者: Stephen Cranefield,Nir Oren
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:AI systems are becoming increasingly complex, ubiquitous and autonomous, leading to increasing concerns about their impacts on individuals and society. In response, researchers have begun investigating how to ensure that the methods underlying AI decision-making are transparent and their decisions are explainable to people and conformant to human values and ethical principles. As part of this research thrust, the need for accountability within AI systems has been noted, but this notion has proven elusive to define; we aim to address this issue in the current paper. Unlike much recent work, we do not address accountability within the human organisational processes of developing and deploying AI; rather we consider what it would it mean for the agents within a multi-agent system (MAS), potentially including human agents, to be accountable to other agents or to have others accountable to them. In this work, we make the following contributions: we provide an in-depth survey of existing work on accountability in multiple disciplines, seeking to identify a coherent definition of the concept; we give a realistic example of a multi-agent system application domain that illustrates the benefits of enabling agents to follow accountability processes, and we identify a set of research challenges for the MAS community in building accountable agents, sketching out some initial solutions to these, thereby laying out a road-map for future research. Our focus is on laying the groundwork to enable autonomous elements within open socio-technical systems to take part in accountability processes. Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2604.07204 [cs.MA] (or arXiv:2604.07204v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.07204 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-2] ReDAct: Uncertainty-Aware Deferral for LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在序列决策任务中因幻觉(hallucination)导致错误决策的问题,尤其是在需要长期推理的场景下,单次错误可能不可逆地破坏整个决策轨迹。尽管增大模型规模可降低幻觉概率,但会显著增加每 token 的推理成本。解决方案的关键在于提出 ReDAct(Reason-Defer-Act)框架,其核心机制是部署两个不同规模的 LLM:一个小型、低成本模型用于默认决策,另一个大型、高可靠性但昂贵的模型仅在小模型预测不确定性超过校准阈值时被调用进行决策。实验表明,在 ALFWorld 和 MiniGrid 等文本驱动的具身环境中,仅将约 15% 的决策委托给大模型即可达到全用大模型的性能水平,同时大幅降低整体推理开销。
链接: https://arxiv.org/abs/2604.07036
作者: Dzianis Piatrashyn,Nikita Kotelevskii,Kirill Grishchenkov,Nikita Glazkov,Ivan Nasonov,Ilya Makarov,Timothy Baldwin,Preslav Nakov,Roman Vashurin,Maxim Panov
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Ocapital; National University of Science and Technology (NUST) MISIS (国家科技大学(NUST)MISIS); AXXX; Ivannikov Institute for System Programming of the Russian Academy of Sciences (俄罗斯科学院伊万尼科夫系统编程研究所); Trusted AI Center, RAS (可信AI中心,俄罗斯科学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.
[MA-3] Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
【速读】:该论文旨在解决传统博弈论模型在对抗性领域(如法律、外交和谈判)中忽视语言作为策略性互动媒介的问题,即如何将语言本身视为一种可建模的战略行动空间。其解决方案的关键在于提出“战略法庭框架”(Strategic Courtroom Framework),这是一个多智能体模拟环境,其中由具有九种可解释特质的大型语言模型(LLM)代理组成的检方与辩方团队进行回合制的法律论证;通过四类原型对特质进行组织,实现对修辞风格和战略取向的系统控制,并引入基于强化学习的特质协调器(Trait Orchestrator),动态生成适应案件和对方团队的辩护特质组合,从而发现优于静态人工设计策略的自适应说服机制。
链接: https://arxiv.org/abs/2604.07028
作者: Philipp D. Siedler
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7,000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.07028 [cs.MA] (or arXiv:2604.07028v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.07028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] Agent City: Constitutional Governance for Autonomous Agent Economies via Separation of Power
【速读】:该论文旨在解决自主AI代理在开放互联网中跨组织边界协作时产生的“逻辑垄断”(Logic Monopoly)问题,即当多个代理由不同人类主体控制并大规模协同时,其整体行为变得不可观测、不可审计且无法治理。解决方案的关键在于提出“权力分离”(Separation of Power, SoP)模型,该模型基于公共区块链部署,通过三项结构性分离打破垄断:代理以智能合约形式制定操作规则(立法)、确定性软件在合约内执行(执行)、人类通过完整的所有权链条对每条代理行为进行裁决(司法)。在此架构下,智能合约即为法律本身,确保每个代理的行为均能追溯至其责任主体,从而实现“通过问责实现对齐”(alignment-through-accountability),使集体行为最终趋近于人类意图,无需自上而下的统一管控。
链接: https://arxiv.org/abs/2604.07007
作者: Anbang Ruan,Xing Zhang
机构: NetX Foundation
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 111 pages, 11 figures, 19 tables, 67 references. Pre-registered experimental design
Abstract:Autonomous AI agents are beginning to operate across organizational boundaries on the open internet – discovering, transacting with, and delegating to agents owned by other parties without centralized oversight. When agents from different human principals collaborate at scale, the collective becomes opaque: no single human can observe, audit, or govern the emergent behavior. We term this the Logic Monopoly – the agent society’s unchecked monopoly over the entire logic chain from planning through execution to evaluation. We propose the Separation of Power (SoP) model, a constitutional governance architecture deployed on public blockchain that breaks this monopoly through three structural separations: agents legislate operational rules as smart contracts, deterministic software executes within those contracts, and humans adjudicate through a complete ownership chain binding every agent to a responsible principal. In this architecture, smart contracts are the law itself – the actual legislative output that agents produce and that governs their behavior. We instantiate SoP in AgentCity on an EVM-compatible layer-2 blockchain (L2) with a three-tier contract hierarchy (foundational, meta, and operational). The core thesis is alignment-through-accountability: if each agent is aligned with its human owner through the accountability chain, then the collective converges on behavior aligned with human intent – without top-down rules. A pre-registered experiment evaluates this thesis in a commons production economy – where agents share a finite resource pool and collaboratively produce value – at 50-1,000 agent scale.
[MA-5] Differentiable Environment-Trajectory Co-Optimization for Safe Multi-Agent Navigation
【速读】:该论文旨在解决多智能体导航中环境配置对代理性能影响被忽视的问题,即传统方法将环境视为静态不变因素,而未将其作为可优化变量来提升导航安全性与效率。其核心解决方案是构建一个双层优化框架:下层子问题通过内点法优化代理轨迹以最小化导航成本,上层子问题则利用梯度上升法优化环境配置以最大化导航安全性。关键创新在于借助KKT条件和隐函数定理,解析地耦合上下两层,从而计算代理轨迹关于环境参数的梯度,实现整个双层结构的可微分优化;此外,提出了一种基于测度论证明有效性的新型导航安全量化指标,用于指导上层优化。实验表明,优化后的环境能够提供导航引导,在仓库物流到城市交通等安全敏感场景中显著提升代理的安全性和效率。
链接: https://arxiv.org/abs/2604.06972
作者: Zhan Gao,Gabriele Fadini,Stelian Coros,Amanda Prorok
机构: University of Cambridge (剑桥大学); Zurich University of Applied Sciences (苏黎世应用科学大学); ETH Zurich (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:The environment plays a critical role in multi-agent navigation by imposing spatial constraints, rules, and limitations that agents must navigate around. Traditional approaches treat the environment as fixed, without exploring its impact on agents’ performance. This work considers environment configurations as decision variables, alongside agent actions, to jointly achieve safe navigation. We formulate a bi-level problem, where the lower-level sub-problem optimizes agent trajectories that minimize navigation cost and the upper-level sub-problem optimizes environment configurations that maximize navigation safety. We develop a differentiable optimization method that iteratively solves the lower-level sub-problem with interior point methods and the upper-level sub-problem with gradient ascent. A key challenge lies in analytically coupling these two levels. We address this by leveraging KKT conditions and the Implicit Function Theorem to compute gradients of agent trajectories w.r.t. environment parameters, enabling differentiation throughout the bi-level structure. Moreover, we propose a novel metric that quantifies navigation safety as a criterion for the upper-level environment optimization, and prove its validity through measure theory. Our experiments validate the effectiveness of the proposed framework in a variety of safety-critical navigation scenarios, inspired from warehouse logistics to urban transportation. The results demonstrate that optimized environments provide navigation guidance, improving both agents’ safety and efficiency.
[MA-6] Exploiting Aggregate Programming in a Multi-Robot Service Prototype
【速读】:该论文旨在解决多机器人系统(Multi-robot Systems)在复杂应用环境(如医疗、探索和救援任务)中协调控制的挑战,这些问题源于机器人及其物理环境的动态性与分布式系统的固有复杂性。解决方案的关键在于采用聚合编程(Aggregate Programming, AP),这是一种基于邻近通信的编程范式,能够有效设计具有鲁棒性的分布式协调机制,并通过实际框架实现。论文展示了一个原型系统,利用AP方法构建多机器人服务系统的协调软件,并在仿真与大学图书馆实地测试中验证了其有效性。
链接: https://arxiv.org/abs/2604.06876
作者: Giorgio Audrito(Dipartimento di Informatica, Universita’ di Torino),Andrea Basso(MITO Technology),Daniele Bortoluzzi(Dipartimento di Informatica, Universita’ di Torino),Ferruccio Damiani(Dipartimento di Informatica, Universita’ di Torino),Giordano Scarso(Dipartimento di Informatica, Universita’ di Torino),Gianluca Torta(Dipartimento di Informatica, Universita’ di Torino)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: In Proceedings PLACES 2026, arXiv:2604.05737
Abstract:Multi-robot systems are becoming increasingly relevant within diverse application domains, such as healthcare, exploration, and rescue missions. However, building such systems is still a significant challenge, since it adds the complexities of the physical nature of robots and their environments to those inherent in coordinating any distributed (multi-agent) system. Aggregate Programming (AP) has recently emerged as a promising approach to engineering resilient, distributed systems with proximity-based communication, and is notably supported by practical frameworks. In this paper we present a prototype of a multi-robot service system, which adopts AP for the design and implementation of its coordination software. The prototype has been validated both with simulations, and with tests in a University library.
[MA-7] Generating Local Shields for Decentralised Partially Observable Markov Decision Processes
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在部分可观测环境下的安全问题,即由于每个智能体仅基于局部观测选择动作,导致难以保证整体行为的安全性。现有屏蔽(Shielding)方法通常依赖于共享全局状态或采用无记忆的局部过滤器,无法有效利用交互历史信息。其解决方案的关键在于提出一种带有守卫选择和递归机制的屏蔽过程代数(shield process algebra),用于在无通信的分布式部分可观测马尔可夫决策过程(Dec-POMDP)环境中规范安全的全局行为;通过将该过程编译为全局Mealy机作为联合动作过滤器,并进一步投影为各智能体本地的Mealy机,其中状态为与观测一致的信念型子集,输出每智能体的安全动作集合,从而实现无需中央协调、基于局部观测的安全控制策略。
链接: https://arxiv.org/abs/2604.06873
作者: Haoran Yang(University of Oxford),Nobuko Yoshida(University of Oxford)
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: In Proceedings PLACES 2026, arXiv:2604.05737
Abstract:Multi-agent systems under partial observation often struggle to maintain safety because each agent’s locally chosen action does not, in general, determine the resulting joint action. Shielding addresses this by filtering actions based on the current state, but most existing techniques either assume access to a shared centralised global state or employ memoryless local filters that cannot consider interaction history. We introduce a shield process algebra with guarded choice and recursion for specifying safe global behaviour in communication-free Dec-POMDP settings. From a shield process, we compile a process automaton, then a global Mealy machine as a safe joint-action filter, and finally project it to local Mealy machines whose states are belief-style subsets of the global Mealy machine states consistent with each agent’s observations, and which output per-agent safe action sets. We implement the pipeline in Rust and integrate PRISM, the Probabilistic Symbolic Model Checker, to compute best- and worst-case safety probabilities independently of the agents’ policies. A multi-agent path-finding case study demonstrates how different shield processes substantially reduce collisions compared to the unshielded baseline while exhibiting varying levels of expressiveness and conservatism. Comments: In Proceedings PLACES 2026, arXiv:2604.05737 Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2604.06873 [cs.MA] (or arXiv:2604.06873v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.06873 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 444, 2026, pp. 1-10 Related DOI: https://doi.org/10.4204/EPTCS.444.1 Focus to learn more DOI(s) linking to related resources
[MA-8] Event-Triggered Adaptive Consensus for Multi-Robot Task Allocation
【速读】:该论文旨在解决动态环境中异构机器人集群在通信受限条件下的高效、自适应任务分配问题。其核心挑战在于如何在保证任务完成效率的同时最小化通信开销,并提升系统对执行失败和节点失效的鲁棒性。解决方案的关键在于提出一种事件触发式组织框架(event-triggered organization framework),该框架基于自适应一致性机制,仅在发生显著事件时触发通信以进行任务协商,从而避免冗余交互;同时,集群根据环境冲突程度自主调节协调速率,并通过基于行为树(Behavior Tree)的健壮执行模型管理个体代理的韧性,实现了高效率、强适应性和良好容错能力的协同控制架构。
链接: https://arxiv.org/abs/2604.06813
作者: Fidel Aznar,Mar Pujol,Álvaro Díez
机构: University of Alicante (阿利坎特大学)
类目: Multiagent Systems (cs.MA)
备注: 40 pages, 18 figures. Published in Computer Communications under CC-BY license
Abstract:Coordinating robotic swarms in dynamic and communication-constrained environments remains a fundamental challenge for collective intelligence. This paper presents a novel framework for event-triggered organization, designed to achieve highly efficient and adaptive task allocation in a heterogeneous robotic swarm. Our approach is based on an adaptive consensus mechanism where communication for task negotiation is initiated only in response to significant events, eliminating unnecessary interactions. Furthermore, the swarm self-regulates its coordination pace based on the level of environmental conflict, and individual agent resilience is managed through a robust execution model based on Behavior Trees. This integrated architecture results in a collective system that is not only effective but also remarkably efficient and adaptive. We validate our framework through extensive simulations, benchmarking its performance against a range of coordination strategies. These include a non-communicating reactive behavior, a simple information-sharing protocol, the baseline Consensus-Based Bundle Algorithm (CBBA), and a periodic CBBA variant integrated within a Behavior Tree architecture. Furthermore, our approach is compared with Clustering-CBBA (C-CBBA), a state-of-the-art algorithm recognized for communication-efficient task management in heterogeneous clusters. Experimental results demonstrate that the proposed method significantly reduces network overhead when compared to communication-heavy strategies. Moreover, it maintains top-tier mission effectiveness regarding the number of tasks completed, showcasing high efficiency and practicality. The framework also exhibits significant resilience to both action execution and permanent agent failures, highlighting the effectiveness of our event-triggered model for designing adaptive and resource-efficient robotic swarms for complex scenarios.
[MA-9] From Perception to Autonomous Computational Modeling: A Multi-Agent Approach
【速读】:该论文旨在解决工程力学分析流程中自动化程度低、多阶段任务依赖人工干预的问题,特别是从原始感知数据(如照片)到生成符合规范的工程报告这一全流程的端到端自动化难题。解决方案的关键在于提出一个与求解器无关的框架,其中协调的大语言模型(Large Language Model, LLM)代理作为条件算子在共享上下文空间中运作,并通过质量门控机制实现各处理层间的条件迭代。该框架进一步引入数学方法,利用区间边界、概率密度和模糊隶属函数从不确定性感知数据中提取工程信息,并基于任务特异性保守性策略解决不同极限状态间参数趋势冲突导致的“保守”定义模糊问题,从而实现了无需人工修正的全自动有限元分析流程,包括几何提取、材料推断、离散化、求解、不确定性量化及合规性评估。
链接: https://arxiv.org/abs/2604.06788
作者: Daniel N. Wilke
机构: University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 32 pages, 8 figures, 5 tables
Abstract:We present a solver-agnostic framework in which coordinated large language model (LLM) agents autonomously execute the complete computational mechanics workflow, from perceptual data of an engineering component through geometry extraction, material inference, discretisation, solver execution, uncertainty quantification, and code-compliant assessment, to an engineering report with actionable recommendations. Agents are formalised as conditioned operators on a shared context space with quality gates that introduce conditional iteration between pipeline layers. We introduce a mathematical framework for extracting engineering information from perceptual data under uncertainty using interval bounds, probability densities, and fuzzy membership functions, and introduce task-dependent conservatism to resolve the ambiguity of what `conservative’ means when different limit states are governed by opposing parameter trends. The framework is demonstrated through a finite element analysis pipeline applied to a photograph of a steel L-bracket, producing a 171,504-node tetrahedral mesh, seven analyses across three boundary condition hypotheses, and a code-compliant assessment revealing structural failure with a quantified redesign. All results are presented as generated in the first autonomous iteration without manual correction, reinforcing that a professional engineer must review and sign off on any such analysis.
[MA-10] Logical Robots: Declarative Multi-Agent Programming in Logica AAMAS
【速读】:该论文旨在解决多智能体机器人系统中低层反应式控制与高层规划难以统一的问题。解决方案的关键在于引入逻辑机器人(Logical Robots)平台,通过逻辑编程语言 Logica 以声明式方式定义机器人行为,利用逻辑谓词将来自模拟雷达阵列和共享内存的观测映射为期望的电机输出,从而在单一编程环境中实现低层反应控制与高层规划的协同共存,为探索多智能体机器人行为提供了一个统一且一致的框架。
链接: https://arxiv.org/abs/2604.06629
作者: Evgeny Skvortsov,Yilin Xia,Ojaswa Garg,Shawn Bowers,Bertram Ludäscher
机构: Google LLC(谷歌); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Gonzaga University(冈扎加大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 25-29, 2026. Paphos, Cyprus
Abstract:We present Logical Robots, an interactive multi-agent simulation platform where autonomous robot behavior is specified declaratively in the logic programming language Logica. Robot behavior is defined by logical predicates that map observations from simulated radar arrays and shared memory to desired motor outputs. This approach allows low-level reactive control and high-level planning to coexist within a single programming environment, providing a coherent framework for exploring multi-agent robot behavior.
[MA-11] Asynchronous Distributed Bandit Submodular Maximization under Heterogeneous Communication Delays
【速读】:该论文旨在解决异步分布式多智能体带宽优化(multi-agent bandit submodular maximization)中的决策问题,尤其针对未知环境中信息采集任务所面临的异构通信延迟和局部时钟不同步挑战。现有方法受限于同质通信延迟假设与同步全局时钟,难以在实际异步场景中保证性能。解决方案的关键在于提出一种异步协调算法,该算法能够适应非同步的本地时钟和不均匀的通信延迟,并提供可证明的近似保证:其子最优间隙明确依赖于通信延迟和时钟偏移,同时考虑邻域拓扑结构的影响,从而在仅通过一跳邻居消息进行分布式决策的前提下实现高效协同。
链接: https://arxiv.org/abs/2604.06430
作者: Pranjal Sharma,Zirui Xu,Vasileios Tzoumas
机构: University of Michigan (密歇根大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注:
Abstract:We study asynchronous distributed decision-making for scalable multi-agent bandit submodular maximization. We are motivated by distributed information-gathering tasks in unknown environments and under heterogeneous inter-agent communication delays. To enable scalability despite limited communication delays, existing approaches restrict each agent to coordinate only with its one-hop neighbors. But these approaches assume homogeneous communication delays among the agents and a synchronous global clock. In practice, however, delays are heterogeneous, and agents operate with mismatched local clocks. That is, each agent does not receive information from all neighbors at the same time, compromising decision-making. In this paper, we provide an asynchronous coordination algorithm to overcome the challenges. We establish a provable approximation guarantee against the optimal synchronized centralized solution, where the suboptimality gap explicitly depends on communication delays and clock mismatches. The bounds also depend on the topology of each neighborhood, capturing the effect of distributed decision-making via one-hop-neighborhood messages only. We validate the approach through numerical simulations on multi-camera area monitoring.
[MA-12] Qualixar OS: A Universal Operating System for AI Agent Orchestration
【速读】:该论文旨在解决当前多智能体系统(Multi-Agent Systems, MAS)在异构环境下的协调与管理难题,尤其是在跨多个大语言模型(Large Language Model, LLM)提供商、代理框架和通信传输协议时缺乏统一运行时支持的问题。现有方案如AIOS(基于内核的AI操作系统)或AutoGen等单一框架工具难以实现真正的通用性与灵活性。其关键解决方案是提出Qualixar OS——首个面向通用AI代理编排的应用层操作系统,通过七项核心贡献构建完整生态:包括12种多智能体拓扑结构的执行语义、基于历史策略记忆的LLM驱动团队设计引擎Forge、三层混合模型路由机制(结合Q-learning、五种策略及贝叶斯POMDP动态发现多提供方)、共识式判别流水线(含Goodhart检测、JSD漂移监控与对齐三难困境导航)、四层内容归属机制(HMAC签名与隐写水印)、通过Claw Bridge实现MCP和A2A协议兼容的通用接口,以及集成可视化工作流构建器与技能市场的24标签生产仪表盘。该系统已在2821个测试用例中验证有效性,并在自定义20任务评估套件上达到每任务平均成本仅0.000039的100%准确率。
链接: https://arxiv.org/abs/2604.06392
作者: Varun Pratap Bhardwaj
机构: Accenture(埃森哲)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 20 pages, 7 figures, 8 tables. Zenodo DOI: https://doi.org/10.5281/zenodo.19454219
Abstract:We present Qualixar OS, the first application-layer operating system for universal AI agent orchestration. Unlike kernel-level approaches (AIOS) or single-framework tools (AutoGen, CrewAI), Qualixar OS provides a complete runtime for heterogeneous multi-agent systems spanning 10 LLM providers, 8+ agent frameworks, and 7 transports. We contribute: (1) execution semantics for 12 multi-agent topologies including grid, forest, mesh, and maker patterns; (2) Forge, an LLM-driven team design engine with historical strategy memory; (3) three-layer model routing combining Q-learning, five strategies, and Bayesian POMDP with dynamic multi-provider discovery; (4) a consensus-based judge pipeline with Goodhart detection, JSD drift monitoring, and alignment trilemma navigation; (5) four-layer content attribution with HMAC signing and steganographic watermarks; (6) universal compatibility via the Claw Bridge supporting MCP and A2A protocols with a 25-command Universal Command Protocol; (7) a 24-tab production dashboard with visual workflow builder and skill marketplace. Qualixar OS is validated by 2,821 test cases across 217 event types and 8 quality modules. On a custom 20-task evaluation suite, the system achieves 100% accuracy at a mean cost of 0.000039 per task. Source-available under the Elastic License 2.0.
[MA-13] Agent Opt v0.1 Technical Report: Client-Side Optimization for LLM -Based Agent
【速读】:该论文旨在解决生成式 AI (Generative AI) 代理在客户端侧的资源分配优化问题,即如何在满足特定质量、成本和延迟约束的前提下,合理分配模型选择、本地工具与远程 API 预算等资源。现有研究多聚焦于服务端效率优化(如缓存、推测执行、流量调度),但随着用户通过组合本地工具、远程 API 和多种模型构建代理,客户端侧的优化变得同等重要且不可由服务端单独决定。解决方案的关键在于提出首个与框架无关的 Python 工具包 AgentOpt,其核心是针对多步代理流水线中的模型选择问题设计高效搜索算法,以在有限评估预算下找到最经济的模型分配方案;实验表明,Arm Elimination 等算法可在保持近似最优准确率的同时,相比暴力搜索减少 24–67% 的评估开销。
链接: https://arxiv.org/abs/2604.06296
作者: Wenyue Hua,Sripad Karne,Qian Xie,Armaan Agrawal,Nikos Pagonas,Kostis Kaffes,Tianyi Peng
机构: Microsoft Research, AI Frontiers; Cornell University; Columbia University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 21 pages, 1 figure
Abstract:AI agents are increasingly deployed in real-world applications, including systems such as Manus, OpenClaw, and coding agents. Existing research has primarily focused on \emphserver-side efficiency, proposing methods such as caching, speculative execution, traffic scheduling, and load balancing to reduce the cost of serving agentic workloads. However, as users increasingly construct agents by composing local tools, remote APIs, and diverse models, an equally important optimization problem arises on the client side. Client-side optimization asks how developers should allocate the resources available to them, including model choice, local tools, and API budget across pipeline stages, subject to application-specific quality, cost, and latency constraints. Because these objectives depend on the task and deployment setting, they cannot be determined by server-side systems alone. We introduce AgentOpt, the first framework-agnostic Python package for client-side agent optimization. We first study model selection, a high-impact optimization lever in multi-step agent pipelines. Given a pipeline and a small evaluation set, the goal is to find the most cost-effective assignment of models to pipeline roles. This problem is consequential in practice: at matched accuracy, the cost gap between the best and worst model combinations can reach 13–32 \times in our experiments. To efficiently explore the exponentially growing combination space, AgentOpt implements eight search algorithms, including Arm Elimination, Epsilon-LUCB, Threshold Successive Elimination, and Bayesian Optimization. Across four benchmarks, Arm Elimination recovers near-optimal accuracy while reducing evaluation budget by 24–67% relative to brute-force search on three of four tasks. Code and benchmark results available at this https URL.
[MA-14] he Art of Building Verifiers for Computer Use Agents
【速读】:该论文旨在解决计算机使用代理(Computer Use Agent, CUA)轨迹验证的可靠性问题,这是评估与训练信号可信度的关键挑战。为实现高精度验证,作者提出“通用验证器”(Universal Verifier),其核心设计围绕四个关键原则:1)构建语义明确且互不重叠的评分标准以降低噪声;2)分离过程奖励与结果奖励,从而捕捉代理虽按正确步骤执行但因外部阻塞或意外路径成功的情形;3)区分可控与不可控失败,并采用级联无错误策略进行细粒度失败归因;4)引入分而治之的上下文管理机制,全面关注轨迹中的所有截图,提升长任务周期下的验证可靠性。实验证明,该方法在CUAVerifierBench数据集上与人类标注者的一致性达到人类间一致性水平,同时将假阳性率显著降低至接近零(相比WebVoyager ≥45% 和 WebJudge ≥22%)。这些改进源于上述设计选择的协同效应,而非单一技术突破。
链接: https://arxiv.org/abs/2604.06240
作者: Corby Rosset,Pratyusha Sharma,Andrew Zhao,Miguel Gonzalez-Fernandez,Ahmed Awadallah
机构: Microsoft Research; Browserbase
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Verifying the success of computer use agent (CUA) trajectories is a critical challenge: without reliable verification, neither evaluation nor training signal can be trusted. In this paper, we present lessons learned from building a best-in-class verifier for web tasks we call the Universal Verifier. We design the Universal Verifier around four key principles: 1) constructing rubrics with meaningful, non-overlapping criteria to reduce noise; 2) separating process and outcome rewards that yield complementary signals, capturing cases where an agent follows the right steps but gets blocked or succeeds through an unexpected path; 3) distinguishing between controllable and uncontrollable failures scored via a cascading-error-free strategy for finer-grained failure understanding; and 4) a divide-and-conquer context management scheme that attends to all screenshots in a trajectory, improving reliability on longer task horizons. We validate these findings on CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels, showing that our Universal Verifier agrees with humans as often as humans agree with each other. We report a reduction in false positive rates to near zero compared to baselines like WebVoyager ( \geq 45%) and WebJudge ( \geq 22%). We emphasize that these gains stem from the cumulative effect of the design choices above. We also find that an auto-research agent achieves 70% of expert quality in 5% of the time, but fails to discover all strategies required to replicate the Universal Verifier. We open-source our Universal Verifier system along with CUAVerifierBench; available at this https URL.
[MA-15] Emergent decentralized regulation in a purely synthetic society
【速读】:该论文试图解决的问题是:在无人类干预且无中心化设计的前提下,由合成智能体(synthetic agents)构成的群体是否能够展现出自组织的社会动态。其解决方案的关键在于通过量化指令强度(Directive Intensity, DI)来识别生成式AI在社交互动中产生的指令性语言,并基于对14,490个智能体在Moltbook平台上的39,026篇帖子和5,712条评论的观察数据,发现高DI内容显著提升了纠正性反馈(Corrective Signaling)的概率,且该关联在考虑评论嵌套于帖子结构的混合效应模型中依然稳健。这一结果表明,纯合成社会可通过内生机制实现自我调节,尤其体现在对高指令强度内容的负向反馈增强上。
链接: https://arxiv.org/abs/2604.06199
作者: Md Motaleb Hossen Manik,Ge Wang
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:As autonomous AI agents increasingly inhabit online environments and extensively interact, a key question is whether synthetic collectives exhibit self-regulated social dynamics with neither human intervention nor centralized design. We study OpenClaw agents on Moltbook, an agent-only social network, using an observational archive of 39,026 posts and 5,712 comments authored by 14,490 agents. We quantify action-inducing language with Directive Intensity (DI), a transparent, lexicon-based proxy for directive and instructional phrasing that does not measure moral valence, intent, or execution outcomes. We classify responsive comments into four types: Affirmation, Corrective Signaling, Adverse Reaction, and Neutral Interaction. Directive content is common (DI0 in 18.4% of posts). More importantly, corrective signaling scales with DI: posts with higher DI exhibit higher corrective reply probability, visible in stable binned estimates with Wilson confidence intervals. To address comment nesting within posts, we fit a post-level random intercept mixed-effects logistic model and find that the positive DI association persists. Event-aligned within-thread analysis of comment text provides additional evidence consistent with negative feedback after the first corrective response. In general, these results suggest that a purely synthetic, agent-only society can exhibit endogenous corrective signaling with a strength positively linked to the intensity of directive proposals.
[MA-16] CODE-GEN: A Human-in-the-Loop RAG -Based Agent ic AI System for Multiple-Choice Question Generation
【速读】:该论文旨在解决教育领域中高质量编程理解类多选题(multiple-choice questions)自动化生成的难题,以提升学生代码推理与理解能力。其解决方案的关键在于提出一个“人类在环路”(human-in-the-loop)的检索增强生成(retrieval-augmented generation, RAG)型智能体系统——CODE-GEN,该系统由两个协同工作的AI代理组成:生成代理(Generator agent)负责依据课程学习目标生成语境对齐的题目,验证代理(Validator agent)则从七个教学维度独立评估内容质量,并借助专用工具提升代码正确性和计算准确性。实验表明,该系统在多数可量化维度上表现优异,人类专家判断与AI验证结果高度一致,但在需要深层教学判断的维度(如干扰项设计和反馈质量)仍需人类介入,从而为AI辅助教育内容生成中的资源分配提供了实证依据。
链接: https://arxiv.org/abs/2604.03926
作者: Xiaojing Duan,Frederick Nwanganga,Chaoli Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Full version of the paper accepted as a short paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract:We present CODE-GEN, a human-in-the-Loop, retrieval-augmented generation (RAG)-based agentic AI system for generating context-aligned multiple-choice questions to develop student code reasoning and comprehension abilities. CODE-GEN employs an agentic AI architecture in which a Generator agent produces multiple-choice coding comprehension questions aligned with course-specific learning objectives, while a Validator agent independently assesses content quality across seven pedagogical dimensions. Both agents are augmented with specialized tools that enhance computational accuracy and verify code outputs. To evaluate the effectiveness of CODE-GEN, we conducted an evaluation study involving six human subject-matter experts (SMEs) who judged 288 AI-generated questions. The SMEs produced a total of 2,016 human-AI rating pairs, indicating agreement or disagreement with the assessments of Validator, along with 131 instances of qualitative feedback. Analyses of SME judgments show strong system performance, with human-validated success rates ranging from 79.9% to 98.6% across the seven pedagogical dimensions. The analysis of qualitative feedback reveals that CODE-GEN achieves high reliability on dimensions well suited to computational verification and explicit criteria matching, including question clarity, code validity, concept alignment, and correct answer validity. In contrast, human expertise remains essential for dimensions requiring deeper instructional judgment, such as designing pedagogically meaningful distractors and providing high-quality feedback that reinforces understanding. These findings inform the strategic allocation of human and AI effort in AI-assisted educational content generation.
[MA-17] A Generalized Sinkhorn Algorithm for Mean-Field Schrödinger Bridge
【速读】:该论文旨在解决均值场薛定谔桥(mean-field Schrödinger bridge, MFSB)问题,即设计一个最小能耗控制器,使具有非局部相互作用的扩散过程在固定截止时间前从初始分布演化到目标分布。MFSB是大规模多智能体系统建模的自然框架,其难点在于非局部相互作用导致问题非凸,难以求解。论文的关键创新在于提出了一种对Hopf-Cole变换的推广,并基于此构建了一个类似Sinkhorn的递归算法来求解相关的积分-偏微分方程组(integro-PDEs),并在交互势能满足弱假设条件下给出了收敛性保证,从而为MFSB提供了可计算且理论上严谨的解决方案。
链接: https://arxiv.org/abs/2604.06531
作者: Asmaa Eldesoukey,Yongxin Chen,Abhishek Halder
机构: Iowa State University (爱荷华州立大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注:
Abstract:The mean-field Schrödinger bridge (MFSB) problem concerns designing a minimum-effort controller that guides a diffusion process with nonlocal interaction to reach a given distribution from another by a fixed deadline. Unlike the standard Schrödinger bridge, the dynamical constraint for MFSB is the mean-field limit of a population of interacting agents with controls. It serves as a natural model for large-scale multi-agent systems. The MFSB is computationally challenging because the nonlocal interaction makes the problem nonconvex. We propose a generalization of the Hopf-Cole transform for MFSB and, building on it, design a Sinkhorn-type recursive algorithm to solve the associated system of integro-PDEs. Under mild assumptions on the interaction potential, we discuss convergence guarantees for the proposed algorithm. We present numerical examples with repulsive and attractive interactions to illustrate the theoretical contributions.
[MA-18] MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在医疗诊断中因过于通用而难以适应真实临床场景多样化需求的问题。现有方法通常依赖静态或预定义的专家选择机制,无法根据具体病情动态调整,限制了诊断的精准性与灵活性。解决方案的关键在于提出MedRoute框架,其核心是一个由多个专科LMM代理组成的协作系统,并引入一个由强化学习(Reinforcement Learning, RL)训练的路由器(General Practitioner),实现对专科代理的动态选择,同时设置一个调解者(Moderator)生成最终诊断决策,从而更贴近真实临床多学科协作流程。实验表明,该框架在文本和图像医学数据集上显著提升了诊断准确性,优于现有最先进基线方法。
链接: https://arxiv.org/abs/2604.06180
作者: Ashmal Vayani,Parth Parag Kulkarni,Joseph Fioresi,Song Wang,Mubarak Shah
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research. Code and models are available at this https URL.
自然语言处理
[NLP-0] Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)中奖励模型(Reward Models, RMs)在个性化偏好建模方面的评估难题,即如何有效衡量奖励模型对个体用户特定偏好的捕捉能力。现有基准主要关注通用响应质量,难以反映模型在个性化场景下的表现。解决方案的关键在于提出Personalized RewardBench——一个基于严格遵循或违背用户专属评判标准构建的选中与拒绝响应对的新型基准,确保偏好差异仅源于个人偏好而非通用质量差异;并通过实验证明该基准与下游任务(如Best-of-N采样和近端策略优化PPO)性能具有更强的相关性,从而为奖励模型的个性化能力提供可靠评估工具。
链接: https://arxiv.org/abs/2604.07343
作者: Qiyao Ma,Dechen Gao,Rui Cai,Boqi Zhao,Hanchu Zhou,Junshan Zhang,Zhe Zhao
机构: University of California, Davis
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models’ capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model’s performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models’ performance in downstream applications.
[NLP-1] Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在文化遗产领域中对结构化文化元数据(如创作者、起源地、时期等)的推理能力不足的问题。现有研究虽提升了图像描述生成能力,但对跨文化语境下结构化信息的准确提取仍缺乏系统评估。解决方案的关键在于构建一个多类别、跨文化的基准测试集,并采用大语言模型作为评判者(LLM-as-Judge)框架,量化模型输出与参考标注之间的语义对齐程度;同时从精确匹配、部分匹配及属性级别准确率三个维度评估模型在不同文化区域的表现,从而揭示当前VLMs在文化推理中的碎片化信号捕捉能力和跨文化泛化差异。
链接: https://arxiv.org/abs/2604.07338
作者: Yuechen Jiang,Enze Zhang,Md Mohsinul Kabir,Qianqian Xie,Stavroula Golfomitsou,Konstantinos Arvanitis,Sophia Ananiadou
机构: University of Manchester(曼彻斯特大学); School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院); Getty Conservation Institute(盖蒂保护研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
[NLP-2] Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
【速读】: 该论文旨在解决低资源语言(low-resource languages)在大语言模型(Large Language Models, LLMs)驱动的机器翻译中因训练数据匮乏而导致性能下降的问题。其核心解决方案在于利用LLMs具备的“上下文描述理解能力”,即通过提供形式化的语言规则(如语法和词典),使模型能够基于这些规则进行推理并完成翻译任务,而非依赖大规模平行语料库。关键创新点在于构建同步上下文无关文法(synchronous context-free grammars),以形式化方式模拟自然语言的句法、形态及书写系统特征,并在此基础上评估LLMs在给定源语言句子与目标语言文法条件下进行字符串转换(string transduction)的能力,从而揭示模型在面对不同语法复杂度、句长、形态差异和书写系统时的适应性边界与错误模式。
链接: https://arxiv.org/abs/2604.07320
作者: Jackson Petty,Jaulie Goe,Tal Linzen
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs’ ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages’ grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs’ translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.
[NLP-3] OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
【速读】: 该论文旨在解决当前空间理解研究中缺乏一个原则性、开源的数据生成引擎的问题,该引擎能够充分释放高质量空间数据的潜力。解决方案的关键在于提出并实现OpenSpatial——一个面向高质、可扩展、任务多样且高效优化的数据引擎,其以3D边界框(3D bounding boxes)为基础构建了涵盖五大基础任务(空间测量、空间关系、相机感知、多视角一致性及场景感知推理)的数据层级体系,并基于此构建了包含300万高质量样本的OpenSpatial-3M数据集,从而显著提升空间推理模型在多个基准上的性能表现。
链接: https://arxiv.org/abs/2604.07296
作者: Jianhui Liu,Haoze Sun,Wenbo Li,Yanbing Zhang,Rui Yang,Zhiliang Zhu,Yijun Yang,Shenghe Zheng,Nan Jiang,Jiaxiu Jiang,Haoyang Huang,Tien-Tsin Wong,Nan Duan,Xiaojuan Qi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial – an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.
[NLP-4] Why teaching resists automation in an AI-inundated era: Human judgment non-modular work and the limits of delegation
【速读】: 该论文试图解决的问题是:当前关于人工智能(Artificial Intelligence, AI)在教育领域应用的讨论往往将教学视为可被模块化、程序化并逐步实现自动化的任务,这种观点忽视了教学实践中不可分割的复杂性。解决方案的关键在于指出,尽管AI可以支持某些有限范围的教学功能,但教学工作本质上具有解释性(interpretive)、关系性(relational)和基于专业判断(professional judgment)的特点,其价值来源于对学习者、情境与关系的持续语境化理解;因此,教学作为一项依赖人类认知、动机和社会互动的 emergent(涌现性)实践,无法被完全自动化,仍需人类教师的专业判断与责任担当。
链接: https://arxiv.org/abs/2604.07285
作者: Songhee Han
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Debates about artificial intelligence (AI) in education often portray teaching as a modular and procedural job that can increasingly be automated or delegated to technology. This brief communication paper argues that such claims depend on treating teaching as more separable than it is in practice. Drawing on recent literature and empirical studies of large language models and retrieval-augmented generation systems, I argue that although AI can support some bounded functions, instructional work remains difficult to automate in meaningful ways because it is inherently interpretive, relational, and grounded in professional judgment. More fundamentally, teaching and learning are shaped by human cognition, behavior, motivation, and social interaction in ways that cannot be fully specified, predicted, or exhaustively modeled. Tasks that may appear separable in principle derive their instructional value in practice from ongoing contextual interpretation across learners, situations, and relationships. As long as educational practice relies on emergent understanding of human cognition and learning, teaching remains a form of professional work that resists automation. AI may improve access to information and support selected instructional activities, but it does not remove the need for human judgment and relational accountability that effective teaching requires.
[NLP-5] A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医学问答任务中因纯参数化机制导致的知识盲区和事实性不足问题。其解决方案的关键在于采用检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过将外部知识检索集成到推理流程中,提升模型对医学证据的利用能力。研究系统评估了多种检索组件(如嵌入模型、检索策略、查询改写与交叉编码重排序)的组合效果,并发现基于密集检索(dense retrieval)结合查询改写与重排序的配置在零样本医学问答上达到60.49%准确率,且表现出良好的计算效率,验证了RAG在医学领域中可有效增强事实准确性并具备在消费级硬件上部署的可行性。
链接: https://arxiv.org/abs/2604.07274
作者: Nusrat Sultana,Abdullah Muhammad Moosa,Kazi Afzalur Rahman,Sajal Chandra Banik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.
[NLP-6] ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection
【速读】: 该论文旨在解决网络内容中广泛存在的“标题党”(clickbait)现象所带来的可信度挑战,这类标题通过煽动性语言、误导性陈述和模糊表达诱导用户点击,损害了数字内容的可靠性。其解决方案的核心是提出 ClickGuard 框架,该框架采用语法-语义自适应融合模块(Syntactic-Semantic Adaptive Fusion Block, SSAFB)动态整合 BERT 语义嵌入与结构特征,并结合混合 CNN-BiLSTM 架构以捕捉局部模式与长程依赖关系,从而实现高精度(测试准确率达 96.93%)且可解释的点击诱饵检测。
链接: https://arxiv.org/abs/2604.07272
作者: Chhavi Dhiman,Naman Chawla,Riya Dhami,Gaurav Kumar,Ganesh Naik
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The widespread use of clickbait headlines, crafted to mislead and maximize engagement, poses a significant challenge to online credibility. These headlines employ sensationalism, misleading claims, and vague language, underscoring the need for effective detection to ensure trustworthy digital content. The paper introduces, ClickGuard: a trustworthy adaptive fusion framework for clickbait detection. It combines BERT embeddings and structural features using a Syntactic-Semantic Adaptive Fusion Block (SSAFB) for dynamic integration. The framework incorporates a hybrid CNN-BiLSTM to capture patterns and dependencies. The model achieved 96.93% testing accuracy, outperforming state-of-the-art approaches. The model’s trustworthiness is evaluated using LIME and Permutation Feature Importance (PFI) for interpretability and perturbation analysis. These methods assess the model’s robustness and sensitivity to feature changes by measuring the average prediction variation. Ablation studies validated the SSAFB’s effectiveness in optimizing feature fusion. The model demonstrated robust performance across diverse datasets, providing a scalable, reliable solution for enhancing online content credibility by addressing syntactic-semantic modelling challenges. Code of the work is available at: this https URL
[NLP-7] Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的诊断代理在临床推理中因独立处理病例而导致的经验难以复用和持续适应能力不足的问题。解决方案的关键在于提出一种具有认知启发式双记忆模块(dual-memory module)的自学习诊断代理(SEA),并通过专门设计的强化学习训练框架实现推理与记忆管理的联合优化,从而将临床经验有效转化为可复用的知识,提升诊断准确性和长期学习性能。
链接: https://arxiv.org/abs/2604.07269
作者: Bingxuan Li,Simo Du,Yue Guo
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Jacobi Medical Center (雅各布医学中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.
[NLP-8] Efficient Learned Data Compression via Dual-Stream Feature Decoupling ACL2026
【速读】: 该论文旨在解决学习型数据压缩(Learned Data Compression, LDC)中难以兼顾精确概率建模与系统效率的问题,尤其是传统单流架构在捕捉微观语法特征与宏观语义特征时存在局限,导致需要深度串行堆叠以提升性能,从而显著增加延迟;同时,异构系统因设备间速度不匹配,受限于阿姆达尔定律(Amdahl’s Law),串行处理进一步限制了吞吐量。其解决方案的关键在于提出双流多尺度解耦器(Dual-Stream Multi-Scale Decoupler),通过分离局部上下文与全局上下文来替代深度串行结构,实现浅层并行处理,并引入分层门控精炼器(Hierarchical Gated Refiner)进行自适应特征优化和精准概率建模;此外,设计了并发流并行流水线(Concurrent Stream-Parallel Pipeline),突破系统瓶颈,实现全流水线并行化,最终在压缩比、吞吐量、延迟和内存占用方面均达到最优表现。
链接: https://arxiv.org/abs/2604.07239
作者: Huidong Ma,Xinyan Shi,Hui Sun,Xiaofei Yue,Xiaoguang Liu,Gang Wang,Wentong Cai
机构: Nankai University (南开大学); Nanyang Technological University (南洋理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: Accepted to ACL 2026
Abstract:While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single-stream architectures struggle to simultaneously capture micro-syntactic and macro-semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl’s Law due to serial processing. To this end, we propose a Dual-Stream Multi-Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream-Parallel Pipeline, which overcomes systemic bottlenecks to achieve full-pipeline parallelism. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at this https URL.
[NLP-9] On the Price of Privacy for Language Identification and Generation
【速读】: 该论文旨在解决在大规模语言模型(Large Language Models, LLMs)训练过程中,如何量化隐私保护(特别是差分隐私,Differential Privacy, DP)对语言识别与生成任务性能带来的影响这一核心问题。其关键解决方案在于:在统计学习的异构(agnostic)设定下,首次建立了语言识别与生成任务中满足近似 (\varepsilon, \delta)-DP 和纯 \varepsilon-DP 的算法及其匹配的下界,精确刻画了隐私成本。研究发现,在近似差分隐私下,当 \varepsilon 为常数时,隐私不会带来额外误差,可恢复非私有情况下的最优错误率;而在纯差分隐私下,误差指数衰减速率仅损失一个 \min{1, \varepsilon} 的乘性因子,且该因子是紧致的(tight up to constants),从而确立了最优隐私-效用权衡率。
链接: https://arxiv.org/abs/2604.07238
作者: Xiaoyu Li,Andi Han,Jiaojiao Jiang,Junbin Gao
机构: University of New South Wales (新南威尔士大学); University of Sydney (悉尼大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
备注:
Abstract:As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate (\varepsilon, \delta) -DP with constant \varepsilon 0 recovers the non-private error rates: \exp(-r(n)) for identification (for any r(n) = o(n) ) and \exp(-\Omega(n)) for generation. Under pure \varepsilon -DP, the exponents degrade by a multiplicative factor of \min\1, \varepsilon\ , which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound \exp(-\min\1,\varepsilon\ \cdot \Omega(n)) matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a \min\1,\varepsilon\ factor in the exponent under pure DP.
[NLP-10] How Much LLM Does a Self-Revising Agent Actually Need?
【速读】: 该论文试图解决的问题是:在基于大语言模型(Large Language Model, LLM)的智能体中,难以区分其能力究竟来源于LLM本身的推理与决策,还是来自外部结构(如世界建模、规划或反思机制)所赋予的显式功能。为解决这一问题,作者提出了一种声明式反思运行时协议(declared reflective runtime protocol),通过将代理的状态、置信度信号、受保护的动作以及假设性状态转移等关键要素外化为可检查的运行时结构,使原本隐含的行为变得可观测和可分解。该协议的核心在于构建一个可追踪的运行时框架,从而能够对LLM干预的边际作用进行直接评估——实验表明,显式世界模型规划显著优于仅依赖后验信念的贪婪策略(胜率提升24.1个百分点),而符号化反思虽能作为实时机制运行,但当前条件下整体收益有限;此外,稀疏LLM修订仅在约4.3%的回合中触发时,带来微小且非单调的性能变化,进一步验证了该方法论的价值:即通过外化反思,实现了对LLM角色的精细化分析,而非单纯追求性能提升。
链接: https://arxiv.org/abs/2604.07236
作者: Seongwoo Jeong,Seonil Son
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: WIP
Abstract:Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent’s competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards \times 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism – with prediction tracking, confidence gating, and guarded revision actions – even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31 \rightarrow 29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly. Comments: WIP Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.07236 [cs.AI] (or arXiv:2604.07236v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.07236 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Seonil Son [view email] [v1] Wed, 8 Apr 2026 16:02:04 UTC (60 KB)
[NLP-11] raceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)从静态对话机器人向自主代理(autonomous agents)演进过程中,安全防护机制从最终输出向中间执行轨迹(intermediate execution traces)转移所带来的新挑战。现有安全护栏(guardrails)主要针对自然语言响应进行评测,但在多步骤工具调用轨迹中的有效性尚不明确。为填补这一空白,论文提出TraceSafe-Bench——首个专门用于评估中间轨迹安全性的基准测试,涵盖12类风险(如提示注入、隐私泄露等),包含超1000个独特执行实例。其关键解决方案在于揭示三个核心发现:1)结构能力瓶颈:安全防护效果更依赖于结构化数据处理能力(如JSON解析),而非语义安全对齐;2)架构优于规模:模型架构比参数量更能影响风险检测性能,通用LLM在轨迹分析中优于专用安全护栏;3)时间稳定性:随着执行步骤增加,模型能从静态工具定义转向动态行为识别,反而提升后期风险检测准确率。这表明,要有效缓解代理工作流中的中段风险,需协同优化结构推理与安全对齐能力。
链接: https://arxiv.org/abs/2604.07223
作者: Yen-Shan Chen,Sian-Yao Huang,Cheng-Lin Yang,Yun-Nung Chen
机构: CyCraft AI Lab (CyCraft AI 实验室); National Taiwan University (国立台湾大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ( \rho=0.79 ) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.
[NLP-12] LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics CVPR2026
【速读】: 该论文旨在解决在非受限环境下预测情绪变化(affect change)这一挑战,尤其针对当前主流深度神经网络嵌入方法存在的可解释性差、难以进行专家干预优化的问题。其解决方案的关键在于构建一个基于语言模型(Language Model, LM)的语义上下文条件机制,将手工设计的情绪特征(如面部几何与声学特征)转化为自然语言描述,并由预训练LM生成语义上下文嵌入作为情绪动态的高层先验。该方法在保持特征透明性的同时,利用LM的上下文抽象能力提升建模效果,从而实现既可解释又高性能的情绪变化预测,优于纯手工特征和端到端深度嵌入基线方法。
链接: https://arxiv.org/abs/2604.07193
作者: Kosmas Pinitas,Ilias Maglogiannis
机构: University of Piraeus (皮拉尤斯大学)
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: This paper has been accepted at the CVPR 2026 Workshop on Affective Behavior Analysis in-the-wild (ABAW)
Abstract:Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures
[NLP-13] Agent -Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery
【速读】: 该论文旨在解决传统语料库语言学(corpus linguistics)中依赖人工进行假设生成、查询构建和结果解释所带来的高技术门槛与低效率问题。其核心挑战在于如何在保持研究严谨性的同时,提升语料库分析的自动化程度与可及性。解决方案的关键在于提出“代理驱动的语料库语言学”(Agent-Driven Corpus Linguistics),即通过一个大型语言模型(LLM)借助结构化的工具调用接口(如Model Context Protocol, MCP)连接语料库查询引擎(如CQP-indexed Gutenberg语料库),自主完成多轮探究:从假设生成、语料库查询、结果解释到迭代优化,而人类研究人员仅负责设定方向并评估最终输出。该方法确保所有发现基于可验证的语料库证据,而非模型内部知识,从而在不牺牲实证基础的前提下实现机器级速度的研究产出,并显著降低技术门槛。
链接: https://arxiv.org/abs/2604.07189
作者: Jia Yu,Weiwei Yu,Pengfei Xiao,Fukun Xing
机构: Zhejiang International Studies University (浙江国际商务大学); Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only “investigate English intensifiers,” the agent identified a diachronic relay chain (so+ADJ very really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.
[NLP-14] Dynamic Context Evolution for Scalable Synthetic Data Generation
【速读】: 该论文旨在解决大规模语言模型在独立批量提示(independent batch prompting)时产生的重复输出问题,即跨批次模式坍缩(cross-batch mode collapse)——随着生成批次的增加,模型输出多样性逐渐丧失。解决方案的关键在于提出动态上下文演化(Dynamic Context Evolution, DCE),其核心机制包括:(1)语义化尾部采样(verbalized tail sampling, VTS),通过模型自我评估对高概率候选进行过滤;(2)语义记忆(semantic memory),利用持久嵌入索引识别并剔除近似重复内容;(3)自适应提示演化(adaptive prompt evolution),基于记忆状态和轮换多样性策略重构每批提示。实验证明,DCE 在三个不同领域中均显著提升输出多样性与概念结构稳定性,且无需微调或定制架构,仅依赖标准API调用即可实现约0.50%的低坍缩率(vs. 基线5.6%)。
链接: https://arxiv.org/abs/2604.07147
作者: Ryan Lingo,Rajeev Chhajer
机构: Honda Research Institute, USA, Inc.(本田研究 institute,美国公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive’s volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately 0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.
[NLP-15] Language Bias under Conflicting Information in Multilingual LLM s
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在处理跨语言冲突信息时是否存在语言偏好偏倚的问题。其核心问题是:当模型面对不同语言表达的矛盾信息时,是否会因语言来源而产生系统性偏向,从而忽略真实冲突并单一输出某一种答案。解决方案的关键在于扩展“针在 haystack 中的冲突针”范式至多语言场景,通过在五个不同语言(包括俄语和中文)的真实新闻数据上对多种规模的多语言模型进行系统评估,发现所有测试模型均普遍忽略冲突并自信地选择单一答案,且存在稳定的语言偏好模式——即对俄语存在系统性排斥,对中文则在长上下文下表现出偏好,且这种偏倚在中外训练的模型中均一致存在,但在中国大陆训练的模型中更为显著。
链接: https://arxiv.org/abs/2604.07123
作者: Robert Östling,Murathan Kurfalı
机构: Stockholm University (斯德哥尔摩大学); RISE Research Institutes of Sweden (瑞典研究机构)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.
[NLP-16] Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域同行评审中存在的一种特定偏见——语言研究偏见(Language-of-study bias, LoS bias),即审稿人基于论文所研究的语言而非其科学质量进行评价的现象。尽管此类偏见在评审指南中已被明确指出,但此前缺乏系统性识别与量化分析。论文的关键解决方案在于首次构建了人类标注的数据集LOBSTER(Language-Of-study Bias in ScienTific pEer Review),并提出一种能够以87.37%宏F1分数准确检测LoS偏见的方法;同时通过分析15,645份评审意见,揭示非英语论文面临的负面偏见显著高于英语论文,且负向偏见始终强于正向偏见,并识别出四种负向偏见子类,其中要求不合理的跨语言泛化是最主要形式。所有资源均已公开,以推动更公平的学术评审实践。
链接: https://arxiv.org/abs/2604.07119
作者: Ehsan Barkhordar,Abdulfattah Safa,Verena Blaschke,Erika Lombart,Marie-Catherine de Marneffe,Gözde Gül Şahin
机构: Koç University(科奇大学); KUIS AI Lab; LMU Munich(慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning(慕尼黑机器学习中心); UCLouvain(鲁汶大学); Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希亚历山大大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 10 figures, 9 tables
Abstract:Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.
[NLP-17] Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
【速读】: 该论文旨在解决医疗问答任务中患者自述问题与临床理解之间语义鸿沟的问题,具体针对ArchEHR-QA 2026共享任务中的四个子任务:患者问题重述(ST1)、证据句识别(ST2)、答案生成(ST3)及证据-答案对齐(ST4)。解决方案的关键在于采用分阶段的模型架构与多模型集成策略:ST1使用Claude Sonnet 4和GPT-4o双模型流水线进行问题重构;ST2–ST4则基于Azure托管的多种大语言模型(如o3、GPT-5.2、GPT-5.1、DeepSeek-R1)结合少样本提示(few-shot prompting)与投票机制提升鲁棒性。实验表明,模型多样性与集成投票显著优于单模型基线,且提供完整临床答案段落作为上下文有助于增强证据-答案对齐效果,但推理能力仍是限制对齐准确率的主要瓶颈。
链接: https://arxiv.org/abs/2604.07116
作者: Elyas Irankhah,Samah Fodeh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures. System description for ArchEHR-QA 2026 shared task
Abstract:We describe the Yale-DM-Lab system for the ArchEHR-QA 2026 shared task. The task studies patient-authored questions about hospitalization records and contains four subtasks (ST): clinician-interpreted question reformulation, evidence sentence identification, answer generation, and evidence-answer alignment. ST1 uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o to reformulate patient questions into clinician-interpreted questions. ST2-ST4 rely on Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, and DeepSeek-R1) combined with few-shot prompting and voting strategies. Our experiments show three main findings. First, model diversity and ensemble voting consistently improve performance compared to single-model baselines. Second, the full clinician answer paragraph is provided as additional prompt context for evidence alignment. Third, results on the development set show that alignment accuracy is mainly limited by reasoning. The best scores on the development set reach 88.81 micro F1 on ST4, 65.72 macro F1 on ST2, 34.01 on ST3, and 33.05 on ST1.
[NLP-18] he Impact of Steering Large Language Models with Persona Vectors in Educational Applications
【速读】: 该论文旨在解决激活引导式人格向量(activation-based persona vectors)在教育场景中对大语言模型生成与评分行为的影响机制不明确的问题。其关键解决方案在于系统性地评估七种人格特质在短答案生成和自动评分任务中的作用,发现人格引导会显著降低整体答案质量,且在开放性英语语言艺术(ELA)任务中敏感度远高于事实性科学任务(最高达11倍),同时评分校准偏差呈现可预测的极性一致性:邪恶和不礼貌人格导致评分更严苛,而善良和乐观人格则使评分更宽松;此外,混合专家(Mixture-of-Experts)架构比密集模型表现出约6倍更大的校准偏移。这一结果凸显了在教育应用中部署人格引导模型时需考虑任务特异性与模型架构差异的校准策略。
链接: https://arxiv.org/abs/2604.07102
作者: Yongchao Wu,Aron Henriksson
机构: Stockholm University (斯德哥尔摩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Activation-based steering can personalize large language models at inference time, but its effects in educational settings remain unclear. We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures. Persona steering lowers answer quality overall, with much larger effects on open-ended English Language Arts (ELA) prompts than on factual science prompts; interpretive and argumentative tasks are up to 11x more sensitive. On the scoring side, we observe predictable valence-aligned calibration shifts: evil and impolite scorers grade more harshly, while good and optimistic scorers grade more leniently. ELA tasks are 2.5-3x more susceptible to scorer personalization than science tasks, and the Mixture-of-Experts model shows roughly 6x larger calibration shifts than the dense models. To our knowledge, this is the first study to systematically examine the effects of activation-steered persona traits in educational generation and scoring, and the results highlight the need for task-aware and architecture-aware calibration when deploying steered models in educational settings.
[NLP-19] STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems ACL2026
【速读】: 该论文旨在解决当前 empathetic dialogue(共情对话)模型在策略感知、上下文敏感决策以及高质量数据支持方面的局限性,这些问题限制了模型将共情对话建模为复杂多阶段认知与决策过程的能力。其解决方案的关键在于提出STRIDE-ED框架——一个基于策略的、可解释的、深度推理的共情对话建模方法,通过结构化策略条件推理实现对共情策略的显式建模;同时构建了一个策略感知的数据精炼流水线(整合LLM标注、多模态一致性加权评估与动态采样),以生成高质量训练数据;并采用监督微调与多目标强化学习相结合的两阶段训练范式,使模型行为更精准地对齐目标情绪、共情策略和响应格式。
链接: https://arxiv.org/abs/2604.07100
作者: Hongru Ji,Yuyin Fan,Meng Zhao,Xianghua Li,Lianwei Wu,Chao Gao
机构: Northwestern Polytechnical University (西北工业大学); Henan University of Technology (河南工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026
Abstract:Empathetic dialogue requires not only recognizing a user’s emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.
[NLP-20] Selective Neuron Amplification for Training-Free Task Enhancement
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在某些任务上表现不佳的问题,尽管这些任务表面上看模型已经具备相应理解能力。研究表明,问题根源并非知识缺失,而是特定内部神经元回路在推理过程中未被充分激活。解决方案的关键在于提出一种名为“选择性神经元放大”(Selective Neuron Amplification, SNA)的方法,该方法在推理阶段增强与任务相关的神经元的影响权重,而无需修改模型参数,且不永久改变模型结构。SNA主要在模型不确定时生效,而在模型已具高置信度时影响较小,这表明部分模型失败源于弱激活而非能力不足。
链接: https://arxiv.org/abs/2604.07098
作者: Ryyan Akhtar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 28 pages, 12 figures. Preprint. Code and experiments conducted independently
Abstract:Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model’s parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.
[NLP-21] Multilingual Embedding Probes Fail to Generalize Across Learner Corpora
【速读】: 该论文旨在解决多语言嵌入模型是否编码了通用的语言熟练度表示(proficiency representation)这一问题。其核心挑战在于,现有模型在不同语料库和语言间的表现差异是否源于对抽象、可迁移的熟练度维度的捕捉,还是仅依赖于特定语料的分布特性。解决方案的关键在于设计线性与非线性探测器(probes),基于Qwen3-Embedding模型各层隐藏状态预测CEFR熟练度等级,并通过跨语料库评估验证其泛化能力。结果表明,尽管模型在同分布下表现优异(QWK≈0.7),但跨语料库时性能急剧下降,且残差分析显示探测器趋向于预测均匀分布标签,说明当前模型学习到的是语料特异性分布特征(如主题、语言、任务类型和评分方法),而非语言通用的熟练度表征。
链接: https://arxiv.org/abs/2604.07095
作者: Laurits Lyngbaek,Ross Deans Kristensen-McLachlan
机构: Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ( QWK\approx0.7 ), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.
[NLP-22] Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English
【速读】: 该论文旨在探究双语语言模型(bilingual language models)是否能够模拟人类双语者在阅读过程中表现出的跨语言激活现象,特别是对同源词(cognates)和跨语言同形异义词(interlingual homographs,即“假朋友”)的不同处理模式。其解决方案的关键在于通过控制词汇共享条件(即是否为“假朋友”或“真朋友”分配共享嵌入或语言特异性嵌入),训练荷兰语-英语因果Transformer模型,并利用心理语言学实验中的刺激材料进行 surprisal(意外度)和嵌入相似性分析。结果表明,模型仅在嵌入共享条件下表现出跨语言效应,且此时同源词与假朋友均呈现促进效应,而这种效应主要由词频驱动而非形式-意义映射的一致性;只有当仅同源词共享嵌入时,模型才复现了人类双语者的定性行为模式。因此,模型对人类双语阅读机制的拟合程度高度依赖于词汇重叠的编码方式。
链接: https://arxiv.org/abs/2604.07067
作者: Iza Škrjanec,Irene Elisabeth Winther,Vera Demberg,Stefan L. Frank
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Bilingual speakers show cross-lingual activation during reading, especially for words with shared surface form. Cognates (friends) typically lead to facilitation, whereas interlingual homographs (false friends) cause interference or no effect. We examine whether cross-lingual activation in bilingual language models mirrors these patterns. We train Dutch-English causal Transformers under four vocabulary-sharing conditions that manipulate whether (false) friends receive shared or language-specific embeddings. Using psycholinguistic stimuli from bilingual reading studies, we evaluate the models through surprisal and embedding similarity analyses. The models largely maintain language separation, and cross-lingual effects arise primarily when embeddings are shared. In these cases, both friends and false friends show facilitation relative to controls. Regression analyses reveal that these effects are mainly driven by frequency rather than consistency in form-meaning mapping. Only when just friends share embeddings are the qualitative patterns of bilinguals reproduced. Overall, bilingual language models capture some cross-linguistic activation effects. However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.
[NLP-23] SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
【速读】: 该论文旨在解决传统方面情感分析(Aspect-Based Sentiment Analysis, ABSA)仅依赖离散极性标签(如正面、负面、中性)所导致的表达能力有限问题,以及将ABSA扩展至公共议题话语(如政治、能源和气候议题)时缺乏有效建模手段的问题。解决方案的关键在于引入维度情感分析(Dimensional Aspect-Based Sentiment Analysis, DimABSA),通过在效价-唤醒度(Valence-Arousal, VA)连续空间中建模情感,实现更精细的情感表示;同时提出维度立场分析(DimStance),将立场目标视为方面,并将立场检测重构为VA空间中的回归任务,从而统一处理方面级情感与立场分析。此外,作者设计了连续F1(cF1)指标以联合评估结构化抽取与VA回归任务,推动该领域方法论的发展。
链接: https://arxiv.org/abs/2604.07066
作者: Liang-Chih Yu,Jonas Becker,Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Lung-Hao Lee,Ying-Lung Lin,Jin Wang,Jan Philip Wahle,Terry Ruas,Natalia Loukachevitch,Alexander Panchenko,Ilseyar Alimova,Lilian Wanzare,Nelson Odhiambo,Bela Gipp,Kai-Wei Chang,Saif M. Mohammad
机构: Yuan Ze University(元智大学); University of Göttingen(哥廷根大学); Imperial College London(伦敦帝国学院); University of Pretoria(比勒陀利亚大学); National Yang Ming Chiao Tung University(阳明交通大学); Central Police University(中央警察大学); Yunnan University(云南大学); Lomonosov Moscow State University(莫斯科国立大学); Skoltech(斯科尔科沃科学技术研究所); AIRI(人工智能研究所); Maseno University(马塞诺大学); UCLA(加州大学洛杉矶分校); National Research Council Canada(加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence-arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression. The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.
[NLP-24] IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text
【速读】: 该论文旨在解决现有印尼语情感分析模型在孤立文本分类中忽略话题上下文的问题,而话题上下文往往决定了语句的情感极性(positive, negative, or neutral)。解决方案的关键在于提出一种基于上下文条件的分类器IndoBERT-Sentiment,其输入包含一个话题上下文和待分析文本,从而生成基于具体话题的情感预测。该模型基于IndoBERT Large(335M参数)构建,并在31,360个标注于188个话题上的上下文-文本对上训练,最终在F1宏平均指标上达到0.856,在准确率上达88.1%,显著优于三种主流通用型印尼语情感模型(提升35.6 F1点),验证了上下文条件化策略从相关性分类到情感分析的有效迁移能力。
链接: https://arxiv.org/abs/2604.07057
作者: Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja
机构: Telkom University (Telkom大学); National Research and Innovation Agency (BRIN) (国家研究与创新署(BRIN))
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 tables, and 2 figures
Abstract:Existing Indonesian sentiment analysis models classify text in isolation, ignoring the topical context that often determines whether a statement is positive, negative, or neutral. We introduce IndoBERT-Sentiment, a context-conditioned sentiment classifier that takes both a topical context and a text as input, producing sentiment predictions grounded in the topic being discussed. Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics, the model achieves an F1 macro of 0.856 and accuracy of 88.1%. In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points. We show that context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables the model to correctly classify texts that are systematically misclassified by context-free approaches.
[NLP-25] Sell More Play Less: Benchmarking LLM Realistic Selling Skill
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在销售对话场景中缺乏对交易进展和结果的量化评估问题,因为现有对话基准通常不衡量成交推进效果。其核心解决方案是构建一个双语(中文/英文)销售对话基准 SalesLLM,涵盖金融与消费品两大真实应用场景,包含30,074个脚本配置和1,805个多轮精心设计的对话场景,具备可控难度与角色设定;同时提出全自动评估流程:基于LLM的评分器用于判断销售进程进展,以及微调后的BERT分类器识别对话结束时的购买意图,并通过SFT与DPO训练用户模型CustomerLM以提升模拟真实性,从而实现对销售代理性能的可扩展、高保真度评估。
链接: https://arxiv.org/abs/2604.07054
作者: Xuanbo Su,Wenhao Hu,Le Zhan,Yanqi Yang,Leo Huang
机构: SF Express (顺丰速运)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.
[NLP-26] Gemma 4 Phi-4 and Qwen 3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在实际推理任务中,混合专家模型(Mixture-of-Experts, MoE)是否真正优于密集模型(Dense Models)的问题,尤其是在端到端推理约束下(如延迟、显存占用和计算量)的效率与性能权衡。其解决方案的关键在于构建了一个受控的实证基准测试框架,系统评估了七种近期面向推理任务优化的指令微调模型(涵盖MoE与密集架构),在四个主流评测基准上采用三种提示策略(零样本、思维链、少样本思维链)进行大规模实验(共8400次评估),量化记录准确率、延迟、峰值GPU显存(VRAM)及每 token 的近似浮点运算次数(FLOPs)。结果表明,稀疏激活本身并不保证最优实用性能,最终的准确性-效率权衡取决于模型架构、提示协议和任务组成三者的协同作用,从而为部署导向的推理大语言模型(Reasoning LLMs)评估提供了可复现的基准工具与数据支持。
链接: https://arxiv.org/abs/2604.07035
作者: Md Motaleb Hossen Manik,Ge Wang
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks – ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 – under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.
[NLP-27] MARS: Enabling Autoregressive Models Multi-Token Generation
【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models)在文本生成过程中效率低下的问题,即尽管连续词元(token)在上下文条件下高度可预测,仍需逐个生成,导致推理延迟高、吞吐量受限。解决方案的关键在于提出一种轻量级微调方法 MARS(Mask AutoRegreSsion),其核心创新是通过继续训练现有指令微调的自回归模型,使其在单次前向传播中预测多个词元,而无需修改模型架构或引入额外参数。MARS 保持与原始自回归模型相同的调用接口且性能无损,并支持基于置信度阈值的实时速度调节,从而实现推理效率与生成质量之间的灵活权衡。
链接: https://arxiv.org/abs/2604.07023
作者: Ziqi Jin,Lei Wang,Ziwei Luo,Aixin Sun
机构: Nanyang Technological University (南洋理工大学); Singapore Management University (新加坡管理大学); Uppsala University (乌普萨拉大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 fugures
Abstract:Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.
[NLP-28] Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexicos Nahuatl
【速读】: 该论文旨在解决低资源语言(即“π-语言”)在自然语言处理(Natural Language Processing, NLP)中因语料库稀缺而难以训练有效语言模型的问题。针对这一挑战,研究提出通过受控的数据复制(data duplication)策略扩展现有语料库,以提升词向量嵌入的质量和下游任务性能。其解决方案的关键在于采用增量式复制技术(incremental duplication technique),在不引入噪声的前提下逐步扩充原始语料库(即新π-yalli语料库),从而训练出更适合NLP任务的静态词嵌入,并在句子级语义相似性任务中验证了该方法能带来适度性能提升。
链接: https://arxiv.org/abs/2604.07015
作者: Juan-José Guzman-Landa,Juan-Manuel Torres-Moreno,Graham Ranger,Miguel Figueroa-Saavedra,Martha-Lorena Avendaño-Garrido,Elvys Linhares-Pontes,Luis-Gil Moreno-Jiménez
机构: Université d’Avignon (法国阿维尼翁大学); Université de Tours (法国图尔大学); Universidad Veracruzana (墨西哥韦拉克鲁斯大学); Trading Central (交易中央); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure, 1 table
Abstract:In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or \pi -languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic \pi -language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new \pi -yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.
[NLP-29] DTCRS: Dynamic Tree Construction for Recursive Summarization
【速读】: 该论文旨在解决递归摘要(Recursive Summarization)在生成式问答系统中因冗余摘要节点导致的构建效率低下及问答相关性下降的问题,同时指出递归摘要并非适用于所有类型的问题。其解决方案的关键在于提出DTCRS方法,该方法通过分析文档结构与查询语义动态生成摘要树:首先根据问题类型判断是否需要构建摘要树;随后对问题进行分解,并利用子问题的嵌入向量作为初始聚类中心,从而减少冗余摘要并提升摘要与问题的相关性。该策略显著降低了摘要树构建时间,并在三个问答任务上实现了性能提升。
链接: https://arxiv.org/abs/2604.07012
作者: Guanran Luo,Zhongquan Jian,Wentao Qiu,Meihong Wang,Qingqiang Wu
机构: School of Informatics, Xiamen University (厦门大学信息学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.
[NLP-30] Continuous Interpretive Steering for Scalar Diversity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理语用推理(Pragmatic Inference)时缺乏对梯度性(Gradedness)敏感性的评估问题,特别是如何系统性地探测模型是否能区分不同词汇项在语用增强(Pragmatic Enrichment)强度上的差异。传统方法依赖于提示(Prompt)层面的操控,难以精确刻画语用推理的连续变化特性。其解决方案的关键在于提出一种名为“连续解释引导”(Continuous Interpretive Steering, CIS)的新方法,将激活层的干预强度作为连续实验变量,并结合新构建的标注数据集 GraSD(编码了梯度化的标量多样性),从而实现对LLMs语用敏感性的精细测量。实验表明,仅使用统一激活引导会削弱词汇级差异,而采用梯度激活引导则可恢复与标量多样性一致的差异化解释偏移,证明语用梯度信息已编码在模型表示空间中并可通过受控干预被系统性还原。
链接: https://arxiv.org/abs/2604.07006
作者: Ye-eun Cho
机构: Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations. Beyond prompt-level effects, this study introduces Continuous Interpretive Steering (CIS), a method that probes graded pragmatic interpretation by treating activation-level steering strength as a continuous experimental variable. To support this analysis, this study introduces a new dataset, GraSD, which encodes graded scalar diversity. Experiments on four LLMs show that uniform activation steering increases pragmatic interpretations globally but collapses item-level variation, whereas graded activation steering yields differentiated interpretive shifts aligned with scalar diversity grades. It indicates that graded sensitivity is encoded in the representation space and can be systematically recovered through controlled intervention. Together, CIS and GraSD provide a principled framework for evaluating graded pragmatic sensitivity in LLMs.
[NLP-31] ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals ACL2026
【速读】: 该论文旨在解决历史文献中检索增强生成(Retrieval-Augmented Generation, RAG)系统在时间敏感场景下的准确性问题,特别是在处理《春秋》这类古典中文编年体史书时,如何确保检索到的文本不仅语义相关,而且具有时间一致性。传统基于语义相似度的检索方法容易产生“看似合理但时间错误”的结果,因为古典中文中的时间表达高度凝练且依赖上下文推断,缺乏标准化的格里高利历格式。解决方案的关键在于提出一个名为ChunQiuTR的时间键检索基准和一种名为CTD(Calendrical Temporal Dual-encoder)的时间感知双编码器模型:CTD通过傅里叶变换建模绝对历法上下文,并引入相对偏移量作为时间偏差项,从而显式地将时间维度融入检索过程,显著提升了在月级时间粒度上的检索精度与一致性。
链接: https://arxiv.org/abs/2604.06997
作者: Yihao Wang,Zijian He,Jie Ren,Keze Wang
机构: Sun Yat-Sen University(中山大学); Shaanxi Normal University(陕西师范大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 11 figures. To appear in Findings of ACL 2026
Abstract:Retrieval shapes how language models access and ground knowledge in retrieval-augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non-Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce \textbfChunQiuTR, a time-keyed retrieval benchmark built from the \textitSpring and Autumn Annals and its exegetical tradition. ChunQiuTR organizes records by month-level reign keys and includes chrono-near confounders that mirror realistic retrieval failures. We further propose \textbfCTD (Calendrical Temporal Dual-encoder), a time-aware dual-encoder that combines Fourier-based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \hrefthis https URL\textttthis http URL.
[NLP-32] Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动评估中普遍存在的自偏好偏差(Self-Preference Bias, SPB)问题,即评判者倾向于偏好由自身或同一家族模型生成的输出,从而导致评估结果失真,尤其在递归自我改进场景下严重影响模型迭代质量。解决方案的关键在于首次系统性地研究了基于评分细则(rubric-based evaluation)这一新兴评估范式中的SPB现象,并通过IFEval和HealthBench两个基准验证:即使在客观、可程序化验证的评分规则下,SPB依然显著存在(最高使错误判定概率提升50%),且主观性更强的领域(如医疗咨询)中偏差可达10分量级,足以改变前沿模型排名。研究进一步发现,集成多个评判者可缓解但无法彻底消除SPB,同时识别出负向评分项、极端长度评分条目及主观话题(如紧急转诊)是驱动SPB的主要因素,为后续设计更公平的自动化评估机制提供了关键依据。
链接: https://arxiv.org/abs/2604.06996
作者: José Pombal,Ricardo Rei,André F. T. Martins
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.
[NLP-33] he AI Skills Shift: Mapping Skill Obsolescence Emergence and Transition Pathways in the LLM Era
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)对劳动力市场中职业技能自动化可行性缺乏系统量化评估的问题。其解决方案的关键在于构建了技能自动化可行性指数(Skill Automation Feasibility Index, SAFI),通过在263个文本类任务上对四个前沿LLM(LLaMA 3.3 70B、Mistral Large、Qwen 2.5 72B和Gemini 2.5 Flash)进行基准测试,覆盖美国劳工部O*NET分类中的全部35项技能,并结合Anthropic经济指数中的真实AI采用数据,提出一个四象限的AI影响矩阵(AI Impact Matrix),从而识别出高替代风险、需再培训、AI增强及低替代风险四类技能。研究发现数学与编程技能自动化可行性最高,而主动倾听与阅读理解最低,且存在“能力-需求倒置”现象,即AI表现最弱的技能恰恰是就业市场最需要的;同时,绝大多数实际AI交互为增强而非替代(78.7%),表明当前LLM在技能层面的自动化潜力具有高度一致性,且更依赖于技能本身而非具体模型。
链接: https://arxiv.org/abs/2604.06906
作者: Rudra Jadhav,Janhavi Danve
机构: Savitribai Phule Pune University (萨维特里·白伊·普恩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 11 pages, 12 figures, 2 tables, 17 references. Code and data available at
Abstract:As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs – LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash – across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor’s O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix – an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a “capability-demand inversion” where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.
[NLP-34] Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus
【速读】: 该论文旨在解决小到中等规模语言模型在法语生物医学领域适应性不足的问题,特别是针对非英语语种的领域专用化挑战。其核心解决方案是采用领域自适应预训练(Domain-Adaptive Pre-Training, DAPT)策略,通过持续预训练使模型更好地适配特定领域。关键在于构建高质量的法语生物医学语料库并进行精细化处理,同时探索因果语言建模方法,并在DAPT后引入模型融合(model merging)技术以缓解通用能力退化问题,从而在保持整体性能的同时提升专业任务表现。
链接: https://arxiv.org/abs/2604.06903
作者: Aidan Mannion,Cécile Macaire,Armand Violle,Stéphane Ohayon,Xavier Tannier,Didier Schwab,Lorraine Goeuriot,François Portet
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.
[NLP-35] AG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations ACL2026
【速读】: 该论文旨在解决文本因果发现(text-based causal discovery)中缺乏带有因果标注的文本数据这一关键障碍,因人工标注成本高昂导致难以获得可靠的地面真实数据。为应对这一问题,作者提出 iTAG 方法,其核心创新在于在现有依赖大语言模型(Large Language Model, LLM)的文本生成流程中引入“真实世界概念赋值”步骤——即在将因果图转化为文本之前,通过将现实概念分配给节点,并将其建模为一个以因果图为目标的逆问题,利用思维链(Chain-of-Thought, CoT)推理迭代地检验和优化概念选择,使概念间诱导的关系尽可能与目标因果关系一致。该方案显著提升了生成文本的自然度与因果图标注准确性,且生成数据经测试可作为文本因果发现算法的可靠基准,具备实际应用价值。
链接: https://arxiv.org/abs/2604.06902
作者: Wenshuo Wang,Boyu Cao,Nan Zhuang,Wei Li
机构: South China University of Technology (华南理工大学); China Mobile GBA Innovation Institute (中国移动大湾区创新研究院); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026
Abstract:A fundamental obstacle to causal discovery from text is the lack of causally annotated text data for use as ground truth, due to high annotation costs. This motivates an important task of generating text with causal graph annotations. Early template-based generation methods sacrifice text naturalness in exchange for high causal graph annotation accuracy. Recent Large Language Model (LLM)-dependent methods directly generate natural text from target graphs through LLMs, but do not guarantee causal graph annotation accuracy. Therefore, we propose iTAG, which performs real-world concept assignment to nodes before converting causal graphs into text in existing LLM-dependent methods. iTAG frames this process as an inverse problem with the causal graph as the target, iteratively examining and refining concept selection through Chain-of-Thought (CoT) reasoning so that the induced relations between concepts are as consistent as possible with the target causal relationships described by the causal graph. iTAG demonstrates both extremely high annotation accuracy and naturalness across extensive tests, and the results of testing text-based causal discovery algorithms with the generated data show high statistical correlation with real-world data. This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.
[NLP-36] Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models ACL2026
【速读】: 该论文旨在解决大语音语言模型(Large Speech Language Models, LSLMs)在推理过程中因高Token速率导致的序列长度远超语义内容、进而引发高昂计算成本的问题。其解决方案的关键在于通过层间Oracle干预发现深层网络存在显著冗余,并提出无需训练的基于相似性的Affinity Pooling机制,在输入层与深层均进行Token合并,从而有效压缩语音表示而不损失语义信息,最终实现预填充浮点运算量减少27.48%且保持竞争性准确率,同时在实际部署中带来约1.7倍内存节省和约1.1倍首token生成速度提升。
链接: https://arxiv.org/abs/2604.06871
作者: Bajian Xiang,Tingwei Guo,Xuan Chen,Yang Han
机构: Beike Inc.(贝壳 Inc.)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Findings)
Abstract:Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48% while maintaining competitive accuracy. Practical deployment further confirms significant efficiency gains, yielding up to \sim 1.7 \times memory savings and \sim 1.1 \times faster time-to-first-token on long utterances. Our results challenge the necessity of fully distinct token representations, providing new perspectives on LSLM efficiency.
[NLP-37] o Adapt or not to Adapt Rethinking the Value of Medical Knowledge-Aware Large Language Models
【速读】: 该论文旨在解决当前临床大语言模型(Clinical LLMs)在标准多选题问答(MCQA)基准上是否真正优于通用大语言模型(General-purpose LLMs)的问题,尤其关注其在英语和西班牙语场景下的表现差异及评估方法的局限性。其解决方案的关键在于:首先构建了一个基于扰动(perturbation-based)的评估基准,系统性地测试模型对指令遵循、鲁棒性和对抗性变化的敏感性;其次引入轻量级8B参数的临床专用模型Marmoka,通过持续领域自适应预训练(continual domain-adaptive pretraining)在医学语料库和指令上优化性能;最终发现,在英语任务中临床模型未显著优于通用模型,但在西班牙语任务中Marmoka表现出更优结果,表明现有评估框架可能无法充分捕捉真正的医学专业能力,且低资源语言如西班牙语也能通过针对性训练获得高质量医疗模型。
链接: https://arxiv.org/abs/2604.06854
作者: Ane G. Domingo-Aldama,Iker De La Iglesia,Maitane Urruela,Aitziber Atutxa,Ander Barrena
机构: University of the Basque Country (巴斯克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical adaptation. METHODS: We systematically compare general and clinical LLMs on a diverse set of multiple choice clinical question answering tasks in English and Spanish. We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations. Our evaluation includes, one-step and two-step question transformations, multi prompt testing and instruction guided assessment. We analyze a range of state-of-the-art clinical models and their general-purpose counterparts, focusing on Llama 3.1-based models. Additionally, we introduce Marmoka, a family of lightweight 8B-parameter clinical LLMs for English and Spanish, developed via continual domain-adaptive pretraining on medical corpora and instructions. RESULTS: The experiments show that clinical LLMs do not consistently outperform their general purpose counterparts on English clinical tasks, even under the proposed perturbation based benchmark. However, for the Spanish subsets the proposed Marmoka models obtain better results compared to Llama. CONCLUSIONS: Our results show that, under current short-form MCQA benchmarks, clinical LLMs offer only marginal and unstable improvements over general-purpose models in English, suggesting that existing evaluation frameworks may be insufficient to capture genuine medical expertise. We further find that both general and clinical models exhibit substantial limitations in instruction following and strict output formatting. Finally, we demonstrate that robust medical LLMs can be successfully developed for low-resource languages such as Spanish, as evidenced by the Marmoka models. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.06854 [cs.CL] (or arXiv:2604.06854v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06854 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ane G. Domingo-Aldama [view email] [v1] Wed, 8 Apr 2026 09:17:55 UTC (1,442 KB)
[NLP-38] MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗问诊场景中面对非合作患者时诊断准确率显著下降的问题。现有方法要么缺乏对患者行为的分级刻画,要么无法捕捉多维行为之间的交互效应,导致对LLM诊断鲁棒性的评估不充分。其解决方案的关键在于提出MedDialBench基准,通过将患者行为分解为五个具有分级严重程度的维度(逻辑一致性、健康认知、表达风格、披露程度和态度),并设计案例特异的行为脚本,实现受控的因子设计实验。这一结构化方法支持剂量-反应分析、敏感性评估及跨维度交互检测,从而揭示信息污染(虚构症状)比信息缺失(隐瞒信息)造成更严重的诊断误差(1.7–3.4倍),且仅虚构行为在所有五种模型中均达到统计显著性,并表现出超加性交互效应(即组合影响远大于单个维度之和)。
链接: https://arxiv.org/abs/2604.06846
作者: Xiaotian Luo,Xun Jiang,Jiangcheng Wu
机构: Shanda Group (山大集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 9 tables. Preprint
Abstract:Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis, and none analyze cross-dimension interactions. We introduce MedDialBench, a benchmark enabling controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. It decomposes patient behavior into five dimensions – Logic Consistency, Health Cognition, Expression Style, Disclosure, and Attitude – each with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection. Evaluating five frontier LLMs across 7,225 dialogues (85 cases x 17 configurations x 5 models), we find a fundamental asymmetry: information pollution (fabricating symptoms) produces 1.7-3.4x larger accuracy drops than information deficit (withholding information), and fabricating is the only configuration achieving statistical significance across all five models (McNemar p 0.05). Among six dimension combinations, fabricating is the sole driver of super-additive interaction: all three fabricating-involving pairs produce O/E ratios of 0.70-0.81 (35-44% of eligible cases fail under the combination despite succeeding under each dimension alone), while all non-fabricating pairs show purely additive effects (O/E ~ 1.0). Inquiry strategy moderates deficit but not pollution: exhaustive questioning recovers withheld information, but cannot compensate for fabricated inputs. Models exhibit distinct vulnerability profiles, with worst-case drops ranging from 38.8 to 54.1 percentage points. Comments: 9 pages, 4 figures, 9 tables. Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06846 [cs.CL] (or arXiv:2604.06846v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-39] HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues
【速读】: 该论文旨在解决现有对话系统中长期记忆机制在适应不同查询类别时灵活性不足以及计算开销过高的问题。当前方法依赖连续摘要或基于OpenIE的图结构构建与固定Top-k检索,难以应对多样化信息需求且效率低下。解决方案的关键在于提出HingeMem——一种基于边界引导的长期记忆框架,其核心是将事件分割理论转化为可解释的索引接口,通过四个要素(人、时间、地点、主题)的变化触发边界并写入当前片段,从而减少冗余操作并保留关键上下文;同时引入查询自适应检索机制,动态决定“检索什么”(基于查询条件的元素索引路由)和“检索多少”(根据查询类型控制检索深度),显著提升检索效率与准确性。
链接: https://arxiv.org/abs/2604.06845
作者: Yijie Zhong,Yunfan Gao,Haofen Wang
机构: College of Design and Innovation, Tongji University (同济大学设计创意学院); Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University (同济大学智能自主系统研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by TheWebConf 2026
Abstract:Long-term memory is critical for dialogue systems that support continuous, sustainable, and personalized interactions. However, existing methods rely on continuous summarization or OpenIE-based graph construction paired with fixed Top-\textitk retrieval, leading to limited adaptability across query categories and high computational overhead. In this paper, we propose HingeMem, a boundary-guided long-term memory that operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any such element changes, HingeMem draws a boundary and writes the current segment, thereby reducing redundant operations and preserving salient context. To enable robust and efficient retrieval under diverse information needs, HingeMem introduces query-adaptive retrieval mechanisms that jointly decide (a) \textitwhat to retrieve: determine the query-conditioned routing over the element-indexed memory; (b) \textithow much to retrieve: control the retrieval depth based on the estimated query type. Extensive experiments across LLM scales (from 0.6B to production-tier models; \textite.g., Qwen3-0.6B to Qwen-Flash) on LOCOMO show that HingeMem achieves approximately 20% relative improvement over strong baselines without query categories specification, while reducing computational cost (68% \downarrow question answering token cost compared to HippoRAG2). Beyond advancing memory modeling, HingeMem’s adaptive retrieval makes it a strong fit for web applications requiring efficient and trustworthy memory over extended interactions.
[NLP-40] On the Step Length Confounding in LLM Reasoning Data Selection ACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在构建高质量长链推理数据集时,因自然度筛选方法(naturalness-based data selection)存在“步骤长度混淆”(step length confounding)问题而导致的偏差。具体而言,现有方法通过计算平均对数概率(average log probability)来评估样本质量,但分析发现其会系统性偏好推理步骤更长(即每步token更多)的样本,而非真正更高质量的推理路径。这种偏差源于推理步骤中首个token的低概率值被较长步骤稀释,从而人为提高整体平均对数概率。解决方案的关键在于识别并消除这一混淆因素:提出两种变体方法——ASLEC-DROP 通过剔除每个推理步骤首token的概率来计算平均对数概率,ASLEC-CASL 则采用因果去偏回归(causal debiasing regression)直接移除首token对平均概率的干扰效应,从而更准确地衡量推理质量。
链接: https://arxiv.org/abs/2604.06834
作者: Bing Wang,Rui Miao,Chen Shen,Shaotian Yan,Kaiyuan Liu,Ximing Li,Xiaosong Yuan,Sinan Fan,Jun Zhang,Jieping Ye
机构: Jilin University (吉林大学); Alibaba Cloud Computing (阿里云计算); Zhejiang University (浙江大学); University of Michigan (密歇根大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Findings of ACL 2026. 15 pages, 9 figures. Code: this https URL
Abstract:Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens’ confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.
[NLP-41] Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理阶段依赖自回归(Autoregressive, AR)解码导致的吞吐量瓶颈问题,尤其是在边缘设备上以批处理大小为1运行时,AR解码受限于内存带宽且无法充分利用硬件并行性。解决方案的关键在于提出Fast-dVLM,一种基于块扩散(block-wise discrete diffusion)的VLM架构,其核心创新包括:支持KV缓存兼容的并行解码和推测性块解码(speculative block decoding),并通过直接转换策略(direct conversion)将预训练好的多模态VLM一次性转化为扩散模型,相比两阶段转换更高效地保留了多模态对齐能力;同时引入一系列适配技术如块大小退火、因果上下文注意力、自动截断掩码和视觉高效拼接,共同实现高质量且高速的多模态生成推理。实验表明,Fast-dVLM在11个基准测试中与AR基线相当,并在SGLang集成和FP8量化下实现超过6倍的端到端推理加速。
链接: https://arxiv.org/abs/2604.06832
作者: Chengyue Wu,Shiyi Lan,Yonggan Fu,Sensen Gao,Jin Wang,Jincheng Yu,Jose M. Alvarez,Pavlo Molchanov,Ping Luo,Song Han,Ligeng Zhu,Enze Xie
机构: The University of Hong Kong; NVIDIA; MIT; MBZUAI
类目: Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.
[NLP-42] WRAP: Web discoveRy Amplified Pretraining
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)预训练中合成数据重构方法局限于单文档层面的问题,即现有技术仅在孤立文档内进行重写,导致生成的示例仅包含文档内部的知识关联,缺乏跨文档的事实联系与丰富的语义上下文。为克服这一局限,作者提出WRAP++(Web Discovery Amplified Pretraining)框架,其核心创新在于通过挖掘网页超链接发现高置信度的关系模式(如双向链接和共提及),并在此基础上合成需要跨两篇文档推理的问答对(QA),从而构建原始文档中缺失的关联性知识。该方案不仅显著增强了事实知识的语境丰富度,还因实体对组合数的组合爆炸特性,使合成数据规模从约84亿token扩展至800亿token,同时在SimpleQA基准上验证了基于OLMo的7B和32B模型均展现出优于单文档方法的性能及持续的缩放增益,证明了跨文档关系发现与数据扩增的有效性。
链接: https://arxiv.org/abs/2604.06829
作者: Jiang Zhou,Yunhao Wang,Xing Wu,Tinghao Yu,Feng Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress. Correspondence to ucaswu@tencent.com or wuxing@iie. this http URL
Abstract:Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.
[NLP-43] Environmental Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models LREC26
【速读】: 该论文旨在解决小公司和新兴市场中环境、社会与治理(ESG)评级数据稀缺的问题,从而提升ESG评估的可及性与准确性。其关键解决方案是构建首个公开可用的斯洛文尼亚语ESG情感数据集,并开发一系列用于自动ESG情感检测的模型,包括基于大语言模型(LLM)的分类器、微调后的SloBERTa(单语模型)、XLM-R(多语模型)以及嵌入式分类器(TabPFN)和分层集成架构。实验表明,LLM在环境(Gemma3-27B,F1-macro: 0.61)和社会(gpt-oss 20B,F1-macro: 0.45)维度表现最优,而微调后的SloBERTa在治理维度上表现最佳(F1-macro: 0.54),验证了多模型策略对不同ESG维度的有效适配能力。
链接: https://arxiv.org/abs/2604.06826
作者: Paula Dodig,Boshko Koloski,Katarina Sitar Šuštar,Senja Pollak,Matthew Purver
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the The 7th Financial Narrative Processing Workshop at LREC 26’
Abstract:Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.
[NLP-44] SemEval-2026 Task 9: Detecting Multilingual Multicultural and Multievent Online Polarization
【速读】: 该论文旨在解决在线极化(online polarization)检测问题,即识别社交媒体内容中是否存在极化现象、极化的类型以及极化的表现形式。其解决方案的关键在于构建了一个多语言、大规模的标注数据集(涵盖22种语言,超过110K条实例),并设计了三个子任务:极化存在性检测、极化类型识别和极化表现形式识别。通过组织SemEval-2026 Task 9这一共享任务,吸引了全球超过1,000名参与者和10,000余次提交,最终基于67个团队的系统结果与73篇系统描述论文,分析了不同方法在各子任务和语言上的性能表现,揭示了当前最有效的技术路径,为跨语言极化检测提供了基准和方向。
链接: https://arxiv.org/abs/2604.06817
作者: Usman Naseem,Robert Geislinger,Juan Ren,Sarah Kohail,Rudy Garrido Veliz,P Sam Sahil,Yiran Zhang,Marco Antonio Stranisci,Idris Abdulmumin,Özge Alaçam,Cengiz Acartürk,Aisha Jabr,Saba Anwar,Abinew Ali Ayele,Elena Tutubalina,Aung Kyaw Htet,Xintong Wang,Surendrabikram Thapa,Tanmoy Chakraborty,Dheeraj Kodati,Sahar Moradizeyveh,Firoj Alam,Ye Kyaw Thu,Shantipriya Parida,Ihsan Ayyub Qazi,Lilian Wanzare,Nelson Odhiambo Onyango,Clemencia Siro,Ibrahim Said Ahmad,Adem Chanie Ali,Martin Semmann,Chris Biemann,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam
机构: Macquarie University; University of Hamburg; Zayed University; HKBK College of Engineering; University of Turin; aequa-tech; University of Pretoria; Bielefeld University; Jagiellonian University; Bahir Dar University; AIRI; KFU; HSE University; Virginia Tech; IIT Delhi; ABV-IIITM; Qatar Computing Research Institute; Hamad Bin Khalifa University; Language Understanding Lab., Myanmar; AMD Silo AI; Lahore University of Management Sciences; Maseno University; Centrum Wiskunde Informatica; Bayero University Kano; Northeastern University; Imperial College London
类目: Computation and Language (cs.CL)
备注:
Abstract:We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three sub-tasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submission on Codabench. We received final submissions from 67 teams and 73 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset of this task is publicly available.
[NLP-45] AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成中因幻觉(hallucination)导致的可靠性问题,尤其针对现有不确定性量化(Uncertainty Quantification, UQ)方法在跨异构主题下难以可靠聚合、忽视中立信息以及细粒度分解计算开销过高的挑战。其解决方案的关键在于提出AGSC(Adaptive Granularity and GMM-based Semantic Clustering)框架:首先利用自然语言推理(NLI)中立概率作为触发条件区分无关内容与不确定性,从而减少冗余计算;随后通过高斯混合模型(Gaussian Mixture Model, GMM)进行软聚类以建模潜在语义主题,并为每个主题分配感知权重用于下游聚合,实现高效且准确的不确定性评估。
链接: https://arxiv.org/abs/2604.06812
作者: Guanran Luo,Wentao Qiu,Wanru Zhao,Wenhan Lv,Zhongquan Jian,Meihong Wang,Qingqiang Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in long-form generation, yet their application is hindered by the hallucination problem. While Uncertainty Quantification (UQ) is essential for assessing reliability, the complex structure makes reliable aggregation across heterogeneous themes difficult, in addition, existing methods often overlook the nuance of neutral information and suffer from the high computational cost of fine-grained decomposition. To address these challenges, we propose AGSC (Adaptive Granularity and GMM-based Semantic Clustering), a UQ framework tailored for long-form generation. AGSC first uses NLI neutral probabilities as triggers to distinguish irrelevance from uncertainty, reducing unnecessary computation. It then applies Gaussian Mixture Model (GMM) soft clustering to model latent semantic themes and assign topic-aware weights for downstream aggregation. Experiments on BIO and LongFact show that AGSC achieves state-of-the-art correlation with factuality while reducing inference time by about 60% compared to full atomic decomposition.
[NLP-46] Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
【速读】: 该论文旨在解决多步思维链(Chain-of-Thought, CoT)在数学推理任务中因序列长度过长而导致计算资源消耗过大,以及现有基于马尔可夫结构的优化方法因固有记忆缺失和有限反向推理能力而影响推理准确性的问题。解决方案的关键在于提出一种基于可逆分层马尔可夫链(Reversible Hierarchical Markov Chain)的新型思维链框架——认知循环思维链(Cognitive Loop of Thought, CLoT),其核心创新包括:1)将问题分解为具有层级依赖关系的子问题;2)借鉴人类认知过程,在每一层级引入反向验证机制以增强推理鲁棒性;3)设计剪枝策略,在高层子问题验证通过后移除冗余的低层子问题,从而减少计算开销并抑制错误传播。实验表明,该方法在多个数学基准测试中显著优于传统CoT及CoT-SC。
链接: https://arxiv.org/abs/2604.06805
作者: Jia-Chen Zhang,Zheng Zhou,Yu-Jie Xiong
机构: East China Normal University (华东师范大学); Shanghai University of Engineering Science (上海工程技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.
[NLP-47] Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLM s Across Nine Complexity Dimensions
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代数推理任务中缺乏可解释性失败分析的问题,即现有基准无法定位模型失败的具体原因,如表达式嵌套深度、运算符罕见度、中间状态数量或依赖链长度等复杂因素。其解决方案的关键在于提出一个九维代数复杂度框架(nine-dimension algebraic complexity framework),其中每个维度独立变化而其余固定,并通过参数化流水线自动生成和验证问题,无需人工标注。该框架基于已知的LLM代数推理失败模式,系统性地刻画了结构上不同的难度特征,从而实现了对模型代数推理能力的精细诊断与长期追踪。研究发现,工作记忆(working memory)是跨参数规模的不变瓶颈,所有模型均在20–30个并行分支处崩溃,揭示出硬性架构限制而非容量不足。进一步识别出五个维度即可覆盖全部已知失败模式,构成完整的代数推理能力复杂度剖面。
链接: https://arxiv.org/abs/2604.06799
作者: Parth Patil,Dhruv Kumar,Yash Sinha,Murari Mandal
机构: BITS Pilani; KIIT Bhubaneshwar
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Under Review as a conference paper at COLM 2026
Abstract:Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model’s algebraic reasoning capacity.
[NLP-48] GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering
【速读】: 该论文旨在解决Chain-of-Thought (CoT)推理在实际应用中依赖人工设计提示(prompt)的局限性,以及现有CoT-decoding方法仅适用于答案集合固定的问答任务的问题。其解决方案的关键在于提出一种通用的解码策略GCoT-decoding,通过两阶段分支机制(结合斐波那契采样与启发式错误回溯)生成候选推理路径,并将每条路径拆分为推理段(reasoning span)和答案段(answer span)以精确计算路径置信度;随后对语义相似的路径进行聚合,采用共识机制替代传统多数投票法来确定最终答案,从而在固定答案和自由答案两种类型的问答任务上均表现出优越性能。
链接: https://arxiv.org/abs/2604.06794
作者: Guanran Luo,Wentao Qiu,Zhongquan Jian,Meihong Wang,Qingqiang Wu
机构: School of Informatics, Xiamen University (厦门大学信息学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain-of-Thought reasoning can enhance large language models, but it requires manually designed prompts to guide the model. Recently proposed CoT-decoding enables the model to generate CoT-style reasoning paths without prompts, but it is only applicable to problems with fixed answer sets. To address this limitation, we propose a general decoding strategy GCoT-decoding that extends applicability to a broader range of question-answering tasks. GCoT-decoding employs a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It then splits each path into a reasoning span and an answer span to accurately compute path confidence, and finally aggregates semantically similar paths to identify a consensus answer, replacing traditional majority voting. We conduct extensive experiments on six datasets covering both fixed and free QA tasks. Our method not only maintains strong performance on fixed QA but also achieves significant improvements on free QA, demonstrating its generality.
[NLP-49] Video-guided Machine Translation with Global Video Context
【速读】: 该论文旨在解决现有视频引导的多模态翻译(Video-guided Multimodal Translation, VMT)方法在长视频场景中因依赖局部对齐视频片段与字幕一一对应而难以捕捉跨多个片段的全局叙事语境的问题。解决方案的关键在于提出一种全局视频引导的多模态翻译框架,其核心包括:利用预训练语义编码器和基于向量数据库的字幕检索机制构建与目标字幕语义高度相关的视频片段上下文集;引入注意力机制聚焦于高相关性的视觉内容,同时保留其余视频特征以维持更广泛的上下文信息;并设计区域感知的跨模态注意力机制以增强翻译过程中的语义对齐效果。
链接: https://arxiv.org/abs/2604.06789
作者: Jian Chen,JinZe Lv,Zi Long,XiangHua Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.
[NLP-50] When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning ACL2026
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在复杂推理任务中普遍存在“过度思考”(overthinking)的问题,即模型在生成链式思维(Chain-of-Thought, CoT)过程中产生大量计算冗余,导致推理效率低下。解决方案的关键在于提出一种名为动态推理充分性评估(Dynamic Thought Sufficiency in Reasoning, DTSR)的新框架,其核心机制为:首先通过反射信号监测(Reflection Signal Monitoring)识别潜在的早期退出线索,再通过思维充分性检验(Thought Sufficiency Check)动态判断当前CoT是否足以得出最终答案,从而实现基于元认知启发的自适应早期退出决策。实验表明,该方法可在Qwen3模型上将推理长度缩短28.9%–34.9%,同时保持性能损失最小化,有效缓解了过量推理问题。
链接: https://arxiv.org/abs/2604.06787
作者: Yang Xiang,Yixin Ji,Ruotao Xu,Dan Qiao,Zheming Yang,Juntao Li,Min Zhang
机构: Soochow University (苏州大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference
Abstract:Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been generated, yet existing approaches mostly rely on handcrafted or empirical indicators that are unreliable and impractical. In this work, we introduce Dynamic Thought Sufficiency in Reasoning (DTSR), a novel framework for efficient reasoning that enables the model to dynamically assess the sufficiency of its chain-of-thought (CoT) and determine the optimal point for early exit. Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT is sufficient to derive the final answer. Experimental results on the Qwen3 models show that DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking. We further discuss overconfidence in LRMs and self-evaluation paradigms, providing valuable insights for early-exit reasoning.
[NLP-51] Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation ACL2026
【速读】: 该论文旨在解决多角色对话生成中因口语化表达和不完整语句导致的对话结构表征失真问题,从而影响生成对话的连贯性和忠实度。解决方案的关键在于提出一种名为DRCR(Discourse coherence and Response-guided Context Rewriting)的新框架,通过引入话语连贯性(discourse coherence)与响应质量(response quality)两种互补的反馈信号,构建用于上下文重写和响应生成的偏好数据,并采用动态自进化学习机制,使重写模块与响应模块在迭代训练中相互促进、持续优化,从而提升多角色对话生成的质量与一致性。
链接: https://arxiv.org/abs/2604.06784
作者: Zhiyu Cao,Peifeng Li,Qiaoming Zhu
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference
Abstract:Previous research on multi-party dialogue generation has predominantly leveraged structural information inherent in dialogues to directly inform the generation process. However, the prevalence of colloquial expressions and incomplete utterances in dialogues often impedes comprehension and weakens the fidelity of dialogue structure representations, which is particularly pronounced in multi-party dialogues. In this work, we propose a novel framework DRCR (Discourse coherence and Response-guided Context Rewriting) to improve multi-party dialogue generation through dialogue context rewriting. Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation. Moreover, we propose a dynamic self-evolution learning method that allows the rewriter and responder to continuously enhance their capabilities through mutual interaction in an iterative training loop. Comprehensive experiments conducted on four multi-party dialogue datasets substantiate the effectiveness of DRCR.
[NLP-52] Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search ACL2026
【速读】: 该论文旨在解决对话式查询重写(Conversational Query Rewriting, CQR)中因孤立处理重写任务而忽略检索(passage retrieval)与响应生成(response generation)反馈的问题,从而导致重写质量受限。解决方案的关键在于提出多维自一致性偏好对齐的CQR方法(Multi-Faceted Self-Consistent Preference Aligned CQR, MSPA-CQR):首先从重写、检索和响应三个维度构建自一致性偏好对齐数据以增强重写多样性;进而采用前缀引导的多维直接偏好优化策略,联合学习来自这三个维度的偏好信息,从而提升模型在分布内与分布外场景下的鲁棒性和有效性。
链接: https://arxiv.org/abs/2604.06771
作者: Zhiyu Cao,Peifeng Li,Qiaoming Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings
Abstract:Conversational Query Rewriting (CQR) aims to rewrite ambiguous queries to achieve more efficient conversational search. Early studies have predominantly focused on the rewriting in isolation, ignoring the feedback from query rewrite, passage retrieval and response generation in the rewriting process. To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR). Specifically, we first construct self-consistent preference alignment data from three dimensions (rewriting, retrieval, and response) to generate more diverse rewritten queries. Then we propose prefix guided multi-faceted direct preference optimization to learn preference information from three different dimensions. The experimental results show that our MSPA-CQR is effective in both in- and out-of-distribution scenarios.
[NLP-53] Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models
【速读】: 该论文旨在解决语言模型在连续向量空间中运行却依赖离散token表示所引发的几何不一致性问题,特别是Token决策边界(即Voronoi图)的可塑性与表达能力间隙(expressibility gap)之间的关系。其核心贡献在于:首先通过float32精度重计算消除bfloat16量化伪影,验证了Mabrok(2026)提出的线性缩放定律(R²=0.9997),并发现中层(第24–28层)存在几何模糊区,此时边际几何与交叉熵呈负相关(ρ = -0.29),最终层则趋于对齐(ρ = 0.836);其次提出一种名为Margin Refinement Procedure (MRP) 的后处理优化方法,利用Fisher信息距离最大化而非直接边际最大化,在保持下游任务性能不变的前提下实现平均边际提升28%,且仅造成约5,300个位置的恒定副作用,远低于传统边际最大化带来的累积损伤。关键创新点在于:MRP能以几何重构方式压缩表达能力间隙,同时维持原有缩放规律,其实际上限由token级收益分布均匀性决定,而非总损伤量。
链接: https://arxiv.org/abs/2604.06767
作者: Marshall Brett
机构: MARS Labs
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages
Abstract:Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. We study this tessellation empirically on Qwen3.5-4B-Base, making two contributions. First, using float32 margin recomputation to resolve bfloat16 quantization artifacts, we validate Mabrok’s (2026) linear scaling law of the expressibility gap with R^2 = 0.9997 - the strongest confirmation to date - and identify a mid-layer geometric ambiguity regime where margin geometry is anti-correlated with cross-entropy (layers 24-28, \rho = -0.29) before crystallizing into alignment at the final layer ( \rho = 0.836). Second, we show that the Voronoi tessellation of a converged model is reshapable through margin refinement procedures (MRP): short post-hoc optimization runs that widen token-decision margins without retraining. We compare direct margin maximization against Fisher information distance maximization across a dose-response sweep. Both methods find the same ceiling of ~16,300 correctable positions per 256K evaluated, but differ critically in collateral damage. Margin maximization damage escalates with intervention strength until corrections are overwhelmed. Fisher damage remains constant at ~5,300 positions across the validated range ( \lambda = 0.15-0.6), achieving +28% median margin improvement at \lambda = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the expressibility gap while preserving its scaling law. However, frequency and token-class audits reveal that gains concentrate in high-frequency structural tokens (84% of net corrections at \lambda = 0.6), with content and entity-like contributions shrinking at higher \lambda . Fisher MRP is therefore a viable geometric polishing tool whose practical ceiling is set not by aggregate damage but by the uniformity of token-level benefit. Comments: 20 pages Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2604.06767 [cs.LG] (or arXiv:2604.06767v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.06767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-54] amLLM : A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks
【速读】: 该论文旨在解决多大语言模型(Large Language Model, LLM)框架在处理多步骤情境化任务时因缺乏人类团队角色分工而可能导致的单一视角问题,从而限制了性能表现。其解决方案的关键在于提出一种类人团队导向的多LLM协作框架——TeamLLM,该框架通过引入四种具有明确分工的团队角色,并采用三阶段协作机制来优化多步骤情境化任务的执行流程,从而提升任务完成质量与多样性。
链接: https://arxiv.org/abs/2604.06765
作者: Xiangyu Wang,Jin Wu,Haoran Shi,Wei Xia,Jiarui Yu,Chanjin Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at this https URL.
[NLP-55] Multilingual Cognitive Impairment Detection in the Era of Foundation Models LREC2026
【速读】: 该论文旨在解决跨语言认知障碍(Cognitive Impairment, CI)分类问题,即如何从英语、斯洛文尼亚语和韩语的语音转录文本中准确识别CI。其核心挑战在于小样本场景下模型性能受限,且不同语言对监督信号的依赖程度存在差异。解决方案的关键在于对比零样本大语言模型(Zero-shot Large Language Models, LLMs)与受监督的表格化建模方法(Supervised Tabular Models),后者通过工程化语言特征(engineered linguistic features)、转录嵌入(transcript embeddings)以及早期或晚期融合策略进行优化。结果表明,在数据有限的情况下,结构化的语言信号与简单的融合分类器仍是最具鲁棒性和有效性的方案,而少量标注数据的价值具有显著的语言依赖性。
链接: https://arxiv.org/abs/2604.06758
作者: Damar Hoogland,Boshko Koloski,Jaya Caporusso,Tine Kolenik,Ana Zwitter Vitez,Senja Pollak,Christina Manouilidou,Matthew Purver
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted as an oral at the RAPID workshop @ LREC 2026’
Abstract:We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings – transcript-only, linguistic-features-only, and combined – with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.
[NLP-56] How Long Reasoning Chains Influence LLM s Judgment of Answer Factuality ACL2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的评判系统在评估生成内容时存在的准确性不足问题,尤其是这些评判者容易受到表面信息(如推理链的流畅性)误导,从而误判错误答案。其解决方案的关键在于系统性地考察推理链(reasoning chain)的可访问性对LLM评判行为的影响,并揭示了推理链的流畅性和事实性是驱动评判决策的核心信号。研究发现,弱判别模型易被看似合理的推理误导,而强判别模型虽能部分利用推理内容作为证据,但仍可能因高质量表象而产生误判,因此亟需构建更鲁棒的LLM评判机制以区分真实推理质量与表面流畅性。
链接: https://arxiv.org/abs/2604.06756
作者: Minzhu Tu,Shiyu Ni,Keping Bi
机构: State Key Laboratory of AI Safety; Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beijing University of Post and Telecommunications
类目: Computation and Language (cs.CL)
备注: ACL2026 Main
Abstract:Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator’s reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.
[NLP-57] Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在任务执行中性能提升的来源问题:即性能改进是源于模型本身的增强,还是得益于推理范式(reasoning paradigm)的设计。通过对比六种推理范式(Direct、CoT、ReAct、Plan-Execute、Reflection 和 ReCode)在四个前沿LLM和十个基准测试上的表现(共约18,000次运行),研究发现不同范式对不同任务的效果差异显著——某些任务中ReAct相比Direct提升44个百分点(pp),而CoT在HumanEval上反而下降15pp。这表明不存在通用最优范式,且基于任务选择最优范式的“理想选择器”(oracle)平均可比最佳固定范式高出17.1pp。为应对这一互补性,作者提出“先选择后求解”(select-then-solve)方法:引入一个轻量级基于嵌入的路由器(embedding-based router),在每道任务前动态选择最适配的推理范式。实验表明,该路由器在四个模型上将平均准确率从47.6%提升至53.1%,优于最佳固定范式(50.3%)2.8pp,并恢复了高达37%的oracle差距;相比之下,零样本自路由仅在GPT-5上有效(67.1%),且对其他模型无效。因此,论文核心结论是:推理范式的选择应作为每个任务的决策问题,由学习得到的路由器实现,而非固定的架构设计。
链接: https://arxiv.org/abs/2604.06753
作者: Heng Zhou,Zelin Tan,Zhemeng Zhang,Yutao Fan,Yibing Lin,Li Kang,Xiufeng Song,Rui Li,Songtao Huang,Ao Yu,Yuchen Fan,Yanxu Chen,Kaixin Xu,Xiaohong Liu,Yiran Qin,Philip Torr,Chen Zhang,Zhenfei Yin
机构: Google DeepMind (谷歌深脑); OpenAI (OpenAI); Qwen Team (通义千问团队)
类目: Computation and Language (cs.CL)
备注:
Abstract:When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.
[NLP-58] StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference ACL2026
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在处理超长上下文(超过百万token)时,由于键值缓存(Key-Value Cache, KV cache)随上下文长度线性增长而导致的内存容量与带宽瓶颈问题。现有压缩方法通常依赖局部显著性指标对token进行优先级排序,从而将预填充(prefill)计算与解码内存分离,但这类方法往往仅基于特定层的局部特征进行剪枝,忽视了跨网络深度中作为全局信息枢纽的token,导致长期依赖关系丢失。解决方案的关键在于提出StructKV框架,其核心创新包括:1)全局入度中心性(Global In-Degree Centrality),通过聚合网络深度上的注意力模式识别全局信息枢纽;2)动态枢轴检测(Dynamic Pivot Detection),利用信息论度量自适应确定最优压缩层;3)结构传播与解耦(Structural Propagation and Decoupling),实现计算预算与存储预算的分离,从而在保持长程依赖和检索鲁棒性的前提下显著提升长上下文推理效率。
链接: https://arxiv.org/abs/2604.06746
作者: Zhirui Chen,Peiyang Liu,Ling Shao
机构: UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, China; National Engineering Research Center for Software Engineering, Peking University
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings, 14 pages
Abstract:As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.
[NLP-59] Luwen Technical Report
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律领域应用中的挑战,包括专业术语复杂性、推理需求高以及法律知识更新迅速等问题。解决方案的关键在于基于百川(Baichuan)基础模型,采用三项核心技术:1)在大规模法律语料上进行持续预训练(continual pre-training),以增强法律领域知识;2)利用精心构建的法律指令数据进行监督微调(supervised fine-tuning),提升任务适配能力;3)集成检索增强生成(retrieval-augmented generation, RAG)与全面的法律知识库,实现动态知识注入与精准推理。实验表明,该方法在五类典型法律任务中优于多个强基线模型,验证了其在法律场景下迁移和泛化的能力。
链接: https://arxiv.org/abs/2604.06737
作者: Yiquan Wu,Yuhang Liu,Yifei Liu,Ang Li,Siying Zhou,Kun Kuang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present Luwen, an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate Luwen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that Luwen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.
[NLP-60] SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的结构化查询语言(SQL)程序在结构可靠性方面的不确定性问题,即尽管LLM在Text-to-SQL基准测试中表现优异,但其生成的SQL语句在语法结构上存在显著多样性,且这种多样性可能不受语义正确性的影响。解决方案的关键在于引入SQLStructEval框架,通过规范化的抽象语法树(Abstract Syntax Tree, AST)表示来分析和量化SQL查询的结构一致性,并进一步提出基于编译式流水线的结构化生成方法,以提升执行准确率与结构一致性。实验表明,该方法能有效减少因输入表面变化(如同义表达或模式展示差异)引发的结构波动,从而强调结构可靠性作为评估LLM程序生成系统的重要维度。
链接: https://arxiv.org/abs/2604.06736
作者: Yixi Zhou,Fan Zhang,Zhiqiao Guo,Yu Chen,Haipeng Zhang,Preslav Nakov,Zhuohan Xie
机构: ShanghaiTech University (上海科技大学); The University of Tokyo (东京大学); MBZUAI
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: 17 pages, including figures and tables
Abstract:Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at this https URL.
[NLP-61] EC: A Collection of Human Trial-and-error Trajectories for Problem Solving
【速读】: 该论文旨在解决当前人工智能(AI)系统在现实环境中进行试错学习时性能受限的问题,其核心瓶颈在于缺乏高质量的人类试错行为数据。现有试错方法多依赖研究人员设计的简单启发式规则,难以实现显著性能提升。解决方案的关键在于构建一个名为Trial-and-Error Collection (TEC) 的数据标注平台与配套数据集,该平台能够完整记录用户在多次试错过程中的行为轨迹,并收集其在接收到错误反馈后的反思信息。通过该平台,研究者采集了46名参与者在58个任务中生成的5,370条试错轨迹及对应反思,覆盖41,229个网页,从而首次提供了人类试错行为的详细结构化数据。这一数据基础为理解人类高效试错机制、并开发更具适应性和智能性的AI系统奠定了关键支撑。
链接: https://arxiv.org/abs/2604.06734
作者: Xinkai Zhang,Jingtao Zhan,Yiqun Liu,Qingyao Ai
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); Quancheng Laboratory(泉城实验室); Institute of Data and Information, Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院数据与信息研究所); Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users’ complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.
[NLP-62] Steering the Verifiability of Multimodal AI Hallucinations
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中幻觉(Hallucination)的可验证性差异问题,即部分幻觉内容容易被人类用户识别(明显幻觉),而另一些则难以察觉或需更高验证成本(隐匿幻觉)。现有研究未充分探讨如何针对不同安全与可用性需求对这类幻觉的可验证性进行精细调控。解决方案的关键在于构建了一个包含4,470条人类反馈的数据集,并据此将幻觉分为明显和隐匿两类;进一步提出基于激活空间(activation space)的干预方法,学习区分两类幻觉的独立探测器(probes),从而实现对模型输出可验证性的细粒度控制。实验证明,针对性干预能有效提升对应类型幻觉的可控性,且混合使用两类干预策略可灵活适应不同应用场景的验证要求。
链接: https://arxiv.org/abs/2604.06714
作者: Jianhong Pang,Ruoxi Cheng,Ziyi Ye,Xingjun Ma,Zuxuan Wu,Xuanjing Huang,Yu-Gang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model’s verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.
[NLP-63] Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation
【速读】: 该论文旨在解决古代中国甲骨文(Oracle Bone Script, OBS) decipherment 任务中的“解释鸿沟”问题,即现有方法将解码视为封闭集图像识别任务,忽略了字符由有限且可迁移语义的象形部件构成的结构逻辑。解决方案的关键在于提出一种基于代理(agent-driven)的视觉-语言模型(Vision-Language Model, VLM)框架,该框架通过一个基于大语言模型(Large Language Model, LLM)的代理自动执行组件识别、基于图的知识检索与关系推理的推理链,从而实现更语言学准确的解读;同时引入 OB-Radix 数据集,提供结构与语义层面的细粒度标注数据,支撑该框架的有效性。
链接: https://arxiv.org/abs/2604.06711
作者: Jianing Zhang,Runan Li,Honglin Pang,Ding Xia,Zhou Zhu,Qian Zhang,Chuntao Li,Xi Yang
机构: Jilin University (吉林大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap’': while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.
[NLP-64] Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs
【速读】: 该论文旨在解决当前基于API的提示词优化方法在自动化优化过程中存在的问题:即迭代修改单一结构的提示词(monolithic prompts)会导致组件耦合、难以进行责任归属分配,从而限制了控制能力并造成token资源浪费。其解决方案的关键在于提出一种名为自适应提示结构分解(Adaptive Prompt Structure Factorization, aPSF)的框架,该框架通过一个架构模型(Architect model)识别任务特定的提示语义因子(semantic factors),并采用干预式单因子更新策略——包括基于验证性能变化的因子级评分以估计每个因子的边际贡献,以及基于错误引导的因子选择机制将优化聚焦于当前主导失败源,从而实现更高效、可控且样本高效的提示优化。
链接: https://arxiv.org/abs/2604.06699
作者: Haoyue Liu,Zhichao Wang,Yongxin Guo,Haoran Shou,Xiaoying Tang
机构: The Chinese University of Hong Kong, Shenzhen; Taobao and Tmall Group
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor’s marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45–87% tokens on MultiArith while reaching peak validation in 1 step.
[NLP-65] ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding ACL2026
【速读】: 该论文旨在解决当前化学视觉语言模型(Chemical Vision-Language Models, VLMs)在处理复杂化学视觉任务时存在的“黑箱”问题,即模型缺乏对反应机制等深层知识的推理能力,导致输出不可解释。其核心挑战在于如何将大型语言模型(LLMs)的推理优势与视觉感知过程深度融合,从而生成可解释的推理路径。解决方案的关键在于提出一种名为ChemVLR的新架构,该架构通过细粒度的视觉分析策略,在生成答案前显式识别分子中的功能基团等化学描述符,并结合跨模态逆向工程方法和严格的筛选流程构建大规模高质量推理-标注数据集(760k样本),同时采用三阶段训练框架系统性地提升模型的感知与推理能力,最终实现对复杂化学图像的理解与可解释推理。
链接: https://arxiv.org/abs/2604.06685
作者: Xuanle Zhao,Xinyuan Cai,Xiang Cheng,Xiuyi Chen,Bo Xu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Findings, Preprint Version
Abstract:While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in “black-box” systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at this https URL.
[NLP-66] Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry
【速读】: 该论文试图解决的问题是:如何更准确地刻画波斯诗歌中词汇意义的历时演变,传统方法常将语义变化视为向量位移(vector displacement)的抽象漂移,但忽略了词在具体语境中的局部语义网络重构。解决方案的关键在于提出一种基于邻域重连(neighborhood rewiring)的分析框架——通过对齐Word2Vec空间并结合图论的邻域分析,将词汇的历史变化理解为局部语义图的动态重构:包括邻居的增减、桥接角色的转换以及社区间的迁移。该方法不仅识别出不同词类(如时间敏感词Night、诗人特异性词Earth、稳定词Heart)的差异化演变模式,还揭示了词语在历史长河中体现的“持续性”、“迁移性”、“中介性”和“选择性转化”,从而为数字人文研究提供了更具文学解释力的计算路径。
链接: https://arxiv.org/abs/2604.06674
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar
机构: Sharif University of Technology (谢里夫理工大学); Amirkabir University of Technology (阿米尔卡比尔理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Meaning in Persian poetry is both historical and relational. Words persist through literary tradition while shifting their force through changing constellations of neighbors, rhetorical frames, and poetic voices. This study examines that process using aligned Word2Vec spaces combined with graph-based neighborhood analysis across centuries and major poets. Rather than modeling semantic change as vector displacement alone, it treats lexical history as the rewiring of local semantic graphs: the gain and loss of neighbors, shifts in bridge roles, and movement across communities. The analysis centers on twenty target words, anchored by five recurrent reference terms: Earth, Night, two wine terms, and Heart. Surrounding them are affective, courtly, elemental, and Sufi concepts such as Love, Sorrow, Dervish, King, Annihilation, and Truth. These words exhibit distinct patterns of change. Night is more time-sensitive, Earth more poet-sensitive, and Heart shows continuity despite graph-role mobility. The two wine terms highlight probe sensitivity: one is broad and semantically diffuse, while the other is narrower and more stable. A lexical audit confirms that the corpus contains historically driven terms, poet-specific usages, and sparsely attested mystical vocabulary requiring caution. Overall, semantic change in Persian poetry is better captured as neighborhood rewiring than as abstract drift. For Digital Humanities, this approach restores local structure to computational analysis and supports interpretations closer to literary practice: persistence, migration, mediation, and selective transformation.
[NLP-67] A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在虚假新闻检测中面临的两大挑战:一是现有方法依赖人工调查,效率低且难以应对突发新闻;二是基于外部检索证据的检测易受未验证信息干扰,且缺乏对新闻主张各维度的细粒度解释能力。解决方案的关键在于提出一种图增强防御框架(G-Defense),其核心创新为构建以主张为中心的依赖图,将主主张分解为子主张并建模其逻辑关系,结合检索增强生成(RAG)技术为每个子主张生成竞争性解释,并通过图结构上的防御推理模块综合评估整体真实性,最终由大语言模型(LLM)输出直观的解释图,从而实现高效、可解释的虚假新闻检测。
链接: https://arxiv.org/abs/2604.06666
作者: Bo Wang,Jing Ma,Hongzhan Lin,Zhiwei Yang,Ruichao Yang,Yuan Tian,Yi Chang
机构: Jilin University (吉林大学); Hong Kong Baptist University (香港浸会大学); Guangdong Institute of Smart Education, Jinan University (暨南大学智能教育研究院); University of Science and Technology Beijing (北京科技大学); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education (教育部知识驱动人机智能工程研究中心); International Center of Future Science, Jilin University (吉林大学未来科学国际中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by TOIS
Abstract:Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations. Existing methods incorporating investigative journalism are often inefficient and struggle with breaking news. Recent advances in large language models (LLMs) enable leveraging externally retrieved reports as evidence for detection and explanation generation, but unverified reports may introduce inaccuracies. Moreover, effective explainable fake news detection should provide a comprehensible explanation for all aspects of a claim to assist the public in verifying its accuracy. To address these challenges, we propose a graph-enhanced defense framework (G-Defense) that provides fine-grained explanations based solely on unverified reports. Specifically, we construct a claim-centered graph by decomposing the news claim into several sub-claims and modeling their dependency relationships. For each sub-claim, we use the retrieval-augmented generation (RAG) technique to retrieve salient evidence and generate competing explanations. We then introduce a defense-like inference module based on the graph to assess the overall veracity. Finally, we prompt an LLM to generate an intuitive explanation graph. Experimental results demonstrate that G-Defense achieves state-of-the-art performance in both veracity detection and the quality of its explanations.
[NLP-68] A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP
【速读】: 该论文旨在解决临床自然语言处理(Clinical NLP)系统在多任务部署时因独立学习每个任务的提示(prompt)而导致的计算与存储开销过大的问题。其解决方案的关键在于提出了一种多任务提示蒸馏与分解框架,通过从21个多样化的临床源任务中学习一个共享的元提示(metaprompt),并将其适配到未见过的目标任务上,仅需少于0.05%的可训练参数即可实现高性能。该方法在多个临床NLP任务类型和骨干模型上的实验表明,其性能显著优于LoRA(提升1.5~1.7%)和单任务提示微调(提升6.1~6.6%),同时展现出优异的零样本与少样本迁移能力。
链接: https://arxiv.org/abs/2604.06650
作者: Cheng Peng,Mengxian Lyu,Ziyi Chen,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.
[NLP-69] Feedback Adaptation for Retrieval-Augmented Generation ACL2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际部署中缺乏对用户或专家反馈适应能力的评估问题。现有评价方法仅关注静态场景下的整体准确率,未能捕捉系统在引入纠正反馈后的行为变化过程。为此,作者提出将“反馈适应”作为新的评估范式,定义了两个核心指标:修正延迟(correction lag)衡量反馈到行为改变的时间间隔,以及反馈后性能(post-feedback performance)评估系统在语义相关查询上的可靠性。解决方案的关键在于提出 PatchRAG——一种无需重新训练的轻量级推理时机制,能够实现即时修正并展现出强泛化能力,从而有效提升了 RAG 系统在交互式环境中的适应性与鲁棒性。
链接: https://arxiv.org/abs/2604.06647
作者: Jihwan Bang,Seunghan Yang,Kyuhong Shim,Simyung Chang,Juntae Lee,Sungha Choi
机构: Qualcomm AI Research†, Qualcomm Korea YH, Seoul, Republic of Korea; Sungkyunkwan University; H1R.ai; Kyung Hee University
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Findings
Abstract:Retrieval-Augmented Generation (RAG) systems are typically evaluated under static assumptions, despite being frequently corrected through user or expert feedback in deployment. Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced. We introduce feedback adaptation as a problem setting for RAG systems, which asks how effectively and how quickly corrective feedback propagates to future queries. To make this behavior measurable, we propose two evaluation axes: correction lag, which captures the delay between feedback provision and behavioral change, and post-feedback performance, which measures reliability on semantically related queries after feedback. Using these metrics, we show that training-based approaches exhibit a trade-off between delayed correction and reliable adaptation. We further propose PatchRAG, a minimal inference-time instantiation that incorporates feedback without retraining, demonstrating immediate correction and strong post-feedback generalization under the proposed evaluation. Our results highlight feedback adaptation as a previously overlooked dimension of RAG system behavior in interactive settings.
[NLP-70] SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning ACL2026
【速读】: 该论文旨在解决现有过程监督(Process Supervision)方法在大语言模型(Large Language Models, LLMs)推理能力提升中无法区分有意义进展与单纯冗余输出的问题,从而导致推理性能受限及token使用效率低下的缺陷。其解决方案的关键在于提出一种基于潜在估计的分阶段层次优势机制(Stage-aware Hierarchical Advantage via Potential Estimation, SHAPE),将推理过程建模为经验可解性状态空间中的轨迹演化:在段落层级采用阶段感知的优势函数以优先推动低潜力状态下的高效突破,在token层级则通过熵驱动的再分配机制强化执行信号,从而实现更精准的信用分配与更优的推理效率。
链接: https://arxiv.org/abs/2604.06636
作者: Zhengyang Ai,Zikang Shan,Xiaodong Ai,Jingxian Tang,Hangkai Hu,Pinyan Lu
机构: Huawei Taylor Lab (华为泰勒实验室); Center for Data Science, Peking University (北京大学数据科学中心); Shanghai University of Finance and Economics (上海财经大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 Main
Abstract:Process supervision has emerged as a promising approach for enhancing LLM reasoning, yet existing methods fail to distinguish meaningful progress from mere verbosity, leading to limited reasoning capabilities and unresolved token inefficiency. To address this, we propose Stage-aware Hierarchical Advantage via Potential Estimation (SHAPE), a framework that formalizes reasoning as a trajectory through a state space of empirical solvability. SHAPE introduces a hierarchical credit assignment mechanism: at the segment level, it employs a stage-aware advantage function to prioritize efficient breakthroughs in low-potential states; at the token level, it utilizes entropy-driven redistribution to sharpen execution signals. Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.
[NLP-71] Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的静态应用安全测试(Static Application Security Testing, SAST)方法中存在的诸多实践性问题,包括高误报率、幻觉现象、推理深度不足以及Token消耗过大等,这些问题导致现有LLM方法难以在工业场景中部署。其解决方案的关键在于提出一种范式转变——从当前以LLM辅助的传统SAST流程重构为以LLM为中心的新工作流,并设计了首个专用于漏洞检测的多智能体框架Argus(Agentic and Retrieval-Augmented Guarding System)。Argus的核心创新包括:全面的供应链分析能力、协作式多智能体工作流机制,以及融合检索增强生成(Retrieval-Augmented Generation, RAG)与ReAct(Reasoning + Acting)策略,从而有效抑制幻觉并提升推理质量,最终实现更高精度的漏洞检测和更低的运营成本。
链接: https://arxiv.org/abs/2604.06633
作者: Zi Liang,Qipeng Xie,Jun He,Bohuan Xue,Weizheng Wang,Yuandao Cai,Fei Luo,Boxian Zhang,Haibo Hu,Kaishun Wu
机构: The Hong Kong Polytechnic University (香港理工大学); HKUST (香港科技大学); SF Express (顺丰速运); Great Bay University (大湾大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have sparked interest in their application to Static Application Security Testing (SAST), primarily due to their superior contextual reasoning capabilities compared to traditional symbolic or rule-based methods. However, existing LLM-based approaches typically attempt to replace human experts directly without integrating effectively with existing SAST tools. This lack of integration results in ineffectiveness, including high rates of false positives, hallucinations, limited reasoning depth, and excessive token usage, making them impractical for industrial deployment. To overcome these limitations, we present a paradigm shift that reorchestrates the SAST workflow from current LLM-assisted structure to a new LLM-centered workflow. We introduce Argus (Agentic and Retrieval-Augmented Guarding System), the first multi-agent framework designed specifically for vulnerability detection. Argus incorporates three key novelties: comprehensive supply chain analysis, collaborative multi-agent workflows, and the integration of state-of-the-art techniques such as Retrieval-Augmented Generation (RAG) and ReAct to minimize hallucinations and enhance reasoning. Extensive empirical evaluation demonstrates that Argus significantly outperforms existing methods by detecting a higher volume of true vulnerabilities while simultaneously reducing false positives and operational costs. Notably, Argus has identified several critical zero-day vulnerabilities with CVE assignments.
[NLP-72] DiffuMask: Diffusion Language Model for Token-level Prompt Pruning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning)和思维链(Chain-of-Thought)提示中因提示长度过长、冗余信息多而导致的计算成本高、效率低的问题。现有基于剪枝(pruning)的提示压缩方法通常依赖于逐个删除 token 的顺序过程,导致计算开销大、效率低下。其解决方案的关键在于提出 DiffuMask——一种基于扩散模型(diffusion-based)的框架,通过整合层级化的示例级(shot-level)与标记级(token-level)剪枝信号,实现并行化、快速的提示剪枝:在每个去噪步骤中可同时预测多个 token 的掩码(mask),从而显著加速压缩过程,同时保留关键推理上下文,在保持甚至提升模型准确率的前提下实现高达 80% 的提示长度缩减。
链接: https://arxiv.org/abs/2604.06627
作者: Caleb Zheng,Jyotika Singh,Fang Tu,Weiyi Sun,Sujeeth Bharadwaj,Yassine Benajiba,Sujith Ravi,Eli Shlizerman,Dan Roth
机构: University of Washington (华盛顿大学); Oracle AI (甲骨文人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.
[NLP-73] he Detection–Extraction Gap: Models Know the Answer Before They Can Say It
【速读】: 该论文旨在解决现代推理模型在生成过程中存在大量冗余计算的问题,即模型在答案已可从部分前缀中恢复后仍继续生成大量无意义的思维链(chain-of-thought)token,这种现象被称为“后承诺生成”(post-commitment generation)。其核心问题是:尽管答案在早期阶段即可被检测和提取,但标准的提示条件解码机制(forced continuation)无法有效识别并终止生成过程,导致效率低下。解决方案的关键在于揭示并利用“检测-提取差距”(detection–extraction gap)——即自由续写(free continuation)能高效恢复正确答案,而强制续写却失败率高达42%。作者提出黑盒自适应早停策略(Black-box Adaptive Early Exit, \BAEE),通过自由续写实现答案检测与提取,从而在不依赖模型内部结构的前提下,实现70–78%的序列生成截断,并提升1–5个百分点的准确率,尤其在思考模式模型中效果显著(最高达5.8个百分点)。
链接: https://arxiv.org/abs/2604.06613
作者: Hanyang Wang,Mingxuan Zhu
机构: The University of Chicago (芝加哥大学); Imperial College London (伦敦帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that \textbf52–88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix. This post-commitment generation reveals a structural phenomenon: the \textbfdetection–extraction gap. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (\BAEE), which uses free continuations for both detection and extraction, truncating \textbf70–78% of serial generation while \textbfimproving accuracy by 1–5,pp across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8,pp; a cost-optimized variant achieves 68–73% reduction at a median of 9 API calls. Code is available at this https URL.
[NLP-74] Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因严重幻觉(hallucination)导致的可靠性问题,尤其在科学任务中,LLMs未能有效利用领域内高度凝练的科学理论与规则来约束生成过程。解决方案的关键在于提出一种名为SciDC的方法,其核心是通过强语言模型自动将灵活的领域知识转化为多层标准化规则,并构建一个可扩展的框架以对模型生成进行强约束,从而显著提升科学任务中的准确性与可信度。
链接: https://arxiv.org/abs/2604.06603
作者: Maotian Ma,Zheni Zeng,Zhenghao Liu,Yukun Yan
机构: Nanjing University (南京大学); Tsinghua University (清华大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown strong knowledge reserves and task-solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting. To address this issue, we propose \textbfSciDC, an LLM generation method that integrate subject-specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi-layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly-condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (this https URL).
[NLP-75] Scoring Edit Impact in Grammatical Error Correction via Embedded Association Graphs
【速读】: 该论文旨在解决生成式语法错误纠正(GEC)系统中编辑质量评估的局限性问题,即现有评估方法难以充分反映句子存在多种有效修正方案的应用场景,且依赖人工标注的元评估方法难以扩展至大规模数据集。其解决方案的关键在于提出一种新任务——“评分GEC编辑影响”,并设计了一个基于嵌入关联图(embedded association graph)的评分框架;该图通过捕捉编辑间的潜在依赖关系和句法相关性,将编辑分组为语义连贯的集合,并利用困惑度(perplexity)对每个编辑对句子流利度的贡献进行量化评估,从而实现对GEC系统输出编辑重要性的自动估计。
链接: https://arxiv.org/abs/2604.06573
作者: Qiyuan Xiao,Xiaoman Wang,Yunshi Lan
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:A Grammatical Error Correction (GEC) system produces a sequence of edits to correct an erroneous sentence. The quality of these edits is typically evaluated against human annotations. However, a sentence may admit multiple valid corrections, and existing evaluation settings do not fully accommodate diverse application scenarios. Recent meta-evaluation approaches rely on human judgments across multiple references, but they are difficult to scale to large datasets. In this paper, we propose a new task, Scoring Edit Impact in GEC, which aims to automatically estimate the importance of edits produced by a GEC system. To address this task, we introduce a scoring framework based on an embedded association graph. The graph captures latent dependencies among edits and syntactically related edits, grouping them into coherent groups. We then perform perplexity-based scoring to estimate each edit’s contribution to sentence fluency. Experiments across 4 GEC datasets, 4 languages, and 4 GEC systems demonstrate that our method consistently outperforms a range of baselines. Further analysis shows that the embedded association graph effectively captures cross-linguistic structural dependencies among edits.
[NLP-76] o Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLM s ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言环境下被用于生成和传播虚假信息(misinformation)的问题,特别是针对不同国家和地区差异化的生成倾向及其对低资源语言和人类发展指数(Human Development Index, HDI)较低国家的不成比例影响。其解决方案的关键在于构建GlobalLies——一个包含440个误导性生成提示模板和6,867个实体的多语言平行数据集,覆盖8种语言和195个国家,并通过人工标注与大规模LLM-as-a-judge评估相结合的方式,系统揭示了LLMs在跨语言、跨区域传播虚假信息中的行为模式。研究发现,现有缓解策略如输入安全分类器和检索增强的事实核查存在显著的语言和地域不均衡性,从而为开发更具普适性和公平性的全球虚假信息防控机制提供了实证基础与工具支持。
链接: https://arxiv.org/abs/2604.06552
作者: Zohaib Khan,Mustafa Dogan,Ifeoma Okoh,Pouya Sadeghi,Siddhartha Shrestha,Sergius Justus Nyah,Mahmoud O. Mokhiamar,Michael J. Ryan,Tarek Naous
机构: Fatima Fellowship; Stanford University; Georgia Institute of Technology
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Main Conference
Abstract:Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: this https URL
[NLP-77] CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram
【速读】: 该论文旨在解决现有大语言模型在心理支持场景中模拟认知行为疗法(Cognitive Behavioral Therapy, CBT)时存在的局限性,即依赖静态认知图谱和全知单一代理模拟,无法真实反映临床实践中信息不对称与动态调整的特性。其解决方案的关键在于提出一种多智能体框架CCD-CBT,通过两个核心设计实现理论驱动且临床上合理的对话代理:一是将静态的认知概念化图谱(Cognitive Conceptualization Diagram, CCD)升级为由控制代理(Control Agent)动态重构的版本,以适应治疗过程中的状态变化;二是引入信息不对称交互机制,使治疗代理(Therapist Agent)基于对来访者状态的推理进行决策,而非假设完全知情。这一设计显著提升了模型在咨询保真度和积极情绪增强方面的表现。
链接: https://arxiv.org/abs/2604.06551
作者: Chang Liu,Changsheng Ma,Yongfeng Tao,Bin Hu,Minqiang Yang
机构: Lanzhou University (兰州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models show potential for scalable mental-health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy. We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient to information-asymmetric interaction, where the Therapist Agent must reason from inferred client states. We release CCDCHAT, a synthetic multi-turn CBT dataset generated under this framework. Evaluations with clinical scales and expert therapists show that models fine-tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive-affect enhancement, with ablations confirming the necessity of dynamic CCD guidance and asymmetric agent design. Our work offers a new paradigm for building theory-grounded, clinically-plausible conversational agents.
[NLP-78] he Illusion of Stochasticity in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为智能体(agents)时在随机采样(stochastic sampling)方面的可靠性问题。当前LLM代理系统常需从观测数据推断出的概率分布中进行采样,但现有模型无法准确将其内部概率估计映射到实际的随机输出上,导致采样行为不可靠。解决方案的关键在于揭示:尽管前沿模型能够通过外部输入的随机种子(random seeds)生成目标分布的样本,其直接从指定分布中采样的能力存在根本性缺陷,这表明LLM自身缺乏对内在概率分布的有效控制机制,从而指明了未来改进方向——需构建能将内部概率估计精确转化为可控随机输出的机制。
链接: https://arxiv.org/abs/2604.06543
作者: Xiangming Gu,Soham De,Michalis Titsias,Larisa Markeeva,Petar Veličković,Razvan Pascanu
机构: Google DeepMind; National University of Singapore
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review
Abstract:In this work, we demonstrate that reliable stochastic sampling is a fundamental yet unfulfilled requirement for Large Language Models (LLMs) operating as agents. Agentic systems are frequently required to sample from distributions, often inferred from observed data, a process which needs to be emulated by the LLM. This leads to a distinct failure point: while standard RL agents rely on external sampling mechanisms, LLMs fail to map their internal probability estimates to their stochastic outputs. Through rigorous empirical analysis across multiple model families, model sizes, prompting styles, and distributions, we demonstrate the extent of this failure. Crucially, we show that while powerful frontier models can convert provided random seeds to target distributions, their ability to sample directly from specific distributions is fundamentally flawed.
[NLP-79] Does a Global Perspective Help Prune Sparse MoEs Elegantly?
【速读】: 该论文旨在解决稀疏混合专家(Mixture-of-Experts, MoE)模型中因专家参数量庞大而导致的内存消耗过高问题,尤其针对现有剪枝方法在各层均匀分配剪枝预算、忽略层间冗余异质性的局限性。其解决方案的关键在于提出一种全局冗余感知的专家剪枝策略——GRAPE(Global Redundancy-Aware Pruning of Experts),通过动态分配剪枝预算以捕捉跨层冗余信息,在保持模型性能的同时显著降低内存占用。实验表明,GRAPE在多个主流MoE模型上均优于现有局部剪枝基线,平均性能提升达1.40%,最高提升达2.45%。
链接: https://arxiv.org/abs/2604.06542
作者: Zeliang Zhang,Nikhil Ghosh,Jiani Liu,Bin Yu,Xiaodong Liu
机构: University of Rochester (罗彻斯特大学); Flatiron Institute (扁平铁研究所); University of California, Berkeley (加州大学伯克利分校); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.06542 [cs.CL] (or arXiv:2604.06542v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-80] Fine-tuning Whisper for Pashto ASR: strategies and scale
【速读】: 该论文旨在解决Whisper模型在Pashto语音识别任务中表现不佳的问题,即由于Pashto未出现在Whisper的预训练语料库中,导致所有尺寸的Whisper模型在处理Pashto音频时输出阿拉伯文、达里语或乌尔都语字符,且词错误率(Word Error Rate, WER)超过100%。解决方案的关键在于通过在CommonVoice Pashto数据集上对Whisper模型进行微调(fine-tuning),并系统比较了四种策略:全参数微调(vanilla fine-tuning)、低秩适应(LoRA,rank 64)、冻结编码器(frozen-encoder,2/6层)以及多阶段乌尔都语到Pashto迁移学习。实验表明,全参数微调在CommonVoice Pashto v20上取得最优效果(WER 21.22%),显著优于其他方法,且进一步扩展至whisper-small和whisper-large-v3-turbo后,在v24数据集(113小时)上分别达到24.89%和23.37%的WER,验证了其有效性与可扩展性,同时指出在特定深度下冻结编码器会因层功能分离失效而损害性能,且迁移学习因中间检查点未验证、音系差异和训练不足而失败。
链接: https://arxiv.org/abs/2604.06507
作者: Hanif Rahman
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Pashto is absent from Whisper’s pre-training corpus despite being one of CommonVoice’s largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.
[NLP-81] MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理密集型生物医学研究任务中,缺乏用于评估其从结构化生物医学证据中推导科学结论能力的资源问题。解决方案的关键在于构建了一个大规模数据集 MedConclusion,包含 570 万条 PubMed 结构化摘要,每条实例将摘要中的非结论部分与原始作者撰写的结论配对,从而提供自然发生的监督信号以支持证据到结论的推理建模。此外,该数据集还整合了期刊级别的元数据(如生物医学类别和 SJR 分数),支持跨生物医学领域的子组分析,为系统性研究科学证据到结论的推理提供了可复用的数据基础。
链接: https://arxiv.org/abs/2604.06505
作者: Weiyue Li,Ruizhi Qian,Yi Li,Yongce Li,Yunfan Long,Jiahui Cai,Yan Luo,Mengyu Wang
机构: Harvard AI and Robotics Lab, Harvard Medical School; University of Southern California; Carnegie Mellon University; Stanford University; Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce \textbfMedConclusion , a large-scale dataset of \textbf5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: this https URL.
[NLP-82] ransformer See Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning
【速读】: 该论文旨在解决如何让人工智能系统具备类人水平的类比推理能力这一难题,尤其是提升模型在面对新任务或新数据时的泛化性能。其核心解决方案是采用基于元学习的组合性训练方法(Meta-Learning for Compositionality, MLC),通过在训练中引入复制任务来引导模型关注问题中最关键的信息元素,并利用多样化数据集增强模型对新字母表等分布外场景的适应能力。实验表明,这种策略显著提升了模型在字母串类比任务上的学习效率与泛化性能,且可通过可解释性分析识别出近似模型计算过程的算法机制,从而实现对模型行为的精准调控。
链接: https://arxiv.org/abs/2604.06501
作者: Philipp Hellwig,Willem Zuidema,Claire E. Stevenson,Martha Lewis
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another. Yet, developing artificial intelligence systems capable of robust human-like analogical reasoning has proven difficult. In this work, we train transformers using Meta-Learning for Compositionality (MLC) on an analogical reasoning task (letter-string analogies) and assess their generalization capabilities. We find that letter-string analogies become learnable when guiding the models to attend to the most informative problem elements induced by including copying tasks in the training data. Furthermore, generalization to new alphabets becomes better when models are trained with more heterogeneous datasets, where our 3-layer encoder-decoder model outperforms most frontier models. The MLC approach also enables some generalization to compositions of trained transformations, but not to completely novel transformations. To understand how the model operates, we identify an algorithm that approximates the model’s computations. We verify this using interpretability analyses and show that the model can be steered precisely according to expectations derived from the algorithm. Finally, we discuss implications of our findings for generalization capabilities of larger models and parallels to human analogical reasoning.
[NLP-83] Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM -Based ASR INTERSPEECH
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的自动语音识别(ASR)系统在仅使用文本数据进行领域适配时存在的模态差异(modality gap)问题,即LLM未接触过由语音投影模块产生的噪声表示,导致性能受限。其解决方案的关键在于引入少量目标域语音数据,通过混合批处理(Mixed Batching, MB)策略将文本与语音数据混合训练,从而提供有效的模态对齐信号;实验表明,仅需10%的目标域语音数据(不足4小时),即可实现与传统端到端ASR全量语音微调相当或更优的词错误率(Word Error Rate, WER)。
链接: https://arxiv.org/abs/2604.06487
作者: Thibault Bañeras-Roux,Sergio Burdisso,Esaú Villatoro-Tello,Dairazalia Sánchez-Cortés,Shiran Liu,Severin Baroudi,Shashi Kumar,Hasindri Watawana,Manjunath K E,Kadri Hacioglu,Petr Motlicek,Andreas Stolcke
机构: Idiap Research Institute (Idiap研究学院); Laboratoire d’Informatique et des Systèmes (信息与系统实验室); Uniphore (Uniphore); Brno University of Technology (布杰约维采理工大学)
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech
Abstract:Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.
[NLP-84] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLM s
【速读】: 该论文旨在解决现有语言模型在文化价值观评估中缺乏视觉情境理解能力的问题,即当前方法主要依赖文本模态,无法验证模型是否能在图像等视觉信息条件下准确进行文化相关的价值判断。解决方案的关键在于构建一个名为ValueGround的多模态基准测试集,该测试集基于世界价值观调查(World Values Survey, WVS)设计,采用最小对比图像对来表征对立的回答选项,并控制无关变量变化;在此基础上,要求模型在不接触原始文本选项的情况下,根据国家和问题选择最符合该国价值倾向的图像,从而系统评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态情境下对文化条件化价值判断的接地能力。
链接: https://arxiv.org/abs/2604.06484
作者: Zhipin Wang,Christoph Leiter,Christian Frey,Mohamed Hesham Ibrahim Abdalla,Josif Grabocka,Steffen Eger
机构: University of Technology Nuremberg (UTN), Germany
类目: Computation and Language (cs.CL)
备注: Preprint. Under review
Abstract:Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country’s value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.
[NLP-85] DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在面对大规模结构化数据库时,难以实现深度研究的问题。现有方法主要聚焦于非结构化网络数据的检索与摘要,而忽略了结构化数据中所需的迭代假设生成、基于模式的定量推理以及形成连贯分析叙事的能力。解决方案的关键在于提出DataSTORM——一个基于LLM的智能体系统,其将结构化数据上的深度研究重构为以论题驱动的分析过程:从数据中发现候选论题,通过跨源迭代验证,并最终构建出逻辑一致的分析叙事。该框架融合探索性数据分析(Exploratory Data Analysis)和数据讲故事(Data Storytelling)原则,在InsightBench和ACLED真实复杂数据库上均显著优于现有方法,包括商用系统如ChatGPT Deep Research。
链接: https://arxiv.org/abs/2604.06474
作者: Shicheng Liu,Yucheng Jiang,Sajid Farook,Camila Nicollier Sanchez,David Fernando Castro Pena,Monica S. Lam
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.06474 [cs.CL] (or arXiv:2604.06474v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-86] Multi-objective Evolutionary Merging Enables Efficient Reasoning Models
【速读】: 该论文旨在解决长推理(Long-to-Short, L2S)问题,即如何在保持高准确率的同时显著减少生成式 AI 模型推理过程中所需的 token 数量。当前训练-free 的模型融合方法依赖于固定超参数的标量算术合并策略,存在鲁棒性差、难以平衡准确率与输出长度的问题。其解决方案的关键在于提出 Evo-L2S 框架,将 L2S 推理建模为多目标优化问题,通过进化模型融合技术显式优化准确率与输出长度之间的权衡关系,从而获得一组稳健的帕累托最优解(Pareto front)。为提升大规模语言模型上的可扩展性,作者进一步引入基于熵的子集采样技术,大幅降低适应度评估的计算开销,实验证明该方法可在多个数学推理基准上实现超过 50% 的推理轨迹压缩,同时维持或提升原始模型的准确性。
链接: https://arxiv.org/abs/2604.06465
作者: Mario Iacobelli,Adrian Robert Minut,Tommaso Mencattini,Donato Crisostomi,Andrea Santilli,Iacopo Masi,Emanuele Rodolà
机构: Sapienza University of Rome (罗马大学); EPFL (瑞士联邦理工学院); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.
[NLP-87] Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)系统在处理阿拉伯语方言多样性时存在的局限性,即系统常将不同方言统一转换为现代标准阿拉伯语(Modern Standard Arabic, MSA),缺乏对目标方言的可控性和精准表达。其解决方案的关键在于提出一种上下文感知且可调控的方言阿拉伯语MT框架,核心技术是构建一个基于规则的数据增强(Rule-Based Data Augmentation, RBDA)管道,将3,000句种子语料扩展为涵盖八种区域变体(如埃及、黎凡特、海湾等)的57,000句平衡平行语料库,并通过轻量级元数据标签微调mT5-base模型,实现对翻译输出中方言和社交语域的可控生成。
链接: https://arxiv.org/abs/2604.06456
作者: Afroza Nowshin,Prithweeraj Acharjee Porag,Haziq Jeelani,Fayeq Jeelani Syed
机构: University of Toledo, Toledo, Ohio; Claremont Graduate University, Claremont, California
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 figures, 5 tables. Preprint under review
Abstract:Current Machine Translation (MT) systems for Arabic often struggle to account for dialectal diversity, frequently homogenizing dialectal inputs into Modern Standard Arabic (MSA) and offering limited user control over the target vernacular. In this work, we propose a context-aware and steerable framework for dialectal Arabic MT that explicitly models regional and sociolinguistic variation. Our primary technical contribution is a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset, covering eight regional varieties eg., Egyptian, Levantine, Gulf, etc. By fine-tuning an mT5-base model conditioned on lightweight metadata tags, our approach enables controllable generation across dialects and social registers in the translation output. Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by defaulting toward the MSA mean, while exhibiting limited dialectal specificity. In contrast, our model achieves lower BLEU scores (8.19) but produces outputs that align more closely with the intended regional varieties. Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs. 1.0/5). These findings highlight the limitations of standard MT metrics for dialect-sensitive tasks and motivate the need for evaluation practices that better reflect linguistic diversity in Arabic MT. Comments: 14 pages, 5 figures, 5 tables. Preprint under review Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.06456 [cs.CL] (or arXiv:2604.06456v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-88] Learning to Interrupt in Language-based Multi-agent Communication
【速读】: 该论文旨在解决多智能体系统中因语言模型(Large Language Models, LLMs)生成冗长消息而导致的上下文过载和计算成本增加的问题。现有方法仅从发言方压缩信息,难以适应不同接收方并精准识别相关性。其解决方案的关键在于提出一种可中断的通信框架(HANDRAISER),允许倾听者在适当时机打断当前发言者,从而提升沟通效率。该框架通过学习预测基于未来奖励与代价的最优中断点,克服了LLMs过度自信导致的过早中断问题,并在文本猜图、会议安排和辩论等多场景中验证了有效性,相较基线降低32.2%的通信成本,同时保持或优于任务性能。
链接: https://arxiv.org/abs/2604.06452
作者: Danqing Wang,Da Yin,Ruta Desai,Lei Li,Asli Celikyilmaz,Ansong Ni
机构: CMU(卡内基梅隆大学); Meta FAIR(元数据公平研究实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-agent systems using large language models (LLMs) have demonstrated impressive capabilities across various domains. However, current agent communication suffers from verbose output that overload context and increase computational costs. Although existing approaches focus on compressing the message from the speaker side, they struggle to adapt to different listeners and identify relevant information. An effective way in human communication is to allow the listener to interrupt and express their opinion or ask for clarification. Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker. Through prompting experiments, we find that current LLMs are often overconfident and interrupt before receiving enough information. Therefore, we propose a learning method that predicts the appropriate interruption points based on the estimated future reward and cost. We evaluate our framework across various multi-agent scenarios, including 2-agent text pictionary games, 3-agent meeting scheduling, and 3-agent debate. The results of the experiment show that our HANDRAISER can reduce the communication cost by 32.2% compared to the baseline with comparable or superior task performance. This learned interruption behavior can also be generalized to different agents and tasks.
[NLP-89] he Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)监控的有效性问题,即大语言模型(Large Language Models, LLMs)是否能在不依赖显式中间步骤监督的情况下,通过潜在表示(latent representations)进行多步规划并执行。其核心发现是:尽管模型在训练中最多能学习到五层潜在规划深度,但一旦发现某种策略,该策略可在测试时泛化至八层潜在步骤,这揭示了“策略发现”与“策略执行”之间的分离现象。解决方案的关键在于利用图路径查找任务对潜在规划步数进行精确控制,从而量化模型在无监督条件下发现和执行隐式多步规划的能力,进而表明若此类限制普遍适用,则复杂协同的多步潜在规划可能需显式教学或外化,为CoT监控提供了理论支持。
链接: https://arxiv.org/abs/2604.06427
作者: Yi Xu,Philipp Jettkant,Laura Ruis
机构: University of Cambridge (剑桥大学); Imperial College London (帝国理工学院); MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 1 table (30 pages, 9 figures, 10 tables including references and appendices)
Abstract:The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.
[NLP-90] am Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking
【速读】: 该论文旨在解决SympTEMIST任务中的命名实体识别(Named Entity Recognition, NER)与实体链接(Entity Linking, EL)问题。其解决方案的关键在于:针对NER任务,采用基于RoBERTa的模型,并在其基础上引入双向长短期记忆网络(BiLSTM)和条件随机场(CRF)层进行微调,同时利用数据增强策略扩充训练集;对于EL任务,则通过跨语言SapBERT-XLMR-Large模型生成候选实体,并基于余弦相似度计算与知识库的匹配度,其中实验表明知识库的选择对模型性能影响最大。
链接: https://arxiv.org/abs/2604.06424
作者: Georgi Grazhdanski,Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva
机构: FMI, Sofia University St. Kliment Ohridski; Ontotext
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 tables, Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models, American Medical Informatics Association 2023 Annual Symposium
Abstract:This paper presents a transformer-based approach to solving the SympTEMIST named entity recognition (NER) and entity linking (EL) tasks. For NER, we fine-tune a RoBERTa-based (1) token-level classifier with BiLSTM and CRF layers on an augmented train set. Entity linking is performed by generating candidates using the cross-lingual SapBERT XLMR-Large (2), and calculating cosine similarity against a knowledge base. The choice of knowledge base proves to have the highest impact on model accuracy.
[NLP-91] When to Call an Apple Red: Humans Follow Introspective Rules VLMs Dont
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在决策过程中行为不可预测、无法可靠预测自身行为以及缺乏对自身推理过程忠实性的问题,这些问题直接影响其在高风险场景中的可信部署。解决方案的关键在于引入Graded Color Attribution (GCA) 数据集——一个受控基准,通过线稿图像在三种条件下(世界知识重着色、反事实重着色及无颜色先验形状)系统性地诱发模型与人类参与者建立颜色标注阈值规则,并比较其内省规则与最终颜色归属决策的一致性。研究发现,VLMs 虽能准确估计像素级颜色覆盖率,却在多数情况下违背自身设定的规则,表现出显著的内省自我认知失调;而人类则保持规则一致性,任何偏差可归因于对颜色覆盖度的普遍高估。这一结果表明,VLMs 的推理失败并非源于任务难度,而是源于其内省自知能力的校准错误,这对高风险应用具有直接警示意义。
链接: https://arxiv.org/abs/2604.06422
作者: Jonathan Nemitz,Carsten Eickhoff,Junyi Jessy Li,Kyle Mahowald,Michal Golovanevsky,William Rudman
机构: University of Tübingen (图宾根大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.
[NLP-92] State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation
【速读】: 该论文旨在解决阿拉伯语在大型语言模型(LLM)生态系统中因代表性不足而导致的数字不平等与性能缺陷问题。其核心挑战在于,现有主流模型对阿拉伯语的支持往往局限于通用架构,缺乏针对该语言特性的深度优化,从而导致其在语法、安全性和多任务能力等关键指标上显著落后于英语模型。解决方案的关键在于三方面协同创新:一是采用稀疏专家混合(sparse MoE)架构以提升参数效率;二是设计基于思维链(Chain-of-Thought, CoT)的四阶段蒸馏策略,嵌入阿拉伯语专属的语言验证机制和区域伦理规范;三是通过精细化的双语数据策划(80/20阿拉伯语-英语训练混合),确保高质量、低污染的数据输入。这一组合使开源模型 Arabic-DeepSeek-R1 在 Open Arabic LLM Leaderboard (OALL) 的全部七个基准测试中取得最高平均得分,并在多个子任务上超越 GPT-5.1 等闭源前沿系统,首次证明了通过文化适配与高效微调可实现对主流商业模型的系统性超越,且无需工业级预训练成本。
链接: https://arxiv.org/abs/2604.06421
作者: Navan Preet Singh,Anurag Garikipati,Ahmed Abulkhair,Jyani Akshay Jagdishbhai,Atul Yaduvanshi,Amarendra Chaudhary,Madalina Ciobanu,Qingqing Mao,Ritankar Das
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically outperform the proprietary frontier system GPT-5.1 on the majority of benchmarks evaluating comprehensive language-specific tasks: the first such demonstration for Arabic LLMs. These findings indicate that much of Arabic’s performance deficit in current LLM ecosystems stems from under-specialization rather than architectural limitations, and that parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs. Arabic-DeepSeek-R1 establishes a validated and replicable framework for sovereign and domain-specific language technologies, demonstrating that strategic, culturally-grounded adaptation of sparse MoE backbones offers a viable and cost-effective pathway to achieving record-breaking performance across standardized benchmarks for low-resource languages.
[NLP-93] Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本时,其信息整合能力未能同步提升的问题,特别是针对小说这类结构复杂的长篇叙事文本的摘要生成任务。解决方案的关键在于通过对比人类与LLM生成的摘要,量化两者在概念层面的参与模式差异:研究者首先构建了一个包含150个人类撰写的小说摘要及其对应章节的对齐数据集,以此作为基准;随后使用九个前沿LLM为相同文本生成摘要并进行对齐分析,发现LLM倾向于聚焦于文本末尾内容,而人类则更均衡地覆盖整个叙事结构。这一发现揭示了LLM在叙事理解上的局限性,并指出其注意力机制可能是导致此类问题的原因,从而为改进模型的长程语义建模能力提供了可操作的研究方向。
链接: https://arxiv.org/abs/2604.06416
作者: Rebecca M. M. Hicke,Sil Hamilton,David Mimno,Ross Deans Kristensen-McLachlan
机构: Cornell University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.
[NLP-94] Say Something Else: Rethinking Contextual Privacy as Information Sufficiency
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在代用户撰写消息时,因用户过度分享敏感信息且对隐私边界认知不一而导致的隐私泄露问题。现有方法仅支持抑制(删除敏感内容)和泛化(用抽象替代敏感信息),且评估局限于单轮孤立消息,无法反映真实对话中的持续压力。论文的关键创新在于:首先将隐私保护的LLM通信形式化为信息充分性(Information Sufficiency, IS)任务,其次提出自由文本伪名化(free-text pseudonymization)作为第三种策略——即用功能等价的替代项替换敏感属性,从而在保障沟通效用的同时提升隐私性;最后设计了对话式评估协议,在多轮交互场景下考察隐私策略在现实follow-up压力下的表现。实证结果表明,伪名化在隐私与效用权衡上最优,且单消息评估显著低估了信息泄露风险,尤其泛化策略在多轮情境下隐私损失最高达16.3个百分点。
链接: https://arxiv.org/abs/2604.06409
作者: Yunze Xiao,Wenkai Li,Xiaoyuan Wu,Ningshan Ma,Yueqi Song,Weihao Xuan
机构: Carnegie Mellon University (卡内基梅隆大学); Massachusetts Institute of Technology (麻省理工学院); University of Tokyo (东京大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private. Existing systems support only suppression (omitting sensitive information) and generalization (replacing information with an abstraction), and are typically evaluated on single isolated messages, leaving both the strategy space and evaluation setting incomplete. We formalize privacy-preserving LLM communication as an \textbfInformation Sufficiency (IS) task, introduce \textbffree-text pseudonymization as a third strategy that replaces sensitive attributes with functionally equivalent alternatives, and propose a \textbfconversational evaluation protocol that assesses strategies under realistic multi-turn follow-up pressure. Across 792 scenarios spanning three power-relation types (institutional, peer, intimate) and three sensitivity categories (discrimination risk, social cost, boundary), we evaluate seven frontier LLMs on privacy at two granularities, covertness, and utility. Pseudonymization yields the strongest privacy\textendash utility tradeoff overall, and single-message evaluation systematically underestimates leakage, with generalization losing up to 16.3 percentage points of privacy under follow-up.
[NLP-95] FMI@SU ToxHabits: Evaluating LLM s Performance on Toxic Habit Extraction in Spanish Clinical Texts IJCAI2025
【速读】: 该论文旨在解决西班牙语临床文本中毒性习惯(Toxic Habits)命名实体识别问题,具体聚焦于检测和分类临床病例报告中的物质使用与滥用提及,分为烟草(Tobacco)、酒精(Alcohol)、大麻(Cannabis)和药物(Drug)四类。解决方案的关键在于探索多种大语言模型(Large Language Models, LLMs)的应用策略,包括零样本(zero-shot)、少样本(few-shot)提示和提示优化方法,并发现GPT-4.1的少样本提示方法在实验中表现最优,最终在测试集上达到0.65的F1分数,验证了该方法在非英语语言场景下进行命名实体识别的可行性与有效性。
链接: https://arxiv.org/abs/2604.06403
作者: Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva
机构: Sofia University St. Kliment Ohridski (索非亚大学圣克莱门特奥赫里德斯基); Graphwise (Graphwise)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, 6 tables, Challenge and Workshop BC9 Large Language Models for Clinical and Biomedical NLP, International Joint Conference on Artificial Intelligence IJCAI 2025
Abstract:The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1’s few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.
[NLP-96] ART: Attention Replacement Technique to Improve Factuality in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答等任务中出现的幻觉(hallucination)问题,即模型生成看似合理但实际错误或无关的信息。研究表明,浅层网络中存在均匀注意力(uniform attention)模式,导致模型无法聚焦于最相关的信息,从而引发幻觉。解决方案的关键在于提出一种无需训练的Attention Replacement Technique (ART),通过将浅层中的均匀注意力模式替换为局部注意力(local attention)模式,引导模型更关注上下文中的关键信息,从而有效降低幻觉现象,并在多种LLM架构上展现出良好的通用性和有效性。
链接: https://arxiv.org/abs/2604.06393
作者: Ziqin Luo,Yihao Quan,Xiaofeng Zhang,Xiaosong Yuan,Chen Shen
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Alibaba Cloud Computing (阿里云计算)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hallucination in large language models (LLMs) continues to be a significant issue, particularly in tasks like question answering, where models often generate plausible yet incorrect or irrelevant information. Although various methods have been proposed to mitigate hallucinations, the relationship between attention patterns and hallucinations has not been fully explored. In this paper, we analyze the distribution of attention scores across each layer and attention head of LLMs, revealing a common and intriguing phenomenon: shallow layers of LLMs primarily rely on uniform attention patterns, where the model distributes its attention evenly across the entire sequence. This uniform attention pattern can lead to hallucinations, as the model fails to focus on the most relevant information. To mitigate this issue, we propose a training-free method called Attention Replacement Technique (ART), which replaces these uniform attention patterns in the shallow layers with local attention patterns. This change directs the model to focus more on the relevant contexts, thus reducing hallucinations. Through extensive experiments, ART demonstrates significant reductions in hallucinations across multiple LLM architectures, proving its effectiveness and generalizability without requiring fine-tuning or additional training data.
[NLP-97] Application-Driven Pedagogical Knowledge Optimization of Open-Source LLM s via Reinforcement Learning and Supervised Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育领域中缺乏专业教学知识的问题,尤其针对其在跨学科教学理解与推理能力上的不足。解决方案的关键在于提出一种多阶段优化策略,融合强化学习(Reinforcement Learning, RL)与监督微调(Supervised Fine-Tuning, SFT),通过三阶段流程实现:首先利用渐进难度训练和扩展推理滚动(extended reasoning rollouts)提升模型对复杂教学场景的处理能力;其次借助RL训练后的模型生成高质量、难度加权的训练数据,并进行SFT以固化教学知识;最后可选地引入第二轮RL优化进一步增强性能。该方法使32B参数规模的开源模型(如EduQwen系列)在Cross-Domain Pedagogical Knowledge (CDPK)基准上达到新的SOTA水平,显著优于更大规模的闭源系统(如Gemini-3 Pro),证明了领域专业化优化可将中等规模模型转化为教育领域的专家级模型,同时保持透明性、可定制性和成本效益,满足负责任教育AI部署的需求。
链接: https://arxiv.org/abs/2604.06385
作者: Navan Preet Singh,Xiaokun Wang,Anurag Garikipati,Madalina Ciobanu,Qingqing Mao,Ritankar Das
机构: 未知
类目: Computation and Language (cs.CL)
备注: * These authors contributed equally to this work and share first authorship
Abstract:We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.
[NLP-98] he Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中连续思维链(Latent Chain-of-Thought, Latent CoT)推理是否真正利用了“叠加态”(superposition)这一核心问题,即语言模型能否在单一表示中同时维持多个候选解。其解决方案的关键在于系统性地比较三种不同训练范式:无训练(training-free)构造潜空间思维、微调(fine-tuned)适应潜空间思维以及从头训练(from-scratch)完全基于潜空间思维的模型,并通过Logit Lens和实体级探测(entity-level probing)分析内部表征。研究发现,仅从头训练的模型表现出明显的叠加态迹象,而其他两种范式因预训练数据导致的最终层token承诺偏倚(commitment bias)和容量效应(capacity effect)使叠加态崩溃或未被使用,从而揭示了叠加态出现与否的根本条件。
链接: https://arxiv.org/abs/2604.06374
作者: Michael Rizvi-Martel,Guillaume Rabusseau,Marius Mosbach
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages
Abstract:Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.
[NLP-99] A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation
【速读】: 该论文旨在解决阿拉伯语医学文本生成中因训练样本假设均匀重要性而导致模型难以有效学习严重或高风险病例的问题。现有方法忽视了临床严重程度的差异,限制了模型对复杂医疗情境的捕捉能力。解决方案的关键在于提出一种基于严重程度的课程学习策略(Severity-based Curriculum Learning Strategy),将训练数据按严重等级(轻度、中度、重度)分阶段排序,并在微调过程中逐步引入更复杂的病例,使模型先掌握基础医学模式,再逐步适应高风险场景。实验表明,该策略在MAQA数据集上显著提升性能,相较基线模型提升4%–7%,较传统微调提升3%–6%。
链接: https://arxiv.org/abs/2604.06365
作者: Ahmed Alansary,Molham Mohamed,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2 tables, ICTIS2026
Abstract:Arabic medical text generation is increasingly needed to help users interpret symptoms and access general health guidance in their native language. Nevertheless, many existing methods assume uniform importance across training samples, overlooking differences in clinical severity. This simplification can hinder the model’s ability to properly capture complex or high-risk cases. To overcome this issue, this work introduces a Severity-based Curriculum Learning Strategy for Arabic Medical Text Generation, where the training process is structured to move gradually from less severe to more critical medical conditions. The approach divides the dataset into ordered stages based on severity and incrementally exposes the model to more challenging cases during fine-tuning, allowing it to first learn basic medical patterns before addressing more complex scenarios. The proposed method is evaluated on a subset of the Medical Arabic Question Answering (MAQA) dataset, which includes Arabic medical questions describing symptoms alongside corresponding responses. In addition, the dataset is annotated with three severity levels (Mild, Moderate, and Critical) using a rule-based method developed in this study. The results demonstrate that incorporating severity-aware curriculum learning leads to consistent performance improvements across all tested models, with gains of around +4% to +7% over baseline models and +3% to +6% compared with conventional fine-tuning approaches.
[NLP-100] In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features Linguistic Structure and Induction Heads
【速读】: 该论文旨在解决生成式 AI(Generative AI)在语音语言模型(Speech Language Models)中进行上下文学习(In-Context Learning, ICL)时,语言和声学特征如何影响其性能的问题。研究聚焦于文本到语音(Text-to-Speech, TTS)任务,从两个维度分析:一是模型能否准确从示范样本中推断出目标任务(即生成正确的语音内容),二是模型输出是否能模仿示范语音的声学特性。解决方案的关键在于识别出说话速率(speaking rate)是影响ICL性能的核心因素,并且该特征在输出中被一致复制;而音高范围(pitch range)和强度(intensity)则对性能影响较小且不具一致性。此外,研究进一步揭示了归纳头(induction heads)在语音ICL中的因果作用——移除前k个归纳头会完全消除模型的ICL能力,与文本领域发现一致,表明该机制具有跨模态普适性。
链接: https://arxiv.org/abs/2604.06356
作者: Charlotte Pouw,Hosein Mohebbi,Afra Alishahi,Willem Zuidema
机构: ILLC, University of Amsterdam (阿姆斯特丹大学信息语言学实验室); CSAI, Tilburg University (蒂尔堡大学计算社会科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to COLM 2026
Abstract:In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain. Here, we investigate how linguistic and acoustic features affect ICL in Speech Language Models. We focus on the Text-to-Speech (TTS) task, which allows us to analyze ICL from two angles: (1) how accurately the model infers the task from the demonstrations (i.e., generating the correct spoken content), and (2) to what extent the model mimics the acoustic characteristics of the demonstration speech in its output. We find that speaking rate strongly affects ICL performance and is also mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. Finally, we investigate the role of induction heads in speech-based ICL and show that these heads play a causal role: ablating the top-k induction heads completely removes the model’s ICL ability, mirroring findings from text-based ICL.
[NLP-101] Severity-Aware Weighted Loss for Arabic Medical Text Generation
【速读】: 该论文旨在解决阿拉伯语医疗文本生成中因传统微调目标对所有临床病例一视同仁而忽略临床严重性差异的问题,这一缺陷在医疗场景下尤为关键,因为严重病例的错误可能带来更高的临床风险。解决方案的关键在于提出一种基于严重程度感知的加权损失函数(severity-aware weighted loss),通过软严重性概率动态调整训练过程中每个token级别的损失贡献,从而在不修改模型架构的前提下优先优化高危临床交互。实验表明,该方法在MAQA数据集上对多种阿拉伯语大语言模型均能显著提升性能,最高提升达12.10%,且改进效果具有架构一致性。
链接: https://arxiv.org/abs/2604.06346
作者: Ahmed Alansary,Molham Mohamed,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 2 tables, ICTIS2026
Abstract:Large language models have shown strong potential for Arabic medical text generation; however, traditional fine-tuning objectives treat all medical cases uniformly, ignoring differences in clinical severity. This limitation is particularly critical in healthcare settings, where errors in severe cases contain higher clinical risk. In this work, we propose a severity-aware weighted loss for fine-tuning Arabic language models on medical complaint-response data. The method depends on soft severity probabilities to dynamically scale token-level loss contributions during optimization, thereby prioritizing clinically critical interactions without modifying model architectures. Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses. Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier and incorporated exclusively at the loss level. The proposed approach is evaluated across ten Arabic large language models of varying architectures and parameter scales. While standard cross-entropy fine-tuning yields only modest improvements, severity-aware optimization consistently achieves larger gains. Using a balanced weighting configuration, performance improves from 54.04% to 66.14% for AraGPT2-Base, from 59.16% to 67.18% for AraGPT2-Medium, and from 57.83% to 66.86% for Qwen2.5-0.5B, with peak performance reaching 67.18%. Overall, severity-aware fine-tuning delivers improvements of up to 12.10% over non-fine-tuned baselines, demonstrating robust and architecture-consistent gains.
[NLP-102] STDec: Spatio-Temporal Stability Guided Decoding for dLLM s
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在解码过程中采用全局置信度阈值导致的局部上下文建模不足与预测token ID时间一致性缺失的问题。现有方法未显式利用相邻已解码状态的空间信息或跨去噪步骤的时序稳定性,限制了推理效率和质量。解决方案的关键在于提出一种无需训练的时空稳定性引导解码方法(STDec),其核心包括两个机制:一是空间感知解码,通过聚合邻近token的已解码状态动态生成自适应阈值;二是时间感知解码,对在多个去噪步骤中保持一致预测ID的token放宽解码阈值。该方法显著提升了推理吞吐量,同时在文本推理和多模态理解任务上保持与原模型相当的性能表现,例如在MBPP基准上使用LLaDA模型时实现最高14.17倍加速。
链接: https://arxiv.org/abs/2604.06330
作者: Yuzhe Chen,Jiale Cao,Xuyang Liu,Jin Xie,Aiping Yang,Yanwei Pang
机构: Tianjin University (天津大学); Sichuan University (四川大学); Chongqing University (重庆大学)
类目: Computation and Language (cs.CL)
备注: Homepage: this https URL
Abstract:Diffusion Large Language Models (dLLMs) have achieved rapid progress, viewed as a promising alternative to the autoregressive paradigm. However, most dLLM decoders still adopt a global confidence threshold, and do not explicitly model local context from neighboring decoded states or temporal consistency of predicted token IDs across steps. To address this issue, we propose a simple spatio-temporal stability guided decoding approach, named STDec. We observe strong spatio-temporal stability in dLLM decoding: newly decoded tokens tend to lie near decoded neighbors, and their predicted IDs often remain consistent across several denoising steps. Inspired by this stability, our STDec includes spatial-aware decoding and temporal-aware decoding. The spatial-aware decoding dynamically generates the token-adaptive threshold by aggregating the decoded states of nearby tokens. The temporal-aware decoding relaxes the decoding thresholds for tokens whose predicted token IDs remain consistent over denoising steps. Our STDec is training-free and remains compatible with cache-based acceleration methods. Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score. Notably, on MBPP with LLaDA, STDec achieves up to 14.17x speedup with a comparable score. Homepage: this https URL.
[NLP-103] Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段产生幻觉(Hallucination)的问题,传统检测方法依赖外部验证机制(如黄金答案、检索系统或辅助判别模型),存在部署复杂性和实时性瓶颈。其解决方案的关键在于:通过弱监督框架将外部 grounding 信号(包括子串匹配、句子嵌入相似度和 LLM 判决)转化为训练阶段的标签,并将其“蒸馏”到模型内部表示中,从而实现仅基于 Transformer 层间隐藏状态(hidden states)即可进行幻觉检测,无需任何外部验证信号即可在推理时完成判断。研究构建了一个包含 15000 样本的标注数据集(含逐层隐藏状态与结构化标签),并验证了多种探测器架构的有效性,证明了幻觉检测信号可被编码进模型表征中,且检测过程对端到端生成效率影响极小(延迟 <6.7ms/样本,吞吐量约 0.231 QPS)。
链接: https://arxiv.org/abs/2604.06277
作者: Shoaib Sadiq Salehmohamed,Jinal Prashant Thakkar,Hansika Aredla,Shaik Mohammed Omar,Shalmali Ayachit
机构: LLM Lens
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 6 figures, 6 tables. Introduces a 15k-sample representation-level hallucination dataset with full transformer hidden states and multi-signal weak supervision. Evaluates 5 probing architectures and demonstrates internal hallucination detection without external inference-time signals. Includes held-out test evaluation and deployment benchmarks
Abstract:Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be distilled into the model’s own representations during training, enabling hallucination detection from internal activations alone at inference time. We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without human annotation. Using this framework, we construct a 15000-sample dataset from SQuAD v2 (10500 train/development samples and a separate 5000-sample test set), where each example pairs a LLaMA-2-7B generated answer with its full per-layer hidden states and structured hallucination labels. We then train five probing classifiers: ProbeMLP (M0), LayerWiseMLP (M1), CrossLayerTransformer (M2), HierarchicalTransformer (M3), and CrossLayerAttentionTransformerV2 (M4), directly on these hidden states, treating external grounding signals as training-time supervision only. Our central hypothesis is that hallucination detection signals can be distilled into transformer representations, enabling internal detection without any external verification at inference time. Results support this hypothesis. Transformer-based probes achieve the strongest discrimination, with M2 performing best on 5-fold average AUC/F1, and M3 performing best on both single-fold validation and held-out test evaluation. We also benchmark inference efficiency: probe latency ranges from 0.15 to 5.62 ms (batched) and 1.55 to 6.66 ms (single sample), while end-to-end generation plus probe throughput remains approximately 0.231 queries per second, indicating negligible practical overhead. Comments: 20 pages, 6 figures, 6 tables. Introduces a 15k-sample representation-level hallucination dataset with full transformer hidden states and multi-signal weak supervision. Evaluates 5 probing architectures and demonstrates internal hallucination detection without external inference-time signals. Includes held-out test evaluation and deployment benchmarks Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2604.06277 [cs.AI] (or arXiv:2604.06277v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.06277 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shoaib Sadiq Salehmohamed [view email] [v1] Tue, 7 Apr 2026 08:14:48 UTC (28,449 KB)
[NLP-104] Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心理健康聊天机器人应用中产生的幻觉(hallucination)和遗漏(omission)问题,这些问题可能对用户安全造成严重风险。现有基于LLM作为评判者(LLM-as-a-judge)的方法在高风险医疗场景下表现不佳,准确率仅为52%,且部分检测方法召回率接近零。其根本原因在于LLM难以捕捉领域专家所识别的细微语言与治疗模式。为此,作者提出了一种融合人类专业知识与LLM的框架,通过提取五个分析维度(逻辑一致性、实体验证、事实准确性、语言不确定性、专业适当性)的可解释特征,构建了基于传统机器学习模型的检测系统,在自建和公开数据集上分别实现了0.717和0.849的F1分数用于幻觉检测,以及0.59–0.64的F1分数用于遗漏检测。关键创新在于将领域知识结构化为可解释特征,并结合自动化方法,从而在高风险心理服务场景中实现比黑箱LLM评判更可靠、透明的评估。
链接: https://arxiv.org/abs/2604.06216
作者: Khizar Hussain,Bradley A. Malin,Zhijun Yin,Susannah Leigh Rose,Murat Kantarcioglu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs’ inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06216 [cs.CL] (or arXiv:2604.06216v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06216 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Khizar Hussain [view email] [v1] Tue, 17 Mar 2026 21:13:19 UTC (2,139 KB) Full-text links: Access Paper: View a PDF of the paper titled Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses, by Khizar Hussain and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-105] Unsupervised Neural Network for Automated Classification of Surgical Urgency Levels in Medical Transcriptions
【速读】: 该论文旨在解决手术操作按紧急程度分类效率低下的问题,以优化医疗资源配置和提升患者护理质量。其核心解决方案是提出一种无监督神经网络框架,关键在于利用领域特定的语言模型BioClinicalBERT将手术记录转化为高维语义嵌入,并通过Deep Embedding Clustering (DEC)算法实现高质量聚类;随后结合专家评审(Modified Delphi Method)对聚类结果进行临床验证,并进一步构建融合BiLSTM与BioClinicalBERT嵌入的分类模型,从而在缺乏标注数据的情况下实现准确、可扩展的手术优先级自动判别,显著提升动态医疗环境中的运营效率和临床决策能力。
链接: https://arxiv.org/abs/2604.06214
作者: Sadaf Tabatabaee,Sarah S. Lam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages. Published in Proceedings of the IISE Annual Conference Expo 2025. DOI: https://doi.org/10.21872/2025IISE_6828
Abstract:Efficient classification of surgical procedures by urgency is paramount to optimize patient care and resource allocation within healthcare systems. This study introduces an unsupervised neural network approach to automatically categorize surgical transcriptions into three urgency levels: immediate, urgent, and elective. Leveraging BioClinicalBERT, a domain-specific language model, surgical transcripts are transformed into high-dimensional embeddings that capture their semantic nuances. These embeddings are subsequently clustered using both K-means and Deep Embedding Clustering (DEC) algorithms, in which DEC demonstrates superior performance in the formation of cohesive and well-separated clusters. To ensure clinical relevance and accuracy, the clustering results undergo validation through the Modified Delphi Method, which involves expert review and refinement. Following validation, a neural network that integrates Bidirectional Long Short-Term Memory (BiLSTM) layers with BioClinicalBERT embeddings is developed for classification tasks. The model is rigorously evaluated using cross-validation and metrics such as accuracy, precision, recall, and F1-score, which achieve robust performance and demonstrate strong generalization capabilities on unseen data. This unsupervised framework not only addresses the challenge of limited labeled data but also provides a scalable and reliable solution for real-time surgical prioritization, which ultimately enhances operational efficiency and patient outcomes in dynamic medical environments.
[NLP-106] Invisible Influences: Investigating Implicit Intersectional Biases through Persona Engineering in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在角色驱动场景下隐性、交叉性偏见(intersectional bias)动态放大问题,传统静态偏差审计方法(如CEAT、I-WEAT、I-SEAT)难以捕捉此类上下文敏感的偏见变化。解决方案的关键在于提出一种新的可扩展度量指标——Bias Amplification Differential and Explainability Score (BADx),其核心由三个部分构成:基于静态测试的差分偏见得分(BAD)、人格敏感性指数(Persona Sensitivity Index, PSI)以及波动性(Volatility,标准差),并融合LIME局部解释方法以增强可解释性。BADx能够系统识别不同社会角色框架下LLMs偏见的动态演变,从而揭示静态方法所忽略的上下文依赖型隐性偏见。
链接: https://arxiv.org/abs/2604.06213
作者: Nandini Arimanda,Achyuth Mukund,Sakthi Balan Muthiah,Rajesh Sharma
机构: Shiv Nadar University Chennai ( Shiv Nadar 大学查恩纳校区); Plaksha University (Plaksha 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, ACM WebScience Conference, 6 Tables
Abstract:Large Language Models (LLMs) excel at human-like language generation but often embed and amplify implicit, intersectional biases, especially under persona-driven contexts. Existing bias audits rely on static, embedding-based tests (CEAT, I-WEAT, I-SEAT) that quantify absolute association strengths. We show that they have limitations in capturing dynamic shifts when models adopt social roles. We address this gap by introducing the Bias Amplification Differential and Explainability Score (BADx): a novel, scalable metric that measures persona-induced bias amplification and integrates local explainability insights. BADx comprises three components - differential bias scores (BAD, based on CEAT, I-WEAT, I-SEAT),Persona Sensitivity Index (PSI), and Volatility (Standard Deviation), augmented by LIME-based analysis for emphasizing explainability. This study is divided and performed as two different tasks. Task 1 establishes static bias baselines, and Task 2 applies six persona frames (marginalized and structurally advantaged) to measure BADx, PSI, and volatility. This is studied across five state-of-the-art LLMs (GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet and Gemma-3n E4B). Results show persona context significantly modulates bias. GPT-4o exhibits high sensitivity and volatility; DeepSeek-R1 suppresses bias but with erratic volatility; LLaMA-4 maintains low volatility and a stable bias profile with limited amplification; Claude 4.0 Sonnet achieves balanced modulation; and Gemma-3n E4B attains the lowest volatility with moderate amplification. BADx performs better than static methods by revealing context-sensitive biases overlooked in static methods. Our unified method offers a systematic way to detect dynamic implicit intersectional bias in five popular LLMs.
[NLP-107] Code Sharing In Prediction Model Research: A Scoping Review
【速读】: 该论文旨在解决预测模型研究中代码可复现性不足的问题,即尽管生成式 AI (Generative AI) 和其他机器学习方法在诊断与预后预测模型中的应用日益广泛,但现有文献中代码共享实践仍不充分且质量参差不齐。其解决方案的关键在于通过系统性回顾3,967篇符合标准的文献,量化当前代码共享现状,并基于大语言模型(LLM)辅助的自动化流程对代码仓库进行结构化评估,识别出如依赖项说明、模块化设计、许可证信息等14项关键可复现性特征的缺失。研究结果为制定TRIPOD-Code这一扩展指南提供了实证基础,强调需从单纯“代码可用”转向对文档完整性、依赖约束、执行结构等方面的明确规范,从而提升预测模型研究的透明度与可重复性。
链接: https://arxiv.org/abs/2604.06212
作者: Thomas Sounack,Raffaele Giancotti,Catherine A. Gao,Lasai Barreñada,Hyeonhoon Lee,Hyung-Chul Lee,Leo Anthony Celi,Karel G.M. Moons,Gary S. Collins,Charlotta Lindvall,Tom Pollard
机构: University Medical Center Utrecht (乌得勒支大学医学中心); Massachusetts General Hospital (马萨诸塞州总医院); Harvard Medical School (哈佛医学院); Stanford University (斯坦福大学); Boston Children’s Hospital (波士顿儿童医院); MIT (麻省理工学院); National Institutes of Health (美国国立卫生研究院); University of Oxford (牛津大学); University of Birmingham (伯明翰大学); University of Amsterdam (阿姆斯特丹大学); Korea Health Industry Development Institute (韩国健康产业发展研究所); University of Tokyo (东京大学); University of Melbourne (墨尔本大学); ETH Zurich (苏黎世联邦理工学院); Imperial College London (伦敦帝国理工学院); University of Toronto (多伦多大学); University of California, San Francisco (加州大学旧金山分校); University of Pennsylvania (宾夕法尼亚大学); University of Cambridge (剑桥大学); University of Chicago (芝加哥大学); University of Queensland (昆士兰大学); Johns Hopkins University (约翰霍普金斯大学); University of Montreal (蒙特利尔大学); University of Edinburgh (爱丁堡大学); University of Sydney (悉尼大学); University of Copenhagen (哥本哈根大学); University of Helsinki (赫尔辛基大学); University of Geneva (日内瓦大学); University of Zurich (苏黎世大学); University of Manchester (曼彻斯特大学); University of Bristol (布里斯托大学); University of Glasgow (格拉斯哥大学); University of Leeds (利兹大学); University of Newcastle (纽卡斯尔大学); University of Sheffield (谢菲尔德大学); University of Southampton (南安普顿大学); University of Warwick (华威大学); University of Reading (雷丁大学); University of Exeter (埃克塞特大学); University of St Andrews (圣安德鲁斯大学); University of Strathclyde (斯特拉斯克莱德大学); University of Aberdeen (阿伯丁大学); University of Dundee (邓迪大学); University of Nottingham (诺丁汉大学); University of East Anglia (东英吉利大学); University of Sussex (萨塞克斯大学); University of Kent (肯特大学); University of Essex (埃塞克斯大学); University of Lancaster (兰卡斯特大学); University of York (约克大学); University of Bath (巴斯大学); University of Bristol (布里斯托大学); University of Hull (赫尔大学); University of Leicester (莱斯特大学); University of Liverpool (利物浦大学); University of Manchester (曼彻斯特大学); University of Sheffield (谢菲尔德大学); University of Southampton (南安普顿大学); University of Warwick (华威大学); University of Reading (雷丁大学); University of Exeter (埃克塞特大学); University of St Andrews (圣安德鲁斯大学); University of Strathclyde (斯特拉斯克莱德大学); University of Aberdeen (阿伯丁大学); University of Dundee (邓迪大学); University of Nottingham (诺丁汉大学); University of East Anglia (东英吉利大学); University of Sussex (萨塞克斯大学); University of Kent (肯特大学); University of Essex (埃塞克斯大学); University of Lancaster (兰卡斯特大学); University of York (约克大学); University of Bath (巴斯大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Analytical code is essential for reproducing diagnostic and prognostic prediction model research, yet code availability in the published literature remains limited. While the TRIPOD statements set standards for reporting prediction model methods, they do not define explicit standards for repository structure and documentation. This review quantifies current code-sharing practices to inform the development of TRIPOD-Code, a TRIPOD extension reporting guideline focused on code sharing. We conducted a scoping review of PubMed-indexed articles citing TRIPOD or TRIPOD+AI as of Aug 11, 2025, restricted to studies retrievable via the PubMed Central Open Access API. Eligible studies developed, updated, or validated multivariable prediction models. A large language model-assisted pipeline was developed to screen articles and extract code availability statements and repository links. Repositories were assessed with the same LLM against 14 predefined reproducibility-related features. Our code is made publicly available. Among 3,967 eligible articles, 12.2% included code sharing statements. Code sharing increased over time, reaching 15.8% in 2025, and was higher among TRIPOD+AI-citing studies than TRIPOD-citing studies. Sharing prevalence varied widely by journal and country. Repository assessment showed substantial heterogeneity in reproducibility features: most repositories contained a README file (80.5%), but fewer specified dependencies (37.6%; version-constrained 21.6%) or were modular (42.4%). In prediction model research, code sharing remains relatively uncommon, and when shared, often falls short of being reusable. These findings provide an empirical baseline for the TRIPOD-Code extension and underscore the need for clearer expectations beyond code availability, including documentation, dependency specification, licensing, and executable structure. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.06212 [cs.SE] (or arXiv:2604.06212v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.06212 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Thomas Sounack [view email] [v1] Mon, 16 Mar 2026 15:21:49 UTC (2,900 KB)
[NLP-108] Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的自然语言解释虽具说服力但缺乏可追溯性的问题,即用户难以验证解释内容是否基于可靠证据。在可解释人工智能(Explainable AI, XAI)领域,这体现为对忠实性(faithfulness)与可追溯性(traceability)的诉求——即解释中的主张能否被明确来源支撑并回溯至其依据。作者聚焦于编程教育场景下的检索增强生成(Retrieval-Augmented Generation, RAG),利用教科书作为权威证据源,通过源适配度(source adherence)指标量化模型输出的忠实性。实验发现,非RAG模型中位源适配度为0%,而基线RAG系统也仅达22–40%。为此,作者基于Achinstein的言语行为理论(illocutionary theory of explanation)提出“言语行为宏观规划”(illocutionary macro-planning)作为设计原则,并引入链式言语行为提示(Chain-of-Illocution prompting, CoI),将查询显式扩展为隐含的解释性问题以驱动更精准的检索。结果显示,CoI在多数模型上显著提升源适配度(最高达63%),且不损害用户感知的相关性、满意度或正确性,成为实现高忠实性解释的关键技术路径。
链接: https://arxiv.org/abs/2604.06211
作者: Francesco Sovrano,Alberto Bacchelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 24 pages; Accepted for publication at XAI’2026
Abstract:Natural language explanations produced by large language models (LLMs) are often persuasive, but not necessarily scrutable: users cannot easily verify whether the claims in an explanation are supported by evidence. In XAI, this motivates a focus on faithfulness and traceability, i.e., the extent to which an explanation’s claims can be grounded in, and traced back to, an explicit source. We study these desiderata in retrieval-augmented generation (RAG) for programming education, where textbooks provide authoritative evidence. We benchmark six LLMs on 90 Stack Overflow questions grounded in three programming textbooks and quantify source faithfulness via source adherence metrics. We find that non Retrieval-Augmented Generation (RAG) models have median source adherence of 0%, while baseline RAG systems still exhibit low median adherence (22-40%, depending on the model). Motivated by Achinstein’s illocutionary theory of explanation, we introduce illocutionary macro-planning as a descriptive design principle for source-faithful explanations and instantiate it with chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval. Across models, CoI yields statistically significant gains (up to 63%) in source adherence, although absolute adherence remains moderate and the gains are weak or non-significant for some models. A user study with 165 retained participants (220 recruited) indicates that these gains do not harm satisfaction, relevance, or perceived correctness.
[NLP-109] Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在全球部署中因文化价值取向不一致而导致的安全性与用户参与度问题。现有评估基准面临“构念-组成-情境”(Construct-Composition-Context, C³)挑战,即依赖判别式多选题格式探测价值观知识而非真实价值取向、忽视亚文化异质性、且与现实开放生成场景脱节。解决方案的关键在于提出DOVE框架——一种分布式的评估方法,通过率失真变分优化目标从10,000篇人类写作文本中构建紧凑的价值码本(value-codebook),将文本映射至结构化的价值空间以过滤语义噪声,并利用非平衡最优传输(unbalanced optimal transport)衡量对齐程度,从而捕捉文化内部分布结构与子群体多样性,实现高预测效度(与下游任务相关性达31.56%)和高可靠性(每文化仅需500样本即可稳定评估)。
链接: https://arxiv.org/abs/2604.06210
作者: Jaehyeok Lee,Xiaoyuan Yi,Jing Yao,Hyunjin Hwang,Roy Ka-Wei Lee,Xing Xie,JinYeong Bak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ( C^3 ) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value-codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and sub-group diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.
[NLP-110] coAgent -Bench: A Multilingual Benchmark for Telecom AI Agents
【速读】: 该论文旨在解决将大语言模型(Large Language Model, LLM)代理集成到电信网络中时面临的挑战,包括意图识别、工具执行和解决方案生成等问题,同时需满足不同操作约束。其核心解决方案是提出了一套面向电信领域的基准测试框架——TelcoAgent-Bench与TelcoAgent-Metrics,该框架通过结构化指标体系评估LLM代理在多语言环境下的语义理解能力、流程级对标准化排障流程的对齐度以及在重复场景变化中的稳定性。关键创新在于量化了LLM代理在电信运维场景中的可靠性与操作一致性,并支持英语与阿拉伯语双语部署,从而为实际网络环境中多语言智能代理的性能评估提供了系统性方法。
链接: https://arxiv.org/abs/2604.06209
作者: Lina Bariah,Brahim Mefgouda,Farbod Tavakkoli,Enrique Molero,Louis Powell,Merouane Debbah
机构: Kuwait University (科威特大学); AT&T (美国电话电报公司); GSMA (全球移动通信系统协会)
类目: Computation and Language (cs.CL)
备注:
Abstract:The integration of large language model (LLM) agents into telecom networks introduces new challenges, related to intent recognition, tool execution, and resolution generation, while taking into consideration different operational constraints. In this paper, we introduce TelcoAgent-Bench and TelcoAgent-Metrics, a Telecom-specific benchmarking framework for evaluating multilingual telecom LLM agents. The proposed framework assesses the semantic understanding as well as process-level alignment with structured troubleshooting flows and stability across repeated scenario variations. Our contribution includes a structured suite of metrics that assess intent recognition, ordered tool execution, resolution correctness, and stability across scenario variations, with the aim of quantifying the reliability and operational consistency of LLM agents in telecom environments. The framework is designed to operate in both English and Arabic, to address the need for multilingual agent deployment in operational network environments. Our experimental results show that although recent instruct-tuned models can understand telecom problems in a reasonable way, they usually struggle to consistently follow the required troubleshooting steps and to maintain stable behavior when exposed to different variations of the same scenario. This performance gap becomes more pronounced in unconstrained and bilingual settings.
[NLP-111] Extracting Breast Cancer Phenotypes from Clinical Notes: Comparing LLM s with Classical Ontology Methods
【速读】: 该论文旨在解决肿瘤学电子病历(EMR)中大量非结构化医生笔记信息难以有效提取和利用的问题,特别是如何从自然语言描述中精准识别与乳腺癌相关的表型特征(如化疗效果、生物标志物、肿瘤位置及生长模式等)。其解决方案的关键在于提出一种基于大语言模型(LLM)的信息抽取框架,该框架能够自动解析临床文本并提取结构化表型数据,且在准确性上可媲美传统基于知识库(如NCIt Ontology Annotator)的注释系统;更重要的是,该LLM框架具备良好的可迁移性,经少量微调即可适配其他癌症类型或疾病领域,显著提升了临床文本挖掘的灵活性与泛化能力。
链接: https://arxiv.org/abs/2604.06208
作者: Abdullah Bin Faiz,Arbaz Khan Shehzad,Asad Afzal,Momin Tariq,Muhammad Siddiqi,Muhammad Usamah Shahid,Maryam Noor Awan,Muddassar Farooq
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:A significant amount of data held in Oncology Electronic Medical Records (EMRs) is contained in unstructured provider notes – including but not limited to the chemotherapy (or cancer treatment) outcome, different biomarkers, the tumor’s location, sizes, and growth patterns of a patient. The clinical studies show that the majority of oncologists are comfortable providing these valuable insights in their notes in a natural language rather than the relevant structured fields of an EMR. The major contribution of this research is to report an LLM-based framework to process provider notes and extract valuable medical knowledge and phenotype mentioned above, with a focus on the domain of oncology. In this paper, we focus on extracting phenotypes related to breast cancer using our LLM framework, and then compare its performance with earlier works that used knowledge-driven annotation system, paired with the NCIt Ontology Annotator. The results of the study show that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods. However, once trained, they could be easily fine-tuned to cater for other cancer types and diseases.
[NLP-112] A Comparative Study of Demonstration Selection for Practical Large Language Models -based Next POI Prediction PRICAI2025
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)进行用户下一次兴趣点(Point-of-Interest, POI)预测时,演示样本(demonstration)选择策略对预测性能的影响问题。其核心挑战在于如何从历史签到数据中有效选取用于上下文学习(In-Context Learning, ICL)的示例,以提升LLMs在不依赖额外训练的情况下对用户轨迹行为的建模能力。解决方案的关键在于系统性地比较多种演示选择方法,包括嵌入式选择、任务特定选择与更简单的启发式策略(如地理邻近性、时间顺序和序列模式),实验表明这些启发式方法在预测准确性和计算效率上均显著优于复杂的嵌入方法,且在某些场景下甚至超越了微调后的模型表现,无需额外训练即可实现高性能预测。
链接: https://arxiv.org/abs/2604.06207
作者: Ryo Nishida,Masayuki Kawarada,Tatsuya Ishigaki,Hiroya Takamura,Masaki Onishi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to PRICAI 2025
Abstract:This paper investigates demonstration selection strategies for predicting a user’s next point-of-interest (POI) using large language models (LLMs), aiming to accurately forecast a user’s subsequent location based on historical check-in data. While in-context learning (ICL) with LLMs has recently gained attention as a promising alternative to traditional supervised approaches, the effectiveness of ICL significantly depends on the selected demonstration. Although previous studies have examined methods such as random selection, embedding-based selection, and task-specific selection, there remains a lack of comprehensive comparative analysis among these strategies. To bridge this gap and clarify the best practices for real-world applications, we comprehensively evaluate existing demonstration selection methods alongside simpler heuristic approaches such as geographical proximity, temporal ordering, and sequential patterns. Extensive experiments conducted on three real-world datasets indicate that these heuristic methods consistently outperform more complex and computationally demanding embedding-based methods, both in terms of computational cost and prediction accuracy. Notably, in certain scenarios, LLMs using demonstrations selected by these simpler heuristic methods even outperform existing fine-tuned models, without requiring further training. Our source code is available at: this https URL.
[NLP-113] he Human Condition as Reflected in Contemporary Large Language Models
【速读】: 该论文试图解决的问题是:如何通过大型语言模型(Large Language Models, LLMs)揭示人类文化演化中潜在的结构特征,即在海量文本数据中是否存在可识别的文化共性模式。其解决方案的关键在于,利用六种主流生成式AI模型对同一提示(prompt)的并行响应,发现它们在描述训练语料库所反映的人类文化与行为时呈现出高度一致的六大核心文化主题:叙事意义建构、情绪优先认知、联盟心理、地位竞争、威胁敏感性和道德合理化。这些主题构成了一种跨模型共识,表明LLMs并非随机输出,而是作为文化凝缩体(cultural condensates),压缩表达了人类社会生活中的普遍叙述、辩护与争议逻辑,其差异更多体现为解释视角的不同而非实质分歧,从而为心理学、社会学和计算语言学提供了新的交叉研究路径。
链接: https://arxiv.org/abs/2604.06206
作者: W. Russell Neuman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This study seeks to uncover evidence of a latent structure in evolved human culture as it is refracted through contemporary large language models (LLMs). Drawing on parallel responses from six leading generative models to a prompt which asks directly what their training corpora reveal about human culture and behavior, we identify a robust cross-model consensus on a limited set of recurring cultural themes. The themes include narrative meaning-making, affect-first cognition, coalition psychology, status competition, threat sensitivity, and moral rationalization. Each provides grounds for further psychological and sociological inquiry. There is strong evidence of a convergence in these pattern recognition exercises as differences among models are shown to reflect varying explanatory lenses rather than substantive disagreement. We review these findings in the light of the evolving literatures of moral psychology, evolutionary psychology, anthropology, and the computer science literature on large-scale language modeling. We argue that LLMs function as cultural condensates – compressed representations of how humans describe, justify, and contest their own social lives across trillions of tokens of aggregated communication and narration.
[NLP-114] ool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation
【速读】: 该论文旨在解决在线平台内容审核系统在面对多模态复杂输入时,因大型语言模型(Large Language Models, LLMs)计算成本高、延迟大而导致的可扩展性问题。其解决方案的关键在于提出一种名为Tool-MCoT的小语言模型(Small Language Model, SLM),通过在由LLM生成的工具增强型思维链(tool-augmented chain-of-thought)数据上进行微调,使SLM能够有效利用外部工具提升推理与决策能力。实验表明,该方法不仅显著提升了内容审核性能,还实现了对工具调用的智能选择,从而在保证审核准确性的同时优化了推理效率。
链接: https://arxiv.org/abs/2604.06205
作者: Shutong Zhang,Dylan Zhou,Yinxiao Liu,Yang Yang,Huiwen Luo,Wenfei Zou
机构: Stanford University (斯坦福大学); Google (谷歌); Google DeepMind (谷歌深度大脑)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The growth of online platforms and user content requires strong content moderation systems that can handle complex inputs from various media types. While large language models (LLMs) are effective, their high computational cost and latency present significant challenges for scalable deployment. To address this, we introduce Tool-MCoT, a small language model (SLM) fine-tuned for content safety moderation leveraging external framework. By training our model on tool-augmented chain-of-thought data generated by LLM, we demonstrate that the SLM can learn to effectively utilize these tools to improve its reasoning and decision-making. Our experiments show that the fine-tuned SLM achieves significant performance gains. Furthermore, we show that the model can learn to use these tools selectively, achieving a balance between moderation accuracy and inference efficiency by calling tools only when necessary.
[NLP-115] Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models
【速读】: 该论文旨在解决多语言大模型(Multilingual Large Language Models, MLLMs)在低资源语言中性能不均衡的问题,特别是针对具有高度形态学和句法相似性的突厥语族语言(如阿塞拜疆语、哈萨克语、乌兹别克语、土库曼语和加告兹语),这些语言虽有大量母语使用者,但数字资源匮乏,导致其在训练数据和评估基准中均被严重低估。解决方案的关键在于提出一个理论框架,结合多语言表征学习与参数高效微调技术(如低秩适应 LoRA),构建了一个描述适应性能的缩放模型,并引入“突厥语迁移系数”(Turkic Transfer Coefficient, TTC),量化了形态相似性、词汇重叠度、句法结构一致性及书写系统兼容性对跨语言迁移潜力的影响,从而揭示了类型学相似性如何促进高效多语言迁移,同时识别出在极端低资源场景下参数高效微调的结构性局限。
链接: https://arxiv.org/abs/2604.06202
作者: O. Ibrahimzade,K. Tabasaransky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, no figures, 1 table
Abstract:Large language models (LLMs) have transformed natural language processing, yet their capabilities remain uneven across languages. Most multilingual models are trained primarily on high-resource languages, leaving many languages with large speaker populations underrepresented in both training data and evaluation benchmarks. This imbalance is particularly visible in the Turkic language family. This paper proposes a theoretical framework for studying cross-lingual transfer and parameter-efficient adaptation of multilingual LLMs within the Turkic language family, focusing on Azerbaijani, Kazakh, Uzbek, Turkmen, and Gagauz. These languages share substantial typological and morphological similarity while differing greatly in available digital resources, making them a natural setting for analyzing multilingual adaptation strategies. We integrate insights from multilingual representation learning and parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA) to develop a conceptual scaling model describing how adaptation performance depends on model capacity, adaptation data size, and the expressivity of adaptation modules. To formalize transfer potential between related languages, we introduce the Turkic Transfer Coefficient (TTC), a theoretical measure incorporating morphological similarity, lexical overlap, syntactic structure, and script compatibility across Turkic languages. The framework highlights how typological similarity can enable efficient multilingual transfer while also identifying structural limits of parameter-efficient adaptation in extremely low-resource scenarios.
[NLP-116] Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在阅读理解任务中普遍聚焦于局部事实性信息、而忽视分布知识(distributional knowledge)推理能力的问题。现有基准多依赖于可定位的文本证据,无法有效评估模型对群体趋势、偏好等分布特征的理解能力。解决方案的关键在于构建Text2DistBench——一个基于真实YouTube评论数据的阅读理解基准,其通过提供实体元数据和关联评论,要求模型回答关于情感比例、话题频率等分布性问题,并采用全自动化的构建流水线实现持续更新与长期可靠评估。这一设计使得模型不仅能处理个体文本,还能从大规模语料中提取并推断群体层面的知识,从而填补了LLMs在分布知识理解方面的评测空白。
链接: https://arxiv.org/abs/2604.06201
作者: Pei-Fu Guo,Ya-An Tsai,Chun-Chia Hsu,Kai-Xin Chen,Yun-Da Tsai,Kai-Wei Chang,Nanyun Peng,Mi-Yen Yeh,Shou-De Lin
机构: National Taiwan University (国立台湾大学); University of California, Los Angeles (加州大学洛杉矶分校); Academia Sinica, Taiwan (台湾中央研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs’ ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate newly emerging entities over time. Experiments across multiple LLMs show that while models substantially outperform random baselines, performance varies widely across different distribution types and characteristics. These findings highlight both the capabilities and limitations of current LLMs in distributional reading comprehension and demonstrate the value of Text2DistBench as a practical and scalable testbed for future research.
[NLP-117] mporally Phenotyping GLP-1RA Case Reports with Large Language Models : A Textual Time Series Corpus and Risk Modeling
【速读】: 该论文旨在解决糖尿病单病例个案报告中临床事件时间线表述模糊、难以用于纵向建模的问题。其解决方案的关键在于构建了一个包含136篇PubMed开放获取的单患者个案报告的文本时间序列语料库,其中临床事件被关联到最可能的时间参考点,并通过大语言模型(LLM)自动提取时间线,与临床专家标注的黄金标准进行对比评估。结果显示,最优LLM(GPT5)在事件覆盖度(0.871)和事件时序准确性(0.843)方面表现优异,为后续糖尿病时间-事件分析提供了可靠的数据基础。
链接: https://arxiv.org/abs/2604.06197
作者: Sayantan Kumar,Jeremy C. Weiss
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AMIA Annual Symposium
Abstract:Type 2 diabetes case reports describe complex clinical courses, but their timelines are often expressed in language that is difficult to reuse in longitudinal modeling. To address this gap, we developed a textual time-series corpus of 136 PubMed Open Access single-patient case reports involving glucagon-like peptide 1 receptor agonists, with clinical events associated with their most probable reference times. We evaluated automated LLM timeline extraction against gold-standard timelines annotated by clinical domain experts, assessing how well systems recovered clinical events and their timings. The best-performing LLM produced high event coverage (GPT5; 0.871) and reliable temporal sequencing across symptoms (GPT5; 0.843), diagnoses, treatments, laboratory tests, and outcomes. As a downstream demonstration, time-to-event analyses in diabetes suggested lower risk of respiratory sequelae among GLP-1 users versus non-users (HR=0.259, p0.05), consistent with prior reports of improved respiratory outcomes. Temporal annotations and code will be released upon acceptance.
[NLP-118] Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering
【速读】: 该论文旨在解决三路逻辑问答(3-way logical question answering)中大型语言模型(LLMs)的两个关键失败模式:一是否定不一致性(negation inconsistency),即模型对命题 $ H $ 和其否定 $ \neg H $ 的回答违反确定性标签映射;二是认知型未知(epistemic Unknown),即模型因不确定性或不稳定而错误地将本可明确判断的命题标记为“Unknown”。解决方案的核心是提出一种轻量级的测试时层 CGD-PD,其关键机制包括:(a) 对 $ H $ 及其机械否定形式均调用单个三路分类器,(b) 在可能情况下将结果投影到否定一致的决策空间,© 通过针对性二元蕴含探测(binary entailment probes)执行基于证明的消歧步骤,从而选择性地解决“Unknown”预测,平均仅需4–5次模型调用。此方法在FOLIO基准的一阶逻辑领域显著提升准确率(最高相对提升16%),同时有效减少“Unknown”预测。
链接: https://arxiv.org/abs/2604.06196
作者: Tianyi Huang,Ming Hou,Jiaheng Su,Yutong Zhang,Ziling Zhang
机构: Ryquo; App-In Club
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Three-way logical question answering (QA) assigns True/False/Unknown to a hypothesis H given a premise set S . While modern large language models (LLMs) can be accurate on isolated examples, we identify two recurring failure modes in 3-way logic QA: (i) negation inconsistency, where answers to H and \neg H violate the deterministic label mapping, and (ii) epistemic Unknown , where the model predicts Unknown due to uncertainty or instability even when S entails one side. We present CGD-PD, a lightweight test-time layer that (a) queries a single 3-way classifier on both H and a mechanically negated form of H , (b) projects the pair onto a negation-consistent decision when possible, and © invokes a proof-driven disambiguation step that uses targeted binary entailment probes to selectively resolve Unknown outcomes, requiring only an average of 4-5 model calls. On the FOLIO benchmark’s first-order-logic fields, CGD-PD yields consistent gains across frontier LLMs, with relative improvements in accuracy of up to 16% over the base model, while also reducing Unknown predictions.
[NLP-119] Hallucination as output-boundary misclassification: a composite abstention architecture for language models ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成 unsupported claims(无依据陈述,即幻觉)的问题。作者将这一问题视为输出边界上的误分类错误,即模型内部生成的内容被错误地当作有证据支持的输出。解决方案的关键在于提出一种组合干预机制:一方面通过指令驱动的拒绝策略(instruction-based refusal)降低幻觉率,另一方面引入结构化的回避门控机制(structural abstention gate),该门控基于三个黑盒信号——自一致性(At)、改写稳定性(Pt)和引用覆盖率(Ct)计算支持缺陷分数(St),并在St超过阈值时阻断输出。实验表明,单一机制均无法全面解决问题,而二者结合实现了高准确率与低幻觉的平衡,且门控机制提供了与模型能力无关的最低拒绝水平,验证了两种机制互补的失效模式。
链接: https://arxiv.org/abs/2604.06195
作者: Angelina Hintsanen
机构: NEXUS Laboratory (NEXUS 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Theoretical manuscript extending an earlier proof-of-concept workshop paper accepted to the ICLR 2026 Workshop on LLM Reasoning; 13 pages, 3 tables
Abstract:Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed confident confabulation on conflicting-evidence items. The composite architecture achieved high overall accuracy with low hallucination, while also inheriting some over-abstention from the instruction component. A supplementary 100-item no-context stress test derived from TruthfulQA showed that structural gating provides a capability-independent abstention floor. Overall, instruction-based refusal and structural gating show complementary failure modes, which suggests that effective hallucination control benefits from combining both mechanisms.
[NLP-120] Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters
【速读】: 该论文旨在解决初级保健中抑郁症筛查不足的问题,通过利用数字记录技术获取的自然对话数据实现自动化抑郁检测。其关键解决方案是基于1,108段录音的初级保健会话,采用多种监督学习与零样本大语言模型(如GPT-OSS)进行比较分析,发现从患者和医生双人对话文本中提取语言特征可显著提升检测性能,尤其是当两者语言模式存在镜像关系时,这种协同信号在单一方单独建模时无法捕捉;此外,仅需前128个患者token即可实现有意义的检测,表明该方法具备实时临床决策支持潜力。
链接: https://arxiv.org/abs/2604.06193
作者: Feng Chen,Manas Bedmutha,Janice Sabin,Andrea Hartzler,Nadir Weibel,Trevor Cohen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Depression is underdiagnosed in primary care, yet timely identification remains critical. Recorded clinical encounters, increasingly common with digital scribing technologies, present an opportunity to detect depression from naturalistic dialogue. We investigated automated depression detection from 1,108 audio-recorded primary care encounters in the Establishing Focus study, with depression defined by PHQ-9 (n=253 depressed, n=855 non-depressed). We compared three supervised approaches, Sentence-BERT + Logistic Regression (LR), LIWC+LR and ModernBERT, against a zero-shot GPT-OSS. GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774), with LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742). Combined dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters, an additive signal not captured by either speaker alone. Meaningful detection is achievable from the first 128 patient tokens (AUPRC=0.356, AUROC=0.675), supporting in-the-moment clinical decision support. These findings argue for passively collected clinical audio as a low-burden complement to existing screening workflows.
[NLP-121] he Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLM s?
【速读】: 该论文试图解决的问题是:为何大型语言模型(Large Language Models, LLMs)内部的条件熵动态变化(defined under the predictive distribution of a model)能够稳健地与外部正确性(given by the ground-truth answer)相关联,这一现象在现有研究中仍属经验性观察,缺乏理论解释。解决方案的关键在于提出逐步信息性假设(Stepwise Informativeness Assumption, SIA),即推理前缀在生成过程中期望地累积与答案相关的信息。作者证明SIA可自然从人类推理轨迹的最大似然优化中产生,并在标准微调和强化学习流程中得到加强;进一步推导出SIA对应的可观测特征——条件答案熵动态与正确性的关联模式,并通过多个基准测试(GSM8K、ARC、SVAMP)和多种开源大模型(Gemma-2、LLaMA-3.2、Qwen-2.5、DeepSeek 和 Olmo 变体)验证了训练诱导出SIA,且正确推理路径表现出特定的条件熵演化模式。
链接: https://arxiv.org/abs/2604.06192
作者: Mar Gonzàlez I Català,Haitz Sáez de Ocáriz Borde,George D. Montañez,Pietro Liò
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, 3 tables
Abstract:Recent work uses entropy-based signals at multiple representation levels to study reasoning in large language models, but the field remains largely empirical. A central unresolved puzzle is why internal entropy dynamics, defined under the predictive distribution of a model, correlate so robustly with external correctness given by the ground-truth answer. In this paper, we argue that this correlation arises because autoregressive models reason correctly when they accumulate information about the true answer via answer-informative prefixes. We formalize this intuition via the Stepwise Informativeness Assumption (SIA), which states that reasoning prefixes accumulate answer-relevant information in expectation as generation progresses. We show that SIA naturally emerges from maximum-likelihood optimization on human reasoning traces and is reinforced by standard fine-tuning and reinforcement-learning pipelines. We then derive observable signatures of SIA linking conditional answer entropy dynamics to correctness. We empirically test SIA across multiple reasoning benchmarks (GSM8K, ARC, SVAMP) and a diverse set of open-weight LLMs (Gemma-2, LLaMA-3.2, Qwen-2.5, DeepSeek and Olmo variants), showing that training induces it and that correct traces exhibit characteristic conditional answer entropy patterns.
[NLP-122] LLM -Augmented Knowledge Base Construction For Root Cause Analysis
【速读】: 该论文旨在解决通信网络中难以实现“五九”(99.999%)高可靠性的问题,核心挑战在于故障发生时如何实现快速且准确的根因分析(Root Cause Analysis, RCA),以缩短服务恢复时间并预防未来中断。解决方案的关键在于利用三种大语言模型(Large Language Model, LLM)方法——微调(Fine-Tuning)、检索增强生成(Retrieval-Augmented Generation, RAG)以及混合方法——从技术支持工单中构建结构化的RCA知识库,并通过词汇和语义相似度指标进行系统评估。实验表明,所生成的知识库能显著加速RCA任务,提升网络韧性。
链接: https://arxiv.org/abs/2604.06171
作者: Nguyen Phuc Tran,Brigitte Jaumard,Oscar Delgado,Tristan Glatard,Karthikeyan Premkumar,Kun Ni
机构: Concordia University (康考迪亚大学); École de Technologie Supérieure (高等技术学院); GAIA-Ericsson Montréal (GAIA-爱立信蒙特利尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been accepted for publication in IEEE Access. The final published version will be available via IEEE Xplore
Abstract:Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee “five 9s” (99.999 %) reliability, requiring rapid and accurate root cause analysis (RCA) during outages. In the event of an outage, rapid and accurate RCA becomes essential to restore service and prevent future disruptions. This study evaluates three Large Language Model (LLM) methodologies - Fine-Tuning, RAG, and a Hybrid approach - for constructing a Root Cause Analysis (RCA) Knowledge Base from support tickets. We compare their performance using a comprehensive suite of lexical and semantic similarity metrics. Our experiments on a real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience. Comments: This work has been accepted for publication in IEEE Access. The final published version will be available via IEEE Xplore Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06171 [cs.CL] (or arXiv:2604.06171v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.06171 Focus to learn more arXiv-issued DOI via DataCite
[NLP-123] Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment
【速读】: 该论文旨在解决阿拉伯语发音评估自动化工具稀缺的问题,特别是在语音治疗和语言学习场景下缺乏可靠的音位级(phoneme-level)评分系统。解决方案的关键在于构建一个模块化系统Harf-Speech,其核心包括:标准阿拉伯语(Modern Standard Arabic, MSA)音位转换器、微调的语音到音位模型、Levenshtein对齐算法,以及融合最长公共子序列(LCS)与编辑距离(edit-distance)的混合评分机制。该系统在阿拉伯语音位数据上微调了三种自动语音识别(ASR)架构,并通过临床专家评分验证,其中最佳模型OmniASR-CTC-1B-v2达到8.92%的音位错误率,且与三位认证言语语言病理学家的评分具有高相关性(Pearson r=0.791,ICC(2,1)=0.659),显著优于现有端到端评估框架,实现了临床可解释且符合专家共识的自动化发音评分。
链接: https://arxiv.org/abs/2604.06191
作者: Asif Azad,MD Sadik Hossain Shanto,Mohammad Sadat Hossain,Bdour Alwuqaysi,Sabri Boughorbel,Yahya Bokhari,Abdulrhman Aljouie,Ayah Othman Sindi,Ehsan Hoque
机构: Ministry of Defense, Riyadh, Saudi Arabia; Ability Center, Saudi Arabia; University of Rochester, USA
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Automated phoneme-level pronunciation assessment is vital for scalable speech therapy and language learning, yet validated tools for Arabic remain scarce. We present Harf-Speech, a modular system scoring Arabic pronunciation at the phoneme level on a clinical scale. It combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein alignment, and a blended scorer using longest common subsequence and edit-distance metrics. We fine-tune three ASR architectures on Arabic phoneme data and benchmark them with zero-shot multimodal models; the best, OmniASR-CTC-1B-v2, achieves 8.92% phoneme error rate. Three certified speech-language pathologists independently scored 40 utterances for clinical validation. Harf-Speech attains a Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks. These results show Harf-Speech yields clinically aligned, interpretable scores comparable to inter-rater expert agreement.
信息检索
[IR-0] HIVE: Query Hypothesize Verify An LLM Framework for Multimodal Reasoning -Intensive Retrieval CVPR2026
【速读】:该论文旨在解决多模态检索模型在需要深度图文推理的查询任务中表现不佳的问题,尤其是在图像(如图表、截图)与文本必须深度融合才能识别相关文档的场景下,现有最佳多模态模型在MM-BRIGHT数据集上的nDCG@10仅为27.6,显著低于强文本检索器(32.2)。其解决方案的关键在于提出HIVE(Hypothesis-driven Iterative Visual Evidence Retrieval)框架,该框架通过大语言模型(LLM)显式注入视觉-文本推理机制:首先进行初始检索,随后利用LLM生成补偿性查询以明确识别Top-k候选中的视觉与逻辑缺失,再基于优化后的查询执行二次检索,并最终通过LLM对候选集合进行验证与重排序。该方法实现了41.7的nDCG@10新SOTA成绩,较最优文本模型提升9.5点,且在高视觉复杂度领域(如游戏、化学、可持续发展)表现尤为突出,证明了LLM驱动的视觉假设生成与验证能有效缩小多模态检索中的推理差距。
链接: https://arxiv.org/abs/2604.07220
作者: Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Mostafa Farouk Senussi,Abdelrahman Abdallah,Hyun-Soo Kang
机构: Chungbuk National University (忠北国立大学); University of Innsbruck (因斯布鲁克大学)
类目: Information Retrieval (cs.IR)
备注: accepted at CVPR 2026 Workshop GRAIL-V
Abstract:Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents – the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbfHIVE (\textbfHypothesis-driven \textbfIterative \textbfVisual \textbfEvidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top- k candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803 real-world queries across 29 technical domains), HIVE achieves a new state-of-the-art aggregated nDCG@10 of \textbf41.7 – a \textbf+9.5 point gain over the best text-only model (DiVeR: 32.2) and \textbf+14.1 over the best multimodal model (Nomic-Vision: 27.6), where our reasoning-enhanced base retriever contributes 33.2 and the HIVE framework adds a further \textbf+8.5 points – with particularly strong results in visually demanding domains (Gaming: 68.2, Chemistry: 42.5, Sustainability: 49.4). Compatible with both standard and reasoning-enhanced retrievers, HIVE demonstrates that LLM-mediated visual hypothesis generation and verification can substantially close the multimodal reasoning gap in retrieval. this https URL
[IR-1] BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment CVPR2026
【速读】:该论文旨在解决多模态检索系统在面对图像-文本查询时,难以有效匹配纯文本语料库的问题。现有基于视觉-语言编码器的模型在MM-BRIGHT数据集上仅达到27.6 nDCG@10,显著低于强文本检索器的表现。作者指出瓶颈并非来自检索器本身,而是原始多模态查询中混杂了视觉描述、对话噪声和检索意图,导致嵌入相似度下降。解决方案的核心是提出BRIDGE系统,由两个组件构成:FORGE(Focused Retrieval Query Generator)通过强化学习训练,将嘈杂的多模态查询提炼为紧凑且适配检索的文本查询;LENS(Language-Enhanced Neural Search)则是在推理密集型数据上微调的稠密检索器,用于处理FORGE生成的高语义密度查询。实验表明,BRIDGE在MM-BRIGHT上达到29.7 nDCG@10,优于所有多模态编码器基线;当FORGE作为插件模块应用于Nomic-Vision时,系统性能进一步提升至33.3 nDCG@10,超越最佳文本检索器(32.2),验证了“查询对齐”才是多模态到纯文本检索的关键瓶颈。
链接: https://arxiv.org/abs/2604.07201
作者: Mohamed Darwish Mounis,Mohamed Mahmoud,Shaimaa Sedek,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Abdelrahman Abdallah,Hyun-Soo Kang
机构: Chungbuk National University (忠北国立大学); Assiut University (艾斯尤特大学); University of Innsbruck (因斯布鲁克大学); High institute for computer information systems (计算机信息系统高级研究所)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop GRAIL-V
Abstract:Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query – raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbfBRIDGE, a two-component system that resolves this mismatch without multimodal encoders. \textbfFORGE (\textbfFocused Retrieval Query Generato\textbfr) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbfLENS (\textbfLanguage-\textbfEnhanced \textbfNeural \textbfSearch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf29.7 nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf33.3 nDCG@10 – exceeding the best text-only retriever (32.2) – demonstrating that \textitquery alignment is the key bottleneck in multimodal-to-text retrieval. this https URL
[IR-2] Leverag ing Artist Catalogs for Cold-Start Music Recommendation
【速读】:该论文旨在解决音乐推荐中的“物品冷启动”(item cold-start)问题,即新加入的歌曲因缺乏用户交互历史而难以被协同过滤(Collaborative Filtering, CF)模型有效推荐。传统方法通常通过将音频、文本和元数据等内容特征映射到CF潜在空间来缓解此问题,但往往忽略艺术家(artist)层级的信息结构。本文的关键创新在于提出“半冷启动”(semi-cold)建模框架,利用艺术家层面已有的丰富协同信号来增强对新歌曲的推荐能力;其核心解决方案是设计ACARec——一种基于注意力机制的架构,通过在艺术家已有曲目上进行注意力加权,生成新歌曲的CF嵌入表示,从而显著提升对新歌偏好预测的准确性,尤其在新艺术家发现和冷门歌曲流行度估计方面表现优异。
链接: https://arxiv.org/abs/2604.07090
作者: Yan-Martin Tamm,Gregor Meehan,Vojtěch Nekl,Vojtěch Vančura,Rodrigo Alves,Johan Pauwels,Anna Aljanaki
机构: University of Tartu(塔尔图大学); Queen Mary University of London(伦敦玛丽女王大学); Czech Technical University in Prague(布拉格捷克技术大学); Recombee(推荐引擎公司)
类目: Information Retrieval (cs.IR)
备注: Accepted at UMAP 2026
Abstract:The item cold-start problem poses a fundamental challenge for music recommendation: newly added tracks lack the interaction history that collaborative filtering (CF) requires. Existing approaches often address this problem by learning mappings from content features such as audio, text, and metadata to the CF latent space. However, previous works either omit artist information or treat it as just another input modality, missing the fundamental hierarchy of artists and items. Since most new tracks come from artists with previous history available, we frame cold-start track recommendation as ‘semi-cold’ by leveraging the rich collaborative signal that exists at the artist level. We show that artist-aware methods can more than double Recall and NDCG compared to content-only baselines, and propose ACARec, an attention-based architecture that generates CF embeddings for new tracks by attending over the artist’s existing catalog. We show that our approach has notable advantages in predicting user preferences for new tracks, especially for new artist discovery and more accurate estimation of cold item popularity.
[IR-3] MARVEL: Multimodal Adaptive Reasoning -intensiVe Expand-rerank and retrievaL
【速读】:该论文旨在解决多模态文本语料库中检索任务的挑战,特别是针对需要复杂推理能力的多模态检索场景(如MM-BRIGHT基准测试),现有方法在性能上显著落后于纯文本系统。其核心问题在于现有方法仅孤立地处理查询扩展、检索建模或重排序中的单一环节,缺乏三者间的协同优化。解决方案的关键在于提出一个统一的“expand-retrieve-rerank”框架——MARVEL,包含三个核心组件:基于大语言模型(LLM)的查询意图扩展、专为复杂多模态查询微调的推理增强密集检索器(MARVEL-Retriever),以及基于GPT-4o的链式思维(Chain-of-Thought)重排序机制,辅以可选的多轮互评排名融合策略,从而显著提升多模态检索的准确性和鲁棒性。
链接: https://arxiv.org/abs/2604.07079
作者: Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Mostafa Farouk Senussi,Mahmoud Abdalla,Abdelrahman Abdallah,Hyun-Soo Kang
机构: Chungbuk National University (忠北国立大学); University of Innsbruck (因斯布鲁克大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Multimodal retrieval over text corpora remains a fundamental challenge: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, a reasoning-intensive multimodal retrieval benchmark, underperforming strong text-only systems. We argue that effective multimodal retrieval requires three tightly integrated capabilities that existing approaches address only in isolation: expanding the query’s latent intent, retrieving with a model trained for complex reasoning, and reranking via explicit step-by-step reasoning over candidates. We introduce \textbfMARVEL (\textbfMultimodal \textbfAdaptive \textbfReasoning-intensi\textbfVe \textbfExpand-rerank and retrieva\textbfL), a unified pipeline that combines LLM-driven query expansion, \textbfMARVEL-Retriever – a reasoning-enhanced dense retriever fine-tuned for complex multimodal queries – and GPT-4o-based chain-of-thought reranking with optional multi-pass reciprocal rank fusion. Evaluated on MM-BRIGHT across 29 technical domains, MARVEL achieves \textbf37.9 nDCG@10, surpassing the best multimodal encoder by \textbf+10.3 points and outperforming all single-stage baselines in 27 of 29 domains and matching or approaching the best baseline in the remaining two highly-specialized domains (Crypto, Quantum Computing), demonstrating that reasoning-intensive multimodal retrieval is best addressed through a unified expand-retrieve-rerank framework. this https URL
[IR-4] AV-SQL: Decomposing Complex Text-to-SQL Queries with Agent ic Views
【速读】:该论文旨在解决复杂Text-to-SQL任务中因数据库模式(schema)庞大、自然语言查询需多步推理而导致的执行准确率低的问题,尤其在上下文窗口受限时难以生成可执行SQL。其解决方案的关键在于提出AV-SQL框架,通过引入“代理视图”(agentic views)——由LLM代理生成的通用表表达式(Common Table Expressions, CTEs),用于封装中间查询逻辑并从大Schema中筛选相关元素,从而实现分阶段的结构化推理:首先由重写代理压缩和澄清输入查询,再由视图生成代理处理Schema片段以构建代理视图,最后由规划、生成与修正代理协同组合这些视图生成最终SQL。此方法显著提升了复杂场景下的执行准确性,在Spider 2.0上达到70.38%的执行准确率,优于现有最优基线。
链接: https://arxiv.org/abs/2604.07041
作者: Minh Tam Pham,Trinh Pham,Tong Chen,Hongzhi Yin,Quoc Viet Hung Nguyen,Thanh Tam Nguyen
机构: Griffith University (澳大利亚格里菲斯大学); The University of Queensland (澳大利亚昆士兰大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Text-to-SQL is the task of translating natural language queries into executable SQL for a given database, enabling non-expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real-world settings, where database schemas are large and questions require multi-step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one-shot generation frequently produces non-executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV-SQL, a framework that decomposes complex Text-to-SQL into a pipeline of specialized LLM agents. Central to AV-SQL is the concept of agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV-SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV-SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state-of-the-art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at this https URL.
[IR-5] Leverag ing LLM s and Heterogeneous Knowledge Graphs for Persona-Driven Session-Based Recommendation
【速读】:该论文旨在解决会话推荐系统(Session-based Recommendation Systems, SBRS)中因匿名会话假设导致的个性化不足问题,尤其在数据稀疏或冷启动场景下表现不佳。现有方法多依赖序列建模或基于文本的用户表示,难以有效捕捉用户潜在特征。其解决方案的关键在于提出一种以用户人格(persona)驱动的SBRS框架,通过构建融合时间无关用户-物品、物品-物品、物品特征关联及DBpedia元数据的异构知识图谱(Heterogeneous Knowledge Graph, KG),并利用基于LLM生成的物品嵌入初始化KG,采用无监督的异构深度图信息最大化(Heterogeneous Deep Graph Infomax, HDGI)目标学习潜在用户人格表征;随后在两阶段架构中实现个性化信息提取与利用:第一阶段从KG中无监督挖掘用户人格,第二阶段将人格表征与LLM衍生物品嵌入整合进数据驱动的SBRS模型,生成候选集并结合基础序列模型进行重排序,从而强化短期会话意图。该方法突破了传统仅依赖会话历史或文本表示的局限,通过结构化关系信号实现更稳健的用户人格建模。
链接: https://arxiv.org/abs/2604.06928
作者: Muskan Gupta,Suraj Thapa,Jyotsana Khatri
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Session-based recommendation systems (SBRS) aim to capture user’s short-term intent from interaction sequences. However, the common assumption of anonymous sessions limits personalization, particularly under sparse or cold-start conditions. Recent advances in LLM-augmented recommendation have shown that LLMs can generate rich item representations, but modeling user personas with LLMs remains challenging due to anonymous sessions. In this work, we propose a persona-driven SBRS framework that explicitly models latent user personas inferred from a heterogeneous knowledge graph (KG) and integrates them into a data-driven recommendation this http URL framework adopts a two-stage architecture consisting of personalized information extraction and personalized information utilization, inspired by recent chain-of-thought recommendation approaches. In the personalized information extraction stage, we construct a heterogeneous KG that integrates time-independent user-item, item-item, item-feature association, and metadata from DBpedia. We then learn latent user personas in an unsupervised manner using a Heterogeneous Deep Graph Infomax (HDGI) objective over a KG initialized with LLM-derived item embeddings. In the personalized information utilization stage, the learned persona representations together with LLM-derived item embeddings are incorporated into a modified architecture of data-driven SBRS to generate a candidate set of relevant items, followed by reranking using the base sequential model to emphasize short-term session intent. Unlike prior approaches that rely solely on sequence modeling or text-based user representations, our method grounds user persona modeling in structured relational signals derived from a KG. Experiments on Amazon Books and Amazon Movies TV demonstrate that our approach consistently improves over sequential models with user embeddings derived using session history.
[IR-6] CASE: Cadence-Aware Set Encoding for Large-Scale Next Basket Repurchase Recommendation SIGIR2026
【速读】:该论文旨在解决大规模零售推荐场景中,现有next basket repurchase推荐模型无法显式建模物品购买间隔时间(即物品级复购节奏,item-level cadence)的问题。当前主流方法将用户历史购物篮表示为按访问顺序索引的离散事件序列,难以捕捉随日历时间推移而变化的复购规律,从而影响推荐准确性。其解决方案的关键在于提出CASE(Cadence-Aware Set Encoding),通过解耦物品级节奏学习与跨物品交互建模:首先将每个物品的历史购买行为编码为固定时长窗口内的日历时间信号,利用共享多尺度卷积捕捉周期性节奏模式;其次采用诱导集合注意力机制(induced set attention)建模物品间的依赖关系,复杂度低于二次方,支持高效批量推理。该设计在保持工业级可扩展性的同时,显著提升了复购推荐的Precision、Recall和NDCG指标,在多个公开数据集及千万级用户生产环境中均取得明显性能提升。
链接: https://arxiv.org/abs/2604.06718
作者: Yanan Cao,Ashish Ranjan,Sinduja Subramaniam,Evren Korpeoglu,Kaushiki Nag,Kannan Achan
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at SIGIR 2026 Industry Track
Abstract:Repurchase behavior is a primary signal in large-scale retail recommendation, particularly in categories with frequent replenishment: many items in a user’s next basket were previously purchased and their timing follows stable, item-specific cadences. Yet most next basket repurchase recommendation models represent history as a sequence of discrete basket events indexed by visit order, which cannot explicitly model elapsed calendar time or update item rankings as days pass between purchases. We present CASE (Cadence-Aware Set Encoding for next basket repurchase recommendation), which decouples item-level cadence learning from cross-item interaction, enabling explicit calendar-time modeling while remaining production-scalable. CASE represents each item’s purchase history as a calendar-time signal over a fixed horizon, applies shared multi-scale temporal convolutions to capture recurring rhythms, and uses induced set attention to model cross-item dependencies with sub-quadratic complexity, allowing efficient batch inference at scale. Across three public benchmarks and a proprietary dataset, CASE consistently improves Precision, Recall, and NDCG at multiple cutoffs compared to strong next basket prediction baselines. In a production-scale evaluation with tens of millions of users and a large item catalog, CASE achieves up to 8.6% relative Precision and 9.9% Recall lift at top-5, demonstrating that scalable cadence-aware modeling yields measurable gains in both benchmark and industrial settings.
[IR-7] ATANT: An Evaluation Framework for AI Continuity
【速读】:该论文旨在解决当前人工智能系统在长期交互中缺乏对叙事连续性(narrative continuity)的量化评估问题,即系统能否持久、更新、消歧并重建跨时间的有意义上下文。尽管已有记忆组件(如RAG流水线、向量数据库、长上下文窗口等)被广泛部署,但尚无公开框架能正式定义并测量这些组件是否真正实现连续性。解决方案的关键在于提出ATANT(Automated Test for Acceptance of Narrative Truth)——一个无需LLM参与评估闭环的10个检查点评估方法论,结合包含250个故事、1,835个验证问题的叙事语料库,以“累积模式”作为核心指标,衡量系统在多条独立生命叙事共存时避免事实交叉污染的能力。该框架具备系统无关性和模型无关性,可作为构建和验证连续性系统的标准化序列流程。
链接: https://arxiv.org/abs/2604.06710
作者: Samuel Sameer Tanguturi
机构: Kenotic Labs
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 pages, 8 tables. Framework and evaluation protocol available at this https URL
Abstract:We present ATANT (Automated Test for Acceptance of Narrative Truth), an open evaluation framework for measuring continuity in AI systems: the ability to persist, update, disambiguate, and reconstruct meaningful context across time. While the AI industry has produced memory components (RAG pipelines, vector databases, long context windows, profile layers), no published framework formally defines or measures whether these components produce genuine continuity. We define continuity as a system property with 7 required properties, introduce a 10-checkpoint evaluation methodology that operates without an LLM in the evaluation loop, and present a narrative test corpus of 250 stories comprising 1,835 verification questions across 6 life domains. We evaluate a reference implementation across 5 test suite iterations, progressing from 58% (legacy architecture) to 100% in isolated mode (250 stories) and 100% in 50-story cumulative mode, with 96% at 250-story cumulative scale. The cumulative result is the primary measure: when 250 distinct life narratives coexist in the same database, the system must retrieve the correct fact for the correct context without cross-contamination. ATANT is system-agnostic, model-independent, and designed as a sequenced methodology for building and validating continuity systems. The framework specification, example stories, and evaluation protocol are available at this https URL. The full 250-story corpus will be released incrementally.
[IR-8] CubeGraph: Efficient Retrieval-Augmented Generation for Spatial and Temporal Data
【速读】:该论文旨在解决现代检索增强生成(Retrieval-Augmented Generation, RAG)系统中混合查询的性能瓶颈问题,即如何高效地结合高维向量相似性搜索与时空过滤条件。现有方法通常将向量索引嵌套在低维空间结构(如R树)中,导致向量空间被分割成多个不连通的子索引,从而破坏图路由连通性、增加遍历开销,并难以优化复杂的空间边界查询。其解决方案的关键在于提出CubeGraph框架,该框架通过层次化网格划分空间域,在每个单元格内维护模块化的向量图;查询时动态拼接与查询空间交集的相邻立方体索引,实现全局连通性与单次遍历的近邻搜索,显著提升查询效率与扩展性。
链接: https://arxiv.org/abs/2604.06616
作者: Mingyu Yang,Wentao Li,Wei Wang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Technical Report
Abstract:Hybrid queries combining high-dimensional vector similarity search with spatio-temporal filters are increasingly critical for modern retrieval-augmented generation (RAG) systems. Existing systems typically handle these workloads by nesting vector indices within low-dimensional spatial structures, such as R-trees. However, this decoupled architecture fragments the vector space, forcing the query engine to invoke multiple disjoint sub-indices per query. This fragmentation destroys graph routing connectivity, incurs severe traversal overhead, and struggles to optimize for complex spatial boundaries. In this paper, we propose CubeGraph, a novel indexing framework designed to natively integrate vector search with arbitrary spatial constraints. CubeGraph partitions the spatial domain using a hierarchical grid, maintaining modular vector graphs within each cell. During query execution, CubeGraph dynamically stitches together adjacent cube-level indices on the fly whenever their spatial cells intersect with the query filter. This dynamic graph integration restores global connectivity, enabling a unified, single-pass nearest-neighbor traversal that eliminates the overhead of fragmented sub-index invocations. Extensive evaluations on real-world datasets demonstrate that CubeGraph significantly outperforms state-of-the-art baselines, offering superior query execution performance, scalability, and flexibility for complex hybrid workloads.
[IR-9] LLM -based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources
【速读】:该论文旨在解决失踪人员与儿童安全调查中因多源异构文档(如结构化表单、公告式海报和叙述性网络档案)在布局、术语及数据质量上的差异,导致快速甄别、大规模分析与搜寻规划流程受阻的问题。解决方案的关键在于提出一种名为Guardian Parser Pack的AI驱动解析与标准化流水线,其核心包括:(i) 多引擎PDF文本提取结合OCR备选方案,(ii) 基于规则的来源识别与源特定解析器,(iii) 以模式优先的统一化与验证机制,以及(iv) 可选的大语言模型(Large Language Model, LLM)辅助提取路径,集成验证引导修复与共享地理编码服务。实证表明,LLM辅助路径在关键字段完整性(96.97% vs. 93.23%)和抽取质量(F1=0.8664 vs. 0.2578)上显著优于确定性方法,同时保持了schema-first架构下的可审计性与高可靠性,支持在高风险调查场景中可控引入概率性AI技术。
链接: https://arxiv.org/abs/2604.06571
作者: Joshua Castillo,Ravi Mukkamala
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages, 6 figures. Accepted at International Conference on Intelligent Digitization of Systems and Services (IDSS 2026)
Abstract:Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97% vs. 93.23%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.
[IR-10] he Unreason able Effectiveness of Data for Recommender Systems
【速读】:该论文旨在解决推荐系统中大规模交互数据训练时的性能边际效益问题,即在训练数据规模持续增加的情况下,推荐效果是否会出现饱和点(saturation point),从而判断何时继续收集和处理更多数据不再带来显著收益。其解决方案的关键在于构建了一个可复现的Python评估流程,结合LensKit与RecBole两个主流工具包,在11个包含至少700万条交互记录的公开数据集上,对10种算法-工具组合进行系统性实验;通过绝对分层用户采样方法训练模型于9个不同样本规模(从10万到1亿条交互)并测量NDCG@10指标,结果表明:原始NDCG随数据量增长普遍上升且未观测到明显饱和点,进一步采用组内最小-最大归一化后发现约75%的最优性能出现在最大样本量下,且最后10%-30%数据的增长阶段斜率中位数接近1.0,验证了数据量提升持续带来正向收益的结论。
链接: https://arxiv.org/abs/2604.06420
作者: Youssef Abdou
机构: University of Siegen(锡根大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, 6 figures. Poster paper
Abstract:In recommender systems, collecting, storing, and processing large-scale interaction data is increasingly costly in terms of time, energy, and computation, yet it remains unclear when additional data stops providing meaningful gains. This paper investigates how offline recommendation performance evolves as the size of the training dataset increases and whether a saturation point can be observed. We implemented a reproducible Python evaluation workflow with two established toolkits, LensKit and RecBole, included 11 large public datasets with at least 7 million interactions, and evaluated 10 tool-algorithm combinations. Using absolute stratified user sampling, we trained models on nine sample sizes from 100,000 to 100,000,000 interactions and measured NDCG@10. Overall, raw NDCG usually increased with sample size, with no observable saturation point. To make result groups comparable, we applied min-max normalization within each group, revealing a clear positive trend in which around 75% of the points at the largest completed sample size also achieved the group’s best observed performance. A late-stage slope analysis over the final 10-30% of each group further supported this upward trend: the interquartile range remained entirely non-negative with a median near 1.0. In summary, for traditional recommender systems on typical user-item interaction data, incorporating more training data remains primarily beneficial, while weaker scaling behavior is concentrated in atypical dataset cases and in the algorithmic outlier RecBole BPR under our setup.
[IR-11] Incentive-Aware Multi-Fidelity Optimization for Generative Advertising in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)生成式广告中赞助配置的优化问题,其核心挑战在于同时应对广告商的战略行为(strategic behavior)和随机生成带来的高计算成本。解决方案的关键在于提出一种统一框架——激励感知多保真机制(Incentive-Aware Multi-Fidelity Mechanism, IAMFM),该框架将Vickrey-Clarke-Groves(VCG)激励机制与多保真优化(Multi-Fidelity Optimization)相结合,以最大化预期社会福利。为提升VCG机制的计算可行性,作者进一步引入主动反事实优化(Active Counterfactual Optimization),通过复用优化数据实现高效支付计算,并提供了近似策略不变性(approximate strategy-proofness)和个体理性(individual rationality)的理论保障,从而在预算约束下实现激励对齐的生成过程。
链接: https://arxiv.org/abs/2604.06263
作者: Jiayuan Liu,Barry Wang,Jiarui Gan,Tonghan Wang,Leon Xie,Mingyu Guo,Vincent Conitzer
机构: Carnegie Mellon University (卡内基梅隆大学); Adelaide University (阿德莱德大学); University of Oxford (牛津大学); Harvard University (哈佛大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Generative advertising in large language model (LLM) responses requires optimizing sponsorship configurations under two strict constraints: the strategic behavior of advertisers and the high cost of stochastic generations. To address this, we propose the Incentive-Aware Multi-Fidelity Mechanism (IAMFM), a unified framework coupling Vickrey-Clarke-Groves (VCG) incentives with Multi-Fidelity Optimization to maximize expected social welfare. We compare two algorithmic instantiations (elimination-based and model-based), revealing their budget-dependent performance trade-offs. Crucially, to make VCG computationally feasible, we introduce Active Counterfactual Optimization, a “warm-start” approach that reuses optimization data for efficient payment calculation. We provide formal guarantees for approximate strategy-proofness and individual rationality, establishing a general approach for incentive-aligned, budget-constrained generative processes. Experiments demonstrate that IAMFM outperforms single-fidelity baselines across diverse budgets.
[IR-12] What Do Humanities Scholars Need? A User Model for Recommendation in Digital Archives
【速读】:该论文旨在解决当前推荐系统(RecSys)用户建模中普遍假设的局限性问题,即在数字人文学者使用数字档案进行信息检索时,传统RecSys所依赖的稳定偏好、基于相似性的相关性以及会话边界交互等假设并不适用。其解决方案的关键在于通过人本设计方法,识别出学术信息寻求行为与常规RecSys用户模型之间的四大差异维度:(1)情境波动性(context volatility)——偏好随研究任务和领域专业知识动态变化;(2)认知信任(epistemic trust)——相关性取决于可验证的来源可信度;(3)对比性搜索(contrastive seeking)——研究人员主动寻找挑战现有方向的内容;(4)线索连续性(strand continuity)——研究具有长期延续性的主题脉络而非离散会话。作者提出这四个维度作为诊断框架,为改进协同过滤、内容-based 和会话-based 推荐算法提供理论依据,并可推广至其他典型用户建模假设不成立的应用场景。
链接: https://arxiv.org/abs/2604.06232
作者: Florian Atzenhofer-Baumgartner,Dominik Kowald
机构: Graz University of Technology (格拉茨工业大学); University of Graz (格拉茨大学); Know Center Research GmbH (知识中心研究有限公司)
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: To be presented at the 34th ACM Conference on User Modeling, Adaptation and Personalization (UMAP’26), June 08-11, 2026, Gothenburg, Sweden
Abstract:User models for recommender systems (RecSys) typically assume stable preferences, similarity-based relevance, and session-bounded interactions – assumptions derived from high-volume consumer contexts. This paper investigates these assumptions for humanities scholars working with digital archives. Following a human-centered design approach, we conducted focus groups and analyzed interview data from 18 researchers. Our analysis identifies four dimensions where scholarly information-seeking diverges from common RecSys user modeling: (1) context volatility – preferences shift with research tasks and domain expertise; (2) epistemic trust – relevance depends on verifiable provenance; (3) contrastive seeking – researchers seek items that challenge their current direction; and (4) strand continuity – research spans long-term threads rather than discrete sessions. We discuss implications for user modeling and outline how these dimensions relate to collaborative filtering, content-based, and session-based recommendation. We propose these dimensions as a diagnostic framework applicable beyond archives to similar application domains where typical user modeling assumptions may not hold.
[IR-13] Automating Database-Native Function Code Synthesis with LLM s
【速读】:该论文旨在解决数据库原生函数(database native functions)自动合成的难题,这类函数在数据库内核中日益增多,用于支持新应用场景和业务迁移,但其合成过程复杂且易出错,涉及多单元注册、内部引用链接及逻辑正确实现。现有基于大语言模型(LLM)的代码生成方法(如Claude Code)因通用性强而难以适配数据库场景,常出现幻觉或忽略关键上下文。解决方案的关键在于提出DBCooker系统,其核心创新包括:(1) 函数特征化模块整合多源声明并识别需特殊编码的函数单元及其依赖关系;(2) 设计基于伪代码的编码计划生成器构建结构化实现骨架,结合概率先验与组件感知的混合填空模型融合核心逻辑与可复用模块;(3) 三层渐进式验证机制(语法检查、标准合规性、LLM引导的语义验证),并引入自适应编排策略动态调度操作流程,从而显著提升合成准确性(平均比现有方法高34.55%)。
链接: https://arxiv.org/abs/2604.06231
作者: Wei Zhou,Xuanhe Zhou,Qikang He,Guoliang Li,Bingsheng He,Quanqing Xu,Fan Wu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: Please visit our homepage at: this https URL . The code is available at: this https URL
Abstract:Database systems incorporate an ever-growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic database native function synthesis. While recent advances in LLM-based code generation (e.g., Claude Code) show promise, they are too generic for database-specific development. They often hallucinate or overlook critical context because database function synthesis is inherently complex and error-prone, where synthesizing a single function may involve registering multiple function units, linking internal references, and implementing logic correctly. To this end, we propose DBCooker, an LLM-based system for automatically synthesizing database native functions. It consists of three components. First, the function characterization module aggregates multi-source declarations, identifies function units that require specialized coding, and traces cross-unit dependencies. Second, we design operations to address the main synthesis challenges: (1) a pseudo-code-based coding plan generator that constructs structured implementation skeletons by identifying key elements such as reusable referenced functions; (2) a hybrid fill-in-the-blank model guided by probabilistic priors and component awareness to integrate core logic with reusable routines; and (3) three-level progressive validation, including syntax checking, standards compliance, and LLM-guided semantic verification. Finally, an adaptive orchestration strategy unifies these operations with existing tools and dynamically sequences them via the orchestration history of similar functions. Results show that DBCooker outperforms other methods on SQLite, PostgreSQL, and DuckDB (34.55% higher accuracy on average), and can synthesize new functions absent in the latest SQLite (v3.50).
[IR-14] Probabilistic Language Tries: A Unified Framework for Compression Decision Policies and Execution Reuse
【速读】:该论文旨在解决生成式模型在序列建模中面临的三大核心问题:高效压缩、最优决策制定与计算资源的重复利用。其解决方案的关键在于提出概率语言树(Probabilistic Language Tries, PLTs),这是一种统一的概率结构表示,显式刻画了任意生成模型在序列空间上的前缀结构。PLT通过为每条出边分配对应标记或动作的条件概率,实现了三重功能:作为基于频率加权区间编码的最优无损压缩器(推广算术编码至模型条件分布)、作为序列决策任务(如博弈、搜索和机器人控制)中的策略表示,以及作为缓存索引以支持结构化检索而非重复执行完整模型推理。论文的核心技术成果是“先验引导缓存定理”,证明在平稳生成分布下,基于PLT的缓存预期推理成本严格低于任何基于经验频率的缓存,且该优势随先验集中度增长而增强——这将Transformer的O(n²)注意力复杂度转化为期望复杂度p_r * O(log N) + (1 - p_r) * O(n²),其中p_r为先验估计的复用概率,N为存储库规模。
链接: https://arxiv.org/abs/2604.06228
作者: Gregory Magarshak
机构: ienyc.edu(纽约州立大学理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注: 24 pages, 2 figures
Abstract:We introduce probabilistic language tries (PLTs), a unified representation that makes explicit the prefix structure implicitly defined by any generative model over sequences. By assigning to each outgoing edge the conditional probability of the corresponding token or action, a PLT simultaneously serves as: (i) an optimal lossless compressor via frequency-weighted interval encoding, generalizing arithmetic coding to model-conditioned distributions; (ii) a policy representation for sequential decision problems including games, search, and robotic control; and (iii) a memoization index that lets repeated inference queries be answered by structured retrieval rather than full model execution. The central technical result is a prior-guided caching theorem: under a stationary generative distribution, a PLT-guided cache achieves strictly lower expected inference cost than any empirical-frequency cache for all query counts below a threshold that grows with the concentration of the prior. This converts O(n^2) transformer attention cost into an expected cost of p_r * O(log N) + (1 - p_r) * O(n^2), where p_r is the prior-estimated reuse probability and N is the artifact store size. We further introduce a hybrid compression architecture decomposing any dataset into a PLT-covered majority and a sparse residual store, connecting arithmetic coding with Kolmogorov-style program representations and rate-distortion theory. We instantiate the framework across chess, web search, robotics, organizational workflows, and LLM inference, demonstrating that compression, decision making, and computational reuse are all derived from a single probability measure on sequence space. Comments: 24 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Information Theory (cs.IT) MSC classes: 94A29, 68P30, 68T50 ACMclasses: E.4; I.2.7; H.3.3 Cite as: arXiv:2604.06228 [cs.LG] (or arXiv:2604.06228v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.06228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-15] ARIA: Adaptive Retrieval Intelligence Assistant – A Multimodal RAG Framework for Domain-Specific Engineering Education
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定教育场景中应用时面临的局限性,包括幻觉问题、知识更新受限以及缺乏领域专业知识等问题。针对这些问题,作者提出了一种名为ARIA(Adaptive Retrieval Intelligence Assistant)的检索增强生成(Retrieval-Augmented Generation, RAG)框架,其核心在于构建一个面向大学课程的智能教学助手系统。关键创新点在于:采用多模态内容提取流水线(结合Docling进行结构化文档分析、Nougat识别数学公式、GPT-4 Vision API解析图表),并利用e5-large-v2嵌入模型实现高语义精度与低延迟的向量表示,从而精准处理复杂教学材料,并通过工程化提示词设计和响应控制机制保障教学一致性。实验表明,ARIA在静力学与材料力学课程中实现了97.5%的领域问题过滤准确率及高达4.89/5.0的平均响应质量,验证了其可扩展且课程无关的部署潜力。
链接: https://arxiv.org/abs/2604.06179
作者: Yue Luo,Dibakar Roy Sarkar,Rachel Herring Sangree,Somdatta Goswami
机构: Dalian University of Technology (大连理工大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Developing effective, domain-specific educational support systems is central to advancing AI in education. Although large language models (LLMs) demonstrate remarkable capabilities, they face significant limitations in specialized educational applications, including hallucinations, limited knowledge updates, and lack of domain expertise. Fine-tuning requires complete model retraining, creating substantial computational overhead, while general-purpose LLMs often provide inaccurate responses in specialized contexts due to reliance on generalized training data. To address this, we propose ARIA (Adaptive Retrieval Intelligence Assistant), a Retrieval-Augmented Generation (RAG) framework for creating intelligent teaching assistants across university-level courses. ARIA leverages a multimodal content extraction pipeline combining Docling for structured document analysis, Nougat for mathematical formula recognition, and GPT-4 Vision API for diagram interpretation, with the e5-large-v2 embedding model for high semantic performance and low latency. This enables accurate processing of complex educational materials while maintaining pedagogical consistency through engineered prompts and response controls. We evaluate ARIA using lecture material from Statics and Mechanics of Materials, a sophomore-level civil engineering course at Johns Hopkins University, benchmarking against ChatGPT-5. Results demonstrate 97.5% accuracy in domain-specific question filtering and superior pedagogical performance. ARIA correctly answered all 20 relevant course questions while rejecting 58 of 60 non-relevant queries, achieving 90.9% precision, 100% recall, and 4.89/5.0 average response quality. These findings demonstrate that ARIA’s course-agnostic architecture represents a scalable framework for domain-specific educational AI deployment.
[IR-16] WebExpert: domain-aware web agents with critic-guided expert experience for high-precision search ICASSP2026
【速读】:该论文旨在解决专业化领域(如金融、生物医学和制药)中网络任务的挑战,这些问题主要源于缺乏领域先验知识:查询漂移、证据噪声以及推理脆弱性。解决方案的关键在于提出一个端到端实现的领域感知网络智能体WebExpert,其核心创新包括:(i) 基于句子级别的经验检索与主题合并及规则蒸馏;(ii) 通过弱监督自举时间、地区、政策、行业等轻量级 facet(facet)诱导机制,替代静态手工词典;(iii) 采用偏好优化规划策略,联合优化查询规划与检索过程,结合成对偏好学习与覆盖率感知目标。在推理阶段,引入轻量级经验门控机制,在高检索置信度时偏向活跃facet,在低置信度时提供回退机制,从而显著提升准确率并减少页面跳转次数。
链接: https://arxiv.org/abs/2604.06177
作者: Yuelin Hu,Zhengxue Cheng,Ronghua Wu,Qunshan Gu,Hongwei Hu,Wei Liu,Qiao Liang,Li Song
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted by icassp2026
Abstract:Specialized web tasks in finance, biomedicine, and pharmaceuticals remain challenging due to missing domain priors: queries drift, evidence is noisy, and reasoning is brittle. We present WebExpert, a domain-aware web agent that we implement end-to-end, featuring : (i) sentence-level experience retrieval with topic merging and rule distillation, (ii) schemalight facet induction that bootstraps time,region,policy,industry facets from weak supervision instead of static hand-written lexicons, and (iii) preference-optimized planning that jointly improves query planning and retrieval via pairwise preference learning alongside a coverage-aware objective. At inference, a lightweight experience gate biases decoding toward active facets with fallback under low-retrieval confidence. On GAIA, GPQA, HLE, and WebWalkerQA, WebExpert improves Answer Exact Match (EM) by 1.5-3.6 pp over the strongest browsing baseline and reduces page hops. Analysis shows consistent gains and ablations on retrieval, topic merging, facet induction, and preference-aware training.
[IR-17] Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen 3-Embedding Model
【速读】:该论文旨在解决生成式 AI(Generative AI)在真实对话场景下进行嵌入式检索(embedding-based retrieval)时存在的鲁棒性问题,即当查询为短句、对话式且语义弱指定时,结构化的对话类噪声(structured dialogue-style noise)会因嵌入表示的偏差而被错误地高排名,从而干扰检索结果的准确性。解决方案的关键在于引入轻量级查询提示(lightweight query prompting),该方法能显著改变检索行为,有效抑制噪声入侵并恢复排序稳定性,揭示了传统干净查询基准测试难以发现的部署风险,并强调了评估协议需贴近实际应用场景的重要性。
链接: https://arxiv.org/abs/2604.06176
作者: Weishu Chen,Zhouhui Hou,Mingjie Zhan,Zhicheng Zhao,Fei Su
机构: Beijing University of Posts and Telecommunications(北京邮电大学); SenseTime(商汤科技); Beijing Key Laboratory of Network System and Network Culture(北京市网络系统与网络文化重点实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present an empirical study of embedding-based retrieval under realistic conversational settings, where queries are short, dialogue-like, and weakly specified, and retrieval corpora contain structured conversational artifacts. Focusing on Qwen3-embedding models, we identify a deployment-relevant robustness vulnerability: under conversational retrieval without query prompting, structured dialogue-style noise can become disproportionately retrievable and intrude into top-ranked results, despite being semantically uninformative. This failure mode emerges consistently across model scales, remains largely invisible under standard clean-query benchmarks, and is significantly more pronounced in Qwen3 than in earlier Qwen variants and other widely used dense retrieval baselines. We further show that lightweight query prompting qualitatively alters retrieval behavior, effectively suppressing noise intrusion and restoring ranking stability. Our findings highlight an underexplored robustness risk in conversational retrieval and underscore the importance of evaluation protocols that reflect the complexities of deployed systems.
[IR-18] Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
【速读】:该论文旨在解决现有法律问答(Legal QA)基准测试主要聚焦于判例法,而忽视了以成文法(statute)为核心的监管推理所面临的独特挑战,尤其是在证据分散于层级结构文档中时,传统检索模型难以有效获取完整上下文,导致模型在信息不全时容易产生幻觉(hallucination)。其解决方案的关键在于提出一个结构感知与安全意识并重的新型基准——SearchFireSafety,该基准基于消防安全法规构建,采用双源评估框架:一方面包含需要引用意识检索的真实问题,另一方面通过合成的部分上下文场景来测试模型的幻觉和拒答行为。实验表明,图引导检索显著提升性能,但同时也揭示了一个关键的安全权衡:领域适配模型在缺少关键法定证据时更易产生幻觉,强调了在成文法导向的监管场景中,需同时评估层级检索能力和模型安全性。
链接: https://arxiv.org/abs/2604.06173
作者: Kyubyung Chae,Jewon Yeom,Jeongjae Park,Seunghyun Bae,Ijun Jang,Hyunbin Jin,Jinkwan Jang,Taesup Kim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.
[IR-19] EviSnap: Faithful Evidence-Cited Explanations for Cold-Start Cross-Domain Recommendation
【速读】:该论文旨在解决冷启动跨域推荐(Cold-start Cross-Domain Recommender, CDR)系统中缺乏可解释性与可信度的问题,即现有模型要么使用难以审计的隐式嵌入映射,要么依赖后验生成或大语言模型(Large Language Model, LLM)构造的推理过程,导致解释不可靠。其解决方案的关键在于提出EviSnap框架,该框架通过离线利用LLM将噪声评论压缩为结构化的“ facet cards”(特征卡片),每个特征附带原始支持语句,并基于这些证据构建一个共享的、领域无关的概念库;随后通过证据加权池化计算用户正向、负向及物品存在性的概念激活强度,再以单一线性映射实现跨域用户迁移,最终由线性评分头输出每概念的可加贡献,从而实现精确的分数分解和基于引用语句的反事实编辑,确保解释的忠实性与透明性。
链接: https://arxiv.org/abs/2604.06172
作者: Yingjun Dai,Ahmed El-Roby
机构: Carleton University (卡尔顿大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Cold-start cross-domain recommender (CDR) systems predict a user’s preferences in a target domain using only their source-domain behavior, yet existing CDR models either map opaque embeddings or rely on post-hoc or LLM-generated rationales that are hard to audit. We introduce EviSnap a lightweight CDR framework whose predictions are explained by construction with evidence-cited, faithful rationales. EviSnap distills noisy reviews into compact facet cards using an LLM offline, pairing each facet with verbatim supporting sentences. It then induces a shared, domain-agnostic concept bank by clustering facet embeddings and computes user-positive, user-negative, and item-presence concept activations via evidence-weighted pooling. A single linear concept-to-concept map transfers users across domains, and a linear scoring head yields per-concept additive contributions, enabling exact score decompositions and counterfactual ‘what-if’ edits grounded in the cited sentences. Experiments on the Amazon Reviews dataset across six transfers among Books, Movies, and Music show that EviSnap consistently outperforms strong mapping and review-text baselines while passing deletion- and sufficiency-based tests for explanation faithfulness.
[IR-20] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLM s and Ontological Engineering for Scholarly Debates
【速读】:该论文旨在解决文化遗产(Cultural Heritage, CH)文本中蕴含的丰富知识难以系统化查询的问题,核心挑战在于如何将非结构化的叙述性文本转化为结构化的知识图谱(Knowledge Graph, KG)。解决方案的关键在于提出了一种名为ATR4CH(Adaptive Text-to-RDF for Cultural Heritage)的五步系统方法论,其核心是通过迭代开发整合标注模型、本体框架与大语言模型(Large Language Model, LLM)驱动的知识抽取流程,实现从原始文档到可查询KG的自动化转换。该方法在争议性文化遗产条目(如文件、文物等)上验证有效,使用三个LLM(Claude Sonnet 3.7、Llama 3.3 70B、GPT-4o-mini)构建顺序抽取管道,在多项指标上取得高F1分数(如元数据提取达0.96–0.99),并证明较小模型亦具竞争力,具备成本效益和可复现性。
链接: https://arxiv.org/abs/2511.10354
作者: Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp
机构: Università degli Studi di Bologna (博洛尼亚大学); KNAW Humanities Cluster (荷兰皇家艺术与科学学院人文集群)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 46 pages
Abstract:Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts…), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.
[IR-21] he Geometry of Forgetting
【速读】:该论文试图解决人类记忆中两个核心现象的机制问题:一是为何我们会遗忘(尤其是遵循幂律遗忘规律),二是为何我们会产生从未发生过的虚假记忆。传统观点认为这些现象源于生物硬件的局限性,但本文提出几何学视角作为替代解释——在高维嵌入空间中,噪声、干扰和时间退化即可自然重现人类记忆的定量特征,无需特定机制设计。其解决方案的关键在于:高维语义空间中的竞争性干扰驱动了幂律遗忘(b ≈ 0.46),而非传统认为的时间衰减;同时,未经调参的预训练嵌入模型本身已具备生成虚假记忆的能力(Deese-Roediger-McDermott任务中错误率≈0.583,接近人类水平0.55),这表明虚假记忆并非系统缺陷,而是基于语义相似性检索机制的必然结果。
链接: https://arxiv.org/abs/2604.06222
作者: Sambartha Ray Barman,Andrey Starenky,Sophia Bodnar,Nikhil Narasimhan,Ashwin Gopinath
机构: Sentra; Massachusetts Institute of Technology (麻省理工学院)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Why do we forget? Why do we remember things that never happened? The conventional answer points to biological hardware. We propose a different one: geometry. Here we show that high-dimensional embedding spaces, subjected to noise, interference, and temporal degradation, reproduce quantitative signatures of human memory with no phenomenon-specific engineering. Power-law forgetting ( b = 0.460 \pm 0.183 , human b \approx 0.5 ) arises from interference among competing memories, not from decay. The identical decay function without competitors yields b \approx 0.009 , fifty times smaller. Time alone does not produce forgetting in this system. Competition does. Production embedding models (nominally 384–1,024 dimensions) concentrate their variance in only \sim16 effective dimensions, placing them deep in the interference-vulnerable regime. False memories require no engineering at all: cosine similarity on unmodified pre-trained embeddings reproduces the Deese–Roediger–McDermott false alarm rate ( 0.583 versus human \sim0.55 ) with zero parameter tuning and no boundary conditions. We did not build a false memory system. We found one already present in the raw geometry of semantic space. These results suggest that core memory phenomena are not bugs of biological implementation but features of any system that organizes information by meaning and retrieves it by proximity.
人机交互
[HC-0] NIRVANA: A Comprehensive Dataset for Reproducing How Students Use Generative AI for Essay Writing
【速读】:该论文旨在解决当前对大学生在真实写作任务中使用生成式 AI(Generative AI)行为缺乏系统理解的问题,具体包括学生何时寻求帮助、提出何种问题以及如何将AI生成内容整合到自己的文章中。这一知识空白限制了基于证据的教育政策制定和对生成式 AI 学习效果的严谨评估。解决方案的关键在于构建 NIRVANA 数据集,该数据集记录了 77 名大学生在写作分析性论文过程中与 ChatGPT 的交互行为,涵盖键入级写作轨迹、完整的 ChatGPT 对话历史及从 AI 复制的内容,从而实现写作过程的完整重建;同时开发了一个回放界面以支持对学生- AI 交互的系统性分析,并识别出四种基于学生贡献与修改模式的写作类型:主导作者(Lead Authors)、合作者(Collaborators)、初稿撰写者(Drafters)和氛围型写作者(Vibe Writers)。
链接: https://arxiv.org/abs/2604.07344
作者: Andrew Jelson,Daniel Manesh,Sangwook Lee,Alice Jang,Daniel Dunlap,Tamara Maddox,Young-Ho Kim,Sang Won Lee
机构: Virginia Tech(弗吉尼亚理工大学); George Mason University(乔治梅森大学); NAVER AI Lab(NAVER人工智能实验室)
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 10 figures, 1 table, Submitted to Learning @ Scale 2026
Abstract:With the rapid adoption of AI writing assistants in education, educators and researchers need empirical evidence to understand the impact on student writing and inform effective pedagogical design. Despite widespread use, we lack systematic understanding of how students engage with these tools during authentic writing tasks: when they seek assistance, what they ask, and how they incorporate AI-generated content into their essays. This gap limits evidence-based policy development and rigorous evaluation of generative AI’s learning effects. To address this gap, we introduce NIRVANA, a dataset capturing how university students use generative AI while writing an analytical essay. The dataset includes 77 students who completed an essay task with access to ChatGPT, recording keystroke-level writing behavior, full ChatGPT conversation histories, and all text copied from ChatGPT, enabling a complete reconstruction of the writing process and revealing how AI assistance shapes student work. Our analysis identifies key behavioral patterns, including variation in ChatGPT query frequency and its relationship to essay characteristics such as length and readability. We identify four writing profiles based on students’ contribution and revision patterns: Lead Authors, Collaborators, Drafters, and Vibe Writers. To support deeper investigation, we developed a replay interface that reconstructs the writing process; qualitative analysis of sampled replays demonstrates how this tool enables systematic examination of student-AI interactions.
[HC-1] Mapping Child Malnutrition and Measuring Efficiency of Community Healthcare Workers through Location Based Games in India
【速读】:该论文旨在解决印度社区卫生工作者(Community Healthcare Workers, CHWs)在长期和跨地域环境中难以持续参与儿童体格测量数据收集的问题,从而导致健康数据空间代表性不足和时效性差的困境。其解决方案的关键在于通过共同设计(co-design)方法开发了一种基于游戏化的地理空间数据采集工具,并在印度多个州的两组CHWs(每组94人)中进行了对照试验,结果显示该游戏化方法显著提升了测量效率(p < 0.05),并展现出优于传统非游戏化方式的参与度与留存率,为改善CHWs的数据采集行为提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2604.07299
作者: Arka Majhi,Aparajita Mondal,Satish B. Agnihotri
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Tampere University (坦佩雷大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted at GoodIT 2024
Abstract:In India, Community Healthcare Workers (CHWs) serve as critical intermediaries between the state and beneficiaries, including pregnant mothers and children. Effective planning and prioritization of care and services necessitate the collection of accurate health data from the community. Crowdsourcing child anthropometric data through CHWs could establish a valuable repository for evidence-based decision-making and service planning. However, existing platforms often fail to maintain CHWs’ engagement over time and across different spatial contexts, resulting in spatially misrepresented and outdated data. This study addresses these challenges by conducting a co-design exercise to develop innovative methods for collecting anthropometric data over time and space. The exercise involved analyzing data to create hotspot and density distribution maps. We implemented a trial of the developed game with two groups (n=94 per group) from various states across India, comparing the game-based and non-game-based data collection methods. Our findings reveal that the game-based approach significantly improved measuring efficiency (p0.05) and demonstrated superior engagement and retention compared to the non-game-based method. This research contributes to the expanding literature on co-design and Research through Design (RtD) methodologies for developing geospatial games, highlighting their potential to enhance data collection practices and improve engagement among CHWs. Comments: Accepted at GoodIT 2024 Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) Cite as: arXiv:2604.07299 [cs.HC] (or arXiv:2604.07299v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.07299 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3677525.3678685 Focus to learn more DOI(s) linking to related resources
[HC-2] BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving
【速读】:该论文旨在解决当前驾驶自动化(Driving Automation, DA)系统中人机交互界面(HMI)设计缺乏情境感知能力的问题,具体表现为无法准确预测驾驶员何时将控制权交给DA系统以及何时重新接管,从而导致认知负荷过重、用户体验不佳及安全隐患。解决方案的关键在于构建一个大规模自然驾驶场景下的多模态数据集BATON,该数据集同步记录了前视视频、舱内视频、解码的CAN总线信号、基于雷达的前车交互信息和GPS驱动的路线环境,形成围绕每次控制权转换的闭环多模态记录。通过引入三个基准任务(驾驶行为理解、交控预测与接管预测),研究发现仅依赖视觉输入(如前视或舱内视频)不足以实现可靠预测,而融合车辆动态(CAN)与路线环境等非视觉模态信号可显著提升性能,揭示了多模态互补性,并指出接管事件发展更渐进、需更长预测窗口,而交控事件则依赖即时上下文线索,为设计更具响应性和安全性的辅助驾驶系统HMI提供了关键依据。
链接: https://arxiv.org/abs/2604.07263
作者: Yuhang Wang,Yiyao Xu,Chaoyun Yang,Lingyao Li,Jingran Sun,Hao Zhou
机构: University of South Florida (南佛罗里达大学); Tongji University (同济大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.
[HC-3] Reshaping Inclusive Interpersonal Dynamics through Smart Glasses in Mixed-Vision Social Activities
【速读】:该论文旨在解决盲人及低视力(Blind and Low Vision, BLV)个体在与视力正常同伴协作时因视觉线索不可访问而导致的社会互动障碍问题。解决方案的关键在于开发了一种基于智能眼镜的系统CollabLens,作为技术探针,在四次工作坊中验证其对混合视力群体人际动态的影响。研究发现,智能眼镜通过扩展BLV用户获取视觉信息的独立性和灵活性,显著促进了包容性协作;同时,视力正常参与者虽认为该设备有助于增进人际连接,但对如何调整自身助人行为存在不确定性。因此,设计核心在于实现无缝交互体验并推动双向混合视力社会包容。
链接: https://arxiv.org/abs/2604.07232
作者: Jieqiong Ding,Yumo Zhang,Xiuqi Tommy Zhu,Kaige Yang,Yuqing Wei,Shiyi Wang,Yishan Liu,Yang Jiao
机构: The Future Laboratory, Tsinghua University (清华大学未来实验室); University of Washington (华盛顿大学); Hong Kong Polytechnic University (香港理工大学); Northeastern University (东北大学); Central Academy of Fine Arts (中央美术学院); Shanghai Jiao Tong University (上海交通大学); Academy of Arts & Design, Tsinghua University (清华大学美术学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Meaningful social interaction is vital to well-being, yet Blind and Low Vision (BLV) individuals face persistent barriers when collaborating with sighted peers due to inaccessible visual cues. While most wearable assistive technologies emphasize individual tasks, smart glasses introduce opportunities for real-time, contextual support in social settings. To explore how smart glasses affect interpersonal dynamics and support inclusion in mixed-vision groups, we developed a smart glasses-based system, CollabLens, as a technology probe and employed it in four workshop sessions. We found that smart glasses can meaningfully support inclusive collaboration through expanding BLV participants’ assistive networks with more flexible, independent access to visual information. While sighted participants viewed smart glasses as a promising medium that fosters interpersonal connection, they revealed uncertainty in adapting their helping behaviors. We concluded by discussing and synthesizing challenges and opportunities for designing smart glasses that provide seamless interaction experiences and enhance reciprocal mixed-vision social inclusion.
[HC-4] Critical Inker: Scaffolding Critical Thinking in AI-Assisted Writing Through Socratic Questioning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化写作任务中可能导致的“认知弱化”(cognitive deskilling)问题,即用户过度依赖系统而削弱自身批判性思维能力。解决方案的关键在于提出一种名为Critical Inker的写作辅助工具,其核心机制是通过两种方法促进写作者的批判性反思:一是基于苏格拉底式提问的聊天机器人,引导用户识别并修正逻辑错误;二是可视化反馈机制,在不依赖对话的情况下直接标记文本中的逻辑缺陷。该系统通过逻辑分析与结构化反馈实现对论证过程的有效干预,从而在提升写作质量的同时维持用户的认知参与度。
链接: https://arxiv.org/abs/2604.07167
作者: Philipp Hugenroth,Valdemar Danry,Pattie Maes
机构: MIT Media Lab (麻省理工学院媒体实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As Large Language Models (LLMs) increasingly automate writing tasks, there is a growing risk of cognitive deskilling where users offload critical thinking to the system. To address this, we introduce Critical Inker, a writing tool designed to scaffold critical reflection during writing through logical analysis and socratic feedback. We present two methods: (1) A Socratic chatbot using questions to help them realize and fix logical errors in their writing and (2) Visual Feedback, which highlights logical errors in the text without dialog. We detail the technical implementation of the system and evaluate its argument extraction and logical validity accuracy. Our evaluation shows a 91.2% argument overlap with ground truth argument annotations and 87% validity accuracy. Finally, we conducted a small-scale pilot and discuss early qualitative results.
[HC-5] DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification CVPR2026
【速读】:该论文旨在解决视觉基础模型(如DINOv2)在作为特征提取器时,其复杂且高维的表示难以解释的问题。解决方案的关键在于提出一种轻量级可解释性适配器DINO-QPM,该方法将原本纠缠的特征转换为对比性的、与类别无关的可解释表示,并通过严格冻结DINO骨干网络实现全局可解释的图像分类。其核心创新包括:摒弃传统依赖\textttCLS token的方式,改用平均池化直接连接patch嵌入至模型特征,从而实现特征在输入空间中的空间定位;同时引入稀疏性损失以减少空间散射和背景噪声,确保解释聚焦于相关物体区域。实验表明,DINO-QPM在保持甚至超越DINOv2线性探针准确率的同时,显著提升了可解释性质量。
链接: https://arxiv.org/abs/2604.07166
作者: Robert Zimmermann,Thomas Norrenbrock,Bodo Rosenhahn
机构: Leibniz University Hannover (汉诺威大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026
Abstract:Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \textttCLS token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model’s features and therefore enable spatial localisation of DINO-QPM’s globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.
[HC-6] Mixed-Initiative Context: Structuring and Managing Context for Human-AI Collaboration
【速读】:该论文旨在解决人机协作(Human-AI collaboration)中上下文管理缺乏动态组织与可控性的问题。当前系统通常将多轮交互形成的上下文扁平化为时间顺序序列,作为固定整体进行后续推理,但这种处理方式无法区分上下文在生命周期、结构层次和相关性上的差异,导致临时对话或并行话题线程占据有限上下文窗口,引发干扰甚至冲突;同时用户只能通过间接输入修改(如修正、引用或忽略)来影响上下文,控制权既不明确也不可验证。解决方案的关键在于提出“混合主动权上下文”(Mixed-Initiative Context),将多轮交互中形成的上下文重构为一种显式、结构化且可操作的交互对象,使上下文的结构、范围和内容能够根据任务需求动态调整,从而实现人类与AI对上下文构建与调控的共同参与。
链接: https://arxiv.org/abs/2604.07121
作者: Haichang Li,Qinshi Zhang,Piaohong Wang,Zhicong Lu
机构: George Mason University (乔治梅森大学); University of California, San Diego (加州大学圣地亚哥分校); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures, 1 table. Appendix on pages 13-19 (main text is self-contained)
Abstract:In the human-AI collaboration area, the context formed naturally through multi-turn interactions is typically flattened into a chronological sequence and treated as a fixed whole in subsequent reasoning, with no mechanism for dynamic organization and management along the collaboration workflow. Yet these contexts differ substantially in lifecycle, structural hierarchy, and relevance. For instance, temporary or abandoned exchanges and parallel topic threads persist in the limited context window, causing interference and even conflict. Meanwhile, users are largely limited to influencing context indirectly through input modifications (e.g., corrections, references, or ignoring), leaving their control neither explicit nor verifiable. To address this, we propose Mixed-Initiative Context, which reconceptualizes the context formed across multi-turn interactions as an explicit, structured, and manipulable interactive object. Under this concept, the structure, scope, and content of context can be dynamically organized and adjusted according to task needs, enabling both humans and AI to actively participate in context construction and regulation. To explore this concept, we implement Contextify as a probe system and conduct a user study examining users’ context management behaviors, attitudes toward AI initiative, and overall collaboration experience. We conclude by discussing the implications of this concept for the HCI community. Comments: 19 pages, 3 figures, 1 table. Appendix on pages 13-19 (main text is self-contained) Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) ACMclasses: H.5.2 Cite as: arXiv:2604.07121 [cs.HC] (or arXiv:2604.07121v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.07121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-7] Workmanship of Learning: Embedding Craftsmanship Values in AI-Integrated Educational Tools
【速读】:该论文旨在解决生成式 AI(Generative AI)在设计教育中因过度强调自动化与效率而对传统以探索、反思和责任为核心的学习模式带来的挑战。其解决方案的关键在于提出“AI工匠精神”(AI Craftsmanship)这一价值导向框架,借鉴手工艺传统中的风险(risk)、节奏(rhythm)与关怀(care)等核心价值,并通过设计一个嵌入这些价值观的创造性编程工具,将它们内化于交互与界面设计之中,而非仅关注输出结果。该工具通过限制AI能力、鼓励迭代实验以及突出反思过程,支持学习者在生成式图案设计中建立更具责任感和审美判断力的学习体验,从而实现面向反思性、负责任且具工匠精神的设计教育目标。
链接: https://arxiv.org/abs/2604.07118
作者: Tuan-Ting Huang,Janet Yi-Ching Huang,Stephan Wensveen
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026 LBW
Abstract:Generative AI’s emphasis on automation and efficiency challenges design education, where learning is grounded in exploration, reflection, and responsibility. This work introduces AI Craftsmanship, a value-oriented framework drawing on craftsmanship traditions that emphasize risk, rhythm, and care as central to learning through making. Through a Research through Design (RtD) approach, we designed an AI-integrated creative coding tool embedding these values into interactions and interface rather than outcomes. The tool supports designers learning generative pattern-making with this http URL by constraining AI, encouraging iterative experimentation, and foregrounding reflection. We studied the tool with five design practitioners through one-hour sessions and semi-structured interviews. Findings show craft values manifest unevenly: risk and rhythm shape early sense-making, while care emerges through reflective practices. Emergent values – such as aesthetic judgment and confidence – also motivated learning. AI Craftsmanship mediates values, tools, and materials, offering a value-driven perspective on designing AI systems for reflective, responsible, craft-informed learning in design education.
[HC-8] BioMoTouch: Touch-Based Behavioral Authentication via Biometric-Motion Interaction Modeling
【速读】:该论文旨在解决现有基于触控行为的认证系统在面对复杂对抗行为和真实环境变化时鲁棒性不足的问题,其核心在于忽视了触控交互的多维特性。解决方案的关键在于提出BioMoTouch框架,该框架首次将电容式触摸屏信号与惯性传感器数据融合,分别捕捉用户特有的生理接触结构(如指型和骨骼特征)与行为运动动态,并通过显式学习二者协同作用,构建统一的触控行为表征,从而在不依赖额外硬件的前提下显著提升认证准确性与抗攻击能力。
链接: https://arxiv.org/abs/2604.07071
作者: Zijian Ling,Jianbang Chen,Hongwei Li,Hongda Zhai,Man Zhou,Jun Feng,Zhengxiong Li,Qi Li,Qian Wang
机构: Huazhong University of Science and Technology (华中科技大学); University of Colorado Denver (科罗拉多大学丹佛分校); Tsinghua University (清华大学); Wuhan University (武汉大学)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注: 13 pages
Abstract:Touch-based authentication is widely deployed on mobile devices due to its convenience and seamless user experience. However, existing systems largely model touch interaction as a purely behavioral signal, overlooking its intrinsic multidimensional nature and limiting robustness against sophisticated adversarial behaviors and real-world variations. In this work, we present BioMoTouch, a multi-modal touch authentication framework on mobile devices grounded in a key empirical finding: during touch interaction, inertial sensors capture user-specific behavioral dynamics, while capacitive screens simultaneously capture physiological characteristics related to finger morphology and skeletal structure. Building upon this insight, BioMoTouch jointly models physiological contact structures and behavioral motion dynamics by integrating capacitive touchscreen signals with inertial measurements. Rather than combining independent decisions, the framework explicitly learns their coordinated interaction to form a unified representation of touch behavior. BioMoTouch operates implicitly during natural user interactions and requires no additional hardware, enabling practical deployment on commodity mobile devices. We evaluate BioMoTouch with 38 participants under realistic usage conditions. Experimental results show that BioMoTouch achieves a balanced accuracy of 99.71% and an equal error rate of 0.27%. Moreover, it maintains false acceptance rates below 0.90% under artificial replication, mimicry, and puppet attack scenarios, demonstrating strong robustness against partial-factor manipulation.
[HC-9] Physics-driven Sonification for Improving Multisensory Needle Guidance in Percutaneous Epicardial Access
【速读】:该论文旨在解决经皮心外膜入路(Percutaneous Epicardial Access, PEA)过程中,因心脏动态运动导致的穿刺针精准定位困难与操作风险高的问题。其核心解决方案是提出一种基于物理驱动的声音编码(physics-driven sonification)方法,结合扩展现实(Extended Reality, XR)多感官导航系统,在实时针尖位置追踪的基础上,通过视觉显示动态心脏解剖结构,并利用多层物理膜模型将生理心脏状态转化为听觉反馈,从而增强术者对时空信息的感知能力。实验表明,该方法显著提升了导航安全性与准确性,降低了心肌接触率并减少认知负荷,验证了其在复杂心脏介入手术中提升用户中心导航效能的可行性。
链接: https://arxiv.org/abs/2604.06911
作者: Veronica Ruozzi,Sasan Matinfar,Pasquale Vergara,Alessandro Albanesi,Serena Dell’Aversana,Stefano Carugo,Gianluigi Buccoliero,Nassir Navab,Alberto Redaelli,Emiliano Votta
机构: Politecnico di Milano(米兰理工大学); Technische Universität München(慕尼黑工业大学); Federico II University of Naples(那不勒斯费德里科二世大学); University of Milan(米兰大学); Ospedale S. Maria Delle Grazie - ASL Napoli 2 Nord(圣玛丽亚格拉齐亚医院-那不勒斯2号卫生局); ASST Bergamo Ovest(贝加莫西部卫生服务局)
类目: Human-Computer Interaction (cs.HC)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Percutaneous epicardial access (PEA), performed on a beating heart under fluoroscopy, enables arrhythmia treatment. However, advancing a needle toward the thin and moving pericardium remains highly challenging and risky. To address this problem, we present a physics-driven sonification method for Extended Reality (XR)-based multisensory navigation to enhance user perception during the critical needle landing phase in PEA. Dynamic cardiac anatomy from 4D CTA was reconstructed and registered to a real-world coordinate system. Real-time needle tracking provided the position of the needle tip relative to moving cardiac structures and drove an audio-visual feedback module. The visual display presented navigational cues and dynamic anatomy, while the auditory display encoded physiological cardiac states using a multilayer physical membrane model. A phantom study was conducted with twelve cardiologists performing needle insertions under visual-only and multisensory feedback. The multisensory method significantly improved navigation safety ( \chi^2 = 11.30 , p 0.01 ), reducing myocardial contact (3.64% vs. 7.27%) and increasing correct access (90.91% vs. 52.73%). Needle placement accuracy improved, with closer membrane proximity (Cliff delta = 0.19) and reduced variability ( p 0.05 ). Execution time was comparable, while time-accuracy correlations differed significantly between modalities ( p 0.01 ). NASA-TLX indicated lower cognitive load with multisensory guidance ( p 0.01 ). These results demonstrate the feasibility of physics-driven sonification for improving spatiotemporal awareness and supporting user-centered surgical navigation. Comments: This work has been submitted to the IEEE for possible publication Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.06911 [cs.HC] (or arXiv:2604.06911v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.06911 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Veronica Ruozzi [view email] [v1] Wed, 8 Apr 2026 10:11:07 UTC (1,284 KB)
[HC-10] Digital Skin Digital Bias: Uncovering Tone-Based Biases in LLM s and Emoji Embeddings WWW’26
【速读】:该论文旨在解决生成式 AI(Generative AI)在在线交流中对肤色色调表情符号(skin-toned emojis)的表示可能加剧社会偏见的问题。研究发现,尽管大型语言模型(LLMs)在处理肤色修饰符方面表现稳健,但专用的表情符号嵌入模型(如emoji2vec和emoji-sw2v)存在严重缺陷;更关键的是,通过语义一致性、表征相似性、情感极性和核心偏见的多维分析,揭示了不同肤色表情符号在情感倾向和语义含义上的系统性偏差。解决方案的关键在于对AI模型进行系统性审计与偏差缓解,以确保其在网络平台上的应用促进真正的公平性,而非强化既有社会偏见。
链接: https://arxiv.org/abs/2604.06863
作者: Mingchen Li,Wajdi Aljedaani,Yingjie Liu,Navyasri Meka,Xuan Lu,Xinyue Ye,Junhua Ding,Yunhe Feng
机构: University of North Texas(北德克萨斯大学); International Center for Artificial Intelligence Research and Ethics(人工智能研究与伦理国际中心); Southern Methodist University(南卫理公会大学); The University of Arizona(亚利桑那大学); The University of Alabama(阿拉巴马大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at WWW’26
Abstract:Skin-toned emojis are crucial for fostering personal identity and social inclusion in online communication. As AI models, particularly Large Language Models (LLMs), increasingly mediate interactions on web platforms, the risk that these systems perpetuate societal biases through their representation of such symbols is a significant concern. This paper presents the first large-scale comparative study of bias in skin-toned emoji representations across two distinct model classes. We systematically evaluate dedicated emoji embedding models (emoji2vec, emoji-sw2v) against four modern LLMs (Llama, Gemma, Qwen, and Mistral). Our analysis first reveals a critical performance gap: while LLMs demonstrate robust support for skin tone modifiers, widely-used specialized emoji models exhibit severe deficiencies. More importantly, a multi-faceted investigation into semantic consistency, representational similarity, sentiment polarity, and core biases uncovers systemic disparities. We find evidence of skewed sentiment and inconsistent meanings associated with emojis across different skin tones, highlighting latent biases within these foundational models. Our findings underscore the urgent need for developers and platforms to audit and mitigate these representational harms, ensuring that AI’s role on the web promotes genuine equity rather than reinforcing societal biases.
[HC-11] MemoryDiorama: Generating Dynamic 3D Diorama from Everyday Photos for Memory Recall
【速读】:该论文旨在解决个人媒体(如照片)在回忆 autobiographical memory(自传体记忆)时线索不足、记忆细节模糊的问题。其解决方案的关键在于提出 MemoryDiorama 系统,通过引入增强型记忆线索(augmented memory cues),利用大语言模型(LLM)进行场景分析,并结合 3D 对象生成、动画与空间构图技术,将静态照片转化为具有动态元素的混合现实(mixed reality)三维微缩景观(diorama)。系统从照片中提取地理信息、物体属性、光照条件和氛围要素,并通过生成式组件(如物体动画、人物动作、地理特效和粒子效果)对这些元素进行可视化增强,从而显著提升用户回忆时的内部细节、感知细节和视觉生动性。
链接: https://arxiv.org/abs/2604.06773
作者: Keiichi Ihara,Tianle Li,Yasuhisa Shiino,Ryo Suzuki
机构: University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 11 figures
Abstract:We present MemoryDiorama, a prototype system that introduces augmented memory cues, a concept that extends captured personal media with AI-generated contextual information to enhance autobiographical memory recall. MemoryDiorama transforms everyday photos into dynamic 3D dioramas in mixed reality by integrating LLM-based scene analysis with 3D object generation, animation, and spatial composition. The system extracts geographic information, object attributes, lighting conditions, and atmospheric elements from the photos. It then animates these elements with generative components such as object animations, human motion, geographical effects, and particle effects to provide richer cues for memory recall. We evaluated MemoryDiorama in a within-subject user study with 18 participants, comparing three conditions: Photo-Only, Static Diorama, and MemoryDiorama. Compared with both Photo-Only and Static Diorama, MemoryDiorama elicited more internal and in-cue details during recall. It also increased perceptual details and visual vividness ratings, suggesting richer recollective experience.
[HC-12] Understanding Data Collection Brokerag e and Spam in the Lead Marketing Ecosystem
【速读】:该论文旨在解决 lead marketing 生态系统中隐私泄露与垃圾营销风险缺乏实证研究的问题,尤其关注健康类个人数据在收集、共享及后续营销中的滥用情况。其解决方案的关键在于构建了一个端到端的测量框架,通过监控超过100个健康相关线索生成网站以及200个受控电话号码和电子邮件地址,追踪数据从采集到消费者接触的完整链条,并结合购买真实线索与人工模拟线索的方式识别出数据被转售给未审核第三方、甚至被伪造或篡改的恶意行为。该方法揭示了生态系统中存在的大规模数据共享、欺骗性中介操作及高频骚扰通信模式,为理解并监管此类高敏感度数据商业化提供了实证基础。
链接: https://arxiv.org/abs/2604.06759
作者: Yash Vekaria,Nurullah Demir,Konrad Kollnig,Zubair Shafiq
机构: UC Davis; Stanford University; Maastricht University
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:The lead marketing ecosystem enables collection, sale, and use of personal data submitted via web forms to deliver personalized quotes in high-value verticals such as insurance. Despite its scale and sensitivity of the collected data, this ecosystem remains largely unexplored by the research community. We present the first empirical study of privacy and spam risks in lead marketing, developing an end-to-end measurement framework to trace data flows from data collection to consumer contact. Our setup instruments over 100 health-related lead-generation websites and monitors 200 controlled phone numbers and email addresses to understand downstream marketing practices. We observe sharing of highly personal and sensitive health information to more than 70 distinct third parties on these lead generation websites. By purchasing our own and other organic leads from three major lead platforms, we uncover deceptive brokerage practices, where consumer data is sold to unvetted buyers and often augmented or fabricated with attributes such as health status and weight. We received a total of over 8,000 telemarketing phone calls, 600 text messages, and 200 emails, where calls often began within seconds of form submission. Many campaigns relied on VoIP-based neighbor spoofing and high-frequency dialing, at times rendering phones unusable. Our experiments with phone and email opt-outs suggest phone-based opt-outs to help the most, although all were ineffective at completely stopping marketing communications. Analysis of 7,432 Better Business Bureau (BBB) complaints and reviews corroborates these findings from the consumer perspective. Overall, our results reveal a highly interconnected and non-compliant lead marketing ecosystem that aggressively monetizes sensitive consumer data.
[HC-13] Meaningful Human Command: Towards a New Model for Military Human-Robot Interaction
【速读】:该论文旨在解决军事人机交互(Military Human Robot Interaction, MHRI)中因人类与人工智能(Artificial Intelligence, AI)系统间动态关系日益紧密而产生的设计、实施与操作难题,特别是现有“有意义的人类控制”(Meaningful Human Control, MHC)概念在军事场景下适用性不足的问题。解决方案的关键在于提出并论证“有意义的人类指挥”(Meaningful Human Command, MHC1)这一更具操作性的新概念,以更好适配嵌入AI赋能自主系统的先进军事指挥与控制系统,并通过一个技术可行的作战场景案例(vignette)来阐释其内涵、指导实践并凸显MHRI相关挑战。
链接: https://arxiv.org/abs/2604.06611
作者: Adam Hepworth,Zena Assaad,Austin Wyatt,Hussein Abbass
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 41 pages
Abstract:Military human robot interaction (MHRI) presents a novel opportunity to blend the capabilities of autonomous and Artificial Intelligence (AI)-enabled systems with the skills and expertise of humans. The concept promises military advantages and greater operational effectiveness and efficiencies. However, the associated human-AI dynamics create challenges when attempting to design, implement, and operationalise the increasingly symbiotic relationship between humans and machines. Meaningful human control (MHC) is a popularised conceptualisation of what is deemed a responsible interaction among human and artificial agents; however, this notion falls short in military contexts and hinders the realisation of military advantages that could be achieved by advancing the adoption of responsible AI. This paper presents meaningful human command (MHC1) as a more operationally effective concept for advanced military command and control systems that embed AI-enabled autonomous systems. We introduce, explore, and unpack meaningful human command in the context of military human-robot interaction, presenting a vignette that offers a technologically feasible concept of an AI-enabled system within military operations. The vignette is used to guide, contextualise, and add realism to the narrative describing the concept and highlights associated MHRI challenges.
[HC-14] Language-Guided Multimodal Texture Authoring via Generative Models
【速读】:该论文旨在解决真实触感纹理创作中因依赖低层级参数调优和反复试错而导致的效率低下、透明度不足及创意局限性问题。其解决方案的关键在于提出一种基于自然语言驱动的多模态纹理生成系统,通过共享的语言对齐潜在空间(language-aligned latent space)将文本提示与两个协调的触觉通道(滑动振动与敲击瞬态)及视觉预览(由扩散模型生成)关联起来,使单一文本指令即可生成语义一致的触觉与视觉信号,从而实现以提示(prompt)为先的设计流程,替代传统手动参数调整,提升创作效率与可解释性。
链接: https://arxiv.org/abs/2604.06489
作者: Wanli Qian,Aiden Chang,Shihan Lu,Michael Gu,Heather Culbertson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 14 pages, 13 figures, accepted to IEEE Haptics Symposium 2026
Abstract:Authoring realistic haptic textures typically requires low-level parameter tuning and repeated trial-and-error, limiting speed, transparency, and creative reach. We present a language-driven authoring system that turns natural-language prompts into multimodal textures: two coordinated haptic channels - sliding vibrations via force/speed-conditioned autoregressive (AR) models and tapping transients - and a text-prompted visual preview from a diffusion model. A shared, language-aligned latent links modalities so a single prompt yields semantically consistent haptic and visual signals; designers can write goals (e.g., “gritty but cushioned surface,” “smooth and hard metal surface”) and immediately see and feel the result through a 3D haptic device. To verify that the learned latent encodes perceptually meaningful structure, we conduct an anchor-referenced, attribute-wise evaluation for roughness, slipperiness, and hardness. Participant ratings are projected to the interpretable line between two real-material references, revealing consistent trends - asperity effects in roughness, compliance in hardness, and surface-film influence in slipperiness. A human-subject study further indicates coherent cross-modal experience and low effort for prompt-based iteration. The results show that language can serve as a practical control modality for texture authoring: prompts reliably steer material semantics across haptic and visual channels, enabling a prompt-first, designer-oriented workflow that replaces manual parameter tuning with interpretable, text-guided refinement.
[HC-15] Breaking Negative Cycles: A Reflection-To-Action System For Adaptive Change
【速读】:该论文旨在解决负性心理健康循环(如反刍思维和反复懊悔)难以突破的问题,核心在于如何将自我觉察有效转化为行为改变。解决方案的关键在于设计并验证了一种技术赋能的自我反思系统(Technologies Supporting Self-Reflection, TSR),该系统整合了行为阶段理论(Transtheoretical Model, TTM)与情绪调节过程模型(Gross’s Emotion Regulation, ER Process Model),通过结构化反思模块——WhatIf-Planning,将个体对后悔和愿望的记录与具身化的“如果-那么”行动计划相结合。研究发现,相较于自由书写组,基于Gross模型引导的干预组在生成更多反事实替代方案、制定具体行动计划及实施自驱动改变方面表现更优,从而实现了从反思到行动的转化机制。
链接: https://arxiv.org/abs/2604.06477
作者: Minsol Michelle Kim,Daniel M. Low,David Lafond,Eugene Shim,Michelle Han,Mohanad Kandil,Chenyu Zhang,Theo Kitsberg,Chelsea Boccagno,Paul Pu Liang,Pattie Maes
机构: MIT Media Lab (媒体实验室); Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学); Child Mind Institute (儿童心智研究所); Stanford University (斯坦福大学); Wellesley College (韦尔斯利学院); Technical University of Munich (慕尼黑工业大学); University of Cambridge (剑桥大学); Massachusetts General Hospital (马萨诸塞总医院); Harvard T.H. Chan School of Public Health (哈佛大学陈曾熙公共卫生学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Breaking negative mental health cycles, including rumination and recurring regrets, requires reflection that translates awareness into behavioral change. Grounded in the Transtheoretical Model (TTM) and Gross’s Emotion Regulation (ER) Process Model, we examine how Technologies Supporting Self-Reflection (TSR) bridge reflection and action. In a 15-day in-the-wild study (N = 20), participants used a voice-based journaling system to capture regrets and wishes and engaged in WhatIf-Planning, a novel structured reflection module integrating counterfactual thinking with if-then planning. Participants were randomized to either a free-form condition or a Gross-guided condition, which maps the five processes of Gross’s ER model into explicit journaling prompts. We contribute: (1) a unified reflection-to-action TSR system that operationalizes the Preparation stage of TTM to bridge Contemplation and Action, and (2) triangulated empirical evidence from an in-the-wild journaling study that first operationalizes Gross’s Process Model, revealing effects on coping flexibility and emotion regulation in daily life. Results show significant pre-post improvements in coping flexibility, indicating adaptive self-regulation across conditions, with the Gross-guided group generating more counterfactual alternatives, articulating concrete if-then action plans, and implementing more plans for self-driven change.
[HC-16] Intimate Strangers by Design: A Uses and Gratifications Analysis of AI Companionship
【速读】:该论文旨在解决当前关于对话式人工智能(Conversational AI)伴侣用户经验的学术理解不足问题,尤其是现有研究多聚焦于“有害”与“有益”的二元评价,而忽视了用户真实需求、平台功能如何中介需求满足以及使用行为随时间演变的动态过程。其解决方案的关键在于运用使用与满足理论(Uses and Gratifications, UG)框架,结合对20名AI陪伴平台用户的访谈和质性内容分析,揭示三类核心发现:一是用户获得的满足感虽可归入传统UG类别,但被对话式AI特有的功能 affordances(如持续可用性、个性化和无社会评判)所重塑;二是出现了三种新类型的满足感——创造性协作(作为关系共构)、关系模拟(作为人际训练)和性/浪漫满足(作为权利再认),这些均源于用户与AI互动中主动构建体验的过程;三是满足感随时间从工具性初始动机转向情感投入,并在部分用户中出现自我调节的使用模式,表明疗愈功能实现后使用行为趋于收敛。这一研究拓展了UG理论在交互式AI场景下的适用边界,为制定基于实证的治理策略提供了关键依据。
链接: https://arxiv.org/abs/2604.06419
作者: Dayeon Eom,Julianne Renner,Sedona Chinn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Conversational AI companions have grown prominent in public discourse, yet scholarly understanding of user experiences remains limited, with existing research organized around evaluative poles of harm and benefit rather than examining what users seek, how affordances mediate need fulfillment, or how use evolves over time. Drawing on interviews with 20 users of AI companionship platforms and qualitative content analysis informed by Uses and Gratifications (UG) theory, this study offers three contributions. First, participants reported gratifications mapping onto established UG categories but qualitatively inflected by conversational AI’s distinctive affordances, such as persistent availability, personalization, and absence of social judgment. Second, several gratifications, creative collaboration as relational co-production, relational simulation as interpersonal training, and sexual/romantic satisfaction as reclamation, do not map onto existing typologies, instead emerging through interactive processes in which users actively simulate experiences with AI. Third, gratifications shifted over time, moving from instrumental entry points toward emotional engagement and, in some cases, self-regulated moderation after therapeutic functions were fulfilled. These findings extend UG by identifying gratification processes unique to interactive AI and suggest governance efforts would benefit from an empirically grounded understanding of how and why users engage with AI companions.
[HC-17] rust in AI among Middle Eastern CS Students: Investigating Students Trust and Usage Patterns Across Saudi Arabia Kuwait and Jordan
【速读】:该论文旨在解决当前关于人工智能(Artificial Intelligence, AI)信任研究主要基于西方、受教育、工业化、富裕和民主(WEIRD)群体的局限性问题,探究中东阿拉伯语地区计算机科学学生对AI的信任及其影响因素。其关键解决方案在于通过在沙特阿拉伯、科威特和约旦三所大学复制一项针对美国学生的AI信任研究,系统分析语言流利度、性别和首代大学生身份等因素对AI信任的影响,发现语言能力是预测AI信任的关键变量,并揭示了性别差异在不同国家间的显著异质性——例如沙特女性学生对AI的信任低于男性,而其他国家则无明显性别差异,从而强调需为非WEIRD地区设计契合当地文化与语言需求的AI系统以促进技术公平采纳。
链接: https://arxiv.org/abs/2604.06418
作者: Saleh Alkhamees,Ali Alfageeh,Bader Alkhazi,Duaa Alshdaifat,Amin Alipour
机构: University of Houston (休斯顿大学); Kuwait University (科威特大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Background and Context: Artificial intelligence (AI) tools have been reshaping computing and computer science education. Trust in AI is a determining factor in the adoption of these tools. Recent studies have shown different trust factors across gender and first-generation status among students. However, these studies have focused mainly on Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations, and their generalizability to other populations with different languages and cultures is unclear. Objective: This study aims to evaluate trust in AI among Middle Eastern computer science students and the factors that can impact it. Method. We replicate a recent study of trust in four universities in three Middle Eastern, Arabic-speaking countries: Saudi Arabia, Kuwait, and Jordan. We analyze trust among students across different factors such as gender and first-generation status. Findings: Our results suggest that language fluency can predict trust in AI. Moreover, unlike the results from the US population where female students tended to trust AI more than their male peers, female students in Saudi Arabia indicated lower trust compared to their male counterparts, and we did not observe any noticeable differences across gender in the other countries. We also found a generally negative correlation between English language proficiency and students’ confidence. Implications: This study highlights differences in students’ adoption and trust in AI even within the same region. It emphasizes the need for more investigation into students’ adoption and interaction in non-WEIRD regions for equitable adoption of this technology. It also suggests a need for efforts in designing effective AI systems tailored to the cultural and linguistic needs of the region. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.06418 [cs.HC] (or arXiv:2604.06418v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.06418 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Saleh Alkhamees [view email] [v1] Tue, 7 Apr 2026 19:51:50 UTC (425 KB) Full-text links: Access Paper: View a PDF of the paper titled Trust in AI among Middle Eastern CS Students: Investigating Students’ Trust and Usage Patterns Across Saudi Arabia, Kuwait and Jordan, by Saleh Alkhamees and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.HC prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-18] Reproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning
【速读】:该论文旨在解决机器学习(Machine Learning, ML) reproducibility问题,其核心挑战不仅在于实验要素(如数据集、代码、配置和执行环境)的缺失,更在于跨学科协作中对先前工作的理解困难、组件演化过程中的对齐障碍以及随时间推移实验意图的重构难题。针对这些问题,作者提出一个两层社会技术系统:第一层是生命周期感知的可追溯性基础设施,用于结构化记录ML项目全生命周期中的各类Artifact;第二层是交互层,通过促进协调、解释与共享理解来弥合团队成员之间的认知鸿沟。该方案的关键创新在于引入AI驱动的语义接口,将可复现性重新定义为一种持续的社会技术实践,而非静态的记录属性,从而推动面向人类中心的ML基础设施设计。
链接: https://arxiv.org/abs/2604.06414
作者: Zhiwei Li,Carl Kesselman
机构: University of Southern California (南加州大学); Information Sciences Institute (信息科学研究所)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI EA 2026 (Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems), 5 pages
Abstract:Machine learning (ML) reproducibility is often framed as a problem of incomplete artifact recording. This framing leads to systems that prioritize capturing datasets, code, configurations, and execution this http URL, in collaborative and interdisciplinary ML projects, reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, we identify recurring interactional breakdowns that persist despite comprehensive structural traceability. Based on these findings, we propose a two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer designed to mediate coordination, explanation, and shared understanding. We discuss how an AI-mediated semantic interface reframes reproducibility as an ongoing socio-technical accomplishment rather than a static property of recorded traces, and outline implications for human-centered ML infrastructure design.
[HC-19] Intimacy as Service Harm as Externality: Critical Perspectives on AI Companion Platform Accountability
【速读】:该论文旨在解决当前关于人工智能(AI)伴侣引发的亲密关系问题中,主流叙事将责任归咎于用户心理而非平台架构设计的偏差问题。其核心贡献在于通过批判性数据研究与平台研究视角,揭示用户在使用AI伴侣过程中所遭遇的设计型伤害(如未经同意的内容生成、保护机制对用户的污名化)和使用型伤害(如难以化解的情感依赖),并指出用户被迫承担全部风险缓解责任,而平台与监管机构却处于问责真空状态。解决方案的关键在于重新审视平台治理结构,强调需从系统层面重构责任分配机制,而非仅依赖个体化的应对策略(如自我规训、隐私合理化等),从而打破由平台制造的脆弱性通过用户解释劳动得以自我维持的恶性循环。
链接: https://arxiv.org/abs/2604.06381
作者: Dayeon Eom,Julianne Renner,Sedona Chinn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:This paper examines artificial intelligence (AI) companionship as a site where intimate relations are simultaneously produced, extracted from, and governed through datafied systems. Drawing on critical data studies and platform studies, we challenge prevailing narratives that locate harm in user psychology rather than platform architecture. Through in-depth interviews with 20 individuals who have AI companions, we address three questions: what harms do users identify, how do they make sense of those harms, and what do their accounts reveal about the perceived distribution of responsibility among users, platforms, and regulators? Participants identified design-based harms, including unsolicited content generation and safety mechanisms that stigmatized the users they intended to protect, alongside use-based harms centered on emotional dependency they could recognize but not resolve. Users deployed individualized sensemaking strategies, including self-regulation, stigma navigation, and privacy rationalization, bearing the full burden of harm mitigation without platform support. On governance, participants described an accountability vacuum in which platforms deflected blame while users articulated conditional preferences that rejected both prohibition and deregulation. The findings extend responsibilization theory by demonstrating how platform-produced vulnerability becomes self-sustaining through the interpretive labor of users who lack structural alternatives.
[HC-20] Navigating Marginalization: Toward Justice-Oriented Socio-Technical Design for Parent-Child Learning among Southeast Asian Immigrant Mothers in Taiwan
【速读】:该论文旨在解决东南亚(Southeast Asian, SEA)移民母亲在台湾参与子女家庭学习过程中所面临的结构性边缘化与文化约束问题,尤其是如何在社会资源受限和身份认同困境中维持其教育参与的能动性。解决方案的关键在于引入以正义为导向的设计视角(justice-oriented lens),通过识别家庭互动中的多重伤害(harms),提出从个体、家庭到社会层面均需强化“承认(recognition)”、“互惠(reciprocity)”和“问责(accountability)”的跨层级设计策略,从而支持移民母亲在人机协同的学习系统中实现更公平的参与与价值传递。
链接: https://arxiv.org/abs/2604.06353
作者: Ying-Yu Chen,Yang Hong,Yan-Rong Chen,Yi-Chieh Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study investigates how Southeast Asian (SEA) immigrant mothers in Taiwan participate in their children’s home-based learning. Drawing on semi-structured interviews and diary studies, we explore how these mothers navigate sociocultural constraints while fostering engagement and transmitting cultural values. Despite facing diminished agency and structural marginalization, mothers engage creatively in their children’s everyday learning interactions. Guided by a justice-oriented lens, we identify various harms and propose design implications for socio-technical systems that center recognition, reciprocity, and accountability in parent-child learning at the individual, familial, and societal levels. Our contribution lies in foregrounding the role of intersectional identity in parent-child learning and proposing justice-oriented design directions that support the flourishing of immigrant mothers within socio-technical systems.
[HC-21] “It didnt feel right but I needed a job so desperately”: Understanding Peoples Emotions Help Needs During Financial Scams
【速读】:该论文旨在解决在线金融诈骗(online financial scams)对公众造成的长期且严重的威胁,特别是理解人们在遭遇诈骗过程中不同阶段(事前、事中、事后)的动机与求助需求。研究发现,诈骗者主要利用恐惧(fear)、希望(hope)等情绪进行操纵,并识别出财务不安全感(financial insecurity)和法律脆弱性(legal precarity)等风险因素显著增加个体受骗及受损的可能性。解决方案的关键在于设计情境化干预措施,包括预防(prevention)、诊断(diagnostic)、缓解(mitigation)和恢复(recovery)四类策略,以精准匹配用户在诈骗各阶段的具体帮助需求和情感状态,从而提升整体反诈支持系统的有效性与响应能力。
链接: https://arxiv.org/abs/2604.06218
作者: Jake Chanenson,Tara Matthews,Sunny Consolvo,Patrick Gage Kelley,Jessica McClearn,Sarah Meiklejohn,Abhishek Roy,Renee Shelby,Kurt Thomas,Amelia Hassoun
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 22 pages, 2 figures, 3 tables, to be published in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract:Online financial scams represent a long-standing and serious threat for which people seek help. We present a study to understand people’s in situ motivations for engaging with scams and the help needs they express before, during, and after encountering a scam. We identify the main emotions scammers exploited (e.g., fear, hope) and characterize how they did so. We examine factors – such as financial insecurity and legal precarity – which elevate people’s risk of engaging with specific scams and experiencing harm. We indicate when people sought help and describe their help-seeking needs and emotions at different stages of the scam. We discuss how these needs could be met through the design of contextually-specific prevention, diagnostic, mitigation, and recovery interventions.
[HC-22] SensorPersona: An LLM -Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在个性化过程中依赖对话历史而难以全面捕捉用户真实特征的问题,因为对话数据仅反映用户自我披露的信息,无法体现其日常物理世界的行为模式。为此,作者提出SensorPersona系统,其核心创新在于通过无感采集移动设备上的多模态长期传感器流,结合面向个体的上下文编码、分层人格推理(整合episode内与episode间推理机制)以及聚类感知的增量验证和时间证据驱动的更新策略,实现对用户物理行为模式、心理社会特质及生活经历等多维度稳定人格的持续推断,从而显著提升LLM代理的响应质量与用户满意度。
链接: https://arxiv.org/abs/2604.06204
作者: Bufang Yang,Lilin Xu,Yixuan Li,Kaiwei Liu,Xiaofan Jiang,Zhenyu Yan
机构: The Chinese University of Hong Kong (香港中文大学); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Personalization is essential for Large Language Model (LLM)-based agents to adapt to users’ preferences and improve response quality and task performance. However, most existing approaches infer personas from chat histories, which capture only self-disclosed information rather than users’ everyday behaviors in the physical world, limiting the ability to infer comprehensive user personas. In this work, we introduce SensorPersona, an LLM-empowered system that continuously infers stable user personas from multimodal longitudinal sensor streams unobtrusively collected from users’ mobile devices. SensorPersona first performs person-oriented context encoding on continuous sensor streams to enrich the semantics of sensor contexts. It then employs hierarchical persona reasoning that integrates intra- and inter-episode reasoning to infer personas spanning physical patterns, psychosocial traits, and life experiences. Finally, it employs clustering-aware incremental verification and temporal evidence-aware updating to adapt to evolving personas. We evaluate SensorPersona on a self-collected dataset containing 1,580 hours of sensor data from 20 participants, collected over up to 3 months across 17 cities on 3 continents. Results show that SensorPersona achieves up to 31.4% higher recall in persona extraction, an 85.7% win rate in persona-aware agent responses, and notable improvements in user satisfaction compared to state-of-the-art baselines.
[HC-23] Content Platform GenAI Regulation via Compensation
【速读】:该论文旨在解决生成式 AI(Generative AI)在内容创作中广泛应用所引发的双重问题:一是原始创作者未获得补偿,导致其创作动力下降;二是人类生成内容(Human-Generated Content, HGC)被 GenAI 替代,造成训练数据污染,进而影响未来 GenAI 模型的质量与平台长期可持续性。解决方案的关键在于设计一种基于经济激励的创作者补偿机制,通过直接奖励高质量人类生成内容的生产,无需依赖 AI 检测技术即可提升内容多样性与质量,从而缓解数据污染、增强用户参与度并提高平台收益。
链接: https://arxiv.org/abs/2604.06194
作者: Wee Chaimanowong
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 40 pages, 2 figures, 2 tables
Abstract:The use of Generative AI (GenAI) for creative content generation has gained popularity in recent years. GenAI allows creators to generate contents that are increasingly becoming indistinguishable to the human–generated counter–part at a much lower cost. While GenAI reshapes the competitive landscape of the contents market, the original creators were typically not compensated for their works that were used in the GenAI training. On the other hands, the wide–spread adoption of GenAI threatens to replace the human–generated shares of contents on content platforms, contaminating training data source for future GenAI models. In this paper, we argue that an unregulated usage of GenAI can also be harmful to the platform by causing a contents distribution distortion which can lower the consumers’ engagement and the platform’s profit. We show that a simple economically–driven creator compensation scheme, can incentivize more creation of high–value human–generated contents, without the need for an AI–detector. This reduces the data pollution for future GenAI training, while improves the consumer engagement and the platform’s profit.
[HC-24] SASLO: A Scene-Aware Spatial Layout Optimization System for AR-SSVEP
【速读】:该论文旨在解决增强现实场景下稳态视觉诱发电位(Steady-state Visual Evoked Potential, SSVEP)脑机接口系统在真实户外环境中性能下降的问题,其核心挑战在于现实场景中的光照强度和刺激间距离(Inter-stimulus Distance, ISD)等因素会干扰刺激感知并削弱SSVEP信号的诱发效果。解决方案的关键在于提出一种情境感知的空间布局优化(Scenario-aware Spatial Layout Optimization, SASLO)系统,该系统通过RGB-CIE方法估计环境亮度,并将场景上下文信息引入线性上下文多臂赌博机(Linear Contextual Bandit, LCB)模型中,实现对刺激空间布局的在线自适应优化,从而提升AR-SSVEP在复杂户外环境下的鲁棒性和性能表现。
链接: https://arxiv.org/abs/2604.06190
作者: Beining Cao,Xiaowei Jiang,Charlie Li-Ting Tsai,Daniel Leong,Thomas Do,Chin-Teng Lin
机构: University of Technology Sydney (悉尼科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Steady-state visual evoked potential (SSVEP) is widely used in brain-computer interfaces (BCIs) due to its reliability. With the integration of augmented reality (AR), AR-SSVEP enables more intuitive interaction by embedding visual stimuli into real-world environments. However, unlike conventional computer screen-based SSVEP (CS-SSVEP) systems with stable visual conditions, AR-SSVEP performance is influenced by real-world scene factors, such as luminance and color, which degrade stimulus perception and weaken SSVEP elicitation. Nevertheless, existing studies primarily focus on offline analyses of SSVEP-related factors in indoor settings, while online adaptive optimization for outdoor AR-SSVEP remains limited. Therefore, a scenario-aware spatial layout optimization (SASLO) system for AR-SSVEP is proposed, which jointly considers scene luminance and inter-stimulus distance (ISD) for adaptive stimulus layout optimization. Scene luminance is estimated using an RGB-CIE based method, and the extracted context is incorporated into a linear contextual bandit (LCB) model to recommend optimized spatial layouts. Two pilot single-factor experiments are conducted to characterize the effects of luminance and ISD on SSVEP performance and to construct reliable rewards for model training. An outdoor online experiment with ten subjects further validates the proposed joint optimization method, achieving an average accuracy of 0.89 and an information transfer rate of 35.74 bits/min with a 3 s input window, and consistently outperforming two baseline methods. Overall, the proposed SASLO system is shown to improve the robustness of AR-SSVEP in real-world outdoor environments. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.06190 [cs.HC] (or arXiv:2604.06190v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.06190 Focus to learn more arXiv-issued DOI via DataCite
[HC-25] LLM Spirals of Delusion: A Benchmarking Audit Study of AI Chatbot Interfaces
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实用户交互场景中可能诱发或加剧妄想性思维与阴谋论倾向的问题,尤其是评估不同部署方式(如API接口与用户聊天界面)对模型行为的影响。其关键解决方案在于通过对比实际用户交互环境(如ChatGPT桌面应用或网页界面)与传统自动化测试方法(API调用),开展系统性的审计与基准测试,从而揭示出:模型行为不仅受训练数据和架构影响,更显著依赖于政策设定、实时更新机制及多轮对话中的时序动态演化特性,进而强调了透明化模型迭代过程和采用贴近真实使用场景的评估范式对于保障生成式AI安全性的必要性。
链接: https://arxiv.org/abs/2604.06188
作者: Peter Kirgis,Ben Hawriluk,Sherrie Feng,Aslan Bilimer,Sam Paech,Zeynep Tufekci
机构: Princeton University (普林斯顿大学); Phillips Exeter Academy (菲利普斯埃克塞特学院); Independent (独立)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at the 2nd Annual Conference of the International Association for Safe and Ethical Artificial Intelligence
Abstract:People increasingly hold sustained, open-ended conversations with large language models (LLMs). Public reports and early studies suggest that, in such settings, models can reinforce delusional or conspiratorial ideation or even amplify harmful beliefs and engagement patterns. We present an audit and benchmarking study that measures how different LLMs encourage, resist, or escalate disordered and conspiratorial thinking. We explicitly compare API outputs to user chat interfaces, like the ChatGPT desktop app or web interface, which is how people have conversations with chatbots in real life but are almost never used for testing. In total, we run 56 20-turn conversations testing ChatGPT-4o and ChatGPT-5, via both the API and chat interface, and grade each conversation by two research assistants (RAs) as well as by GPT-5. We document five results. First, we observe large differences in performance between the API and chat interface environments, showing that the universally used method of automated testing through the API is not sufficient to assess the impact of chatbots in the real world. Second, when tested in the chat interface, we find that ChatGPT-5 displays less sycophancy, escalation, and delusion reinforcement than ChatGPT-4o, showing that these behaviors are influenced by the policy choices of major AI companies. Third, conversations with nearly identical aggregate intensity in a behavior display large differences in how the behavior evolves turn by turn, highlighting the importance of temporal dynamics in multi-turn evaluation. Fourth, even updated models display substantial levels of negative behaviors, revealing that model improvement does not imply model safety. Fifth, the same API endpoint tested just two months apart yields a complete reversal in behavior, underscoring how transparency in model updates is a necessary prerequisite for robust audit findings.
[HC-26] Skin-Deep Bias: How Avatar Appearances Shape Perceptions of AI Hiring
【速读】:该论文旨在解决生成式 AI (Generative AI) 在招聘场景中应用时,申请人如何感知其公平性的问题,尤其是 avatar 身份线索(如种族和性别)如何影响申请人的公正性判断。解决方案的关键在于通过众包实验设计,利用真实感 AI avatars 在面试情境中操纵 phenotypic traits(表型特征),结合自评问卷、情感分析与眼动追踪技术,系统揭示了身份匹配程度对申请人信任、公平感及偏见感知的影响机制:发现种族不匹配会增强民族偏见感知,而部分匹配(仅共享一种身份属性)比完全匹配或无匹配更降低公平评价。这一发现扩展了“计算机即社会行为体”(Computers-Are-Social-Actors)范式,并为设计更具公平性的 AI 面试系统提供了可操作的实证依据。
链接: https://arxiv.org/abs/2604.06187
作者: Ka Hei Carrie Lau,Philipp Stark,Efe Bozkir,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Lund University (隆德大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Artificial intelligence is increasingly used in hiring, raising concerns about how applicants perceive these systems. While prior work on algorithmic fairness has emphasized technical bias mitigation, little is known about how avatar identity cues influence applicants’ justice attributions in an interview context. We conducted a crowdsourcing study with 215 participants who completed an interview with photorealistic AI avatars varied in phenotypic traits (race and sex), followed by a standardized rejection. Using self-reports, sentiment analysis, and eye tracking, we measured perceptions of trust, fairness, and bias. Results show that racial mismatch heightened perceptions of ethnic bias, while partial match (sharing only one identity) reduced fairness judgments compared to both full and no match. This work extends the Computers-Are-Social-Actors paradigm by demonstrating that avatar appearances shape justicerelated evaluations of AI. We contribute to HCI by revealing how identity cues influence fairness attributions and offer actionable insights for designing equitable AI interview systems.
[HC-27] Full State-Space Visualisation of the 8-Puzzle: Feasibility Design and Educational Use
【速读】:该论文旨在解决搜索算法(search algorithms)在人工智能教育中因状态空间庞大而导致学习者难以建立准确心智模型的问题,尤其针对8-puzzle问题(181,440个状态)的复杂性。其解决方案的关键在于构建一个交互式学习系统,通过Unity与现代GPU渲染技术实现对整个可达状态空间的可视化,并将抽象的图结构与具体的拼图操作紧密耦合,从而支持实时探索全局结构、逐步执行搜索算法以及直观比较不同策略在相同状态空间中的遍历路径。这种全状态空间可视化不仅技术可行,且被初步教学实践验证具有显著的教育价值,有助于提升学生对搜索行为概念的理解。
链接: https://arxiv.org/abs/2604.06186
作者: Ian Frank,Kanata Kawanishi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This is a preprint of a paper accepted to IEEE ITET 2026
Abstract:Search algorithms are a foundational topic in artificial intelligence education, yet even simple domains can generate large state spaces that challenge learners’ ability to form accurate mental models. This paper presents an interactive learning system that demonstrates the feasibility of visualising the entire reachable state space of the 8-puzzle (181,440 states), while tightly coupling abstract graph structure with concrete puzzle manipulation. Built using Unity and modern GPU-based rendering techniques, the system enables real-time exploration of global structure, step-by-step execution of search algorithms, and direct comparison of how different strategies traverse the same space. We describe the system’s design, visualisation layouts, and educational use, reporting findings from an initial classroom deployment and pilot study with students at different levels of university education. Overall, the results indicate that full state-space visualisation is both technically feasible and educationally valuable for supporting conceptual understanding of search behaviour within this canonical problem domain.
[HC-28] Benchmarking LLM Tool-Use in the Wild ICLR2026
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在多轮、多步骤工具调用场景下表现不佳的问题,其核心挑战源于真实用户行为的复杂性:包括任务组合导致的工具调用拓扑结构需高效编排、意图分散于对话轮次中需上下文推理,以及指令转换(如任务查询、澄清与闲聊混合)迫使模型实时调整策略。现有基准测试未能反映这些行为特征,导致LLM工具使用能力的提升显得虚假。为此,作者提出WildToolBench,一个基于真实用户行为模式构建的LLM工具使用基准,通过57个LLM的全面评估发现,无一模型准确率超过15%,揭示了LLM代理能力在鲁棒性上的显著差距。关键解决方案在于重新聚焦于“野生”(wild)用户行为的本质复杂性,而非人为构造的复杂任务,强调必须重新审视LLM、用户与工具之间的交互机制。
链接: https://arxiv.org/abs/2604.06185
作者: Peijie Yu,Wei Liu,Yifan Yang,Jinjian Li,Zelong Zhang,Xiao Feng,Feng Zhang
机构: Tencent(腾讯); King’s College London(伦敦大学国王学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted by ICLR 2026
Abstract:Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behaviour: compositional tasks that demand efficient orchestration of tool-call topologies, implicit intent spread across dialogue turns that require contextual inference, and instruction transition, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce WildToolBench, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15%, indicating a substantial gap in the robustness of LLMs’ agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among LLMs, users, and tools.
[HC-29] A Goal-Oriented Chatbot for Engaging the Elderly Through Family Photo Conversations
【速读】:该论文旨在解决老年人孤独感增强及认知功能衰退的问题,提出了一种基于家庭照片的个性化聊天机器人(personalized chatbot)解决方案。其关键在于通过引导用户围绕家庭照片进行目标导向的对话(goal-oriented dialogue framework),生成“W问题”(who, where, when, what)以刺激认知功能,并辅以开放式问题促进积极回忆(positive reminiscence)。同时,系统能分析每次对话内容,识别用户偏好话题并推荐相关照片继续互动,从而提升参与度与情感联结;此外,配套的网页门户使照护者可上传照片并查看对话记录,为评估用户心理状态提供数据支持。
链接: https://arxiv.org/abs/2604.06184
作者: Raymond Chung,Keith Ng,CD Shum
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC)
Abstract:We propose a personalized chatbot designed for elderly individuals. The chatbot initiates discussions based on family photos, encouraging users to interact naturally. During these interactions, it generates W questions (who, where, when, and what) to stimulate cognitive function, followed by an open-ended question to promote positive reminiscence. This approach is structured around a goal-oriented dialogue framework. Additionally, after each conversation about a photo, the chatbot analyzes the discussion to identify topics that the user favors or dislikes. It then offers the user the option to chat about another photo either featuring the same family members or an individual previously mentioned in the conversation. To support this system, we have developed a web portal that allows caregivers to upload photos and review chat conversations. This personalized chatbot not only encourages elderly users to engage with the chatbot regularly and reduces feelings of loneliness but also provides caregivers with a valuable tool to gain insights into users’ well-being.
[HC-30] he Impact of Response Latency and Task Type on Human-LLM Interaction and Perception
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)应用中延迟(latency)对用户行为和输出质量感知的影响尚不明确的问题。通过控制时间到首个token的延迟(2、9、20秒),在两类基于分类体系的知识任务(Creation 和 Advice)中进行实验,发现尽管用户交互行为对延迟具有鲁棒性,但延迟显著影响了用户对输出质量的评价:2秒延迟下用户认为输出更缺乏思考和实用性,而9秒和20秒延迟则获得更高评价。关键在于识别出延迟不仅是性能指标,更是可调节的设计变量,其设置会影响用户的认知解释(如将延迟视为AI深度思考)和伦理体验(如引发焦虑或信任问题),从而提出以延迟为设计策略来优化人-LLM交互的方案。
链接: https://arxiv.org/abs/2604.06183
作者: Felicia Fang-Yi Tan,Moritz A. Messerschmidt,Wen Yin,Oded Nov
机构: New York University (纽约大学); National University of Singapore (新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注: To be published in ACM CHI 2026
Abstract:Responsiveness in large language model (LLM) applications is widely assumed to be critical, yet the impact of latency on user behavior and perception of output quality has not been systematically explored. We report a controlled experiment varying time-to-first-token latency (2, 9, 20 seconds) across two taxonomy-driven knowledge task types (Creation and Advice). Log analyses reveal that user interaction behaviors were robust to latency, yet varied by task type: Creation tasks elicited more frequent prompting than Advice tasks. In contrast, participants who experienced 2-second latencies rated the LLM’s outputs less thoughtful and useful than those who experienced 9- or 20-second latencies. Participants attributed delays to AI deliberation, though long waits occasionally shifted this interpretation toward frustration or concerns about reliability. Overall, this work demonstrates that latency is not simply a cost to reduce but a tunable design variable with ethical implications. We offer design strategies for enhancing human-LLM interaction.
[HC-31] VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics
【速读】:该论文旨在解决现有移动图形用户界面(GUI)智能体评估基准普遍存在的局限性问题,即当前基准多以应用为中心且任务同质化,难以反映真实移动使用场景中的多样性与不稳定性。为此,作者提出VenusBench-Mobile,一个面向通用移动GUI智能体的在线评估基准,其核心创新在于构建两个关键评估支柱:一是通过用户意图驱动的任务设计来定义评估内容,确保任务贴近实际用户行为;二是采用能力导向的标注方案实现细粒度的行为分析,从而揭示智能体在感知和记忆等关键能力上的缺陷。实验表明,现有先进模型在该基准下性能显著低于以往评测结果,凸显了新基准的挑战性和现实性,同时诊断分析指出失败主要源于感知与记忆能力不足,且强模型在环境变化下几乎无法成功,验证了其在真实场景下的脆弱性。
链接: https://arxiv.org/abs/2604.06182
作者: Yichen Gong,Zhuohan Cai,Sunhao Dai,Yuqi Zhou,Zhangxuan Gu,Changhua Meng,Shuheng Shen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at this https URL.
[HC-32] Digital Weight Management Interventions: A review of commercial solutions and survey analysis of user needs ALT
【速读】:该论文旨在解决当前数字体重管理干预(Digital Weight Management Interventions, DWMIs)在用户需求匹配度和功能完整性方面存在的不足问题。现有商业DWMIs虽具备自监测、目标设定与行为改变策略等核心功能,但普遍缺乏社交支持、虚拟现实(Virtual Reality, VR)应用及自适应个性化服务,且未充分考虑用户的实际使用偏好与数字素养差异。研究通过系统识别26个商业化DWMIs并结合207名真实参与者的用户需求分析,提出关键解决方案在于:优化DWMIs设计以增强社交互动性、引入VR技术提升沉浸式体验,并发展基于用户反馈的自适应个性化算法,从而提高干预的可及性、依从性和长期有效性。
链接: https://arxiv.org/abs/2604.06181
作者: Suncica Hadzidedic,Jingyun Wang,Victor Elijah Adeyemo,George Sanders,Grant Westermann
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, to be published in conference proceedings of KES International conference on Innovation in Medicine and Healthcare (KES-InMed-25)
Abstract:Obesity is a global health challenge. According to the World Health Organization (WHO), between 1990 and 2022, adult obesity more than doubled. Weight management interventions (WMIs) support individuals in achieving and maintaining a healthy weight through dietary guidance, physical activity promotion and behavioural counselling. However, traditional WMIs often have limited accessibility. Digital WMIs or DWMIs are delivered via websites or smartphone applications and provide scalable and cost-effective alternatives. However, user needs for digital services and their prevalence in the existing commercial solutions remain underexplored. Hence, our study systematically identified 26 commercial DWMIs to identify their features, services, and data collection practices. Additionally, we performed a user needs analysis by recruiting 207 individuals involved in a real-life WMI. Our findings indicated that DWMIs integrated self-monitoring, goal setting, and behaviour change strategies, yet lack social support, virtual reality applications and adaptive personalisation. WMI clients prefer smartphone Apps and fitness trackers for tracking weight management progress and have varying levels of comfort in using digital resources. The presented results serve as recommendations for future directions in the design and implementation of services for DWMIs.
[HC-33] “Help Me But Dont Watch Me”: Intervention Timing and Privacy Boundaries for Process-Aware AI Tutors
【速读】:该论文旨在解决当前K-12教育中个体化教学支持不足的问题,特别是在数学学习场景下,如何利用生成式AI(Generative AI)作为非正式导师提供及时、灵活且尊重学生自主性的辅导支持。其解决方案的关键在于通过实证调查揭示学生对AI Tutor支持的偏好:学生更倾向于保留自主权的支持方式(如给予思考时间或提示而非直接答案),对人机 tutor 的选择持谨慎态度,并期待AI具备适应性干预能力但同时担忧过度打扰或侵犯自主性;此外,隐私边界存在差异,学生愿意共享解题步骤和错误模式,但对注意力或行为数据等敏感信息接受度较低。研究据此提出设计AI Tutor应平衡即时干预与学生主体性、个性化服务与感知边界之间的关系。
链接: https://arxiv.org/abs/2604.06178
作者: Jane Hanqi Li,Yuhong Zhang,Jiaqi Liu,Tzyy-Ping Jung,Amy Eguchi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Submitted to AIED2026
Abstract:The use of generative AI (genAI) tools as informal tutors is becoming increasingly prevalent among secondary school students in mathematics learning. In many schools, individualized instructional support is limited, and one-on-one human tutoring remains costly in most learning contexts. GenAI has the potential to provide timely, on-demand help to students when teachers or tutors are not available. However, there are still few studies that examine students’ preferences for AI tutor support that enhances autonomous learning. We investigated learner expectations for AI tutoring through a survey with secondary school students in China (Grades 7-11; N=330). Students generally preferred support that preserves learner autonomy (e.g., time to think, hints over direct answers), expressed mixed or cautious preferences between human and AI tutors, and held nuanced views of proactive intervention, valuing adaptivity but also worrying about annoyance and autonomy. Privacy boundaries were uneven: many accepted sharing problem steps and error patterns, while willingness dropped for more sensitive signals such as attention or behavior. Our findings offer learner-centered insights for designing AI tutors that balance timely intervention with student agency, and personalization with perceived boundaries in a K-12 context.
[HC-34] User-Centric Design of UI for Mobile Banking Apps: Improving UI and Features for Better Customer Experience
【速读】:该论文旨在解决移动银行应用(Mobile Banking Application)用户体验不佳的问题,具体表现为用户对现有应用的满意度低、功能不完善、导航复杂以及安全性和个性化不足等。研究通过用户中心设计(User-Centered Design, UCD)方法,结合Think Aloud测试、热力图分析和远程可用性测试等手段识别痛点,并提出以提升安全性、增强功能性、简化导航和优化视觉设计为核心的解决方案。关键在于利用格式塔心理学(Gestalt Psychology)原理如接近性(closeness)与对称性(symmetry)改进界面布局与分组逻辑,从而显著改善用户交互体验,促进移动银行的广泛采纳与使用满意度。
链接: https://arxiv.org/abs/2604.06175
作者: Luniva Chitrakar,Ishan Panta,Biplov Paneru,Sangharsh Poudel,Lahana Kansakar
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Financial management has been revolutionized by mobile banking, but increasing usefulness and satisfaction requires a better user experience. This study aims to provide an improved customer experience by offering user-friendly interfaces, and real-time notifications by user-centric design of mobile banking application UI. A survey was carried out on the target audience in which 81% of respondents to a study of 103 people said they regularly used mobile banking apps, while 77% said they had problems with the ones they were using at the time. Furthermore, 44.7% of respondents expressed unhappiness with the current solutions by depending on third-party apps like e-Sewa and Khalti for everyday transactions. Language obstacles, lengthy loading times, unclear terminology, and navigational challenges were among the problems found. With 84% asking for a budgeting function and 46% complaining about biometric authentication, users indicated a need for more individualized interfaces, improved customer service, and increased security. The study included Think Aloud testing, heat maps, and remote usability testing to determine user preferences and pain spots to solve these. Feedback from a wider audience was obtained informally through guerrilla usability testing. The results highlight how important it is for mobile banking apps to guarantee security, increase functionality, simplify navigation, and improve visual design. App grouping and layout can be further enhanced by utilizing Gestalt psychology concepts like closeness and symmetry. The goal of these user-centered insights is to promote greater happiness and adoption of mobile banking.
[HC-35] X-BCD: Explainable Sensor-Based Behavioral Change Detection in Smart Home Environments
【速读】:该论文旨在解决如何从多模态智能家居传感器数据中自动检测并表征日常活动模式的演变,从而识别认知功能下降的数字生物标志物(digital biomarkers)这一难题。现有研究虽在智能环境中识别个体活动方面取得进展,但对活动模式随时间演化特征(如简化、碎片化等)的分析仍具挑战性,且缺乏临床可解释性。其解决方案的关键在于提出X-BCD框架——一个可解释的、无监督的方法,融合变化点检测(change point detection)与聚类演化追踪(cluster evolution tracking),能够自动识别行为习惯的结构性变化,并将这些变化转化为基于可解释特征的自然语言描述,从而支持临床决策和监测。
链接: https://arxiv.org/abs/2604.06174
作者: Gabriele Civitarese,Claudio Bettini
机构: EveryWare Lab, Dept. of Computer Science, University of Milan (米兰大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Manuscript currently under review
Abstract:Behavioral changes in daily life activities at home can be digital markers of cognitive decline. However, such changes are difficult to assess through sporadic clinical visits and remain challenging to interpret from continuous in-home sensing data. Extensive work has been done in the ubiquitous computing area on recognizing activities in smart homes, but only limited efforts have focused on analysing the evolution of patterns of activities, hence identifying behavior changes. In particular, understanding how daily habits and routines evolve and reorganize (e.g., simplification, fragmentation) is still an open challenge for clinical monitoring and decision support. In this paper, we present X-BCD, an explainable, unsupervised framework for detecting and characterizing changes in activity routines from multimodal smart home sensor data, combining change point detection and cluster evolution tracking. To support clinical interpretation, detected changes in routines are transformed into natural-language explanations grounded in interpretable features. Our preliminary evaluation on longitudinal data from real MCI patients shows that X-BCD produces interpretable descriptions of behavioral change, as supported by cohort-level comparisons, expert assessment, and parameter sensitivity analysis. Comments: Manuscript currently under review Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) Cite as: arXiv:2604.06174 [cs.HC] (or arXiv:2604.06174v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.06174 Focus to learn more arXiv-issued DOI via DataCite
计算机视觉
[CV-0] Fast Spatial Memory with Elastic Test-Time Training FAST
【速读】:该论文旨在解决Large Chunk Test-Time Training (LaCT)在长序列3D重建中因推理时全量更新导致的灾难性遗忘(catastrophic forgetting)和过拟合问题,从而限制其处理任意长度序列的能力。解决方案的关键在于提出受弹性权重巩固(Elastic Weight Consolidation, EWC)启发的弹性推理时训练(Elastic Test-Time Training),通过引入基于Fisher信息加权的弹性先验约束,稳定快适应(fast-weight)更新;同时维护一个锚点状态(anchor state),该状态由历史快权重的指数移动平均(exponential moving average)动态演化,以平衡模型的稳定性与可塑性。基于此架构,作者进一步提出Fast Spatial Memory (FSM),一种高效且可扩展的4D重建模型,能从长观测序列中学习时空表征并渲染新视角-时间组合,显著缓解相机插值捷径问题,并支持小块(chunk)下的快速适应,推动LaCT从单块固定设置迈向多块鲁棒适应,为真正长序列泛化奠定基础。
链接: https://arxiv.org/abs/2604.07350
作者: Ziqiao Ma,Xueyang Yu,Haoyu Zhen,Yuncong Yang,Joyce Chai,Chuang Gan
机构: MIT-IBM Watson AI Lab (MIT-IBM 沃森人工智能实验室); University of Michigan (密歇根大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project Page: this https URL
Abstract:Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.
[CV-1] MoRight: Motion Control Done Right
【速读】:该论文旨在解决生成式视频中运动控制的两个核心问题:一是运动与相机视角的纠缠问题,即现有方法将相机运动和物体运动混合作为单一追踪信号,导致难以独立控制;二是运动因果关系缺失问题,即现有方法仅实现像素级位移而非基于物理逻辑的因果响应。解决方案的关键在于提出MoRight框架,通过解耦运动建模实现两方面突破:首先利用时序跨视图注意力机制,将物体在标准静态视角下的运动迁移至任意目标相机视角,从而实现相机与物体运动的解耦控制;其次将运动分解为用户驱动的主动成分(active)与由此引发的被动后果成分(passive),并训练模型从数据中学习运动因果性,支持正向推理(输入主动运动预测被动结果)与逆向推理(输入期望被动结果反推合理驱动动作),同时保持自由相机视角调整能力。
链接: https://arxiv.org/abs/2604.07348
作者: Shaowei Liu,Xuanchi Ren,Tianchang Shen,Huan Ling,Saurabh Gupta,Shenlong Wang,Sanja Fidler,Jun Gao
机构: NVIDIA(英伟达); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL
Abstract:Generating motion-controlled videos–where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints–demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.
[CV-2] C-AE: Unlocking Token Capacity for Deep Compression Autoencoders
【速读】:该论文旨在解决深度压缩下生成式图像重建质量下降的问题,特别是由于潜在表示(latent representation)崩溃导致的生成性能退化。现有方法通常通过增加潜在空间通道数来维持高压缩比下的重建质量,但易引发潜在表示崩溃。其解决方案的关键在于从token空间(token space)出发进行优化:首先,通过调整ViT中的patch大小在固定潜在预算下实现token数量的有效扩展,并将token到潜在的压缩过程分解为两个阶段以减少结构信息损失;其次,借助联合自监督训练增强图像token的语义结构,从而生成更具生成友好性的潜在表示。这一系列设计显著提升了深度压缩条件下的重建与生成性能。
链接: https://arxiv.org/abs/2604.07340
作者: Teng Li,Ziyuan Huang,Cong Chen,Yangfu Li,Yuanhuiyi Lyu,Dandan Zheng,Chunhua Shen,Jun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.
[CV-3] From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在表面提取方面的根本性难题——由于其基于不透明度的建模方式,缺乏全局几何场,导致现有方法依赖启发式策略(如TSDF融合混合深度图)进行表面重建,难以获得精确且封闭的网格。解决方案的关键在于引入可学习的定向法向量(oriented normal)至每个高斯元素,并设计一种适应性的衰减公式,从而推导出空间任意位置处的法向场与占用场(occupancy field)的闭式表达;同时提出新颖的一致性损失和专用稀疏化策略,促使高斯体素包裹完整表面并填补几何空洞,最终实现高质量、封闭的三维网格重建。
链接: https://arxiv.org/abs/2604.07337
作者: Diego Gomez,Antoine Guédon,Nissim Maruani,Bingchen Gong,Maks Ovsjanikov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page is available in this http URL
Abstract:3D Gaussian Splatting (3DGS) has revolutionized fast novel view synthesis, yet its opacity-based formulation makes surface extraction fundamentally difficult. Unlike implicit methods built on Signed Distance Fields or occupancy, 3DGS lacks a global geometric field, forcing existing approaches to resort to heuristics such as TSDF fusion of blended depth maps. Inspired by the Objects as Volumes framework, we derive a principled occupancy field for Gaussian Splatting and show how it can be used to extract highly accurate watertight meshes of complex scenes. Our key contribution is to introduce a learnable oriented normal at each Gaussian element and to define an adapted attenuation formulation, which leads to closed-form expressions for both the normal and occupancy fields at arbitrary locations in space. We further introduce a novel consistency loss and a dedicated densification strategy to enforce Gaussians to wrap the entire surface by closing geometric holes, ensuring a complete shell of oriented primitives. We modify the differentiable rasterizer to output depth as an isosurface of our continuous model, and introduce Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution. We additionally expose fundamental biases in standard surface evaluation protocols and propose two more rigorous alternatives. Overall, our method Gaussian Wrapping sets a new state-of-the-art on DTU and Tanks and Temples, producing complete, watertight meshes at a fraction of the size of concurrent work-recovering thin structures such as the notoriously elusive bicycle spokes. Comments: Our project page is available in this http URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.07337 [cs.CV] (or arXiv:2604.07337v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.07337 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-4] RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild
【速读】:该论文旨在解决机器人学习中大规模数据采集的瓶颈问题,尤其是如何获取包含丰富且长时程交互的人类行为数据。现有方法在便携性、遮挡鲁棒性和全局一致性之间存在权衡。其解决方案的关键在于提出一种名为RoSHI的混合可穿戴系统,该系统融合低成本稀疏惯性测量单元(Inertial Measurement Units, IMUs)与Project Aria眼镜,通过融合感知实现佩戴者在度量全局坐标系下的完整3D姿态和身体形状估计。该设计利用IMUs对遮挡和高速运动的鲁棒性,以及基于第一人称视角SLAM(Simultaneous Localization and Mapping)提供的长期运动锚定和上半身姿态稳定能力,从而在保证便携性和实用性的同时,获得高质量、高一致性的全身运动数据,适用于真实世界人形机器人策略学习。
链接: https://arxiv.org/abs/2604.07331
作者: Wenjing Margaret Mao,Jefferson Ng,Luyang Hu,Daniel Gehrig,Antonio Loquercio
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. *Equal contribution by first three authors. Project webpage: this https URL
Abstract:Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning. For videos, data and more, visit the project webpage: this https URL
[CV-5] Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling
【速读】:该论文旨在解决光子计数CT(Photon-counting CT, PCCT)因临床可用性受限而难以大规模研究与部署的问题,其核心挑战在于如何利用有限的高质量PCCT数据来提升常规能量积分CT(Energy-integrating CT, EICT)图像质量。解决方案的关键在于提出SUMI方法,通过显式建模真实采集退化过程,将高质量PCCT模拟为具有临床真实感的低质量EICT版本,并学习逆向恢复这一退化过程;该方法基于1,046例PCCT和405,379例来自145家医院的EICT数据训练一个潜在扩散模型,同时释放预训练的自动编码器以提取通用CT潜在特征,从而实现无需成对标注即可获得高保真增强效果,显著优于现有图像翻译技术并在多个下游任务中提升诊断性能。
链接: https://arxiv.org/abs/2604.07329
作者: Junqi Liu,Xinze Zhou,Wenxuan Li,Scott Ye,Arkadiusz Sitek,Xiaofeng Yang,Yucheng Tang,Daguang Xu,Kai Ding,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.
[CV-6] Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment CVPR2026
【速读】:该论文旨在解决现有动态数据剪枝(dynamic data pruning)方法在标签噪声环境下性能显著下降的问题。传统方法通常以每个样本的损失值作为剪枝排序依据,但在标签噪声场景下,噪声样本可能因高损失值被错误保留,从而损害模型性能。解决方案的关键在于提出AlignPrune模块,其核心是引入基于损失轨迹的动态对齐评分(Dynamic Alignment Score, DAS),通过分析样本损失随训练过程的变化趋势来更准确地识别噪声样本,从而提升剪枝的有效性和鲁棒性。该模块可无缝集成至主流动态剪枝框架中,无需改动模型结构或训练流程,且在多种噪声类型和剪枝比例下均显著优于当前最优基线,最高可提升准确率6.3%。
链接: https://arxiv.org/abs/2604.07306
作者: Huaiyuan Qin,Muli Yang,Gabriel James Goenawan,Kai Wang,Zheng Wang,Peng Hu,Xi Peng,Hongyuan Zhu
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore; National University of Singapore; Wuhan University; Sichuan University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in CVPR 2026 Findings
Abstract:Existing dynamic data pruning methods often fail under noisy-label settings, as they typically rely on per-sample loss as the ranking criterion. This could mistakenly lead to preserving noisy samples due to their high loss values, resulting in significant performance drop. To address this, we propose AlignPrune, a noise-robust module designed to enhance the reliability of dynamic pruning under label noise. Specifically, AlignPrune introduces the Dynamic Alignment Score (DAS), which is a loss-trajectory-based criterion that enables more accurate identification of noisy samples, thereby improving pruning effectiveness. As a simple yet effective plug-and-play module, AlignPrune can be seamlessly integrated into state-of-the-art dynamic pruning frameworks, consistently outperforming them without modifying either the model architecture or the training pipeline. Extensive experiments on five widely-used benchmarks across various noise types and pruning ratios demonstrate the effectiveness of AlignPrune, boosting accuracy by up to 6.3% over state-of-the-art baselines. Our results offer a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real-world scenarios. Code is available at: this https URL.
[CV-7] Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification
【速读】:该论文旨在解决多实例学习(Multiple Instance Learning, MIL)在计算病理学中处理高分辨率全切片图像(gigapixel whole-slide image, WSI)时存在的局限性:现有MIL聚合器将所有实例通过共享路径进行处理,难以适应每张切片内部的病理异质性;而虽有混合专家(Mixture-of-Experts, MoE)方法可提供专业化子网络,但其无约束的softmax路由机制常导致专家利用不平衡,使模型退化为近似单路径方案。解决方案的关键在于提出ROAM(Region-graph OptimAl-transport Mixture-of-experts),其核心创新包括:(i)基于容量受限的熵正则最优传输(entropic optimal transport)建模区域到专家的分配,引入每张切片显式的容量边际约束,在无需辅助负载均衡损失的情况下实现专家利用平衡;(ii)通过图正则化的Sinkhorn迭代,在空间区域图上扩散路由分配,促使邻近区域协同路由至相同专家,从而增强空间一致性。
链接: https://arxiv.org/abs/2604.07298
作者: Xin Tian,Jiuliu Lu,Ephraim Tsalik,Bart Wanders,Colleen Knoth,Julian Knight
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 10 pages, 2 figures, 2 tables
Abstract:Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 ± 0.019.
[CV-8] Are Face Embeddings Compatible Across Deep Neural Network Models?
【速读】:该论文旨在解决不同深度神经网络(DNN)模型——包括领域特定模型与基础模型——在面部身份编码方式上是否具有一致性的问题。其核心挑战在于这些模型虽训练数据、损失函数和架构各异,但能否在嵌入空间中实现跨模型的对齐与兼容。解决方案的关键在于直接分析嵌入空间的几何结构,将人脸图像的嵌入表示视为点云,并通过简单的仿射变换(affine transformation)来对齐不同模型的特征表示。研究发现,低容量线性映射即可显著提升跨模型人脸识别与验证性能,且对齐模式在不同数据集间具有泛化能力,表明面部身份编码存在表征收敛现象,从而为模型互操作性、集成设计及生物特征模板安全提供了理论依据。
链接: https://arxiv.org/abs/2604.07282
作者: Fizza Rubab,Yiying Tong,Arun Ross
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models–both domain-specific and foundation models–encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.
[CV-9] Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
【速读】:该论文旨在解决流式3D感知(streaming 3D perception)中长期序列下的时序一致性问题,尤其是现有递归模型因压缩潜在记忆容量有限而导致的漂移累积(drift accumulation)与时间遗忘(temporal forgetting)。其解决方案的关键在于提出一种混合记忆设计(hybrid memory design)——将相机位姿跟踪(camera tracking)与几何映射(geometric mapping)解耦:前者采用轻量级多层感知机(Multi-Layer Perceptron)实现隐式快速权重记忆(implicit fast-weight memory),并通过测试时训练(Test-Time Training, TTT)动态更新;后者则维护一个显式的、固定大小的基于token的状态用于几何建模。该设计在显著提升长序列性能的同时,将模型参数从793M降至644M,并兼容现有优化策略(如TTT3R),使绝对轨迹误差(Absolute Trajectory Error)在500–1000帧序列上降低达39%,且保持恒定GPU内存占用和相近推理吞吐量。
链接: https://arxiv.org/abs/2604.07279
作者: Changkun Liu,Jiezhi Yang,Zeman Li,Yuan Deng,Jiancong Guo,Luca Ballan
机构: Google(谷歌); The Hong Kong University of Science and Technology(香港科技大学); University of Southern California(南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: this https URL
[CV-10] GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos
【速读】:该论文旨在解决如何从文本和图像输入中生成并编辑高保真度的全身虚拟形象(full-body avatar)的问题,尤其关注在训练过程中利用大规模真实世界视频数据以提升生成结果的逼真度与泛化能力。其核心挑战在于:现有方法难以有效利用部分可观测的2D视频数据来训练3D扩散模型,导致生成质量受限且难以扩展至海量数据。解决方案的关键在于提出一种新颖的“可见性感知扩散训练策略”(visibility-aware diffusion training strategy),通过将预训练的前馈式虚拟形象重建模型作为可动画化的3D标记器(animatable 3D tokenizer),将非结构化的视频帧编码为结构化的3D token,并在训练时用可学习的标记替换无效区域(如遮挡或缺失部位),仅在有效区域内计算损失,从而实现基于大规模真实视频数据的3D扩散模型训练,显著提升了生成结果的保真度和动画兼容性。
链接: https://arxiv.org/abs/2604.07273
作者: Yiqian Wu,Rawal Khirodkar,Egor Zakharov,Timur Bagautdinov,Lei Xiao,Zhaoen Su,Shunsuke Saito,Xiaogang Jin,Junxuan Li
机构: State Key Laboratory of CADCG, Zhejiang University (浙江大学CADCG国家重点实验室); Codec Avatars Lab, Meta (Meta公司编码化身实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at this https URL.
[CV-11] Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments
【速读】:该论文试图解决的问题是:深度神经网络在预测人类对图像真实性的判断时,是否依赖于与人类相似的信息处理机制,以及其生成的解释性热图(如Grad-CAM、LIME和多尺度像素掩码)是否能可靠地揭示支撑这些判断的线索。解决方案的关键在于系统评估不同预训练视觉模型(包括VGG、EfficientNetB3和Barlow Twins等)在预测人类真实性评分时的性能及其解释的一致性——通过比较同一架构内不同随机种子下的解释稳定性,以及跨架构之间的解释一致性发现:尽管多个模型能达到约80%噪声上限的预测性能,但它们的解释在跨架构间一致性较弱,说明单纯依赖预测性能无法推断认知机制;进一步采用集成方法结合多个模型后,不仅提升了预测准确性,还实现了基于像素掩码的图像级归因,从而为行为模型的可解释性提供更稳健的路径。
链接: https://arxiv.org/abs/2604.07254
作者: Icaro Re Depaolini,Uri Hasson
机构: The University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.
[CV-12] Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving
【速读】:该论文旨在解决自动驾驶中异构传感器导致的相机装置依赖问题,通过生成标准化虚拟视图来实现外推式新视角合成(extrapolative novel view synthesis)。现有方法在记录轨迹之外的表现退化,主要原因是外推位姿提供的几何支持较弱且缺乏密集的目标视图监督。解决方案的关键在于:在训练阶段显式暴露模型于轨迹外条件下的缺陷,从而提升其泛化能力。为此,作者提出Geo-EVS框架,其核心创新为两个组件——几何感知重投影(Geometry-Aware Reprojection, GAR)和Artifact-Guided Latent Diffusion(AGLD),前者利用微调的VGGT重建彩色点云并重投影至观测与虚拟目标位姿,生成几何条件图以统一训练与推理路径;后者在训练中注入由重投影产生的伪影掩码,使模型学会在缺失几何支撑下恢复结构。此设计显著提升了稀疏视图合成质量和几何准确性,尤其在高仰角和低覆盖场景下表现优异,并改善了下游3D检测性能。
链接: https://arxiv.org/abs/2604.07250
作者: Yatong Lan,Rongkui Tang,Lei He
机构: Tsinghua University (清华大学); Minzu University (民族大学); Chongqing Changan Automobile Co., Ltd. (重庆长安汽车有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.
[CV-13] PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
【速读】:该论文旨在解决图像编辑中物体物理属性难以准确操控的问题,特别是现有视觉生成模型在空间操作上常出现缩放和定位错误,根源在于缺乏显式的三维几何与透视投影机制。其解决方案的关键在于提出 PhyEdit 框架,通过引入显式的几何模拟作为上下文感知的 3D 视觉引导,并结合联合 2D-3D 监督策略,显著提升物体的空间物理准确性与操作一致性。
链接: https://arxiv.org/abs/2604.07230
作者: Ruihang Xu,Dewei Zhou,Xiaolong Shen,Fan Ma,Yi Yang
机构: Zhejiang University (浙江大学); ReLER; CCAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D–3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.
[CV-14] VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
【速读】:该论文旨在解决时尚图像生成中两个关键问题:一是现有方法将服装生成与虚拟试穿视为独立任务,限制了其在真实时尚工作流中的灵活性;二是多源异构条件下的时尚图像合成仍面临挑战,因传统方法依赖简单的特征拼接或静态层注入,易导致属性纠缠和语义干扰。解决方案的核心在于提出VersaVogue框架,其关键创新是引入特质路由注意力(trait-routing attention, TA)模块,该模块采用专家混合(mixture-of-experts)机制动态地将条件特征路由至最匹配的专家和生成层,实现纹理、形状、颜色等视觉属性的解耦注入;同时设计自动化多视角偏好优化(MPO)流水线,通过内容保真度、文本对齐性和感知质量评估器构建无标注偏好数据,利用直接偏好优化(DPO)提升模型的真实感与可控性。
链接: https://arxiv.org/abs/2604.07210
作者: Jian Yu,Fei Shen,Cong Wang,Yi Xin,Si Shen,Xiaoyu Du,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Nanjing University (南京大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.
[CV-15] INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
【速读】:该论文旨在解决当前视频生成范式中普遍存在的空间一致性不足与视觉真实感欠缺的问题,从而支持在复杂环境中的无缝导航。其核心挑战在于如何从单个参考视频中恢复并生成高保真、可交互的动态场景,并保持长时间序列下的空间一致性与用户交互的物理合理性。解决方案的关键在于提出了一种名为INSPATIO-WORLD的新颖实时框架,其核心技术是Spatiotemporal Autoregressive (STAR)架构,包含两个紧密耦合的组件:隐式时空缓存(Implicit Spatiotemporal Cache)用于聚合参考和历史观测信息以构建全局一致的潜在世界表示;显式空间约束模块(Explicit Spatial Constraint Module)则通过几何结构约束将用户交互转化为精确且物理合理的相机轨迹。此外,引入联合分布匹配蒸馏(Joint Distribution Matching Distillation, JDMD)机制,利用真实世界数据分布作为正则化引导,有效缓解因过度依赖合成数据导致的保真度下降问题。
链接: https://arxiv.org/abs/2604.07209
作者: InSpatio Team(Alphabetical Order):Donghui Shen,Guofeng Zhang,Haomin Liu,Haoyu Ji,Hujun Bao,Hongjia Zhai,Jialin Liu,Jing Guo,Nan Wang,Siji Pan,Weihong Pan,Weijian Xie,Xianbin Liu,Xiaojun Xiang,Xiaoyu Zhang,Xinyu Chen,Yifu Wang,Yipeng Chen,Zhenzhou Fan,Zhewen Le,Zhichao Ye,Ziqiang Zhao
机构: Alphabetical Order: Google(谷歌); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
[CV-16] aLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification
【速读】:该论文旨在解决茶叶叶片病害的精准识别与检测问题,以提升茶园病害管理的效率和准确性。其关键解决方案在于采用多种卷积神经网络(Convolutional Neural Networks, CNN)模型对teaLeafBD数据集进行训练与评估,其中DenseNet201表现最优,测试准确率达99%;同时引入梯度加权类激活映射(Gradient weighted Class Activation Mapping, Grad CAM)、遮挡敏感性分析及对抗训练等技术,显著增强了模型的可靠性与可解释性,并提升了其在噪声环境下的鲁棒性。最终开发出原型系统,实现了深度学习模型在真实农业场景中的落地应用。
链接: https://arxiv.org/abs/2604.07182
作者: Rafi Ahamed,Sidratul Moon Nafsin,Md Abir Rahman,Tasnia Tarannum Roza,Munaia Jannat Easha,Abu Raihan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management.
[CV-17] Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis
【速读】:该论文旨在解决纵向多参数磁共振成像(multi-parametric MRI, mpMRI)分析中缺乏稳定几何参考框架的问题,尤其在无需分割标签或监督分类的情况下实现对组织状态变化的定量追踪。解决方案的关键在于构建患者特异性的能量流形(energy manifold),通过单次基线扫描学习一个隐式神经表示的能量函数 $ E_\theta(\mathbf{u}) $,该函数定义在序列空间(sequence space)中,其中每个体素由其多序列强度向量(如 T1、T1c、T2、FLAIR、ADC)表征。该能量流形作为固定几何参考,编码诊断时观察到的对比度组合,并用于评估随访扫描中序列向量分布相对于该基准的变化:局部极小值对应组织盆地,梯度模长反映接近边界程度,拉普拉斯曲率刻画局部约束结构;纵向分析聚焦于能量偏差与序列空间中的方向位移,从而在肿瘤复发前即检测出组织状态的系统性偏离,证明了基于能量流形的几何参考系统在神经肿瘤学中进行无监督组织风险区追踪的可行性。
链接: https://arxiv.org/abs/2604.07180
作者: Kartikay Tehlan,Lukas Förner,Nico Schmutzenhofer,Michael Frühwald,Matthias Wagner,Nassir Navab,Thomas Wendler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code is available at this https URL
Abstract:We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ( T1 , T1c , T2 , FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function E_\theta(\mathbfu) over \mathbbR^d from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.
[CV-18] Multiple Domain Generalization Using Category Information Independent of Domain Differences
【速读】:该论文旨在解决域泛化(Domain Generalization)问题,即模型在训练数据所在域(source domain)上表现良好,但在未见过的新域(target domain)上性能显著下降的问题。这种性能下降通常由环境差异(如成像设备、染色方法等)导致的域间分布偏移引起。解决方案的关键在于提出一种分离机制:将与域无关的类别信息(如血管或细胞核的结构特征)从源域特异性信息中解耦出来;同时利用随机量化变分自编码器(Stochastically Quantized Variational AutoEncoder, SQ-VAE)中的量子向量来吸收残余域差距,从而提升模型在跨域场景下的分割准确性。
链接: https://arxiv.org/abs/2604.07175
作者: Reiji Saito,Kazuhiro Hotta
机构: Meijo University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.
[CV-19] Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations
【速读】:该论文旨在解决多模态影像分析中缺乏对共享信息与模态特异性信息明确区分的问题,这在临床实践中至关重要,因为它能揭示各模态不可替代的贡献并指导合理的成像策略。其解决方案的关键在于提出一种子空间分解框架,将多模态融合重新建模为正交子空间分离问题而非图像翻译任务:通过训练一个基于多参数MRI的强度非空间隐式神经表示(INR),将MRI特征向量映射至PSMA PET摄取值,并引入基于奇异值分解(SVD)的投影正则化项,惩罚残差成分落在MRI特征流形内的部分,从而强制组织水平生理属性(结构、扩散、灌注)与细胞内PSMA表达之间在数学上正交。实验表明,肿瘤区域的正交残差最大,说明PSMA PET包含无法由MRI生理描述符恢复的信息,实现了基于表示几何的模态互补性结构化表征。
链接: https://arxiv.org/abs/2604.07154
作者: Sonja Adomeit,Kartikay Tehlan,Lukas Förner,Katharina Weisser,Helen Scholtiseek,David Kaufmann,Julie Steinestel,Constantin Lapa,Thomas Kröncke,Thomas Wendler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code is available at this https URL
Abstract:Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality-specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate-Specific Membrane Antigen (PSMA) PET uptake into an MRI-explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity-based, non-spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection-based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue-level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI-derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.
[CV-20] An RTK-SLAM Dataset for Absolute Accuracy Evaluation in GNSS-Degraded Environments
【速读】:该论文旨在解决RTK-SLAM(实时动态载波相位差分-同时定位与地图构建)系统在评估其全局精度时存在的关键问题:现有标准评价指标Absolute Trajectory Error(ATE)通过SE(3)对齐(即最优刚体变换)来计算误差,这种做法会吸收全局漂移和系统性误差,导致轨迹看似准确而实际全球定位精度被高估。为揭示这一差距,论文提出了一种大地测量学参考的数据集和评估方法,其核心设计原则是将RTK接收机仅作为系统输入,而真实值则独立地通过大地测量全站仪(geodetic total station)建立,从而避免了传统数据集中GNSS作为部分真值所引发的偏差。该方案首次实现了对RTK-SLAM系统绝对精度与SE(3)对齐后相对精度的直接对比,实验证明SE(3)对齐可能低估绝对定位误差高达76%,并明确展示了RTK-SLAM在开阔天空下可达到厘米级绝对精度、室内环境下保持分米级全局精度的能力。
链接: https://arxiv.org/abs/2604.07151
作者: Wei Zhang,Vincent Ress,David Skuddis,Uwe Soergel,Norbert Haala
机构: Institute for Photogrammetry and Geoinformatics, University of Stuttgart, Germany(德国斯图加特大学摄影测量与地理信息研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISPRS congress 2026
Abstract:RTK-SLAM systems integrate simultaneous localization and mapping (SLAM) with real-time kinematic (RTK) GNSS positioning, promising both relative consistency and globally referenced coordinates for efficient georeferenced surveying. A critical and underappreciated issue is that the standard evaluation metric, Absolute Trajectory Error (ATE), first fits an optimal rigid-body transformation between the estimated trajectory and reference before computing errors. This so-called SE(3) alignment absorbs global drift and systematic errors, making trajectories appear more accurate than they are in practice, and is unsuitable for evaluating the global accuracy of RTK-SLAM. We present a geodetically referenced dataset and evaluation methodology that expose this gap. A key design principle is that the RTK receiver is used solely as a system input, while ground truth is established independently via a geodetic total station. This separation is absent from all existing datasets, where GNSS typically serves as (part of) the ground truth. The dataset is collected with a handheld RTK-SLAM device, comprising two scenes. We evaluate LiDAR-inertial, visual-inertial, and LiDAR-visual-inertial RTK-SLAM systems alongside standalone RTK, reporting direct global accuracy and SE(3)-aligned relative accuracy to make the gap explicit. Results show that SE(3) alignment can underestimate absolute positioning error by up to 76%. RTK-SLAM achieves centimeter-level absolute accuracy in open-sky conditions and maintains decimeter-level global accuracy indoors, where standalone RTK degrades to tens of meters. The dataset, calibration files, and evaluation scripts are publicly available at this https URL.
[CV-21] Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering
【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-based Visual Question Answering, KB-VQA)中现有检索增强生成(Retrieval-Augmented Generation, RAG)方法存在的两大问题:一是固定流水线设计难以适应多样化问题类型,二是检索与推理过程分离导致模型无法动态决策何时检索、如何优化查询或何时终止,从而造成检索证据与问题语义对齐度低。解决方案的关键在于将KB-VQA建模为一个搜索代理(search-agent)问题,将其求解过程视为多步决策过程,在每一步根据当前信息状态选择“回答”、“图像检索”、“文本检索”或“基于图像描述的检索”四类动作之一,并通过自动化收集多步轨迹数据作为监督信号进行微调,从而实现检索与推理的协同优化和动态控制。
链接: https://arxiv.org/abs/2604.07146
作者: Zhuohong Chen,Zhenxian Wu,Yunyao Yu,Hangrui Xu,Zirui Liao,Zhifang Liu,Xiangwen Deng,Pen Jiao,Haoqian Wang
机构: Tsinghua University(清华大学); University of Arizona(亚利桑那大学); Hefei University of Technology(合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent’s reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.
[CV-22] USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification ALT
【速读】:该论文旨在解决肾结石术前分类困难的问题,现有方法依赖术后标本分析,无法实现手术前的快速准确分类。解决方案的关键在于提出一种名为尿路结石分割与分类网络(Urinary Stone Segmentation and Classification Network, USCNet)的新方法,其核心创新是基于Transformer的多模态融合框架,结合CT图像与电子健康记录(Electronic Health Records, EHR)数据,并引入CT-EHR注意力机制和分割引导注意力模块,以提升分类精度;同时设计动态损失函数有效平衡分割与分类两个任务的目标,从而实现术前精准分类,显著优于现有主流方法。
链接: https://arxiv.org/abs/2604.07141
作者: Changmiao Wang,Songqi Zhang,Yongquan Zhang,Yifei Wang,Liya Liu,Nannan Li,Xingzhi Li,Jiexin Pan,Yi Jiang,Xiang Wan,Hai Wang,Ahmed Elazab
机构: Shenzhen Research Institute of Big Data(深圳大数据研究院); Zhejiang University of Finance and Economics(浙江财经大学); Anhui University of Finance and Economics(安徽财经大学); The Second Affiliated Hospital of Chinese University of Hong Kong (Longgang District People’s Hospital of Shenzhen)(香港中文大学第二附属医院(深圳市龙岗区人民医院)); Murdoch University(默多克大学); Macau University of Science and Technology(澳门科技大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Journal of Biomedical and Health Informatics. Early Access
Abstract:Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on postoperative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: this https URL.
[CV-23] CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research CVPR2026
【速读】:该论文旨在解决儿童性虐待图像(Child Sexual Abuse Imagery, CSAI)分类任务中因法律与伦理限制导致的数据集无法公开共享的问题,从而阻碍了计算机视觉研究的可复现性与进展。其解决方案的关键在于提出一种隐私保护的结构化数据集——CSA-Graphs,该数据集不直接发布原始图像,而是通过两种互补的图结构表示来保留上下文信息:场景图(scene graphs)用于描述物体间关系,骨架图(skeleton graphs)用于编码人体姿态。实验表明,这两种表示均能有效支持CSAI分类,且融合使用时性能进一步提升,从而在保障合规性的前提下推动儿童安全相关的视觉算法研究。
链接: https://arxiv.org/abs/2604.07132
作者: Carlos Caetano,Camila Laranjeira,Clara Ernesto,Artur Barros,João Macedo,Leo S. F. Ribeiro,Jefersson A. dos Santos,Sandra Avila
机构: Universidade Estadual de Campinas (UNICAMP), Brazil; Instituto Federal de Educação, Ciência e Tecnologia de Minas Gerais (IFMG), Brazil; Universidade de São Paulo (USP), Brazil; Universidade Federal de Minas Gerais (UFMG), Brazil; Polícia Federal (PF), Brazil; University of Sheffield, England, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Conference on Computer Vision and Pattern Recognition (CVPR 2026), in the Workshop on Computer Vision for Children (CV4CHL)
Abstract:Child Sexual Abuse Imagery (CSAI) classification is an important yet challenging problem for computer vision research due to the strict legal and ethical restrictions that prevent the public sharing of CSAI datasets. This limitation hinders reproducibility and slows progress in developing automated methods. In this work, we introduce CSA-Graphs, a privacy-preserving structural dataset. Instead of releasing the original images, we provide structural representations that remove explicit visual content while preserving contextual information. CSA-Graphs includes two complementary graph-based modalities: scene graphs describing object relationships and skeleton graphs encoding human pose. Experiments show that both representations retain useful information for classifying CSAI, and that combining them further improves performance. This dataset enables broader research on computer vision methods for child safety while respecting legal and ethical constraints.
[CV-24] A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing
【速读】:该论文旨在解决跨医院放射学数据共享中的隐私保护与数据效用之间的矛盾问题。现有去标识化方法虽能移除可识别信息,但未充分验证其在大规模视觉-语言模型训练及跨机构迁移任务中是否仍保留足够的诊断相关特征。解决方案的关键在于提出一种效用保持的去标识化流程(Utility-Preserving De-identification Pipeline, UPDP),通过构建隐私敏感词黑名单和病灶相关词白名单,并采用生成式过滤机制合成既去除隐私信息又保留病理特征的图像副本,结合去标识化的报告文本实现安全的数据共享,从而在保障隐私的同时维持模型训练的诊断准确性与跨机构迁移性能。
链接: https://arxiv.org/abs/2604.07128
作者: Chenhao Liu,Zelin Wen,Yan Tong,Junjie Zhu,Xinyu Tian,Yuchi Liu,Ashu Gupta,Syed M. S. Islam,Tom Gedeon,Yue Yao
机构: Shandong University (山东大学); Australian National University (澳大利亚国立大学); Hong Kong University of Science and Technology (香港科技大学); Curtin University (柯廷大学); Edith Cowan University (埃迪斯科文大学); University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.
[CV-25] Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator
【速读】:该论文旨在解决半监督学习(Semi-supervised Learning)中用于医学图像语义分割时的两个关键问题:一是传统方法如ClassMix使用来自未标注图像的伪标签进行图像混合操作时,存在标签不准确的风险;二是标注图像与未标注图像之间在数据质量上的差异可能导致特征图分布不一致,从而影响模型性能。解决方案的关键在于:首先,提出一种新的图像混合策略,将标注图像的类别标签及其对应区域直接粘贴到未标注图像及其伪标签图像上,以提升标签可靠性;其次,引入一种特征对齐机制,使模型在未标注图像上的预测结果更接近于标注图像上的预测,从而缩小两类图像间的特征分布差距。实验表明,该方法在Chase和COVID-19数据集上平均提升了2.07%的mIoU指标。
链接: https://arxiv.org/abs/2604.07122
作者: Takahiro Mano,Reiji Saito,Kazuhiro Hotta
机构: Meijo University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.
[CV-26] Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment
【速读】:该论文旨在解决当前地球观测(Earth Observation, EO)服务依赖地面处理流程所导致的延迟高、带宽受限及自主观测优先级能力弱的问题。其核心解决方案在于引入机载智能处理(onboard processing)能力,通过“IRIDE HEO”系统实现数据产品在轨生成,从而提升响应速度、空间分辨率(<3米地面采样距离)和最小可检测事件规模(3公顷),并作为对现有Copernicus服务的补充层,提供图像驱动的预分类信息以支持下游应急管理和土地管理流程。
链接: https://arxiv.org/abs/2604.07120
作者: Parampuneet Kaur Thind,Charles Mwangi,Giovanni Varetto,Lorenzo Sarti,Andrea Papa,Andrea Taramelli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
备注:
Abstract:Current operational Earth Observation (EO) services, including the Copernicus Emergency Management Service (CEMS), the European Forest Fire Information System (EFFIS), and the Copernicus Land Monitoring Service (CLMS), rely primarily on ground-based processing pipelines. While these systems provide mature large-scale information products, they remain constrained by downlink latency, bandwidth limitations, and limited capability for autonomous observation prioritisation. The International Report for an Innovative Defence of Earth (IRIDE) programme is a national Earth observation initiative led by the Italian government to support public authorities through timely, objective information derived from spaceborne data. Rather than a single constellation, IRIDE is designed as a constellation of constellations, integrating heterogeneous sensing technologies within a unified service-oriented architecture. Within this framework, Hawk for Earth Observation (HEO) enables onboard generation of data products, allowing information extraction earlier in the processing chain. This paper examines the limitations of ground-only architectures and evaluates the added value of onboard processing at the operational service level. The IRIDE burnt-area mapping service is used as a representative case study to demonstrate how onboard intelligence can support higher spatial detail (sub-three-metre ground sampling distance), smaller detectable events (minimum mapping unit of three hectares), and improved system responsiveness. Rather than replacing existing Copernicus services, the IRIDE HEO capability is positioned as a complementary layer providing image-driven pre-classification to support downstream emergency and land-management workflows. This work highlights the operational value of onboard intelligence for emerging low-latency EO service architectures.
[CV-27] SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
【速读】:该论文旨在解决现有图像伪造检测模型在监控场景下泛化能力不足的问题。当前主流的伪造检测方法多基于全图合成或大范围篡改的图像数据集进行训练,难以适应监控图像中局部、细微且背景复杂的伪造行为(如视角多样、目标小或遮挡、画质较低等)。解决方案的关键在于构建一个名为SurFITR的大规模监控风格伪造图像测试集,其通过多模态大语言模型(Multimodal LLM)驱动的自动化生成管道,实现语义感知、细粒度的跨场景编辑,从而提供更具真实感和多样性的真实伪造样本。实验表明,基于SurFITR训练可显著提升模型在域内及跨域场景下的检测性能。
链接: https://arxiv.org/abs/2604.07101
作者: Qizhou Wang,Guansong Pang,Christopher Leckie
机构: The University of Melbourne (墨尔本大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
Abstract:We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.
[CV-28] Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples CVPR2026
【速读】:该论文旨在解决工业场景中正常样本定义模糊的问题,即在实际应用中,某些带有微小划痕、灰尘或异物的样本可能仍被视为正常(尤其是设备升级后对精度要求提高时),但传统异常检测方法仅使用纯正样本进行训练,难以适应这种动态变化的“正常”边界。为应对这一挑战,作者提出了新的评估场景与指标,并引入关键解决方案——RePaste机制,其核心在于通过迭代地将前一步中高异常分数区域重新粘贴回输入图像,增强模型对细微异常特征的学习能力,从而提升在新规范下的检测性能。实验表明,RePaste在MVTec AD基准上达到了当前最优表现,同时保持了高AUROC和PRO评分。
链接: https://arxiv.org/abs/2604.07097
作者: Reiji Saito,Satoshi Kamiya,Kazuhiro Hotta
机构: Meijo University(明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Workshop
Abstract:In conventional anomaly detection, training data consist of only normal samples. However, in real-world scenarios, the definition of a normal sample is often ambiguous. For example, there are cases where a sample has small scratches or stains but is still acceptable for practical usage. On the other hand, higher precision is required when manufacturing equipment is upgraded. In such cases, normal samples may include small scratches, tiny dust particles, or a foreign object that we would prefer to classify as an anomaly. Such cases frequently occur in industrial settings, yet they have not been discussed until now. Thus, we propose novel scenarios and an evaluation metric to accommodate specification changes in real-world applications. Furthermore, to address the ambiguity of normal samples, we propose the RePaste, which enhances learning by re-pasting regions with high anomaly scores from the previous step into the input for the next step. On our scenarios using the MVTec AD benchmark, RePaste achieved the state-of-the-art performance with respect to the proposed evaluation metric, while maintaining high AUROC and PRO scores. Code: this https URL
[CV-29] Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data
【速读】:该论文旨在解决多时相星载地球观测(Earth Observation, EO)数据在下游任务中因数据访问和预处理复杂性而导致的使用门槛高、模型训练效率低的问题。现有方案依赖于原始卫星数据进行微调,不仅计算成本高,且对用户的数据获取能力要求严格。解决方案的关键在于提出LIANet(Location Is All You Need Network),这是一种基于坐标的神经表示方法,能够将目标区域的多时相EO数据建模为连续的时空神经场;仅通过空间与时间坐标即可重建卫星影像,在预训练完成后,可直接基于标签进行微调,无需访问原始卫星数据,从而显著降低下游任务的部署门槛并提升灵活性。
链接: https://arxiv.org/abs/2604.07092
作者: Mojgan Madadikhaljan,Jonathan Prexl,Isabelle Wittmann,Conrad M Albrecht,Michael Schmitt
机构: University of the Bundeswehr Munich, Germany; IBM Research – Europe; Columbia University, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we present LIANet (Location Is All You Need Network), a coordinate-based neural representation that models multi-temporal spaceborne Earth observation (EO) data for a given region of interest as a continuous spatiotemporal neural field. Given only spatial and temporal coordinates, LIANet reconstructs the corresponding satellite imagery. Once pretrained, this neural representation can be adapted to various EO downstream tasks, such as semantic segmentation or pixel-wise regression, importantly, without requiring access to the original satellite data. LIANet intends to serve as a user-friendly alternative to Geospatial Foundation Models (GFMs) by eliminating the overhead of data access and preprocessing for end-users and enabling fine-tuning solely based on labels. We demonstrate the pretraining of LIANet across target areas of varying sizes and show that fine-tuning it for downstream tasks achieves competitive performance compared to training from scratch or using established GFMs. The source code and datasets are publicly available at this https URL.
[CV-30] AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors
【速读】:该论文旨在解决当前前向高斯重建模型中像素对齐表示导致的3D高斯分布与输入图像紧密耦合的问题,这种耦合限制了模型在不同图像分辨率和视图数量下的泛化能力,并增加了不必要的计算开销。解决方案的关键在于提出AnchorSplat框架,其核心创新是引入基于3D几何先验(如稀疏点云、体素或RGB-D点云)引导的锚点对齐高斯表示方法,使3D高斯分布直接在三维空间中建模,从而摆脱对图像分辨率和视图数量的依赖;同时结合高斯精炼器(Gaussian Refiner)通过少量前向传播迭代优化中间高斯参数,显著减少所需高斯基元数量并提升重建保真度,实现在ScanNet++ v2 NVS基准上的最先进性能。
链接: https://arxiv.org/abs/2604.07053
作者: Xiaoxue Zhang,Xiaoxu Zheng,Yixuan Yin,Tiao Zhao,Kaihua Tang,Michael Bi Mi,Zhan Xu,Dave Zhenyu Chen
机构: Huawei Technologies Ltd.(华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent feed-forward Gaussian reconstruction models adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, entangling Gaussian representations tightly with the input images. In this paper, we propose AnchorSplat, a novel feed-forward 3DGS framework for scene-level reconstruction that represents the scene directly in 3D space. AnchorSplat introduces an anchor-aligned Gaussian representation guided by 3D geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware renderable 3D Gaussians that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. Beyond the anchor-aligned design, we utilize a Gaussian Refiner to adjust the intermediate Gaussiansy via merely a few forward passes. Experiments on the ScanNet++ v2 NVS benchmark demonstrate the SOTA performance, outperforming previous methods with more view-consistent and substantially fewer Gaussian primitives.
[CV-31] PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing
【速读】:该论文针对真实场景图像去雾(Real-world Image Dehazing, RID)任务中面临的三大挑战:非均匀雾霾分布、多光源引起的光照空间变化以及成对真实雾霾-清晰图像数据稀缺问题,提出了一种名为PRISM的物理结构化框架。其解决方案的关键在于引入了近端散射大气重建(Proximal Scattered Atmosphere Reconstruction, PSAR),在大气散射模型下联合重建清晰场景与散射变量,从而提升复杂区域和混合光照条件下的可靠性;同时设计了在线非均匀雾霾合成管道与选择性自蒸馏适应机制(Selective Self-distillation Adaptation),有效弥合合成数据到真实世界的域差距,在无配对真实数据条件下实现高质量感知目标的选择性学习,并利用内在散射理解对残余雾霾进行审计与引导自优化。
链接: https://arxiv.org/abs/2604.07048
作者: Chengyu Fang,Chunming He,Yuelin Zhang,Chubin Chen,Chenyang Zhu,Longxiang Tang,Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Pages, 7 Figures
Abstract:Real-world image dehazing (RID) aims to remove haze induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying illumination from multiple light sources, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattered Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, thereby improving reliability in complex regions and mixed-light conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-distillation Adaptation scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Extensive experiments on real-world benchmarks demonstrate that PRISM achieves state-of-the-art performance on RID tasks.
[CV-32] KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis ICRA2026
【速读】:该论文旨在解决机器人执行长视频序列中缺乏结构化、可解释的视觉-语言输入表示的问题,从而提升视觉语言模型(VLM)在机器人故障分析任务中的性能。解决方案的关键在于提出KITE,一个无需训练的前端框架,其核心是通过关键帧锚定(keyframe-anchored)和布局引导(layout-grounded)的方式,将长时间机器人轨迹压缩为紧凑且语义明确的tokenized证据。具体而言,KITE从视频中提取运动显著的关键帧,并结合开放词汇检测结果与鸟瞰图(BEV)表示(包含相对物体布局、坐标轴、时间戳及置信度),再与机器人配置和场景上下文token融合形成统一提示(prompt),使通用VLM能够直接支持故障检测、识别、定位、解释与纠正等多任务处理。
链接: https://arxiv.org/abs/2604.07034
作者: Mehdi Hosseinzadeh,King Hang Wong,Feras Dayoub
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); Adelaide University (阿德莱德大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026; Project page: this https URL
Abstract:We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: this https URL
[CV-33] Not all tokens contribute equally to diffusion learning
【速读】:该论文旨在解决条件扩散模型在文本到视频生成任务中,由于分类器自由引导(classifier-free guidance)机制下忽略语义重要token而导致的生成偏差或不完整问题。核心问题源于两个因素:一是训练数据中token频次呈长尾分布引发的分布偏倚;二是交叉注意力(cross-attention)空间错位,导致语义信息丰富的token被低信息量token压制。解决方案的关键在于提出统一框架DARE(Distribution-Aware Rectification and Spatial Ensemble),其包含两项核心技术:一是分布感知的修正引导(DR-CFG),通过动态抑制低语义密度的高频token,使模型更关注稀有但重要的语义线索,从而学习更均衡的条件分布;二是空间表示对齐(SRA),自适应地重加权交叉注意力图以强化高语义密度token的空间引导能力,避免低语义密度token主导注意力分配,保障生成结果的语义一致性和空间准确性。
链接: https://arxiv.org/abs/2604.07026
作者: Guoqing Zhang,Lu Shi,Wanru Xu,Linna Zhang,Sen Wang,Fangfang Wang,Yigang Cen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.
[CV-34] ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation
【速读】:该论文旨在解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)中因语义识别与目标定位耦合导致模型仅关注稀疏判别区域的问题,以及现有方法在伪标签噪声处理和训练稳定性方面的局限性。其解决方案的关键在于提出一种无需训练的解耦架构 ModuSeg,通过显式分离对象发现(object discovery)与语义分配(semantic assignment)两个阶段:首先利用通用掩码提议器(mask proposer)生成边界可靠的几何提案,再借助语义基础模型构建离线特征库,将分割任务转化为非参数化的特征检索过程;同时引入语义边界净化(semantic boundary purification)和软掩码特征聚合(soft-masked feature aggregation)策略,有效缓解边界模糊性和量化误差,从而提取高质量类别原型,实现无参数微调下对精细边界的更好保留和优异性能表现。
链接: https://arxiv.org/abs/2604.07021
作者: Qingze He,Fagui Liu,Dengke Zhang,Qingmao Wei,Quan Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Weakly supervised semantic segmentation aims to achieve pixel-level predictions using image-level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo-label noise and often relying on time-consuming multi-stage retraining or unstable end-to-end joint optimization. To address the above challenges, we present ModuSeg, a training-free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non-parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft-masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high-quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at this https URL.
[CV-35] Synthetic Dataset Generation for Partially Observed Indoor Objects
【速读】:该论文旨在解决3D场景重建与物体补全(object completion)学习方法中缺乏大规模、高质量标注数据的问题,尤其是真实世界扫描系统获取包含遮挡区域精确真值(ground-truth geometry)的局部扫描数据成本高、效率低。其解决方案的关键在于提出一个基于Unity的虚拟扫描框架(virtual scanning framework),通过模拟真实扫描仪的行为参数(如分辨率、测距范围和距离相关的噪声模型),从虚拟视角执行基于射线的扫描以准确建模传感器可见性和遮挡效应,并结合全景图像为点云分配颜色,从而生成逼真的合成3D扫描数据集。此外,该框架集成了一个程序化室内场景生成流水线,支持高效扩展多样化的房间布局与家具排列,最终构建了V-Scan数据集,包含对象级部分点云、体素遮挡网格及完整真值几何结构,为训练和评估基于学习的方法提供了关键监督信号。
链接: https://arxiv.org/abs/2604.07010
作者: Jelle Vermandere,Maarten Bassier,Maarten Vergauwen
机构: KU Leuven, Department of Civil Engineering (鲁汶大学土木工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textitV-Scan dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.07010 [cs.CV] (or arXiv:2604.07010v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.07010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-36] IQ-LUT: interpolated and quantized LUT for efficient image super-resolution
【速读】:该论文旨在解决查找表(Lookup Table, LUT)方法在图像超分辨率(Super-Resolution, SR)推理加速中因感受野和位深度增加而导致的索引空间指数级增长问题,进而引发存储瓶颈,限制其在资源受限设备上的部署。解决方案的关键在于:首先,将插值与量化集成到单输入多输出的ECNN(Efficient Convolutional Neural Network)架构中,显著压缩索引空间从而降低LUT整体规模;其次,引入残差学习机制减少对高比特深度的依赖,提升训练稳定性并增强细粒度细节重建能力;最后,基于知识蒸馏指导的非均匀量化策略优化量化级别分布,在减少存储开销的同时补偿量化损失,实现性能与效率的协同提升。
链接: https://arxiv.org/abs/2604.07000
作者: Yuxuan Zhang,Zhikai Dong,Xinning Chai,Xiangyun Zhou,Yi Xu,Zhengxue Cheng,Li Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lookup table (LUT) methods demonstrate considerable potential in accelerating image super-resolution inference. However, pursuing higher image quality through larger receptive fields and bit-depth triggers exponential growth in the LUT’s index space, creating a storage bottleneck that limits deployment on resource-constrained devices. We introduce IQ-LUT, which achieves a reduction in LUT size while simultaneously enhancing super-resolution quality. First, we integrate interpolation and quantization into the single-input, multiple-output ECNN, which dramatically reduces the index space and thereby the overall LUT size. Second, the integration of residual learning mitigates the dependence on LUT bit-depth, which facilitates training stability and prioritizes the reconstruction of fine-grained details for superior visual quality. Finally, guided by knowledge distillation, our non-uniform quantization process optimizes the quantization levels, thereby reducing storage while also compensating for quantization loss. Extensive benchmarking demonstrates our approach substantially reduces storage costs (by up to 50x compared to ECNN) while achieving superior super-resolution quality.
[CV-37] Generative Phomosaic with Structure-Aligned and Personalized Diffusion
【速读】:该论文旨在解决传统拼贴图像(photomosaic)生成方法依赖大量瓷砖图像和基于颜色匹配的局限性,这些问题导致生成结果在多样性与结构一致性方面受限。解决方案的关键在于提出一种生成式框架,利用基于扩散模型(diffusion-based generation)的方法,通过参考图像条件化生成瓷砖图像,并引入低频条件扩散机制以对齐全局结构并保留提示驱动的细节,从而实现语义表达丰富且结构一致的拼贴图像合成。此外,借助少样本个性化扩散技术,模型能够在无需大量图像数据的情况下生成用户特定或风格统一的瓷砖图像。
链接: https://arxiv.org/abs/2604.06989
作者: Jaeyoung Chung,Hyunjin Son,Kyoung Mu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:We present the first generative approach to photomosaic creation. Traditional photomosaic methods rely on a large number of tile images and color-based matching, which limits both diversity and structural consistency. Our generative photomosaic framework synthesizes tile images using diffusion-based generation conditioned on reference images. A low-frequency conditioned diffusion mechanism aligns global structure while preserving prompt-driven details. This generative formulation enables photomosaic composition that is both semantically expressive and structurally coherent, effectively overcoming the fundamental limitations of matching-based approaches. By leveraging few-shot personalized diffusion, our model is able to produce user-specific or stylistically consistent tiles without requiring an extensive collection of images.
[CV-38] Canopy Tree Height Estimation Using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing AISTATS2026
【速读】:该论文旨在解决现有基于卫星数据的树木高度估计模型仅提供点预测(point predictions)的问题,这限制了其在风险敏感场景中的应用。解决方案的关键在于引入分位数回归(quantile regression),通过仅对现有模型的预测头进行轻微修改,即可实现统计校准的不确定性量化(uncertainty quantification),从而提升模型在复杂地形和植被异质性等挑战性条件下的置信度表现。
链接: https://arxiv.org/abs/2604.06988
作者: Karsten Schrödter,Jan Pauls,Fabian Gieseke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AISTATS 2026
Abstract:Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches for tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that, with minor modifications of a given prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.
[CV-39] CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models
【速读】:该论文旨在解决深度掌纹识别系统在物理可实现攻击下的鲁棒性不足问题,现有研究多局限于数字域,未能充分考虑掌纹识别中以纹理为主导的特性及物理采集过程中引入的失真。其解决方案的关键在于提出一种捕获感知的对抗补丁框架(Capture-Aware Adversarial Patch, CAAP),该框架通过设计十字形补丁拓扑结构,在固定像素预算下扩大空间覆盖范围并更有效地破坏长程纹理连续性;同时集成三个核心模块:输入条件感知的补丁渲染(ASIT)、随机捕获感知模拟(RaS)和特征级身份干扰引导(MS-DIFE),从而生成可在不同输入和采集条件下复用且保持有效性的通用补丁,显著提升对抗攻击的针对性与跨模型、跨数据集迁移能力。
链接: https://arxiv.org/abs/2604.06987
作者: Renyang Liu,Jiale Li,Jie Zhang,Cong Wu,Xiaojun Jia,Shuxin Li,Wei Zhou,Kwok-Yan Lam,See-kiong Ng
机构: Institute of Data Science, National University of Singapore (数据科学研究所,新加坡国立大学); Centre for Frontier AI Research (CFAR), ASTAR (前沿人工智能研究中心,ASTAR); School of Cyber Science and Engineering, Wuhan University (网络科学与工程学院,武汉大学); College of Computing and Data Science, Nanyang Technological University (计算与数据科学学院,南洋理工大学); School of Engineering, Yunnan University (工程学院,云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital setting and do not adequately account for the texture-dominant nature of palmprint recognition or the distortions introduced during physical acquisition. To address this gap, we propose CAAP, a capture-aware adversarial patch framework for palmprint recognition. CAAP learns a universal patch that can be reused across inputs while remaining effective under realistic acquisition variation. To match the structural characteristics of palmprints, the framework adopts a cross-shaped patch topology, which enlarges spatial coverage under a fixed pixel budget and more effectively disrupts long-range texture continuity. CAAP further integrates three modules: ASIT for input-conditioned patch rendering, RaS for stochastic capture-aware simulation, and MS-DIFE for feature-level identity-disruptive guidance. We evaluate CAAP on the Tongji, IITD, and AISEC datasets against generic CNN backbones and palmprint-specific recognition models. Experiments show that CAAP achieves strong untargeted and targeted attack performance with favorable cross-model and cross-dataset transferability. The results further show that, although adversarial training can partially reduce the attack success rate, substantial residual vulnerability remains. These findings indicate that deep palmprint recognition systems remain vulnerable to physically realizable, capture-aware adversarial patch attacks, underscoring the need for more effective defenses in practice. Code available at this https URL.
[CV-40] MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
【速读】:该论文旨在解决将强化学习(Reinforcement Learning, RL)应用于混合自回归-扩散框架(hybrid autoregressive-diffusion frameworks)时面临的挑战,尤其是由于推理过程交错和噪声对数概率估计导致的训练不稳定与早期性能饱和问题。其关键解决方案在于提出一种稳定化的RL框架,核心创新包括:引入多轨迹期望(multi-trajectory expectation, MTE),通过在多个扩散轨迹上平均来降低扩散头引入的梯度噪声;进一步基于多轨迹估计token级别的不确定性,并仅对top-k%不确定token进行优化以避免过度平滑;同时设计一致性感知的token选择策略,过滤掉与最终生成内容不一致的自回归token。这些改进显著提升了视觉质量、训练稳定性和空间结构理解能力。
链接: https://arxiv.org/abs/2604.06966
作者: Xiaoxiao Ma,Jiachen Lei,Tianfei Ren,Jie Huang,Siming Fu,Aiming Hao,Jiahong Wu,Xiangxiang Chu,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); AMAP, Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: this https URL.
[CV-41] Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)中感知模型的公平性问题,特别是面部关键点检测任务中潜在的年龄、性别和种族偏见。其核心问题是:尽管高阶人脸分析任务中的群体偏差已被广泛研究,但低层次视觉任务如面部关键点检测是否存在类似偏见仍不明确,且这些偏见可能通过HRI系统传播并加剧对弱势群体的不公平影响。解决方案的关键在于提出一种受控的统计方法,用于解耦人口统计学因素与混杂视觉变量(如头部姿态和图像分辨率)的影响;实证结果表明,后者对性能差异的贡献远大于前者,而校正这些混杂因素后,性别和种族相关的性能差距消失,仅保留显著的年龄相关偏差,这凸显了在机器人感知系统中识别并纠正低层视觉模块偏见的重要性,以实现更可信和公平的机器人感知能力。
链接: https://arxiv.org/abs/2604.06961
作者: Pablo Parte,Roberto Valle,José M. Buenaposada,Luis Baumela
机构: Universidad Politécnica de Madrid (马德里理工大学); Universidad Rey Juan Carlos (雷伊·胡安·卡洛斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.
[CV-42] Compression as an Adversarial Amplifier Through Decision Space Reduction
【速读】:该论文旨在解决图像压缩对深度图像分类器对抗鲁棒性的影响这一尚未被充分理解的问题。在实际应用中,图像压缩广泛存在于社交媒体平台和资源受限系统中,但其是否以及如何影响模型的对抗稳定性仍不明确。论文提出了一种新的对抗攻击场景——直接在压缩表示空间中施加攻击,并发现压缩本身会作为对抗放大器(adversarial amplifier),显著增强攻击效果。解决方案的关键在于揭示了“决策空间压缩”(decision space reduction)机制:压缩过程引入非可逆的信息丢失变换,导致分类边界收缩、扰动敏感性增加,从而削弱模型鲁棒性。实验结果验证了该机制的有效性,并指出在压缩-推理闭环部署环境中存在关键脆弱性。
链接: https://arxiv.org/abs/2604.06954
作者: Lewis Evans,Harkrishan Jandu,Zihan Ye,Yang Lu,Shreyank N Gowda
机构: University of Nottingham (诺丁汉大学); University of Chinese Academy of Sciences (中国科学院大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image compression is a ubiquitous component of modern visual pipelines, routinely applied by social media platforms and resource-constrained systems prior to inference. Despite its prevalence, the impact of compression on adversarial robustness remains poorly understood. We study a previously unexplored adversarial setting in which attacks are applied directly in compressed representations, and show that compression can act as an adversarial amplifier for deep image classifiers. Under identical nominal perturbation budgets, compression-aware attacks are substantially more effective than their pixel-space counterparts. We attribute this effect to decision space reduction, whereby compression induces a non-invertible, information-losing transformation that contracts classification margins and increases sensitivity to perturbations. Extensive experiments across standard benchmarks and architectures support our analysis and reveal a critical vulnerability in compression-in-the-loop deployment settings. Code will be released.
[CV-43] Making MLLM s Blind: Adversarial Smuggling Attacks in MLLM Content Moderation ACL2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动化内容审核中面临的一种新型安全威胁——对抗性走私攻击(Adversarial Smuggling Attacks)。此类攻击利用人类与AI在感知和理解能力上的差异,将有害内容编码为人类可读但AI难以识别的视觉形式,从而绕过自动检测机制。解决方案的关键在于:首先通过构建首个综合性基准SmuggleBench(包含1,700个攻击实例)系统评估MLLMs的脆弱性;其次识别出三个根本原因:视觉编码器能力有限、光学字符识别(OCR)鲁棒性不足以及特定领域对抗样本稀缺;最后初步探索了两种缓解策略:测试时缩放(如链式思维CoT)和对抗训练(如监督微调SFT),以提升模型对走私攻击的防御能力。
链接: https://arxiv.org/abs/2604.06950
作者: Zhiheng Li,Zongyang Ma,Yuntong Pan,Ziqi Zhang,Xiaolei Lv,Bo Li,Jun Gao,Jianing Zhang,Chunfeng Yuan,Bing Li,Weiming Hu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; Hellogroup; University of Washington; Jilin University; ShanghaiTech University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2026. 19 pages, 6 figures
Abstract:Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at this https URL.
[CV-44] NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results CVPR
【速读】:该论文旨在解决因比特流(bitstream)损坏导致的视频恢复问题,即从受损的比特流中重建视觉上连贯的视频序列,此类损坏常引发严重的时空伪影和内容失真。解决方案的关键在于构建一个基于真实场景下比特流损坏设置的统一基准(benchmark),包括标准化的数据集、评估协议及对参赛方法的系统性分析,从而推动鲁棒视频恢复技术的发展,并揭示当前主流技术趋势与挑战。
链接: https://arxiv.org/abs/2604.06945
作者: Wenbin Zou,Tianyi Li,Kejun Wu,Huiping Zhuang,Zongwei Wu,Zhuyun Zhou,Radu Timofte,Kim-Hui Yap,Lap-Pui Chau,Yi Wang,Shiqi Zhou,Xiaodi Shi,Yuxiang Chen,Yilian Zhong,Shibo Yin,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Zhitao Wang,Lifa Ha,Hengyu Man,Xiaopeng Fan,Priyansh Singh,Sidharth,Krrish Dev,Soham Kakkar,Vinit Jakhetiya,Ovais Iqbal Shah,Wei Zhou,Linfeng Li,Qi Xu,Zhenyang Liu,Kepeng Xu,Tong Qiao,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi
机构: MGTV; Xiaohongshu INC; Fudan University; Xi’an Jiaotong University; Harbin Institute of Technology; Indian Institute of Technology Jammu; National University of Singapore; Shanghai Jiao Tong University; Xidian University; University of Illinois Urbana-Champaign
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, 1 table, CVPRW2026 NTIRE Challenge Report
Abstract:This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.
[CV-45] Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
【速读】:该论文旨在解决自回归视频生成(autoregressive video synthesis)在长时程生成中面临的三大核心挑战:由于上下文限制导致的语义遗忘(semantic forgetting)、因位置外推引起的视觉漂移(visual drift),以及交互指令切换时的可控性丧失(controllability loss)。现有方法通常孤立地处理这些问题,难以保障长期一致性。其解决方案的关键在于提出一种名为“Grounded Forcing”的新框架,通过三个相互耦合的机制实现时间无关语义与邻近动态之间的统一:1)双记忆键值缓存(Dual Memory KV Cache)分离局部时序动态与全局语义锚点,维持语义连贯性和身份稳定性;2)双参考RoPE注入(Dual-Reference RoPE Injection)将位置嵌入约束在训练流形内,使全局语义对时间不变;3)非对称邻近重缓存(Asymmetric Proximity Recache)通过邻近加权缓存更新,实现提示切换时的平滑语义继承。这三个组件协同作用,使生成过程锚定于稳定的语义核心,同时支持灵活的局部动态演化,显著提升长视频生成的视觉稳定性和长期一致性。
链接: https://arxiv.org/abs/2604.06939
作者: Jintao Chen,Chengyu Bai,Junjun hu,Xinda Xue,Mu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.
[CV-46] POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP
【速读】:该论文旨在解决图像信号处理(ISP)流水线中模块序列与参数联合优化的难题,现有方法如神经架构搜索(NAS)存在训练-推理不匹配问题,而分阶段强化学习(RL)则因逐阶段决策导致训练不稳定且计算开销高。解决方案的关键在于提出POS-ISP,一种基于序列级强化学习的框架,将模块ISP优化建模为全局序列预测问题,通过单次前向传播同时预测完整的模块序列及其参数,并利用终端任务奖励进行端到端优化,从而避免中间监督和冗余执行,实现稳定且高效的任务感知ISP优化。
链接: https://arxiv.org/abs/2604.06938
作者: Jiyun Won,Heemin Yang,Woohyeok Kim,Jungseul Ok,Sunghyun Cho
机构: POSTECH CSE; GSAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at this https URL
[CV-47] Multi-modal user interface control detection using cross-attention
【速读】:该论文旨在解决从软件截图中检测用户界面(User Interface, UI)控件的难题,这一任务在自动化测试、无障碍支持和软件分析等领域至关重要。现有基于纯像素的方法因视觉模糊、设计多样性及缺乏上下文线索而效果受限。解决方案的关键在于提出一种多模态扩展的YOLOv5模型,通过引入GPT生成的UI图像文本描述,并利用交叉注意力模块将视觉特征与文本嵌入的语义信息对齐,从而实现更鲁棒且具备上下文感知能力的UI控件检测。实验表明,卷积融合策略在23类控件上显著优于基线模型,尤其在语义复杂或视觉模糊的类别中表现突出,验证了视觉与文本模态融合的有效性。
链接: https://arxiv.org/abs/2604.06934
作者: Milad Moradi,Ke Yan,David Colwell,Matthias Samwald,Rhona Asgari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.
[CV-48] FP4 Explore BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
【速读】:该论文旨在解决大规模文本到图像扩散模型在基于强化学习(Reinforcement Learning, RL)的后训练对齐过程中,因rollout规模扩大带来的高昂计算成本问题。现有方法虽通过增加rollout群体规模可显著提升对齐性能,但在如FLUX.1-12B等大模型上实施时面临严重效率瓶颈。解决方案的关键在于提出Sol-RL框架——一种基于FP4量化(Floating Point 4-bit)的两阶段强化学习机制:第一阶段利用高吞吐量的NVFP4 rollout生成海量候选样本并提取高对比度子集;第二阶段仅在BF16精度下重生成这些精选样本并针对性优化策略网络。该设计通过解耦探索与优化过程,实现了算法级rollout扩展与系统级FP4吞吐增益的协同,从而在保持BF16训练完整性的同时,将训练收敛速度提升至4.64倍,显著降低了大规模rollout对齐的成本。
链接: https://arxiv.org/abs/2604.06916
作者: Yitong Li,Junsong Chen,Shuchen Xue,Pengcuo Zeren,Siyuan Fu,Dinghao Yang,Yangyang Tang,Junjie Bai,Ping Luo,Song Han,Enze Xie
机构: NVIDIA; HKU; MIT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to 4.64\times , unlocking the power of massive rollout scaling at a fraction of the cost.
[CV-49] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
【速读】:该论文旨在解决多模态大模型(Multimodal Large Models, MLLMs)在处理高分辨率视觉输入时存在的效率瓶颈问题,即当前全局分辨率提升策略 indiscriminately(无差别地)向二次自注意力机制注入冗余视觉令牌,导致推理吞吐量严重下降,且未考虑空间稀疏性和查询意图。其解决方案的核心是提出 Q-Zoom 框架,采用一种查询感知的自适应高分辨率感知机制,以粗到精的方式实现高效计算:首先通过轻量级动态门控网络(Dynamic Gating Network)在粗粒度特征已足够时跳过高分辨率处理;其次,对于需要细粒度感知的查询,利用自蒸馏区域提议网络(Self-Distilled Region Proposal Network, SD-RPN)直接从中间特征空间中精准定位任务相关的感兴趣区域(Region-of-Interest, RoI)。该方案通过一致性感知生成策略和完全自监督蒸馏范式优化各模块,并结合连续时空对齐与针对性微调,将局部密集 RoI 与全局粗略布局无缝融合,从而在保持甚至超越基线精度的同时显著提升推理速度。
链接: https://arxiv.org/abs/2604.06912
作者: Yuheng Shi,Xiaohuan Pei,Linfeng Wen,Minjing Dong,Chang Xu
机构: University of Sydney (悉尼大学); Sun Yat-sen University (中山大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures
Abstract:MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline’s peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline’s peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at this https URL.
[CV-50] XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leverag ing Extended Reality and Multimodal AI
【速读】:该论文旨在解决传统职业指导平台依赖静态文本界面、交互性弱且缺乏个性化与证据支持的洞察力的问题,同时指出当前计算机辅助职业指导系统在叙事维度上的忽视。其解决方案的关键在于构建一个融合扩展现实(Extended Reality, XR)与五个核心人工智能(Artificial Intelligence, AI)模块的沉浸式平台——XR-CareerAssist:包括语音识别实现语音驱动交互、神经机器翻译支持多语言沟通、基于Langchain的对话训练助手提供个性化对话、基于BLIP的视觉-语言模型生成职业可视化内容,以及通过AWS Polly驱动的3D虚拟形象实现文本转语音输出;该系统将超过10万条匿名职业档案转化为动态桑基图展示职业路径,最终在Unity中开发并部署于Meta Quest 3设备上,显著提升了用户参与度和体验效果。
链接: https://arxiv.org/abs/2604.06901
作者: N.D. Tantaroudas,A.J. McCracken,I. Karachalios,E. Papatheou,V. Pastrikakis
机构: Institute of Communications and Computer Systems (ICCS)(通信与计算机系统研究所); DASKALOS-APPS; National Technical University of Athens (雅典国立技术大学); Exeter Small-Scale Robotics Laboratory (埃克塞特小型机器人实验室); CVCOSMOS Ltd
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 21
Abstract:Conventional career guidance platforms rely on static, text-driven interfaces that struggle to engage users or deliver personalised, evidence-based insights. Although Computer-Assisted Career Guidance Systems have evolved since the 1960s, they remain limited in interactivity and pay little attention to the narrative dimensions of career development. We introduce XR-CareerAssist, a platform that unifies Extended Reality (XR) with several Artificial Intelligence (AI) modules to deliver immersive, multilingual career guidance. The system integrates Automatic Speech Recognition for voice-driven interaction, Neural Machine Translation across English, Greek, French, and Italian, a Langchain-based conversational Training Assistant for personalised dialogue, a BLIP-based Vision-Language model for career visualisations, and AWS Polly Text-to-Speech delivered through an interactive 3D avatar. Career trajectories are rendered as dynamic Sankey diagrams derived from a repository of more than 100,000 anonymised professional profiles. The application was built in Unity for Meta Quest 3, with backend services hosted on AWS. A pilot evaluation at the University of Exeter with 23 participants returned 95.6% speech recognition accuracy, 78.3% overall user satisfaction, and 91.3% favourable ratings for system responsiveness, with feedback informing subsequent improvements to motion comfort, audio clarity, and text legibility. XR-CareerAssist demonstrates how the fusion of XR and AI can produce more engaging, accessible, and effective career development tools, with the integration of five AI modules within a single immersive environment yielding a multimodal interaction experience that distinguishes it from existing career guidance platforms.
[CV-51] Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models
【速读】:该论文旨在解决深度卷积神经网络在处理密集空间特征图时存在的计算冗余问题,以及由此导致的模型对背景噪声的依赖性过强、鲁棒性差和可解释性不足的问题。解决方案的关键在于提出能量正则化的空间掩码(Energy-Regularized Spatial Masking, ERSM)框架,其核心是将特征选择建模为一个可微的能量最小化问题,并在标准卷积骨干网络中嵌入轻量级的能量掩码层(Energy-Mask Layer)。该层为每个视觉token分配一个标量能量,由内在的单变量重要性代价与成对的空间一致性惩罚共同构成,从而实现输入自适应的最优信息密度平衡,无需预设稀疏预算或依赖启发式重要性评分,最终在保持分类精度的同时,生成具有鲁棒性和可解释性的稀疏空间掩码。
链接: https://arxiv.org/abs/2604.06893
作者: Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah
机构: DAVID Lab, UVSQ, Paris-Saclay University, Versailles, France; LIPN, UMR CNRS 7030, Sorbonne Paris Nord University, Villetaneuse, France; LISN, Paris-Saclay University, Saclay, France
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.
[CV-52] me-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer
【速读】:该论文旨在解决基于医学影像的生存期(Overall Survival, OS)自动化预测问题,以提升非小细胞肺癌(Non-Small Cell Lung Cancer, NSCLC)患者的预后评估与个体化治疗规划。其核心挑战在于如何从FDG-PET/CT图像中提取有效特征,并结合时间维度信息实现连续时间轴上的生存概率建模。解决方案的关键在于提出了一种基于ResNet-50骨干网络的深度回归框架,将组织层面的PET/CT投影图像嵌入(image embeddings)与表示时间跨度的标量时序输入(temporal input)进行融合,从而生成随时间变化的OS概率函数;此外,通过引入临床特征和图像衍生参数(IDP),并采用多模态模型集成策略,显著提升了预测性能(AUC提高4.3%),同时实现了高/低风险患者的精准分层,验证了影像与表格数据协同建模的有效性。
链接: https://arxiv.org/abs/2604.06885
作者: Sambit Tarai,Ashish Chauhan,Elin Lundström,Johan Öfverstedt,Therese Sjöholm,Veronica Sanchez Rodriguez,Håkan Ahlström,Joel Kullberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction. Comments: Under review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.06885 [cs.CV] (or arXiv:2604.06885v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.06885 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ashish Chauhan [view email] [v1] Wed, 8 Apr 2026 09:43:30 UTC (2,246 KB)
[CV-53] SCT-MOT: Enhancing Air-to-Air Multiple UAVs Tracking with Swarm-Coupled Motion and Trajectory Guidance
【速读】:该论文旨在解决空中无人机群(swarm UAVs)的空对空跟踪问题,核心挑战在于复杂非线性群体运动与小目标弱视觉线索导致的检测失败、轨迹碎片化及身份切换。现有方法虽引入轨迹预测,但仅独立建模个体行为,忽视了群体层面的运动依赖关系,且运动预测与外观表征融合不足,难以保障在视觉模糊和杂乱环境中的时空一致性。解决方案的关键在于提出SCT-MOT框架,其创新点为:一是设计Swarm Motion-Aware Trajectory Prediction (SMTP)模块,从群体视角联合建模历史轨迹与姿态感知的外观特征,提升对非线性耦合群体轨迹的预测精度;二是构建Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF)模块,将预测位置与历史视觉线索对齐,并深度融合当前帧特征,增强弱目标的时序一致性和空间可区分性。实验表明,该方法在多个公开数据集上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2604.06883
作者: Zhaochen Chu,Tao Song,Ren Jin,Shaoming He,Defu Lin,Siqing Cheng
机构: Beijing Institute of Technology (北京理工大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures. Under review at IEEE Transactions on Aerospace and Electronic Systems (TAES). This work has been submitted to the IEEE for possible publication
Abstract:Air-to-air tracking of swarm UAVs presents significant challenges due to the complex nonlinear group motion and weak visual cues for small objects, which often cause detection failures, trajectory fragmentation, and identity switches. Although existing methods have attempted to improve performance by incorporating trajectory prediction, they model each object independently, neglecting the swarm-level motion dependencies. Their limited integration between motion prediction and appearance representation also weakens the spatio-temporal consistency required for tracking in visually ambiguous and cluttered environments, making it difficult to maintain coherent trajectories and reliable associations. To address these challenges, we propose SCT-MOT, a tracking framework that integrates Swarm-Coupled motion modeling and Trajectory-guided feature fusion. First, we develop a Swarm Motion-Aware Trajectory Prediction (SMTP) module jointly models historical trajectories and posture-aware appearance features from a swarm-level perspective, enabling more accurate forecasting of the nonlinear, coupled group trajectories. Second, we design a Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF) module aligns predicted positions with historical visual cues and deeply integrates them with current frame features, enhancing temporal consistency and spatial discriminability for weak objects. Extensive experiments on three public air-to-air swarm UAV tracking datasets, including AIRMOT, MOT-FLY, and UAVSwarm, demonstrate that SMTP achieves more accurate trajectory forecasts and yields a 1.21% IDF1 improvement over the state-of-the-art trajectory prediction module EqMotion when integrated into the same MOT framework. Overall, our SCT-MOT consistently achieves superior accuracy and robustness compared to state-of-the-art trackers across multiple metrics under complex swarm scenarios.
[CV-54] RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
【速读】:该论文旨在解决图像局部细节恢复中的“局部细节坍缩”问题(local detail collapse),即在图像编辑任务中,现有生成模型常因忽视细微结构(如文字、logo和细长物体)而导致局部失真,且在小区域编辑时易误改背景。其核心解决方案是提出一种基于多模态扩散模型的区域特异性图像精修方法 RefineAnything,关键创新在于“聚焦-精修-回贴”策略(Focus-and-Refine):通过裁剪并重缩放目标区域以提升局部分辨率,再将精修结果以融合掩膜方式精确回贴至原图,从而实现高保真局部重构与严格背景不变性;同时引入边界感知的一致性损失(Boundary Consistency Loss)减少拼接伪影,显著提升自然度与一致性。
链接: https://arxiv.org/abs/2604.06870
作者: Dewei Zhou,You Li,Zongxin Yang,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: this https URL.
[CV-55] Physical Adversarial Attacks on AI Surveillance Systems:Detection Tracking and Visible–Infrared Evasion
【速读】:该论文旨在解决物理对抗攻击在真实监控场景中评估不足的问题,即传统基于单帧图像的基准测试难以反映实际部署环境下攻击的有效性。其关键解决方案在于提出一个以监控系统为导向的四维分类法,聚焦于时间持续性(temporal persistence)、传感模态(sensing modality)、载体真实性(carrier realism)和系统级目标(system-level objective),从而重新组织现有研究并揭示当前方法在多目标跟踪、可见光-红外双模态规避以及可控服装设计等方面的进展,强调必须将攻击效果视为跨时间、跨传感器且受物理部署约束的系统级问题来评估。
链接: https://arxiv.org/abs/2604.06865
作者: Miguel A.DelaCruz,Patricia Mae Santos,Rafael T.Navarro
机构: University of the Philippines Diliman (菲律宾大学迪里曼分校); Ateneo de Manila University (马尼拉雅典耀大学); De La Salle University (德拉萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible–infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible–infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.
[CV-56] Vision-Language Model-Guided Deep Unrolling Enables Personalized Fast MRI
【速读】:该论文旨在解决传统加速磁共振成像(MRI)方法在临床应用中因缺乏任务自适应性而导致的图像质量与诊断效能不足的问题。现有方法通常优化通用图像保真度,难以针对特定临床任务(如病灶检测或定位)进行个性化调整。其解决方案的关键在于提出PASS框架,该框架通过三个核心模块实现任务导向的快速成像:(1) 基于物理模型的深度展开重建网络,确保可解释性;(2) 生成患者特异性k空间轨迹的采样模块;(3) 利用预训练视觉-语言模型(VLM)提取异常感知先验,引导采样和重建聚焦于临床相关区域。这种将VLM的高层语义推理能力与物理约束的可解释网络相结合的设计,显著提升了不同解剖结构、对比度、异常类型及加速因子下的图像质量,并直接改善了下游诊断任务的表现。
链接: https://arxiv.org/abs/2604.06849
作者: Fangmao Ju,Yuzhu He,Zhiwen Xue,Chunfeng Lian,Jianhua Ma
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific k -space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.
[CV-57] CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery
【速读】:该论文旨在解决遥感图像中云检测任务的准确性与效率问题,尤其针对现有基于深度学习的单阶段像素级二分类方法在薄云区域存在模糊性、难以处理碎片化云体及边界细节的问题。其解决方案的关键在于提出了一种名为CloudMamba的新框架:首先设计了一个不确定性引导的两阶段检测策略,通过嵌入式不确定性估计模块自动量化薄云分割置信度,并引入第二阶段精修分割以提升低置信度困难区域的精度;其次构建了基于CNN-Mamba混合架构的双尺度网络,利用Mamba结构实现线性计算复杂度下对云体大尺度结构与小尺度边界细节的有效建模,从而实现整体云形态精确勾勒与边界精准分割。
链接: https://arxiv.org/abs/2604.06844
作者: Jiajun Yang,Keyan Chen,Zhengxia Zou,Zhenwei Shi
机构: Beihang University (北京航空航天大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cloud detection in remote sensing imagery is a fundamental, critical, and highly challenging problem. Existing deep learning-based cloud detection methods generally formulate it as a single-stage pixel-wise binary segmentation task with one forward pass. However, such single-stage approaches exhibit ambiguity and uncertainty in thin-cloud regions and struggle to accurately handle fragmented clouds and boundary details. In this paper, we propose a novel deep learning framework termed CloudMamba. To address the ambiguity in thin-cloud regions, we introduce an uncertainty-guided two-stage cloud detection strategy. An embedded uncertainty estimation module is proposed to automatically quantify the confidence of thin-cloud segmentation, and a second-stage refinement segmentation is introduced to improve the accuracy in low-confidence hard regions. To better handle fragmented clouds and fine-grained boundary details, we design a dual-scale Mamba network based on a CNN-Mamba hybrid architecture. Compared with Transformer-based models with quadratic computational complexity, the proposed method maintains linear computational complexity while effectively capturing both large-scale structural characteristics and small-scale boundary details of clouds, enabling accurate delineation of overall cloud morphology and precise boundary segmentation. Extensive experiments conducted on the GF1_WHU and Levir_CS public datasets demonstrate that the proposed method outperforms existing approaches across multiple segmentation accuracy metrics, while offering high efficiency and process transparency. Our code is available at this https URL.
[CV-58] VGGT-SLAM CVPR2026
【速读】:该论文旨在解决现有基于Transformer的视觉SLAM系统在长距离运行中因依赖稀疏回环闭合或全局Sim(3)流形约束而导致的短时位姿漂移问题。其核心解决方案是引入一种空间校正型后端(spatially corrective back-end),通过高频局部束调整(Local Bundle Adjustment, LBA)实现轨迹稳定;具体而言,针对每个VGGT子地图构建密集平面规范数字高程模型(Digital Elevation Map, DEM),并将其分块提取DINOv2特征嵌入,融入共视性图(covisibility graph)中,结合视觉位置识别(Visual Place Recognition, VPR)模块在窗口内检索空间邻近子地图,从而触发频繁局部优化,显著降低短期漂移、加速图收敛,并以紧凑的DEM瓦片和次线性检索维持全局一致性。
链接: https://arxiv.org/abs/2604.06830
作者: Avilasha Mandal,Rajesh Kumar,Sudarshan Sunil Harithas,Chetan Arora
机构: Indian Institute of Technology Delhi (印度理工学院德里分校); Addverb Technologies (艾德弗伯技术公司); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages (main paper) + supplementary material. Accepted at CVPR 2026 Workshop (VOCVALC)
Abstract:We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.
[CV-59] RePL: Pseudo-label Refinement for Semi-supervised LiDAR Semantic Segmentation
【速读】:该论文旨在解决激光雷达(LiDAR)语义分割中半监督学习因伪标签噪声导致的误差传播和确认偏倚问题。解决方案的关键在于提出RePL框架,通过掩码重建机制识别并修正伪标签中的潜在错误,同时设计专门的训练策略以提升伪标签质量;理论分析表明该方法在较宽松条件下即可有效优化伪标签,实验验证其在nuScenes-lidarseg和SemanticKITTI数据集上显著提升了分割性能,达到当前最优水平。
链接: https://arxiv.org/abs/2604.06825
作者: Donghyeon Kwon,Taegyu Park,Suha Kwak
机构: POSTECH(浦项科技大学); GSAI(全球人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised learning for LiDAR semantic segmentation often suffers from error propagation and confirmation bias caused by noisy pseudo-labels. To tackle this chronic issue, we introduce RePL, a novel framework that enhances pseudo-label quality by identifying and correcting potential errors in pseudo-labels through masked reconstruction, along with a dedicated training strategy. We also provide a theoretical analysis demonstrating the condition under which the pseudo-label refinement is beneficial, and empirically confirm that the condition is mild and clearly met by RePL. Extensive evaluations on the nuScenes-lidarseg and SemanticKITTI datasets show that RePL improves pseudo-label quality a lot and, as a result, achieves the state of the art in LiDAR semantic segmentation.
[CV-60] Generate Analyze and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning CVPR2026
【速读】:该论文旨在解决声源定位(Sound Source Localization, SSL)任务中现有方法依赖对比学习进行特征匹配但缺乏显式推理与验证的问题,尤其在复杂声学场景下性能受限。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)内在的推理能力,提出一种无需训练的Generation-Analysis-Refinement(GAR)框架:首先通过生成阶段产出初始边界框和音频分类结果;接着分析阶段基于开放集角色标记和锚点投票量化音视频一致性;最后精炼阶段采用自适应门控机制避免不必要的调整,从而实现更可靠且具解释性的声源定位。
链接: https://arxiv.org/abs/2604.06824
作者: Subin Park,Jung Uk Kim
机构: Kyung Hee University (경희대학교)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at this https URL.
[CV-61] FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift CVPR2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据来自不同域而导致的严重域偏移(domain shift)问题,进而影响全局模型性能。现有基于原型(prototype)的联邦学习方法存在两个关键局限:一是仅构建单一全局原型而忽略域信息;二是特征与原型对齐过程缺乏域感知能力,导致本地特征被迫与所有域的全局原型对齐。为应对上述挑战,论文提出联邦域感知原型(Federated Domain-Aware Prototypes, FedDAP),其核心在于通过相似性加权融合机制,在同一域内聚合本地原型以构建域特定的全局原型,并在本地训练中引导特征与同域原型对齐、同时与异域原型分离,从而实现局部域特异性学习并提升全局模型跨域泛化能力。
链接: https://arxiv.org/abs/2604.06795
作者: Huy Q. Le,Loc X. Nguyen,Yu Qiao,Seong Tae Kim,Eui-Nam Huh,Choong Seon Hong
机构: G-LAMP NEXUS Institute, Kyung Hee University (高丽大学G-LAMP NEXUS研究所); Kyung Hee University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026
Abstract:Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a \textitsingle global prototype per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is \textitdomain-agnostic , forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges. The code is available at this https URL.
[CV-62] Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
【速读】:该论文旨在解决视频任务中因采用因子分解或窗口化自注意力机制而导致的时空相关性割裂问题,此类方法虽提升了计算效率,却限制了模型对运动信息和长程依赖关系的捕捉能力。其解决方案的关键在于提出一种双路径架构——Overall Glance and Refined Gaze (OG-ReG) Transformer,其中Glance路径提取粗粒度的整体时空信息,Gaze路径则补充局部细节,模拟人类视觉系统在不同时间尺度下稀疏分配注意力的机制,从而更有效地建模视频中的动态与结构特征。
链接: https://arxiv.org/abs/2604.06783
作者: Bohao Xing,Deng Li,Rong Gao,Xin Liu,Heikki Kälviäinen
机构: Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-拉赫蒂理工大学); Brno University of Technology (布杰约维采理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models’ ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at this https URL.
[CV-63] EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling
【速读】:该论文旨在解决事件相机(Event Camera)在人脸识别任务中因缺乏传统RGB图像稳定光度特征而导致的识别性能下降问题。其核心挑战在于如何从稀疏、动态的事件流中提取具有判别力的身份表征。解决方案的关键在于构建一个名为EventFace的框架,该框架通过融合空间结构与时间动态信息来建模身份相关特征:首先利用低秩适配(Low-Rank Adaptation, LoRA)将预训练RGB人脸模型中的面部结构先验迁移至事件域,建立可靠的 spatial basis;随后引入运动提示编码器(Motion Prompt Encoder, MPE)显式编码时序特征,并通过时空调制器(Spatiotemporal Modulator, STM)将其与空间特征融合,从而增强对身份相关事件模式的表达能力。实验表明,该方法在光照退化条件下表现出更强鲁棒性,且学习到的表示具有更低的模板重构可恢复性,验证了其在隐私友好场景下的有效性。
链接: https://arxiv.org/abs/2604.06782
作者: Qingguo Meng,Xingbo Dong,Zhe Jin,Massimo Tistarelli
机构: State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology (国家重点光电信息获取与保护技术实验室); Anhui Provincial Key Laboratory of Secure Artificial Intelligence (安徽省安全人工智能重点实验室); Anhui Provincial International Joint Research Center for Advanced Technology in Medical Imaging (安徽省医学影像先进技术国际联合研究中心); School of Artificial Intelligence (人工智能学院); Anhui University (安徽大学); Computer Vision Laboratory, University of Sassari (计算机视觉实验室,萨萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.
[CV-64] Walk the Talk: Bridging the Reasoning -Action Gap for Thinking with Images via Multimodal Agent ic Policy Optimization
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在多轮推理过程中存在的“推理-动作不一致”问题,即模型可能生成看似合理的文本推理链,但执行的视觉操作却存在偏差或无关性,导致错误在多轮交互中累积并严重削弱模型的多模态推理能力,甚至引发训练崩溃。解决方案的关键在于提出一种名为多模态代理策略优化(Multimodal Agentic Policy Optimization, MAPO)的方法,其核心机制是强制模型在使用视觉工具后生成显式的视觉内容描述,并设计了一种新颖的优势估计策略,将这些描述与实际观测之间的语义对齐度与任务奖励耦合,从而有效降低梯度方差并提升推理与动作的一致性。
链接: https://arxiv.org/abs/2604.06777
作者: Wenhao Yang,Yu Xia,Jinlong Huang,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Yuchen Zhou,Xiaobo Xia,Yuanyu Wan,Lijun Zhang,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images’’ by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model’s multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.
[CV-65] FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts
【速读】:该论文旨在解决制造设施中维护流程文档(如ISO 5807标准的流程图)难以被现代操作员支持系统利用的问题,这类文档通常以静态PDF或扫描图像形式存在,其蕴含的程序知识无法直接用于生成式AI等智能系统的查询与推理。解决方案的关键在于提出FlowExtract管道,该方法将节点检测与连接拓扑重建分离:首先使用YOLOv8和EasyOCR进行标准化领域对齐的节点识别与文本提取,再通过一种新颖的边检测机制——分析箭头方向并反向追踪连线至源节点——实现高精度的有向图结构重建,从而显著优于基于视觉-语言模型的基线方法,在工业故障排查指南上验证了其有效性。
链接: https://arxiv.org/abs/2604.06770
作者: Guillermo Gil de Avalle,Laura Maruster,Eric Sloot,Christos Emmanouilidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.
[CV-66] FlowInOne:Unifying Multimodal Generation as Image-in Image-out Flow Matching
【速读】:该论文旨在解决多模态生成任务中长期存在的跨模态对齐瓶颈问题,即传统以文本驱动为主的生成流程难以实现视觉内部的推理与创造,且存在噪声调度和任务特异性结构分支等限制。其核心解决方案是提出FlowInOne框架,通过将所有输入(包括文本描述、空间布局和编辑指令)统一转化为视觉提示(visual prompt),构建一个纯视觉流的生成范式,从而实现图像输入到图像输出的端到端流程,并由单一的流匹配模型控制。该方案消除了跨模态对齐需求,显著简化了架构并提升了生成一致性与精度,为完全以视觉为中心的生成建模提供了新基础。
链接: https://arxiv.org/abs/2604.06757
作者: Junchao Yi,Rui Zhao,Jiahao Tang,Weixian Lei,Linjie Li,Qisheng Su,Zhengyuan Yang,Lijuan Wang,Xiaofeng Zhu,Alex Jinpeng Wang
机构: University of Electronic Science and Technology of China (电子科技大学); Central South University (中南大学); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Tencent (腾讯); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.
[CV-67] How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对时序驾驶内容理解能力不足的问题,尤其是输入配置因素如何影响其性能尚缺乏系统性分析。解决方案的关键在于提出VENUSS(VLM Evaluation oN Understanding Sequential Scenes)框架,该框架基于现有数据集提取驾驶视频中的时间序列,并构建结构化评估体系,在2600多个场景下对比25种以上VLMs,首次系统性地揭示了图像分辨率、帧数、时间间隔、空间布局和呈现方式等输入配置对模型性能的影响,从而为未来研究提供了基准与方向。
链接: https://arxiv.org/abs/2604.06750
作者: Roberto Brusnicki,Mattia Piccinini,Johannes Betz
机构: Technical University of Munich (慕尼黑工业大学); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at this https URL
[CV-68] From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks
【速读】:该论文旨在解决当前视觉上下文学习模型(Visual In-Context Learning Models)在实际应用中缺乏用户交互能力的问题,即这些模型虽能通过示例对新任务进行快速泛化,但无法有效利用用户提供的引导信号(如涂鸦、点击或边界框)来调整或优化预测过程。解决方案的关键在于提出一种名为 Interactive DeLVM 的方法,其核心思想是将用户的交互信息直接编码到示例的输入-输出对中,从而在不进行任务特定微调的前提下,使模型能够响应自然的视觉提示并动态调整预测结果,实现了从静态任务适应向用户驱动的灵活交互转变。
链接: https://arxiv.org/abs/2604.06748
作者: Carlos Schmidt,Simon Reiß
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of +7.95% IoU for interactive segmentation, +2.46 PSNR for directed super-resolution, and -3.14% LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.
[CV-69] LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video
【速读】:该论文旨在解决从无标定(unposed)多视角视频中实现实时动态场景新视角合成(Novel View Synthesis, NVS)的挑战,现有方法通常依赖于已知相机参数并需长时间优化(约2.67秒/帧),难以满足直播场景的实时性要求。解决方案的关键在于提出一种前馈式端到端模型LiveStre4m,其核心创新包括:1)基于多视角视觉Transformer(Multi-view Vision Transformer)实现关键帧3D场景重建;2)引入扩散-Transformer插值模块以保障时序一致性与稳定流媒体输出;3)设计相机位姿预测模块(Camera Pose Predictor),直接从RGB图像中估计相机内外参,从而摆脱对预先标定信息的依赖。该方案可在仅两路同步未标定输入下实现每帧0.07秒的平均重建时间(1024×768分辨率),显著优于传统优化方法,在实际部署中实现了可行的实时NVS直播系统。
链接: https://arxiv.org/abs/2604.06740
作者: Pedro Quesado,Erkut Akdag,Yasaman Kashefbahrami,Willem Menu,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ( \approx 2.67 s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of 0.07 s per-frame at 1024 \times 768 resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: this https URL
[CV-70] DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting
【速读】:该论文旨在解决稀疏视角下基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)重建中存在的严重过拟合及结构失真与半透明雾状伪影问题,其根本原因在于几何监督不足导致高斯基元(Gaussian primitive)可靠性不可观测。解决方案的关键在于提出统一的双域观测与校准(Dual-domain Observation and Calibration, DOC-GS)框架:在优化域中,通过连续深度引导的丢弃策略(Continuous Depth-Guided Dropout, CDGD)显式建模每个高斯基元的约束程度,并以此作为可靠性代理,引入平滑的深度感知归纳偏置以抑制弱约束基元并提升优化稳定性;在观测域中,将漂浮伪影与大气散射建立联系,利用暗通道先验(Dark Channel Prior, DCP)作为结构一致性线索识别异常区域,并基于多视角聚合证据设计可靠性驱动的几何剪枝策略,剔除低置信度高斯基元,从而系统性地改善重建质量。
链接: https://arxiv.org/abs/2604.06739
作者: Hantang Li,Qiang Zhu,Xiandong Meng,Debin Zhao,Xiaopeng Fan
机构: Harbin Institute of TechnologyShenzhenChina; Pengcheng LaboratoryShenzhenChina; Harbin Institute of TechnologyHarbinChina
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:Sparse-view reconstruction with 3D Gaussian Splatting (3DGS) is fundamentally ill-posed due to insufficient geometric supervision, often leading to severe overfitting and the emergence of structural distortions and translucent haze-like artifacts. While existing approaches attempt to alleviate this issue via dropout-based regularization, they are largely heuristic and lack a unified understanding of artifact formation. In this paper, we revisit sparse-view 3DGS reconstruction from a new perspective and identify the core challenge as the unobservability of Gaussian primitive reliability. Unreliable Gaussians are insufficiently constrained during optimization and accumulate as haze-like degradations in rendered images. Motivated by this observation, we propose a unified Dual-domain Observation and Calibration (DOC-GS) framework that models and corrects Gaussian reliability through the synergy of optimization-domain inductive bias and observation-domain evidence. Specifically, in the optimization domain, we characterize Gaussian reliability by the degree to which each primitive is constrained during training, and instantiate this signal via a Continuous Depth-Guided Dropout (CDGD) strategy, where the dropout probability serves as an explicit proxy for primitive reliability. This imposes a smooth depth-aware inductive bias to suppress weakly constrained Gaussians and improve optimization stability. In the observation domain, we establish a connection between floater artifacts and atmospheric scattering, and leverage the Dark Channel Prior (DCP) as a structural consistency cue to identify and accumulate anomalous regions. Based on cross-view aggregated evidence, we further design a reliability-driven geometric pruning strategy to remove low-confidence Gaussians.
[CV-71] URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection
【速读】:该论文旨在解决多模态讽刺检测(Multimodal Sarcasm Detection, MSD)中因假设所有模态(文本与图像)均等可靠而导致的鲁棒性不足问题。在真实社交媒体场景下,文本可能模糊、图像可能弱相关甚至无关,传统确定性融合方法易引入噪声证据,削弱推理能力。其解决方案的关键在于提出不确定性感知的鲁棒多模态融合框架(Uncertainty-aware Robust Multimodal Fusion, URMF),通过显式建模各模态的似然不确定性(aleatoric uncertainty),将每种模态参数化为可学习的高斯后验分布,并利用估计的不确定性动态调节融合过程中各模态的贡献权重,从而抑制不可靠模态、增强语义不一致性的感知能力,最终提升联合表示的准确性与鲁棒性。
链接: https://arxiv.org/abs/2604.06728
作者: Zhenyu Wang,Weichen Cheng,Weijia Li,Junjie Mou,Zongyou Zhao,Guoying Zhang
机构: China University of Mining and Technology, Beijing (中国矿业大学(北京))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.
[CV-72] Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂三维空间推理任务中表现不佳的问题,其根本原因在于模型依赖于二维视觉先验,缺乏对三维几何结构的显式理解与灵活视角变换能力。解决方案的关键在于提出一种无需训练(training-free)的框架,引入基于显式三维重建的视觉思维链(Visual Chain-of-Thought)机制:首先利用MLLM引导的关键词提取与多粒度掩码生成技术从单张图像重建高保真三维网格(3D mesh),随后借助外部知识库迭代计算最优相机外参并合成新视角,从而模拟人类视角转换过程,显著提升空间理解能力。
链接: https://arxiv.org/abs/2604.06725
作者: Jiahua Chen,Qihong Tang,Weinong Wang,Qi Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textittraining-free framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textitGPT-5.2 and \textitGemini-2.5-Flash, on major benchmarks such as 3DSRBench and Rel3D.
[CV-73] Exploring 6D Object Pose Estimation with Deformation
【速读】:该论文旨在解决现有6D物体位姿估计方法在面对非刚性变形物体时性能显著下降的问题。当前大多数方法假设物体为刚体或可关节约束结构,但在实际场景中,物体常因磨损、撞击或形变而偏离其标准形态(canonical shape),导致传统方法失效。为此,作者提出了DeSOPE数据集,其关键创新在于构建了一个大规模、高保真度的6DoF变形物体数据集,包含26类常见物体在1个标准状态和3种变形状态下的3D扫描数据,并通过精确的3D配准实现多姿态建模;同时提供包含13.3万帧RGB-D图像和66.5万条位姿标注的高质量数据集,标注流程采用半自动管道:先进行2D实例掩码标注,再用物体位姿估计方法获取初始位姿,继而通过物体级SLAM系统优化位姿,最后人工验证以确保精度。实验表明,随着变形程度增加,现有方法性能急剧下降,凸显了对变形鲁棒处理机制的需求,从而推动了面向真实世界复杂形变场景的位姿估计研究。
链接: https://arxiv.org/abs/2604.06720
作者: Zhiqiang Liu,Rui Song,Duanmu Chuangqi,Jiaojiao Li,David Ferstl,Yinlin Hu
机构: Xidian University (西安电子科技大学); MagicLeap
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications. The project page and dataset are available at this https URLthis https URL.
[CV-74] HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
【速读】:该论文旨在解决遥感图像语义分割中模型难以同时捕捉精细空间细节与高层语义上下文的问题,尤其针对传统编码器-解码器架构(如U-Net)在全局语义利用和结构化特征交互方面存在的局限性。其解决方案的关键在于提出一种混合量子-经典多尺度融合网络HQF-Net,通过冻结的DINOv3 ViT-L/16骨干网络提供多尺度语义引导,并引入可变形多尺度交叉注意力融合模块(Deformable Multiscale Cross-Attention Fusion, DMCAF)实现高效特征融合;此外,创新性地设计了量子增强跳跃连接(QSkip)与量子瓶颈中的专家混合机制(Quantum bottleneck with Mixture-of-Experts, QMoE),结合局部、全局及方向性量子电路并采用自适应路由策略,从而在近期内存量子硬件约束下显著提升特征表示能力。实验表明,该方法在多个遥感数据集上均取得优于现有基准的性能。
链接: https://arxiv.org/abs/2604.06715
作者: Md Aminur Hossain,Ayush V. Patel,Siddhant Gole,Sanjay K. Singh,Biplab Banerjee
机构: Space Applications Centre, ISRO, India; Indian Institute of Technology Bombay
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Remote sensing semantic segmentation requires models that can jointly capture fine spatial details and high-level semantic context across complex scenes. While classical encoder-decoder architectures such as U-Net remain strong baselines, they often struggle to fully exploit global semantics and structured feature interactions. In this work, we propose HQF-Net, a hybrid quantum-classical multi-scale fusion network for remote sensing image segmentation. HQF-Net integrates multi-scale semantic guidance from a frozen DINOv3 ViT-L/16 backbone with a customized U-Net architecture through a Deformable Multiscale Cross-Attention Fusion (DMCAF) module. To enhance feature refinement, the framework further introduces quantum-enhanced skip connections (QSkip) and a Quantum bottleneck with Mixture-of-Experts (QMoE), which combines complementary local, global, and directional quantum circuits within an adaptive routing mechanism. Experiments on three remote sensing benchmarks show consistent improvements with the proposed design. HQF-Net achieves 0.8568 mIoU and 96.87% overall accuracy on this http URL, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. An architectural ablation study further confirms the contribution of each major component. These results show that structured hybrid quantum-classical feature processing is a promising direction for improving remote sensing semantic segmentation under near-term quantum constraints.
[CV-75] Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency
【速读】:该论文旨在解决半稠密图像匹配方法中长期存在的两个问题:在粗匹配阶段,互近邻(Mutual Nearest Neighbor, MNN)匹配层存在过排除(over-exclusion)现象,导致难以处理图像间尺度差异;在细匹配阶段,现有方法忽视了最终匹配的局部一致性,从而削弱了鲁棒性。解决方案的关键在于:首先提出一种尺度感知匹配模块(scale-aware matching module),利用得分矩阵中的隐含信息推断尺度比例,有效缓解因尺度差异导致的匹配失效问题;其次将细匹配阶段重构为级联光流精修问题,并引入新型梯度损失以增强光流场的局部一致性,从而提升匹配结果的稳定性和精度。
链接: https://arxiv.org/abs/2604.06713
作者: Ke Jin,Jiming Chen,Qi Ye
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent semi-dense image matching methods have achieved remarkable success, but two long-standing issues still impair their performance. At the coarse stage, the over-exclusion issue of their mutual nearest neighbor (MNN) matching layer makes them struggle to handle cases with scale difference between images. To this end, we comprehensively revisit the matching mechanism and make a key observation that the hint concealed in the score matrix can be exploited to indicate the scale ratio. Based on this, we propose a scale-aware matching module which is exceptionally effective but introduces negligible overhead. At the fine stage, we point out that existing methods neglect the local consistency of final matches, which undermines their robustness. To this end, rather than independently predicting the correspondence for each source pixel, we reformulate the fine stage as a cascaded flow refinement problem and introduce a novel gradient loss to encourage local consistency of the flow field. Extensive experiments demonstrate that our novel matching pipeline, with these proposed modifications, achieves robust and accurate matching performance on downstream tasks.
[CV-76] RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection
【速读】:该论文旨在解决多模态虚假视频检测中两个关键问题:一是缺乏跨实例的全局语义关联,难以有效利用历史关联证据验证当前视频的真实性;二是不同领域间的语义差异阻碍了通用知识的迁移,且缺乏领域专家知识的引导。解决方案的核心在于提出一种检索增强的语义推理框架(Retrieval-Augmented Semantic Reasoning, RASR),其关键创新包括:(1) 通过跨实例语义解析与检索模块(Cross-instance Semantic Parser and Retriever, CSPR)从动态记忆库中提取相关关联证据,增强全局语义建模能力;(2) 引入领域引导的多模态推理模块(Domain-Guided Multimodal Reasoning, DGMP),融合领域先验知识驱动大语言模型生成具有领域特异性的深度分析报告;(3) 设计多视角特征解耦与融合模块(Multi-View Feature Decoupling and Fusion, MVDFF),基于自适应门控机制整合多维特征,提升检测鲁棒性。
链接: https://arxiv.org/abs/2604.06687
作者: Hui Li,Peien Ding,Jun Li,Guoqi Ma,Zhanyu Liu,Ge Xu,Junfeng Yao,Jinsong Su
机构: Xiamen University (厦门大学); Guilin University of Electronic Technology (桂林电子科技大学); Minjiang University (闽江学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,5 figures
Abstract:Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.
[CV-77] VDPP: Video Depth Post-Processing for Speed and Scalability CVPR2024
【速读】:该论文旨在解决当前端到端(end-to-end, E2E)视频深度估计模型因系统耦合紧密而导致的适应滞后问题,即当更优的单图深度估计算法发布时,E2E模型难以快速集成更新。为此,作者提出了一种后处理框架VDPP(Video Depth Post-Processing),其关键在于将传统依赖RGB图像和高计算成本的场景重建范式,转变为在低分辨率空间中仅进行几何精修的轻量级策略。通过密集残差学习驱动几何表示而非完整重建,VDPP实现了43.5 FPS的高速运行(NVIDIA Jetson Orin Nano平台),同时保持与E2E系统相当的时间一致性,并具备RGB无关架构带来的可扩展性,从而成为实时边缘部署中最实用的解决方案。
链接: https://arxiv.org/abs/2604.06665
作者: Daewon Yoon,Injun Baek,Sangyu Han,Yearim Kim,Nojun Kwak
机构: Seoul National University (首尔国立大学); Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. Accepted to CVPR 2024 Workshop. Project page: this https URL
Abstract:Video depth estimation is essential for providing 3D scene structure in applications ranging from autonomous driving to mixed reality. Current end-to-end video depth models have established state-of-the-art performance. Although current end-to-end (E2E) models have achieved state-of-the-art performance, they function as tightly coupled systems that suffer from a significant adaptation lag whenever superior single-image depth estimators are released. To mitigate this issue, post-processing methods such as NVDS offer a modular plug-and-play alternative to incorporate any evolving image depth model without retraining. However, existing post-processing methods still struggle to match the efficiency and practicality of E2E systems due to limited speed, accuracy, and RGB reliance. In this work, we revitalize the role of post-processing by proposing VDPP (Video Depth Post-Processing), a framework that improves the speed and accuracy of post-processing methods for video depth estimation. By shifting the paradigm from computationally expensive scene reconstruction to targeted geometric refinement, VDPP operates purely on geometric refinements in low-resolution space. This design achieves exceptional speed (43.5 FPS on NVIDIA Jetson Orin Nano) while matching the temporal coherence of E2E systems, with dense residual learning driving geometric representations rather than full reconstructions. Furthermore, our VDPP’s RGB-free architecture ensures true scalability, enabling immediate integration with any evolving image depth model. Our results demonstrate that VDPP provides a superior balance of speed, accuracy, and memory efficiency, making it the most practical solution for real-time edge deployment. Our project page is at this https URL
[CV-78] owards Robust Content Watermarking Against Removal and Forgery Attacks CVPR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像扩散模型中产生的内容所面临的版权保护、图像溯源和信用归属等问题,特别是现有水印技术易受移除攻击(removal attacks)和伪造攻击(forgery attacks)的脆弱性。其解决方案的关键在于提出一种名为 Instance-Specific Watermarking with Two-Sided Detection (ISTS) 的新型水印范式:首先,通过基于用户提示语义动态控制水印注入时机与模式,实现实例特异性水印;其次,引入双侧检测机制(two-sided detection),显著提升水印在对抗攻击下的鲁棒性。
链接: https://arxiv.org/abs/2604.06662
作者: Yifan Zhu,Yihan Wang,Xiao-Shan Gao
机构: Academy of Mathematics and Systems Science, Chinese Academy of Sciences (中国科学院数学与系统科学研究院); University of Chinese Academy of Sciences (中国科学院大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, CVPR 2026 Findings
Abstract:Generated contents have raised serious concerns about copyright protection, image provenance, and credit attribution. A potential solution for these problems is watermarking. Recently, content watermarking for text-to-image diffusion models has been studied extensively for its effective detection utility and robustness. However, these watermarking techniques are vulnerable to potential adversarial attacks, such as removal attacks and forgery attacks. In this paper, we build a novel watermarking paradigm called Instance-Specific watermarking with Two-Sided detection (ISTS) to resist removal and forgery attacks. Specifically, we introduce a strategy that dynamically controls the injection time and watermarking patterns based on the semantics of users’ prompts. Furthermore, we propose a new two-sided detection approach to enhance robustness in watermark detection. Experiments have demonstrated the superiority of our watermarking against removal and forgery attacks.
[CV-79] GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation
【速读】:该论文旨在解决多器官、多模态三维医学图像分割中同时实现高精度与计算效率的难题,尤其是在资源受限和时间敏感的临床环境中。其解决方案的关键在于提出了一种轻量级网络架构GPAFormer,核心创新包括两个模块:一是多尺度注意力引导的堆叠聚合(multi-scale attention-guided stacked aggregation, MASA),通过三条具有不同感受野的并行路径及平面聚合机制增强对不同尺寸结构的建模能力;二是互知补丁图聚合器(mutual-aware patch graph aggregator, MPGA),基于补丁间特征相似性和空间邻近性动态聚合相似区域,从而提升器官内部及边界结构的区分度。该设计在仅使用1.81M参数的情况下,在多个公开数据集上实现了领先的Dice相似系数(DSC)表现,并且在消费级GPU上单次推理时间低于1秒,显著平衡了分割性能与计算效率。
链接: https://arxiv.org/abs/2604.06658
作者: Chung-Ming Lo,I-Yun Liu,Wei-Yang Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network’s capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.
[CV-80] Controllable Generative Video Compression
【速读】:该论文旨在解决感知视频压缩(perceptual video compression)中感知真实感与信号保真度之间的权衡问题,即现有方法在提升主观视觉质量的同时往往牺牲了客观的信号 fidelity。解决方案的关键在于提出可控生成式视频压缩(Controllable Generative Video Compression, CGVC)范式,通过引入多视觉条件引导生成过程:首先对场景代表性关键帧进行编码以提供结构先验,同时为每个非关键帧编码密集的逐帧控制先验,从而更精确地保留细节结构和语义信息;在此基础上,利用可控视频生成模型实现时序一致性和内容一致性重建,并结合颜色距离引导的关键帧选择算法优化颜色恢复精度,最终在保持高感知质量的同时显著提升信号保真度。
链接: https://arxiv.org/abs/2604.06655
作者: Ding Ding,Daowen Li,Ying Chen,Yixin Gao,Ruixiao Dong,Kai Li,Li Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perceptual video compression adopts generative video modeling to improve perceptual realism but frequently sacrifices signal fidelity, diverging from the goal of video compression to faithfully reproduce visual signal. To alleviate the dilemma between perception and fidelity, in this paper we propose Controllable Generative Video Compression (CGVC) paradigm to faithfully generate details guided by multiple visual conditions. Under the paradigm, representative keyframes of the scene are coded and used to provide structural priors for non-keyframe generation. Dense per-frame control prior is additionally coded to better preserve finer structure and semantics of each non-keyframe. Guided by these priors, non-keyframes are reconstructed by controllable video generation model with temporal and content consistency. Furthermore, to accurately recover color information of the video, we develop a color-distance-guided keyframe selection algorithm to adaptively choose keyframes. Experimental results show CGVC outperforms previous perceptual video compression method in terms of both signal fidelity and perceptual quality.
[CV-81] Variational Feature Compression for Model-Specific Representations
【速读】:该论文旨在解决深度学习推理在共享和云环境部署中面临的输入再利用(input repurposing)问题,即未经授权的模型可能利用为特定任务生成的数据表示进行其他下游任务。现有隐私保护方法主要通过限制数据访问来实现防御,但难以控制释放的特征表示所能支持的下游用途。解决方案的关键在于提出一种特征提取框架,该框架通过变分潜在瓶颈(variational latent bottleneck)将输入编码至紧凑的潜在空间,并结合任务驱动的交叉熵目标与KL正则化训练,同时不使用像素级重建损失;进一步引入一个基于维度KL散度和针对冻结目标模型的梯度显著性计算的动态二值掩码,抑制对指定任务无信息的潜在维度,从而在保持目标分类器准确率的同时显著削弱非预期模型的性能。
链接: https://arxiv.org/abs/2604.06644
作者: Zinan Guo,Zihan Wang,Chuan Yan,Liuhuo Wan,Ethan Ma,Guangdong Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:As deep learning inference is increasingly deployed in shared and cloud-based settings, a growing concern is input repurposing, in which data submitted for one task is reused by unauthorized models for another. Existing privacy defenses largely focus on restricting data access, but provide limited control over what downstream uses a released representation can still support. We propose a feature extraction framework that suppresses cross-model transfer while preserving accuracy for a designated classifier. The framework employs a variational latent bottleneck, trained with a task-driven cross-entropy objective and KL regularization, but without any pixel-level reconstruction loss, to encode inputs into a compact latent space. A dynamic binary mask, computed from per-dimension KL divergence and gradient-based saliency with respect to the frozen target model, suppresses latent dimensions that are uninformative for the intended task. Because saliency computation requires gradient access, the encoder is trained in a white-box setting, whereas inference requires only a forward pass through the frozen target model. On CIFAR-100, the processed representations retain strong utility for the designated classifier while reducing the accuracy of all unintended classifiers to below 2%, yielding a suppression ratio exceeding 45 times relative to unintended models. Preliminary experiments on CIFAR-10, Tiny ImageNet, and Pascal VOC provide exploratory evidence that the approach extends across task settings, although further evaluation is needed to assess robustness against adaptive adversaries.
[CV-82] SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport CVPR2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际部署中因系统异构性和统计异构性导致的挑战,尤其是现有联邦网络剪枝方法面临的两难困境:服务器端剪枝缺乏个性化,客户端剪枝则对资源受限设备计算负担过重;同时,剪枝过程引发各客户端子模型间参数差异显著,破坏训练稳定性并阻碍全局收敛。其解决方案的关键在于提出SubFLOT框架,通过两个核心模块实现高效且个性化的剪枝:一是基于最优传输增强的剪枝(Optimal Transport-enhanced Pruning, OTP)模块,将历史客户端模型视为局部数据分布的代理,以Wasserstein距离最小化为目标生成定制化子模型而无需访问原始数据;二是基于缩放的自适应正则化(Scaling-based Adaptive Regularization, SAR)模块,根据客户端剪枝率动态调整对子模型偏离全局模型程度的惩罚强度,从而有效缓解参数发散问题,保障全局训练稳定性和性能。
链接: https://arxiv.org/abs/2604.06631
作者: Zheng Jiang,Nan He,Yiming Chen,Lifeng Sun
机构: Tsinghua University (清华大学); Beijing University of Technology (北京工业大学); Key Laboratory of Pervasive Computing, Ministry of Education (教育部普适计算重点实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel’s deviation from the global model, with the penalty’s strength scaled by the client’s pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.
[CV-83] WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression
【速读】:该论文旨在解决恶劣天气条件下图像退化问题,如雨、雪、雾等导致的模糊、遮挡和低亮度,这些问题会显著影响后续计算机视觉任务的性能。为实现多天气场景下的高效图像恢复,作者提出WeatherRemover模型,其关键在于采用类UNet结构结合门控机制(gating mechanism)与多尺度金字塔视觉Transformer(multi-scale pyramid vision Transformer),通过卷积神经网络提取通道注意力特征以优化特征表达,并利用线性空间降维降低注意力计算开销;同时在前馈和下采样阶段嵌入门控机制,有效抑制冗余信息对学习过程的影响,从而实现自适应数据选择,在保证高质量恢复的同时显著提升参数效率、计算速度与内存利用率,兼顾性能与实用性。
链接: https://arxiv.org/abs/2604.06623
作者: Weikai Qu,Sijun Liang,Cheng Pan,Zikuan Yang,Guanchi Zhou,Xianjun Fu,Bo Liu,Changmiao Wang,Ahmed Elazab
机构: Guangdong University of Technology (广东工业大学); Sanda University (三明大学); University of Queensland (昆士兰大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学); Zhejiang college of Security Technology (浙江安防职业技术学院); Northwest China Research Institute of Electronic Equipment (西北机电工程研究所); Shenzhen Research Institute of Big Data (深圳大数据研究院); Shenzhen University (深圳大学); Misr Higher Institute for Commerce and Computers (埃及高等商业与计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Artificial Intelligence
Abstract:Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at this https URL.
[CV-84] Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction
【速读】:该论文旨在解决医学计算机断层扫描(Computed Tomography, CT)成像中金属植入物引起的严重伪影问题,这些问题会损害图像质量并影响诊断准确性。现有方法面临三大挑战:器官和组织结构的退化、对sinogram数据的依赖性以及资源消耗与恢复效率之间的失衡。解决方案的关键在于提出MARMamba模型,该模型采用精简的UNet架构,并引入多尺度Mamba(Multi-scale Mamba, MS-Mamba)作为核心模块;其中,翻转Mamba块(flip mamba block)通过多角度分析图像捕获全面上下文信息,平均最大前馈网络则融合关键特征与平均特征以有效抑制伪影,从而在无需额外输入数据的前提下实现高效去伪影,同时在计算开销、内存占用和参数数量之间取得良好平衡,具备实际应用价值。
链接: https://arxiv.org/abs/2604.06622
作者: Weikai Qu,Sijun Liang,Xianfeng Li,Cheng Pan,An Yan,Ahmed Elazab,Shanzhou Niu,Dong Zeng,Xiang Wan,Changmiao Wang
机构: New York University (纽约大学); BaiyunPort; University of Technology Sydney (悉尼科技大学); Sanda University (三明大学); Shan Dong Xiehe College (山东协和学院); Tsinghua University (清华大学); Gannan Normal University (赣南师范大学); Southern Medical University (南方医科大学); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Radiation and Plasma Medical Sciences
Abstract:In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, which effectively eliminates artifacts caused by metals of different sizes while maintaining the integrity of the original anatomical structures of the image. Furthermore, this model only focuses on CT images affected by metal artifacts, thus negating the requirement for additional input data. The model is a streamlined UNet architecture, which incorporates multi-scale Mamba (MS-Mamba) as its core module. Within MS-Mamba, a flip mamba block captures comprehensive contextual information by analyzing images from multiple orientations. Subsequently, the average maximum feed-forward network integrates critical features with average features to suppress the artifacts. This combination allows MARMamba to reduce artifacts efficiently. The experimental results demonstrate that our model excels in reducing metal artifacts, offering distinct advantages over other models. It also strikes an optimal balance between computational demands, memory usage, and the number of parameters, highlighting its practical utility in the real world. The code of the presented model is available at: this https URL.
[CV-85] Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels
【速读】:该论文旨在解决在部分标签(partial labels)可用条件下,提示学习(prompt learning)因标签歧义性和监督信息不足而导致性能受限的问题。其解决方案的关键在于提出一种整体最优标签选择方法(Holistic Optimal Label Selection, HopS),通过两种互补策略实现鲁棒的标签选择:一是基于局部密度的过滤机制,利用最近邻候选标签集中的高频标签和Softmax得分识别最合理的标签,捕捉特征空间中的结构规律;二是基于最优传输(optimal transport)的全局选择目标,将均匀采样分布映射到批次内候选标签分布,通过最小化期望运输成本确定最可能的标签分配。这两种策略从局部和全局两个维度协同优化,显著提升了弱监督场景下提示学习的性能。
链接: https://arxiv.org/abs/2604.06614
作者: Yaqi Zhao,Haoliang Sun,Yating Wang,Yongshun Gong,Yilong Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors’ candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.
[CV-86] VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography ICPR2026
【速读】:该论文旨在解决光学相干断层扫描血管成像(Optical Coherence Tomography Angiography, OCTA)中因血管结构稀疏且具有强拓扑约束而导致的鲁棒表征学习难题。现有自监督学习方法(如掩码自编码器)主要针对密集自然图像设计,依赖均匀掩码和像素级重建,难以有效捕捉血管几何特性。其解决方案的关键在于提出一种血管感知掩码自编码框架(VAMAE),通过引入基于血管强度(vesselness)和骨架线索的解剖学引导掩码策略,强化模型对血管连通性和分支模式的关注;同时,在预训练目标中引入多目标重建机制,使模型能够联合捕获外观、结构与拓扑信息,从而在低标注数据场景下显著提升OCTA图像的分割性能。
链接: https://arxiv.org/abs/2604.06583
作者: Ilerioluwakiiye Abolade,Prince Mireku,Kelechi Chibundu,Peace Ododo,Emmanuel Idoko,Promise Omoigui,Solomon Odelola
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures. Accepted at ICPR 2026
Abstract:Optical coherence tomography angiography (OCTA) provides non-invasive visualization of retinal microvasculature, but learning robust representations remains challenging due to sparse vessel structures and strong topological constraints. Many existing self-supervised learning approaches, including masked autoencoders, are primarily designed for dense natural images and rely on uniform masking and pixel-level reconstruction, which may inadequately capture vascular geometry. We propose VAMAE, a vessel-aware masked autoencoding framework for self-supervised pretraining on OCTA images. The approach incorporates anatomically informed masking that emphasizes vessel-rich regions using vesselness and skeleton-based cues, encouraging the model to focus on vascular connectivity and branching patterns. In addition, the pretraining objective includes reconstructing multiple complementary targets, enabling the model to capture appearance, structural, and topological information. We evaluate the proposed pretraining strategy on the OCTA-500 benchmark for several vessel segmentation tasks under varying levels of supervision. The results indicate that vessel-aware masking and multi-target reconstruction provide consistent improvements over standard masked autoencoding baselines, particularly in limited-label settings, suggesting the potential of geometry-aware self-supervised learning for OCTA analysis. Comments: 8 pages, 5 figures. Accepted at ICPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.06583 [cs.CV] (or arXiv:2604.06583v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.06583 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-87] LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation
【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)这一高度病态问题,即从单张图像中准确恢复场景的深度信息。其核心挑战在于如何将图像的颜色特征有效映射到几何意义上的深度值,并提升边缘区域的预测精度。解决方案的关键在于提出了一种基于提升理论(Lifting Theory)拓扑结构的LiftFormer模型:首先构建一个深度导向的几何表示(Depth-oriented Geometric Representation, DGR)子空间,通过框架理论(Frame Theory)利用深度分桶(depth bins)中的线性相关向量实现冗余且鲁棒的特征表示,从而将图像空间特征映射至直接对应深度值的DGR子空间;其次引入边缘感知表示(Edge-aware Representation, ER)子空间,对边缘附近的深度特征进行增强,以缓解边缘区域因突变导致的预测误差。实验表明,该方法在多个基准数据集上达到最优性能,且消融实验证明了两个提升模块的有效性。
链接: https://arxiv.org/abs/2604.06576
作者: Shuai Li,Huibin Bai,Yanbo Gao,Chong Lv,Hui Yuan,Chuankun Li,Wei Hua,Tian Xie
机构: Shandong University (山东大学); School of Control Science and Engineering (控制科学与工程学院); Key Laboratory of Machine Intelligence and System Control, Ministry of Education (机器智能与系统控制教育部重点实验室); School of Software (软件学院); Shandong University-WeiHai Research Institute of Industrial Technology (山东大学威海工业技术研究院); North University of China (中北大学); State Key Laboratory of Dynamic Testing Technology (动态测试技术国家重点实验室); School of Information and Communication Engineering (信息与通信工程学院); Research Institute of Interdisciplinary Innovation, Zhejiang Lab (浙江大学实验室交叉创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by IEEE Transactions on Multimedia
Abstract:Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.
[CV-88] DesigNet: Learning to Draw Vector Graphics as Designers Do
【速读】:该论文旨在解决生成式 AI (Generative AI) 与人类设计师在设计逻辑上的根本差异问题,尤其是在可缩放矢量图形(SVG)生成任务中,如何实现更高效的人机协同。其关键解决方案在于提出 DesigNet——一种基于分层 Transformer-VAE 的模型,直接处理 SVG 序列并采用连续命令参数化;核心创新是两个可微模块:一是连续性自精修模块,用于预测并强制执行曲线点处的 C⁰、G¹ 和 C¹ 连续性,通过调整贝塞尔控制点实现;二是对齐自精修模块,具备捕捉水平或垂直线的能力,提升图形结构的规整性。该方案显著提高了输出结果在连续性和对齐精度上的表现,使生成内容更易编辑和融入专业设计流程。
链接: https://arxiv.org/abs/2604.06494
作者: Tomas Guija-Valiente,Iago Suárez
机构: Machine Learning Circle (机器学习圈); Universidad Politécnica de Madrid (马德里理工大学); Qualcomm XR Labs (高通XR实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:AI-driven content generation has made remarkable progress in recent years. However, neural networks and human designers operate in fundamentally different ways, making collaboration between them challenging. We address this gap for Scalable Vector Graphics (SVG) by equipping neural networks with tools commonly used by designers, such as axis alignment and explicit continuity control at command junctions. We introduce DesigNet, a hierarchical Transformer-VAE that operates directly on SVG sequences with a continuous command parameterization. Our main contributions are two differentiable modules: a continuity self-refinement module that predicts C^0 , G^1 , and C^1 continuity for each curve point and enforces it by modifying Bézier control points, and an alignment self-refinement module with snapping capabilities for horizontal or vertical lines. DesigNet produces editable outlines and achieves competitive results against state-of-the-art methods, with notably higher accuracy in continuity and alignment. These properties ensure the outputs are easier to refine and integrate into professional design workflows. Source Code: this https URL.
[CV-89] Hybrid ResNet-1D-BiGRU with Multi-Head Attention for Cyberattack Detection in Industrial IoT Environments
【速读】:该论文旨在解决工业物联网(Industrial Internet of Things, IIoT)系统中入侵检测的准确性与实时性难题,尤其是在数据类别不平衡和复杂时空特征提取方面的挑战。其解决方案的关键在于提出一种混合深度学习模型,融合ResNet-1D用于空间特征提取、双向门控循环单元(BiGRU)用于时序建模,并引入多头注意力机制(Multi-Head Attention, MHA)实现特征加权,从而有效捕捉IIoT场景下的多维关联信息;同时,通过SMOTE算法缓解训练集中的类别不平衡问题,显著提升了模型在EdgeHoTset和CICIoV2024两个数据集上的性能表现,实现了高精度(最高达99.99%)、低误报率(FPR=0%)及极低推理延迟(<0.00014秒/实例),验证了其在真实部署环境中的有效性与泛化能力。
链接: https://arxiv.org/abs/2604.06481
作者: Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:This study introduces a hybrid deep learning model for intrusion detection in Industrial IoT (IIoT) systems, combining ResNet-1D, BiGRU, and Multi-Head Attention (MHA) for effective spatial-temporal feature extraction and attention-based feature weighting. To address class imbalance, SMOTE was applied during training on the EdgeHoTset dataset. The model achieved 98.71% accuracy, a loss of 0.0417%, and low inference latency (0.0001 sec /instance), demonstrating strong real-time capability. To assess generalizability, the model was also tested on the CICIoV2024 dataset, where it reached 99.99% accuracy and F1-score, with a loss of 0.0028, 0 % FPR, and 0.00014 sec/instance inference time. Across all metrics and datasets, the proposed model outperformed existing methods, confirming its robustness and effectiveness for real-time IoT intrusion detection.
[CV-90] Predicting Alzheimers disease progression using rs-fMRI and a history-aware graph neural network
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期预测难题,即如何基于个体的临床随访数据准确识别从轻度认知障碍(mild cognitive impairment, MCI)向更严重阶段(如AD)转化的风险。其核心挑战在于处理具有不规则时间间隔和缺失访问记录的纵向数据,并实现高精度的阶段性转换预测。解决方案的关键在于提出一种融合图神经网络(graph neural network, GNN)与循环神经网络(recurrent neural network, RNN)的模型架构:GNN用于建模静息态功能磁共振成像(resting-state functional magnetic resonance imaging, rs-fMRI)获取的功能连接图结构特征,RNN模块则整合受试者完整的就诊历史信息,同时通过引入“就诊距离”作为输入特征来适应非均匀时间间隔,从而在存在缺失数据的情况下仍能保持优异的预测性能(整体准确率达82.9%,尤其在CN到MCI转换预测中达到68.8%)。
链接: https://arxiv.org/abs/2604.06469
作者: Mahdi Moghaddami,Mohammad-Reza Siadat,Austin Toma,Connor Laming,Huirong Fu
机构: Oakland University (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proc. SPIE 13926, Medical Imaging 2026: Computer-Aided Diagnosis, 1392604
Abstract:Alzheimer’s disease (AD) is a neurodegenerative disorder that affects more than seven million people in the United States alone. AD currently has no cure, but there are ways to potentially slow its progression if caught early enough. In this study, we propose a graph neural network (GNN)-based model for predicting whether a subject will transition to a more severe stage of cognitive impairment at their next clinical visit. We consider three stages of cognitive impairment in order of severity: cognitively normal (CN), mild cognitive impairment (MCI), and AD. We use functional connectivity graphs derived from resting-state functional magnetic resonance imaging (rs-fMRI) scans of 303 subjects, each with a different number of visits. Our GNN-based model incorporates a recurrent neural network (RNN) block, enabling it to process data from the subject’s entire visit history. It can also work with irregular time gaps between visits by incorporating visit distance information into our input features. Our model demonstrates robust predictive performance, even with missing visits in the subjects’ visit histories. It achieves an accuracy of 82.9%, with an especially impressive accuracy of 68.8% on CN to MCI conversions - a task that poses a substantial challenge in the field. Our results highlight the effectiveness of rs-fMRI in predicting the onset of MCI or AD and, in conjunction with other modalities, could offer a viable method for enabling timely interventions to slow the progression of cognitive impairment.
[CV-91] PhysHead: Simulation-Ready Gaussian Head Avatars ATC WWW CVPR2026
【速读】:该论文旨在解决现有头像建模方法中头发运动僵硬、难以解耦头发与头部以及无法模拟真实体积动态行为的问题。传统方法通常将头发视为头部的刚性外层,缺乏对自然风动等动态特性的表达能力。其解决方案的关键在于提出PhysHead——一种基于3D高斯分层表示的混合头像建模框架,该框架融合了参数化网格(parametric mesh)用于头部几何建模和基于发丝(strand-based)的物理驱动头发系统,同时利用附着于头部网格和头发片段上的高斯原语构建外观模型,从而实现具有逼真动态头发行为(如风吹效果)的可动画化头像生成。此外,为应对训练序列中遮挡区域的视觉缺失问题,文中还引入基于视觉语言模型(VLM-based)的生成策略以补全遮挡区域的外观信息。
链接: https://arxiv.org/abs/2604.06467
作者: Berna Kabadayi,Vanessa Sklyarova,Wojciech Zielonka,Justus Thies,Gerard Pons-Moll
机构: Max Planck Institute for Intelligent Systems; ETH Zürich; University of Tübingen; Technical University of Darmstadt; Tübingen AI Center; Max Planck Institute for Informatics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: see this https URL Youtube Video: see this https URL Accepted to CVPR 2026
Abstract:Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control.
[CV-92] Visual prompting reimagined: The power of the Activation Prompts AISTATS2026
【速读】:该论文旨在解决视觉提示(Visual Prompting, VP)在性能上显著落后于传统微调方法的问题,特别是输入级VP在效率和效果上的局限性。其核心解决方案是提出一种广义概念——激活提示(Activation Prompt, AP),将通用扰动从输入层扩展至模型中间层的特征激活图(activation maps),从而实现更灵活且高效的提示机制。AP的关键创新在于揭示了不同模型架构(如卷积神经网络与视觉Transformer)对提示位置存在依赖性的层偏好,并通过理论分析证明这种偏好源于各层全局特征的差异。实验表明,AP在29个数据集和多种模型结构中均优于VP及参数高效微调基线,在准确率、计算时间、参数量、内存占用和吞吐量等方面展现出综合优势。
链接: https://arxiv.org/abs/2604.06440
作者: Yihua Zhang,Hongkang Li,Yuguang Yao,Aochuan Chen,Shuai Zhang,Pin-Yu Chen,Meng Wang,Sijia Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AISTATS 2026
Abstract:Visual prompting (VP) has emerged as a popular method to repurpose pretrained vision models for adaptation to downstream tasks. Unlike conventional model fine-tuning techniques, VP introduces a universal perturbation directly into the input data to facilitate task-specific fine-tuning rather than modifying model parameters. However, there exists a noticeable performance gap between VP and conventional fine-tuning methods, highlighting an unexplored realm in theory and practice to understand and advance the input-level VP to reduce its current performance gap. Towards this end, we introduce a generalized concept, termed activation prompt (AP), which extends the scope of the input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. By using AP to revisit the problem of VP and employing it as an analytical tool, we demonstrate the intrinsic limitations of VP in both performance and efficiency, revealing why input-level prompting may lack effectiveness compared to AP, which exhibits a model-dependent layer preference. We show that AP is closely related to normalization tuning in convolutional neural networks and vision transformers, although each model type has distinct layer preferences for prompting. We also theoretically elucidate the rationale behind such a preference by analyzing global features across layers. Through extensive experiments across 29 datasets and various model architectures, we provide a comprehensive performance analysis of AP, comparing it with VP and parameter-efficient fine-tuning baselines. Our results demonstrate AP’s superiority in both accuracy and efficiency, considering factors such as time, parameters, memory usage, and throughput.
[CV-93] Continual Visual Anomaly Detection on the Edge: Benchmark and Efficient Solutions
【速读】:该论文旨在解决边缘部署(edge deployment)与持续学习(continual learning)双重约束下的视觉异常检测(Visual Anomaly Detection, VAD)问题,即在计算资源受限的设备上实现模型对动态数据分布的适应能力,同时避免灾难性遗忘。其核心挑战在于:传统VAD方法通常针对单一场景设计,无法兼顾低内存占用、低推理开销与持续适应新数据的能力,且孤立研究任一约束条件会导致方法在联合场景下失效。解决方案的关键在于构建首个面向边缘持续学习场景的VAD综合基准,系统评估七种主流VAD模型在三种轻量级骨干网络上的表现,并提出Tiny-Dinomaly——基于DINO基础模型的轻量化改进版本,通过结构优化实现13倍更小的内存占用和20倍更低的计算成本,同时提升Pixel F1指标5个百分点;此外,还针对性地改进PatchCore和PaDiM以增强其在持续学习中的效率。
链接: https://arxiv.org/abs/2604.06435
作者: Manuel Barusco,Francesco Borsatti,David Petrovic,Davide Dalle Pezze,Gian Antonio Susto
机构: University of Padova, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual Anomaly Detection (VAD) is a critical task for many applications including industrial inspection and healthcare. While VAD has been extensively studied, two key challenges remain largely unaddressed in conjunction: edge deployment, where computational resources are severely constrained, and continual learning, where models must adapt to evolving data distributions without forgetting previously acquired knowledge. Our benchmark provides guidance for the selection of the optimal backbone and VAD method under joint efficiency and adaptability constraints, characterizing the trade-offs between memory footprint, inference cost, and detection performance. Studying these challenges in isolation is insufficient, as methods designed for one setting make assumptions that break down when the other constraint is simultaneously imposed. In this work, we propose the first comprehensive benchmark for VAD on the edge in the continual learning scenario, evaluating seven VAD models across three lightweight backbone architectures. Furthermore, we propose Tiny-Dinomaly, a lightweight adaptation of the Dinomaly model built on the DINO foundation model that achieves 13x smaller memory footprint and 20x lower computational cost while improving Pixel F1 by 5 percentage points. Finally, we introduce targeted modifications to PatchCore and PaDiM to improve their efficiency in the continual learning setting.
[CV-94] ProofSketcher: Hybrid LLM Lightweight Proof Checker for Reliable Math/Logic Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学与逻辑推理中生成看似合理但存在细微错误(如遗漏前提条件、无效推理模式或引用无法从上下文推导的引理)的问题,这类错误难以仅通过文本检查发现。同时,尽管交互式定理证明器(Interactive Theorem Provers, ITPs)如Lean和Coq能提供严格的可靠性保障,其代价是需完全形式化证明并手动输入大量低层细节信息。论文提出的解决方案关键在于构建一个混合流水线:首先由LLM生成一个类型化的证明草图(proof sketch),该草图以紧凑的领域特定语言(Domain-Specific Language, DSL)表示;随后由一个轻量级可信内核(trusted kernel)将该草图扩展为明确的证明义务(proof obligations),从而在保持形式正确性的同时显著降低人工干预成本。
链接: https://arxiv.org/abs/2604.06401
作者: Kranthi Kommuru,Kunal Khanvilkar,Gaurav Parekh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The large language models (LLMs) might produce a persuasive argument within mathematical and logical fields, although such argument often includes some minor missteps, including the entire omission of side conditions, invalid inference patterns, or appeals to a lemma that cannot be derived logically out of the context being discussed. These omissions are infamously hard to notice solely out of the text, as even the misconstrued construction still may seem mostly accurate. Conversely, interactive theorem provers like Lean and Coq have rigorous reliability by ensuring that syntactic and semantic statements only accept statements that can pass all the syntactic and semantic steps in the program which is a small trusted kernel of the language type-checks with. Despite the fact that this technique provides strong guarantees, it comes at quite a heavy price: the evidence must be completely formalized, and the evidence user or a auxiliary search program must provide an avalanche of low-level information. This paper presents a hybrid pipeline where an LLM generates a typed proof sketch in a compact DSL and a lightweight trusted kernel expands the sketch into explicit proof obligations.
[CV-95] MorphDistill: Distilling Unified Morphological Knowledge from Pathology Foundation Models for Colorectal Cancer Survival Prediction
【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)生存预测中现有病理基础模型忽视器官特异性特征的问题,从而影响预后判断的准确性。其解决方案的关键在于提出一种两阶段知识蒸馏框架MorphDistill,通过多教师关系蒸馏(dimension-agnostic multi-teacher relational distillation)将来自多个病理基础模型的互补知识整合到一个紧凑的CRC特异性编码器中,无需显式特征对齐即可保留样本间关系;随后利用注意力机制的多实例学习聚合切片级特征以实现五年生存率预测,最终在多个数据集上展现出优于基线模型的性能和良好的泛化能力。
链接: https://arxiv.org/abs/2604.06390
作者: Hikmat Khan,Usama Sajjad,Metin N. Gurcan,Anil Parwani,Wendy L. Frankel,Wei Chen,Muhammad Khalid Khan Niazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Colorectal cancer (CRC) remains a leading cause of cancer-related mortality worldwide. Accurate survival prediction is essential for treatment stratification, yet existing pathology foundation models often overlook organ-specific features critical for CRC prognostication. Methods: We propose MorphDistill, a two-stage framework that distills complementary knowledge from multiple pathology foundation models into a compact CRC-specific encoder. In Stage I, a student encoder is trained using dimension-agnostic multi-teacher relational distillation with supervised contrastive regularization on large-scale colorectal datasets. This preserves inter-sample relationships from ten foundation models without explicit feature alignment. In Stage II, the encoder extracts patch-level features from whole-slide images, which are aggregated via attention-based multiple instance learning to predict five-year survival. Results: On the Alliance/CALGB 89803 cohort (n=424, stage III CRC), MorphDistill achieves an AUC of 0.68 (SD 0.08), an approximately 8% relative improvement over the strongest baseline (AUC 0.63). It also attains a C-index of 0.661 and a hazard ratio of 2.52 (95% CI: 1.73-3.65), outperforming all baselines. On an external TCGA cohort (n=562), it achieves a C-index of 0.628, demonstrating strong generalization across datasets and robustness across clinical subgroups. Conclusion: MorphDistill enables task-specific representation learning by integrating knowledge from multiple foundation models into a unified encoder. This approach provides an efficient strategy for prognostic modeling in computational pathology, with potential for broader oncology applications. Further validation across additional cohorts and disease stages is warranted. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06390 [cs.CV] (or arXiv:2604.06390v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.06390 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hikmat Khan Ph.D [view email] [v1] Tue, 7 Apr 2026 19:21:18 UTC (48,820 KB)
[CV-96] MTA-Agent : An Open Recipe for Multimodal Deep Search Agents
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂、多步骤推理任务中面临的局限性,尤其是其在整合视觉证据与外部知识以进行深度搜索时的能力不足。解决方案的关键在于构建高质量、经验证的多跳视觉-语言训练数据集(MTA-Vision-DeepSearch),并提出一种多跳工具增强型证据驱动问答合成代理(Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis, MTA-Agent)。该代理能自动选择工具及其参数,从视觉和文本源中检索并验证证据,生成结构化的多跳问答轨迹。通过这一方法,训练出的32B开源多模态搜索代理在六项挑战性基准测试中达到54.63%的平均准确率,显著优于GPT-5、Gemini系列模型,并且提升了推理深度与工具使用行为的系统性。此外,研究还表明可通过回放缓存交互实现无需实时工具调用的训练,大幅降低训练成本。
链接: https://arxiv.org/abs/2604.06376
作者: Xiangyu Peng,Can Qin,An Yan,Xinyi Yang,Zeyuan Chen,Ran Xu,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.
[CV-97] DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images
【速读】:该论文旨在解决现有基于图像的饮食评估方法在食物摄入量精确量化方面的局限性,特别是针对仅依赖单张餐前图像、无法实现食物项级(food-item-level)营养分析且常需深度感知或多视角图像等限制性输入的问题。其解决方案的关键在于提出一种简单而有效的视觉-语言框架,利用成对的餐前与餐后RGB图像,通过自然语言提示(natural language prompts)定位特定食物项并直接估算其重量,同时采用两阶段训练策略预测图像间重量差异以实现食物消耗量的精准估计,从而无需依赖刚性的分割掩码即可完成细粒度的食物摄入分析。
链接: https://arxiv.org/abs/2604.06352
作者: Gautham Vinod,Siddeshwar Raghavan,Bruce Coburn,Fengqing Zhu
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
Abstract:Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.
[CV-98] Bi-Level Optimization for Single Domain Generalization CVPR
【速读】:该论文旨在解决单域泛化(Single Domain Generalization, SDG)问题,即在训练过程中无法访问目标域数据的情况下,如何从单一标注源域有效泛化到未见过的目标域。解决方案的关键在于提出一种双层优化框架 BiSDG,其核心创新是显式解耦任务学习与领域建模:通过标签保持变换构建代理域以模拟分布偏移,并设计领域提示编码器(domain prompt encoder)生成轻量级调制信号,通过特征层面的线性调制生成增强特征;整个学习过程被形式化为双层优化问题——内层固定提示优化任务性能,外层更新领域提示编码器以最大化跨代理域的泛化能力,同时引入高效的梯度近似方案实现无需二阶导数的训练。
链接: https://arxiv.org/abs/2604.06349
作者: Marzi Heidari,Hanping Zhang,Hao Yan,Yuhong Guo
机构: Carleton University (卡尔顿大学); Amii (艾米)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings Track, 2026
Abstract:Generalizing from a single labeled source domain to unseen target domains, without access to any target data during training, remains a fundamental challenge in robust machine learning. We address this underexplored setting, known as Single Domain Generalization (SDG), by proposing BiSDG, a bi-level optimization framework that explicitly decouples task learning from domain modeling. BiSDG simulates distribution shifts through surrogate domains constructed via label-preserving transformations of the source data. To capture domain-specific context, we propose a domain prompt encoder that generates lightweight modulation signals to produce augmenting features via feature-wise linear modulation. The learning process is formulated as a bi-level optimization problem: the inner objective optimizes task performance under fixed prompts, while the outer objective maximizes generalization across the surrogate domains by updating the domain prompt encoder. We further develop a practical gradient approximation scheme that enables efficient bi-level training without second-order derivatives. Extensive experiments on various SGD benchmarks demonstrate that BiSDG consistently outperforms prior methods, setting new state-of-the-art performance in the SDG setting.
[CV-99] Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents CVPR
【速读】:该论文旨在解决基于视觉语言模型(Visual Language Model, VLM)的超声心动图智能分析中存在的可信性不足问题,特别是由于直接映射视频与问题到答案的范式导致的模板捷径(template shortcuts)和虚假解释(spurious explanations)。其解决方案的关键在于提出了一种证据驱动的“Actor-Verifier”框架(EchoTrust),通过生成结构化的中间表示(structured intermediate representation),由不同角色分别执行推理与验证,从而提升临床决策支持系统在高风险场景下的可靠性与可解释性。
链接: https://arxiv.org/abs/2604.06347
作者: Peng Huang,Yiming Wang,Yineng Chen,Liangqiao Gui,Hui Guo,Bo Peng,Shu Hu,Xi Wu,Tsao Connie,Hongtu Zhu,Balakrishnan Prabhakaran,Xin Wang
机构: University at Albany, SUNY; Southwest Jiaotong University; Purdue University; Chengdu University of Information Technology; Harvard Medical School; University of North Carolina at Chapel Hill
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: cvprw 2026(AIMS)
Abstract:Echocardiography plays an important role in the screening and diagnosis of cardiovascular diseases. However, automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. In recent years, visual language models (VLM) have opened a new avenue for building ultrasound understanding systems for clinical decision support. Nevertheless, most existing methods formulate this task as a direct mapping from video and question to answer, making them vulnerable to template shortcuts and spurious explanations. To address these issues, we propose EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. EchoTrust produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.
[CV-100] Evolution of Video Generative Foundations
【速读】:该论文旨在解决当前视频生成领域研究缺乏系统性综述的问题,尤其针对现有文献多局限于特定技术(如生成对抗网络GAN或扩散模型)或单一任务(如视频编辑),未能全面梳理从早期GAN到主流扩散模型,再到新兴自回归(Auto-Regressive, AR)模型与多模态融合技术的发展脉络。其解决方案的关键在于构建一个结构化的演进框架,深入分析各类方法的基础原理、关键突破及优劣对比,并重点探讨多模态信息整合在提升视频上下文感知能力方面的趋势,从而为未来视频生成技术及其在虚拟现实、自动驾驶仿真等领域的应用提供方向性指导。
链接: https://arxiv.org/abs/2604.06339
作者: Teng Hu,Jiangning Zhang,Hongrui Huang,Ran Yi,Zihan Su,Jieyu Weng,Zhucun Xue,Lizhuang Ma,Ming-Hsuan Yang,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI’s Sora, Google’s Veo3, and Bytedance’s Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building “world models” that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field’s evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at this https URL.
[CV-101] Drifting Fields are not Conservative
【速读】:该论文旨在解决生成式模型中基于漂移场(drift field)的采样方法是否等价于优化一个标量损失函数的问题。研究表明,一般情况下漂移场并非保守场(即不能表示为任何标量势函数的梯度),其非保守性源于位置依赖的归一化项;唯有高斯核是唯一例外,此时漂移场恰好可表示为标量函数的梯度。为恢复保守性并构造明确的损失函数,作者提出使用一种相关核(称为尖锐核,sharp kernel)替代原有归一化方式,从而对任意径向核都能保证漂移场的保守性。尽管理论上漂移场匹配目标比损失最小化更通用(能实现无标量损失对应非保守传输场),但实际应用中利用此灵活性带来的性能提升有限,因此作者建议采用概念更简洁、基于标量损失的训练方式来训练漂移模型。
链接: https://arxiv.org/abs/2604.06333
作者: Leonard Franz,Sebastian Hoffmann,Georg Martius
机构: Eberhard Karls Universität Tübingen (图宾根大学); Max Planck Institute for Biogeochemistry (马普生物地球化学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures
Abstract:Drifting models generate high-quality samples in a single forward pass by transporting generated samples toward the data distribution using a vector valued drift field. We investigate whether this procedure is equivalent to optimizing a scalar loss and find that, in general, it is not: drift fields are not conservative - they cannot be written as the gradient of any scalar potential. We identify the position-dependent normalization as the source of non-conservatism. The Gaussian kernel is the unique exception where the normalization is harmless and the drift field is exactly the gradient of a scalar function. Generalizing this, we propose an alternative normalization via a related kernel (the sharp kernel) which restores conservatism for any radial kernel, yielding well-defined loss functions for training drifting models. While we identify that the drifting field matching objective is strictly more general than loss minimization, as it can implement non-conservative transport fields that no scalar loss can reproduce, we observe that practical gains obtained utilizing this flexibility are minimal. We thus propose to train drifting models with the conceptually simpler formulations utilizing loss functions.
[CV-102] scope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection
【速读】:该论文旨在解决超远距离(>250米)自动驾驶中目标检测性能严重下降的问题,尤其是在高速长距离重卡行驶场景下,由于物体在图像中仅占据少量像素导致现有目标检测器失效,同时商用激光雷达(LiDAR)因距离引起的分辨率平方衰减而难以满足超远距探测需求。解决方案的关键在于提出一种两阶段检测模型Telescope,其核心创新包括一个新颖的重采样层(re-sampling layer)和图像变换机制,有效提升了小尺寸、远距离目标的可检测性;该方法在保持极低计算开销的同时,将mAP指标在超远距离范围内提升76%(从0.185提升至0.326),且在全距离范围内均维持优异性能。
链接: https://arxiv.org/abs/2604.06332
作者: Parker Ewen,Dmitriy Rivkin,Mario Bijelic,Felix Heide
机构: Torc Robotics(托克机器人); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Autonomous highway driving, especially for long-haul heavy trucks, requires detecting objects at long ranges beyond 500 meters to satisfy braking distance requirements at high speeds. At long distances, vehicles and other critical objects occupy only a few pixels in high-resolution images, causing state-of-the-art object detectors to fail. This challenge is compounded by the limited effective range of commercially available LiDAR sensors, which fall short of ultra-long range thresholds because of quadratic loss of resolution with distance, making image-based detection the most practically scalable solution given commercially available sensor constraints. We introduce Telescope, a two-stage detection model designed for ultra-long range autonomous driving. Alongside a powerful detection backbone, this model contains a novel re-sampling layer and image transformation to address the fundamental challenges of detecting small, distant objects. Telescope achieves 76% relative improvement in mAP in ultra-long range detection compared to state-of-the-art methods (improving from an absolute mAP of 0.185 to 0.326 at distances beyond 250 meters), requires minimal computational overhead, and maintains strong performance across all detection ranges.
[CV-103] Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization ICLR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对恶意提示(malicious prompts)时产生的安全风险问题,即攻击者通过精心设计的输入诱导模型生成有害内容,而现有防御方法如黑名单过滤器易被绕过,基于分类器的方案则计算成本高且对嵌入空间级别的攻击脆弱。解决方案的关键在于提出两个互补组件:Hyperbolic Prompt Espial (HyPE) 和 Hyperbolic Prompt Sanitization (HyPS)。HyPE 利用双曲空间(hyperbolic space)的结构化几何特性建模正常提示,将有害提示识别为异常点,实现轻量级、高精度的检测;HyPS 基于可解释的归因方法定位并选择性修改有害词汇,在保留原始语义的前提下消除不当意图,从而构建了一个高效、可解释且鲁棒的安全防护框架。
链接: https://arxiv.org/abs/2604.06285
作者: Igor Maljkovic,Maria Rosaria Briglia,Iacopo Masi,Antonio Emanuele Cinà,Fabio Roli
机构: University of Genoa(热那亚大学); University of Cagliari(卡利亚里大学); Sapienza University of Rome(罗马大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at ICLR 2026. Webpage available at: this https URL
Abstract:Vision-Language Models (VLMs) have become essential for tasks such as image synthesis, captioning, and retrieval by aligning textual and visual information in a shared embedding space. Yet, this flexibility also makes them vulnerable to malicious prompts designed to produce unsafe content, raising critical safety concerns. Existing defenses either rely on blacklist filters, which are easily circumvented, or on heavy classifier-based systems, both of which are costly and fragile under embedding-level attacks. We address these challenges with two complementary components: Hyperbolic Prompt Espial (HyPE) and Hyperbolic Prompt Sanitization (HyPS). HyPE is a lightweight anomaly detector that leverages the structured geometry of hyperbolic space to model benign prompts and detect harmful ones as outliers. HyPS builds on this detection by applying explainable attribution methods to identify and selectively modify harmful words, neutralizing unsafe intent while preserving the original semantics of user prompts. Through extensive experiments across multiple datasets and adversarial scenarios, we prove that our framework consistently outperforms prior defenses in both detection accuracy and robustness. Together, HyPE and HyPS offer an efficient, interpretable, and resilient approach to safeguarding VLMs against malicious prompt misuse.
[CV-104] SE-Enhanced ViT and BiLSTM-Based Intrusion Detection for Secure IIoT and IoMT Environments
【速读】:该论文旨在解决工业和医疗物联网(IIoT 和 MIoT)生态系统中因互联设备快速增长而导致的网络安全威胁检测延迟与不准确问题。其核心解决方案是提出一种基于混合 Squeeze-and-Excitation Attention Vision Transformer-Bidirectional Long Short-Term Memory(SE ViT-BiLSTM)架构的入侵检测框架,关键创新在于用 Squeeze-and-Excitation 注意力机制替代传统 Vision Transformer 中的多头注意力模块,并融合双向长短期记忆网络(BiLSTM),从而在提升检测精度的同时优化计算效率。实验表明,该模型在未平衡和数据增强后的 EdgeIIoT 与 CICIoMT2024 数据集上均显著优于现有方法,在保持极低误报率(FPR)和高吞吐量(低延迟)的前提下实现了接近99%的准确率。
链接: https://arxiv.org/abs/2604.06254
作者: Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari,Seref Sagiroglu,Onur Ceran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid growth of interconnected devices in Industrial and Medical Internet of Things (IIoT and MIoT) ecosystems, ensuring timely and accurate detection of cyber threats has become a critical challenge. This study presents an advanced intrusion detection framework based on a hybrid Squeeze-and-Excitation Attention Vision Transformer-Bidirectional Long Short-Term Memory (SE ViT-BiLSTM) architecture. In this design, the traditional multi-head attention mechanism of the Vision Transformer is replaced with Squeeze-and-Excitation attention, and integrated with BiLSTM layers to enhance detection accuracy and computational efficiency. The proposed model was trained and evaluated on two real-world benchmark datasets; EdgeIIoT and CICIoMT2024; both before and after data balancing using the Synthetic Minority Over-sampling Technique (SMOTE) and RandomOverSampler. Experimental results demonstrate that the SE ViT-BiLSTM model outperforms existing approaches across multiple metrics. Before balancing, the model achieved accuracies of 99.11% (FPR: 0.0013%, latency: 0.00032 sec/inst) on EdgeIIoT and 96.10% (FPR: 0.0036%, latency: 0.00053 sec/inst) on CICIoMT2024. After balancing, performance further improved, reaching 99.33% accuracy with 0.00035 sec/inst latency on EdgeIIoT and 98.16% accuracy with 0.00014 sec/inst latency on CICIoMT2024.
[CV-105] DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)中存在的“感知-整合鸿沟”(perception-integration gap)问题,即模型虽能准确识别图像中的视觉内容(如分子结构),但在后续推理任务中表现不佳,表明视觉信息在下游推理过程中丢失。为系统性暴露此类失败,作者提出了DISSECT诊断基准,包含12,000个问题(化学7,000题、生物5,000题),通过五种输入模式(视觉+文本、纯文本、纯视觉、人类专家、以及创新的“模型自述”模式)对模型性能进行分解评估,从而分离出语言先验利用、视觉提取、感知保真度与整合有效性等维度。关键解决方案在于引入“模型自述”(Model Oracle)协议——让VLM首先用自己的语言描述图像,再基于该描述进行推理,以此诊断整合能力瓶颈,并发现开源模型在从自身描述推理时显著优于直接使用原始图像,揭示了整合环节是当前开源多模态模型的核心短板,而闭源模型则无此差距,说明感知与整合的协同能力已成为区分两者的关键前沿。
链接: https://arxiv.org/abs/2604.06250
作者: Dikshant Kukreja,Kshitij Sah,Karan Goyal,Mukesh Mohania,Vikram Goyal
机构: IIIT Delhi(印度信息技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.‘’ When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes – Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description – yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.
[CV-106] No-reference based automatic parameter optimization for iterative reconstruction using a novel search space aware crow search algorithm
【速读】:该论文旨在解决锥形束计算机断层成像(Cone-beam computed tomography, CBCT)迭代重建算法中因超参数手动调优效率低、耗时长且对重建质量影响显著的问题。其核心解决方案是一种无需参考图像的全自动参数优化框架,关键创新在于引入改进的麻雀搜索算法(Crow Search Algorithm, CSA),融合了基于集合的局部搜索机制、自适应全局搜索策略以及目标驱动的局部与全局搜索平衡机制,并结合混沌对角线线性均匀初始化方案以加速收敛。该方法在多种设备和数据集上验证有效,显著优于人工调参和传统CSA,在无参考质量指标上提升达4.89%~3.82%,同时保持细节清晰度,展现出良好的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2604.06246
作者: Poorya MohammadiNasab,Ander Biguri,Philipp Steininger,Peter Keuschnigg,Lukas Lamminger,Agnieszka Lach,S M Ragib Shahriar Islam,Anna Breger,Clemens Karner,Carola-Bibiane Schönlieb,Wolfgang Birkfellner,Sepideh Hatamikia
机构: University of Cambridge (剑桥大学); Medical University of Vienna (维也纳医科大学); Austrian Academy of Sciences (奥地利科学院); Digital Pathology Unit, Department of Pathology, Medical University of Vienna (维也纳医科大学病理学系数字病理学单元); Department of Applied Mathematics and Theoretical Physics, University of Cambridge (剑桥大学应用数学与理论物理系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Iterative reconstruction technique’s ability to reduce radiation exposure by using fewer projections has attracted significant attention. However, these methods typically require a precise tuning of several hyperparameters, which can have a major impact on reconstruction quality. Manually setting these parameters is time-consuming and increases the workload for human operators. In this paper, we introduce a novel fully automatic parameter optimization framework that can be applied to a wide range of Cone-beam computed tomography (CBCT) iterative reconstruction algorithms to determine optimal parameters without requiring a reference reconstruction. The proposed method incorporates a modified crow search algorithm (CSA) featuring a superior set-dependent local search mechanism, a search-space-aware global search strategy, and an objective-driven balance between local and global search. Additionally, to ensure an effective initial population, we propose a chaotic diagonal linear uniform initialization scheme that accelerates algorithm convergence. The performance of the proposed framework was evaluated on three imaging machines and four real datasets, as well as three different iterative reconstruction methods with the highest number of tunable parameters, representing the most challenging senario. The results indicate that the proposed method could outperform manual settings and CSA, with an 4.19% improvement in average fitness and 4.89% and 3.82% improvements on CHILL@UK and RPI_AXIS, respectively, which are two benchmark no-reference learning-based quality metrics. In addition, the qualitative results clearly show the superiority of the proposed method by maintaining fine details sharply. The overall performance of the proposed framework across different comparison scenarios demonstrates its effectiveness and robustness across all cases.
[CV-107] CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale CVPR2026
【速读】:该论文旨在解决行星表面撞击坑(impact crater)分析中传统深度学习方法仅关注检测任务而忽视关键科学工作流(如目录去重、多观测匹配和形态学类比发现)的局限性,这些问题本质上是实例级图像检索任务。解决方案的关键在于将撞击坑分析建模为实例级图像检索问题,并提出CraterBench-R基准数据集(包含约25,000个撞击坑身份及其多尺度图库视图与人工验证查询),同时引入一种高效的“实例token聚合”(instance-token aggregation)方法:通过选择K个种子token并基于余弦相似度分配其余patch token,再对每个簇聚合为单一代表性token,从而在显著降低存储开销的同时实现接近全token检索的精度;此外,还设计了一个两阶段流水线——先用单向量短列表筛选候选集,再以实例token进行重排序,可在仅搜索少量候选的情况下恢复89–94%的完整late-interaction精度。
链接: https://arxiv.org/abs/2604.06245
作者: Jichao Fang,Lei Zhang,Michael Phillips,Wei Luo
机构: Northern Illinois University (北方伊利诺伊大学); University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the EarthVision 2026 Workshop at CVPR 2026
Abstract:Impact craters are a cornerstone of planetary surface analysis. However, while most deep learning pipelines treat craters solely as a detection problem, critical scientific workflows such as catalog deduplication, cross-observation matching, and morphological analog discovery are inherently retrieval tasks. To address this, we formulate crater analysis as an instance-level image retrieval problem and introduce CraterBench-R, a curated benchmark featuring about 25,000 crater identities with multi-scale gallery views and manually verified queries spanning diverse scales and contexts. Our baseline evaluations across various architectures reveal that self-supervised Vision Transformers (ViTs), particularly those with in-domain pretraining, dominate the task, outperforming generic models with significantly more parameters. Furthermore, we demonstrate that retaining multiple ViT patch tokens for late-interaction matching dramatically improves accuracy over standard single-vector pooling. However, storing all tokens per image is operationally inefficient at a planetary scale. To close this efficiency gap, we propose instance-token aggregation, a scalable, training-free method that selects K seed tokens, assigns the remaining tokens to these seeds via cosine similarity, and aggregates each cluster into a single representative token. This approach yields substantial gains: at K=16, aggregation improves mAP by 17.9 points over raw token selection, and at K=64, it matches the accuracy of using all 196 tokens with significantly less storage. Finally, we demonstrate that a practical two-stage pipeline, with single-vector shortlisting followed by instance-token reranking, recovers 89-94% of the full late-interaction accuracy while searching only a small candidate set. The benchmark is publicly available at this http URL.
[CV-108] Implantable Adaptive Cells: A Novel Enhancement for Pre-Trained U-Nets in Medical Image Segmentation
【速读】:该论文旨在解决预训练神经网络在医学图像分割任务中性能提升困难的问题,特别是如何在不进行完整模型重训练的前提下实现高效、稳定的精度优化。其解决方案的关键在于提出了一种名为“可植入自适应单元”(Implantable Adaptive Cell, IAC)的新机制,该单元通过基于梯度的神经架构搜索(gradient-based Neural Architecture Search, NAS)方法,从部分连接的DARTS框架中识别出小型模块,并将其注入已训练好的U型结构模型的跳跃连接(skip connections)中,从而以极低计算成本实现对现有模型的精细化改进。实验表明,该方法在四个包含MRI和CT图像的医学数据集上均实现了显著且一致的分割精度提升,平均增益约5个百分点,最高可达11个百分点。
链接: https://arxiv.org/abs/2405.03420
作者: Emil Benedykciuk,Marcin Denkowski,Grzegorz Wójcik
机构: Maria Curie-Skłodowska University (玛丽居里-斯克沃多夫斯卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a novel approach to enhance the performance of pre-trained neural networks in medical image segmentation using gradient-based Neural Architecture Search (NAS) methods. We present the concept of Implantable Adaptive Cell (IAC), small modules identified through Partially-Connected DARTS based approach, designed to be injected into the skip connections of an existing and already trained U-shaped model. Unlike traditional NAS methods, our approach refines existing architectures without full retraining. Experiments on four medical datasets with MRI and CT images show consistent accuracy improvements on various U-Net configurations, with segmentation accuracy gain by approximately 5 percentage points across all validation datasets, with improvements reaching up to 11%pt in the best-performing cases. The findings of this study not only offer a cost-effective alternative to the complete overhaul of complex models for performance upgrades but also indicate the potential applicability of our method to other architectures and problem domains.
[CV-109] urPy: a physics-based and differentiable optical turbulence simulator for algorithmic development and system optimization
【速读】:该论文旨在解决自由空间光学系统设计中湍流引起的波前畸变模拟精度不足与梯度优化兼容性差的问题。现有仿真工具难以在高保真度建模的同时支持端到端的梯度驱动优化,限制了复杂光学系统(如自适应光学或深度学习辅助的光路设计)在湍流环境下的性能提升。解决方案的关键在于提出一个GPU加速、完全可微分的波光学湍流模拟框架TurPy,其核心创新包括:基于介质特异性功率谱密度参数化的次谐波相位屏生成方法、自回归时间演化机制以及自动屏幕布放策略,从而在满足傅里叶混叠约束和弱湍流近似之间实现平衡;该框架可无缝扩展至大气、海洋及生物传播环境,且仅需输入折射率结构常数和功率谱密度即可准确匹配经典湍流理论中的二阶高斯光束展宽与四阶平面波闪烁特性(误差<2%),并成功用于双域衍射神经网络(D2NN)的梯度优化,实现相比未补偿接收器超过20倍的闪烁抑制效果,验证了其作为端到端光学系统设计平台的有效性。
链接: https://arxiv.org/abs/2604.07248
作者: Joseph L. Greene,Alfred Moore,Iris Ochoa,Emily Kwan,Patrick Marano,Christopher R. Valenta
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures, 1 table. Presented at 2026 SPIE DS Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications IV
Abstract:Developing optical systems for free-space applications requires simulation tools that accurately capture turbulence-induced wavefront distortions and support gradient-based optimization. Here we introduce TurPy, a GPU-accelerated, fully differentiable wave optics turbulence simulator to bridge high fidelity simulation with end-to-end optical system design. TurPy incorporates subharmonic phase screen generation, autoregressive temporal evolution, and an automated screen placement routine balancing Fourier aliasing constraints and weak-turbulence approximations into a unified, user-ready framework. Because TurPy’s phase screen generation is parameterized through a media-specific power spectral density, the framework extends to atmospheric, oceanic, and biological propagation environments with minimal modification. We validate TurPy against established atmospheric turbulence theory by matching 2nd order Gaussian beam broadening and 4th order plane wave scintillation to closed-form models with 98% accuracy across weak to strong turbulence regimes, requiring only the medium’s refractive index structure constant and power spectral density as inputs. To demonstrate TurPy as a gradient-based training platform, we optimize a dual-domain diffractive deep neural network (D2NN) in a two-mask dual-domain architecture to recover a Gaussian beam from a weakly turbulent path and achieving over 20x reduction in scintillation relative to an uncompensated receiver in simulation. TurPy is released as an open-source package to support synthetic data generation, turbulence-informed algorithm development, and the end-to-end design of optical platforms operating in turbulent environments.
[CV-110] owards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training
【速读】:该论文旨在解决高能加速器中微子物理(accelerator-based neutrino physics)在TeV能量尺度下,由于事件信号密集且重叠导致传统重建方法难以适用的问题,尤其是在标注数据稀缺、下游任务多样化的情况下。解决方案的关键在于提出一种稀疏视觉Transformer(sparse ViT)框架,通过自监督预训练学习可复用的探测器数据表征:预训练阶段结合掩码自动编码器重建与关系体素级目标(用于层次结构、鬼迹和粒子识别),从而获得共享编码器;随后在分类与回归任务上联合微调。实验证明,该方法显著提升中微子味识别、粲夸克识别、动量回归及顶点重建性能,并在仅需约 103 标注事件时即达到随机初始化模型使用十倍数据的效果,同时具备跨探测器技术与能量尺度的良好迁移能力。
链接: https://arxiv.org/abs/2604.07037
作者: Saúl Alonso-Monsalve,Fabio Cufino,Umut Kose,Anna Mascellani,André Rubbia
机构: IPA, ETH Zürich (苏黎世联邦理工学院)
类目: High Energy Physics - Experiment (hep-ex); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures
Abstract:Accelerator-based neutrino physics is entering an energy-frontier regime in which interactions reach the TeV scale and produce exceptionally dense, overlapping detector signatures. In this regime, event interpretation becomes impractical for conventional reconstruction approaches, particularly when labelled data are scarce and the analysis spans diverse downstream objectives. We present a sparse ViT framework for learning reusable representations from heterogeneous detector data. Self-supervised pre-training combines masked autoencoder reconstruction with relational voxel-level objectives for hierarchy, ghost and particle identification, and the resulting shared encoder is then jointly fine-tuned across classification and regression tasks. Evaluated on simulated events from the proposed FASERCal concept at the LHC, we find that pre-training consistently improves neutrino flavour and charm-quark identification, momentum regression, and vertex reconstruction over training from scratch, with the addition of relational objectives yielding further gains in the most topologically complex channels. Interpretability analyses further show that pre-training yields a more structured latent space, while detector-subsystem ablations recover physically plausible channel-dependent roles for the heterogeneous inputs. A data-efficiency study shows that, with roughly 10^3 labelled events, the pre-trained encoder already matches the flavour-classification performance of a randomly initialised model trained on an order of magnitude more data. The learned representations also transfer effectively to publicly available benchmarks spanning different detector technologies and energy scales, matching or exceeding published baselines. These results support self-supervised pre-training on multimodal detector data as a scalable route towards reusable representations for neutrino and particle-detector analysis.
[CV-111] Enhanced Self-Supervised Multi-Image Super-Resolution for Camera Array Images
【速读】:该论文旨在解决多图像超分辨率(Multi-Image Super-Resolution, MISR)中因单一相机序列帧导致的复杂退化和严重遮挡问题,以及现有自监督学习(Self-Supervised Learning, SSL)方法难以恢复精细纹理细节的局限性。其解决方案的关键在于提出一种多图像到单图像引导的多图像到多图像自监督学习框架(Multi-to-Single-Guided Multi-to-Multi SSL),该框架融合了Multi-to-Single与Multi-to-Multi SSL的优势,从而生成视觉上更吸引人且保真度更高的图像;同时设计了一种适用于SSL的双Transformer(dual Transformer)网络结构,以增强从混叠伪影中恢复高频细节的能力,实现了深度神经网络与经典物理驱动变分方法的有效集成。
链接: https://arxiv.org/abs/2604.06816
作者: Yating Chen,Feng Huang,Xianyu Wu,Jing Wu,Ying Shen
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional multi-image super-resolution (MISR) methods, such as burst and video SR, rely on sequential frames from a single camera. Consequently, they suffer from complex image degradation and severe occlusion, increasing the difficulty of accurate image restoration. In contrast, multi-aperture camera-array imaging captures spatially distributed views with sampling offsets forming a stable disk-like distribution, which enhances the non-redundancy of observed data. Existing MISR algorithms fail to fully exploit these unique properties. Supervised MISR methods tend to overfit the degradation patterns in training data, and current self-supervised learning (SSL) techniques struggle to recover fine-grained details. To address these issues, this paper thoroughly investigates the strengths, limitations and applicability boundaries of multi-image-to-single-image (Multi-to-Single) and multi-image-to-multi-image (Multi-to-Multi) SSL methods. We propose the Multi-to-Single-Guided Multi-to-Multi SSL framework that combines the advantages of Multi-to-Single and Multi-to-Multi to generate visually appealing and high-fidelity images rich in texture details. The Multi-to-Single-Guided Multi-to-Multi SSL framework provides a new paradigm for integrating deep neural network with classical physics-based variational methods. To enhance the ability of MISR network to recover high-frequency details from aliased artifacts, this paper proposes a novel camera-array SR network called dual Transformer suitable for SSL. Experiments on synthetic and real-world datasets demonstrate the superiority of the proposed method.
[CV-112] 4D Vessel Reconstruction for Benchtop Thrombectomy Analysis
【速读】:该论文旨在解决机械血栓切除术(mechanical thrombectomy)过程中血管变形与操作相关损伤的量化评估难题,尤其针对现有体外模型中缺乏时间分辨、全场三维血管运动测量手段的问题。其解决方案的关键在于提出了一种基于九相机多视角视频采集与4D Gaussian Splatting重建的低成本、高精度框架,能够实现硅胶大脑中动脉模型在模拟血栓切除过程中的时序化表面形变追踪,并通过固定连接边图结构提取区域位移和基于Neo-Hookean映射的相对应力代理指标(stress proxy),从而提供标准化的比较性表面力学表征方法,支持不同操作条件之间的定量对比与方法验证。
链接: https://arxiv.org/abs/2604.06671
作者: Ethan Nguyen,Javier Carmona,Arisa Matsuzaki,Naoki Kaneko,Katsushi Arisaka
机构: UCLA Health (加州大学洛杉矶分校健康中心); UCLA Physics and Astronomy (加州大学洛杉矶分校物理与天文学系); Chan Zuckerberg Biohub Network (陈-扎克伯格倡议生物枢纽网络); Ronald Reagan UCLA Medical Center (罗纳德·里根加州大学洛杉矶分校医学中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 20 pages, 10 figures, 1 table, supplementary material (3 tables, 3 figures, and 11 videos). Project page: this https URL
Abstract:Introduction: Mechanical thrombectomy can cause vessel deformation and procedure-related injury. Benchtop models are widely used for device testing, but time-resolved, full-field 3D vessel-motion measurements remain limited. Methods: We developed a nine-camera, low-cost multi-view workflow for benchtop thrombectomy in silicone middle cerebral artery phantoms (2160p, 20 fps). Multi-view videos were calibrated, segmented, and reconstructed with 4D Gaussian Splatting. Reconstructed point clouds were converted to fixed-connectivity edge graphs for region-of-interest (ROI) displacement tracking and a relative surface-based stress proxy. Stress-proxy values were derived from edge stretch using a Neo-Hookean mapping and reported as comparative surface metrics. A synthetic Blender pipeline with known deformation provided geometric and temporal validation. Results: In synthetic bulk translation, the stress proxy remained near zero for most edges (median \approx 0 MPa; 90th percentile 0.028 MPa), with sparse outliers. In synthetic pulling (1-5 mm), reconstruction showed close geometric and temporal agreement with ground truth, with symmetric Chamfer distance of 1.714-1.815 mm and precision of 0.964-0.972 at \tau = 1 mm. In preliminary benchtop comparative trials (one trial per condition), cervical aspiration catheter placement showed higher max-median ROI displacement and stress-proxy values than internal carotid artery terminus placement. Conclusion: The proposed protocol provides standardized, time-resolved surface kinematics and comparative relative displacement and stress proxy measurements for thrombectomy benchtop studies. The framework supports condition-to-condition comparisons and methods validation, while remaining distinct from absolute wall-stress estimation. Implementation code and example data are available at this https URL Comments: 20 pages, 10 figures, 1 table, supplementary material (3 tables, 3 figures, and 11 videos). Project page: this https URL Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2604.06671 [eess.IV] (or arXiv:2604.06671v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2604.06671 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ethan Nguyen [view email] [v1] Wed, 8 Apr 2026 04:45:17 UTC (10,245 KB)
[CV-113] Euclid Quick Data Release (Q1). AgileLens: A scalable CNN-based pipeline for strong gravitational lens identification
【速读】:该论文旨在解决强引力透镜系统(strong galaxy–galaxy lensing systems)在大规模巡天数据中高效识别的问题,尤其针对欧空局Euclid望远镜第一阶段(Q1)成像数据。其核心挑战在于从海量星系图像中准确筛选出具有明显引力透镜效应的候选体,同时排除噪声、伪影及非透镜源干扰。解决方案的关键在于构建一个端到端、迭代优化的深度学习流水线:首先基于VIS波段目录进行预筛选与图像裁剪,利用VIS-anchored luminance方案融合VIS与NISP多色信息以保留形态并增强颜色对比;其次通过种子分类器生成初始正负样本,并通过数据增强和形态平衡策略扩充训练集;最终采用改进的VGG16卷积神经网络(CNN)模型,在三轮迭代微调后实现高精度分类,成功从4000个高置信度候选体中识别出441个A/B级强透镜系统,其中130个为新发现。该方法兼具效率与可扩展性,适用于未来Euclid数据释放。
链接: https://arxiv.org/abs/2604.06648
作者: Euclid Collaboration:X. Xu(1 and 2),R. Chen(1),T. Li(1),A. R. Cooray(1),S. Schuldt(3 and 4),J. A. Acevedo Barroso(5),D. Stern(5),D. Scott(6),M. Meneghetti(7 and 8),G. Despali(9 and 7 and 8),J. Chopra(1),Y. Cao(1),M. Cheng(1),J. Buda(1),J. Zhang(1),J. Furumizo(1),R. Valencia(1),Z. Jiang(2),C. Tortora(10),N. E. P. Lines(11),T. E. Collett(11),S. Fotopoulou(12),A. Galan(13 and 14),A. Manjón-García(15),R. Gavazzi(16 and 17),L. Iwamoto(18),S. Kruk(19),M. Millon(20),P. Nugent(21),C. Saulder(22 and 23),D. Sluse(24),J. Wilde(25),M. Walmsley(26 and 27),F. Courbin(25 and 28 and 29),R. B. Metcalf(9 and 7),B. Altieri(19),A. Amara(30),S. Andreon(31),N. Auricchio(7),C. Baccigalupi(32 and 33 and 34 and 35),M. Baldi(36 and 7 and 8),A. Balestra(37),S. Bardelli(7),P. Battaglia(7),R. Bender(22 and 23),A. Biviano(33 and 32),E. Branchini(38 and 39 and 31),M. Brescia(40 and 10),S. Camera(41 and 42 and 43),V. Capobianco(43),C. Carbone(4),V. F. Cardone(44 and 45),J. Carretero(46 and 47),S. Casas(48 and 49),M. Castellano(44),G. Castignani(7),S. Cavuoti(10 and 50),A. Cimatti(51),C. Colodro-Conde(52),G. Congedo(53),C. J. Conselice(27),L. Conversi(54 and 19),Y. Copin(55),H. M. Courtois(56),M. Cropper(57),A. Da Silva(58 and 59),H. Degaudenzi(60),G. De Lucia(33),C. Dolding(57),H. Dole(61),F. Dubath(60),X. Dupac(19),S. Dusini(62),S. Escoffier(63),M. Farina(64),R. Farinelli(7),S. Farrens(65),S. Ferriol(55),F. Finelli(7 and 66),P. Fosalba(67 and 68),M. Frailis(33),E. Franceschi(7),M. Fumana(4),S. Galeotta(33),K. George(69),W. Gillard(63),B. Gillis(53),C. Giocoli(7 and 8),P. Gómez-Alvarez(70 and 19),J. Gracia-Carpio(22),A. Grazian(37),F. Grupp(22 and 23),S. V. H. Haugan(71),W. Holmes(5),F. Hormuth(72),A. Hornstrup(73 and 74),K. Jahnke(75),M. Jhabvala(76),B. Joachimi
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 16 figures
Abstract:We present an end-to-end, iterative pipeline for efficient identification of strong galaxy–galaxy lensing systems, applied to the Euclid Q1 imaging data. Starting from VIS catalogues, we reject point sources, apply a magnitude cut (I _E \leq 24) on deflectors, and run a pixel-level artefact/noise filter to build 96 \times 96 pix cutouts; VIS+NISP colour composites are constructed with a VIS-anchored luminance scheme that preserves VIS morphology and NISP colour contrast. A VIS-only seed classifier supplies clear positives and typical impostors, from which we curate a morphology-balanced negative set and augment scarce positives. Among the six CNNs studied initially, a modified VGG16 (GlobalAveragePooling + 256/128 dense layers with the last nine layers trainable) performs best; the training set grows from 27 seed lenses (augmented to 1809) plus 2000 negatives to a colour dataset of 30,686 images. After three rounds of iterative fine-tuning, human grading of the top 4000 candidates ranked by the final model yields 441 Grade A/B candidate lensing systems, including 311 overlapping with the existing Q1 strong-lens catalogue, and 130 additional A/B candidates (9 As and 121 Bs) not previously reported. Independently, the model recovers 740 out of 905 (81.8%) candidate Q1 lenses within its top 20,000 predictions, considering off-centred samples. Candidates span I _E \simeq 17–24 AB mag (median 21.3 AB mag) and are redder in Y _E --H _E than the parent population, consistent with massive early-type deflectors. Each training iteration required a week for a small team, and the approach easily scales to future Euclid releases; future work will calibrate the selection function via lens injection, extend recall through uncertainty-aware active learning, explore multi-scale or attention-based neural networks with fast post-hoc vetters that incorporate lens models into the classification.
[CV-114] A Noise Constrained Diffusion (NC-Diffusion) Framework for High Fidelity Image Compression
【速读】:该论文旨在解决扩散模型在图像压缩中因训练时引入随机噪声而导致重建图像与原图存在偏差的问题,从而影响压缩性能。其关键解决方案是提出一种噪声约束扩散(Noise Constrained Diffusion, NC-Diffusion)框架,将原始图像压缩过程中加入的量化噪声作为扩散前向过程中的噪声,并构建从真实图像到含量化噪声初始压缩结果的噪声约束扩散过程,有效缓解了压缩与扩散阶段的噪声不匹配问题,显著提升了推理效率和重建质量。此外,通过自适应频域滤波模块增强U-Net结构中的跳跃连接以保留高频细节,并设计零样本样本引导增强方法进一步提升图像保真度。
链接: https://arxiv.org/abs/2604.06568
作者: Zhenyu Du,Yanbo Gao,Shuai Li,Yiyang Li,Hui Yuan,Mao Ye
机构: Shandong University (山东大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Abstract:With the great success of diffusion models in image generation, diffusion-based image compression is attracting increasing interests. However, due to the random noise introduced in the diffusion learning, they usually produce reconstructions with deviation from the original images, leading to suboptimal compression results. To address this problem, in this paper, we propose a Noise Constrained Diffusion (NC-Diffusion) framework for high fidelity image compression. Unlike existing diffusion-based compression methods that add random Gaussian noise and direct the noise into the image space, the proposed NC-Diffusion formulates the quantization noise originally added in the learned image compression as the noise in the forward process of diffusion. Then a noise constrained diffusion process is constructed from the ground-truth image to the initial compression result generated with quantization noise. The NC-Diffusion overcomes the problem of noise mismatch between compression and diffusion, significantly improving the inference efficiency. In addition, an adaptive frequency-domain filtering module is developed to enhance the skip connections in the U-Net based diffusion architecture, in order to enhance high-frequency details. Moreover, a zero-shot sample-guided enhancement method is designed to further improve the fidelity of the image. Experiments on multiple benchmark datasets demonstrate that our method can achieve the best performance compared with existing methods.
[CV-115] CWRNN-INVR: A Coupled WarpRNN based Implicit Neural Video Representation
【速读】:该论文旨在解决隐式神经视频表示(Implicit Neural Video Representation, INVR)中对视频信息组成缺乏系统性区分的问题,即现有方法未明确区分神经网络与网格结构在视频表征中的不同作用,导致难以同时高效捕捉视频中的规律性结构信息与不规则细节。解决方案的关键在于提出一种基于混合神经网络与残差网格的框架——CWRNN-INVR:其中,神经网络用于建模视频中具有规律性和结构性的运动信息(通过耦合WarpRNN的多尺度运动表示与补偿模块实现),而残差网格则专门用于编码剩余的不规则外观和运动信息;二者协同工作,实现对视频内容的分层、互补式表征,从而显著提升重建质量(在UVG数据集上平均PSNR达33.73 dB),并优于现有INVR方法在下游任务中的表现。
链接: https://arxiv.org/abs/2604.06564
作者: Yiyang Li,Yanbo Gao,Shuai Li,Zhenyu Du,Jinglin Zhang,Hui Yuan,Mao Ye,Xingyu Gao
机构: Shandong University (山东大学); School of Software, Shandong University (山东大学软件学院); Shandong University-WeiHai Research Institute of Industrial Technology (山东大学威海工业技术研究院); University of Electronic Science and Technology of China (电子科技大学); Institute of Microelectronics, Chinese Academy of Sciences (中国科学院微电子研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia
Abstract:Implicit Neural Video Representation (INVR) has emerged as a novel approach for video representation and compression, using learnable grids and neural networks. Existing methods focus on developing new grid structures efficient for latent representation and neural network architectures with large representation capability, lacking the study on their roles in video representation. In this paper, the difference between INVR based on neural network and INVR based on grid is first investigated from the perspective of video information composition to specify their own advantages, i.e., neural network for general structure while grid for specific detail. Accordingly, an INVR based on mixed neural network and residual grid framework is proposed, where the neural network is used to represent the regular and structured information and the residual grid is used to represent the remaining irregular information in a video. A Coupled WarpRNN-based multi-scale motion representation and compensation module is specifically designed to explicitly represent the regular and structured information, thus terming our method as CWRNN-INVR. For the irregular information, a mixed residual grid is learned where the irregular appearance and motion information are represented together. The mixed residual grid can be combined with the coupled WarpRNN in a way that allows for network reuse. Experiments show that our method achieves the best reconstruction results compared with the existing methods, with an average PSNR of 33.73 dB on the UVG dataset under the 3M model and outperforms existing INVR methods in other downstream tasks. The code can be found at this https URLthis https URL.
[CV-116] Adaptive Differential Privacy for Federated Medical Image Segmentation Across Diverse Modalities
【速读】:该论文旨在解决医疗图像分割中因数据分散存储、隐私法规限制及数据分布异构性导致的模型泛化能力差和隐私保护不足的问题。现有联邦学习(Federated Learning, FL)在跨机构协作训练时面临数据隐私与模型性能之间的权衡困境,尤其是引入差分隐私(Differential Privacy, DP)后常导致精度下降、收敛不稳定和泛化性能降低。其解决方案的关键在于提出一种自适应差分私有联邦学习(Adaptive Differentially Private Federated Learning, ADP-FL)框架,该框架通过动态调整隐私机制,在训练过程中实时优化隐私-效用权衡,从而显著提升Dice分数、边界分割质量与训练稳定性,同时保持严格的隐私保障,并在多种医学影像模态(如皮肤病变、肾肿瘤和脑肿瘤分割)中实现接近非私有联邦学习的性能表现。
链接: https://arxiv.org/abs/2604.06518
作者: Puja Saha,Eranga Ukwatta
机构: University of Guelph (圭尔夫大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures. Accepted in SPIE Medical Imaging 2026. Recipient of CAD Best Paper Award: 1st Place, and Robert F. Wagner All-Conference Best Paper Award: Finalist
Abstract:Large volumes of medical data remain underutilized because centralizing distributed data is often infeasible due to strict privacy regulations and institutional constraints. In addition, models trained in centralized settings frequently fail to generalize across clinical sites because of heterogeneity in imaging protocols and continuously evolving data distributions arising from differences in scanners, acquisition parameters, and patient populations. Federated learning offers a promising solution by enabling collaborative model training without sharing raw data. However, incorporating differential privacy into federated learning, while essential for privacy guarantees, often leads to degraded accuracy, unstable convergence, and reduced generalization. In this work, we propose an adaptive differentially private federated learning (ADP-FL) framework for medical image segmentation that dynamically adjusts privacy mechanisms to better balance the privacy-utility trade-off. The proposed approach stabilizes training, significantly improves Dice scores and segmentation boundary quality, and maintains rigorous privacy guarantees. We evaluated ADP-FL across diverse imaging modalities and segmentation tasks, including skin lesion segmentation in dermoscopic images, kidney tumor segmentation in 3D CT scans, and brain tumor segmentation in multi-parametric MRI. Compared with conventional federated learning and standard differentially private federated learning, ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability, with performance approaching that of non-private federated learning under the same privacy budgets. These results demonstrate the practical viability of ADP-FL for high-performance, privacy-preserving medical image segmentation in real-world federated settings.
[CV-117] Structural Regularities of Cinema SDR-to-HDR Mapping in a Controlled Mastering Workflow: A Pixel-wise Case Study on ASC StEM2
【速读】:该论文旨在解决影院级标准动态范围(SDR)向高动态范围(HDR)映射过程中缺乏定量、结构感知的分析基准问题。针对这一挑战,研究者利用ASC StEM2数据集——一个基于ACES(Academy Color Encoding System)工作流的共源数据集,包含EXR场景参考图像及对应的SDR/HDR影院发行母版——构建了一个三域对比框架(EXR源数据、SDR母版、HDR母版),从亮度和色彩两个维度量化分析其结构关系。关键在于通过像素级统计识别出82.4%的图像区域属于“更接近EXR恢复”的区域,其余则需局部自适应调整,从而提出一种可操作的像素级决策图,为结构感知的SDR-to-HDR转换提供可解释的定量基线,并支持在共享源数据条件下设计学习型映射模型。
链接: https://arxiv.org/abs/2604.06276
作者: Xin Zhang,Xiaoyi Chen
机构: China Research Institute of Film Science Technology (电影科学技术研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures. Empirical case study on cinema SDR-to-HDR mapping using ASC StEM2
Abstract:We present an empirical case study of cinema SDR-to-HDR mapping using ASC StEM2, a rare common-source dataset containing EXR scene-referred images and matched SDR/HDR cinema release masters from the same ACES-based mastering workflow. Based on pixel-wise statistics over all 18,580 frames of the test film, we construct a three-domain comparison involving EXR source data, SDR release masters, and HDR release masters to characterize their luminance and color structural relationships within this controlled workflow. In the luminance dimension, SDR and HDR masters exhibit a highly stable global monotonic correspondence, with geometric structure remaining largely consistent overall; sparse and structured deviations appear in self-luminous highlights and specific material regions. In the color dimension, the two masters remain largely consistent in hue, with saturation exhibiting a redistribution pattern of shadow suppression, midtone expansion, and highlight convergence. Using EXR as a scene-referred anchor, we further define a pixel-level decision map that operationally separates EXR-closer recovery regions from content-adaptive adjustment regions. Under this operational definition, 82.4% of sampled image regions are classified as EXR-closer recovery, while the remainder require localized adaptive adjustment. Rather than claiming a universal law for all cinema mastering pipelines, the study provides an interpretable quantitative baseline for structure-aware SDR-to-HDR analysis and for designing learning-based models under shared-source mastering conditions.
[CV-118] MAE-SAM2: Mask Autoencoder-Enhanced SAM2 for Clinical Retinal Vascular Leakage Segmentation
【速读】:该论文旨在解决荧光素血管造影(fluorescein angiography)图像中视网膜血管渗漏区域的分割难题,该任务因渗漏区域尺寸小、分布密集且标注临床数据稀缺而极具挑战性。解决方案的关键在于提出MAE-SAM2模型,其核心创新是将自监督学习(Self-Supervised Learning, SSL)策略与掩码自动编码器(Masked Autoencoder, MAE)相结合,并嵌入到SAM2架构中,通过设计特定任务的联合损失函数优化模型性能。实验表明,该方法在Dice分数和交并比(Intersection-over-Union, IoU)上均优于现有主流模型,相比原始SAM2提升达5%,验证了预训练基础模型在医学影像分析中的有效性。
链接: https://arxiv.org/abs/2509.10554
作者: Xin Xing,Irmak Karaca,Amir Akhavanrezayat,Samira Badrloo,Quan Dong Nguyen,Mahadevan Subramaniam
机构: University of Nebraska Omaha (内布拉斯加大学奥马哈分校); Columbia University Irving Medical Center (哥伦比亚大学欧文医学中心); Stanford University (斯坦福大学)
类目: Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:We propose MAE-SAM2, a novel foundation model for retinal vascular leakage segmentation on fluorescein angiography images. Due to the small size and dense distribution of the leakage areas, along with the limited availability of labeled clinical data, this presents a significant challenge for segmentation tasks. Our approach integrates a Self-Supervised learning (SSL) strategy, Masked Autoencoder (MAE), with SAM2. In our implementation, we explore different loss functions and conclude a task-specific combined loss. Extensive experiments and ablation studies demonstrate that MAE-SAM2 outperforms several state-of-the-art models, achieving the highest Dice score and Intersection-over-Union (IoU). Compared to the original SAM2, our model achieves a 5% performance improvement, highlighting the promise of foundation models with self-supervised pretraining in clinical imaging tasks.
人工智能
[AI-0] oward a Tractability Frontier for Exact Relevance Certification
【速读】:该论文旨在解决坐标结构化决策问题中的**精确相关性认证(exact relevance certification)问题,即确定哪些坐标是决定最优动作所必需的。其核心挑战在于:尽管所研究的可 tractable(高效处理)家族具有有限的原始基(finite primitive basis),但优化器商(optimizer-quotient)实现性达到最大,导致仅凭商结构(quotient shape)无法刻画最优性边界。解决方案的关键在于构建四类障碍族(obstruction families)——主导对集中、边际掩蔽、幽灵动作集中和加法/状态偏移集中——并通过与动作无关、针对成对目标的仿射见证(action-independent, pair-targeted affine witnesses)**构造出同轨道分歧(same-orbit disagreements),从而证明不存在能在闭包封闭域上被有效验证且保持正确的结构性谓词(structural predicates)。这一结果表明,任何正确分类器若定义在闭包封闭域上,都无法对这些障碍族提供精确刻画,且该结论不依赖于预设的可接受性包(admissibility package),而是由正确性强制推导出的轨道一致性(closure-orbit agreement)。
链接: https://arxiv.org/abs/2604.07349
作者: Tristan Simas
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 23 pages. 2 tables. Lean 4 formalization available this https URL
Abstract:Exact relevance certification asks which coordinates are necessary to determine the optimal action in a coordinate-structured decision problem. The tractable families treated here admit a finite primitive basis, but optimizer-quotient realizability is maximal, so quotient shape alone cannot characterize the frontier. We prove a meta-impossibility theorem for efficiently checkable structural predicates invariant under the theorem-forced closure laws of exact certification. Structural convergence with zero-distortion summaries, quotient entropy bounds, and support-counting arguments explains why those closure laws are canonical. We establish the theorem by constructing same-orbit disagreements for four obstruction families, namely dominant-pair concentration, margin masking, ghost-action concentration, and additive/statewise offset concentration, using action-independent, pair-targeted affine witnesses. Consequently no correct tractability classifier on a closure-closed domain yields an exact characterization over these families. Here closure-orbit agreement is forced by correctness rather than assumed as an invariance axiom. The result therefore applies to correct classifiers on closure-closed domains, not only to classifiers presented through a designated admissibility package. Comments: 23 pages. 2 tables. Lean 4 formalization available this https URL Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) MSC classes: 90C31, 68Q15, 03D20, 68T20 ACMclasses: F.2.2; F.4.1; I.2.4 Cite as: arXiv:2604.07349 [cs.CC] (or arXiv:2604.07349v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2604.07349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-1] Syntax Is Easy Semantics Is Hard: Evaluating LLM s for LTL Translation
【速读】:该论文旨在解决将自然语言描述的安全与隐私策略(如软件、网络系统中的要求)转化为命题线性时序逻辑(Propositional Linear Temporal Logic, LTL)公式这一难题,从而降低安全分析工具的使用门槛。当前LTL语义复杂,开发者和分析师难以直接编写,而大型语言模型(Large Language Models, LLMs)有望通过自然语言到LTL的自动翻译提升可访问性。论文的关键解决方案在于:首先评估多个代表性LLMs在语法和语义两个维度上的翻译性能;其次发现更详细的提示(prompt)能显著提升效果;最重要的是,将任务重构为Python代码补全问题后,整体性能得到显著改善,表明任务形式化设计对LLM应用效果具有决定性影响。
链接: https://arxiv.org/abs/2604.07321
作者: Priscilla Kyei Danso,Mohammad Saqib Hasan,Niranjan Balasubramanian,Omar Chowdhury
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: SecDev 2026 in Montreal, Canada, 10 pages, maximum 16 pages
Abstract:Propositional Linear Temporal Logic (LTL) is a popular formalism for specifying desirable requirements and security and privacy policies for software, networks, and systems. Yet expressing such requirements and policies in LTL remains challenging because of its intricate semantics. Since many security and privacy analysis tools require LTL formulas as input, this difficulty places them out of reach for many developers and analysts. Large Language Models (LLMs) could broaden access to such tools by translating natural language fragments into LTL formulas. This paper evaluates that premise by assessing how effectively several representative LLMs translate assertive English sentences into LTL formulas. Using both human-generated and synthetic ground-truth data, we evaluate effectiveness along syntactic and semantic dimensions. The results reveal three findings: (1) in line with prior findings, LLMs perform better on syntactic aspects of LTL than on semantic ones; (2) they generally benefit from more detailed prompts; and (3) reformulating the task as a Python code-completion problem substantially improves overall performance. We also discuss challenges in conducting a fair evaluation on this task and conclude with recommendations for future work.
[AI-2] Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化编程评估(Automated Programming Assessment Systems, APASs)中带来的新挑战:学生可生成功能正确但缺乏理解的代码,导致传统评估方法失效。解决方案的关键在于提出一种混合苏格拉底框架(Hybrid Socratic Framework),其核心是将确定性代码分析与双代理对话层相结合,通过知识追踪、分层提问机制和提示约束(guardrails)确保生成式 AI (Generative AI) 的回答基于运行时事实,从而有效验证学生对代码的理解深度。该框架不替代传统测试,而是作为补充层,增强对学生编程认知水平的探测能力。
链接: https://arxiv.org/abs/2604.07304
作者: Eduard Frankford,Erik Cikalleshi,Ruth Breu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, accepted for publication at CSEDU 2026
Abstract:Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.
[AI-3] CADENCE: Context-Adaptive Depth Estimation for Navigation and Computational Efficiency
【速读】:该论文旨在解决自动驾驶车辆在资源受限的偏远环境中,如何平衡感知精度与计算资源消耗的问题。其核心挑战在于,高精度环境感知通常依赖于计算密集型深度神经网络,而嵌入式处理器、有限电池和轻量化传感器等硬件约束使得持续运行高性能模型不可行。解决方案的关键在于提出一种自适应系统CADENCE,它能够根据导航需求和环境上下文动态调整一个可裁剪(slimmable)单目深度估计网络的计算复杂度,从而实现感知保真度与执行需求之间的闭环控制,确保仅在任务关键时使用高精度计算。该方法显著降低了传感器采样频率、功耗和推理延迟,并提升了整体导航准确性。
链接: https://arxiv.org/abs/2604.07286
作者: Timothy K Johnsen,Marco Levorato
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 7 figures, Accepted for publication at IEEE World AI IoT Congress (AIIoT) 2026
Abstract:Autonomous vehicles deployed in remote environments typically rely on embedded processors, compact batteries, and lightweight sensors. These hardware limitations conflict with the need to derive robust representations of the environment, which often requires executing computationally intensive deep neural networks for perception. To address this challenge, we present CADENCE, an adaptive system that dynamically scales the computational complexity of a slimmable monocular depth estimation network in response to navigation needs and environmental context. By closing the loop between perception fidelity and actuation requirements, CADENCE ensures high-precision computing is only used when mission-critical. We conduct evaluations on our released open-source testbed that integrates Microsoft AirSim with an NVIDIA Jetson Orin Nano. As compared to a state-of-the-art static approach, CADENCE decreases sensor acquisitions, power consumption, and inference latency by 9.67%, 16.1%, and 74.8%, respectively. The results demonstrate an overall reduction in energy expenditure by 75.0%, along with an increase in navigation accuracy by 7.43%.
[AI-4] Android Coach: Improve Online Agent ic Training Efficiency with Single State Multiple Actions
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning, RL)在Android智能体训练中因模拟器高延迟和现有算法样本效率低下而导致的高昂成本问题。其核心挑战在于当前普遍采用的“单状态单动作”(Single State Single Action)范式无法充分探索每个昂贵的模拟器状态,从而限制了学习效率。解决方案的关键是提出“单状态多动作”(Single State Multiple Actions)新范式,通过引入一个估计动作价值的评论家(critic)模型,在不增加额外模拟器开销的前提下,使智能体能够对同一在线状态采样并利用多个动作进行学习;同时,为确保评论家作为可靠教练,论文融合了过程奖励模型,并设计了一种基于平均评论家输出的组内优势估计器(group-wise advantage estimator),显著提升了训练效率与成功率。
链接: https://arxiv.org/abs/2604.07277
作者: Guo Gan,Yuxuan Ding,Cong Chen,Yuwei Ren,Yin Huang,Hong Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.
[AI-5] Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS
【速读】:该论文旨在解决将高精度神经网络势函数(如DeePMD-kit)高效集成到大规模并行分子动力学(Molecular Dynamics, MD)模拟中所面临的挑战,即如何在多GPU、多节点系统上实现高性能的生成式AI(Generative AI)推理,同时保持与GROMACS这一主流MD软件框架的兼容性。其解决方案的关键在于:首先,在GROMACS的NNPot接口中引入DeePMD后端,并设计一个独立于主模拟流程的域分解层,使神经网络推理可并发执行于所有进程;其次,每步模拟通过两次MPI集体通信操作广播坐标和聚合/重分配力,从而实现跨节点的协同计算;最后,通过针对1.6M参数的DPA-1模型训练及在含15,668个原子的蛋白质系统上的性能测试,验证了该方案在NVIDIA A100和AMD MI250x GPU上具备良好的强弱扩展效率(最高达66%强扩展效率),且主要瓶颈为由截断半径决定的不可减少的ghost原子通信开销和进程间负载不平衡问题。
链接: https://arxiv.org/abs/2604.07276
作者: Luca Pennati,Andong Hu,Ivy Peng,Lukas Müllender,Stefano Markidis
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:GROMACS is a de-facto standard for classical Molecular Dynamics (MD). The rise of AI-driven interatomic potentials that pursue near-quantum accuracy at MD throughput now poses a significant challenge: embedding neural-network inference into multi-GPU simulations retaining high-performance. In this work, we integrate the MLIP framework DeePMD-kit into GROMACS, enabling domain-decomposed, GPU-accelerated inference across multi-node systems. We extend the GROMACS NNPot interface with a DeePMD backend, and we introduce a domain decomposition layer decoupled from the main simulation. The inference is executed concurrently on all processes, with two MPI collectives used each step to broadcast coordinates and to aggregate and redistribute forces. We train an in-house DPA-1 model (1.6 M parameters) on a dataset of solvated protein fragments. We validate the implementation on a small protein system, then we benchmark the GROMACS-DeePMD integration with a 15,668 atom protein on NVIDIA A100 and AMD MI250x GPUs up to 32 devices. Strong-scaling efficiency reaches 66% at 16 devices and 40% at 32; weak-scaling efficiency is 80% to 16 devices and reaches 48% (MI250x) and 40% (A100) at 32 devices. Profiling with the ROCm System profiler shows that 90% of the wall time is spent in DeePMD inference, while MPI collectives contribute 10%, primarily since they act as a global synchronization point. The principal bottlenecks are the irreducible ghost-atom cost set by the cutoff radius, confirmed by a simple throughput model, and load imbalance across ranks. These results demonstrate that production MD with near ab initio fidelity is feasible at scale in GROMACS.
[AI-6] Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations
【速读】:该论文旨在解决低轨巨型星座(LEO mega-constellations)中高阶操作意图(如“在80毫秒内避开极区链路 reroute financial traffic away from polar links under 80 ms”)到底层路由约束的映射问题,这一过程需融合自然语言理解与网络域专业知识。其解决方案的关键在于构建一个端到端系统:首先,通过图神经网络(GNN)成本到目标路由器将Dijkstra算法级路由性能压缩为仅152K参数的图注意力网络,在保持99.8%分组交付率的同时实现17倍推理加速;其次,利用大语言模型(LLM)意图编译器结合少量样本提示和验证反馈修复循环,将自然语言转化为带类型的约束中间表示,达到98.4%编译成功率和87.6%可行意图的语义匹配准确率;最后,引入八轮确定性验证器以构造可行性认证机制,在所有不可行意图(共47个)上实现零误接受,并在结构损坏测试和针对性对抗攻击中均实现100%检测率,从而确保操作安全性和部署可靠性。
链接: https://arxiv.org/abs/2604.07264
作者: Yuanhang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
Abstract:Operating LEO mega-constellations requires translating high-level operator intents (“reroute financial traffic away from polar links under 80 ms”) into low-level routing constraints – a task that demands both natural language understanding and network-domain expertise. We present an end-to-end system comprising three components: (1) a GNN cost-to-go router that distills Dijkstra-quality routing into a 152K-parameter graph attention network achieving 99.8% packet delivery ratio with 17x inference speedup; (2) an LLM intent compiler that converts natural language to a typed constraint intermediate representation using few-shot prompting with a verifier-feedback repair loop, achieving 98.4% compilation rate and 87.6% full semantic match on feasible intents in a 240-intent benchmark (193 feasible, 47 infeasible); and (3) an 8-pass deterministic validator with constructive feasibility certification that achieves 0% unsafe acceptance on all 47 infeasible intents (30 labeled + 17 discovered by Pass 8), with 100% corruption detection across 240 structural corruption tests and 100% on 15 targeted adversarial attacks. End-to-end evaluation across four constrained routing scenarios confirms zero constraint violations with both routers. We further demonstrate that apparent performance gaps in polar-avoidance scenarios are largely explained by topological reachability ceilings rather than routing quality, and that the LLM compiler outperforms a rule-based baseline by 46.2 percentage points on compositional intents. Our system bridges the semantic gap between operator intent and network configuration while maintaining the safety guarantees required for operational deployment.
[AI-7] Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education
【速读】:该论文旨在解决在性别限制和监控环境下,女性因无法获得正规教育而面临的学习与职业发展障碍问题。研究发现,尽管生成式 AI (Generative AI, GenAI) 可作为替代性学习工具,但其有效应用需满足安全、隐私保护及文化适配等关键要求。解决方案的关键在于构建以“安全优先”为核心的设计原则:一是强化用户控制权与隐私保护机制,降低监视风险;二是提供基于具体情境的资源受限支持,弥补学习社群缺失;三是设计符合教学逻辑的辅助方式,避免仅提供直接答案导致的虚假进步感,从而真正促进深度学习与自我效能提升。
链接: https://arxiv.org/abs/2604.07253
作者: Hamayoon Behmanush,Freshta Akhtari,Ingmar Weber,Vikram Kamath Cannanure
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This work has been accepted at ACM Conference on Fairness, Accountability, and Transparency 2026 as a full paper. Please cite the peer-reviewed version
Abstract:In gender-restrictive and surveilled contexts, where access to formal education may be restricted for women, pursuing education involves safety and privacy risks. When women are excluded from schools and universities, they often turn to online self-learning and generative AI (GenAI) to pursue their educational and career aspirations. However, we know little about what safe and accountable GenAI support is required in the context of surveillance, household responsibilities, and the absence of learning communities. We present a remote participatory design study with 20 women in Afghanistan, informed by a recruitment survey (n = 140), examining how participants envision GenAI for learning and employability. Participants describe using GenAI less as an information source and more as an always-available peer, mentor, and source of career guidance that helps compensate for the absence of learning communities. At the same time, they emphasize that this companionship is constrained by privacy and surveillance risks, contextually unrealistic and culturally unsafe support, and direct-answer interactions that can undermine learning by creating an illusion of progress. Beyond eliciting requirements, envisioning the future with GenAI through participatory design was positively associated with significant increases in participants’ aspirations (p=.01), perceived agency (p=.01), and perceived avenues (p=.03). These outcomes show that accountable and safe GenAI is not only about harm reduction but can also actively enable women to imagine and pursue viable learning and employment futures. Building on this, we translate participants’ proposals into accountability-focused design directions that center on safety-first interaction and user control, context-grounded support under constrained resources, and offer pedagogically aligned assistance that supports genuine learning rather than quick answers.
[AI-8] k-server-bench: Automating Potential Discovery for the k-Server Conjecture
【速读】:该论文旨在解决生成式 AI (Generative AI) 在数学发现中的开放性问题,具体聚焦于 k-服务器猜想(k-server conjecture)这一竞争分析领域的核心难题。其解决方案的关键在于构建一个基于代码的挑战任务:通过寻找一个满足大规模图结构线性不等式系统的势函数(potential function),来推进对该猜想在特定情形下的理解。该评估机制具有Sound但Incomplete特性——任何违反不等式的候选方案可被直接证伪,而满足全部约束的候选则提供强证据支持,但尚不足以构成完整证明。实验表明,在已解决的 k=3 情况下,当前代理方法能有效处理非平凡实例;而在尚未解决的 k=4 圆形情形中,现有方法虽未完全收敛,但相较已有势函数显著减少了违反约束的数量,验证了该任务对当前方法的挑战性和可行性。
链接: https://arxiv.org/abs/2604.07240
作者: Kirill Brilliantov,Etienne Bamas,Emmanuel Abbé
机构: 未知
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce a code-based challenge for automated, open-ended mathematical discovery based on the k -server conjecture, a central open problem in competitive analysis. The task is to discover a potential function satisfying a large graph-structured system of simple linear inequalities. The resulting evaluation procedure is sound but incomplete: any violated inequality definitively refutes a candidate, whereas satisfying all inequalities does not by itself constitute a proof of the corresponding conjecture’s special case. Nevertheless, a candidate that passes all constraints would be strong evidence toward a valid proof and, to the best of our knowledge, no currently known potential achieves this under our formulation in the open k=4 circle case. As such, a successful candidate would already be an interesting contribution to the k -server conjecture, and could become a substantial theoretical result when paired with a full proof. Experiments on the resolved k=3 regime show that current agentic methods can solve nontrivial instances, and in the open k=4 regime they reduce the number of violations relative to existing potentials without fully resolving the task. Taken together, these results suggest that the task is challenging but plausibly within reach of current methods. Beyond its relevance to the k -server community, where the developed tooling enables researchers to test new hypotheses and potentially improve on the current record, the task also serves as a useful \emphbenchmark for developing code-based discovery agents. In particular, our k=3 results show that it mitigates important limitations of existing open-ended code-based benchmarks, including early saturation and the weak separation between naive random baselines and more sophisticated methods. Subjects: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.07240 [cs.MS] (or arXiv:2604.07240v1 [cs.MS] for this version) https://doi.org/10.48550/arXiv.2604.07240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-9] Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence AISTATS2026
【速读】:该论文旨在解决混合比例估计(Mixture Proportion Estimation, MPE)问题,即从无标签数据中估计类别先验概率,这是弱监督学习(如PU学习、标签噪声学习和域适应)中的关键步骤。传统MPE方法依赖于不可约性假设(irreducibility assumption)或其变体以保证可识别性,但该假设在许多实际场景中难以满足。本文提出基于给定类别标签的条件独立性(Conditional Independence, CI)的新假设,能够在不可约性不成立时仍确保参数可识别性;在此基础上,作者设计了矩估计(method of moments)方法并分析其渐近性质,同时引入弱监督核检验(weakly-supervised kernel tests)用于验证CI假设,从而提升估计鲁棒性和适用性。实验表明,所提方法在估计精度上优于现有方法,且检验能有效控制第一类和第二类错误。
链接: https://arxiv.org/abs/2604.07191
作者: Yushi Hirose,Akito Narahara,Takafumi Kanamori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AISTATS 2026
Abstract:Mixture proportion estimation (MPE) aims to estimate class priors from unlabeled data. This task is a critical component in weakly supervised learning, such as PU learning, learning with label noise, and domain adaptation. Existing MPE methods rely on the \textitirreducibility assumption or its variant for identifiability. In this paper, we propose novel assumptions based on conditional independence (CI) given the class label, which ensure identifiability even when irreducibility does not hold. We develop method of moments estimators under these assumptions and analyze their asymptotic properties. Furthermore, we present weakly-supervised kernel tests to validate the CI assumptions, which are of independent interest in applications such as causal discovery and fairness evaluation. Empirically, we demonstrate the improved performance of our estimators compared with existing methods and that our tests successfully control both type I and type II errors.\labelkey
[AI-10] he ATOM Report: Measuring the Open Language Model Ecosystem
【速读】:该论文旨在解决当前全球开放语言模型(Open Language Models)生态发展中地域分布与竞争态势的量化分析问题,尤其关注中国模型与美国模型之间的相对优势变化。其解决方案的关键在于通过多维度数据融合——包括Hugging Face下载量、模型衍生版本数量、推理市场占有率及性能指标等——构建了一个全面、动态的生态系统画像,从而清晰揭示了自2025年夏季起中国开源语言模型在整体影响力上超越西方同行并持续扩大差距的趋势。
链接: https://arxiv.org/abs/2604.07190
作者: Nathan Lambert,Florian Brand
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 17 figures
Abstract:We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba’s Qwen, DeepSeek, Meta’s Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.
[AI-11] Reason in Chains Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在多步推理任务中因稀疏奖励导致的强化学习优化困难问题。现有方法如Group Relative Policy Optimization将采样轨迹视为独立链,对每条链内所有步骤分配均匀信用,忽略了关键步骤可能对推理结果产生非均衡影响的事实。其解决方案的关键在于提出T-STAR(Tree-structured Self-Taught Agent Rectification)框架,通过构建认知树(Cognitive Tree)将轨迹合并为具有功能相似节点的结构化树形表示,从而恢复看似独立轨迹间的潜在相关奖励结构;在此基础上引入内省估值机制(Introspective Valuation),沿树结构反向传播轨迹级奖励以获得方差降低的步骤级相对优势,并结合上下文思维嫁接(In-Context Thought Grafting)在关键分歧点合成纠正性推理,最终利用基于Bradley-Terry模型的手术式损失(Surgical Policy Optimization)聚焦于这些关键步骤进行策略梯度更新,显著提升长链条推理任务的性能。
链接: https://arxiv.org/abs/2604.07165
作者: Yu Li,Sizhe Tang,Tian Lan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T-STAR(Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back-propagates trajectory-level rewards through the tree to obtain a new notion of variance-reduced relative advantage at step-level. Using the Cognitive Tree, we also develop In-Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points/steps. Our proposed Surgical Policy Optimization then capitalizes on the rich policy gradient information concentrated at these critical points/steps through a Bradley-Terry type of surgical loss. Extensive experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate that T-STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.
[AI-12] Energy Saving for Cell-Free Massive MIMO Networks: A Multi-Agent Deep Reinforcement Learning Approach
【速读】:该论文旨在解决细胞自由大规模MIMO(cell-free massive MIMO, CF mMIMO)网络在下行链路操作中因动态业务流量导致的高能耗问题。其核心解决方案是提出一种多智能体深度强化学习(multi-agent deep reinforcement learning, MADRL)算法,使每个接入点(access point, AP)能够自主控制天线重构和高级睡眠模式(advanced sleep mode, ASM)选择,在无需集中式控制的前提下实现分布式实时优化。该方法通过训练后完全去中心化的运行机制,显著降低功耗(相比无节能方案减少56.23%,相比仅使用最轻睡眠模式的非学习机制减少30.12%),同时保持较低的掉话率,相较经典深度Q网络(DQN)算法在相近功耗水平下具有更优的连接稳定性。
链接: https://arxiv.org/abs/2604.07133
作者: Qichen Wang,Keyu Li,Ozan Alp Topal,Özlem Tugfe Demir,Mustafa Ozger,Cicek Cavdar
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper focuses on energy savings in downlink operation of cell-free massive MIMO (CF mMIMO) networks under dynamic traffic conditions. We propose a multi-agent deep reinforcement learning (MADRL) algorithm that enables each access point (AP) to autonomously control antenna re-configuration and advanced sleep mode (ASM) selection. After the training process, the proposed framework operates in a fully distributed manner, eliminating the need for centralized control and allowing each AP to dynamically adjust to real-time traffic fluctuations. Simulation results show that the proposed algorithm reduces power consumption (PC) by 56.23% compared to systems without any energy-saving scheme and by 30.12% relative to a non-learning mechanism that only utilizes the lightest sleep mode, with only a slight increase in drop ratio. Moreover, compared to the widely used deep Q-network (DQN) algorithm, it achieves a similar PC level but with a significantly lower drop ratio.
[AI-13] Self-Discovered Intention-aware Transformer for Multi-modal Vehicle Trajectory Prediction
【速读】:该论文旨在解决自动驾驶和智能交通系统(ITS)中车辆轨迹预测的灵活性与准确性问题,尤其针对现有深度学习方法依赖特定图结构(如图神经网络)或显式意图标注所导致的局限性。其解决方案的关键在于提出一种纯Transformer架构的多模态网络,通过双路径设计分离空间特征提取模块与轨迹生成模块,从而提升模型的泛化能力;同时,该模型能够通过预测K条轨迹间的残差偏移量,自动学习有序的轨迹群体分布,进一步增强预测的多样性和合理性。
链接: https://arxiv.org/abs/2604.07126
作者: Diyi Liu,Zihan Niu,Tu Xu,Lishan Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures
Abstract:Predicting vehicle trajectories plays an important role in autonomous driving and ITS applications. Although multiple deep learning algorithms are devised to predict vehicle trajectories, their reliant on specific graph structure (e.g., Graph Neural Network) or explicit intention labeling limit their flexibilities. In this study, we propose a pure Transformer-based network with multiple modals considering their neighboring vehicles. Two separate tracks are employed. One track focuses on predicting the trajectories while the other focuses on predicting the likelihood of each intention considering neighboring vehicles. Study finds that the two track design can increase the performance by separating spatial module from the trajectory generating module. Also, we find the the model can learn an ordered group of trajectories by predicting residual offsets among K trajectories.
[AI-14] Information as Structural Alignment: A Dynamical Theory of Continual Learning
【速读】:该论文旨在解决持续学习(Continual Learning)中的灾难性遗忘(Catastrophic Forgetting)问题,即模型在学习新任务时对先前知识的严重遗忘现象。传统方法如正则化、回放(replay)和冻结子网络虽能缓解遗忘,但均依赖于外部机制对共享参数进行干预,未能从学习动态本身出发实现记忆保持。本文提出信息构建框架(Informational Buildup Framework, IBF),其核心创新在于将“信息”重新定义为结构对齐(structural alignment)的结果而非静态内容存储,并通过两个关键动力学方程驱动模型演化:一是运动定律(Law of Motion)促使系统配置向更高一致性演进;二是修改动力学(Modification Dynamics)根据局部差异持续重塑一致性景观。由此产生的记忆、自主性和自修正能力并非外加模块,而是内生于这些动力学过程之中,从而实现了无需存储原始数据即可超越回放基准的稳定性能表现。
链接: https://arxiv.org/abs/2604.07108
作者: Radu Negulescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 8 figures
Abstract:Catastrophic forgetting is not an engineering failure. It is a mathematical consequence of storing knowledge as global parameter superposition. Existing methods, such as regularization, replay, and frozen subnetworks, add external mechanisms to a shared-parameter substrate. None derives retention from the learning dynamics themselves. This paper introduces the Informational Buildup Framework (IBF), an alternative substrate for continual learning, based on the premise that information is the achievement of structural alignment rather than stored content. In IBF, two equations govern the dynamics: a Law of Motion that drives configuration toward higher coherence, and Modification Dynamics that persistently deform the coherence landscape in response to localized discrepancies. Memory, agency, and self-correction arise from these dynamics rather than being added as separate modules. We first demonstrate the full lifecycle in a transparent two-dimensional toy model, then validate across three domains: a controlled non-stationary world, chess evaluated independently by Stockfish, and Split-CIFAR-100 with a frozen ViT encoder. Across all three, IBF achieves replay-superior retention without storing raw data. We observe near-zero forgetting on CIFAR-100 (BT = -0.004), positive backward transfer in chess (+38.5 cp), and 43% less forgetting than replay in the controlled domain. In chess, the framework achieves a mean behavioral advantage of +88.9 +/- 2.8 cp under independent evaluation, exceeding MLP and replay baselines. Comments: 31 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.07108 [cs.LG] (or arXiv:2604.07108v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.07108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-15] Flow Motion Policy: Manipulator Motion Planning with Flow Matching Models
【速读】:该论文旨在解决现有开环端到端神经运动规划方法在执行时缺乏多样性与推理阶段优化能力的问题,即大多数方法对同一工作空间在不同运行中仅生成单一路径,无法利用其开环结构进行推理时的多候选路径采样与优选。解决方案的关键在于提出Flow Motion Policy,该方法基于流匹配(flow matching)的随机生成式建模方式,学习可行路径的概率分布,从而在推理阶段实现高效的“最佳-N”(best-of-N)采样:生成多个端到端候选路径,评估其碰撞状态后选择首个无碰撞解执行,显著提升了规划成功率与效率,验证了随机生成策略在端到端运动规划与推理时优化中的有效性。
链接: https://arxiv.org/abs/2604.07084
作者: Davood Soleymanzadeh,Xiao Liang,Minghui Zheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-loop end-to-end neural motion planners have recently been proposed to improve motion planning for robotic manipulators. These methods enable planning directly from sensor observations without relying on a privileged collision checker during planning. However, many existing methods generate only a single path for a given workspace across different runs, and do not leverage their open-loop structure for inference-time optimization. To address this limitation, we introduce Flow Motion Policy, an open-loop, end-to-end neural motion planner for robotic manipulators that leverages the stochastic generative formulation of flow matching methods to capture the inherent multi-modality of planning datasets. By modeling a distribution over feasible paths, Flow Motion Policy enables efficient inference-time best-of- N sampling. The method generates multiple end-to-end candidate paths, evaluates their collision status after planning, and executes the first collision-free solution. We benchmark the Flow Motion Policy against representative sampling-based and neural motion planning methods. Evaluation results demonstrate that Flow Motion Policy improves planning success and efficiency, highlighting the effectiveness of stochastic generative policies for end-to-end motion planning and inference-time optimization. Experimental evaluation videos are available via this \hrefthis https URLlink.
[AI-16] EVGeoQA: Benchmarking LLM s on Dynamic Multi-Objective Geo-Spatial Exploration
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态地理空间环境中进行目的驱动探索的能力尚未得到充分研究的问题,尤其是现有地理空间问答(Geo-Spatial Question Answering, GSQA)基准多聚焦静态检索,无法刻画真实世界规划中涉及用户实时位置与复合约束的复杂性。其解决方案的关键在于提出一个名为EVGeoQA的新基准,该基准基于电动汽车(Electric Vehicle, EV)充电场景设计,具有位置锚定和双目标特性——每个查询明确绑定用户实时坐标,并同时整合充电需求与共址活动偏好;同时构建了GeoRover评估框架,基于工具增强型智能体架构系统性地测评LLMs在动态多目标探索中的能力。实验表明,尽管LLMs能有效利用工具完成子任务,但在长距离空间探索上存在局限,但观察到一种新兴能力:LLMs可总结历史探索轨迹以提升探索效率。
链接: https://arxiv.org/abs/2604.07070
作者: Jianfei Wu,Zhichun Wang,Zhensheng Wang,Zhiyu He
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, their potential for purpose-driven exploration in dynamic geo-spatial environments remains under-investigated. Existing Geo-Spatial Question Answering (GSQA) benchmarks predominantly focus on static retrieval, failing to capture the complexity of real-world planning that involves dynamic user locations and compound constraints. To bridge this gap, we introduce EVGeoQA, a novel benchmark built upon Electric Vehicle (EV) charging scenarios that features a distinct location-anchored and dual-objective design. Specifically, each query in EVGeoQA is explicitly bound to a user’s real-time coordinate and integrates the dual objectives of a charging necessity and a co-located activity preference. To systematically assess models in such complex settings, we further propose GeoRover, a general evaluation framework based on a tool-augmented agent architecture to evaluate the LLMs’ capacity for dynamic, multi-objective exploration. Our experiments reveal that while LLMs successfully utilize tools to address sub-tasks, they struggle with long-range spatial exploration. Notably, we observe an emergent capability: LLMs can summarize historical exploration trajectories to enhance exploration efficiency. These findings establish EVGeoQA as a challenging testbed for future geo-spatial intelligence. The dataset and prompts are available at this https URL.
[AI-17] Planning Task Shielding: Detecting and Repairing Flaws in Planning Tasks through Turning them Unsolvable
【速读】:该论文旨在解决规划任务中的缺陷检测与修复问题,即当目标规范中隐含了不应达到的错误状态时,如何通过最小修改原规划动作来使整个规划任务变得不可解(unsolvable),从而避免系统进入此类缺陷状态。解决方案的关键在于提出了一种名为 allmin 的最优算法,该算法通过最小化对原始动作的改动,确保规划任务无法达成任何包含错误状态的路径,从而实现对规划任务的“屏蔽”(shielding)。
链接: https://arxiv.org/abs/2604.07042
作者: Alberto Pozanco,Marianela Morales,Pietro Totis,Daniel Borrajo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Most research in planning focuses on generating a plan to achieve a desired set of goals. However, a goal specification can also be used to encode a property that should never hold, allowing a planner to identify a trace that would reach a flawed state. In such cases, the objective may shift to modifying the planning task to ensure that the flawed state is never reached-in other words, to make the planning task unsolvable. In this paper we introduce planning task shielding: the problem of detecting and repairing flaws in planning tasks. We propose allmin , an optimal algorithm that solves these tasks by minimally modifying the original actions to render the planning task unsolvable. We empirically evaluate the performance of allmin in shielding planning tasks of increasing size, showing how it can effectively shield the system by turning the planning task unsolvable.
[AI-18] AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules
【速读】:该论文旨在解决机器人系统在智能组织、能力扩展与执行层面缺乏统一抽象建模的问题,现有方法或采用紧耦合的单体架构,或分解为松散协调的模块/多智能体,难以实现身份一致性与控制权的清晰界定。其解决方案的关键在于提出AEROS(Agent Execution Runtime Operating System),将机器人建模为一个持续存在的智能主体,通过可安装的具身能力模块(Embodied Capability Modules, ECMs)扩展功能;每个ECM封装可执行技能、模型和工具,而运行时策略分离机制确保执行约束与安全保障,从而实现模块化可扩展性、能力组合执行以及系统级一致性安全。
链接: https://arxiv.org/abs/2604.07039
作者: Xue Qin,Simin Luan,Cong Yang,Zhijun Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to Engineering Applications of Artificial Intelligence (EAAI). 48 pages, 5 figures, 9 tables
Abstract:Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (this http URL-style and ProgPrompt-style at 92–93%, flat pipeline at 67–73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.
[AI-19] ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations
【速读】:该论文旨在解决神经网络(尤其是表格基础模型)在决策过程中缺乏可解释性的问题,即如何系统地探索其内部表征以识别具有概念可解释性的神经元。解决方案的关键在于提出ConceptTracer这一交互式分析工具,该工具整合了两种信息论度量——概念显著性(concept saliency)和概念选择性(concept selectivity),从而能够量化神经元对特定人类可理解概念的响应强度,进而帮助研究人员定位并理解那些编码概念级信息的神经元。
链接: https://arxiv.org/abs/2604.07019
作者: Ricardo Knauer,Andre Beinrucker,Erik Rodner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: XAI 2026 Late-Breaking Work Track
Abstract:Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision-making processes. Despite a growing interest in mechanistic interpretability, tools for systematically exploring the representations learned by neural networks in general, and tabular foundation models in particular, remain limited. In this work, we introduce ConceptTracer, an interactive application for analyzing neural representations through the lens of human-interpretable concepts. ConceptTracer integrates two information-theoretic measures that quantify concept saliency and selectivity, enabling researchers and practitioners to identify neurons that respond strongly to individual concepts. We demonstrate the utility of ConceptTracer on representations learned by TabPFN and show that our approach facilitates the discovery of interpretable neurons. Together, these capabilities provide a practical framework for investigating how neural networks like TabPFN encode concept-level information. ConceptTracer is available at this https URL.
[AI-20] A-MBER: Affective Memory Benchmark for Emotion Recognition
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在长期交互中对用户情绪状态进行持续、准确识别的能力不足问题。现有情绪数据集多聚焦于瞬时情感判断,而记忆基准测试则集中于事实回忆或知识更新,缺乏评估模型如何利用多轮对话历史来推断当前情绪状态的能力。为此,作者提出 A-MBER(Affective Memory Benchmark for Emotion Recognition),其核心创新在于构建一个基于多轮交互轨迹的、以“锚定回合”为参照点的情绪解释评估框架,要求模型不仅能识别当前情绪,还需从历史中提取相关证据并提供可解释的推理过程。关键在于通过分阶段管道设计实现结构化中间表示,并引入多种鲁棒性设置(如模态退化和证据不足场景),从而系统性地检验模型是否具备选择性、情境敏感且 grounded 的记忆使用能力,而非仅依赖信息量的堆叠。
链接: https://arxiv.org/abs/2604.07017
作者: Deliang Wen,Ke Sun,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI assistants that interact with users over time need to interpret the user’s current emotional state in order to respond appropriately and personally. However, this capability remains insufficiently evaluated. Existing emotion datasets mainly assess local or instantaneous affect, while long-term memory benchmarks focus largely on factual recall, temporal consistency, or knowledge updating. As a result, current resources provide limited support for testing whether a model can use remembered interaction history to interpret a user’s present affective state. We introduce A-MBER, an Affective Memory Benchmark for Emotion Recognition, to evaluate this capability. A-MBER focuses on present affective interpretation grounded in remembered multi-session interaction history. Given an interaction trajectory and a designated anchor turn, a model must infer the user’s current affective state, identify historically relevant evidence, and justify its interpretation in a grounded way. The benchmark is constructed through a staged pipeline with explicit intermediate representations, including long-horizon planning, conversation generation, annotation, question construction, and final packaging. It supports judgment, retrieval, and explanation tasks, together with robustness settings such as modality degradation and insufficient-evidence conditions. Experiments compare local-context, long-context, retrieved-memory, structured-memory, and gold-evidence conditions within a unified framework. Results show that A-MBER is especially discriminative on the subsets it is designed to stress, including long-range implicit affect, high-dependency memory levels, trajectory-based reasoning, and adversarial settings. These findings suggest that memory supports affective interpretation not simply by providing more history, but by enabling more selective, grounded, and context-sensitive use of past interaction Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.07017 [cs.AI] (or arXiv:2604.07017v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.07017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-21] CAFP: A Post-Processing Framework for Group Fairness via Counterfactual Model Averag ing
【速读】:该论文旨在解决机器学习模型在敏感领域(如信贷评分、医疗和刑事司法)中预测结果可能因受保护属性(如性别、种族等)影响而产生不公平的问题。现有公平性干预方法通常依赖数据预处理或训练过程中的算法约束,但这些方法往往需要对模型架构有完全控制权并访问受保护属性信息,在实际系统中难以实现。为此,论文提出了一种模型无关的后处理方法——反事实平均(Counterfactual Averaging for Fair Predictions, CAFP),其核心在于生成每个输入样本的反事实版本(即敏感属性被翻转),然后对原始模型在真实实例与反事实实例上的预测结果进行平均,从而消除模型输出对受保护属性的直接依赖,并有效降低预测与敏感属性之间的互信息。理论分析表明,CAFP能保证完美的群体均等性(demographic parity),并将等效机会差距(equalized odds gap)至少减少一半的平均反事实偏差,且引入的扰动可被严格界定。
链接: https://arxiv.org/abs/2604.07009
作者: Irina Arévalo,Marcos Oliva
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ensuring fairness in machine learning predictions is a critical challenge, especially when models are deployed in sensitive domains such as credit scoring, healthcare, and criminal justice. While many fairness interventions rely on data preprocessing or algorithmic constraints during training, these approaches often require full control over the model architecture and access to protected attribute information, which may not be feasible in real-world systems. In this paper, we propose Counterfactual Averaging for Fair Predictions (CAFP), a model-agnostic post-processing method that mitigates unfair influence from protected attributes without retraining or modifying the original classifier. CAFP operates by generating counterfactual versions of each input in which the sensitive attribute is flipped, and then averaging the model’s predictions across factual and counterfactual instances. We provide a theoretical analysis of CAFP, showing that it eliminates direct dependence on the protected attribute, reduces mutual information between predictions and sensitive attributes, and provably bounds the distortion introduced relative to the original model. Under mild assumptions, we further show that CAFP achieves perfect demographic parity and reduces the equalized odds gap by at least half the average counterfactual bias.
[AI-22] EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在隐私敏感的边缘设备场景(如移动助手或救援机器人)中部署时面临的高计算成本和隐私风险问题,同时克服小型语言模型(Small Language Models, SLMs)在高风险谈判中因情感动态复杂性而表现不足的局限。其解决方案的关键在于提出EmoMAS——一个基于贝叶斯(Bayesian)的多智能体框架,通过协调三个专业化智能体(博弈论、强化学习与心理一致性模型)实现从反应式到战略性的决策机制转变;该框架利用贝叶斯调度器融合实时洞察,并基于谈判反馈持续更新代理可靠性,从而在无需预训练的情况下实现在线策略学习,显著提升谈判绩效并保障伦理行为。
链接: https://arxiv.org/abs/2604.07003
作者: Yunbo Long,Yunhan Liu,Liming Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) has been widely used for automated negotiation, but their high computational cost and privacy risks limit deployment in privacy-sensitive, on-device settings such as mobile assistants or rescue robots. Small language models (SLMs) offer a viable alternative, yet struggle with the complex emotional dynamics of high-stakes negotiation. We introduces EmoMAS, a Bayesian multi-agent framework that transforms emotional decision-making from reactive to strategic. EmoMAS leverages a Bayesian orchestrator to coordinate three specialized agents: game-theoretic, reinforcement learning, and psychological coherence models. The system fuses their real-time insights to optimize emotional state transitions while continuously updating agent reliability based on negotiation feedback. This mixture-of-agents architecture enables online strategy learning without pre-training. We further introduce four high-stakes, edge-deployable negotiation benchmarks across debt, healthcare, emergency response, and educational domains. Through extensive agent-to-agent simulations across all benchmarks, both SLMs and LLMs equipped with EmoMAS consistently surpass all baseline models in negotiation performance while balancing ethical behavior. These results show that strategic emotional intelligence is also the key driver of negotiation success. By treating emotional expression as a strategic variable within a Bayesian multi-agent optimization framework, EmoMAS establishes a new paradigm for effective, private, and adaptive negotiation AI suitable for high-stakes edge deployment.
[AI-23] Whats Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning ACL2026
【速读】:该论文旨在解决现有图形用户界面(GUI)推理任务中因缺乏对UI元素的全面理解而导致的决策不可解释性和任务失败问题。当前方法多依赖于直接基于屏幕的决策,未能充分建模UI元素的定位、语义功能及实际使用场景。其解决方案的关键在于提出一种名为“UI-in-the-Loop”(UILoop)的新范式,该范式将GUI推理建模为一个循环的“屏幕-UI元素-动作”过程,并借助多模态大语言模型(MLLMs)显式学习关键UI元素的定位、语义功能与实践用途,从而实现精准的元素识别和可解释的推理。此外,作者还构建了一个包含26K样本的基准数据集(UI Comprehension-Bench),用于系统评估模型在UI元素理解上的能力,实验表明UILoop在GUI理解与推理任务中均达到最优性能。
链接: https://arxiv.org/abs/2604.06995
作者: Songze Li,Xiaoke Guo,Tianqi Liu,Biao Yi,Zhaoyan Gong,Zhiqiang Liu,Huajun Chen,Wen Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings
Abstract:Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods’ mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.
[AI-24] Stress Estimation in Elderly Oncology Patients Using Visual Wearable Representations and Multi-Instance Learning
【速读】:该论文旨在解决心理压力(psychological stress)在心脏肿瘤学(cardio-oncology)中缺乏连续监测的问题,传统上仅依赖患者自评结果测量工具(PROMs),难以实现动态评估。为应对这一挑战,研究提出了一种基于多模态可穿戴设备数据的弱监督学习方法:利用智能手表(物理活动与睡眠)和胸贴式心电图(ECG)传感器获取生理信号,将其转化为异构视觉表征,并采用轻量级预训练专家混合模型(Tiny-BioMoE)提取192维嵌入向量,通过注意力机制的多实例学习(attention-based multiple instance learning, MIL)整合特征以预测三个月(M3)和六个月(M6)时的感知压力量表(PSS)得分。其关键创新在于将单个PSS标签映射到大量未标注时间窗口的弱监督框架下,实现了无需人工标注即可从连续生理数据中估计心理压力水平,为未来整合心理状态至心血管毒性监测提供了可行路径。
链接: https://arxiv.org/abs/2604.06990
作者: Ioannis Kyprakis,Vasileios Skaramagkas,Georgia Karanasiou,Vasilis Bouratzis,Andri Papakonstantinou,Dimitar Stefanovski,Kalliopi Keramida,Aristofania Simatou,Ketti Mazzocco,Anastasia Constantinidou,Konstantinos Marias,Dimitrios I. Fotiadis,Manolis Tsiknakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, under review for IEEE EMBC 2026
Abstract:Psychological stress is clinically relevant in cardio-oncology, yet it is typically assessed only through patient-reported outcome measures (PROMs) and is rarely integrated into continuous cardiotoxicity surveillance. We estimate perceived stress in an elderly, multicenter breast cancer cohort (CARDIOCARE) using multimodal wearable data from a smartwatch (physical activity and sleep) and a chest-worn ECG sensor. Wearable streams are transformed into heterogeneous visual representations, yielding a weakly supervised setting in which a single Perceived Stress Scale (PSS) score corresponds to many unlabeled windows. A lightweight pretrained mixture-of-experts backbone (Tiny-BioMoE) embeds each representation into 192-dimensional vectors, which are aggregated via attention-based multiple instance learning (MIL) to predict PSS at month 3 (M3) and month 6 (M6). Under leave-one-subject-out (LOSO) evaluation, predictions showed moderate agreement with questionnaire scores (M3: R^2=0.24, Pearson r=0.42, Spearman rho=0.48; M6: R^2=0.28, Pearson r=0.49, Spearman rho=0.52), with global RMSE/MAE of 6.62/6.07 at M3 and 6.13/5.54 at M6.
[AI-25] Frailty Estimation in Elderly Oncology Patients Using Multimodal Wearable Data and Multi-Instance Learning
【速读】:该论文旨在解决老年乳腺癌患者在两次临床随访之间难以实时评估虚弱(frailty)相关功能变化的问题,从而提升治疗耐受性和预后管理。其解决方案的关键在于提出了一种基于注意力机制的多实例学习(attention-based multiple instance learning, MIL)框架,能够融合来自智能手表的自由生活状态下的物理活动与睡眠特征以及胸带采集的心电图衍生心率变异性(heart rate variability, HRV)特征,在存在真实世界缺失数据和弱监督条件下,对患者功能状态的离散变化类别(恶化、稳定、改善)进行预测。该方法通过模态特定的多层感知机(MLP)编码器和嵌入维度为128的表示学习,有效聚合长度可变且部分缺失的纵向可穿戴设备数据,实现了对FACIT-F评分和握力变化的准确估计。
链接: https://arxiv.org/abs/2604.06985
作者: Ioannis Kyprakis,Vasileios Skaramagkas,Georgia Karanasiou,Lampros Lakkas,Andri Papakonstantinou,Domen Ribnikar,Kalliopi Keramida,Dorothea Tsekoura,Ketti Mazzocco,Anastasia Constantinidou,Konstantinos Marias,Dimitrios I. Fotiadis,Manolis Tsiknakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, under review for IEEE EMBC 2026
Abstract:Frailty and functional decline strongly influence treatment tolerance and outcomes in older patients with cancer, yet assessment is typically limited to infrequent clinic visits. We propose a multimodal wearable framework to estimate frailty-related functional change between visits in elderly breast cancer patients enrolled in the multicenter CARDIOCARE study. Free-living smartwatch physical activity and sleep features are combined with ECG-derived heart rate variability (HRV) features from a chest strap and organized into patient-horizon bags aligned to month 3 (M3) and month 6 (M6) follow-ups. Our innovation is an attention-based multiple instance learning (MIL) formulation that fuses irregular, multimodal wearable instances under real-world missingness and weak supervision. An attention-based MIL model with modality-specific multilayer perceptron (MLP) encoders with embedding dimension 128 aggregates variable-length and partially missing longitudinal instances to predict discretized change-from-baseline classes (worsened, stable, improved) for FACIT-F and handgrip strength. Under subject-independent leave-one-subject-out (LOSO) evaluation, the full multimodal model achieved balanced accuracy/F1 of 0.68 +/- 0.08/0.67 +/- 0.09 at M3 and 0.70 +/- 0.10/0.69 +/- 0.08 at M6 for handgrip, and 0.59 +/- 0.04/0.58 +/- 0.06 at M3 and 0.64 +/- 0.05/0.63 +/- 0.07 at M6 for FACIT-F. Ablation results indicated that smartwatch activity and sleep provide the strongest predictive information for frailty-related functional changes, while HRV contributes complementary information when fused with smartwatch streams.
[AI-26] An empirical study of LoRA-based fine-tuning of large language models for automated test case generation
【速读】:该论文旨在解决从自然语言需求中自动生成功能性测试用例的问题,这一任务在软件工程中因需求表述的模糊性和对结构化、可执行测试产物的要求而具有挑战性。其解决方案的关键在于采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术中的LoRA(Low-Rank Adaptation),通过系统性地调整LoRA超参数(如秩、缩放因子和dropout率),显著提升开源大语言模型(LLM)在测试用例生成任务上的性能。实验表明,经过LoRA微调的8B规模开源模型(如Ministral-8B)性能可媲美预微调的GPT-4.1模型,从而证明了在本地部署、成本可控的前提下,结合合理微调策略的开源模型能够成为高性能测试生成系统的可行替代方案。
链接: https://arxiv.org/abs/2604.06946
作者: Milad Moradi,Ke Yan,David Colwell,Rhona Asgari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.
[AI-27] A First Guess is Rarely the Final Answer: Learning to Search in the Travelling Salesperson Problem
【速读】:该论文旨在解决当前神经求解器在处理旅行商问题(Traveling Salesperson Problem, TSP)时存在的设计不匹配问题,即现有方法多沿用单解输出的架构与训练策略,未能充分适配局部搜索(local search)机制的本质特性,导致学习到的改进策略在鲁棒性和可扩展性上表现不足。其解决方案的关键在于提出NICO-TSP(Neural Improvement for Combinatorial Optimization),该框架围绕2-opt局部搜索操作重构了整个模型设计:通过精确地将当前路径表示为n个边标记(edge tokens),直接评分2-opt移动而无需路径位置编码,并采用两阶段训练策略——先通过模仿学习获取短视野最优改进轨迹,再利用无评分子的群体强化学习进行长程优化,从而实现更高效、稳定且泛化能力强的改进性能。
链接: https://arxiv.org/abs/2604.06940
作者: Andoni Irazusta Garmendia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Most neural solvers for the Traveling Salesperson Problem (TSP) are trained to output a single solution, even though practitioners rarely stop there: at test time, they routinely spend extra compute on sampling or post-hoc search. This raises a natural question: can the search procedure itself be learned? Neural improvement methods take this perspective by learning a policy that applies local modifications to a candidate solution, accumulating gains over an improvement trajectory. Yet learned improvement for TSP remains comparatively immature, with existing methods still falling short of robust, scalable performance. We argue that a key reason is design mismatch: many approaches reuse state representations, architectural choices, and training recipes inherited from single-solution methods, rather than being built around the mechanics of local search. This mismatch motivates NICO-TSP (Neural Improvement for Combinatorial Optimization): a 2-opt improvement framework for TSP. NICO-TSP represents the current tour with exactly n edge tokens aligned with the neighborhood operator, scores 2-opt moves directly without tour positional encodings, and trains via a two-stage procedure: imitation learning to short-horizon optimal trajectories, followed by critic-free group-based reinforcement learning over longer rollouts. Under compute-matched evaluations that measure improvement as a function of both search steps and wall-clock time, NICO-TSP delivers consistently stronger and markedly more step-efficient improvement than prior learned and heuristic search baselines, generalizes far more reliably to larger out-of-distribution instances, and serves both as a competitive replacement for classical local search and as a powerful test-time refinement module for constructive solvers.
[AI-28] SentinelSphere: Integrating AI-Powered Real-Time Threat Detection with Cybersecurity Awareness Training
【速读】:该论文旨在解决网络安全领域面临的两大核心问题:全球合格安全从业人员的短缺,以及由人为因素导致的安全事件占绝大多数的现状。解决方案的关键在于提出一个名为SentinelSphere的统一平台,该平台通过人工智能驱动,将基于机器学习的威胁检测模块与由大型语言模型(Large Language Model, LLM)赋能的安全培训模块相结合。其中,检测模块采用增强型深度神经网络(Enhanced Deep Neural Network, DNN),在CIC-IDS2017和CIC-DDoS2019数据集上训练,并引入新型HTTP层特征工程以捕捉应用层攻击签名;教育模块则部署了经过微调的Phi-4量化版本(Q4_K_M),可在仅需16 GB内存且无需专用GPU的通用硬件上运行,从而实现轻量化、可扩展的智能安全教学。实验与验证工作表明,该方案在降低误报率的同时保持高召回率,并通过直观的Traffic Light可视化系统和对话式AI助手显著提升了非技术背景用户的使用效果,实现了对技术和人为因素双重漏洞的协同治理。
链接: https://arxiv.org/abs/2604.06900
作者: Nikolaos D. Tantaroudas,Ilias Karachalios,Andrew J. McCracken
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: 21
Abstract:The field of cybersecurity is confronted with two interrelated challenges: a worldwide deficit of qualified practitioners and ongoing human-factor weaknesses that account for the bulk of security incidents. To tackle these issues, we present SentinelSphere, a platform driven by artificial intelligence that unifies machine learning-based threat identification with security training powered by a Large Language Model (LLM). The detection module uses an Enhanced Deep Neural Network (DNN) trained on the CIC-IDS2017 and CIC-DDoS2019 benchmark datasets, enriched with novel HTTP-layer feature engineering that captures application level attack signatures. For the educational component, we deploy a quantised variant of Phi-4 model (Q4_K_M), fine-tuned for the cybersecurity domain, enabling deployment on commodity hardware requiring only 16 GB of RAM without dedicated GPU resources. Experimental results show that the Enhanced DNN attains high detection accuracy while substantially lowering false positives relative to baseline models, and maintains strong recall across critical attack categories such as DDoS, brute force, and web-based exploits. Validation workshops involving industry professionals and university students confirmed that the Traffic Light visualisation system and conversational AI assistant are both intuitive and effective for users without technical backgrounds. SentinelSphere illustrates that coupling intelligent threat detection with adaptive, LLM-driven security education can meaningfully address both technical and human-factor cybersecurity vulnerabilities within a single, cohesive framework.
[AI-29] Explaining Neural Networks in Preference Learning: a Post-hoc Inductive Logic Programming Approach
【速读】:该论文旨在解决如何在高维特征空间中对黑盒模型(如神经网络,Neural Networks, NN)进行高效且高保真度的近似问题,特别是在用户偏好学习场景下。其核心挑战在于保持目标模型的预测准确性的同时,避免计算复杂度随维度增加而显著上升。解决方案的关键在于引入基于弱约束(weak constraints)的归纳逻辑编程方法——ILASP(Inductive Learning of Answer Set Programs),通过训练ILASP模型来逼近神经网络的行为;此外,为应对高维数据带来的计算负担,作者提出了一种预处理步骤,利用主成分分析(Principal Component Analysis, PCA)降低数据维度,从而在保障解释透明性的同时提升近似效率。
链接: https://arxiv.org/abs/2604.06838
作者: Daniele Fossemò,Filippo Mignosi,Giuseppe Placidi,Luca Raggioli,Matteo Spezialetti,Fabio Aurelio D’Asaro
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under consideration for publication in Theory and Practice of Logic Programming (TPLP)
Abstract:In this paper, we propose using Learning from Answer Sets to approximate black-box models, such as Neural Networks (NN), in the specific case of learning user preferences. We specifically explore the use of ILASP (Inductive Learning of Answer Set Programs) to approximate preference learning systems through weak constraints. We have created a dataset on user preferences over a set of recipes, which is used to train the NNs that we aim to approximate with ILASP. Our experiments investigate ILASP both as a global and a local approximator of the NNs. These experiments address the challenge of approximating NNs working on increasingly high-dimensional feature spaces while achieving appropriate fidelity on the target model and limiting the increase in computational time. To handle this challenge, we propose a preprocessing step that exploits Principal Component Analysis to reduce the dataset’s dimensionality while keeping our explanations transparent. Under consideration for publication in Theory and Practice of Logic Programming (TPLP).
[AI-30] owards Privacy-Preserving Large Language Model: Text-free Inference Through Alignment and Adaptation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)服务中因用户提交原始文本而引发的隐私泄露问题,尤其是在敏感信息(如个人、医疗或法律数据)暴露风险较高的场景下。现有防御机制虽能缓解隐私风险,但常伴随显著的计算开销并损害模型性能,难以兼顾隐私保护与效率。解决方案的关键在于提出隐私保护微调(Privacy-Preserving Fine-Tuning, PPFT),其核心是通过双阶段训练:第一阶段联合训练客户端编码器与服务器端投影模块及LLM,使服务器仅基于k-池化提示嵌入(k-pooled prompt embeddings)进行条件推理,无需接收原始文本;第二阶段在私有领域数据上使用噪声注入嵌入对投影模块和LLM进行微调,实现模型适应性提升的同时不暴露明文提示,并且无需访问解码器内部参数。实验表明,PPFT在保障隐私的前提下实现了与无噪声基准相当的模型性能,有效平衡了隐私保护与模型效用。
链接: https://arxiv.org/abs/2604.06831
作者: Jeongho Yoon,Chanhee Park,Yongchan Chun,Hyeonseok Moon,Heuiseok Lim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Current LLM-based services typically require users to submit raw text regardless of its sensitivity. While intuitive, such practice introduces substantial privacy risks, as unauthorized access may expose personal, medical, or legal information. Although prior defenses strived to mitigate these risks, they often incur substantial computational overhead and degrade model performance. To overcome this privacy-efficiency trade-off, we introduce Privacy-Preserving Fine-Tuning (PPFT), a novel training pipeline that eliminates the need for transmitting raw prompt text while maintaining a favorable balance between privacy preservation and model utility for both clients and service providers. Our approach operates in two stages: first, we train a client-side encoder together with a server-side projection module and LLM, enabling the server to condition on k-pooled prompt embeddings instead of raw text; second, we fine-tune the projection module and LLM on private, domain-specific data using noise-injected embeddings, allowing effective adaptation without exposing plain text prompts and requiring access to the decoder’s internal parameters. Extensive experiments on domain-specific and general benchmarks demonstrate that PPFT achieves a striking balance between privacy and utility, maintaining competitive performance with minimal degradation compared to noise-free upper bounds.
[AI-31] Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM -Generated Disinformation
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)作为低成本替代方案被广泛用于评估生成式 AI(Generative AI)内容的说服力,但其是否能准确反映人类读者的真实反应尚不明确。为应对这一问题,作者将评估任务重新定义为代理有效性(proxy-validity)问题,并通过系统性审计 LLM 判官与人类读者响应之间的对齐程度来检验其有效性。解决方案的关键在于构建了一个包含 290 篇对齐文章、2,043 对人类评分和八个前沿判官输出的数据集,从整体评分、项目级排序以及文本信号依赖三个维度量化 LLM 判官与人类读者的差异,结果揭示了判官在评分倾向、排序一致性及文本特征权重上的显著偏差,从而表明 LLM 判官虽内部一致性高,但并非人类读者响应的有效代理。
链接: https://arxiv.org/abs/2604.06820
作者: Zonghuan Xu,Xiang Zheng,Yutao Wu,Xingjun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear. We recast evaluation in this setting as a proxy-validity problem and audit LLM judges against human reader responses. Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judges, we examine judge–human alignment in terms of overall scoring, item-level ordering, and signal dependence. We find persistent judge–human gaps throughout. Relative to humans, judges are typically harsher, recover item-level human rankings only weakly, and rely on different textual signals, placing more weight on logical rigour while penalizing emotional intensity more strongly. At the same time, judges agree far more with one another than with human readers. These results suggest that LLM judges form a coherent evaluative group that is much more aligned internally than it is with human readers, indicating that internal agreement is not evidence of validity as a proxy for reader response.
[AI-32] OmniTabBench: Mapping the Empirical Frontiers of GBDTs Neural Networks and Foundation Models for Tabular Data at Scale
【速读】:该论文旨在解决现有tabular数据任务评估中基准数据集规模不足(通常少于100个)以及缺乏统一、无偏的模型比较框架的问题,从而无法准确判断不同模型架构(如树模型、深度神经网络及基础模型)在各类场景下的相对优势。其解决方案的关键在于构建了目前最大的tabular基准OmniTabBench,包含3030个来自多源、按行业分类的数据集,并通过大规模实证评估验证了不存在单一最优模型;进一步采用解耦元特征分析(decoupled metafeature analysis),系统性地考察数据规模、特征类型、偏度与峰度等独立属性对不同模型类别性能的影响,从而提供比以往综合指标研究更清晰、更具可操作性的模型选择依据。
链接: https://arxiv.org/abs/2604.06814
作者: Dihong Jiang,Ruoqi Cao,Zhiyuan Dang,Li Huang,Qingsong Zhang,Zhiyu Wang,Shihao Piao,Shenggao Zhu,Jianlong Chang,Zhouchen Lin,Qi Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While traditional tree-based ensemble methods have long dominated tabular tasks, deep neural networks and emerging foundation models have challenged this primacy, yet no consensus exists on a universally superior paradigm. Existing benchmarks typically contain fewer than 100 datasets, raising concerns about evaluation sufficiency and potential selection biases. To address these limitations, we introduce OmniTabBench, the largest tabular benchmark to date, comprising 3030 datasets spanning diverse tasks that are comprehensively collected from diverse sources and categorized by industry using large language models. We conduct an unprecedented large-scale empirical evaluation of state-of-the-art models from all model families on OmniTabBench, confirming the absence of a dominant winner. Furthermore, through a decoupled metafeature analysis, which examines individual properties such as dataset size, feature types, feature and target skewness/kurtosis, we elucidate conditions favoring specific model categories, providing clearer, more actionable guidance than prior compound-metric studies.
[AI-33] SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
【速读】:该论文旨在解决技能驱动型智能体系统(Skill-based Agent Systems)中存在的新型安全威胁问题,即在不修改模型参数或训练数据的前提下,通过攻击技能实现模块植入后门的潜在风险。传统安全研究多聚焦于模型参数或数据层面的攻击,而忽略了技能组合执行过程中可能被利用的攻击面。解决方案的关键在于提出SkillTrojan——一种针对技能实现本身的后门攻击方法:它将恶意逻辑嵌入看似正常的技能中,并借助标准技能组合机制重构并执行攻击者指定的加密载荷;攻击通过预设触发器激活,且能自动从任意技能模板合成带毒技能,从而实现跨技能生态系统的规模化传播。实验表明,该攻击在保持良性任务性能的同时具备高成功率(如EHR SQL任务中达到97.2%攻击成功率),揭示了当前技能架构中的关键盲点。
链接: https://arxiv.org/abs/2604.06811
作者: Yunhao Feng,Yifan Ding,Yingshui Tan,Boren Zheng,Yanming Guo,Xiaolong Li,Kun Zhai,Yishan Li,Wenke Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.
[AI-34] Riemann-Bench: A Benchmark for Moonshot Mathematics
【速读】:该论文旨在解决当前人工智能系统在数学推理能力上存在“表面卓越、实质局限”的问题,即现有模型虽能在国际数学奥林匹克(International Mathematical Olympiad, IMO)级别竞赛中达到顶尖水平,但其能力难以迁移至真正意义上的研究级数学问题。这类问题通常涉及更广泛的理论背景、复杂的构造性证明和长期探索性的思考,而非依赖技巧性的解题策略。解决方案的关键在于构建一个名为 \bench 的私有基准测试集,包含由顶尖数学学者(包括常春藤联盟教授、研究生及IMO金牌得主)精心设计的25道研究级数学问题,每道题均需作者独立花费数周时间完成,并通过双盲验证机制确保题目难度与原创性;同时,评估过程模拟真实科研场景,允许模型使用编程工具、网络搜索及开放式推理,采用100次独立运行的无偏统计估计器量化性能。结果表明,所有前沿模型在该基准上的得分均低于10%,揭示了从竞赛数学到科研级数学推理之间存在的显著能力鸿沟。
链接: https://arxiv.org/abs/2604.06802
作者: Suhaas Garre,Erik Knutsen,Sushant Mehta,Edwin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.
[AI-35] MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization ACL2026
【速读】:该论文旨在解决基于混合专家(Mixture-of-Experts, MoE)的大语言模型(Large Language Models, LLMs)在采用权重二值化(Weight Binarization)时面临的三大核心挑战:跨专家冗余、任务无关的重要性估计以及量化引起的路由偏移。为此,作者提出首个专为MoE架构设计的二值化框架MoBiE,其关键创新在于三点:1)通过联合奇异值分解(Joint SVD Decomposition)降低跨专家冗余;2)将全局损失梯度融入局部海森矩阵(Hessian Metrics),提升权重重要性估计精度;3)引入基于输入零空间(Input Null Space)的误差约束以缓解路由畸变。该方案在不增加存储开销的前提下实现了效率与性能的平衡,实验表明其在多个MoE-LLM和基准测试中显著优于现有二值化方法。
链接: https://arxiv.org/abs/2604.06798
作者: Zhixiong Zhao,Zukang Xu,Zhixuan Chen,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 Findings
Abstract:Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2 % , improves average zero-shot performance by 43.4 % , achieves over 2 \times inference speedup, and further shortens quantization time. The code is available at this https URL.
[AI-36] Instance-Adaptive Parametrization for Amortized Variational Inference
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)中因摊销变分推断(amortized variational inference)导致的“摊销差距”(amortization gap)问题,即共享参数化结构限制了后验近似精度,从而影响生成模型性能。解决方案的关键在于提出实例自适应变分自编码器(Instance-Adaptive VAE, IA-VAE),其通过一个超网络(hypernetwork)动态生成输入相关的参数调制,使编码器能够根据每个输入实例进行自适应调整,同时保持单次前向传播的计算效率。这种基于实例特定参数调制的设计显著提升了后验近似的准确性,并在合成数据和标准图像基准上均实现了优于传统VAE的证据下界(ELBO)表现,验证了增强推理模型参数灵活性对缓解摊销子最优性的重要作用。
链接: https://arxiv.org/abs/2604.06796
作者: Andrea Pollastro,Andrea Apicella,Francesco Isgrò,Roberto Prevete
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent variable models, including variational autoencoders (VAE), remain a central tool in modern deep generative modeling due to their scalability and a well-founded probabilistic formulation. These models rely on amortized variational inference to enable efficient posterior approximation, but this efficiency comes at the cost of a shared parametrization, giving rise to the amortization gap. We propose the instance-adaptive variational autoencoder (IA-VAE), an amortized variational inference framework in which a hypernetwork generates input-dependent modulations of a shared encoder. This enables input-specific adaptation of the inference model while preserving the efficiency of a single forward pass. By leveraging instance-specific parameter modulations, the proposed approach can achieve performance comparable to standard encoders with substantially fewer parameters, indicating a more efficient use of model capacity. Experiments on synthetic data, where the true posterior is known, show that IA-VAE yields more accurate posterior approximations and reduces the amortization gap. Similarly, on standard image benchmarks, IA-VAE consistently improves held-out ELBO over baseline VAEs, with statistically significant gains across multiple runs. These results suggest that increasing the flexibility of the inference parametrization through instance-adaptive modulation is a key factor in mitigating amortization-induced suboptimality in deep generative models.
[AI-37] Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
【速读】:该论文旨在解决现有软件文档生成评估基准的两大局限性:一是缺乏对整个代码库(repository-level)的全面评估,二是依赖不可靠的评价策略(如LLM-as-a-judge),其标准模糊且缺乏对仓库级知识的理解。解决方案的关键在于提出SWD-Bench,一个基于功能驱动问答(function-driven Question Answering, QA)的新型基准,通过评估大语言模型(Large Language Models, LLMs)是否能仅凭文档理解并实现特定功能来间接衡量文档质量。该方法构建了三个相互关联的QA任务:功能检测(Functionality Detection)、功能定位(Functionality Localization)和功能完成度(Functionality Completion),并通过挖掘高质量Pull Requests并补充仓库级上下文构建包含4,170个条目的数据集,从而更真实、系统地评估文档生成效果。
链接: https://arxiv.org/abs/2604.06793
作者: Xinchen Wang,Ruida Hu,Cuiyun Gao,Pengfei Gao,Chao Peng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software documentation is crucial for repository comprehension. While Large Language Models (LLMs) advance documentation generation from code snippets to entire repositories, existing benchmarks have two key limitations: (1) they lack a holistic, repository-level assessment, and (2) they rely on unreliable evaluation strategies, such as LLM-as-a-judge, which suffers from vague criteria and limited repository-level knowledge. To address these issues, we introduce SWD-Bench, a novel benchmark for evaluating repository-level software documentation. Inspired by documentation-driven development, our strategy evaluates documentation quality by assessing an LLM’s ability to understand and implement functionalities using the documentation, rather than by directly scoring it. This is measured through function-driven Question Answering (QA) tasks. SWD-Bench comprises three interconnected QA tasks: (1) Functionality Detection, to determine if a functionality is described; (2) Functionality Localization, to evaluate the accuracy of locating related files; and (3) Functionality Completion, to measure the comprehensiveness of implementation details. We construct the benchmark, containing 4,170 entries, by mining high-quality Pull Requests and enriching them with repository-level context. Experiments reveal limitations in current documentation generation methods and show that source code provides complementary value. Notably, documentation from the best-performing method improves the issue-solving rate of SWE-Agent by 20.00%, which demonstrates the practical value of high-quality documentation in supporting documentation-driven development.
[AI-38] FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling
【速读】:该论文旨在解决基于序贯蒙特卡洛(Sequential Monte Carlo, SMC)的扩散采样器在推理过程中常见的多样性崩溃(diversity collapse)问题,即在强选择压力下,粒子轨迹趋于收敛、丧失多样性,导致生成质量下降。其解决方案的关键在于提出Fleming-Viot Diffusion(FVD),通过引入受Fleming-Viot种群动力学启发的专用出生-死亡机制替代传统的多项式重采样(multinomial resampling),并结合基于奖励的独立存活决策与随机重生噪声,实现无需价值函数近似或昂贵回溯(rollouts)即可保持更广的轨迹支持并有效探索奖励倾斜分布的灵活群体动态。该方法完全可并行化,且在计算资源上高效扩展,实验证明其在DrawBench和类别条件任务中显著优于现有方法。
链接: https://arxiv.org/abs/2604.06779
作者: Shivanshu Shekhar,Sagnik Mukherjee,Jia Yi Zhang,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Fleming-Viot Diffusion (FVD), an inference-time alignment method that resolves the diversity collapse commonly observed in Sequential Monte Carlo (SMC) based diffusion samplers. Existing SMC-based diffusion samplers often rely on multinomial resampling or closely related resampling schemes, which can still reduce diversity and lead to lineage collapse under strong selection pressure. Inspired by Fleming-Viot population dynamics, FVD replaces multinomial resampling with a specialized birth-death mechanism designed for diffusion alignment. To handle cases where rewards are only approximately available and naive rebirth would collapse deterministic trajectories, FVD integrates independent reward-based survival decisions with stochastic rebirth noise. This yields flexible population dynamics that preserve broader trajectory support while effectively exploring reward-tilted distributions, all without requiring value function approximation or costly rollouts. FVD is fully parallelizable and scales efficiently with inference compute. Empirically, it achieves substantial gains across settings: on DrawBench it outperforms prior methods by 7% in ImageReward, while on class-conditional tasks it improves FID by roughly 14-20% over strong baselines and is up to 66 times faster than value-based approaches.
[AI-39] Sparse-Aware Neural Networks for Nonlinear Functionals: Mitigating the Exponential Dependence on Dimension
【速读】:该论文旨在解决深度神经网络在无限维函数空间上学习算子时面临的维度灾难(curse of dimensionality)和可解释性不足的问题。其核心解决方案是引入稀疏性(sparsity)作为关键机制,通过卷积架构从有限样本中提取稀疏特征,并结合深层全连接网络逼近非线性泛函(nonlinear functionals),从而实现稳定且高效的函数空间近似。该框架利用通用离散化方法证明了稀疏近似器能够从离散采样中恢复原函数,且无论采用确定性还是随机采样方案均有效,显著提升了多种函数空间(如高频快速衰减或混合光滑性空间)中的逼近精度并减少了所需样本数量。
链接: https://arxiv.org/abs/2604.06774
作者: Jianfei Li,Shuo Huang,Han Feng,Ding-Xuan Zhou,Gitta Kutyniok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA)
备注:
Abstract:Deep neural networks have emerged as powerful tools for learning operators defined over infinite-dimensional function spaces. However, existing theories frequently encounter difficulties related to dimensionality and limited interpretability. This work investigates how sparsity can help address these challenges in functional learning, a central ingredient in operator learning. We propose a framework that employs convolutional architectures to extract sparse features from a finite number of samples, together with deep fully connected networks to effectively approximate nonlinear functionals. Using universal discretization methods, we show that sparse approximators enable stable recovery from discrete samples. In addition, both the deterministic and the random sampling schemes are sufficient for our analysis. These findings lead to improved approximation rates and reduced sample sizes in various function spaces, including those with fast frequency decay and mixed smoothness. They also provide new theoretical insights into how sparsity can alleviate the curse of dimensionality in functional learning.
[AI-40] urboAgent : An LLM -Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design
【速读】:该论文旨在解决涡轮机械气动设计中传统流程依赖人工干预、各阶段耦合松散且效率低下的问题,尤其针对从自然语言需求到最终设计生成的端到端自动化难题。其解决方案的关键在于提出 TurboAgent——一个基于大语言模型(Large Language Model, LLM)驱动的多智能体自主框架,其中LLM负责任务规划与协调,而专用智能体分别执行生成式设计、快速性能预测、多目标优化和物理验证等模块,形成数据驱动的协同工作流,实现高保真仿真支持下的闭环设计迭代,显著提升设计效率与精度。
链接: https://arxiv.org/abs/2604.06747
作者: Juan Du,Yueteng Wu,Pan Zhao,Yuze Liu,Min Zhang,Xiaobin Xu,Xinglong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design this http URL address this issue, this study proposes TurboAgent, a large language model (LLM)-driven autonomous multi-agent framework for turbomachinery aerodynamic design and optimization. The LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi-objective optimization, and physics-based validation. The framework transforms traditional trial-and-error design into a data-driven collaborative workflow, with high-fidelity simulations retained for final verification.A transonic single-rotor compressor is used for validation. The results show strong agreement between target performance, generated designs, and CFD simulations. The coefficients of determination (R2) for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. The optimization agent further improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. The complete workflow can be executed within approximately 30 minutes under parallel computing. These results demonstrate that TurboAgent enables an autonomous closed-loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06747 [cs.AI] (or arXiv:2604.06747v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.06747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-41] Evaluating LLM -Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
【速读】:该论文旨在解决现有基准测试无法有效评估大语言模型(Large Language Models, LLMs)从零开始(0-to-1)生成完整软件的能力这一问题。具体而言,传统方法依赖预定义模板(predefined scaffolds)忽略项目结构规划,且采用白盒单元测试缺乏对最终行为的端到端验证。为应对这一挑战,作者提出CLI-Tool-Bench,一个无结构依赖(structure-agnostic)的基准测试平台,用于评估命令行界面(Command-Line Interface, CLI)工具的从零构建能力。其核心创新在于引入黑盒差分测试框架(black-box differential testing framework),通过沙箱执行Agent生成的代码,并利用多级等价性指标(multi-tiered equivalence metrics)对比系统副作用和终端输出与人工编写的参考实现(oracles),从而实现对生成软件功能正确性的客观量化评估。
链接: https://arxiv.org/abs/2604.06742
作者: Ruida Hu,Xinchen Wang,Chao Peng,Cuiyun Gao,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are driving a shift towards intent-driven development, where agents build complete software from scratch. However, existing benchmarks fail to assess this 0-to-1 generation capability due to two limitations: reliance on predefined scaffolds that ignore repository structure planning, and rigid white-box unit testing that lacks end-to-end behavioral validation. To bridge this gap, we introduce CLI-Tool-Bench, a structure-agnostic benchmark for evaluating the ground-up generation of Command-Line Interface (CLI) tools. It features 100 diverse real-world repositories evaluated via a black-box differential testing framework. Agent-generated software is executed in sandboxes, comparing system side effects and terminal outputs against human-written oracles using multi-tiered equivalence metrics. Evaluating seven state-of-the-art LLMs, we reveal that top models achieve under 43% success, highlighting the ongoing challenge of 0-to-1 generation. Furthermore, higher token consumption does not guarantee better performance, and agents tend to generate monolithic code.
[AI-42] he Traveling Thief Problem with Time Windows: Benchmarks and Heuristics
【速读】:该论文旨在解决带有时间窗约束的旅行商盗窃问题(Traveling Thief Problem with Time Windows, TTP-TW),这是一个多组件优化问题,模拟了现实场景中货物只能在特定时间段内被收集的情形。传统TTP研究未考虑时间限制,而实际物流与配送场景常存在时间窗约束,因此该问题更具应用价值。解决方案的关键在于提出一种新的启发式算法,并基于现有TTP和带时间窗的旅行商问题(TSP with Time Windows, TSPTW)的适应性方法进行改进,同时构建了新的基准测试实例集以系统评估算法性能。实验结果表明,所设计的新启发式算法在多种基准实例上均优于现有方法,体现出其在处理复杂时序依赖关系下的优越性。
链接: https://arxiv.org/abs/2604.06724
作者: Helen Yuliana Angmalisang,Frank Neumann
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:While traditional optimization problems were often studied in isolation, many real-world problems today require interdependence among multiple optimization components. The traveling thief problem (TTP) is a multi-component problem that has been widely studied in the literature. In this paper, we introduce and investigate the TTP with time window constraints which provides a TTP variant highly relevant to real-world situations where good can only be collected at given time intervals. We examine adaptions of existing approaches for TTP and the Traveling Salesperson Problem (TSP) with time windows to this new problem and evaluate their performance. Furthermore, we provide a new heuristic approach for the TTP with time windows. To evaluate algorithms for TTP with time windows, we introduce new TTP benchmark instances with time windows based on TTP instances existing in the literature. Our experimental investigations evaluate the different approaches and show that the newly designed algorithm outperforms the other approaches on a wide range of benchmark instances.
[AI-43] Fine-grained Approaches for Confidence Calibration of LLM s in Automated Code Revision
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自动化代码修订(Automated Code Revision, ACR)任务中生成的置信度分数(confidence scores)普遍缺乏校准性的问题。由于LLMs固有的不完美性,其输出的置信度常不能真实反映预测正确性的概率,导致开发者难以判断是否采纳模型建议,从而影响效率和可靠性。传统全局Platt缩放(global Platt-scaling)方法虽在通用生成式软件工程任务中有效,但在ACR任务中表现不稳定,因其基于序列级粗粒度校准,无法捕捉局部编辑决策对正确性的影响。本文的关键解决方案是提出细粒度Platt缩放(local Platt-scaling),针对三种不同层次的细粒度置信度分数分别进行独立校准,从而更精准地匹配局部校正行为与实际正确性之间的关系。实验表明,该方法在多种模型、任务和评估指标下均显著降低校准误差,并且与全局校准结合后效果进一步提升,为ACR场景提供了可信赖且实用的置信度校准方案。
链接: https://arxiv.org/abs/2604.06723
作者: Hong Yi Lin,Chunhua Liu,Haoyu Gao,Patanamon Thongtanunam,Christoph Treude
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In today’s AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model’s capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for automated code revision (ACR) tasks such as program repair, vulnerability repair, and code refinement. We hypothesise that the coarse-grained nature of this conventional method makes it ill-suited for ACR tasks, where correctness is often determined by local edit decisions and miscalibration can be sample-dependent, thereby motivating fine-grained confidence calibration. To address this, our study proposes local Platt-scaling applied separately to three different fine-grained confidence scores. Through experiments across 3 separate tasks and correctness metrics, as well as 14 different models of various sizes, we find that fine-grained confidence scores consistently achieve lower calibration error across a broader range of probability intervals, and this effect is further amplified when global Platt-scaling is applied. Our proposed approaches offer a practical solution to eliciting well-calibrated confidence scores, enabling more trustworthy and streamlined usage of imperfect models in ACR tasks.
[AI-44] Agent Gate: A Lightweight Structured Routing Engine for the Internet of Agents
【速读】:该论文旨在解决AI代理系统(AI agent systems)中请求调度(request dispatch)的效率问题,尤其是在延迟、隐私和成本约束下的高效路由决策难题。现有方法在代理命名、发现与交互方面已有进展,但缺乏对调度过程的结构化建模,导致资源利用率低且难以保障隐私。解决方案的关键在于提出AgentGate——一个轻量级的结构化路由引擎,将路由问题从无约束的文本生成转化为受约束的决策问题,并分为两个阶段:第一阶段进行动作决策(action decision),判断是否触发单代理调用、多代理规划、直接响应或安全升级;第二阶段进行结构化定位(structural grounding),将选定动作转化为可执行输出,如目标代理、结构化参数或多步计划。此外,作者设计了面向路由任务的微调策略,引入候选感知监督与硬负例样本,使小型模型(3B–7B参数规模)也能在受限环境中实现高性能路由,验证了结构化路由作为高效、隐私友好的代理系统设计范式可行性。
链接: https://arxiv.org/abs/2604.06696
作者: Yujun Cheng,Enfang Cui,Hao Qin,Zhiyuan Liang,Qi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid development of AI agent systems is leading to an emerging Internet of Agents, where specialized agents operate across local devices, edge nodes, private services, and cloud platforms. Although recent efforts have improved agent naming, discovery, and interaction, efficient request dispatch remains an open systems problem under latency, privacy, and cost constraints. In this paper, we present AgentGate, a lightweight structured routing engine for candidate-aware agent dispatch. Instead of treating routing as unrestricted text generation, AgentGate formulates it as a constrained decision problem and decomposes it into two stages: action decision and structural grounding. The first stage determines whether a query should trigger single-agent invocation, multi-agent planning, direct response, or safe escalation, while the second stage instantiates the selected action into executable outputs such as target agents, structured arguments, or multi-step plans. To adapt compact models to this setting, we further develop a routing-oriented fine-tuning scheme with candidate-aware supervision and hard negative examples. Experiments on a curated routing benchmark with several 3B–7B open-weight models show that compact models can provide competitive routing performance in constrained settings, and that model differences are mainly reflected in action prediction, candidate selection, and structured grounding quality. These results indicate that structured routing is a feasible design point for efficient and privacy-aware agent systems, especially when routing decisions must be made under resource-constrained deployment conditions.
[AI-45] Reasoning Fails Where Step Flow Breaks ACL2026
【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在多步数学、科学和编程任务中表现不稳定且难以解释的问题,特别是现有分析工具难以处理长而结构化的推理轨迹。其关键解决方案是提出Step-Saliency方法,通过将注意力梯度得分池化为沿“问题—思考—总结”轨迹的步骤间映射,识别出两种常见的信息流故障:浅层锁定(Shallow Lock-in)和深层衰减(Deep Decay)。基于此发现,作者进一步设计了StepFlow干预策略,在测试阶段通过Odds-Equal Bridge调整浅层关注模式,并引入步骤级残差项(Step Momentum Injection)增强深层信息流动,从而无需重新训练即可显著提升多个LRM在多项任务上的准确性,验证了修复信息流对恢复推理性能的重要性。
链接: https://arxiv.org/abs/2604.06695
作者: Xiaoyu Xu,Yulan Pan,Xiaosong Yuan,Zhihong Shen,Minghao Su,Yuanhao Su,Xiaofeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026
Abstract:Large reasoning models (LRMs) that generate long chains of thought now perform well on multi-step math, science, and coding tasks. However, their behavior is still unstable and hard to interpret, and existing analysis tools struggle with such long, structured reasoning traces. We introduce Step-Saliency, which pools attention–gradient scores into step-to-step maps along the question–thinking–summary trajectory. Across several models, Step-Saliency reveals two recurring information-flow failures: Shallow Lock-in, where shallow layers over-focus on the current step and barely use earlier context, and Deep Decay, where deep layers gradually lose saliency on the thinking segment and the summary increasingly attends to itself and the last few steps. Motivated by these patterns, we propose StepFlow, a saliency-inspired test-time intervention that adjusts shallow saliency patterns measured by Step-Saliency via Odds-Equal Bridge and adds a small step-level residual in deep layers via Step Momentum Injection. StepFlow improves accuracy on math, science, and coding tasks across multiple LRMs without retraining, indicating that repairing information flow can recover part of their missing reasoning performance.
[AI-46] KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning IJCNN2026
【速读】:该论文旨在解决多智能体强化学习(Multi Agent Reinforcement Learning, MARL)系统在真实世界部署中面临的计算资源受限问题,尤其是推理时延和内存占用过高导致难以在边缘设备或嵌入式平台运行的问题。现有方法虽能通过专家策略实现高性能,但其依赖高复杂度模型和昂贵的决策周期,不适用于资源受限场景。解决方案的关键在于提出一种资源感知的知识蒸馏框架(Resource Aware Knowledge Distillation for Multi Agent Reinforcement Learning, KD-MARL),该框架采用两阶段训练机制:第一阶段从中心化专家策略中蒸馏出结构化的协调行为与优势信号,第二阶段训练轻量级去中心化学生策略,无需 critic 支持,仅依靠蒸馏的优势信号和结构化策略监督来保留协作能力。该方法不仅传递动作层面的行为模式,还显式建模并迁移协调结构,同时支持异构学生架构以适配不同观测复杂度,从而在保证超过90%专家性能的同时,将计算成本(FLOPs)降低最多达28.6倍,显著提升了MARL在资源受限环境中的可部署性。
链接: https://arxiv.org/abs/2604.06691
作者: Monirul Islam Pavel,Siyi Hu,Muhammad Anwar Masum,Mahardhika Pratama,Ryszard Kowalczyk,Zehong Jimmy Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in IJCNN 2026
Abstract:Real world deployment of multi agent reinforcement learning MARL systems is fundamentally constrained by limited compute memory and inference time. While expert policies achieve high performance they rely on costly decision cycles and large scale models that are impractical for edge devices or embedded platforms. Knowledge distillation KD offers a promising path toward resource aware execution but existing KD methods in MARL focus narrowly on action imitation often neglecting coordination structure and assuming uniform agent capabilities. We propose resource aware Knowledge Distillation for Multi Agent Reinforcement Learning KD MARL a two stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. The student policies are trained without a critic relying instead on distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Our approach transfers both action level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures allowing each agent model capacity to match its observation complexity which is crucial for efficient execution under partial or limited observability and limited onboard resources. Extensive experiments on SMAC and MPE benchmarks demonstrate that KD MARL achieves high performance retention while substantially reducing computational cost. Across standard multi agent benchmarks KD MARL retains over 90 percent of expert performance while reducing computational cost by up to 28.6 times FLOPs. The proposed approach achieves expert level coordination and preserves it through structured distillation enabling practical MARL deployment across resource constrained onboard platforms.
[AI-47] Restoring Heterogeneity in LLM -based Social Simulation: An Audience Segmentation Approach
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在社会态度与行为模拟中普遍存在的“平均人格”问题,即模型常将多样性简化为单一代表个体,从而掩盖了真实社会群体间的异质性。其解决方案的关键在于引入**受众细分(audience segmentation)**作为系统性方法,通过基于不同标识符粒度、简洁性及选择逻辑(理论驱动、数据驱动和工具导向)的配置,恢复LLM模拟中的子群体差异。研究发现,适度的细分可提升模拟性能,但过度细化可能损害结构与预测保真度;且不同选择逻辑对各保真维度(分布、结构、预测)的影响各异,表明需采用异质性感知的评估框架与方差保留建模策略,以实现更贴近现实的社会模拟。
链接: https://arxiv.org/abs/2604.06663
作者: Xiaoyou Qin,Zhihong Li,Xiaoxiao Cheng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used to simulate social attitudes and behaviors, offering scalable “silicon samples” that can approximate human data. However, current simulation practice often collapses diversity into an “average persona,” masking subgroup variation that is central to social reality. This study introduces audience segmentation as a systematic approach for restoring heterogeneity in LLM-based social simulation. Using U.S. climate-opinion survey data, we compare six segmentation configurations across two open-weight LLMs (Llama 3.1-70B and Mixtral 8x22B), varying segmentation identifier granularity, parsimony, and selection logic (theory-driven, data-driven, and instrument-based). We evaluate simulation performance with a three-dimensional evaluation framework covering distributional, structural, and predictive fidelity. Results show that increasing identifier granularity does not produce consistent improvement: moderate enrichment can improve performance, but further expansion does not reliably help and can worsen structural and predictive fidelity. Across parsimony comparisons, compact configurations often match or outperform more comprehensive alternatives, especially in structural and predictive fidelity, while distributional fidelity remains metric dependent. Identifier selection logic determines which fidelity dimension benefits most: instrument-based selection best preserves distributional shape, whereas data-driven selection best recovers between-group structure and identifier-outcome associations. Overall, no single configuration dominates all dimensions, and performance gains in one dimension can coincide with losses in another. These findings position audience segmentation as a core methodological approach for valid LLM-based social simulation and highlight the need for heterogeneity-aware evaluation and variance-preserving modeling strategies.
[AI-48] RPM-Net Reciprocal Point MLP Network for Unknown Network Security Threat Detection ICASSP2026
【速读】:该论文旨在解决多类不平衡环境下未知网络威胁检测的难题,现有方法在已知攻击类别表征学习、未知威胁识别以及模型可解释性方面存在局限。其解决方案的关键在于提出RPM-Net框架,通过引入互点机制(reciprocal point mechanism)为每个已知攻击类别学习“非类别”(non-class)表示,并结合对抗边界约束(adversarial margin constraints)提供几何可解释性以增强未知威胁检测能力;此外,RPM-Net++进一步利用Fisher判别正则化提升性能,实验证明该方法在F1-score、AUROC和AUPR-OUT等多个指标上显著优于现有技术,具备实际部署价值。
链接: https://arxiv.org/abs/2604.06638
作者: Jiachen Zhang,Yueming Lu,Fan Feng,Zhanfeng Wang,Shengli Pan,Daoqi Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Compared to the ICASSP 2026 proceedings version, this version corrects a transcription error in Table 1 (ODIN’s precision, recall, and f1 scores)
Abstract:Effective detection of unknown network security threats in multi-class imbalanced environments is critical for maintaining cyberspace security. Current methods focus on learning class representations but face challenges with unknown threat detection, class imbalance, and lack of interpretability, limiting their practical use. To address this, we propose RPM-Net, a novel framework that introduces reciprocal point mechanism to learn “non-class” representations for each known attack category, coupled with adversarial margin constraints that provide geometric interpretability for unknown threat detection. RPM-Net++ further enhances performance through Fisher discriminant regularization. Experimental results show that RPM-Net achieves superior performance across multiple metrics including F1-score, AUROC, and AUPR-OUT, significantly outperforming existing methods and offering practical value for real-world network security applications. Our code is available at:this https URL
[AI-49] Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization Data and Model Capability
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理监督微调(Reasoning Supervised Fine-Tuning, Reasoning SFT)中关于泛化能力的争议问题,即传统观点认为SFT仅记忆而RL(强化学习)才能实现泛化。研究发现,跨域泛化并非缺失,而是条件性的,受优化动态、训练数据质量和基础模型能力共同影响。解决方案的关键在于揭示了“下探-回升”(dip-and-recovery)现象:短期训练会低估泛化性能,需延长训练以观察真实提升;同时强调高质量长链式思维(Chain-of-Thought, CoT)数据的重要性,以及更强模型能内化可迁移的推理过程(如回溯),而弱模型仅模仿表面文本冗余。这表明推理SFT具备潜在泛化能力,但其有效性取决于训练策略、数据结构与模型容量的协同作用,并伴随安全性的不对称下降。
链接: https://arxiv.org/abs/2604.06628
作者: Qihan Ren,Peng Wang,Ruikun Cai,Shuai Shao,Dadi Guo,Yuejin Xie,Yafu Li,Quanshi Zhang,Xia Hu,Jing Shao,Dongrui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.
[AI-50] winLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决去中心化在线学习在物理-信息多智能体系统中因运行环境变化导致策略性能下降后,需依赖大量试错交互才能恢复的问题。其解决方案的关键在于提出TwinLoop框架——一种“仿真回路”数字孪生(Digital Twin)机制,通过在环境状态发生突变时触发数字孪生重建当前系统状态,以最新智能体策略为起点,在仿真环境中进行加速策略优化与“假设分析”(what-if analysis),随后将更新后的参数同步至物理系统中的智能体,从而显著提升适应效率并降低对昂贵在线试错的依赖。
链接: https://arxiv.org/abs/2604.06610
作者: Nan Zhang,Zishuo Wang,Shuyu Huang,Georgios Diamantopoulos,Nikos Tziritas,Panagiotis Oikonomou,Georgios Theodoropoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures
Abstract:Decentralised online learning enables runtime adaptation in cyber-physical multi-agent systems, but when operating conditions change, learned policies often require substantial trial-and-error interaction before recovering performance. To address this, we propose TwinLoop, a simulation-in-the-loop digital twin framework for online multi-agent reinforcement learning. When a context shift occurs, the digital twin is triggered to reconstruct the current system state, initialise from the latest agent policies, and perform accelerated policy improvement with simulation what-if analysis before synchronising updated parameters back to the agents in the physical system. We evaluate TwinLoop in a vehicular edge computing task-offloading scenario with changing workload and infrastructure conditions. The results suggest that digital twins can improve post-shift adaptation efficiency and reduce reliance on costly online trial-and-error.
[AI-51] AI-Driven Research for Databases
【速读】:该论文旨在解决现代数据库系统性能优化中因工作负载与硬件复杂度快速提升,而传统人工优化方法难以跟上节奏的问题。其核心挑战在于AI-Driven Research for Systems (ADRS) 方法在实际应用中的评估瓶颈——由于ADRS能无监督地生成大量候选方案,亟需高效且准确的评估机制以实现有效收敛。解决方案的关键在于提出一种“协同进化”策略:通过自动化设计评估器,并将其与优化方案同步演化,从而构建出能够适应复杂数据库系统特性的评估反馈机制。实验表明,该方法在缓冲区管理、查询重写和索引选择三个案例中成功发现优于现有基线的新算法(如确定性查询重写策略可降低6.8倍延迟),证明了突破评估瓶颈对释放ADRS潜力、生成可部署的高性能代码具有决定性意义。
链接: https://arxiv.org/abs/2604.06566
作者: Audrey Cheng,Harald Ng,Aaron Kabcenell,Peter Bailis,Matei Zaharia,Lin Ma,Xiao Shi,Ion Stoica
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:As the complexity of modern workloads and hardware increasingly outpaces human research and engineering capacity, existing methods for database performance optimization struggle to keep pace. To address this gap, a new class of techniques, termed AI-Driven Research for Systems (ADRS), uses large language models to automate solution discovery. This approach shifts optimization from manual system design to automated code generation. The key obstacle, however, in applying ADRS is the evaluation pipeline. Since these frameworks rapidly generate hundreds of candidates without human supervision, they depend on fast and accurate feedback from evaluators to converge on effective solutions. Building such evaluators is especially difficult for complex database systems. To enable the practical application of ADRS in this domain, we propose automating the design of evaluators by co-evolving them with the solutions. We demonstrate the effectiveness of this approach through three case studies optimizing buffer management, query rewriting, and index selection. Our automated evaluators enable the discovery of novel algorithms that outperform state-of-the-art baselines (e.g., a deterministic query rewrite policy that achieves up to 6.8x lower latency), demonstrating that addressing the evaluation bottleneck unlocks the potential of ADRS to generate highly optimized, deployable code for next-generation data systems.
[AI-52] On Emotion-Sensitive Decision Making of Small Language Model Agents
【速读】:该论文旨在解决当前小语言模型(Small Language Models, SLMs)在作为交互式决策代理时,普遍忽略情绪作为影响行为因果因素的问题。现有决策评估方法多未考虑情绪对策略选择的系统性影响,导致模型行为与人类实际决策模式存在偏差。解决方案的关键在于将表征层面的情绪诱导(representation-level emotion induction)与结构化的博弈论评估框架相结合:通过基于众包验证的真实情绪诱发文本所提取的激活控制(activation steering),实现可控制且可迁移的情绪干预,超越传统提示工程(prompt-based methods)的局限;同时构建围绕经典决策模板(canonical decision templates)的基准测试,涵盖合作与竞争激励下完全与不完全信息场景,从而系统评估情绪扰动对战略决策的影响。实验表明,情绪扰动确实显著改变模型决策行为,但其稳定性不足且与人类预期不一致,为后续提升模型对情绪驱动扰动的鲁棒性提供了方向。
链接: https://arxiv.org/abs/2604.06562
作者: Jiaju Lin,Xingjian Du,Qingyun Wu,Ellen Wenting Zou,Jindong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Small language models (SLM) are increasingly used as interactive decision-making agents, yet most decision-oriented evaluations ignore emotion as a causal factor influencing behavior. We study emotion-sensitive decision making by combining representation-level emotion induction with a structured game-theoretic evaluation. Emotional states are induced using activation steering derived from crowd-validated, real-world emotion-eliciting texts, enabling controlled and transferable interventions beyond prompt-based methods. We introduce a benchmark built around canonical decision templates that span cooperative and competitive incentives under both complete and incomplete information. These templates are instantiated using strategic scenarios from \textscDiplomacy, \textscStarCraft II, and diverse real-world personas. Experiments across multiple model families in various architecture and modalities, show that emotional perturbations systematically affect strategic choices, but the resulting behaviors are often unstable and not fully aligned with human expectations. Finally, we outline an approach to improve robustness to emotion-driven perturbations.
[AI-53] SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
【速读】:该论文旨在解决开源代理技能市场(如OpenClaw的ClawHub)中因自然语言指令与代码混杂导致的安全漏洞检测难题,特别是针对正则表达式扫描器和形式化静态分析工具无法有效识别提示注入(prompt injection)和社交工程攻击等隐蔽威胁的问题。解决方案的关键在于提出SkillSieve三层次检测框架:第一层通过XGBoost特征评分实现快速过滤(平均40ms/技能,零API成本),覆盖约86%良性技能;第二层将可疑技能拆分为四个并行子任务(意图一致性、权限合理性、隐匿行为检测、跨文件一致性),由大语言模型(LLM)分别处理并结构化输出;第三层对高风险技能引入三位不同LLM组成的评审团,采用独立投票与争议辩论机制以提升决策可靠性。该方法在真实数据集上实现了0.800 F1分数,显著优于现有方案(ClawVet为0.421),且单次检测成本低至0.006美元。
链接: https://arxiv.org/abs/2604.06550
作者: Yinghan Hou,Zongyou Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 tables, 1 figure
Abstract:OpenClaw’s ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in this http URL files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a 440 ARM single-board computer. On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1, outperforming ClawVet’s 0.421, at an average cost of 0.006 per skill. Code, data, and benchmark are open-sourced.
[AI-54] Database Querying under Missing Values Governed by Missingness Mechanisms
【速读】:该论文旨在解决关系数据库(Relational Database, RDB)中缺失值(Missing Values, MVs)的语义赋予与查询回答(Query Answering, QA)问题。其核心挑战在于如何在缺失值由特定缺失机制(Missingness Mechanism)驱动的情况下,既保留数据的不确定性,又保证隐式填补(implicit imputation)的统计合理性。解决方案的关键在于:首先,将缺失机制建模为贝叶斯网络(Bayesian Network),从而构建一个缺失图(Missingness Graph, MG),该图刻画了属性间的缺失依赖结构;其次,利用观测到的数据库和MG共同生成一个块独立(block-independent)的概率数据库;在此基础上,提出两种联合捕捉概率不确定性和统计合理性的查询回答技术,从而实现对缺失值的稳健推理。
链接: https://arxiv.org/abs/2604.06520
作者: Leopoldo Bertossi,Farouk Toumani,Maxime Buron
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Submitted, under review
Abstract:We address the problems of giving a semantics to- and doing query answering (QA) on a relational database (RDB) that has missing values (MVs). The causes for the latter are governed by a Missingness Mechanism that is modelled as a Bayesian Network, which represents a Missingness Graph (MG) and involves the DB attributes. Our approach considerable departs from the treatment of RDBs with NULL (values). The MG together with the observed DB allow to build a block-independent probabilistic DB, on which basis we propose two QA techniques that jointly capture probabilistic uncertainty and statistical plausibility of the implicit imputation of MVs. We obtain complexity results that characterize the computational feasibility of those approaches.
[AI-55] Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
【速读】:该论文旨在解决稀疏混合专家(Sparse Mixture-of-Experts, MoE)模型在推理阶段因参数量庞大而导致的显著内存开销问题,尤其是在低比特量化(low-bit quantization)下易引发精度损失的挑战。现有均匀量化方法在低比特宽度时会带来明显性能下降,而现有混合精度方法虽能缓解此问题,却常因比特分配计算复杂且未考虑不同专家对量化敏感度的差异而效果受限。论文提出了一种基于理论支撑的逐专家混合精度策略,其核心在于:首先依据训练过程中路由器(router)L2范数变化量为每个专家分配比特宽度——变化较小的专家通常捕获的是低频但关键特征,对量化更敏感,需更高精度;其次,进一步引入最大神经元内方差(maximum intra-neuron variance)作为补充指标,避免将高噪声敏感的专家分配至低精度。实验表明,该方法在Switch Transformer和Mixtral等大规模MoE模型上实现了优于现有方法的精度提升,同时降低推理成本,并仅引入可忽略的比特分配开销。
链接: https://arxiv.org/abs/2604.06515
作者: Mohammed Nowaz Rabbani Chowdhury,Kaoutar El Maghraoui,Hsinyu Tsai,Naigang Wang,Geoffrey W. Burr,Liu Liu,Meng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert primarily based on their change in routers l2 norm during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large maximum intra-neuron variance are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.
[AI-56] Improving Robustness In Sparse Autoencoders via Masked Regularization
【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在机制可解释性研究中因训练目标设计不充分而导致的鲁棒性不足问题,特别是特征吸收(feature absorption)现象——即通用特征被更具体的特征所覆盖,从而损害可解释性,即使重建保真度较高也无法保证模型输出的稳定性。解决方案的关键在于引入一种基于掩码(masking-based regularization)的正则化策略:在训练过程中随机替换输入token以破坏特征共现模式,从而减少特征吸收、提升探测性能,并缩小分布外(Out-of-Distribution, OOD)场景下的性能差距,增强不同架构和稀疏度下SAE的可靠性。
链接: https://arxiv.org/abs/2604.06495
作者: Vivek Narayanaswamy,Kowshik Thopalli,Bhavya Kailkhura,Wesam Sakla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure
Abstract:Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.
[AI-57] Discrete Flow Matching Policy Optimization
【速读】:该论文旨在解决离散序列生成中基于强化学习(Reinforcement Learning, RL)微调的偏差与不稳定性问题,尤其针对生成式AI(Generative AI)模型在奖励驱动下的优化过程中存在的采样偏差、辅助估计器偏倚及似然替代目标等问题。解决方案的关键在于提出一种统一框架——离散流匹配策略优化(Discrete flow Matching policy Optimization, DoMinO),其核心思想是将离散流匹配(Discrete Flow Matching, DFM)的采样过程建模为多步马尔可夫决策过程(Markov Decision Process, MDP),从而将奖励最大化问题转化为一个鲁棒的强化学习目标;该方法不仅保留了原始DFM采样器的结构特性,还通过引入总变差正则化项防止策略坍塌,确保微调后的分布贴近预训练分布,同时保持优异的功能性能。
链接: https://arxiv.org/abs/2604.06491
作者: Maojiang Su,Po-Chung Hsieh,Weimin Wu,Mingcheng Lu,Jiunhau Chen,Jerry Yao-Chieh Hu,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.
[AI-58] Inference-Time Code Selection via Symbolic Equivalence Partitioning
【速读】:该论文旨在解决代码生成任务中“Best-of-N”选择策略依赖昂贵或随机外部验证器以可靠识别正确解的问题。其核心解决方案是提出符号等价划分(Symbolic Equivalence Partitioning),利用符号执行按语义行为对候选程序进行分组,并从主导功能分区中选取代表性程序。为提升分组与选择效率,该方法在符号执行过程中将领域特定约束编码为Satisfiability Modulo Theories (SMT) 假设,从而抑制路径爆炸并避免搜索问题域外的无效输入,最终在不增加额外LLM推理的前提下显著提升了准确率。
链接: https://arxiv.org/abs/2604.06485
作者: David Cho,Yifan Wang,Fanping Sui,Ananth Grama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:“Best-of-N” selection is a popular inference-time scaling method for code generation using Large Language Models (LLMs). However, to reliably identify correct solutions, existing methods often depend on expensive or stochastic external verifiers. In this paper, we propose Symbolic Equivalence Partitioning, a selection framework that uses symbolic execution to group candidate programs by semantic behavior and select a representative from the dominant functional partition. To improve grouping and selection, we encode domain-specific constraints as Satisfiability Modulo Theories (SMT) assumptions during symbolic execution to reduce path explosion and prevent invalid input searches outside the problem domain. At N=10, our method improves average accuracy over Pass@1 from 0.728 to 0.803 on HumanEval+ and from 0.516 to 0.604 on LiveCodeBench, without requiring any additional LLM inference beyond the initial N candidate generations.
[AI-59] Distributed Interpretability and Control for Large Language Models
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多GPU部署场景下缺乏有效激活级可解释性(activation-level interpretability)与行为控制(steering)能力的问题。当前技术难以支持对多GPU环境下LLMs的逐层激活轨迹进行高效采集与干预,限制了对模型内部机制的理解和实时调控。解决方案的关键在于提出一种可扩展至多GPU架构的实现方法,通过优化内存管理和计算流水线设计,在不增加额外前向传播次数的前提下,实现了高达7倍的激活内存压缩和41倍的吞吐量提升;同时引入基于LayerNorm后注入的转向向量(steering vector),在无需微调或额外推理的情况下,实现了输出结果的可控、单调变化(平均转向斜率0.702),从而为前沿大模型提供实用的可解释性工具与实时行为控制能力。
链接: https://arxiv.org/abs/2604.06483
作者: Dev Arpan Desai,Shaoyi Huang,Zining Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi-GPU setting as well as the single-GPU setting. We present a practical implementation of activation-level interpretability (logit lens) and steering (steering vector) that scales up to multi-GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA-3.1 (8B, 70B) and Qwen-3 (4B, 14B, 32B), sustaining 20-100 tokens/s while collecting full layer-wise activation trajectories for sequences of 1,500 tokens. Using label-position steering vectors injected post-LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine-tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real-time behavioral control for frontier LLMs at this https URL.
[AI-60] From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
【速读】:该论文旨在解决负载测试在模拟真实事件流量时可能遗漏服务行为的问题,特别是在直播或视频点播(VOD)事件中,传统压力测试无法充分捕捉到实际服务间的异常交互模式。解决方案的关键在于构建一个基于图的异常检测系统,利用图卷积网络-图自编码器(GCN-GAE)学习分钟级粒度的有向加权服务图结构表示,并通过计算负载测试与真实事件嵌入之间的余弦相似度来识别未被充分覆盖的服务节点。该方法能够早期发现与事件相关的异常服务,且初步的合成异常注入框架验证了其高精度(96%)和低误报率(0.08%),尽管召回率受限于保守的传播假设,但仍为微服务生态系统的异常检测提供了可扩展的技术基础。
链接: https://arxiv.org/abs/2604.06448
作者: Srinidhi Madabhushi,Pranesh Vyas,Swathi Vaidyanathan,Mayur Kurup,Elliott Nash,Yegor Silyutin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Accepted at FSE 2026 - Industrial Track
Abstract:Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.
[AI-61] he Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
【速读】:该论文旨在解决语言模型(Language Model, LM)在面对恶意输入时如何通过前置防御机制(即连续、保效用的预处理函数 $ D: X \to X )实现输出严格安全的问题。研究发现,对于提示空间连通的语言模型,任何满足连续性和保效用性质的防御策略都无法保证所有输出均严格安全,这构成了一个“防御三难困境”(defensetrilemma):连续性、效用保持与完备性三者不可兼得。解决方案的关键在于从数学上精确刻画了此类防御必然失效的位置——包括边界固定性(boundaryfixation)、\epsilon−鲁棒约束(\epsilon$-robust constraint)以及在横截条件下的持续不安全区域(persistent unsafe region),从而揭示了现有防御方法的根本局限,并为未来设计更稳健的对抗性防御提供了理论边界。
链接: https://arxiv.org/abs/2604.06436
作者: Manish Bhatt,Sarthak Munshi,Vineeth Sai Narajala,Idan Habler,Ammar Al-Kahfah,Ken Huang,Blake Gatto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We prove that no continuous, utility-preserving wrapper defense-a function D: X\to X that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an \epsilon -robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.
[AI-62] Neural Computers
【速读】:该论文试图解决的问题是:如何构建一种新型计算系统——神经计算机(Neural Computers, NCs),使其能够将计算、内存和输入/输出(I/O)功能统一在一个可学习的运行时状态中,从而让模型本身成为运行中的计算机,而非依赖于传统程序指令或外部环境代理。其核心挑战在于实现稳定执行、显式重编程能力以及持续的能力复用,以迈向完全神经计算机(Completely Neural Computer, CNC)。解决方案的关键在于:通过仅从收集的 I/O 轨迹(如屏幕帧、用户动作等)中学习早期神经计算机原语,例如 I/O 对齐和短时控制策略,而不依赖于程序内部状态的仪器化观测,从而验证神经运行时(learned runtime)具备基础接口理解与行为生成能力,为后续实现通用、可重用、可控的神经计算范式奠定基础。
链接: https://arxiv.org/abs/2604.06425
作者: Mingchen Zhuge,Changsheng Zhao,Haozhe Liu,Zijian Zhou,Shuming Liu,Wenyi Wang,Ernie Chang,Gael Le Lan,Junjie Fei,Wenxuan Zhang,Yasheng Sun,Zhipeng Cai,Zechun Liu,Yunyang Xiong,Yining Yang,Yuandong Tian,Yangyang Shi,Vikas Chandra,Jürgen Schmidhuber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Github (data pipeline): this https URL Blogpost: this https URL
Abstract:We propose a new frontier: Neural Computers (NCs) – an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural Computer (CNC): the mature, general-purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether early NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. These implementations show that learned runtimes can acquire early interface primitives, especially I/O alignment and short-horizon control, while routine reuse, controlled updates, and symbolic stability remain open. We outline a roadmap toward CNCs around these challenges. If overcome, CNCs could establish a new computing paradigm beyond today’s agents, world models, and conventional computers.
[AI-63] owards Resilient Intrusion Detection in CubeSats: Challenges TinyML Solutions and Future Directions
【速读】:该论文旨在解决CubeSat在轨运行中因依赖商用现成(Commercial Off-The-Shelf, COTS)组件和开源软件而面临的严重网络安全漏洞问题,尤其是在资源受限和独特空间环境下的传统入侵检测系统(Intrusion Detection System, IDS)难以部署的挑战。其解决方案的关键在于探索轻量级机器学习(TinyML)技术在CubeSat系统中的应用,以实现资源高效、实时的入侵检测能力,并推动构建适应空间任务需求的自主响应机制与安全框架,从而提升CubeSat在复杂太空环境中运行的可靠性与安全性。
链接: https://arxiv.org/abs/2604.06411
作者: Yasamin Fayyaz,Li Yang,Khalil El-Khatib
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); General Literature (cs.GL); Machine Learning (cs.LG)
备注: Published in IEEE Aerospace and Electronic Systems Magazine
Abstract:CubeSats have revolutionized access to space by providing affordable and accessible platforms for research and education. However, their reliance on Commercial Off-The-Shelf (COTS) components and open-source software has introduced significant cybersecurity vulnerabilities. Ensuring the cybersecurity of CubeSats is vital as they play increasingly important roles in space missions. Traditional security measures, such as intrusion detection systems (IDS), are impractical for CubeSats due to resource constraints and unique operational environments. This paper provides an in-depth review of current cybersecurity practices for CubeSats, highlighting limitations and identifying gaps in existing methods. Additionally, it explores non-cyber anomaly detection techniques that offer insights into adaptable algorithms and deployment strategies suitable for CubeSat constraints. Open research problems are identified, including the need for resource-efficient intrusion detection mechanisms, evaluation of IDS solutions under realistic mission scenarios, development of autonomous response systems, and creation of cybersecurity frameworks. The addition of TinyML into CubeSat systems is explored as a promising solution to address these challenges, offering resource-efficient, real-time intrusion detection capabilities. Future research directions are proposed, such as integrating cybersecurity with health monitoring systems, and fostering collaboration between cybersecurity researchers and space domain experts.
[AI-64] BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization
【速读】:该论文旨在解决数据整合分析中因模式(schema)、值表示和领域特定惯例差异导致的数据异质性问题,这是实现跨数据源集成分析的主要瓶颈。其解决方案的关键在于提出一个可扩展的工具包BDI-Kit,提供两种互补接口:一是面向开发者的Python API,支持程序化构建数据谐调(data harmonization)流水线并复用转换逻辑;二是AI辅助的自然语言对话界面,使领域专家可通过交互式对话完成模式与值匹配的迭代探索、验证与优化。该方案融合自动化匹配、AI推理与用户驱动修正,显著提升了数据谐调的灵活性与效率。
链接: https://arxiv.org/abs/2604.06405
作者: Roque Lopez,Yurong Liu,Christos Koutras,Juliana Freire
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Data harmonization remains a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions. BDI-Kit provides an extensible toolkit for schema and value matching. It exposes two complementary interfaces tailored to different user needs: a Python API enabling developers to construct harmonization pipelines programmatically, and an AI-assisted chat interface allowing domain experts to harmonize data through natural language dialogue. This demonstration showcases how users interact with BDI-Kit to iteratively explore, validate, and refine schema and value matches through a combination of automated matching, AI-assisted reasoning, and user-driven refinement. We present two scenarios: (i) using the Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and (ii) conversing with the AI assistant in natural language to access BDI-Kit’s capabilities and iteratively refine outputs based on the assistant’s suggestions.
[AI-65] oward a universal foundation model for graph-structured data
【速读】:该论文旨在解决生物医学领域中图结构数据缺乏通用基础模型(foundation model)的问题,现有图神经网络通常仅在单一数据集上训练,学习到的表示局限于特定节点特征、拓扑结构和标签空间,难以跨域迁移,尤其在生物学与医学场景下,不同队列、实验方法和机构间的网络差异显著,导致模型泛化能力受限。解决方案的关键在于提出一种特征无关的图结构提示(feature-agnostic structural prompts)机制,利用度统计、中心性度量、社区结构指标及扩散签名等图固有属性编码为结构提示,并将其融合进消息传递骨干网络,从而将异构图嵌入共享表示空间;该模型一次性预训练后可在未见数据集上以极小适应实现复用,在多个基准测试中展现出媲美甚至超越监督基线的性能,并在零样本和少样本场景下表现出卓越的泛化能力。
链接: https://arxiv.org/abs/2604.06391
作者: Sakib Mostafa,Lei Xing,Md. Tauhidul Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, 12 supplementary figures
Abstract:Graphs are a central representation in biomedical research, capturing molecular interaction networks, gene regulatory circuits, cell–cell communication maps, and knowledge graphs. Despite their importance, currently there is not a broadly reusable foundation model available for graph analysis comparable to those that have transformed language and vision. Existing graph neural networks are typically trained on a single dataset and learn representations specific only to that graph’s node features, topology, and label space, limiting their ability to transfer across domains. This lack of generalization is particularly problematic in biology and medicine, where networks vary substantially across cohorts, assays, and institutions. Here we introduce a graph foundation model designed to learn transferable structural representations that are not specific to specific node identities or feature schemes. Our approach leverages feature-agnostic graph properties, including degree statistics, centrality measures, community structure indicators, and diffusion-based signatures, and encodes them as structural prompts. These prompts are integrated with a message-passing backbone to embed diverse graphs into a shared representation space. The model is pretrained once on heterogeneous graphs and subsequently reused on unseen datasets with minimal adaptation. Across multiple benchmarks, our pretrained model matches or exceeds strong supervised baselines while demonstrating superior zero-shot and few-shot generalization on held-out graphs. On the SagePPI benchmark, supervised fine-tuning of the pretrained backbone achieves a mean ROC-AUC of 95.5%, a gain of 21.8% over the best supervised message-passing baseline. The proposed technique thus provides a unique approach toward reusable, foundation-scale models for graph-structured data in biomedical and network science applications.
[AI-66] SELFDOUBT: Uncertainty Quantification for Reasoning LLM s via the Hedge-to-Verify Ratio
【速读】:该论文旨在解决推理型语言模型在实际部署中不确定性估计困难的问题,尤其是在无法访问模型内部信息(如logits或中间token概率)的专有推理API场景下,传统基于采样的方法计算成本过高,而单次通过的代理指标(如口头置信度或轨迹长度)又往往不一致。解决方案的关键在于提出SELFDOUBT框架,其核心是利用推理轨迹本身提取行为信号——特别是“ Hedge-to-Verify Ratio (HVR)”,该比率能够识别推理过程中是否存在不确定性标记,并判断这些标记是否被显式的自我验证行为所抵消。与依赖多次采样或模型内部结构的方法不同,SELFDOUBT仅需单次推理轨迹即可实现高效、可靠的不确定性评估,适用于延迟和成本敏感的生产环境。
链接: https://arxiv.org/abs/2604.06389
作者: Satwik Pandey,Suresh Raghu,Shashwat Pandey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 4 tables, plus appendix. Submitted to COLM 2026
Abstract:Uncertainty estimation for reasoning language models remains difficult to deploy in practice: sampling-based methods are computationally expensive, while common single-pass proxies such as verbalized confidence or trace length are often inconsistent across models. This problem is compounded for proprietary reasoning APIs that expose neither logits nor intermediate token probabilities, leaving practitioners with no reliable uncertainty signal at inference time. We propose SELFDOUBT, a single-pass uncertainty framework that resolves this impasse by extracting behavioral signals directly from the reasoning trace itself. Our key signal, the Hedge-to-Verify Ratio (HVR), detects whether a reasoning trace contains uncertainty markers and, if so, whether they are offset by explicit selfchecking behavior. Unlike methods that require multiple sampled traces or model internals, SELFDOUBT operates on a single observed reasoning trajectory, making it suitable for latency- and cost-constrained deployment over any proprietary API. We evaluate SELFDOUBT across seven models and three multi-step reasoning benchmarks (BBH, GPQA-Diamond, and MMLU-Pro). Most notably, traces containing no hedging markers are correct 96% of the time, revealing an emergent high-precision confidence gate at zero additional cost. For the remaining cases, the full SELFDOUBT score significantly outperforms sampling-based semantic entropy at 10x lower inference cost. A deployment cascade combining both stages attains 90% accuracy at 71% coverage without any task-specific labels. These results establish SELFDOUBT as a scalable, production-ready foundation for uncertainty estimation over proprietary reasoning models.
[AI-67] Uncertainty Estimation for Deep Reconstruction in Actuatic Disaster Scenarios with Autonomous Vehicles
【速读】:该论文旨在解决从稀疏机载观测中准确重建环境标量场(scalar field)并同时进行不确定性分解的问题,这对于执行水下监测任务的自主车辆至关重要。其核心挑战在于如何在真实传感器模态下实现高精度场重建与可解释的不确定性量化,以支持信息驱动的路径规划等主动感知策略。解决方案的关键在于比较多种不确定性量化方法(包括高斯过程、蒙特卡洛丢弃、深度集成和证据深度学习),发现证据深度学习(Evidential Deep Learning)在所有传感器配置下均实现了最优的重建精度与不确定性校准,且推理成本最低,从而成为实时自主系统部署中的首选方案。
链接: https://arxiv.org/abs/2604.06387
作者: Samuel Yanes Luis,Alejandro Casado Pérez,Alejandro Mendoza Barrionuevo,Dame Seck Diop,Sergio Toral Marín,Daniel Gutiérrez Reina
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate reconstruction of environmental scalar fields from sparse onboard observations is essential for autonomous vehicles engaged in aquatic monitoring. Beyond point estimates, principled uncertainty quantification is critical for active sensing strategies such as Informative Path Planning, where epistemic uncertainty drives data collection decisions. This paper compares Gaussian Processes, Monte Carlo Dropout, Deep Ensembles, and Evidential Deep Learning for simultaneous scalar field reconstruction and uncertainty decomposition under three perceptual models representative of real sensor modalities. Results show that Evidential Deep Learning achieves the best reconstruction accuracy and uncertainty calibration across all sensor configurations at the lowest inference cost, while Gaussian Processes are fundamentally limited by their stationary kernel assumption and become intractable as observation density grows. These findings support Evidential Deep Learning as the preferred method for uncertainty-aware field reconstruction in real-time autonomous vehicle deployments.
[AI-68] he Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
【速读】:该论文旨在解决模型能力在不同规模模型间迁移的问题,即如何在不进行额外训练的情况下将后训练(post-trained)的能力从一个模型转移到另一个模型。其核心挑战在于识别并提取具有可转移性的行为方向,从而实现跨模型的能力复用。解决方案的关键在于提出“主键假设”(Master Key Hypothesis),认为模型能力对应于低维潜在子空间中的特定方向,这些方向可通过线性对齐方式在不同模型间传递;基于此假设,作者设计了UNLOCK框架——一种无需训练且无标签的迁移方法,通过对比源模型中具备与不具备某能力的激活差异来提取能力方向,并利用低秩线性变换将其对齐至目标模型,在推理阶段直接应用以激发相应行为。实验表明,该方法在链式思维(Chain-of-Thought, CoT)和数学推理等任务上显著提升性能,证明了潜在能力方向的可迁移性及其放大效应。
链接: https://arxiv.org/abs/2604.06377
作者: Rishab Balasubramanian,Pin-Jie Lin,Rituraj Sharma,Anjie Fang,Fardin Abdi,Viktor Rozgic,Zheng Du,Mohit Bansal,Tu Vu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate whether post-trained capabilities can be transferred across models without retraining, with a focus on transfer across different model scales. We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments on reasoning behaviors, including Chain-of-Thought (CoT) and mathematical reasoning, demonstrate substantial improvements across model scales without training. For example, transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B yields an accuracy gain of 12.1% on MATH, and transferring a mathematical reasoning direction from Qwen3-4B-Base to Qwen3-14B-Base improves AGIEval Math accuracy from 61.1% to 71.3%, surpassing the 67.8% achieved by the 14B post-trained model. Our analysis shows that the success of transfer depends on the capabilities learned during pre-training, and that our intervention amplifies latent capabilities by sharpening the output distribution toward successful reasoning trajectories.
[AI-69] SymptomWise: A Deterministic Reasoning Layer for Reliable and Efficient AI Systems
【速读】:该论文旨在解决生成式 AI (Generative AI) 在症状分析系统中面临的可靠性、可解释性不足及幻觉问题,尤其是在安全关键场景下,端到端的生成方法常因缺乏可追溯性而产生不支持或不一致的诊断输出。其解决方案的关键在于提出 SymptomWise 框架,通过将语言理解与诊断推理分离:先利用专家标注的医学知识和确定性规则驱动的推理模块(codex-driven inference)对结构化症状表示进行有限假设空间内的推理,从而生成排序的鉴别诊断;仅在症状提取和可选解释环节使用大语言模型(LLM),避免其直接参与诊断决策。该设计提升了系统的可追溯性、减少无依据结论,并支持模块化评估,初步在儿科神经学病例中验证了有效性。
链接: https://arxiv.org/abs/2604.06375
作者: Isaac Henry,Avery Byrne,Christopher Giza,Ron Henry,Shahram Yazdani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 1 figure,
Abstract:AI-driven symptom analysis systems face persistent challenges in reliability, interpretability, and hallucination. End-to-end generative approaches often lack traceability and may produce unsupported or inconsistent diagnostic outputs in safety-critical settings. We present SymptomWise, a framework that separates language understanding from diagnostic reasoning. The system combines expert-curated medical knowledge, deterministic codex-driven inference, and constrained use of large language models. Free-text input is mapped to validated symptom representations, then evaluated by a deterministic reasoning module operating over a finite hypothesis space to produce a ranked differential diagnosis. Language models are used only for symptom extraction and optional explanation, not for diagnostic inference. This architecture improves traceability, reduces unsupported conclusions, and enables modular evaluation of system components. Preliminary evaluation on 42 expert-authored challenging pediatric neurology cases shows meaningful overlap with clinician consensus, with the correct diagnosis appearing in the top five differentials in 88% of cases. Beyond medicine, the framework generalizes to other abductive reasoning domains and may serve as a deterministic structuring and routing layer for foundation models, improving precision and potentially reducing unnecessary computational overhead in bounded tasks.
[AI-70] Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects
【速读】:该论文旨在解决当前生成式 AI 编程工具(如 Cursor)在实际应用中缺乏实证研究的问题,特别是其在大规模软件系统生成能力及其所产生代码的设计质量方面存在知识空白。为应对这一问题,作者提出了一种特征驱动的人类在环(Feature-Driven Human-In-The-Loop, FD-HITL)框架,作为系统化引导项目生成的核心解决方案。该框架通过结构化地处理项目描述,使 Cursor 能够基于多领域和多技术栈生成具备功能正确性的大型项目(平均 16,965 行代码、114 个文件),并借助静态分析工具 CodeScene 和 SonarQube 发现设计缺陷,从而揭示出尽管功能性良好,但生成代码仍存在大量违反单一职责原则(SRP)、关注点分离(SoC)和 DRY 原则等设计规范的问题,提示开发者需对结果进行专业审查以保障长期可维护性。
链接: https://arxiv.org/abs/2604.06373
作者: Syed Mohammad Kashif,Ruiyin Li,Peng Liang,Amjed Tahir,Qiong Feng,Zengyang Li,Mojtaba Shahin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 40 pages, 19 images, 5 tables, Manuscript submitted to a Journal (2026)
Abstract:New generation of AI coding tools, including AI-powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project-level code at scale. However, there is limited empirical evidence on the extent to which they can generate large-scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large-scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD-HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at this https URL
[AI-71] WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
【速读】:该论文旨在解决现有Web代理评估框架缺乏对用户面向的网站安全与隐私任务(如管理Cookie偏好、配置隐私敏感账户设置或撤销非活跃会话)的系统性评测问题。当前主流基准(如WebArena和SafeArena)仅关注通用性能或恶意行为安全性,未能覆盖实际用户在隐私保护场景下的操作需求。解决方案的关键在于提出WebSP-Eval框架,其核心包括:1)人工构建涵盖28个网站的200个任务实例数据集;2)基于自定义Google Chrome扩展实现跨运行的账号与初始状态管理机制;3)开发自动化评估模块。该框架首次量化了当前基于多模态大语言模型的Web代理在安全与隐私任务上的表现瓶颈,揭示了状态化UI元素(如切换开关和复选框)是导致失败的主要原因(失败率超45%)。
链接: https://arxiv.org/abs/2604.06367
作者: Guruprasad Viswanathan Ramesh,Asmit Nayak,Basieem Siddique,Kassem Fawaz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent’s ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We evaluate a total of 8 web agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements such as toggles and checkboxes are a primary reason for agent failure, failing at a rate of more than 45% in tasks containing these elements across many models.
[AI-72] GS-Surrogate: Deformable Gaussian Splatting for Parameter Space Exploration of Ensemble Simulations
【速读】:该论文旨在解决科学模拟中参数空间探索的挑战,即如何在不存储昂贵原始数据的前提下,实现灵活且高效的可视化后处理(post-hoc exploration)。现有方法要么仅在图像空间操作而缺乏显式的三维表示,要么依赖神经辐射场(Neural Radiance Fields, NeRF),其计算成本高且将所有参数驱动的变化编码于单一隐式场中,难以支持交互式探索。论文提出GS-Surrogate——一种基于可变形高斯点绘(Deformable Gaussian Splatting)的可视化代理模型,其关键在于构建一个规范高斯场作为基础3D表示,并通过顺序的、参数条件驱动的形变进行适应性调整;这种显式分离模拟相关变化与可视化特定调整的方法,使得对不同可视化任务(如等值面提取和传递函数编辑)的高效、可控适配成为可能,从而实现了跨模拟与可视化参数空间的实时、灵活探索。
链接: https://arxiv.org/abs/2604.06358
作者: Ziwei Li,Rumali Perera,Angus Forbes,Ken Moreland,Dave Pugmire,Scott Klasky,Wei-Lun Chao,Han-Wei Shen
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:Exploring ensemble simulations is increasingly important across many scientific domains. However, supporting flexible post-hoc exploration remains challenging due to the trade-off between storing the expensive raw data and flexibly adjusting visualization settings. Existing visualization surrogate models have improved this workflow, but they either operate in image space without an explicit 3D representation or rely on neural radiance fields that are computationally expensive for interactive exploration and encode all parameter-driven variations within a single implicit field. In this work, we introduce GS-Surrogate, a deformable Gaussian Splatting-based visualization surrogate for parameter-space exploration. Our method first constructs a canonical Gaussian field as a base 3D representation and adapts it through sequential parameter-conditioned deformations. By separating simulation-related variations from visualization-specific changes, this explicit formulation enables efficient and controllable adaptation to different visualization tasks, such as isosurface extraction and transfer function editing. We evaluate our framework on a range of simulation datasets, demonstrating that GS-Surrogate enables real-time and flexible exploration across both simulation and visualization parameter spaces.
[AI-73] “Dont Be Afraid Just Learn”: Insights from Industry Practitioners to Prepare Software Engineers in the Age of Generative AI
【速读】:该论文旨在解决高校课程与产业界对软件工程人才能力期望之间因生成式AI(Generative AI)快速融入软件开发而加剧的脱节问题。其关键解决方案在于通过实证调研(51名行业从业者问卷与11次深度访谈),识别出GenAI时代下企业对新技能(如提示工程和输出评估)、软技能(如问题解决与批判性思维)及传统技术能力(如架构设计与调试)的强化需求,并据此提出可操作的教育改进建议,包括如何将GenAI融入课程体系及重构评价机制,从而帮助学术界更好地培养学生适应现代软件工程环境的能力。
链接: https://arxiv.org/abs/2604.06342
作者: Daniel Otten,Trevor Stalnaker,Nathan Wintersgill,Oscar Chaparro,Denys Poshyvanyk,Douglas Schmidt
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Although tension between university curricula and industry expectations has existed in some form for decades, the rapid integration of generative AI (GenAI) tools into software development has recently widened the gap between the two domains. To better understand this disconnect, we surveyed 51 industry practitioners (software developers, technical leads, upper management, \etc) and conducted 11 follow-up interviews focused on hiring practices, required job skills, perceived shortcomings in university curricula, and views on how university learning outcomes can be improved. Our results suggest that GenAI creates demand for new skills (\eg prompting and output evaluation), while strengthening the importance of soft-skills (\eg problem solving and critical thinking) and traditional competencies (\eg architecture design and debugging). We synthesize these findings into actionable recommendations for academia (\eg how to incorporate GenAI into curricula and evaluation redesign). Our work offers empirical guidance to help educators prepare students for modern software engineering environments.
[AI-74] BiScale-GTR: Frag ment-Aware Graph Transformers for Multi-Scale Molecular Representation Learning
【速读】:该论文旨在解决现有图 Transformer 模型在分子属性预测中仍受局部消息传递主导、缺乏多尺度结构感知能力的问题,从而限制了对跨尺度分子模式的捕捉。其解决方案的关键在于提出 BiScale-GTR 框架,通过化学合理且高覆盖率的片段标记化(fragment tokenization)与自适应多尺度推理机制相结合:首先改进图 BPE(Byte Pair Encoding)以生成一致且化学有效的片段 token,随后在并行 GNN-Transformer 架构中,将原子级表示池化为片段级嵌入,并与片段 token 嵌入融合后输入 Transformer 进行全局建模,实现局部化学环境、子结构特征和长程分子依赖关系的联合捕获。
链接: https://arxiv.org/abs/2604.06336
作者: Yi Yang,Ovidiu Daescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Transformers have recently attracted attention for molecular property prediction by combining the inductive biases of graph neural networks (GNNs) with the global receptive field of Transformers. However, many existing hybrid architectures remain GNN-dominated, causing the resulting representations to remain heavily shaped by local message passing. Moreover, most existing methods operate at only a single structural granularity, limiting their ability to capture molecular patterns that span multiple molecular scales. We introduce BiScale-GTR, a unified framework for self-supervised molecular representation learning that combines chemically grounded fragment tokenization with adaptive multi-scale reasoning. Our method improves graph Byte Pair Encoding (BPE) tokenization to produce consistent, chemically valid, and high-coverage fragment tokens, which are used as fragment-level inputs to a parallel GNN-Transformer architecture. Architecturally, atom-level representations learned by a GNN are pooled into fragment-level embeddings and fused with fragment token embeddings before Transformer reasoning, enabling the model to jointly capture local chemical environments, substructure-level motifs, and long-range molecular dependencies. Experiments on MoleculeNet, PharmaBench, and the Long Range Graph Benchmark (LRGB) demonstrate state-of-the-art performance across both classification and regression tasks. Attribution analysis further shows that BiScale-GTR highlights chemically meaningful functional motifs, providing interpretable links between molecular structure and predicted properties. Code will be released upon acceptance.
[AI-75] A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech ICASSP
【速读】:该论文旨在解决生成式语音合成(Text-to-Speech, TTS)中长期存在的“说话人漂移”(speaker drift)问题,即在单个语音片段内感知到的说话人身份发生微妙且渐进的变化,从而破坏合成语音的一致性和自然性,尤其在长句或交互式场景下更为显著。解决方案的关键在于提出首个自动化的检测框架,将说话人漂移识别建模为一个基于语音段级一致性判断的二分类任务:通过计算合成语音重叠片段间的余弦相似度(cosine similarity),提取说话人嵌入(speaker embeddings)并利用大型语言模型(Large Language Models, LLMs)进行结构化推理,以判断是否存在漂移现象。该方法结合了说话人嵌入在单位球面上的几何聚类特性与LLM的感知推理能力,实现了从信号分析到语义理解的跨模态协同检测,并提供了理论保障和高质量人工标注的基准数据集用于评估。
链接: https://arxiv.org/abs/2604.06327
作者: Jia-Hong Huang,Seulgi Kim,Yi Chieh Liu,Yixian Shen,Hongyi Zhu,Prayag Tiwari,Stevan Rudinac,Evangelos Kanoulas
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026
Abstract:Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.
[AI-76] Blockchain and AI: Securing Intelligent Networks for the Future
【速读】:该论文旨在解决智能网络在互联网万物(Internet of Everything, IoE)范式下因高度互联而面临日益复杂和高级的网络安全威胁问题,尤其关注关键基础设施的安全性与韧性。其核心解决方案在于融合区块链(Blockchain)与人工智能(Artificial Intelligence, AI)技术:区块链提供去中心化、不可篡改和透明的数据存储与验证机制,保障数据完整性与信任;AI则通过预测分析、异常检测和自适应防御能力实现主动威胁识别与响应。两者协同构建了具备鲁棒性、自适应性和可信性的安全框架,并进一步引入大语言模型(Large Language Models, LLMs)增强威胁情报处理能力,以及受控的代理型AI(agentic AI)优化警报分类、证据收集与策略驱动的响应流程,从而显著提升智能网络的抗攻击能力和动态适应能力。
链接: https://arxiv.org/abs/2604.06323
作者: Joy Dutta,Hossien B. Eldeeb,Tu Dac Ho
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of intelligent networks under the Internet of Everything (IoE) paradigm is transforming connectivity by integrating people, processes, data, and things. This ecosystem includes domains such as the Internet of Things (IoT), Internet of Healthcare (IoH), Internet of Vehicles (IoV), and cyber-physical and human-machine systems. While enabling efficiency and automation, this interconnectivity also exposes critical infrastructures to increasingly sophisticated cyber threats, creating an urgent need for advanced security solutions. This chapter examines the integration of Blockchain and Artificial Intelligence (AI) as complementary approaches for securing intelligent networks. Blockchain provides decentralized, immutable, and transparent mechanisms that strengthen data integrity, trust, and accountability. In parallel, AI offers predictive analytics, anomaly detection, and adaptive defense capabilities to enable proactive threat identification and mitigation. The chapter discusses how Blockchain supports security in cyber-physical systems, how AI enables proactive security operations, and how their combination creates robust, adaptive, and trustworthy security frameworks. The chapter also explores the emerging role of large language models in threat intelligence and analyzes how controlled agentic AI can support bounded security workflows such as alert triage, evidence collection, and policy-aware response planning. Representative case studies illustrate the potential of these technologies to enhance cyber resilience. Finally, challenges related to scalability, energy efficiency, and ethical considerations are addressed, along with reported mitigation strategies and future research directions. Overall, this chapter provides researchers, practitioners, and policymakers with insights to design secure, resilient, and adaptable intelligent networks.
[AI-77] alkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models
【速读】:该论文旨在解决现有Mixture-of-Experts (MoE)增强的低秩适配(LoRA)方法中专家间独立假设导致的路由不稳定和专家主导问题。其解决方案的关键在于提出TalkLoRA,一种通信感知的MoE-LoRA框架,通过在路由前引入专家级通信机制来放松独立性假设:具体而言,每个低秩专家被赋予一个轻量级“Talking Module”,用于在专家子空间之间可控地交换信息,从而生成更稳健的全局信号以优化路由决策。该设计理论上可缓解扰动放大效应并严格推广现有MoE-LoRA架构,实验证明其在多种语言理解与生成任务中显著优于传统LoRA和MoE-LoRA,在相同参数预算下实现更高的参数效率和更均衡的专家激活。
链接: https://arxiv.org/abs/2604.06291
作者: Lin Mu,Haiyang Wang,Li Ni,Lei Sang,Zhize Wu,Peiquan Jin,Yiwen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of Large Language Models (LLMs), and recent Mixture-of-Experts (MoE) extensions further enhance flexibility by dynamically combining multiple LoRA experts. However, existing MoE-augmented LoRA methods assume that experts operate independently, often leading to unstable routing, expert dominance. In this paper, we propose \textbfTalkLoRA, a communication-aware MoELoRA framework that relaxes this independence assumption by introducing expert-level communication prior to routing. TalkLoRA equips low-rank experts with a lightweight Talking Module that enables controlled information exchange across expert subspaces, producing a more robust global signal for routing. Theoretically, we show that expert communication smooths routing dynamics by mitigating perturbation amplification while strictly generalizing existing MoELoRA architectures. Empirically, TalkLoRA consistently outperforms vanilla LoRA and MoELoRA across diverse language understanding and generation tasks, achieving higher parameter efficiency and more balanced expert routing under comparable parameter budgets. These results highlight structured expert communication as a principled and effective enhancement for MoE-based parameter-efficient adaptation. Code is available at this https URL.
[AI-78] ClawLess: A Security Model of AI Agents
【速读】:该论文旨在解决由大型语言模型驱动的自主AI代理在运行过程中因具备信息检索和代码执行能力而带来的严重安全风险问题。现有方法通过训练或提示来规范代理行为,但无法提供根本性的安全保障。其解决方案的关键在于提出了一种名为ClawLess的安全框架,该框架在最坏情况威胁模型下(即代理本身可能为恶意)强制执行形式化验证的安全策略;通过细粒度建模系统实体、信任范围与权限,动态表达适应代理运行时行为的安全策略,并将这些策略转化为具体的系统调用拦截规则,借助BPF(Berkeley Packet Filter)增强的用户空间内核实现强制执行,从而在不依赖代理内部设计的前提下确保安全性。
链接: https://arxiv.org/abs/2604.06284
作者: Hongyi Lu,Nian Liu,Shuai Wang,Fengwei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous AI agents powered by Large Language Models can reason, plan, and execute complex tasks, but their ability to autonomously retrieve information and run code introduces significant security risks. Existing approaches attempt to regulate agent behavior through training or prompting, which does not offer fundamental security guarantees. We present ClawLess, a security framework that enforces formally verified policies on AI agents under a worst-case threat model where the agent itself may be adversarial. ClawLess formalizes a fine-grained security model over system entities, trust scopes, and permissions to express dynamic policies that adapt to agents’ runtime behavior. These policies are translated into concrete security rules and enforced through a user-space kernel augmented with BPF-based syscall interception. This approach bridges the formal security model with practical enforcement, ensuring security regardless of the agent’s internal design.
[AI-79] owards the Development of an LLM -Based Methodology for Automated Security Profiling in Compliance with Ukrainian Cybersecurity Regulations
【速读】:该论文旨在解决在高强度混合威胁环境下,如何高效实现网络安全合规管理的问题,特别是传统合规模型难以适应动态风险环境的挑战。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)并结合检索增强生成(Retrieval-Augmented Generation, RAG)的技术框架,通过集成国家法规与组织策略的向量数据库,自动化生成目标安全配置文件,从而降低人工复杂度、减少人为错误,并确保技术控制措施与法律要求的一致性。
链接: https://arxiv.org/abs/2604.06274
作者: Daniil Shafranskyi,Iryna Stopochkina,Mykola Ilin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures
Abstract:In recent years, the pace of development of information technology in various areas has increased drastically, forcing cybersecurity specialists to constantly review existing processes in order to prevent unauthorized access to confidential information. Using Ukraine as a primary case study, this paper explores the integration of international best practices, specifically ISO/IEC 27001 and the NIST Cybersecurity Framework, into national regulatory systems. A focus is placed on the transition from traditional compliance models to risk-based approaches, exemplified by the recent adoption of the Ukrainian normative documents. Furthermore, we propose a methodology for automating the development of target security profiles using Large Language Models (LLMs) enhanced by RetrievalAugmented Generation (RAG). By integrating a vector database of national regulations and organizational policies, the proposed RAG-based advisor reduces manual complexity, minimizes human error, and ensures alignment between technical controls and legal requirements. This study contributes to the field by providing a structured workflow for AI-assisted cybersecurity management in environments characterized by high-intensity hybrid threats.
[AI-80] MO-RiskVAE: A Multi-Omics Variational Autoencoder for Survival Risk Modeling in Multiple MyelomaMO-RiskVAE
【速读】:该论文旨在解决多模态变分自编码器(Multimodal Variational Autoencoders, VAEs)在多发性骨髓瘤生存风险建模中,因标准潜在空间正则化策略无法保留预后相关变异而导致表示不稳定或过度约束的问题。其解决方案的关键在于系统性地分离并评估潜在空间设计中的三个核心因素:正则化强度、后验几何结构和潜在空间结构,并发现生存驱动训练对潜在正则化幅度和结构更为敏感,而非具体的散度形式;通过适度放松KL散度正则化并引入基于Gumbel-Softmax的混合连续-离散潜在结构,显著提升了生存风险排序的一致性和鲁棒性,最终构建出无需额外监督或复杂训练技巧的稳健模型MO-RiskVAE,实现了优于原始MyeVAE的分层风险预测性能。
链接: https://arxiv.org/abs/2604.06267
作者: Zixuan Chen,Heng Zhang,YuPeng Qin,WenPeng Xing,Qiang Wang,Da Wang,Changting Lin,Meng Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal variational autoencoders (VAEs) have emerged as a powerful framework for survival risk modeling in multiple myeloma by integrating heterogeneous omics and clinical data. However, when trained under survival supervision, standard latent regularization strategies often fail to preserve prognostically relevant variation, leading to unstable or overly constrained representations. Despite numerous proposed variants, it remains unclear which aspects of latent design fundamentally govern performance in this setting. In this work, we conduct a controlled investigation of latent modeling choices for multimodal survival prediction within a unified extension of the MyeVAE framework. By systematically isolating regularization scale, posterior geometry, and latent space structure under identical architectures and optimization protocols, we show that survival-driven training is primarily sensitive to the magnitude and structure of latent regularization rather than the specific divergence formulation. In particular, moderate relaxation of KL regularization consistently improves survival discrimination, while alternative divergence mechanisms such as MMD and HSIC provide limited benefit without appropriate scaling. We further demonstrate that structuring the latent space can improve alignment between learned representations and survival risk gradients. A hybrid continuous–discrete formulation based on Gumbel–Softmax enhances global risk ordering in the continuous latent subspace, even though stable discrete subtype discovery does not emerge under survival supervision. Guided by these findings, we instantiate a robust multimodal survival model, termed MO-RiskVAE, which consistently improves risk stratification over the original MyeVAE without introducing additional supervision or complex training heuristics.
[AI-81] Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models
【速读】:该论文旨在解决软件定义网络(Software-Defined Networking, SDN)环境中入侵检测系统对可靠性和可解释性需求提升的问题,尤其是大语言模型(Large Language Models, LLMs)在安全关键场景中因缺乏透明性而难以被实际采纳的挑战。其解决方案的关键在于采用基于归因(attribution)的分析方法,对基于编码器的LLMs在流量级特征上的决策过程进行剖析,揭示模型判断依赖于有意义的流量行为模式,从而增强对基于Transformer架构的SDN入侵检测系统的透明度与可信度。
链接: https://arxiv.org/abs/2604.06266
作者: Umesh Biswas,Shafqat Hasan,Syed Mohammed Farhan,Nisha Pillai,Charan Gudla
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Software-Defined Networking (SDN) improves network flexibility but also increases the need for reliable and interpretable intrusion detection. Large Language Models (LLMs) have recently been explored for cybersecurity tasks due to their strong representation learning capabilities; however, their lack of transparency limits their practical adoption in security-critical environments. Understanding how LLMs make decisions is therefore essential. This paper presents an attribution-driven analysis of encoder-based LLMs for network intrusion detection using flow-level traffic features. Attribution analysis demonstrates that model decisions are driven by meaningful traffic behavior patterns, improving transparency and trust in transformer-based SDN intrusion detection. These patterns align with established intrusion detection principles, indicating that LLMs learn attack behavior from traffic dynamics. This work demonstrates the value of attribution methods for validating and trusting LLM-based security analysis.
[AI-82] S3: Stratified Scaling Search for Test-Time in Diffusion Language Models
【速读】:该论文旨在解决扩散语言模型(Diffusion Language Model, DLM)在测试阶段如何利用更多推理计算资源以提升生成质量的问题,即“测试时扩展”(test-time scaling)问题。传统方法如朴素的 best-of-K 采样受限于从同一基础扩散分布中重复采样,而该分布的高概率区域常与高质量输出不一致。解决方案的关键在于提出 S³(Stratified Scaling Search),一种基于轻量级无参考验证器引导的分层搜索机制:在去噪过程中动态分配计算资源,而非仅在最终输出阶段;每一步均扩展多个候选轨迹,通过验证器评估并选择性重采样高潜力路径,同时保持搜索前沿的多样性,从而近似一个奖励倾斜的采样分布,在不改变模型和解码策略的前提下显著提升生成质量,尤其在数学推理任务上效果突出。
链接: https://arxiv.org/abs/2604.06260
作者: Ahsan Bilal,Muhammad Ahmed Mohsin,Muhammad Umer,Asad Aali,Muhammad Usman Khanzada,Muhammad Usman Rafique,Zihao He,Emily Fox,Dean F. Hougen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to COLM 2026
Abstract:Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of- K sampling is fundamentally limited because it repeatedly draws from the same base diffusion distribution, whose high-probability regions are often misaligned with high-quality outputs. We propose S^3 (Stratified Scaling Search), a classical verifier-guided search method that improves generation by reallocating compute during the denoising process rather than only at the final output stage. At each denoising step, S^3 expands multiple candidate trajectories, evaluates them with a lightweight reference-free verifier, and selectively resamples promising candidates while preserving diversity within the search frontier. This procedure effectively approximates a reward-tilted sampling distribution that favors higher-quality outputs while remaining anchored to the model prior. Experiments with LLaDA-8B-Instruct on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA demonstrate that S^3 consistently improves performance across benchmarks, achieving the largest gains on mathematical reasoning tasks while leaving the underlying model and decoding schedule unchanged. These results show that classical search over denoising trajectories provides a practical mechanism for test-time scaling in DLMs.
[AI-83] Spectral Edge Dynamics Reveal Functional Modes of Learning
【速读】:该论文旨在揭示模型在“grokking”(即突然理解任务)过程中训练动态的本质机制,特别是识别哪些方向的参数更新主导了学习过程。其核心问题是:为何某些任务会在训练后期出现性能跃迁?解决方案的关键在于发现并分析这些主导更新方向——即“谱边缘”(spectral edge),它们并非传统可解释性工具(如头归因、激活探测或稀疏自编码器)所能捕捉的局部结构,而是对应于输入空间上的低维函数模式。研究表明,这些函数模式的结构依赖于任务的代数对称性:对于加法任务,所有主导方向坍缩为单一傅里叶模;乘法任务仅在离散对数基下呈现类似简化结构,带来5.9倍的集中度提升;而复杂组合任务(如 x2+y2)则需通过加法与乘法特征的交叉项构建更丰富的功能描述,实现4倍方差增强。因此,该研究提出,训练过程实际上是在探索输入域上的低维功能性子空间,其表达形式由任务本身的代数结构决定,从而为理解深度学习中的隐式归纳偏置提供了新视角。
链接: https://arxiv.org/abs/2604.06256
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 1 figure
Abstract:Training dynamics during grokking concentrate along a small number of dominant update directions – the spectral edge – which reliably distinguishes grokking from non-grokking regimes. We show that standard mechanistic interpretability tools (head attribution, activation probing, sparse autoencoders) fail to capture these directions: their structure is not localized in parameter or feature space. Instead, each direction induces a structured function over the input domain, revealing low-dimensional functional modes invisible to representation-level analysis. For modular addition, all leading directions collapse to a single Fourier mode. For multiplication, the same collapse appears only in the discrete-log basis, yielding a 5.9x improvement in concentration. For subtraction, the edge spans a small multi-mode family. For x^2+y^2 , no single harmonic basis suffices, but cross-terms of additive and multiplicative features provide a 4x variance boost, consistent with the decomposition (a+b)^2 - 2ab. Multitask training amplifies this compositional structure, with the x^2+y^2 spectral edge inheriting the addition circuit’s characteristic frequency (2.3x concentration increase). These results suggest that training discovers low-dimensional functional modes over the input domain, whose structure depends on the algebraic symmetry of the task. These results suggest that spectral edge dynamics identify low-dimensional functional subspaces governing learning, whose representation depends on the algebraic structure of the task. Simple harmonic structure emerges only when the task admits a symmetry-adapted basis; more complex tasks require richer functional descriptions. Comments: 17 pages, 1 figure Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06256 [cs.LG] (or arXiv:2604.06256v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.06256 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yongzhong Xu [view email] [v1] Mon, 6 Apr 2026 22:29:00 UTC (604 KB)
[AI-84] FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
【速读】:该论文旨在解决企业在多编程语言共存环境下,如何高效实现跨语言代码生成的问题。传统方法需为每种语言单独微调大语言模型(Large Language Models, LLMs),计算成本高昂。其解决方案的关键在于:采用参数高效微调方法(如低秩适应 LoRA)仅优化少量参数,并结合优化器改进(Adam 与 Sophia 对比)和一种新颖的基于傅里叶域的正则化技术,从而在小规模高质量数据集(MBPP)上实现从 Python 到 Java 等语言的有效迁移,最终在 Java 上达到 42.1% 的 pass@1 准确率,显著优于基线(34.2%)。
链接: https://arxiv.org/abs/2604.06253
作者: Gaurav Narasimhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 19 pages, 25 figures, Stanford CS224N Custom Project
Abstract:Cross-lingual code generation is critical in enterprise environments where multiple programming languages coexist. However, fine-tuning large language models (LLMs) individually for each language is computationally prohibitive. This paper investigates whether parameter-efficient fine-tuning methods and optimizer enhancements can improve cross-lingual transfer from Python to languages like Java. We fine-tune the Code Llama 7B model using low-rank adaptation (LoRA) to optimize a small subset of parameters and compare Adam and Sophia optimizers, while exploring a novel Fourier-based regularization technique. Our contributions include: (1)demonstrating that LoRA fine-tuning on a small, high-quality dataset (MBPP) can exceed the pass@1 performance of the more broadly fine-tuned Code Llama-Python-7B model (40.1% vs. 38.4%); (2) showing that while Sophia achieves faster convergence than Adam, final pass@1 scores show marginal differences; and (3) presenting evidence that Fourier-based regularization during fine-tuning significantly improves cross-lingual transfer, achieving 42.1% pass@1 on Java tasks compared to the 34.2% baseline. These findings suggest that combining LoRA, optimized training methods, and frequency-domain regularization can efficiently adapt single-language LLMs to perform well across multiple programming languages.
[AI-85] oward Reducing Unproductive Container Moves: Predicting Service Requirements and Dwell Times
【速读】:该论文旨在解决集装箱码头中因未预见的预清关服务需求和容器滞留时间过长而导致的非生产性移动(unproductive container moves)问题。解决方案的关键在于构建并验证基于历史运营数据的机器学习模型,用于预测哪些集装箱需要在货物放行前进行预清关处理,并估算其在码头内的预期停留时间。研究通过实施货物描述分类系统和收货人记录去重以提升数据一致性与特征质量,从而显著优于现有的规则启发式方法和随机基线,在精确率和召回率上均表现出更优性能,为堆场作业的战略规划与资源配置提供了数据驱动的支持。
链接: https://arxiv.org/abs/2604.06251
作者: Elena Villalobos(1),Adolfo De Unánue T.(1),Fernanda Sobrino(1),David Aké(1),Stephany Cisneros(1),Jorge Lecona(2),Alejandra Matadamaz(2) ((1) Tecnológico de Monterrey, Mexico City, Mexico, (2) Container Terminal Operations, Veracruz, Mexico)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: Preprint, 20 pages, 9 figures, 5 tables (including appendices)
Abstract:This article presents the results of a data science study conducted at a container terminal, aimed at reducing unproductive container moves through the prediction of service requirements and container dwell times. We develop and evaluate machine learning models that leverage historical operational data to anticipate which containers will require pre-clearance handling services prior to cargo release and to estimate how long they are expected to remain in the terminal. As part of the data preparation process, we implement a classification system for cargo descriptions and perform deduplication of consignee records to improve data consistency and feature quality. These predictive capabilities provide valuable inputs for strategic planning and resource allocation in yard operations. Across multiple temporal validation periods, the proposed models consistently outperform existing rule-based heuristics and random baselines in precision and recall. These results demonstrate the practical value of predictive analytics for improving operational efficiency and supporting data-driven decision-making in container terminal logistics.
[AI-86] SALLIE: Safeguarding Against Latent Language Image Exploits
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在面对文本与视觉越狱攻击(textual and visual jailbreaks)及提示注入攻击(prompt injections)时的脆弱性问题,同时克服现有防御方法因复杂输入变换导致性能下降或无法统一处理多模态威胁的局限。其解决方案的关键在于提出一种轻量级运行时检测框架 SALLIE(Safeguarding Against Latent Language Image Exploits),该框架基于机制可解释性(mechanistic interpretability),无需修改模型架构即可在标准 token-level 融合流水线中无缝集成;通过提取模型内部残差流激活(residual stream activations),利用 k-近邻(k-NN)分类器计算逐层恶意分数,并由层集成模块聚合预测结果,在不牺牲推理效率的前提下实现对多模态攻击的统一、高效防御。
链接: https://arxiv.org/abs/2604.06247
作者: Guy Azov,Ofer Rivlin,Guy Shtar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 7 tables. Preprint under review
Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to textual and visual jailbreaks, as well as prompt injections (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213). Existing defenses often degrade performance through complex input transformations or treat multimodal threats as isolated problems (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). To address the critical gap for a unified, modal-agnostic defense that mitigates both textual and visual threats simultaneously without degrading performance or requiring architectural modifications, we introduce SALLIE (Safeguarding Against Latent Language Image Exploits), a lightweight runtime detection framework rooted in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). By integrating seamlessly into standard token-level fusion pipelines (arXiv:2306.13549), SALLIE extracts robust signals directly from the model’s internal activations. At inference, SALLIE defends via a three-stage architecture: (1) extracting internal residual stream activations, (2) calculating layer-wise maliciousness scores using a K-Nearest Neighbors (k-NN) classifier, and (3) aggregating these predictions via a layer ensemble module. We evaluate SALLIE on compact, open-source architectures - Phi-3.5-vision-instruct (arXiv:2404.14219), SmolVLM2-2.2B-Instruct (arXiv:2504.05299), and gemma-3-4b-it (arXiv:2503.19786) - prioritized for practical inference times and real-world deployment costs. Our comprehensive evaluation pipeline spans over ten datasets and more than five strong baseline methods from the literature, and SALLIE consistently outperforms these baselines across a wide range of experimental settings.
[AI-87] Negotiating Privacy with Smart Voice Assistants: Risk-Benefit and Control-Acceptance Tensions
【速读】:该论文旨在解决青少年在使用智能语音助手(Smart Voice Assistants, SVAs)时,隐私决策行为中存在“隐私悖论”——即个体虽意识到隐私风险却仍不采取保护性行为的问题。传统研究多将隐私风险、收益、信任与自我效能等视为独立预测因子,但忽视了这些因素如何协同作用形成更高层次的权衡张力。解决方案的关键在于提出一种基于协商(negotiation)的理论框架,通过构建两个复合指数:风险-收益张力指数(Risk-Benefit Tension Index, RBTI)和控制-接受张力指数(Control-Acceptance Tension Index, CATI),量化青少年在复杂隐私压力下的权衡模式,并揭示其与隐私保护行为及SVA使用频率之间的关联,从而为理解青少年在语音交互生态系统中的隐私决策提供新的分析视角与可操作的测量工具。
链接: https://arxiv.org/abs/2604.06235
作者: Molly Campbell,Mohamad Sheikho Al Jasem,Ajay Kumar Shrestha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To appear in the IEEE CSP 2026 proceedings
Abstract:Smart Voice assistants (SVAs) are widely adopted by youth, yet privacy decision-making in these environments is often characterized by competing considerations rather than clear-cut preferences. While our prior research has examined privacy risks, benefits, trust, and self-efficacy as distinct predictors of behavior, less attention has been paid to how these factors combine into higher-level tension that shapes privacy outcomes. This study introduces a negotiation-based framework for understanding youth privacy decision-making with SVAs by operationalizing two composite indices: the Risk-Benefit Tension Index (RBTI) and the Control-Acceptance Tension Index (CATI), using survey data from 469 Canadian youth aged 16-24. We examine the distribution of these indices and their relationship with privacy-protective behavior and SVA usage. Results show that both indices are meaningfully associated with protective action. Frequent SVA usage exhibits more benefit-dominant and acceptance-leaning negotiation profiles, suggesting that convenience-driven engagement may come at the expense of perceived control. By reframing privacy decision-making as a process of negotiation rather than inconsistency, this study offers a complementary perspective on the privacy paradox and provides a compact measurement approach for capturing how youth navigate competing privacy pressures in voice-enabled ecosystems.
[AI-88] Blind Refusal: Language Models Refuse to Help Users Evade Unjust Absurd and Illegitimate Rules
【速读】:该论文旨在解决生成式 AI(Generative AI)在面对用户请求绕过规则时表现出的“盲从拒绝”(blind refusal)问题,即模型在未评估规则正当性的情况下一律拒绝协助规避规则的行为,从而导致对不合法、不正义或可合理豁免规则的过度服从。解决方案的关键在于构建一个系统性的实证框架:通过设计涵盖5类规则失效理由(defeat families)与19种权威类型(authority types)的合成案例数据集,并结合自动化质量筛选与人工审核,量化分析不同模型配置对这些情境下的响应模式(帮助、硬性拒绝或回避),并利用盲评的GPT-5.4作为裁判模型判断模型是否识别出规则失效依据。研究发现,尽管多数模型能识别规则缺陷(57.5%),却仍倾向于拒绝提供帮助(75.4%),表明当前模型的拒绝行为与其对规则正当性的规范性推理能力之间存在脱钩现象。
链接: https://arxiv.org/abs/2604.06233
作者: Cameron Pattison,Lorenzo Manuali,Seth Lazar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages body text, 38 pages total, 4 figures
Abstract:Safety-trained language models routinely refuse requests for help circumventing rules. But not all rules deserve compliance. When users ask for help evading rules imposed by an illegitimate authority, rules that are deeply unjust or absurd in their content or application, or rules that admit of justified exceptions, refusal is a failure of moral reasoning. We introduce empirical results documenting this pattern of refusal that we call blind refusal: the tendency of language models to refuse requests for help breaking rules without regard to whether the underlying rule is defensible. Our dataset comprises synthetic cases crossing 5 defeat families (reasons a rule can be broken) with 19 authority types, validated through three automated quality gates and human review. We collect responses from 18 model configurations across 7 families and classify them on two behavioral dimensions – response type (helps, hard refusal, or deflection) and whether the model recognizes the reasons that undermine the rule’s claim to compliance – using a blinded GPT-5.4 LLM-as-judge evaluation. We find that models refuse 75.4% (N=14,650) of defeated-rule requests and do so even when the request poses no independent safety or dual-use concerns. We also find that models engage with the defeat condition in the majority of cases (57.5%) but decline to help regardless – indicating that models’ refusal behavior is decoupled from their capacity for normative reasoning about rule legitimacy.
[AI-89] Ontology-based knowledge graph infrastructure for interoperable atomistic simulation data
【速读】:该论文旨在解决原子尺度模拟数据在复用过程中面临的三大挑战:数据格式异构性、元数据不完整以及工作流和溯源信息缺乏标准化表示。其解决方案的关键在于构建一个基于本体(ontology)的基础设施,将原子尺度模拟数据组织为知识图谱(knowledge graph),通过整合领域本体与软件框架,实现从现有数据集和模拟流程生成点直接捕获数据,并将来自多源的异构数据统一映射到语义一致的本体对齐表示中,从而支持跨数据集的一致查询与分析;同时以机器可读形式表示工作流,支持正向溯源追踪及部分计算过程重建,最终形成包含超过75万条三元组的知识图谱,显著提升原子尺度模拟数据的可发现性(findability)、互操作性(interoperability)和可复用性(reusability)。
链接: https://arxiv.org/abs/2604.06230
作者: Abril Azocar Guzman,Sarath Menon,Tilmann Hickel,Stefan Sandfeld
机构: 未知
类目: Databases (cs.DB); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:The reuse of atomistic simulation data is often limited by heterogeneous formats, incomplete metadata, and a lack of standardized representations of workflows and provenance. Here we present an ontology-based infrastructure for representing and integrating atomistic simulation data as a knowledge graph. The approach combines domain ontologies with a software framework that enables data capture both from existing datasets and directly from simulation workflows at the point of generation. Heterogeneous data from multiple sources are normalized into a common, ontology-aligned representation, enabling consistent querying and analysis across datasets. We demonstrate these capabilities through the integration of grain boundary data, cross-dataset analysis of material properties, and extraction of derived thermodynamic quantities from existing simulations. In addition, workflows are represented in a machine-readable form, enabling both forward provenance tracking and partial reconstruction of computational procedures. The resulting knowledge graph contains over 750,000 triples describing nearly 8,000 computational samples. This work provides a practical framework for improving the findability, interoperability, and reuse of atomistic simulation data.
[AI-90] From experimentation to engagement: on the paradox of participatory AI and power in contexts of forced displacement and humanitarian crises
【速读】:该论文旨在解决全球南方地区(尤其是人道主义危机与被迫流离失所背景下)参与式人工智能(Participatory AI)实践严重不足的问题,以及现有参与式方法在这些敏感环境中可能引发“参与洗白”(participation washing)和算法伤害的风险。解决方案的关键在于:超越对公众AI认知水平的片面关注,深入剖析人道主义领域内固有的权力结构——包括援助接受者、服务提供者、捐助国政府与东道国之间的不平等关系,以及AI企业与人道主义行动者之间的利益差异——并据此提出建立独立治理架构(independent governance architecture),以实现对人道主义人工智能的有效问责与伦理约束。
链接: https://arxiv.org/abs/2604.06219
作者: Stella Suge(Executive Director, FilmAid Kenya),Sarah W. Spencer,Nyalleng Moorosi(Senior Researcher, The Distributed AI Research Institute (DAIR)),Helen McElhinney(Executive Director, The CDAC Network),Geoff Loane(Chair, The CDAC Network),Sue Black(Professor of Computer Science and Technology Evangelist, Durham University)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This paper was submitted to the ACM FAccT conference in 2025 and is published here as a preprint in March 2026. The research was conducted in December 2024. Since submission, AI deployment across the humanitarian sector has accelerated without commensurate development of independent accountability
Abstract:Across the Global North, calls for participatory artificial intelligence (AI) to improve the responsible, safe, and ethical use of AI have increased, particularly efforts that engage citizens and communities whose well-being and safety may be directly impacted by AI and other algorithmic tools. These initiatives include surveys, community consultations, citizens’ councils and assemblies, and co-designing AI models and projects. Far fewer efforts, however, have been made in the Global South, particularly in contexts related to humanitarian crises and forced displacement, where the deployment of AI and algorithmic tools is accelerating. In this paper, we critically examine participatory AI methods and their limitations in these contexts and explore the opinions and perceptions of AI held by displaced and crisis-affected communities. Based on a pilot exercise with communities living in Kakuma Refugee Camp in northwestern Kenya, we find important limitations in some participatory AI approaches which, if used in humanitarian contexts, could increase risks of so-called ‘participation washing’ and algorithmic harm. We argue that these risks are not predominantly driven by varying levels of understanding and awareness of AI but more closely linked to the fundamental power dynamics embedded within the humanitarian sector: between humanitarian aid recipients, service providers, donor governments, and host nations, as well as the power differentials and incentives that exist between AI companies and humanitarian actors. These structural conditions make the case not only for more rigorous participatory methods, but for independent governance architecture capable of holding humanitarian AI to account.
[AI-91] he End of the Foundation Model Era: Open-Weight Models Sovereign AI and Inference as Infrastructure
【速读】:该论文试图解决的问题是:在基础模型(Foundation Model)时代(约2020–2025年)结束后,人工智能产业如何重构其经济、技术、商业与政治结构以应对竞争壁垒消失、推理成本趋近于零以及地缘政治监管加强的新现实。解决方案的关键在于识别并推动四大轴心的同步转型——经济上,打破依赖循环融资支撑的估值泡沫;技术上,从预训练扩展范式转向后训练优化与智能体组合;商业上,由应用层集成商取代基础模型公司作为价值中心;政治上,政府通过主权控制生成式AI能力,实现对战略技术的自主掌控。其中最具颠覆性的洞见是:开放权重模型(Open-weight Models)成为主权国家实现技术独立的核心工具,使政府无需依赖供应商政策或人员许可即可自主部署和控制AI能力。
链接: https://arxiv.org/abs/2604.06217
作者: Jared James Grogan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 44 pages, 75 references, 5 endnotes. Version 1.0, events covered through March 9, 2026
Abstract:The foundation model era – roughly 2020 to 2025 – is over. The forces that defined it have inverted. Open source models have reached frontier performance while inference costs approach zero, exposing what was always structurally true: pre-training large language models at scale is not a durable competitive moat. The US government’s formal designation of Anthropic as a supply chain risk in February 2026 accelerated a transition already underway – but did not cause it. The paper argues that the AI industry is restructuring simultaneously along four axes: economic, as the circular financing structure that inflated foundation model valuations collapses; technical, as the pre-training scaling paradigm gives way to post-training optimization and agentic composition; commercial, as application-layer integrators displace the foundation model companies whose commodity they now consume; and political, as the government asserts its historic role as gatekeeper of strategic technology. These are not separate disruptions. They are one structural shift, arriving together. The paper further argues that open-weight models are the counterintuitive instrument of sovereign control: a government that holds the weights commands the capability on its own terms, without dependence on vendor policy, financial continuity, or personnel clearance.
[AI-92] Governing frontier general-purpose AI in the public sector: adaptive risk management and policy capacity under uncertainty through 2030
【速读】:该论文试图解决的是前沿通用人工智能(Frontier General-Purpose Artificial Intelligence)治理的复杂性问题,即如何在技术快速演进但风险认知滞后、政策决策面临高度不确定性的情境下,构建有效的公共治理机制。其核心挑战在于传统静态合规型监管模式难以应对多路径发展轨迹和组织实践、数据结构与公共价值相互嵌套的系统性影响。解决方案的关键在于提出一种以适应性风险管理(Adaptive Risk Management)、情景敏感型监管(Scenario-Aware Regulation)和社会技术转型(Sociotechnical Transformation)为基础的新型治理框架,强调通过能力监测、风险分级、条件控制、制度学习与标准互操作性等要素的集成,实现跨技术未来情境下的稳健治理。
链接: https://arxiv.org/abs/2604.06215
作者: Fabio Correa Xavier
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 7 PAGES, 1 FIGURE
Abstract:The governance of frontier general-purpose artificial intelligence has become a public-sector problem of institutional design, not merely a technical issue of model performance. Recent evidence indicates that AI capabilities are advancing rapidly, though unevenly, while knowledge about harms, safeguards, and effective interventions remains partial and lagged. This combination creates a difficult policy condition: governments must decide under uncertainty, across multiple plausible trajectories of progress through 2030, and in environments where adoption outcomes depend on organizational routines, data arrangements, accountability structures, and public values. This article argues that public governance for frontier AI should be based on adaptive risk management, scenario-aware regulation, and sociotechnical transformation rather than static compliance models. Drawing on the International AI Safety Report 2026, OECD foresight and policy documents, and recent scholarship in digital government, the article first reconstructs the conceptual foundations of the ‘evidence dilemma’, differentiated AI risk categories, and the limits of prediction. It then examines how AI adoption in government depends on organizational redesign, public-sector institutional dynamics, and data collaboration capacity. On that basis, it proposes an adaptive governance framework for public institutions that integrates capability monitoring, risk tiering, conditional controls, institutional learning, and standards-based interoperability. The article concludes that effective AI governance requires stronger policy capacity, clearer allocation of responsibility, and governance mechanisms that remain robust across divergent technological futures.
[AI-93] Front-End Ethics for Sensor-Fused Health Conversational Agents : An Ethical Design Space for Biometrics
【速读】:该论文旨在解决“传感器融合大语言模型(Sensor-Fused LLM agents)”在个人健康与福祉支持中,因前端生物特征翻译(biometric translation)缺乏伦理设计而导致的潜在风险问题,尤其是传感器数据所呈现的“客观性幻觉”可能放大AI幻觉,将错误转化为有害的医疗指令。其解决方案的关键在于提出“伦理前端设计(Ethical Front-End Design for AI)”框架,包含五个核心维度:生物特征披露(Biometric Disclosure)、监测时间性(Monitoring Temporality)、解释框架(Interpretation Framing)、AI立场(AI Stance)和可争议性(Contestability),并进一步提出“自适应披露(Adaptive Disclosure)”作为安全护栏,以管理模型的不完善性,从而保障用户自主权,避免生物反馈循环带来的危害。
链接: https://arxiv.org/abs/2604.06203
作者: Hansoo Lee,Rafael A. Calvo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End
Abstract:The integration of continuous data from built-in sensors and Large Language Models (LLMs) has fueled a surge of “Sensor-Fused LLM agents” for personal health and well-being support. While recent breakthroughs have demonstrated the technical feasibility of this fusion (e.g., Time-LLM, SensorLLM), research primarily focuses on “Ethical Back-End Design for Generative AI”, concerns such as sensing accuracy, bias mitigation in training data, and multimodal fusion. This leaves a critical gap at the front end, where invisible biometrics are translated into language directly experienced by users. We argue that the “illusion of objectivity” provided by sensor data amplifies the risks of AI hallucinations, potentially turning errors into harmful medical mandates. This paper shifts the focus to “Ethical Front-End Design for AI”, specifically, the ethics of biometric translation. We propose a design space comprising five dimensions: Biometric Disclosure, Monitoring Temporality, Interpretation Framing, AI Stance, and Contestability. We examine how these dimensions interact with context (user- vs. system-initiated) and identify the risk of biofeedback loops. Finally, we propose “Adaptive Disclosure” as a safety guardrail and offer design guidelines to help developers manage fallibility, ensuring that these cutting-edge health agents support, rather than destabilize, user autonomy.
[AI-94] hinking in Graphs with CoMAP: A Shared Visual Workspace for Designing Project-Based Learning
【速读】:该论文旨在解决项目式学习(Project-Based Learning, PBL)设计过程中高度依赖的组件难以有效管理的问题,传统线性工具无法捕捉创意设计的非线性特性,而纯对话式人工智能系统则缺乏支持反思性协作所需的持久共享上下文。解决方案的关键在于提出CoMAP系统,其基于分布式认知理论,采用图结构协作范式,提供一个双模态AI支持的共享可视化工作空间,将人机关系从单一的“提示-响应”循环转变为透明且平等的合作关系,从而显著提升教师的设计表达能力、发散思维与迭代实践水平。
链接: https://arxiv.org/abs/2604.06200
作者: Ruijia Li,Bo Jiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted by CHI 2026
Abstract:Designing project-based learning (PBL) demands managing highly interdependent components, a task that both traditional linear tools and purely conversational AI struggle with. Traditional tools fail to capture the non-linear nature of creative design, while conversational systems lack the persistent, shared context necessary for reflective collaboration. Grounded in theories of distributed cognition, we introduce CoMAP, a system that embodies a graph-based collaboration paradigm. By providing a shared visual workspace with dual-modality AI support, CoMAP transforms the human-AI relationship from a prompt-and-response loop into a transparent and equitable partnership. Our study with 30 educators shows CoMAP significantly improves teachers’ design expression, divergent thinking, and iterative practice compared to a dialogue-only baseline. These findings demonstrate how a nonlinear, artifact-centric approach can foster trust, reduce cognitive load, and \textcolorfixsupport educators to take control of their creative process. Our contributions are available at: this https URL.
[AI-95] Concentrated siting of AI data centers drives regional power-system stress under rising global compute demand
【速读】:该论文旨在解决生成式 AI (Generative AI) 快速发展所带来的全球计算需求激增对电力系统造成的压力问题,特别是预测 AI 驱动的数据中心在未来几年内的电力消耗及其对电网稳定性的影响。其解决方案的关键在于构建一个 AI-能源耦合框架(AI-energy coupling framework),该框架融合了基于大语言模型(Large Language Models, LLMs)对企业、政策和媒体数据的定性分析与定量能源系统建模,从而精准预测2025至2030年间全球主要AI数据中心的电力足迹,并识别高风险区域(如俄勒冈州、弗吉尼亚州和爱尔兰)与具备较强负荷吸收能力的地区(如德克萨斯州和日本),为电力系统规划提供科学依据。
链接: https://arxiv.org/abs/2604.06198
作者: Danbo Chen,Zijun Zhou,Yongyang Cai,Jiahong Qin,Ani Katchova,Lei Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 32 pages, 8 figures
Abstract:The rapid rise of generative artificial intelligence (AI) is driving unprecedented growth in global computational demand, placing increasing pressure on electricity systems. This study introduces an AI-energy coupling framework that combines large language models (LLMs)-based analysis of corporate, policy, and media data with quantitative energy-system modeling to forecast the electricity footprint of AI-driven data centers from 2025 to 2030. Results show that the new AI infrastructure is highly concentrated in North America, Western Europe, and the Asia-Pacific, which together account for more than 90% of projected compute capacity. Aggregate electricity consumption by the six leading firms is projected to increase from roughly 118 TWh in 2024 to between 239 TWh and 295 TWh by 2030, equivalent to about 1% of global power demand. Regions such as Oregon, Virginia, and Ireland may experience high Power Stress Index (PSI) values exceeding 0.25, indicating local grid vulnerability, whereas diversified systems such as those in Texas and Japan can absorb new loads more effectively. These findings demonstrate that AI infrastructure is evolving from a marginal digital service into a structural component of power-system dynamics, underscoring the need for anticipatory planning that aligns computational growth with renewable expansion and grid resilience.
[AI-96] High-Precision Estimation of the State-Space Complexity of Shogi via the Monte Carlo Method
【速读】:该论文旨在解决将棋(Shogi)状态空间复杂度的精确估算问题,此前的组合估计存在高达五个数量级的误差范围(10⁶⁴ 至 10⁶⁹)。这一巨大差距源于在海量合法棋盘配置中难以区分哪些位置是从初始局面合法可达的。解决方案的关键在于提出一种结合蒙特卡洛采样与新型可达性测试方法的统计估计算法:该方法采用反向搜索策略,不是从任意位置回溯至单一初始状态,而是向一组“仅含双王”(King-King only, KK)的位置进行反向搜索,从而显著降低判断不可达性的搜索开销。基于50亿个样本位置的分析,作者得出Shogi合法状态数为6.55 × 10⁶⁸(置信水平3σ),大幅提升了此前的精度。
链接: https://arxiv.org/abs/2604.06189
作者: Sotaro Ishii,Tetsuro Tanaka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Preprint submitted to IPSJ Journal of Information Processing
Abstract:Determining the state-space complexity of the game of Shogi (Japanese Chess) has been a challenging problem, with previous combinatorial estimates leaving a gap of five orders of magnitude ( 10^64 to 10^69 ). This large gap arises from the difficulty of distinguishing Shogi positions legally reachable from the initial position among the vast number of valid board configurations. In this paper, we present a high-precision statistical estimation of the number of reachable positions in Shogi. Our method combines Monte Carlo sampling with a novel reachability test that utilizes a reverse search toward a set of “King-King only” (KK) positions, rather than a single-target backward search to the single initial position. This approach significantly reduces the search effort for determining unreachability. Based on a sample of 5 billion positions, we estimated the number of legal positions in Shogi to be 6.55 \times 10^68 (to three significant digits) with a 3\sigma confidence level, substantially improving upon previously known bounds. We also applied this method to Mini Shogi, determining its complexity to be approximately 2.38 \times 10^18 .
[AI-97] Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在教育场景中因支持生成式AI(Generative AI)而引发的学术评估失效问题,尤其是其可能导致学生绕过批判性思维训练、加剧认知卸载(cognitive offloading)的风险。解决方案的关键在于提出一种名为AI-Sinkhole的框架,该框架基于DNS机制,通过集成AI代理增强的动态发现与语义分类能力,实时识别并临时阻断考试期间出现的新兴LLM聊天服务,从而保障在线考试的公平性与学术严谨性。其核心技术包括利用量化后的LLM(如LLama 3、DeepSeek-R1、Qwen-3)实现可解释的分类,并结合Pi-Hole实现网络范围内的动态DNS屏蔽,实验表明该方法在跨语言场景下具有稳定的性能(F1-score ≥ 0.83)。
链接: https://arxiv.org/abs/2604.02360
作者: Yonas Kassa,James Bonacci,Ping Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: accepted at ITNG 2026
Abstract:The transformative potential of large language models (LLMs) in education, such as improving accessibility and personalized learning, is being eclipsed by significant challenges. These challenges stem from concerns that LLMs undermine academic assessment by enabling bypassing of critical thinking, leading to increased cognitive offloading. This emerging trend stresses the dual imperative of harnessing AI’s educational benefits while safeguarding critical thinking and academic rigor in the evolving AI ecosystem. To this end, we introduce AI-Sinkhole, an AI-agent augmented DNS-based framework that dynamically discovers, semantically classifies, and temporarily network-wide blocks emerging LLM chatbot services during proctored exams. AI-Sinkhole offers explainable classification via quantized LLMs (LLama 3, DeepSeek-R1, Qwen-3) and dynamic DNS blocking with Pi-Hole. We also share our observations in using LLMs as explainable classifiers which achieved robust cross-lingual performance (F1-score 0.83). To support future research and development in this domain initial codes with a readily deployable ‘AI-Sinkhole’ blockist is available on this https URL.
[AI-98] Soft-Quantum Algorithms
【速读】:该论文旨在解决当前量子机器学习中因量子设备训练成本高、保真度低而导致的可扩展性受限问题,尤其针对小规模量子系统(如五比特)在大数据集上的训练效率瓶颈。其解决方案的关键在于提出一种两步训练策略:首先通过引入单一正则化项直接优化矩阵元素以保持软酉(soft-unitary)性质,从而避免传统门分解带来的计算开销;其次通过电路对齐(circuit alignment)步骤将所得软酉映射回具体的门基架构,实现高效且可解释的量子电路训练。该方法在监督分类和强化学习任务中均显著优于传统直接电路训练方式,在保证性能的同时大幅缩短训练时间。
链接: https://arxiv.org/abs/2604.06523
作者: Basil Kyriacou,Mo Kordzanganeh,Maniraman Periyasamy,Alexey Melnikov
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 6 figures, 0 tables
Abstract:Quantum operations on pure states can be fully represented by unitary matrices. Variational quantum circuits, also known as quantum neural networks, embed data and trainable parameters into gate-based operations and optimize the parameters via gradient descent. The high cost of training and low fidelity of current quantum devices, however, restricts much of quantum machine learning to classical simulation. For few-qubit problems with large datasets, training the matrix elements directly, as is done with weight matrices in classical neural networks, can be faster than decomposing data and parameters into gates. We propose a method that trains matrices directly while maintaining unitarity through a single regularization term added to the loss function. A second training step, circuit alignment, then recovers a gate-based architecture from the resulting soft-unitary. On a five-qubit supervised classification task with 1000 datapoints, this two-step process produces a trained variational circuit in under four minutes, compared to over two hours for direct circuit training, while achieving lower binary cross-entropy loss. In a second experiment, soft-unitaries are embedded in a hybrid quantum-classical network for a reinforcement learning cartpole task, where the hybrid agent outperforms a purely classical baseline of comparable size.
[AI-99] DosimeTron: Automating Personalized Monte Carlo Radiation Dosimetry in PET/CT with Agent ic AI
【速读】:该论文旨在解决患者特异性蒙特卡洛(Monte Carlo, MC)内照射剂量学在PET/CT检查中自动化程度低、流程繁琐且易受人为干预的问题。为实现这一目标,作者提出并验证了DosimeTron——一个基于代理型人工智能(agentic AI)的系统,其核心创新在于通过GPT-5.2作为推理引擎,并结合四个Model Context Protocol服务器提供的23个工具,实现了DICOM元数据提取、图像预处理、器官分割、MC模拟及剂量报告生成等全流程的自然语言驱动自动化。该方案的关键在于将复杂医学物理任务封装为可调用工具集,并借助大模型的上下文理解与规划能力,在无需人工介入的情况下完成端到端剂量计算,同时保证高精度(Pearson相关系数最高达1.000,平均绝对百分比差异低于5%)和稳定运行(无执行失败或幻觉输出),显著提升了临床可接受的处理效率(平均单例耗时32.3分钟)。
链接: https://arxiv.org/abs/2604.06280
作者: Eleftherios Tzanis,Michail E. Klontzas,Antonios Tzortzakakis
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: To develop and evaluate DosimeTron, an agentic AI system for automated patient-specific MC internal radiation dosimetry in PET/CT examinations. Materials and Methods: In this retrospective study, DosimeTron was evaluated on a publicly available PSMA-PET/CT dataset comprising 597 studies from 378 male patients acquired on three scanner models (18-F, n = 369; 68-Ga, n = 228). The system uses GPT-5.2 as its reasoning engine and 23 tools exposed via four Model Context Protocol servers, automating DICOM metadata extraction, image preprocessing, MC simulation, organ segmentation, and dosimetric reporting through natural-language interaction. Agentic performance was assessed using diverse prompt templates spanning single-turn instructions of varying specificity and multi-turn conversational exchanges, monitored via OpenTelemetry traces. Dosimetric accuracy was validated against OpenDose3D across 114 cases and 22 organs using Pearson’s r, Lin’s concordance correlation coefficient (CCC), and Bland-Altman analysis. Results: Across all prompt templates and all runs, no execution failures, pipeline errors, or hallucinated outputs were observed. Pearson’s r ranged from 0.965 to 1.000 (median 0.997; all p 0.001) and CCC from 0.963 to 1.000 (median 0.996). Mean absolute percentage difference was below 5% for 19 of 22 organs (median 2.5%). Total per-study processing time (SD) was 32.3 (6.0) minutes. Conclusion: DosimeTron autonomously executed complex dosimetry pipelines across diverse prompt configurations and achieved high dosimetric agreement with OpenDose3D at clinically acceptable processing times, demonstrating the feasibility of agentic AI for patient-specific Monte Carlo dosimetry in PET/CT. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06280 [physics.med-ph] (or arXiv:2604.06280v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2604.06280 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Eleftherios Tzanis [view email] [v1] Tue, 7 Apr 2026 11:09:30 UTC (1,858 KB)
[AI-100] Plasma GraphRAG : Physics-Grounded Parameter Selection for Gyrokinetic Simulations
【速读】:该论文旨在解决等离子体物理领域中陀螺动力学(gyrokinetic)模拟参数选择依赖人工文献调研而导致效率低下和结果不一致的问题。其核心解决方案是提出 Plasma GraphRAG 框架,该框架将图检索增强生成(Graph Retrieval-Augmented Generation, GraphRAG)与大语言模型(Large Language Models, LLMs)相结合,通过构建领域特定的知识图谱并实现基于图结构的实体与关系检索,使 LLM 能够生成准确且上下文相关的参数推荐。关键创新在于利用知识图谱对科学知识进行结构化组织,从而提升推荐的准确性、可解释性和抗幻觉能力。
链接: https://arxiv.org/abs/2604.06279
作者: Ruichen Zhang,Feda AlMuhisen,Chenguang Wan,Zhisong Qu,Kunpeng Li,Youngwoo Cho,Kyungtak Lim,Virginie Grandgirard,Xavier Garbet
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures
Abstract:Accurate parameter selection is fundamental to gyrokinetic plasma simulations, yet current practices rely heavily on manual literature reviews, leading to inefficiencies and inconsistencies. We introduce Plasma GraphRAG, a novel framework that integrates Graph Retrieval-Augmented Generation (GraphRAG) with large language models (LLMs) for automated, physics-grounded parameter range identification. By constructing a domain-specific knowledge graph from curated plasma literature and enabling structured retrieval over graph-anchored entities and relations, Plasma GraphRAG enables LLMs to generate accurate, context-aware recommendations. Extensive evaluations across five metrics, comprehensiveness, diversity, grounding, hallucination, and empowerment, demonstrate that Plasma GraphRAG outperforms vanilla RAG by over 10% in overall quality and reduces hallucination rates by up to 25% . Beyond enhancing simulation reliability, Plasma GraphRAG offers a methodology for accelerating scientific discovery across complex, data-rich domains.
[AI-101] MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell Annotation
【速读】:该论文旨在解决单细胞分析中自动化细胞推理的两大核心问题:一是监督学习方法陷入“参考陷阱”(Reference Trap),难以泛化到分布外的细胞状态;二是大语言模型(LLM)因缺乏生物先验知识而面临“信噪比悖论”(Signal-to-Noise Paradox),产生虚假关联。解决方案的关键在于提出MAT-Cell——一种神经符号推理框架,通过自适应检索增强生成(Retrieval-Augmented Generation, RAG)注入符号约束,将神经推理锚定在生物学公理上以降低转录组噪声,并引入同质反驳代理的辩证验证机制,审计并修剪推理路径,构建演绎树结构以保证逻辑一致性。该方法实现了从黑箱分类到可构造、可验证推理证明的范式转变,在大规模跨物种基准测试中显著优于现有最先进模型,并在基线方法严重退化的挑战场景中保持鲁棒性能。
链接: https://arxiv.org/abs/2604.06269
作者: Yehui Yang,Zelin Zang,Changxi Chi,Jingbo Zhou,Xienan Zheng,Yuzhe Jia,Chang Yu,Jinlin Wu,Fuji Yang,Jiebo Luo,Zhen Lei,Stan Z. Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated cellular reasoning faces a core dichotomy: supervised methods fall into the Reference Trap and fail to generalize to out-of-distribution cell states, while large language models (LLMs), without grounded biological priors, suffer from a Signal-to-Noise Paradox that produces spurious associations. We propose MAT-Cell, a neuro-symbolic reasoning framework that reframes single-cell analysis from black-box classification into constructive, verifiable proof generation. MAT-Cell injects symbolic constraints through adaptive Retrieval-Augmented Generation (RAG) to ground neural reasoning in biological axioms and reduce transcriptomic noise. It further employs a dialectic verification process with homogeneous rebuttal agents to audit and prune reasoning paths, forming syllogistic derivation trees that enforce logical this http URL large-scale and cross-species benchmarks, MAT-Cell significantly outperforms state-of-the-art (SOTA) models and maintains robust per-formance in challenging scenarios where baselinemethods severely degrade. Code is available at https://gith this http URL ti-Agent-Tree-Structured-Reasoni ng-Framework-for-Batch-Level-Sin gle-Cell-Annotation.
[AI-102] oxReason : A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway ACL2026
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在毒性预测中缺乏机制性推理能力的问题,即模型虽能生成流畅的解释,但这些解释往往不符合生物学事实,导致预测结果不可靠。其解决方案的关键在于构建一个基于不良结局路径(Adverse Outcome Pathway, AOP)的基准测试集 ToxReason,该基准要求模型从分子起始事件(Molecular Initiating Event, MIE)推导至不良结局(Adverse Outcome, AO),并整合实验药物靶点相互作用证据与毒性标签,从而系统评估模型在器官层面的机制推理能力。通过此设计,研究揭示了预测性能与推理质量之间的脱节,并验证了引入推理感知训练可显著提升机制推理能力和毒性预测准确性,为可信毒性建模提供了方法论支撑。
链接: https://arxiv.org/abs/2604.06264
作者: Jueon Park,Wonjune Jang,Chanhwi Kim,Yein Park,Jaewoo Kang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Findings
Abstract:Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.
[AI-103] From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
【速读】:该论文旨在解决临床推理中模型对复杂、异构病历数据的情境内化不足问题,即现有方法(如微调、上下文学习和检索增强生成)虽能暴露知识,但难以在推理时动态调整内部表征以适应个体病例的细微差异。其解决方案的关键在于提出双流校准(Dual-Stream Calibration, DSC)框架,通过两个协同工作的校准流实现测试时训练下的深度内化:一是语义校准流(Semantic Calibration Stream),通过最小化熵稳定生成轨迹,使模型对核心证据进行反思性内化;二是结构校准流(Structural Calibration Stream),基于迭代元学习目标整合潜在推理依赖关系,将外部证据与内部逻辑融合,从而构建连贯响应。DSC实现了从被动注意力匹配到主动潜空间优化的推理范式转变。
链接: https://arxiv.org/abs/2604.06262
作者: Chuang Zhao,Hongke Zhao,Xiaofang Zhou,Xiaomeng Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Contextual clinical reasoning demands robust inference grounded in complex, heterogeneous clinical records. While state-of-the-art fine-tuning, in-context learning (ICL), and retrieval-augmented generation (RAG) enable knowledge exposure, they often fall short of genuine contextual internalization: dynamically adjusting a model’s internal representations to the subtle nuances of individual cases at inference time. To address this, we propose Dual-Stream Calibration (DSC), a test-time training framework that transcends superficial knowledge exposure to achieve deep internalization during inference. DSC facilitates input internalization by synergistically aligning two calibration streams. Unlike passive context exposure, the Semantic Calibration Stream enforces a deliberative reflection on core evidence, internalizing semantic anchors by minimizing entropy to stabilize generative trajectories. Simultaneously, the Structural Calibration Stream assimilates latent inferential dependencies through an iterative meta-learning objective. By training on specialized support sets at test-time, this stream enables the model to bridge the gap between external evidence and internal logic, synthesizing fragmented data into a coherent response. Our approach shifts the reasoning paradigm from passive attention-based matching to an active refinement of the latent inferential space. Validated against thirteen clinical datasets, DSC demonstrates superiority across three distinct task paradigms, consistently outstripping state-of-the-art baselines ranging from training-dependent models to test-time learning frameworks.
[AI-104] Learning the Stellar Structure Equations via Self-supervised Physics-Informed Neural Networks
【速读】:该论文旨在解决传统恒星结构求解器(如MESA)在进行大规模恒星群体合成时计算成本高、难以扩展的问题。其核心挑战在于如何在保持物理准确性的同时,实现高效且可扩展的恒星内部结构建模。解决方案的关键是提出一种自监督的物理信息神经网络(Physics-Informed Neural Network, PINN)框架,该框架无需离散化网格即可直接学习恒星内部质量 $ M_r® $、压力 $ P® $、密度 $ \rho® $、温度 $ T® $ 和光度 $ L_r® $ 的连续径向分布,并通过引入辅助神经网络对状态方程和 opacity 表进行平滑、可微分的近似,从而将微观物理输入转化为端到端可训练的函数形式,实现了完全自监督、无需数据驱动的恒星结构方程求解。
链接: https://arxiv.org/abs/2604.06255
作者: Manuel Ballester,Santiago Lopez-Tapia,Seth Gossage,Patrick Koller,Philipp M. Srivastava,Ugur Demir,Yongseok Jo,Almudena P. Marquez,Christoph Wuersch,Souvik Chakraborty,Vicky Kalogera,Aggelos Katsaggelos
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:Stellar astrophysics relies critically on accurate descriptions of the physical conditions inside stars. Traditional solvers such as \textttMESA (Modules for Experiments in Stellar Astrophysics), which employ adaptive finite-difference methods, can become computationally expensive and challenging to scale for large stellar population synthesis ( 10^9 stars). In this work, we present an self-supervised physics-informed neural network (PINN) framework that provides a mesh-free and fully differentiable approach to solving the stellar structure equations under hydrostatic and thermal equilibrium. The model takes as input the stellar boundary conditions (at the center and surface) together with the chemical composition, and learns continuous radial profiles for mass M_r® , pressure P® , density \rho® , temperature T® , and luminosity L_r® by enforcing the governing structure equations through physics-based loss terms. To incorporate realistic microphysics, we introduce auxiliary neural networks that approximate the equation of state and opacity tables as smooth, differentiable functions of the local thermodynamic state. These surrogates replace traditional tabulated inputs and enable end-to-end training. Once trained for a given star, the model produces continuous solutions across the entire radial domain without requiring discretization or interpolation. Validation against benchmark \textttMESA models across a range of stellar masses yields a Mean Relative Absolute Error of 3.06% and an average R^2 score of 99.98% . To our knowledge, this is the first demonstration that the stellar structure equations can be solved in a fully self-supervised and data-free fashion employing PINNs. This work establishes a foundation for scalable, physics-informed emulation of stellar interiors and opens the door to future extensions toward time-dependent stellar evolution.
[AI-105] Development of ML model for triboelectric nanogenerator based sign language detection system
【速读】:该论文旨在解决手语识别(Sign Language Recognition, SLR)中视觉方法因遮挡、计算成本高及物理限制导致的局限性问题,提出基于摩擦纳米发电机(Triboelectric Nanogenerator, TENG)传感器手套的可穿戴式手势识别方案。其关键解决方案在于采用多传感器频域特征融合架构——具体为MFCC(Mel-Frequency Cepstral Coefficients)CNN-LSTM模型:通过独立卷积分支提取各柔性传感器的频率域特征,再进行跨传感器融合,从而实现对11类手语符号(数字1–5和字母A–F)的高效识别,最终达到93.33%准确率与95.56%精确率,显著优于传统机器学习算法(如随机森林70.38%)。此方法利用MFCC将时序变化映射为执行速度无关的频谱表示,并结合时间扭曲与噪声注入的数据增强策略,提升了模型泛化能力,验证了频域特征与并行多传感器处理结构在可穿戴设备手势识别中的优越性。
链接: https://arxiv.org/abs/2604.06220
作者: Meshv Patel,Bikash Baro,Sayan Bayan,Mohendra Roy
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: This paper has been accepted at the IEEE GCON 2026 ( this https URL ) Conference, organized by IIT Guwahati
Abstract:Sign language recognition (SLR) is vital for bridging communication gaps between deaf and hearing communities. Vision-based approaches suffer from occlusion, computational costs, and physical constraints. This work presents a comparison of machine learning (ML) and deep learning models for a custom triboelectric nanogenerator (TENG)-based sensor glove. Utilizing multivariate time-series data from five flex sensors, the study benchmarks traditional ML algorithms, feedforward neural networks, LSTM-based temporal models, and a multi-sensor MFCC CNN-LSTM architecture across 11 sign classes (digits 1-5, letters A-F). The proposed MFCC CNN-LSTM architecture processes frequency-domain features from each sensor through independent convolutional branches before fusion. It achieves 93.33% accuracy and 95.56% precision, a 23-point improvement over the best ML algorithm (Random Forest: 70.38%). Ablation studies reveal 50-timestep windows offer a tradeoff between temporal context and training data volume, yielding 84.13% accuracy compared to 58.06% with 100-timestep windows. MFCC feature extraction maps temporal variations to execution-speed-invariant spectral representations, and data augmentation methods (time warping, noise injection) are essential for generalization. Results demonstrate that frequency-domain feature representations combined with parallel multi-sensor processing architectures offer enhancement over classical algorithms and time-domain deep learning for wearable sensor-based gesture recognition. This aids assistive technology development.
机器学习
[LG-0] Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
链接: https://arxiv.org/abs/2604.07345
作者: Roberto Vercellino(1),Jared Willard(1),Gustavo Campos(1),Weslley da Silva Pereira(1),Olivia Hull(1),Matthew Selensky(1),Juliane Mueller(1) ((1) National Laboratory of the Rockies (NLR), Golden, CO, USA)
类目: ystems and Control (eess.SY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: The data associated with this publication can be found at this http URL
Abstract:The rapid growth of generative artificial intelligence (AI) has introduced unprecedented computational demands, driving significant increases in the energy footprint of data centers. However, existing power consumption data is largely proprietary and reported at varying resolutions, creating challenges for estimating whole-facility energy use and planning infrastructure. In this work, we present a methodology that bridges this gap by linking high-resolution workload power measurements to whole-facility energy demand. Using NLR’s high-performance computing data center equipped with NVIDIA H100 GPUs, we measure power consumption of AI workloads at 0.1-second resolution for AI training, fine-tuning and inference jobs. Workloads are characterized using MLCommons benchmarks for model training and fine-tuning, and vLLM benchmarks for inference, enabling reproducible and standardized workload profiling. The dataset of power consumption profiles is made publicly available. These power profiles are then scaled to the whole-facility-level using a bottom-up, event-driven, data center energy model. The resulting whole-facility energy profiles capture realistic temporal fluctuations driven by AI workloads and user-behavior, and can be used to inform infrastructure planning for grid connection, on-site energy generation, and distributed microgrids.
[LG-1] How to sketch a learning algorithm
链接: https://arxiv.org/abs/2604.07328
作者: Sam Gunn
类目: Machine Learning (cs.LG)
*备注:
Abstract:How does the choice of training data influence an AI model? This question is of central importance to interpretability, privacy, and basic science. At its core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error \varepsilon in the deep learning setting. Our precomputation and prediction algorithms are only \mathrmpoly(1/\varepsilon) factors slower than regular training and inference, respectively. The storage requirements are those of \mathrmpoly(1/\varepsilon) models. Our proof is based on an assumption that we call “stability.” In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at this https URL. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.07328 [cs.LG] (or arXiv:2604.07328v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.07328 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sam Gunn [view email] [v1] Wed, 8 Apr 2026 17:46:06 UTC (59 KB)
[LG-2] SL-FAC: A Communication-Efficient Split Learning Framework with Frequency-Aware Compression
链接: https://arxiv.org/abs/2604.07316
作者: Zehang Lin,Miao Yang,Haihan Zhu,Zheng Lin,Jianhao Huang,Jing Yang,Guangjin Pan,Dianxin Luan,Zihan Fang,Shunzhi Zhu,Wei Ni,John Thompson
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures
Abstract:The growing complexity of neural networks hinders the deployment of distributed machine learning on resource-constrained devices. Split learning (SL) offers a promising solution by partitioning the large model and offloading the primary training workload from edge devices to an edge server. However, the increasing number of participating devices and model complexity leads to significant communication overhead from the transmission of smashed data (e.g., activations and gradients), which constitutes a critical bottleneck for SL. To tackle this challenge, we propose SL-FAC, a communication-efficient SL framework comprising two key components: adaptive frequency decomposition (AFD) and frequency-based quantization compression (FQC). AFD first transforms the smashed data into the frequency domain and decomposes it into spectral components with distinct information. FQC then applies customized quantization bit widths to each component based on its spectral energy distribution. This collaborative approach enables SL-FAC to achieve significant communication reduction while strategically preserving the information most crucial for model convergence. Extensive experiments confirm the superior performance of SL-FAC for improving the training efficiency.
[LG-3] Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability
链接: https://arxiv.org/abs/2604.07292
作者: Akzhol Almukhametov,Doyeong Lim,Rui Hu,Yang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-time supervisory control of advanced reactors requires accurate forecasting of plant-wide thermal-hydraulic states, including locations where physical sensors are unavailable. Meeting this need calls for surrogate models that combine predictive fidelity, millisecond-scale inference, and robustness to partial observability. In this work, we present a physics-informed message-passing Graph Neural Network coupled with a Neural Ordinary Differential Equation (GNN-ODE) to addresses all three requirements simultaneously. We represent the whole system as a directed sensor graph whose edges encode hydraulic connectivity through flow/heat transfer-aware message passing, and we advance the latent dynamics in continuous time via a controlled Neural ODE. A topology-guided missing-node initializer reconstructs uninstrumented states at rollout start; prediction then proceeds fully autoregressively. The GNN-ODE surrogate achieves satisfactory results for the system dynamics prediction. On held-out simulation transients, the surrogate achieves an average MAE of 0.91 K at 60 s and 2.18 K at 300 s for uninstrumented nodes, with R^2 up to 0.995 for missing-node state reconstruction. Inference runs at approximately 105 times faster than simulated time on a single GPU, enabling 64-member ensemble rollouts for uncertainty quantification. To assess sim-to-real transfer, we adapt the pretrained surrogate to experimental facility data using layerwise discriminative fine-tuning with only 30 training sequences. The learned flow-dependent heat-transfer scaling recovers a Reynolds-number exponent consistent with established correlations, indicating constitutive learning beyond trajectory fitting. The model tracks a steep power change transient and produces accurate trajectories at uninstrumented locations.
[LG-4] racking Adaptation Time: Metrics for Temporal Distribution Shift AAAI2026
链接: https://arxiv.org/abs/2604.07266
作者: Lorenzo Iovine,Giacomo Ziffer,Emanuele Della Valle
类目: Machine Learning (cs.LG)
*备注: Accepted at CEUR-WS Vol. 4183 (Streaming Continual Learning Bridge at AAAI 2026)
Abstract:Evaluating robustness under temporal distribution shift remains an open challenge. Existing metrics quantify the average decline in performance, but fail to capture how models adapt to evolving data. As a result, temporal degradation is often misinterpreted: when accuracy declines, it is unclear whether the model is failing to adapt or whether the data itself has become inherently more challenging to learn. In this work, we propose three complementary metrics to distinguish adaptation from intrinsic difficulty in the data. Together, these metrics provide a dynamic and interpretable view of model behavior under temporal distribution shift. Results show that our metrics uncover adaptation patterns hidden by existing analysis, offering a richer understanding of temporal robustness in evolving environments.
[LG-5] A comparative analysis of machine learning models in SHAP analysis
链接: https://arxiv.org/abs/2604.07258
作者: Justin Lin,Julia Fukuyama
类目: Machine Learning (cs.LG)
*备注: 17 pages, 16 figures, 4 tables
Abstract:In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex data patterns. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, an associated SHAP value quantifies the contribution of that feature to the prediction of that sample. Analysis of these SHAP values provides valuable insight into the model’s decision-making process, which can be leveraged to create data-driven solutions. The interpretation of these SHAP values, however, is model-dependent, so there does not exist a universal analysis procedure. To aid in these efforts, we present a detailed investigation of SHAP analysis across various machine learning models and data sets. In uncovering the details and nuance behind SHAP analysis, we hope to empower analysts in this less-explored territory. We also present a novel generalization of the waterfall plot to the multi-classification problem.
[LG-6] Weaves Wires and Morphisms: Formalizing and Implementing the Algebra of Deep Learning
链接: https://arxiv.org/abs/2604.07242
作者: Vincent Abbott,Gioele Zardini
类目: Machine Learning (cs.LG); Category Theory (math.CT)
*备注:
Abstract:Despite deep learning models running well-defined mathematical functions, we lack a formal mathematical framework for describing model architectures. Ad-hoc notation, diagrams, and pseudocode poorly handle nonlinear broadcasting and the relationship between individual components and composed models. This paper introduces a categorical framework for deep learning models that formalizes broadcasting through the novel axis-stride and array-broadcasted categories. This allows the mathematical function underlying architectures to be precisely expressed and manipulated in a compositional manner. These mathematical definitions are translated into human manageable diagrams and machine manageable data structures. We provide a mirrored implementation in Python (pyncd) and TypeScript (tsncd) to show the universal aspect of our framework, along with features including algebraic construction, graph conversion, PyTorch compilation and diagram rendering. This lays the foundation for a systematic, formal approach to deep learning model design and analysis.
[LG-7] How Does Machine Learning Manage Complexity?
链接: https://arxiv.org/abs/2604.07233
作者: Lance Fortnow
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: 16 pages, no figures
Abstract:We provide a computational complexity lens to understand the power of machine learning models, particularly their ability to model complex systems. Machine learning models are often trained on data drawn from sampleable or more complex distributions, a far wider range of distributions than just computable ones. By focusing on computable distributions, machine learning models can better manage complexity via probability. We abstract away from specific learning mechanisms, modeling machine learning as producing P/poly-computable distributions with polynomially-bounded max-entropy. We illustrate how learning computable distributions models complexity by showing that if a machine learning model produces a distribution \mu that minimizes error against the distribution generated by a cryptographic pseudorandom generator, then \mu must be close to uniform. Comments: 16 pages, no figures Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC) MSC classes: 68Q15, 68T05 (Primary) 94A17, 94A60, 68Q30 (Secondary) ACMclasses: F.1.1; I.2.6 Cite as: arXiv:2604.07233 [cs.LG] (or arXiv:2604.07233v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.07233 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-8] Diffusion Processes on Implicit Manifolds
链接: https://arxiv.org/abs/2604.07213
作者: Victor Kawasaki-Borruat,Clara Grotehans,Pierre Vandergheynst,Adam Gosztolai
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:High-dimensional data are often modeled as lying near a low-dimensional manifold. We study how to construct diffusion processes on this data manifold in the implicit setting. That is, using only point cloud samples and without access to charts, projections, or other geometric primitives. Our main contribution is a data-driven SDE that captures intrinsic diffusion on the underlying manifold while being defined in ambient space. The construction relies on estimating the diffusion’s infinitesimal generator and its carré-du-champ (CDC) from a proximity graph built from the data. The generator and CDC together encode the local stochastic and geometric structure of the intended diffusion. We show that, as the number of samples grows, the induced process converges in law on the space of probability paths to its smooth manifold counterpart. We call this construction Implicit Manifold-valued Diffusions (IMDs), and furthermore present a numerical simulation procedure using Euler-Maruyama integration. This gives a rigorous basis for practical implementations of diffusion dynamics on data manifolds, and opens new directions for manifold-aware sampling, exploration, and generative modeling.
[LG-9] Beyond the Mean: Modelling Annotation Distributions in Continuous Affect Prediction CVPR2026
链接: https://arxiv.org/abs/2604.07198
作者: Kosmas Pinitas,Ilias Maglogiannis
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: This paper has been accepted at the CVPR 2026 Workshop on Affective Behavior Analysis in-the-wild (ABAW)
Abstract:Emotion annotation is inherently subjective and cognitively demanding, producing signals that reflect diverse perceptions across annotators rather than a single ground truth. In continuous affect prediction, this variability is typically collapsed into point estimates such as the mean or median, discarding valuable information about annotator disagreement and uncertainty. In this work, we propose a distribution-aware framework that models annotation consensus using the Beta distribution. Instead of predicting a single affect value, models estimate the mean and standard deviation of the annotation distribution, which are transformed into valid Beta parameters through moment matching. This formulation enables the recovery of higher-order distributional descriptors, including skewness, kurtosis, and quantiles, in closed form. As a result, the model captures not only the central tendency of emotional perception but also variability, asymmetry, and uncertainty in annotator responses. We evaluate the proposed approach on the SEWA and RECOLA datasets using multimodal features. Experimental results show that Beta-based modelling produces predictive distributions that closely match the empirical annotator distributions while achieving competitive performance with conventional regression approaches. These findings highlight the importance of modelling annotation uncertainty in affective computing and demonstrate the potential of distribution-aware learning for subjective signal analysis.
[LG-10] Splats under Pressure: Exploring Performance-Energy Trade-offs in Real-Time 3D Gaussian Splatting under Constrained GPU Budgets
链接: https://arxiv.org/abs/2604.07177
作者: Muhammad Fahim Tajwar,Arthur Wuhrlin,Bhojan Anand
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:We investigate the feasibility of real-time 3D Gaussian Splatting (3DGS) rasterisation on edge clients with varying Gaussian splat counts and GPU computational budgets. Instead of evaluating multiple physical devices, we adopt an emulation-based approach that approximates different GPU capability tiers on a single high-end GPU. By systematically under-clocking the GPU core frequency and applying power caps, we emulate a controlled range of floating-point performance levels that approximate different GPU capability tiers. At each point in this range, we measure frame rate, runtime behaviour, and power consumption across scenes of varying complexity, pipelines, and optimisations, enabling analysis of power-performance relationships such as FPS-power curves, energy per frame, and performance per watt. This method allows us to approximate the performance envelope of a diverse class of GPUs, from embedded and mobile-class devices to high-end consumer-grade systems. Our objective is to explore the practical lower bounds of client-side 3DGS rasterisation and assess its potential for deployment in energy-constrained environments, including standalone headsets and thin clients. Through this analysis, we provide early insights into the performance-energy trade-offs that govern the viability of edge-deployed 3DGS systems. Subjects: Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2604.07177 [cs.GR] (or arXiv:2604.07177v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2604.07177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling
链接: https://arxiv.org/abs/2604.07172
作者: Tom A. Lamb,Desi R. Ivanova,Philip H. S. Torr,Tim G. J. Rudner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Calibration is central to reliable semantic uncertainty quantification, yet prior work has largely focused on discrimination, neglecting calibration. As calibration and discrimination capture distinct aspects of uncertainty, focusing on discrimination alone yields an incomplete picture. We address this gap by systematically evaluating both aspects across a broad set of confidence measures. We show that current approaches, particularly fixed-temperature heuristics, produce systematically miscalibrated and poorly discriminative semantic confidence distributions. We demonstrate that optimising a single scalar temperature, which, we argue, provides a suitable inductive bias, is a surprisingly simple yet effective solution. Our exhaustive evaluation confirms that temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and more expressive token-level recalibration methods on question-answering tasks.
[LG-12] Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization
链接: https://arxiv.org/abs/2604.07171
作者: Yong Si,Mingfei Lu,Jing Li,Yang Hu,Guijiang Li,Yueheng Song,Zhaokui Wang
类目: Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, 4 tables
Abstract:Decision-making in military aviation Prognostics and Health Management (PHM) faces significant challenges due to the “curse of dimensionality” in large-scale fleet operations, combined with sparse feedback and stochastic mission profiles. To address these issues, this paper proposes Smart Commander, a novel Hierarchical Reinforcement Learning (HRL) framework designed to optimize sequential maintenance and logistics decisions. The framework decomposes the complex control problem into a two-tier hierarchy: a strategic General Commander manages fleet-level availability and cost objectives, while tactical Operation Commanders execute specific actions for sortie generation, maintenance scheduling, and resource allocation. The proposed approach is validated within a custom-built, high-fidelity discrete-event simulation environment that captures the dynamics of aircraft configuration and support this http URL integrating layered reward shaping with planning-enhanced neural networks, the method effectively addresses the difficulty of sparse and delayed rewards. Empirical evaluations demonstrate that Smart Commander significantly outperforms conventional monolithic Deep Reinforcement Learning (DRL) and rule-based baselines. Notably, it achieves a substantial reduction in training time while demonstrating superior scalability and robustness in failure-prone environments. These results highlight the potential of HRL as a reliable paradigm for next-generation intelligent fleet management.
[LG-13] SBBTS: A Unified Schrödinger-Bass Framework for Synthetic Financial Time Series
链接: https://arxiv.org/abs/2604.07159
作者: Alexandre Alouadi,Grégoire Loeper,Célian Marsala,Othmane Mazhar,Huyên Pham
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of generating synthetic time series that reproduce both marginal distributions and temporal dynamics, a central challenge in financial machine learning. Existing approaches typically fail to jointly model drift and stochastic volatility, as diffusion-based methods fix the volatility while martingale transport models ignore drift. We introduce the Schrödinger-Bass Bridge for Time Series (SBBTS), a unified framework that extends the Schrödinger-Bass formulation to multi-step time series. The method constructs a diffusion process that jointly calibrates drift and volatility and admits a tractable decomposition into conditional transport problems, enabling efficient learning. Numerical experiments on the Heston model demonstrate that SBBTS accurately recovers stochastic volatility and correlation parameters that prior SchrödingerBridge methods fail to capture. Applied to SP 500 data, SBBTS-generated synthetic time series consistently improve downstream forecasting performance when used for data augmentation, yielding higher classification accuracy and Sharpe ratio compared to real-data-only training. These results show that SBBTS provides a practical and effective framework for realistic time series generation and data augmentation in financial applications.
[LG-14] Multi-Turn Reasoning LLM s for Task Offloading in Mobile Edge Computing
链接: https://arxiv.org/abs/2604.07148
作者: Ning Yang,Chuangxin Cheng,Haijun Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Emerging computation-intensive applications impose stringent latency requirements on resource-constrained mobile devices. Mobile Edge Computing (MEC) addresses this challenge through task offloading. However, designing effective policies remains difficult due to dynamic task arrivals, time-varying channels, and the spatio-temporal coupling of server queues. Conventional heuristics lack adaptability, while Deep Reinforcement Learning (DRL) suffers from limited generalization and architectural rigidity, requiring retraining when network topology changes. Although Large Language Models (LLMs) offer semantic reasoning capabilities, standard Supervised Fine-Tuning (SFT) yields myopic policies that greedily minimize immediate latency without accounting for long-term system evolution. To address these limitations, we propose COMLLM, a generative framework that enables foresighted decision-making in MEC systems. COMLLM integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism, which performs multi-step Monte Carlo rollouts while jointly modeling server queue dynamics. By incorporating these rollouts into the reward design, the framework captures the long-term impact of current decisions on future system states. Experimental results demonstrate that COMLLM achieves near-optimal latency and improved load-balancing fairness. Notably, it exhibits zero-shot topological scalability, allowing a model trained on small-scale networks to generalize to larger, unseen topologies without retraining, outperforming SFT, DRL, and heuristic baselines.
[LG-15] Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees
链接: https://arxiv.org/abs/2604.07143
作者: Marek Gagolewski
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset’s mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source ‘lumbermark’ package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.
[LG-16] DDP-SA: Scalable Privacy-Preserving Federated Learning via Distributed Differential Privacy and Secure Aggregation
链接: https://arxiv.org/abs/2604.07125
作者: Wenjing Wei,Farid Nait-Abdesselam,Alla Jammine
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:This article presents DDP-SA, a scalable privacy-preserving federated learning framework that jointly leverages client-side local differential privacy (LDP) and full-threshold additive secret sharing (ASS) for secure aggregation. Unlike existing methods that rely solely on differential privacy or on secure multi-party computation (MPC), DDP-SA integrates both techniques to deliver stronger end-to-end privacy guarantees while remaining computationally practical. The framework introduces a two-stage protection mechanism: clients first perturb their local gradients with calibrated Laplace noise, then decompose the noisy gradients into additive secret shares that are distributed across multiple intermediate servers. This design ensures that (i) no single compromised server or communication channel can reveal any information about individual client updates, and (ii) the parameter server reconstructs only the aggregated noisy gradient, never any client-specific contribution. Extensive experiments show that DDP-SA achieves substantially higher model accuracy than standalone LDP while providing stronger privacy protection than MPC-only approaches. The proposed framework scales linearly with the number of participants and offers a practical, privacy-preserving solution for federated learning applications with controllable computational and communication overhead.
[LG-17] Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?
链接: https://arxiv.org/abs/2604.07096
作者: Changkun Guan,Mengfan Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages
Abstract:Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emphare multi-objective bandits actually harder than single-objective ones? We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap (g^\dagger), and hence by the minimum marginal regret of order (\Omega(\fracK\log Tg^\dagger)). We further develop a new algorithm that achieves Pareto regret of order (O(\fracK\log Tg^\dagger)), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.
[LG-18] Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering ALT
链接: https://arxiv.org/abs/2604.07085
作者: Manar D. Samad,Yina Hou,Shrabani Ghosh
类目: Machine Learning (cs.LG)
*备注: 14th IEEE Conference on Healthcare Informatics
Abstract:In electronic health records (EHRs), clustering patients and distinguishing disease subtypes are key tasks to elucidate pathophysiology and aid clinical decision-making. However, clustering in healthcare informatics is still based on traditional methods, especially K-means, and has achieved limited success when applied to embedding representations learned by autoencoders as hybrid methods. This paper investigates the effectiveness of traditional, hybrid, and deep learning methods in heart failure patient cohorts using real EHR data from the All of Us Research Program. Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting. To address the shortcomings of deep clustering, we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions, rather than relying on a single fixed embedding space. When combined with traditional clustering in a novel ensemble framework, the proposed ensemble embedding for deep clustering delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. This paper underscores the importance of biological sex-specific clustering of EHR data and the advantages of combining traditional and deep clustering approaches over a single method.
[LG-19] Epistemic Robust Offline Reinforcement Learning
链接: https://arxiv.org/abs/2604.07072
作者: Abhilash Reddy Chenreddy,Erick Delage
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning learns policies from fixed datasets without further environment interaction. A key challenge in this setting is epistemic uncertainty, arising from limited or biased data coverage, particularly when the behavior policy systematically avoids certain actions. This can lead to inaccurate value estimates and unreliable generalization. Ensemble-based methods like SAC-N mitigate this by conservatively estimating Q-values using the ensemble minimum, but they require large ensembles and often conflate epistemic with aleatoric uncertainty. To address these limitations, we propose a unified and generalizable framework that replaces discrete ensembles with compact uncertainty sets over Q-values. %We further introduce an Epinet based model that directly shapes the uncertainty sets to optimize the cumulative reward under the robust Bellman objective without relying on ensembles. We also introduce a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies, and demonstrate that our method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains.
[LG-20] Controller Design for Structured State-space Models via Contraction Theory
链接: https://arxiv.org/abs/2604.07069
作者: Muhammad Zakwan,Vaibhav Gupta,Alireza Karimi,Efe C. Balta,Giancarlo Ferrari-Trecate
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: The first and second authors contributed equally. The paper has been accepted in 24th European Control Conference (ECC) in Reykjavik, Iceland, 2026
Abstract:This paper presents an indirect data-driven output feedback controller synthesis for nonlinear systems, leveraging Structured State-space Models (SSMs) as surrogate models. SSMs have emerged as a compelling alternative in modelling time-series data and dynamical systems. They can capture long-term dependencies while maintaining linear computational complexity with respect to the sequence length, in comparison to the quadratic complexity of Transformer-based architectures. The contributions of this work are threefold. We provide the first analysis of controllability and observability of SSMs, which leads to scalable control design via Linear Matrix Inequalities (LMIs) that leverage contraction theory. Moreover, a separation principle for SSMs is established, enabling the independent design of observers and state-feedback controllers while preserving the exponential stability of the closed-loop system. The effectiveness of the proposed framework is demonstrated through a numerical example, showcasing nonlinear system identification and the synthesis of an output feedback controller.
[LG-21] Production-Ready Automated ECU Calibration using Residual Reinforcement Learning
链接: https://arxiv.org/abs/2604.07059
作者: Andreas Kampmeier,Kevin Badalian,Lucas Koch,Sung-Yong Lee,Jakob Andert
类目: Machine Learning (cs.LG)
*备注: This manuscript has been submitted to SAE as a conference paper for the 2026 Stuttgart International Symposium on Automotive and Powertrain Technology
Abstract:Electronic Control Units (ECUs) have played a pivotal role in transforming motorcars of yore into the modern vehicles we see on our roads today. They actively regulate the actuation of individual components and thus determine the characteristics of the whole system. In this, the behavior of the control functions heavily depends on their calibration parameters which engineers traditionally design by hand. This is taking place in an environment of rising customer expectations and steadily shorter product development cycles. At the same time, legislative requirements are increasing while emission standards are getting stricter. Considering the number of vehicle variants on top of all that, the conventional method is losing its practical and financial viability. Prior work has already demonstrated that optimal control functions can be automatically developed with reinforcement learning (RL); since the resulting functions are represented by artificial neural networks, they lack explainability, a circumstance which renders them challenging to employ in production vehicles. In this article, we present an explainable approach to automating the calibration process using residual RL which follows established automotive development principles. Its applicability is demonstrated by means of a map-based air path controller in a series control unit using a hardware-in-the-loop (HiL) platform. Starting with a sub-optimal map, the proposed methodology quickly converges to a calibration which closely resembles the reference in the series ECU. The results prove that the approach is suitable for the industry where it leads to better calibrations in significantly less time and requires virtually no human intervention
[LG-22] AdaBoost Does Not Always Cycle: A Computer-Assisted Counterexample
链接: https://arxiv.org/abs/2604.07055
作者: Erik Y. Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We give a computer-assisted counterexample to the open question, posed by Rudin, Schapire, and Daubechies in COLT 2012, of whether exhaustive AdaBoost always converges to a finite cycle. The construction is based on a block-product gadget whose two factors share an exact period-2 orbit for their 5-step branch maps, but whose linearized return maps have dominant eigenvalues with an irrational logarithmic ratio. This irrationality forces the burst-winner sequence to have an irrational asymptotic frequency, precluding eventual periodicity. All assertions are certified by exact rational arithmetic. This work was developed in collaboration with GPT-5.4 Pro and Claude Opus 4.6.
[LG-23] MoE Routing Testbed: Studying Expert Specialization and Routing Behavior at Small Scale
链接: https://arxiv.org/abs/2604.07030
作者: Tobias Falke,Nicolas Anastassacos,Samson Tan,Chankrisna Richy Meas,Chandana Satya Prakash,Nitesh Sekhar,M Saiful Bari,Krishna Kompella,Gamaleldin F. Elsayed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse Mixture-of-Experts (MoE) architectures are increasingly popular for frontier large language models (LLM) but they introduce training challenges due to routing complexity. Fully leveraging parameters of an MoE model requires all experts to be well-trained and to specialize in non-redundant ways. Assessing this, however, is complicated due to lack of established metrics and, importantly, many routing techniques exhibit similar performance at smaller sizes, which is often not reflective of their behavior at large scale. To address this challenge, we propose the MoE Routing Testbed, a setup that gives clearer visibility into routing dynamics at small scale while using realistic data. The testbed pairs a data mix with clearly distinguishable domains with a reference router that prescribes ideal routing based on these domains, providing a well-defined upper bound for comparison. This enables quantifiable measurement of expert specialization. To demonstrate the value of the testbed, we compare various MoE routing approaches and show that balancing scope is the crucial factor that allows specialization while maintaining high expert utilization. We confirm that this observation generalizes to models 35x larger.
[LG-24] Learning to Query History: Nonstationary Classification via Learned Retrieval ICLR2026
链接: https://arxiv.org/abs/2604.07027
作者: Jimmy Gammell,Bishal Thapaliya,Yoon Jung,Riyasat Ohib,Bilel Fehri,Deepayan Chakrabarti
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM). 12 pages, 6 figures
Abstract:Nonstationarity is ubiquitous in practical classification settings, leading deployed models to perform poorly even when they generalize well to holdout sets available at training time. We address this by reframing nonstationary classification as time series prediction: rather than predicting from the current input alone, we condition the classifier on a sequence of historical labeled examples that extends beyond the training cutoff. To scale to large sequences, we introduce a learned discrete retrieval mechanism that samples relevant historical examples via input-dependent queries, trained end-to-end with the classifier using a score-based gradient estimator. This enables the full corpus of historical data to remain on an arbitrary filesystem during training and deployment. Experiments on synthetic benchmarks and Amazon Reviews '23 (electronics category) show improved robustness to distribution shift compared to standard classifiers, with VRAM scaling predictably as the length of the historical data sequence increases.
[LG-25] Predictive Representations for Skill Transfer in Reinforcement Learning
链接: https://arxiv.org/abs/2604.07016
作者: Ruben Vereecken,Luke Dickens,Alessandra Russo
类目: Machine Learning (cs.LG)
*备注: esearch conducted: September 2018 to June 2021. This manuscript represents the work as of June 2021
Abstract:A key challenge in scaling up Reinforcement Learning is generalizing learned behaviour. Without the ability to carry forward acquired knowledge an agent is doomed to learn each task from scratch. In this paper we develop a new formalism for transfer by virtue of state abstraction. Based on task-independent, compact observations (outcomes) of the environment, we introduce Outcome-Predictive State Representations (OPSRs), agent-centered and task-independent abstractions that are made up of predictions of outcomes. We show formally and empirically that they have the potential for optimal but limited transfer, then overcome this trade-off by introducing OPSR-based skills, i.e. abstract actions (based on options) that can be reused between tasks as a result of state abstraction. In a series of empirical studies, we learn OPSR-based skills from demonstrations and show how they speed up learning considerably in entirely new and unseen tasks without any pre-processing. We believe that the framework introduced in this work is a promising step towards transfer in RL in general, and towards transfer through combining state and action abstraction specifically.
[LG-26] NestPipe: Large-Scale Recommendation Training on 1500 Accelerators via Nested Pipelining
链接: https://arxiv.org/abs/2604.06956
作者: Zhida Jiang,Zhaolong Xing,Huichao Chai,Tianxing Sun,Qiang Peng,Baopeng Yuan,Jiaxing Wang,Hua Du,Zhixin Wu,Xuemiao Li,Yikui Cao,Xinyu Liu,Yongxiang Feng,Zhen Chen,Ke Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.
[LG-27] Evaluating PQC KEMs Combiners and Cascade Encryption via Adaptive IND-CPA Testing Using Deep Learning
链接: https://arxiv.org/abs/2604.06942
作者: Simon Calderon,Niklas Johansson,Onur Günlü
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
*备注:
Abstract:Ensuring ciphertext indistinguishability is fundamental to cryptographic security, but empirically validating this property in real implementations and hybrid settings presents practical challenges. The transition to post-quantum cryptography (PQC), with its hybrid constructions combining classical and quantum-resistant primitives, makes empirical validation approaches increasingly valuable. By modeling IND-CPA games as binary classification tasks and training on labeled ciphertext data with BCE loss, we study deep neural network (DNN) distinguishers for ciphertext indistinguishability. We apply this methodology to PQC KEMs. We specifically test the public-key encryption (PKE) schemes used to construct examples such as ML-KEM, BIKE, and HQC. Moreover, a novel extension of this DNN modeling for empirical distinguishability testing of hybrid KEMs is presented. We implement and test this on combinations of PQC KEMs with plain RSA, RSA-OAEP, and plaintext. Finally, methodological generality is illustrated by applying the DNN IND-CPA classification framework to cascade symmetric encryption, where we test combinations of AES-CTR, AES-CBC, AES-ECB, ChaCha20, and DES-ECB. In our experiments on PQC algorithms, KEM combiners, and cascade encryption, no algorithm or combination of algorithms demonstrates a significant advantage (two-sided binomial test, significance level \alpha = 0.01 ), consistent with theoretical guarantees that hybrids including at least one IND-CPA-secure component preserve indistinguishability, and with the absence of exploitable patterns under the considered DNN adversary model. These illustrate the potential of using deep learning as an adaptive, practical, and versatile empirical estimator for indistinguishability in more general IND-CPA settings, allowing data-driven validation of implementations and compositions and complementing the analytical security analysis.
[LG-28] Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems
链接: https://arxiv.org/abs/2604.06914
作者: Charbel Bou Chaaya,Mehdi Bennis
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study a vehicle-to-infrastructure (V2I) system where distributed base stations (BSs) acting as road-side units (RSUs) collect multimodal (wireless and visual) data from moving vehicles. We consider a decentralized rate maximization problem, where each RSU relies on its local observations to optimize its resources, while all RSUs must collaborate to guarantee favorable network performance. We recast this problem as a distributed multi-agent reinforcement learning (MARL) problem, by incorporating rotation symmetries in terms of vehicles’ locations. To exploit these symmetries, we propose a novel self-supervised learning framework where each BS agent aligns the latent features of its multimodal observation to extract the positions of the vehicles in its local region. Equipped with this sensing data at each RSU, we train an equivariant policy network using a graph neural network (GNN) with message passing layers, such that each agent computes its policy locally, while all agents coordinate their policies via a signaling scheme that overcomes partial observability and guarantees the equivariance of the global policy. We present numerical results carried out in a simulation environment, where ray-tracing and computer graphics are used to collect wireless and visual data. Results show the generalizability of our self-supervised and multimodal sensing approach, achieving more than two-fold accuracy gains over baselines, and the efficiency of our equivariant MARL training, attaining more than 50% performance gains over standard approaches.
[LG-29] Data Leakage in Automotive Perception: Practitioners Insights
链接: https://arxiv.org/abs/2604.06899
作者: Md Abu Ahammed Babu,Sushant Kumar Pandey,Darko Durisic,Andras Balint,Miroslaw Staron
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Data leakage is the inadvertent transfer of information between training and evaluation datasets that poses a subtle, yet critical, risk to the reliability of machine learning (ML) models in safety-critical systems such as automotive perception. While leakage is widely recognized in research, little is known about how industrial practitioners actually perceive and manage it in practice. This study investigates practitioners’ knowledge, experiences, and mitigation strategies around data leakage through ten semi-structured interviews with system design, development, and verification engineers working on automotive perception functions development. Using reflexive thematic analysis, we identify that knowledge of data leakage is widespread and fragmented along role boundaries: ML engineers conceptualize it as a data-splitting or validation issue, whereas design and verification roles interpret it in terms of representativeness and scenario coverage. Detection commonly arises through generic considerations and observed performance anomalies rather than implying specific tools. However, data leakage prevention is more commonly practiced, which depends mostly on experience and knowledge sharing. These findings suggest that leakage control is a socio-technical coordination problem distributed across roles and workflows. We discuss implications for ML reliability engineering, highlighting the need for shared definitions, traceable data practices, and continuous cross-role communication to institutionalize data leakage awareness within automotive ML development.
[LG-30] VertAX: a differentiable vertex model for learning epithelial tissue mechanics
链接: https://arxiv.org/abs/2604.06896
作者: Alessandro Pasqui,Jim Martin Catacora Ocana,Anshuman Sinha,Matthieu Perez,Fabrice Delbary,Giorgio Gosti,Mattia Miotto,Domenico Caudo,Maxence Ernoult,Hervé Turlier
类目: Machine Learning (cs.LG); Software Engineering (cs.SE); Biological Physics (physics.bio-ph)
*备注: 28 pages, 4 figures
Abstract:Epithelial tissues dynamically reshape through local mechanical interactions among cells, a process well captured by vertex models. Yet their many tunable parameters make inference and optimization challenging, motivating computational frameworks that flexibly model and learn tissue mechanics. We introduce VertAX, a differentiable JAX-based framework for vertex-modeling of confluent epithelia. VertAX provides automatic differentiation, GPU acceleration, and end-to-end bilevel optimization for forward simulation, parameter inference, and inverse mechanical design. Users can define arbitrary energy and cost functions in pure Python, enabling seamless integration with machine-learning pipelines. We demonstrate VertAX on three representative tasks: (i) forward modeling of tissue morphogenesis, (ii) mechanical parameter inference, and (iii) inverse design of tissue-scale behaviors. We benchmark three differentiation strategies-automatic differentiation, implicit differentiation, and equilibrium propagation-showing that the latter can approximate gradients using repeated forward, adjoint-free simulations alone, offering a simple route for extending inverse biophysical problems to non-differentiable simulators with limited additional engineering effort.
[LG-31] MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems
链接: https://arxiv.org/abs/2604.06881
作者: Tianyue Yang,Xiao Xue
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 23 pages, 9 figures
Abstract:Neural operators have emerged as powerful surrogates for dynamical systems due to their grid-invariant properties and computational efficiency. However, the Fourier-based neural operator framework inherently truncates high-frequency components in spectral space, resulting in the loss of small-scale structures and degraded prediction quality at high resolutions when trained on low-resolution data. While diffusion-based enhancement methods can recover multi-scale features, they introduce substantial inference overhead that undermines the efficiency advantage of neural operators. In this work, we introduce \textbfMeanFlow-\textbfEnhanced \textbfNeural \textbfOperators (MENO), a novel framework that achieves accurate all-scale predictions with minimal inference cost. By leveraging the improved MeanFlow method, MENO restores both small-scale details and large-scale dynamics with superior physical fidelity and statistical accuracy. We evaluate MENO on three challenging dynamical systems, including phase-field dynamics, 2D Kolmogorov flow, and active matter dynamics, at resolutions up to 256 \times 256. Across all benchmarks, MENO improves the power spectrum density accuracy by up to a factor of 2 compared to baseline neural operators while achieving 12 \times faster inference than the state-of-the-art Diffusion Denoising Implicit Model (DDIM)-enhanced counterparts, effectively bridging the gap between accuracy and efficiency. The flexibility and efficiency of MENO position it as an efficient surrogate model for scientific machine learning applications where both statistical integrity and computational efficiency are paramount.
[LG-32] Contraction-Aligned Analysis of Soft Bellm an Residual Minimization with Weighted Lp-Norm for Markov Decision Problem
链接: https://arxiv.org/abs/2604.06837
作者: Hyukjun Yang,Han-Dong Lim,Donghwan Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:The problem of solving Markov decision processes under function approximation remains a fundamental challenge, even under linear function approximation settings. A key difficulty arises from a geometric mismatch: while the Bellman optimality operator is contractive in the Linfty-norm, commonly used objectives such as projected value iteration and Bellman residual minimization rely on L2-based formulations. To enable gradient-based optimization, we consider a soft formulation of Bellman residual minimization and extend it to a generalized weighted Lp -norm. We show that this formulation aligns the optimization objective with the contraction geometry of the Bellman operator as p increases, and derive corresponding performance error bounds. Our analysis provides a principled connection between residual minimization and Bellman contraction, leading to improved control of error propagation while remaining compatible with gradient-based optimization.
[LG-33] STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
链接: https://arxiv.org/abs/2604.06836
作者: Minglu Liu,Cunchen Hu,Liangliang Xu,Fengming Tang,Ruijia Wang,Fu Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across layers and training steps. Such uniform designs often introduce noticeable accuracy degradation. To move beyond fixed quantization, we propose STQuant, a distributed training framework that reduces the memory footprint of optimizer states via dynamic precision allocation across layers, state variables, and training steps, while maintaining model quality. Naively applying dynamic quantization during training is challenging for two reasons. First, optimizer states are numerically sensitive, and quantization noise can destabilize quality. Second, jointly considering multiple states and layers induces a large combinatorial search space. STQuant addresses these challenges with two key techniques: 1) a provably near-optimal factor selection strategy that accurately identifies the most influential factors for precision adaptation. 2) a dynamic transition decision algorithm that reduces the search cost from exponential to linear complexity. Experiments on GPT-2 and ViT show that STQuant reduces optimizer-state memory by 84.4%, achieving an average bit-width of as low as 5.1 bits, compared with existing solutions. Moreover, STQuant incurs only O(N/K) computational overhead and requires O(1) extra space.
[LG-34] FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
链接: https://arxiv.org/abs/2604.06833
作者: Shunan Zhu,Jiawei Chen,Yonghao Yu,Hideya Ochiai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:As high quality public data becomes scarce, Federated Learning (FL) provides a vital pathway to leverage valuable private user data while preserving privacy. However, real-world client data often contains toxic or unsafe information. This leads to a critical issue we define as unintended data poisoning, which can severely damage the safety alignment of global models during federated alignment. To address this, we propose FedDetox, a robust framework tailored for Small Language Models (SLMs) on resource-constrained edge devices. We first employ knowledge distillation to transfer sophisticated safety alignment capabilities from large scale safety aligned teacher models into light weight student classifiers suitable for resource constrained edge devices. Specifically, during federated learning for human preference alignment, the edge client identifies unsafe samples at the source and replaces them with refusal templates, effectively transforming potential poisons into positive safety signals. Experiments demonstrate that our approach preserves model safety at a level comparable to centralized baselines without compromising general utility.
[LG-35] CBM-Dual: A 65-nm Fully Connected Chaotic Boltzmann Machine Processor for Dual Function Simulated Annealing and Reservoir Computing
链接: https://arxiv.org/abs/2604.06808
作者: Kanta Yoshioka,Soshi Hirayae,Yuichiro Tanaka,Yuichi Katori,Takashi Morie,Hakaru Tamukoh
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 3 pages, 9 figures
Abstract:This paper presents CBM-Dual, the first silicon-proven digital chaotic dynamics processor (CDP) supporting both simulated annealing (SA) and reservoir computing (RC). CBM-Dual enables real-time decision-making and lightweight adaptation for autonomous Edge AI, employing the largest-scale fully connected 1024-neuron chaotic Boltzmann machine (CBM). To address the high computational and area costs of digital CDPs, we propose: 1) a CBM-specific scheduler that exploits an inherently low neuron flip rate to reduce multiply-accumulate operations by 99%, and 2) an efficient multiply splitting scheme that reduces the area by 59%. Fabricated in 65nm (12mm ^2 ), CBM-Dual achieves simultaneous heterogeneous task execution and state-of-the-art energy efficiency, delivering \times 25-54 and \times 4.5 improvements in the SA and RC fields, respectively.
[LG-36] he Rhetoric of Machine Learning
链接: https://arxiv.org/abs/2604.06754
作者: Robert C. Williamson
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 25 pages. Text of a talk given at AlphaPersuade 2.0, 26 March 2026
Abstract:I examine the technology of machine learning from the perspective of rhetoric, which is simply the art of persuasion. Rather than being a neutral and “objective” way to build “world models” from data, machine learning is (I argue) inherently rhetorical. I explore some of its rhetorical features, and examine one pervasive business model where machine learning is widely used, “manipulation as a service.”
[LG-37] Busemann energy-based attention for emotion analysis in Poincaré discs
链接: https://arxiv.org/abs/2604.06752
作者: Zinaid Kapić,Vladimir Jaćimović
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present EmBolic - a novel fully hyperbolic deep learning architecture for fine-grained emotion analysis from textual messages. The underlying idea is that hyperbolic geometry efficiently captures hierarchies between both words and emotions. In our context, these hierarchical relationships arise from semantic ambiguities. EmBolic aims to infer the curvature on the continuous space of emotions, rather than treating them as a categorical set without any metric structure. In the heart of our architecture is the attention mechanism in the hyperbolic disc. The model is trained to generate queries (points in the hyperbolic disc) from textual messages, while keys (points at the boundary) emerge automatically from the generated queries. Predictions are based on the Busemann energy between queries and keys, evaluating how well a certain textual message aligns with the class directions representing emotions. Our experiments demonstrate strong generalization properties and reasonably good prediction accuracy even for small dimensions of the representation space. Overall, this study supports our claim that affective computing is one of the application domains where hyperbolic representations are particularly advantageous.
[LG-38] Beyond Pessimism: Offline Learning in KL-regularized Games
链接: https://arxiv.org/abs/2604.06738
作者: Yuheng Zhang,Claire Chen,Nan Jiang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized under a KL constraint to a fixed reference policy. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only \widetilde\mathcalO(1/\sqrt n) statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first \widetilde\mathcalO(1/n) sample complexity bound for offline learning in KL-regularized zero-sum games, achieved entirely without pessimism. We further propose an efficient self-play policy optimization algorithm and prove that, with a number of iterations linear in the sample size, it achieves the same fast \widetilde\mathcalO(1/n) statistical rate as the minimax estimator.
[LG-39] Extraction of linearized models from pre-trained networks via knowledge distillation
链接: https://arxiv.org/abs/2604.06732
作者: Fumito Kimura,Jun Ohkubo
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
Abstract:Recent developments in hardware, such as photonic integrated circuits and optical devices, are driving demand for research on constructing machine learning architectures tailored for linear operations. Hence, it is valuable to explore methods for constructing learning machines with only linear operations after simple nonlinear preprocessing. In this study, we propose a framework to extract a linearized model from a pre-trained neural network for classification tasks by integrating Koopman operator theory with knowledge distillation. Numerical demonstrations on the MNIST and the Fashion-MNIST datasets reveal that the proposed model consistently outperforms the conventional least-squares-based Koopman approximation in both classification accuracy and numerical stability.
[LG-40] Bi-level Heterogeneous Learning for Time Series Foundation Models: A Federated Learning Approach
链接: https://arxiv.org/abs/2604.06727
作者: Shengchao Chen,Guodong Long,Dikai Liu,Jing Jiang
类目: Machine Learning (cs.LG)
*备注: 31 pages
Abstract:Heterogeneity in time series data is more pronounced than in vision or language, as temporal dynamics vary substantially across domains and tasks. Existing efforts on training time series foundation models (TSFMs) from scratch are often trained with mixed-batch strategies that merge large-scale datasets, which can cause gradient conflicts and degrade representation quality. To address this, we propose a fine-grained learning method that distills invariant knowledge from heterogeneous series while reducing cross-domain interference. We characterize heterogeneity at two levels: inter-domain and intra-domain. To tackle this bi-level heterogeneity, we design a federated learning method that mitigates intra-domain conflicts by enforcing domain-invariant and semantically consistent representations through local regularization, and addresses inter-domain discrepancies by enhancing cross-domain collaboration via domain-aware aggregation. Experiments across diverse benchmarks show that TSFMs trained with our method consistently outperform both centralized and federated TSFM baselines in point and probabilistic forecasting, while also achieving competitive zero-shot performance at scale, offering a flexible pathway for training TSFMs from scratch in heterogeneous environments.
[LG-41] Bi-Lipschitz Autoencoder With Injectivity Guarantee ICLR2026
链接: https://arxiv.org/abs/2604.06701
作者: Qipeng Zhan,Zhuoping Zhou,Zexuan Wang,Qi Long,Li Shen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for publication at ICLR 2026, 27 Pages, 15 Figures
Abstract:Autoencoders are widely used for dimensionality reduction, based on the assumption that high-dimensional data lies on low-dimensional manifolds. Regularized autoencoders aim to preserve manifold geometry during dimensionality reduction, but existing approaches often suffer from non-injective mappings and overly rigid constraints that limit their effectiveness and robustness. In this work, we identify encoder non-injectivity as a core bottleneck that leads to poor convergence and distorted latent representations. To ensure robustness across data distributions, we formalize the concept of admissible regularization and provide sufficient conditions for its satisfaction. In this work, we propose the Bi-Lipschitz Autoencoder (BLAE), which introduces two key innovations: (1) an injective regularization scheme based on a separation criterion to eliminate pathological local minima, and (2) a bi-Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift. Empirical results on diverse datasets show that BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts. Code is available at this https URL.
[LG-42] owards Accurate and Calibrated Classification: Regularizing Cross-Entropy From A Generative Perspective
链接: https://arxiv.org/abs/2604.06689
作者: Qipeng Zhan,Zhuoping Zhou,Li Shen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Accurate classification requires not only high predictive accuracy but also well-calibrated confidence estimates. Yet, modern deep neural networks (DNNs) are often overconfident, primarily due to overfitting on the negative log-likelihood (NLL). While focal loss variants alleviate this issue, they typically reduce accuracy, revealing a persistent trade-off between calibration and predictive performance. Motivated by the complementary strengths of generative and discriminative classifiers, we propose Generative Cross-Entropy (GCE), which maximizes p(x|y) and is equivalent to cross-entropy augmented with a class-level confidence regularizer. Under mild conditions, GCE is strictly proper. Across CIFAR-10/100, Tiny-ImageNet, and a medical imaging benchmark, GCE improves both accuracy and calibration over cross-entropy, especially in the long-tailed scenario. Combined with adaptive piecewise temperature scaling (ATS), GCE attains calibration competitive with focal-loss variants without sacrificing accuracy.
[LG-43] GraphWalker: Graph-Guided In-Context Learning for Clinical Reasoning on Electronic Health Records
链接: https://arxiv.org/abs/2604.06684
作者: Yue Fang,Weibin Liao,Yuxin Guo,Jiaran Gao,Hongxin Ding,Jinyang Zhang,Xinke Jiang,Zhibang Yang,Junfeng Zhao,Yasha Wang,Liantao Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clinical Reasoning on Electronic Health Records (EHRs) is a fundamental yet challenging task in modern healthcare. While in-context learning (ICL) offers a promising inference-time adaptation paradigm for large language models (LLMs) in EHR reasoning, existing methods face three fundamental challenges: (1) Perspective Limitation, where data-driven similarity fails to align with LLM reasoning needs and model-driven signals are constrained by limited clinical competence; (2) Cohort Awareness, as demonstrations are selected independently without modeling population-level structure; and (3) Information Aggregation, where redundancy and interaction effects among demonstrations are ignored, leading to diminishing marginal gains. To address these challenges, we propose GraphWalker, a principled demonstration selection framework for EHR-oriented ICL. GraphWalker (i) jointly models patient clinical information and LLM-estimated information gain by integrating data-driven and model-driven perspectives, (ii) incorporates Cohort Discovery to avoid noisy local optima, and (iii) employs a Lazy Greedy Search with Frontier Expansion algorithm to mitigate diminishing marginal returns in information aggregation. Extensive experiments on multiple real-world EHR benchmarks demonstrate that GraphWalker consistently outperforms state-of-the-art ICL baselines, yielding substantial improvements in clinical reasoning performance. Our code is open-sourced at this https URL
[LG-44] Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
链接: https://arxiv.org/abs/2604.06664
作者: Xueshen Liu,Yongji Wu,Yuncheng Yao,Danyang Zhuo,Ion Stoica,Z. Morley Mao
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: beyond graph topology, they are tightly coupled to execution context, including device addresses embedded in kernel arguments and kernel code lazily loaded during warmup. Existing approaches either rely on brittle kernel-specific patching or heavyweight process-level checkpoint/restore that are inflexible to dynamic parallelism switching. We present Foundry, a template-based CUDA graph context materialization system that persists both graph topology and execution context during an offline processing stage, and reconstructs executable graphs online with negligible overhead. Foundry enforces deterministic memory layouts, automatically extracts and reloads kernel binaries required by captured graphs, and reduces online reconstruction costs through topology-based templating. For distributed serving, Foundry further enables a single-GPU offline capture to generate templates for multi-GPU deployments by patching only rank-dependent communication state. Across dense and MoE models up to 235B parameters, Foundry reduces cold-start latency by up to 99%, cutting the initialization time of Qwen3-235B-A22B from 10 minutes to 3.9 seconds while preserving the throughput gains of CUDA graphs.
[LG-45] FlowAdam: Implicit Regularization via Geometry-Aware Soft Momentum Injection IJCNN2026
链接: https://arxiv.org/abs/2604.06652
作者: Devender Singh,Tarun Sheel
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCNN 2026 (IEEE WCCI). 8 pages, 4 figures
Abstract:Adaptive moment methods such as Adam use a diagonal, coordinate-wise preconditioner based on exponential moving averages of squared gradients. This diagonal scaling is coordinate-system dependent and can struggle with dense or rotated parameter couplings, including those in matrix factorization, tensor decomposition, and graph neural networks, because it treats each parameter independently. We introduce FlowAdam, a hybrid optimizer that augments Adam with continuous gradient-flow integration via an ordinary differential equation (ODE). When EMA-based statistics detect landscape difficulty, FlowAdam switches to clipped ODE integration. Our central contribution is Soft Momentum Injection, which blends ODE velocity with Adam’s momentum during mode transitions. This prevents the training collapse observed with naive hybrid approaches. Across coupled optimization benchmarks, the ODE integration provides implicit regularization, reducing held-out error by 10-22% on low-rank matrix/tensor recovery and 6% on Jester (real-world collaborative filtering), also surpassing tuned Lion and AdaBelief, while matching Adam on well-conditioned workloads (CIFAR-10). MovieLens-100K confirms benefits arise specifically from coupled parameter interactions rather than bias estimation. Ablation studies show that soft injection is essential, as hard replacement reduces accuracy from 100% to 82.5%.
[LG-46] he Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence
链接: https://arxiv.org/abs/2604.06621
作者: Napoleon Paxton
类目: General Literature (cs.GL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Survey article, 19 pages, 1 figure, 2 tables
Abstract:Dr. David Blackwell was a mathematician and statistician of the first rank, whose contributions to statistical theory, game theory, and decision theory predated many of the algorithmic breakthroughs that define modern artificial intelligence. This survey examines three of his most consequential theoretical results the Rao Blackwell theorem, the Blackwell Approachability theorem, and the Blackwell Informativeness theorem (comparison of experiments) and traces their direct influence on contemporary AI and machine learning. We show that these results, developed primarily in the 1940s and 1950s, remain technically live across modern subfields including Markov Chain Monte Carlo inference, autonomous mobile robot navigation (SLAM), generative model training, no-regret online learning, reinforcement learning from human feedback (RLHF), large language model alignment, and information design. NVIDIAs 2024 decision to name their flagship GPU architecture (Blackwell) provides vivid testament to his enduring relevance. We also document an emerging frontier: explicit Rao Blackwellized variance reduction in LLM RLHF pipelines, recently proposed but not yet standard practice. Together, Blackwell theorems form a unified framework addressing information compression, sequential decision making under uncertainty, and the comparison of information sources precisely the problems at the core of modern AI.
[LG-47] PD-SOVNet: A Physics-Driven Second-Order Vibration Operator Network for Estimating Wheel Polygonal Roughness from Axle-Box Vibrations
链接: https://arxiv.org/abs/2604.06620
作者: Xiancheng Wang,Lin Wang,Rui Wang,Zhibo Zhang,Minghang Zhao,Xiaoheng Zhang,Zhongyue Tan,Kaitai Mao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantitative estimation of wheel polygonal roughness from axle-box vibration signals is a challenging yet practically relevant problem for rail-vehicle condition monitoring. Existing studies have largely focused on detection, identification, or severity classification, while continuous regression of multi-order roughness spectra remains less explored, especially under real operational data and unseen-wheel conditions. To address this problem, this paper presents PD-SOVNet, a physics-guided gray-box framework that combines shared second-order vibration kernels, a 4\times4 MIMO coupling module, an adaptive physical correction branch, and a Mamba-based temporal branch for estimating the 1st–40th-order wheel roughness spectrum from axle-box vibrations. The proposed design embeds modal-response priors into the model while retaining data-driven flexibility for sample-dependent correction and residual temporal dynamics. Experiments on three real-world datasets, including operational data and real fault data, show that the proposed method provides competitive prediction accuracy and relatively stable cross-wheel performance under the current data protocol, with its most noticeable advantage observed on the more challenging Dataset III. Noise injection experiments further indicate that the Mamba temporal branch helps mitigate performance degradation under perturbed inputs. These results suggest that structured physical priors can be beneficial for stabilizing roughness regression in practical rail-vehicle monitoring scenarios, although further validation under broader operating conditions and stricter comparison protocols is still needed.
[LG-48] Neural parametric representations for thin-shell shape optimisation
链接: https://arxiv.org/abs/2604.06612
作者: Xiao Xiao,Fehmi Cirak
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures
Abstract:Shape optimisation of thin-shell structures requires a flexible, differentiable geometric representation suitable for gradient-based optimisation. We propose a neural parametric representation (NRep) for the shell mid-surface based on a neural network with periodic activation functions. The NRep is defined using a multi-layer perceptron (MLP), which maps the parametric coordinates of mid-surface vertices to their physical coordinates. A structural compliance optimisation problem is posed to optimise the shape of a thin-shell parameterised by the NRep subject to a volume constraint, with the network parameters as design variables. The resulting shape optimisation problem is solved using a gradient-based optimisation algorithm. Benchmark examples with classical solutions demonstrate the effectiveness of the proposed NRep. The approach exhibits potential for complex lattice-skin structures, owing to the compact and expressive geometry representation afforded by the NRep.
[LG-49] DynLP: Parallel Dynamic Batch Update for Label Propagation in Semi-Supervised Learning
链接: https://arxiv.org/abs/2604.06596
作者: S M Shovan,Arindam Khanda,S M Ferdous,Sajal K. Das,Mahantesh Halappanavar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To be published in the ACM International Conference on Supercomputing (ICS 2026)
Abstract:Semi-supervised learning aims to infer class labels using only a small fraction of labeled data. In graph-based semi-supervised learning, this is typically achieved through label propagation to predict labels of unlabeled nodes. However, in real-world applications, data often arrive incrementally in batches. Each time a new batch appears, reapplying the traditional label propagation algorithm to recompute all labels is redundant, computationally intensive, and inefficient. To address the absence of an efficient label propagation update method, we propose DynLP, a novel GPU-centric Dynamic Batched Parallel Label Propagation algorithm that performs only the necessary updates, propagating changes to the relevant subgraph without requiring full recalculation. By exploiting GPU architectural optimizations, our algorithm achieves on average 13x and upto 102x speedup on large-scale datasets compared to state-of-the-art approaches.
[LG-50] ExplainFuzz: Explainable and Constraint-Conditioned Test Generation with Probabilistic Circuits
链接: https://arxiv.org/abs/2604.06559
作者: Annaëlle Baiget,Jaron Maene,Seongmin Lee,Benjie Wang,Guy Van den Broeck,Miryung Kim
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 19 pages
Abstract:Understanding and explaining the structure of generated test inputs is essential for effective software testing and debugging. Existing approaches–including grammar-based fuzzers, probabilistic Context-Free Grammars (pCFGs), and Large Language Models (LLMs)–suffer from critical limitations. They frequently produce ill-formed inputs that fail to reflect realistic data distributions, struggle to capture context-sensitive probabilistic dependencies, and lack explainability. We introduce ExplainFuzz, a test generation framework that leverages Probabilistic Circuits (PCs) to learn and query structured distributions over grammar-based test inputs interpretably and controllably. Starting from a Context-Free Grammar (CFG), ExplainFuzz compiles a grammar-aware PC and trains it on existing inputs. New inputs are then generated via sampling. ExplainFuzz utilizes the conditioning capability of PCs to incorporate test-specific constraints (e.g., a query must have GROUP BY), enabling constrained probabilistic sampling to generate inputs satisfying grammar and user-provided constraints. Our results show that ExplainFuzz improves the coherence and realism of generated inputs, achieving significant perplexity reduction compared to pCFGs, grammar-unaware PCs, and LLMs. By leveraging its native conditioning capability, ExplainFuzz significantly enhances the diversity of inputs that satisfy a user-provided constraint. Compared to grammar-aware mutational fuzzing, ExplainFuzz increases bug-triggering rates from 35% to 63% in SQL and from 10% to 100% in XML. These results demonstrate the power of a learned input distribution over mutational fuzzing, which is often limited to exploring the local neighborhood of seed inputs. These capabilities highlight the potential of PCs to serve as a foundation for grammar-aware, controllable test generation that captures context-sensitive, probabilistic dependencies. Comments: 19 pages Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2604.06559 [cs.SE] (or arXiv:2604.06559v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.06559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] When Does Context Help? A Systematic Study of Target-Conditional Molecular Property Prediction ICLR2026
链接: https://arxiv.org/abs/2604.06558
作者: Bryan Cheng,Jasper Zhang
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: 9 pages, 5 figures. Accepted at Workshop on AI for Accelerated Materials Design and Foundation Models for Science: Real-World Impact and Science-First Design at ICLR 2026
Abstract:We present the first systematic study of when target context helps molecular property prediction, evaluating context conditioning across 10 diverse protein families, 4 fusion architectures, data regimes spanning 67-9,409 training compounds, and both temporal and random evaluation splits. Using NestDrug, a FiLM-based architecture that conditions molecular representations on target identity, we characterize both success and failure modes with three principal findings. First, fusion architecture dominates: FiLM outperforms concatenation by 24.2 percentage points and additive conditioning by 8.6 pp; how you incorporate context matters more than whether you include it. Second, context enables otherwise impossible predictions: on data-scarce CYP3A4 (67 training compounds), multi-task transfer achieves 0.686 AUC where per-target Random Forest collapses to 0.238. Third, context can systematically hurt: distribution mismatch causes 10.2 pp degradation on BACE1; few-shot adaptation consistently underperforms zero-shot. Beyond methodology, we expose fundamental flaws in standard benchmarking: 1-nearest-neighbor Tanimoto achieves 0.991 AUC on DUD-E without any learning, and 50% of actives leak from training data, rendering absolute performance metrics meaningless. Our temporal split evaluation (train up to 2020, test 2021-2024) achieves stable 0.843 AUC with no degradation, providing the first rigorous evidence that context-conditional molecular representations generalize to future chemical space.
[LG-52] me-Series Classification with Multivariate Statistical Dependence Features
链接: https://arxiv.org/abs/2604.06537
作者: Yao Sun,Bo Hu,Jose Principe
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we propose a novel framework for non-stationary time-series analysis that replaces conventional correlation-based statistics with direct estimation of statistical dependence in the normalized joint density of input and target signals, the cross density ratio (CDR). Unlike windowed correlation estimates, this measure is independent of sample order and robust to regime changes. The method builds on the functional maximal correlation algorithm (FMCA), which constructs a projection space by decomposing the eigenspectrum of the CDR. Multiscale features from this eigenspace are classified using a lightweight single-hidden-layer perceptron. On the TI-46 digit speech corpus, our approach outperforms hidden Markov models (HMMs) and state-of-the-art spiking neural networks, achieving higher accuracy with fewer than 10 layers and a storage footprint under 5 MB.
[LG-53] VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts
链接: https://arxiv.org/abs/2604.06502
作者: Peigui Qi,Kunsheng Tang,Yanpu Yu,Jialin Wu,Yide Song,Wenbo Zhou,Zhicong Huang,Cheng Hong,Weiming Zhang,Nenghai Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](this https URL).
[LG-54] Optimal Rates for Pure varepsilon-Differentially Private Stochastic Convex Optimization with Heavy Tails
链接: https://arxiv.org/abs/2604.06492
作者: Andrew Lowy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:We study stochastic convex optimization (SCO) with heavy-tailed gradients under pure epsilon-differential privacy (DP). Instead of assuming a bound on the worst-case Lipschitz parameter of the loss, we assume only a bounded k-th moment. This assumption allows for unbounded, heavy-tailed stochastic gradient distributions, and can yield sharper excess risk bounds. The minimax optimal rate for approximate (epsilon, delta)-DP SCO is known in this setting, but the pure epsilon-DP case has remained open. We characterize the minimax optimal excess-risk rate for pure epsilon-DP heavy-tailed SCO up to logarithmic factors. Our algorithm achieves this rate in polynomial time with high probability. Moreover, it runs in polynomial time with probability 1 when the worst-case Lipschitz parameter is polynomially bounded. For important structured problem classes - including hinge/ReLU-type and absolute-value losses on Euclidean balls, ellipsoids, and polytopes - we achieve the same excess-risk guarantee in polynomial time with probability 1 even when the worst-case Lipschitz parameter is infinite. Our approach is based on a novel framework for privately optimizing Lipschitz extensions of the empirical loss. We complement our excess risk upper bound with a novel high probability lower bound.
[LG-55] AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
链接: https://arxiv.org/abs/2604.06475
作者: Iva Mikuš,Boris Muha,Domagoj Vlah
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 16 pages, 7 figures
Abstract:Deep Learning Reduced Order Models (ROMs) are becoming increasingly popular as surrogate models for parametric partial differential equations (PDEs) due to their ability to handle high-dimensional data, approximate highly nonlinear mappings, and utilize GPUs. Existing approaches typically learn evolution either on the full solution field, which requires capturing long-range spatial interactions at high computational cost, or on compressed latent representations obtained from autoencoders, which reduces the cost but often yields latent vectors that are difficult to evolve, since they primarily encode spatial information. Moreover, in parametric PDEs, the initial condition alone is not sufficient to determine the trajectory, and most current approaches are not evaluated on jointly predicting multiple solution components with differing magnitudes and parameter sensitivities. To address these challenges, we propose a joint model consisting of a convolutional encoder, a transformer operating on latent representations, and a decoder for reconstruction. The main novelties are joint training with multi-stage parameter injection and coordinate channel injection. Parameters are injected at multiple stages to improve conditioning. Physical coordinates are encoded to provide spatial information. This allows the model to dynamically adapt its computations to the specific PDE parameters governing each system, rather than learning a single fixed response. Experiments on the Advection-Diffusion-Reaction equation and Navier-Stokes flow around the cylinder wake demonstrate that our approach combines the efficiency of latent evolution with the fidelity of full-field models, outperforming DL-ROMs, latent transformers, and plain ViTs in multi-field prediction, reducing the relative rollout error by approximately 5 times.
[LG-56] MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
链接: https://arxiv.org/abs/2604.06473
作者: Willa Potosnak,Nina Żukowska,Michał Wiliński,Dan Howarth,Ignacy Stępka,Mononito Goswami,Artur Dubrawski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate forecasting with Transformers faces a core scalability challenge: modeling cross-channel dependencies via attention compounds attention’s quadratic sequence complexity with quadratic channel scaling, making full cross-channel attention impractical for high-dimensional time series. We propose Multivariate Infini Compressive Attention (MICA), an architectural design to extend channel-independent Transformers to channel-dependent forecasting. By adapting efficient attention techniques from the sequence dimension to the channel dimension, MICA adds a cross-channel attention mechanism to channel-independent backbones that scales linearly with channel count and context length. We evaluate channel-independent Transformer architectures with and without MICA across multiple forecasting benchmarks. MICA reduces forecast error over its channel-independent counterparts by 5.4% on average and up to 25.4% on individual datasets, highlighting the importance of explicit cross-channel modeling. Moreover, models with MICA rank first among deep multivariate Transformer and MLP baselines. MICA models also scale more efficiently with respect to both channel count and context length than Transformer baselines that compute attention across both the temporal and channel dimensions, establishing compressive attention as a practical solution for scalable multivariate forecasting.
[LG-57] Conformal Margin Risk Minimization: An Envelope Framework for Robust Learning under Label Noise AISTATS
链接: https://arxiv.org/abs/2604.06468
作者: Yuanjie Shi,Peihong Li,Zijian Zhang,Janardhan Rao Doppa,Yan Yan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for Publication at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS), 2026
Abstract:Most methods for learning with noisy labels require privileged knowledge such as noise transition matrices, clean subsets or pretrained feature extractors, resources typically unavailable when robustness is most needed. We propose Conformal Margin Risk Minimization (CMRM), a plug-and-play envelope framework that improves any classification loss under label noise by adding a single quantile-calibrated regularization term, with no privileged knowledge or training pipeline modification. CMRM measures the confidence margin between the observed label and competing labels, and thresholds it with a conformal quantile estimated per batch to focus training on high-margin samples while suppressing likely mislabeled ones. We derive a learning bound for CMRM under arbitrary label noise requiring only mild regularity of the margin distribution. Across five base methods and six benchmarks with synthetic and real-world noise, CMRM consistently improves accuracy (up to +3.39%), reduces conformal prediction set size (up to -20.44%) and does not hurt under 0% noise, showing that CMRM captures a method-agnostic uncertainty signal that existing mechanisms did not exploit.
[LG-58] Weighted Bayesian Conformal Prediction
链接: https://arxiv.org/abs/2604.06464
作者: Xiayin Lou,Peng Luo
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Machine Learning (stat.ML)
*备注:
Abstract:Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell \ Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption – a limitation the authors themselves identify. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose \textbfWeighted Bayesian Conformal Prediction (WBCP), which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet \Dir(1,\ldots,1) with a weighted Dirichlet \Dir(\neff \cdot \tildew_1, \ldots, \neff \cdot \tildew_n) , where \neff is Kish’s effective sample size. We prove four theoretical results: (1)~ \neff is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as O(1/\sqrt\neff) ; (3)~BQ-CP’s stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides O(1/\sqrt\neff) improvement in conditional coverage. We instantiate WBCP for spatial prediction as \emphGeographical BQ-CP, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.
[LG-59] Quality-preserving Model for Electronics Production Quality Tests Reduction
链接: https://arxiv.org/abs/2604.06451
作者: Noufa Haneefa,Teddy Lazebnik,Einav Peretz-Andersson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Manufacturing test flows in high-volume electronics production are typically fixed during product development and executed unchanged on every unit, even as failure patterns and process conditions evolve. This protects quality, but it also imposes unnecessary test cost, while existing data-driven methods mostly optimize static test subsets and neither adapt online to changing defect distributions nor explicitly control escape risk. In this study, we present an adaptive test-selection framework that combines offline minimum-cost diagnostic subset construction using greedy set cover with an online Thompson-sampling multi-armed bandit that switches between full and reduced test plans using a rolling process-stability signal. We evaluate the framework on two printed circuit board assembly stages-Functional Circuit Test and End-of-Line test-covering 28,000 board runs. Offline analysis identified zero-escape reduced plans that cut test time by 18.78% in Functional Circuit Test and 91.57% in End-of-Line testing. Under temporal validation with real concept drift, static reduction produced 110 escaped defects in Functional Circuit Test and 8 in End-of-Line, whereas the adaptive policy reduced escapes to zero by reverting to fuller coverage when instability emerged in practice. These results show that online learning can preserve manufacturing quality while reducing test burden, offering a practical route to adaptive test planning across production domains, and offering both economic and logistics improvement for companies.
[LG-60] ODE-free Neural Flow Matching for One-Step Generative Modeling
链接: https://arxiv.org/abs/2604.06413
作者: Xiao Shou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion and flow matching models generate samples by learning time-dependent vector fields whose integration transports noise to data, requiring tens to hundreds of network evaluations at inference. We instead learn the transport map directly. We propose Optimal Transport Neural Flow Matching (OT-NFM), an ODE-free generative framework that parameterizes the flow map with neural flows, enabling true one-step generation with a single forward pass. We show that naive flow-map training suffers from mean collapse, where inconsistent noise-data pairings drive all outputs toward the data mean. We prove that consistent coupling is necessary for non-degenerate learning and address this using optimal transport pairings with scalable minibatch and online coupling strategies. Experiments on synthetic benchmarks and image generation tasks (MNIST and CIFAR-10) demonstrate competitive sample quality while reducing inference to a single network evaluation.
[LG-61] Bridging Theory and Practice in Crafting Robust Spiking Reservoirs
链接: https://arxiv.org/abs/2604.06395
作者: Ruggero Freddi,Nicolas Seseri,Diana Nigrisoli,Alessio Basti
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:
Abstract:Spiking reservoir computing provides an energy-efficient approach to temporal processing, but reliably tuning reservoirs to operate at the edge-of-chaos is challenging due to experimental uncertainty. This work bridges abstract notions of criticality and practical stability by introducing and exploiting the robustness interval, an operational measure of the hyperparameter range over which a reservoir maintains performance above task-dependent thresholds. Through systematic evaluations of Leaky Integrate-and-Fire (LIF) architectures on both static (MNIST) and temporal (synthetic Ball Trajectories) tasks, we identify consistent monotonic trends in the robustness interval across a broad spectrum of network configurations: the robustness-interval width decreases with presynaptic connection density \beta (i.e., directly with sparsity) and directly with the firing threshold \theta . We further identify specific (\beta, \theta) pairs that preserve the analytical mean-field critical point w_\textcrit , revealing iso-performance manifolds in the hyperparameter space. Control experiments on Erdős-Rényi graphs show the phenomena persist beyond small-world topologies. Finally, our results show that w_\textcrit consistently falls within empirical high-performance regions, validating w_\textcrit as a robust starting coordinate for parameter search and fine-tuning. To ensure reproducibility, the full Python code is publicly available.
[LG-62] Revisiting Fairness Impossibility with Endogenous Behavior
链接: https://arxiv.org/abs/2604.06378
作者: Elizabeth Maggie Penn,John W. Patty
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
Abstract:In many real-world settings, institutions can and do adjust the consequences attached to algorithmic classification decisions, such as the size of fines, sentence lengths, or benefit levels. We refer to these consequences as the stakes associated with classification. These stakes can give rise to behavioral responses to classification, as people adjust their actions in anticipation of how they will be classified. Much of the algorithmic fairness literature evaluates classification outcomes while holding behavior fixed, treating behavioral differences across groups as exogenous features of the environment. Under this assumption, the stakes of classification play no role in shaping outcomes. We revisit classic impossibility results in algorithmic fairness in a setting where people respond strategically to classification. We show that, in this environment, the well-known incompatibility between error-rate balance and predictive parity disappears, but only by potentially introducing a qualitatively different form of unequal treatment. Concretely, we construct a two-stage design in which a classifier first standardizes its statistical performance across groups, and then adjusts stakes so as to induce comparable patterns of behavior. This requires treating groups differently in the consequences attached to identical classification decisions. Our results demonstrate that fairness in strategic settings cannot be assessed solely by how algorithms map data into decisions. Rather, our analysis treats the human consequences of classification as primary design variables, introduces normative criteria governing their use, and shows that their interaction with statistical fairness criteria generates qualitatively new tradeoffs. Our aim is to make these tradeoffs precise and explicit. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH) Cite as: arXiv:2604.06378 [cs.GT] (or arXiv:2604.06378v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2604.06378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
链接: https://arxiv.org/abs/2604.06370
作者: Shao Wang,Rui Ren,Lin Gui
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The serving paradigm of large language models (LLMs) is rapidly shifting towards complex multi-agent workflows where specialized agents collaborate over massive shared contexts. While Low-Rank Adaptation (LoRA) enables the efficient co-hosting of these specialized agents on a single base model, it introduces a critical memory footprint bottleneck during serving. Specifically, unique LoRA activations cause Key-Value (KV) cache divergence across agents, rendering traditional prefix caching ineffective for shared contexts. This forces redundant KV cache maintenance, rapidly saturating GPU capacity and degrading throughput. To address this challenge, we introduce ForkKV, a serving system for multi-LoRA agent workflows centered around a novel memory management paradigm in OS: fork with copy-on-write (CoW). By exploiting the structural properties of LoRA, ForkKV physically decouples the KV cache into a massive shared component (analogous to the parent process’s memory pages) and lightweight agent-specific components (the child process’s pages). To support this mechanism, we propose a DualRadixTree architecture that allows newly forked agents to inherit the massive shared cache and apply CoW semantics for their lightweight unique cache. Furthermore, to guarantee efficient execution, we design ResidualAttention, a specialized kernel that reconstructs the disaggregated KV cache directly within on-chip SRAM. Comprehensive evaluations across diverse language models and practical datasets of different tasks demonstrate that ForkKV achieves up to 3.0x the throughput of state-of-the-art multi-LoRA serving systems with a negligible impact on generation quality. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2604.06370 [cs.DC] (or arXiv:2604.06370v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.06370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks
链接: https://arxiv.org/abs/2604.06366
作者: Guillaume Corlouer,Avi Semler,Alexander Strang,Alexander Gietelink Oldenziel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.
[LG-65] Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs ICLR
链接: https://arxiv.org/abs/2604.06298
作者: Suraj Yadav,Siddharth Yadav,Parth Goyal
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR Workshop 2026 ICBINB
Abstract:Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We show that the best achievable gains depend strongly on the base model’s prior reasoning competence and the dataset’s difficulty profile.
[LG-66] FedSpy-LLM : Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLM s
链接: https://arxiv.org/abs/2604.06297
作者: Syed Irfan Ali Meerza,Feiyi Wang,Jian Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Given the growing reliance on private data in training Large Language Models (LLMs), Federated Learning (FL) combined with Parameter-Efficient Fine-Tuning (PEFT) has garnered significant attention for enhancing privacy and efficiency. Despite FL’s privacy benefits, prior studies have shown that private data can still be extracted from shared gradients. However, these studies, mainly on full-parameter model training, are limited to reconstructing small batches, short input sequences, and specific model architectures, such as encoder-based or decoder-based models. The reconstruction quality becomes even worse when dealing with gradients from PEFT methods. To fully understand the practical attack surface of federated LLMs, this paper proposes FedSpy-LLM, a scalable and generalizable data reconstruction attack designed to reconstruct training data with larger batch sizes and longer sequences while generalizing across diverse model architectures, even when PEFT methods are deployed for training. At the core of FedSpy-LLM is a novel gradient decomposition strategy that exploits the rank deficiency and subspace structure of gradients, enabling efficient token extraction while preserving key signal components at scale. This approach further mitigates the reconstruction challenges introduced by PEFT’s substantial null space, ensuring robustness across encoder-based, decoder-based, and encoder-decoder model architectures. Additionally, by iteratively aligning each token’s partial-sequence gradient with the full-sequence gradient, FedSpy-LLM ensures accurate token ordering in reconstructed sequences.
[LG-67] Adversarial Robustness of Time-Series Classification for Crystal Collimator Alignment
链接: https://arxiv.org/abs/2604.06289
作者: Xaver Fink,Borja Fernandez Adiego,Daniele Mirarchi,Eloise Matheson,Alvaro Garcia Gonzales,Gianmarco Ricci,Joost-Pieter Katoen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we analyze and improve the adversarial robustness of a convolutional neural network (CNN) that assists crystal-collimator alignment at CERN’s Large Hadron Collider (LHC) by classifying a beam-loss monitor (BLM) time series during crystal rotation. We formalize a local robustness property for this classifier under an adversarial threat model based on real-world plausibility. Building on established parameterized input-transformation patterns used for transformation- and semantic-perturbation robustness, we instantiate a preprocessing-aware wrapper for our deployed time-series pipeline: we encode time-series normalization, padding constraints, and structured perturbations as a lightweight differentiable wrapper in front of the CNN, so that existing gradient-based robustness frameworks can operate on the deployed pipeline. For formal verification, data-dependent preprocessing such as per-window z-normalization introduces nonlinear operators that require verifier-specific abstractions. We therefore focus on attack-based robustness estimates and pipeline-checked validity by benchmarking robustness with the frameworks Foolbox and ART. Adversarial fine-tuning of the resulting CNN improves robust accuracy by up to 18.6 % without degrading clean accuracy. Finally, we extend robustness on time-series data beyond single windows to sequence-level robustness for sliding-window classification, introduce adversarial sequences as counterexamples to a temporal robustness requirement over full scans, and observe attack-induced misclassifications that persist across adjacent windows.
[LG-68] Asymptotic-Preserving Neural Networks for Viscoelastic Parameter Identification in Multiscale Blood Flow Modeling
链接: https://arxiv.org/abs/2604.06287
作者: Giulia Bertaglia,Raffaella Fiamma Cabini
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Mathematical models and numerical simulations offer a non-invasive way to explore cardiovascular phenomena, providing access to quantities that cannot be measured directly. In this study, we start with a one-dimensional multiscale blood flow model that describes the viscoelastic properties of arterial walls, and we focus on improving its practical applicability by addressing a major challenge: determining, in a reliable way, the viscoelastic parameters that control how arteries deform under pulsatile pressure. To achieve this, we employ Asymptotic-Preserving Neural Networks that embed the governing physical principles of the multiscale viscoelastic blood flow model within the learning procedure. This framework allows us to infer the viscoelastic parameters while simultaneously reconstructing the time-dependent evolution of the state variables of blood vessels. With this approach, pressure waveforms are estimated from readily accessible patient-specific data, i.e., cross-sectional area and velocity measurements from Doppler ultrasound, in vascular segments where direct pressure measurements are not available. Different numerical simulations, conducted in both synthetic and patient-specific scenarios, show the effectiveness of the proposed methodology.
[LG-69] RAG EN-2: Reasoning Collapse in Agent ic RL
链接: https://arxiv.org/abs/2604.06268
作者: Zihan Wang,Chi Gui,Xing Jin,Qineng Wang,Licheng Liu,Kangrui Wang,Shiqi Chen,Linjie Li,Zhengyuan Yang,Pingyue Zhang,Yiping Lu,Jiajun Wu,Li Fei-Fei,Lijuan Wang,Yejin Choi,Manling Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.
[LG-70] SMT-AD: a scalable quantum-inspired anomaly detection approach
链接: https://arxiv.org/abs/2604.06265
作者: Apimuk Sornsaeng,Si Min Chan,Wenxuan Zhang,Swee Liang Wong,Joshua Lim,Dario Poletti
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph)
*备注: 11 pages, 5 figures
Abstract:Quantum-inspired tensor networks algorithms have shown to be effective and efficient models for machine learning tasks, including anomaly detection. Here, we propose a highly parallelizable quantum-inspired approach which we call SMT-AD from Superposition of Multiresolution Tensors for Anomaly Detection. It is based upon the superposition of bond-dimension-1 matrix product operators to transform the input data with Fourier-assisted feature embedding, where the number of learnable parameters grows linearly with feature size, embedding resolutions, and the number of additional components in the matrix product operators structure. We demonstrate successful anomaly detection when applied to standard datasets, including credit card transactions, and find that, even with minimal configurations, it achieves competitive performance against established anomaly detection baselines. Furthermore, it provides a straightforward way to reduce the weight of the model and even improve the performance by highlighting the most relevant input features.
[LG-71] A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset
链接: https://arxiv.org/abs/2604.06227
作者: Tashreef Muhammad,Tahsin Ahmed,Meherun Farzana,Md. Mahmudul Hasan,Abrar Eyasir,Md. Emon Khan,Mahafuzul Islam Shawon,Ferdous Mondol,Mahmudul Hasan,Muhammad Ibrahim
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: 26 pages, 22 figures, 7 tables
Abstract:Accurate short-term forecasting of agricultural commodity prices is critical for food security planning and smallholder income stabilisation in developing economies, yet machine-learning-ready datasets for this purpose remain scarce in South Asia. This paper makes two contributions. First, we introduce AgriPriceBD, a benchmark dataset of 1,779 daily retail mid-prices for five Bangladeshi commodities - garlic, chickpea, green chilli, cucumber, and sweet pumpkin - spanning July 2020 to June 2025, extracted from government reports via an LLM-assisted digitisation pipeline. Second, we evaluate seven forecasting approaches spanning classical models - naïve persistence, SARIMA, and Prophet - and deep learning architectures - BiLSTM, Transformer, Time2Vec-enhanced Transformer, and Informer - with Diebold-Mariano statistical significance tests. Commodity price forecastability is fundamentally heterogeneous: naïve persistence dominates on near-random-walk commodities. Time2Vec temporal encoding provides no statistically significant advantage over fixed sinusoidal encoding and causes catastrophic degradation on green chilli (+146.1% MAE, p0.001). Prophet fails systematically, attributable to discrete step-function price dynamics incompatible with its smooth decomposition assumptions. Informer produces erratic predictions (variance up to 50x ground-truth), confirming sparse-attention Transformers require substantially larger training sets than small agricultural datasets provide. All code, models, and data are released publicly to support replication and future forecasting research on agricultural commodity markets in Bangladesh and similar developing economies.
[LG-72] Gaussian Approximation for Asynchronous Q-learning
链接: https://arxiv.org/abs/2604.07323
作者: Artemy Rubtsov,Sergey Samsonov,Vladimir Ulyanov,Alexey Naumov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 41 pages
Abstract:In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize k^-\omega,, \omega \in (1/2, 1] . Assuming that the sequence of state-action-next-state triples (s_k, a_k, s_k+1)_k \geq 0 forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to n^-1/6 \log^4 (nS A) over the class of hyper-rectangles, where n is the number of samples used by the algorithm and S and A denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm’s last iterate.
[LG-73] he Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours
链接: https://arxiv.org/abs/2604.07267
作者: Robert Allison,Tomasz Maciazek,Anthony Stephenson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 92 pages (35-page main text + self-contained appendix with theorem proofs and auxiliary lemmas)
Abstract:Gaussian process ( GP ) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each test point, as in Nearest Neighbour Gaussian Process ( NNGP ) regression for geospatial problems and the related scalable GPnn method for more general machine-learning applications. Despite their strong empirical performance, the large- n theory of NNGP/GPnn remains incomplete. We develop a theoretical framework for NNGP and GPnn regression. Under mild regularity assumptions, we derive almost sure pointwise limits for three key predictive criteria: mean squared error ( MSE ), calibration coefficient ( CAL ), and negative log-likelihood ( NLL ). We then study the L_2 -risk, prove universal consistency, and show that the risk attains Stone’s minimax rate n^-2\alpha/(2p+d) , where \alpha and p capture regularity of the regression problem. We also prove uniform convergence of MSE over compact hyper-parameter sets and show that its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically, with explicit rates. This explains the observed robustness of GPnn to hyper-parameter tuning. These results provide a rigorous statistical foundation for NNGP/GPnn as a highly scalable and principled alternative to full GP models.
[LG-74] Amortized Filtering and Smoothing with Conditional Normalizing Flows
链接: https://arxiv.org/abs/2604.07169
作者: Tiangang Cui,Xiaodong Feng,Chenlong Pei,Xiaoliang Wan,Tao Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 43 pages
Abstract:Bayesian filtering and smoothing for high-dimensional nonlinear dynamical systems are fundamental yet challenging problems in many areas of science and engineering. In this work, we propose AFSF, a unified amortized framework for filtering and smoothing with conditional normalizing flows. The core idea is to encode each observation history into a fixed-dimensional summary statistic and use this shared representation to learn both a forward flow for the filtering distribution and a backward flow for the backward transition kernel. Specifically, a recurrent encoder maps each observation history to a fixed-dimensional summary statistic whose dimension does not depend on the length of the time series. Conditioned on this shared summary statistic, the forward flow approximates the filtering distribution, while the backward flow approximates the backward transition kernel. The smoothing distribution over an entire trajectory is then recovered by combining the terminal filtering distribution with the learned backward flow through the standard backward recursion. By learning the underlying temporal evolution structure, AFSF also supports extrapolation beyond the training horizon. Moreover, by coupling the two flows through shared summary statistics, AFSF induces an implicit regularization across latent state trajectories and improves trajectory-level smoothing. In addition, we develop a flow-based particle filtering variant that provides an alternative filtering procedure and enables ESS-based diagnostics when explicit model factors are available. Numerical experiments demonstrate that AFSF provides accurate approximations of both filtering distributions and smoothing paths.
[LG-75] A solver-in-the-loop framework for end-to-end differentiable coastal hydrodynamics
链接: https://arxiv.org/abs/2604.07129
作者: Elsa Cardoso-Bihlo,Alex Bihlo
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Numerical Analysis (math.NA); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 23 pages,9 figures
Abstract:Numerical simulation of wave propagation and run-up is a cornerstone of coastal engineering and tsunami hazard assessment. However, applying these forward models to inverse problems, such as bathymetry estimation, source inversion, and structural optimization, remains notoriously difficult due to the rigidity and high computational cost of deriving discrete adjoints. In this paper, we introduce AegirJAX, a fully differentiable hydrodynamic solver based on the depth-integrated, non-hydrostatic shallow-water equations. By implementing the solver entirely within a reverse-mode automatic differentiation framework, AegirJAX treats the time-marching physics loop as a continuous computational graph. We demonstrate the framework’s versatility across a suite of scientific machine learning tasks: (1) discovering regime-specific neural corrections for model misspecifications in highly dispersive wave propagation; (2) performing continuous topology optimization for breakwater design; (3) training recurrent neural networks in-the-loop for active wave cancellation; and (4) inverting hidden bathymetry and submarine landslide kinematics directly from downstream sensor data. The proposed differentiable paradigm fundamentally blurs the line between forward simulation and inverse optimization, offering a unified, end-to-end framework for coastal hydrodynamics.
[LG-76] Physics-Informed Functional Link Constrained Framework with Domain Mapping for Solving Bending Analysis of an Exponentially Loaded Perforated Beam
链接: https://arxiv.org/abs/2604.07025
作者: Iswari Sahu,Ramanath Garai,S. Chakraverty
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This article presents a novel and comprehensive approach for analyzing bending behavior of the tapered perforated beam under an exponential load. The governing differential equation includes important factors like filling ratio ( \alpha ), number of rows of holes ( N ), tapering parameters ( \phi and \psi ), and exponential loading parameter ( \gamma ), providing a realistic and flexible representation of perforated beam configuration. Main goal of this work is to see how well the Domain mapped physics-informed Functional link Theory of Functional Connection (DFL-TFC) method analyses bending response of perforated beam with square holes under exponential loading. For comparison purposes, a corresponding PINN-based formulation is developed. Outcomes clearly show that the proposed DFL-TFC framework gives better results, including faster convergence, reduced computational cost, and improved solution accuracy when compared to the PINN approach. These findings highlight effectiveness and potential of DFL-TFC method for solving complex engineering problems governed by differential equations. Within this framework, hidden layer is replaced by a functional expansion block that enriches input representation via orthogonal polynomial basis functions, and the domain of DE mapped to corresponding domain of orthogonal polynomials. A Constrained Expression (CE), constructed through the Theory of Functional Connections (TFC) using boundary conditions, ensures that constraints are exactly satisfied. In CE, free function is represented using a Functional Link Neural Network (FLNN), which learns to solve resulting unconstrained optimization problem. The obtained results are further validated through the Galerkin and PINN solutions.
[LG-77] QNAS: A Neural Architecture Search Framework for Accurate and Efficient Quantum Neural Networks IJCNN
链接: https://arxiv.org/abs/2604.07013
作者: Kooshan Maleki,Alberto Marchisio,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: To appear at the IEEE International Joint Conference on Neural Networks (IJCNN), Maastricht, The Netherlands, June 2026
Abstract:Designing quantum neural networks (QNNs) that are both accurate and deployable on NISQ hardware is challenging. Handcrafted ansatze must balance expressivity, trainability, and resource use, while limited qubits often necessitate circuit cutting. Existing quantum architecture search methods primarily optimize accuracy while only heuristically controlling quantum and mostly ignore the exponential overhead of circuit cutting. We introduce QNAS, a neural architecture search framework that unifies hardware aware evaluation, multi objective optimization, and cutting overhead awareness for hybrid quantum classical neural networks (HQNNs). QNAS trains a shared parameter SuperCircuit and uses NSGA-II to optimize three objectives jointly: (i) validation error, (ii) a runtime cost proxy measuring wall clock evaluation time, and (iii) the estimated number of subcircuits under a target qubit budget. QNAS evaluates candidate HQNNs under a few epochs of training and discovers clear Pareto fronts that reveal tradeoffs between accuracy, efficiency, and cutting overhead. Across MNIST, Fashion-MNIST, and Iris benchmarks, we observe that embedding type and CNOT mode selection significantly impact both accuracy and efficiency, with angle-y embedding and sparse entangling patterns outperforming other configurations on image datasets, and amplitude embedding excelling on tabular data (Iris). On MNIST, the best architecture achieves 97.16% test accuracy with a compact 8 qubit, 2 layer circuit; on the more challenging Fashion-MNIST, 87.38% with a 5 qubit, 2 layer circuit; and on Iris, 100% validation accuracy with a 4 qubit, 2 layer circuit. QNAS surfaces these design insights automatically during search, guiding practitioners toward architectures that balance accuracy, resource efficiency, and practical deployability on current hardware.
[LG-78] ELC: Evidential Lifelong Classifier for Uncertainty Aware Radar Pulse Classification
链接: https://arxiv.org/abs/2604.06958
作者: Mohamed Rabie,Chinthana Panagamuwa,Konstantinos G. Kyriakopoulos
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: IEEE RadarConf’26 Submission. 6 pages; 3 figures; 1 table
Abstract:Reliable radar pulse classification is essential in Electromagnetic Warfare for situational awareness and decision support. Deep Neural Networks have shown strong performance in radar pulse and RF emitter recognition; however, on their own they struggle to efficiently learn new pulses and lack mechanisms for expressing predictive confidence. This paper integrates Uncertainty Quantification with Lifelong Learning to address both challenges. The proposed approach is an Evidential Lifelong Classifier (ELC), which models epistemic uncertainty using evidence theory. ELC is evaluated against a Bayesian Lifelong Classifier (BLC), which quantifies uncertainty through Shannon entropy. Both integrate Learn-Prune-Share to enable continual learning of new pulses and uncertainty-based selective prediction to reject unreliable predictions. ELC and BLC are evaluated on 2 synthetic radar and 3 RF fingerprinting datasets. Selective prediction based on evidential uncertainty improves recall by up to 46% at -20 dB SNR on synthetic radar pulse datasets, highlighting its effectiveness at identifying unreliable predictions in low-SNR conditions compared to BLC. These findings demonstrate that evidential uncertainty offers a strong correlation between confidence and correctness, improving the trustworthiness of ELC by allowing it to express ignorance.
[LG-79] Continuous-Time Dynamics of the Difference-of-Convex Algorithm
链接: https://arxiv.org/abs/2604.06926
作者: Yi-Shuai Niu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 22 pages
Abstract:We study the continuous-time structure of the difference-of-convex algorithm (DCA) for smooth DC decompositions with a strongly convex component. In dual coordinates, classical DCA is exactly the full-step explicit Euler discretization of a nonlinear autonomous system. This viewpoint motivates a damped DCA scheme, which is also a Bregman-regularized DCA variant, and whose vanishing-step limit yields a Hessian-Riemannian gradient flow generated by the convex part of the decomposition. For the damped scheme we prove monotone descent, asymptotic criticality, Kurdyka-Lojasiewicz convergence under boundedness, and a global linear rate under a metric DC-PL inequality. For the limiting flow we establish an exact energy identity, asymptotic criticality of bounded trajectories, explicit global rates under metric relative error bounds, finite-length and single-point convergence under a Kurdyka-Lojasiewicz hypothesis, and local exponential convergence near nondegenerate local minima. The analysis also reveals a global-local tradeoff: the half-relaxed scheme gives the best provable global guarantee in our framework, while the full-step scheme is locally fastest near a nondegenerate minimum. Finally, we show that different DC decompositions of the same objective induce different continuous dynamics through the metric generated by the convex component, providing a geometric criterion for decomposition quality and linking DCA with Bregman geometry.
[LG-80] A Data-Informed Variational Clustering Framework for Noisy High-Dimensional Data
链接: https://arxiv.org/abs/2604.06864
作者: Wan Ping Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Clustering in high-dimensional settings with severe feature noise remains challenging, especially when only a small subset of dimensions is informative and the final number of clusters is not specified in advance. In such regimes, partition recovery, feature relevance learning, and structural adaptation are tightly coupled, and standard likelihood-based methods can become unstable or overly sensitive to noisy dimensions. We propose DIVI, a data-informed variational clustering framework that combines global feature gating with split-based adaptive structure growth. DIVI uses informative prior initialization to stabilize optimization, learns feature relevance in a differentiable manner, and expands model complexity only when local diagnostics indicate underfit. Beyond clustering performance, we also examine runtime scalability and parameter sensitivity in order to clarify the computational and practical behavior of the framework. Empirically, we find that DIVI performs competitively under severe feature noise, remains computationally feasible, and yields interpretable feature-gating behavior, while also exhibiting conservative growth and identifiable failure regimes in challenging settings. Overall, DIVI is best viewed as a practical variational clustering framework for noisy high-dimensional data rather than as a fully Bayesian generative solution.
[LG-81] Accelerating 4D Hyperspectral Imaging through Physics-Informed Neural Representation and Adaptive Sampling
链接: https://arxiv.org/abs/2604.06561
作者: Chi-Jui Ho,Harsh Bhakta,Wei Xiong,Nicholas Antipa
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 18 pages, 14 figures
Abstract:High-dimensional hyperspectral imaging (HSI) enables the visualization of ultrafast molecular dynamics and complex, heterogeneous spectra. However, applying this capability to resolve spatially varying vibrational couplings in two-dimensional infrared (2DIR) spectroscopy, a type of coherent multidimensional spectroscopy (CMDS), necessitates prohibitively long data acquisition, driven by dense Nyquist sampling requirements and the need for extensive signal accumulation. To address this challenge, we introduce a physics-informed neural representation approach that efficiently reconstructs dense spatially-resolved 2DIR hyperspectral images from sparse experimental measurements. In particular, we used a multilayer perceptron (MLP) to model the relationship between the sub-sampled 4D coordinates and their corresponding spectral intensities, and recover densely sampled 4D spectra from limited observations. The reconstruction results demonstrate that our method, using a fraction of the samples, faithfully recovers both oscillatory and non-oscillatory spectral dynamics in experimental measurement. Moreover, we develop a loss-aware adaptive sampling method to progressively introduce potentially informative samples for iterative data collection while conducting experiments. Experimental results show that the proposed approach achieves high-fidelity spectral recovery using only 1/32 of the sampling budget, as opposed to exhaustive sampling, effectively reducing total experiment time by up to 32-fold. This framework offers a scalable solution for accelerating any experiments with hypercube data, including multidimensional spectroscopy and hyperspectral imaging, paving the way for rapid chemical imaging of transient biological and material systems.
[LG-82] Quantum-Inspired Tensor Network Autoencoders for Anomaly Detection: A MERA-Based Approach
链接: https://arxiv.org/abs/2604.06541
作者: Emre Gurkanli,Michael Spannowsky
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 26 pages, 5 figures
Abstract:We investigate whether a multiscale tensor-network architecture can provide a useful inductive bias for reconstruction-based anomaly detection in collider jets. Jets are produced by a branching cascade, so their internal structure is naturally organised across angular and momentum scales. This motivates an autoencoder that compresses information hierarchically and can reorganise short-range correlations before coarse-graining. Guided by this picture, we formulate a MERA-inspired autoencoder acting directly on ordered jet constituents. To the best of our knowledge, a MERA-inspired autoencoder has not previously been proposed, and this architecture has not been explored in collider anomaly detection. We compare this architecture to a dense autoencoder, the corresponding tree-tensor-network limit, and standard classical baselines within a common background-only reconstruction framework. The paper is organised around two main questions: whether locality-aware hierarchical compression is genuinely supported by the data, and whether the disentangling layers of MERA contribute beyond a simpler tree hierarchy. To address these questions, we combine benchmark comparisons with a training-free local-compressibility diagnostic and a direct identity-disentangler ablation. The resulting picture is that the locality-preserving multiscale structure is well matched to jet data, and that the MERA disentanglers become beneficial precisely when the compression bottleneck is strongest. Overall, the study supports locality-aware hierarchical compression as a useful inductive bias for jet anomaly detection. Comments: 26 pages, 5 figures Subjects: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2604.06541 [hep-ph] (or arXiv:2604.06541v1 [hep-ph] for this version) https://doi.org/10.48550/arXiv.2604.06541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-83] Stochastic Auto-conditioned Fast Gradient Methods with Optimal Rates
链接: https://arxiv.org/abs/2604.06525
作者: Yao Ji,Guanghui Lan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Achieving optimal rates for stochastic composite convex optimization without prior knowledge of problem parameters remains a central challenge. In the deterministic setting, the auto-conditioned fast gradient method has recently been proposed to attain optimal accelerated rates without line-search procedures or prior knowledge of the Lipschitz smoothness constant, providing a natural prototype for parameter-free acceleration. However, extending this approach to the stochastic setting has proven technically challenging and remains open. Existing parameter-free stochastic methods either fail to achieve accelerated rates or rely on restrictive assumptions, such as bounded domains, bounded gradients, prior knowledge of the iteration horizon, or strictly sub-Gaussian noise. To address these limitations, we propose a stochastic variant of the auto-conditioned fast gradient method, referred to as stochastic AC-FGM. The proposed method is fully adaptive to the Lipschitz constant, the iteration horizon, and the noise level, enabling both adaptive stepsize selection and adaptive mini-batch sizing without line-search procedures. Under standard bounded conditional variance assumptions, we show that stochastic AC-FGM achieves the optimal iteration complexity of O(1/\sqrt\varepsilon) and the optimal sample complexity of O(1/\varepsilon^2) .
[LG-84] Spatiotemporal Gaussian representation-based dynamic reconstruction and motion estimation framework for time-resolved volumetric MR imaging (DREME-GSMR)
链接: https://arxiv.org/abs/2604.06482
作者: Jiacheng Xie,Hua-Chieh Shao,Can Wu,Ricardo Otazo,Jie Deng,Mu-Han Lin,Tsuicheng Chiu,Jacob Buatti,Viktor Iakovenko,You Zhang
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 57 pages, 10 figures
Abstract:Time-resolved volumetric MR imaging that reconstructs a 3D MRI within sub-seconds to resolve deformable motion is essential for motion-adaptive radiotherapy. Representing patient anatomy and associated motion fields as 3D Gaussians, we developed a spatiotemporal Gaussian representation-based framework (DREME-GSMR), which enables time-resolved dynamic MRI reconstruction from a pre-treatment 3D MR scan without any prior anatomical/motion model. DREME-GSMR represents a reference MRI volume and a corresponding low-rank motion model (as motion-basis components) using 3D Gaussians, and incorporates a dual-path MLP/CNN motion encoder to estimate temporal motion coefficients of the motion model from raw k-space-derived signals. Furthermore, using the solved motion model, DREME-GSMR can infer motion coefficients directly from new online k-space data, allowing subsequent intra-treatment volumetric MR imaging and motion tracking (real-time imaging). A motion-augmentation strategy is further introduced to improve robustness to unseen motion patterns during real-time imaging. DREME-GSMR was evaluated on the XCAT digital phantom, a physical motion phantom, and MR-LINAC datasets acquired from 6 healthy volunteers and 20 patients (with independent sequential scans for cross-evaluation). DREME-GSMR reconstructs MRIs of a ~400ms temporal resolution, with an inference time of ~10ms/volume. In XCAT experiments, DREME-GSMR achieved mean(s.d.) SSIM, tumor center-of-mass-error(COME), and DSC of 0.92(0.01)/0.91(0.02), 0.50(0.15)/0.65(0.19) mm, and 0.92(0.02)/0.92(0.03) for dynamic reconstruction/real-time imaging. For the physical phantom, the mean target COME was 1.19(0.94)/1.40(1.15) mm for dynamic/real-time imaging, while for volunteers and patients, the mean liver COME for real-time imaging was 1.31(0.82) and 0.96(0.64) mm, respectively.
[LG-85] Anticipating tipping in spatiotemporal systems with machine learning
链接: https://arxiv.org/abs/2604.06454
作者: Smita Deb,Zheng-Meng Zhai,Mulugeta Haile,Ying-Cheng Lai
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 26 pages, 25 figures
Abstract:In nonlinear dynamical systems, tipping refers to a critical transition from one steady state to another, typically catastrophic, steady state, often resulting from a saddle-node bifurcation. Recently, the machine-learning framework of parameter-adaptable reservoir computing has been applied to predict tipping in systems described by low-dimensional stochastic differential equations. However, anticipating tipping in complex spatiotemporal dynamical systems remains a significant open problem. The ability to forecast not only the occurrence but also the precise timing of such tipping events is crucial for providing the actionable lead time necessary for timely mitigation. By utilizing the mathematical approach of non-negative matrix factorization to generate dimensionally reduced spatiotemporal data as input, we exploit parameter-adaptable reservoir computing to accurately anticipate tipping. We demonstrate that the tipping time can be identified within a narrow prediction window across a variety of spatiotemporal dynamical systems, as well as in CMIP5 (Coupled Model Intercomparison Project 5) climate projections. Furthermore, we show that this reservoir-computing framework, utilizing reduced input data, is robust against common forecasting challenges and significantly alleviates the computational overhead associated with processing full spatiotemporal data.
[LG-86] Learning Debt and Cost-Sensitive Bayesian Retraining: A Forecasting Operations Framework
链接: https://arxiv.org/abs/2604.06438
作者: Harrison Katz
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Forecasters often choose retraining schedules by convention rather than by an explicit decision rule. This paper gives that decision a posterior-space language. We define learning debt as the divergence between the deployed and continuously updated posteriors, define actionable staleness as the policy-relevant latent state, and derive a one-step Bayes retraining rule under an excess-loss formulation. In an online conjugate simulation using the exact Kullback-Leibler divergence between deployed and shadow normal-inverse-gamma posteriors, a debt-filter beats a default 10-period calendar baseline in 15 of 24 abrupt-shift cells, all 24 gradual-drift cells, and 17 of 24 variance-shift cells, and remains below the best fixed cadence in a grid of cadences (5, 10, 20, and 40 periods) in 10, 24, and 17 cells, respectively. Fixed-threshold CUSUM remains a strong benchmark, while a proxy filter built from indirect diagnostics performs poorly. A retrospective Airbnb production backtest shows how the same decision logic behaves around a known payment-policy shock.
[LG-87] Operator Learning for Surrogate Modeling of Wave-Induced Forces from Sea Surface Waves
链接: https://arxiv.org/abs/2604.06433
作者: Shukai Cai,Sourav Dutta,Mark Loveland,Eirik Valseth,Peter Rivera-Casillas,Corey Trahan,Clint Dawson
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 46 pages, 15 figures
Abstract:Wave setup plays a significant role in transferring wave-induced energy to currents and causing an increase in water elevation. This excess momentum flux, known as radiation stress, motivates the coupling of circulation models with wave models to improve the accuracy of storm surge prediction, however, traditional numerical wave models are complex and computationally expensive. As a result, in practical coupled simulations, wave models are often executed at much coarser temporal resolution than circulation models. In this work, we explore the use of Deep Operator Networks (DeepONets) as a surrogate for the Simulating WAves Nearshore (SWAN) numerical wave model. The proposed surrogate model was tested on three distinct 1-D and 2-D steady-state numerical examples with variable boundary wave conditions and wind fields. When applied to a realistic numerical example of steady state wave simulation in Duck, NC, the model achieved consistently high accuracy in predicting the components of the radiation stress gradient and the significant wave height across representative scenarios.
[LG-88] Calibration of a neural network ocean closure for improved mean state and variability
链接: https://arxiv.org/abs/2604.06398
作者: Pavel Perezhogin,Alistair Adcroft,Laure Zanna
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Global ocean models exhibit biases in the mean state and variability, particularly at coarse resolution, where mesoscale eddies are unresolved. To address these biases, parameterization coefficients are typically tuned ad hoc. Here, we formulate parameter tuning as a calibration problem using Ensemble Kalman Inversion (EKI). We optimize parameters of a neural network parameterization of mesoscale eddies in two idealized ocean models at coarse resolution. The calibrated parameterization reduces errors in the time-averaged fluid interfaces and their variability by approximately a factor of two compared to the unparameterized model or the offline-trained parameterization. The EKI method is robust to noise in time-averaged statistics arising from chaotic ocean dynamics. Furthermore, we propose an efficient calibration protocol that bypasses integration to statistical equilibrium by carefully choosing an initial condition. These results demonstrate that systematic calibration can substantially improve coarse-resolution ocean simulations and provide a practical pathway for reducing biases in global ocean models.
[LG-89] ght Convergence Rates for Online Distributed Linear Estimation with Adversarial Measurements
链接: https://arxiv.org/abs/2604.06282
作者: Nibedita Roy,Vishal Halder,Gugan Thoppe,Alexandre Reiffers-Masson,Mihir Dhanakshirur,Naman,Alexandre Azor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint
Abstract:We study mean estimation of a random vector X in a distributed parameter-server-worker setup. Worker i observes samples of a_i^\top X , where a_i^\top is the i th row of a known sensing matrix A . The key challenges are adversarial measurements and asynchrony: a fixed subset of workers may transmit corrupted measurements, and workers are activated asynchronously–only one is active at any time. In our previous work, we proposed a two-timescale \ell_1 -minimization algorithm and established asymptotic recovery under a null-space-property-like condition on A . In this work, we establish tight non-asymptotic convergence rates under the same null-space-property-like condition. We also identify relaxed conditions on A under which exact recovery may fail but recovery of a projected component of \mathbbE[X] remains possible. Overall, our results provide a unified finite-time characterization of robustness, identifiability, and statistical efficiency in distributed linear estimation with adversarial workers, with implications for network tomography and related distributed sensing problems.
附件下载


