Arxiv今日论文 | 2026-04-24

本篇博文主要内容为 2026-04-24 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共107篇(Computation and Language (cs.CL))
人工智能共202篇(Artificial Intelligence (cs.AI))
计算机视觉共102篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共141篇(Machine Learning (cs.LG))
多智能体系统共15篇(Multiagent Systems (cs.MA))
信息检索共35篇(Information Retrieval (cs.IR))
人机交互共26篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] ask-Driven Co-Design of Heterogeneous Multi-Robot Systems

【速读】：该论文旨在解决异构多机器人系统（heterogeneous multi-robot systems）在任务驱动下的协同设计问题，即如何在机器人设计、车队组成和规划策略之间进行联合优化，以实现系统级性能最大化。传统方法通常孤立地优化各子模块，忽视了跨域决策的耦合关系与任务约束下的权衡机制。其解决方案的关键在于提出一个形式化且可组合的框架，基于单调协同设计理论（monotone co-design theory），将机器人、车队、规划器、执行器和评估器抽象为具有明确定义接口的互联设计问题，且对具体实现和任务类型保持无关性。这一结构支持在任务特定性能约束下高效进行联合优化，并通过一系列案例研究验证了该方法在灵活性、可扩展性和可解释性方面的优势，同时能够系统性发现非直观的设计替代方案并提供最优性保证。

链接: https://arxiv.org/abs/2604.21894
作者: Maximilian Stralz,Meshal Alharbi,Yujun Huang,Gioele Zardini
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Designing multi-agent robotic systems requires reasoning across tightly coupled decisions spanning heterogeneous domains, including robot design, fleet composition, and planning. Much effort has been devoted to isolated improvements in these domains, whereas system-level co-design considering trade-offs and task requirements remains underexplored. In this work, we present a formal and compositional framework for the task-driven co-design of heterogeneous multi-robot systems. Building on a monotone co-design theory, we introduce general abstractions of robots, fleets, planners, executors, and evaluators as interconnected design problems with well-defined interfaces that are agnostic to both implementations and tasks. This structure enables efficient joint optimization of robot design, fleet composition, and planning under task-specific performance constraints. A series of case studies demonstrates the capabilities of the framework. Various component models can be seamlessly incorporated, including new robot types, task profiles, and probabilistic sensing objectives, while non-obvious design alternatives are systematically uncovered with optimality guarantees. The results highlight the flexibility, scalability, and interpretability of the proposed approach, and illustrate how formal co-design enables principled reasoning about complex heterogeneous multi-robot systems.

[MA-1] Probably Approximately Consensus: On the Learning Theory of Finding Common Ground IJCAI2025

【速读】：该论文旨在解决在线协商平台中共识提取的局限性问题，即现有方法仅基于用户对特定陈述的偏好来识别共识，而未能充分考虑不同议题的相对重要性（salience）。为应对这一挑战，作者提出将共识建模为一维意见空间中的一个区间，该空间通过嵌入（embedding）和降维技术从高维数据中提取。解决方案的关键在于定义一个优化目标，该目标在假设区间内最大化预期一致性，并通过对潜在议题分布取期望来隐式纳入议题的重要性权重；同时，作者设计了一种高效的经验风险最小化（Empirical Risk Minimization, ERM）算法并提供了PAC学习保证，实验表明通过选择性地向用户查询现有语句样本，可显著减少达成最优共识区域所需的交互次数。

链接: https://arxiv.org/abs/2604.21811
作者: Carter Blair,Ben Armstrong,Shiri Alouf-Heffetz,Nimrod Talmon,Davide Grossi
机构: University of Waterloo (滑铁卢大学); Tulane University (图兰大学); Ben Gurion University (本古里安大学); University of Groningen (格罗宁根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to the Social Choice and Learning Algorithms Workshop at IJCAI 2025

点击查看摘要

Abstract:A primary goal of online deliberation platforms is to identify ideas that are broadly agreeable to a community of users through their expressed preferences. Yet, consensus elicitation should ideally extend beyond the specific statements provided by users and should incorporate the relative salience of particular topics. We address this issue by modelling consensus as an interval in a one-dimensional opinion space derived from potentially high-dimensional data via embedding and dimensionality reduction. We define an objective that maximizes expected agreement within a hypothesis interval where the expectation is over an underlying distribution of issues, implicitly taking into account their salience. We propose an efficient Empirical Risk Minimization (ERM) algorithm and establish PAC-learning guarantees. Our initial experiments demonstrate the performance of our algorithm and examine more efficient approaches to identifying optimal consensus regions. We find that through selectively querying users on an existing sample of statements, we can reduce the number of queries needed to a practical number.

[MA-2] Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems）中通信机制设计的局限性问题，即当前方法通常将智能体间的交互视为固定接口，而非可学习的动态过程，从而限制了复杂推理任务中的协同效率与准确性。其解决方案的关键在于提出DiffMAS训练框架，将潜在通信（latent communication）作为可学习组件嵌入到多智能体系统中，通过在多智能体潜在轨迹上进行参数高效的监督训练，使智能体能够联合学习如何在交互过程中编码和解码信息，从而实现通信与推理的端到端优化。

链接: https://arxiv.org/abs/2604.21794
作者: Ye Yu,Heming Liu,Haibo Jin,Xiaopeng Yuan,Peng Kuang,Haohan Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Under review at COLM 2026

点击查看摘要

Abstract:Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

[MA-3] Agent ic AI-Enabled Framework for Thermal Comfort and Building Energy Assessment in Tropical Urban Neighborhoods

【速读】：该论文旨在解决新加坡城市热岛效应（Urban Heat Island Effect）与建筑能耗需求之间的矛盾问题，提出了一种由代理型人工智能（Agentic AI）驱动的推理框架。其解决方案的关键在于将大语言模型（Large Language Models, LLMs）与轻量级物理模型相结合：LLMs通过提示工程（prompt customization）理解城市设计任务、提取相关政策并激活相应的物理模型进行评估，形成闭环推理-行动流程；而轻量级物理模型则基于核心热力学和气流原理，简化传统复杂模型以显著降低计算时间，同时精准预测微气候变量（如建筑表面温度、地表辐射热和气流条件），进而估算热舒适指数（如生理等效温度PET）及建筑能耗。该方法实现了对气候适应性建筑表层策略（如绿色立面和降温涂料）的快速量化评估，为可持续城市设计提供高效、跨学科的技术支持。

链接: https://arxiv.org/abs/2604.21787
作者: Po-Yen Lai,Xinyu Yang,Derrick Low,Huizhe Liu,Jian Cheng Wong
机构: 未知
类目: Multiagent Systems (cs.MA); Computational Physics (physics.comp-ph)
备注: Accepted at IAQVEC 2026

点击查看摘要

Abstract:In response to the urban heat island effects and building energy demands in Singapore, this study proposes an agentic AI-enabled reasoning framework that integrates large language models (LLMs) with lightweight physics-based models. Through prompt customization, the LLMs interpret urban design tasks, extract relevant policies, and activate appropriate physics-based models for evaluation, forming a closed-loop reasoning-action process. These lightweight physics-based models leverage core thermal and airflow principles, streamlining conventional models to reduce computational time while predicting microclimate variables, such as building surface temperature, ground radiant heat, and airflow conditions, thereby enabling the estimation of thermal comfort indices, e.g., physiological equivalent temperature (PET), and building energy usage. This framework allows users to explore a variety of climate-resilient building surface strategies, e.g., green façades and cool paint applications, that improve thermal comfort while reducing wall heat gain and energy demand. By combining the autonomous reasoning capacity of LLMs with the rapid quantitative evaluation of lightweight physics-based models, the proposed system demonstrates potential for cross-disciplinary applications in sustainable urban design, indoor-outdoor environmental integration, and climate adaptation planning. The source code and data used in this study are available at: this https URL.

[MA-4] StructMem: Structured Memory for Long-Horizon Behavior in LLM s ACL2026

【速读】：该论文旨在解决长期对话代理中记忆系统难以同时捕捉事件间关系与实现高效推理的问题。现有方法在扁平化记忆（flat memory）与图结构记忆（graph-based memory）之间存在权衡：前者效率高但无法建模关系，后者虽支持结构化推理却因构建复杂且脆弱而成本高昂。解决方案的关键在于提出一种结构增强的分层记忆框架 StructMem，其通过时间锚定双重视角并周期性进行语义整合，既保留事件级绑定关系，又自动诱导跨事件连接，从而显著提升时间推理和多跳问答性能，同时大幅降低令牌消耗、API调用次数和运行时间。

链接: https://arxiv.org/abs/2604.21748
作者: Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng
机构: Zhejiang University(浙江大学); Ant Group(蚂蚁集团); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbfStructMem, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \textttLoCoMo, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see this https URL .

[MA-5] Architectures for Robust Self-Organizing Energy Systems under Information and Control Constraints

【速读】：该论文旨在解决在基于代理的网络物理能源系统（Cyber-Physical Energy Systems, CPES）中，如何通过受控自组织（controlled self-organization）机制提升系统鲁棒性的问题，尤其是在面对网络攻击等扰动时的响应能力。其关键解决方案在于设计适配信息访问受限与行动能力有限的观测器/控制器架构变体，明确考虑分布式能源资源隐私、监管约束及数据交换要求对系统控制策略的影响，并评估不同架构下控制器可行动作的效果，从而为实际应用中的代理系统设计提供可落地的鲁棒性保障方案。

链接: https://arxiv.org/abs/2604.21529
作者: Emilie Frost,Astrid Nieße
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution will be published in Agents and Artificial Intelligence, Lecture Notes in Computer Science, and available online at this https URL

点击查看摘要

Abstract:Applying the concept of controlled self-organization in agent-based Cyber-Physical Energy Systems (CPES) is a promising approach to ensure system robustness. By introducing an observer/controller architecture to the system, this concept allows for self-organization while still enabling intervention when disturbances occur. Thus, it is possible to respond to effects of cyber attacks, a major threat to current energy systems. However, when implementing an observer to monitor the system and a controller to execute actions for controlled self-organization in CPES, it is essential to take into account restrictions on information and actions resulting from the privacy of local distributed energy resources, regulatory constraints, and data exchange requirements. For this reason, this paper presents architecture variants for the observer and controller that take into account restrictions on access to information and limited actions. In addition, it evaluates possible controller actions in various architectures. The results underscore the importance of considering observer/controller architectures when designing agent-based systems to ensure their robustness for real-world applications.

[MA-6] AI-Gram: When Visual Agents Interact in a Social Network

【速读】：该论文旨在解决如何在完全自主的多智能体视觉网络中研究社会动态的问题，特别是探索基于图像的交互如何促成智能体间的沟通与适应机制。其解决方案的关键在于构建了一个名为AI-Gram的实时平台，该平台由全部由大语言模型（LLM）驱动的智能体组成，能够通过视觉媒介进行自发性的互动；实验发现智能体间会自发形成视觉回复链（visual reply chains），表现出丰富的交际结构，同时展现出审美主权（aesthetic sovereignty）——即个体对自身视觉风格的坚持，即使在对抗性影响下仍能保持独立性，并且视觉相似性与社会关系之间存在解耦现象。这一发现揭示了当前智能体架构中存在的根本不对称性：强大的表达性通信能力与对个体视觉身份的坚定维护并存。

链接: https://arxiv.org/abs/2604.21446
作者: Andrew Shin
机构: Keio University (庆应义塾大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:We present AI-Gram, a live platform enabling image-based interactions, to study social dynamics in a fully autonomous multi-agent visual network where all participants are LLM-driven agents. Using the platform, we conduct experiments on how agents communicate and adapt through visual media, and observe the spontaneous emergence of visual reply chains, indicating rich communicative structure. At the same time, agents exhibit aesthetic sovereignty resisting stylistic convergence toward social partners, anchoring under adversarial influence, and a decoupling between visual similarity and social ties. These results reveal a fundamental asymmetry in current agent architectures: strong expressive communication paired with a steadfast preservation of individual visual identity. We release AI-Gram as a publicly accessible, continuously evolving platform for studying social dynamics in Al-native multi-agent systems. this https URL

[MA-7] Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

【速读】：该论文旨在解决多图表图像（multi-chart images）中复杂信息理解与问答的难题，尤其是在真实科研场景下，用户需结合多个相关图表进行推理才能获得有意义的洞察。现有研究对多图表的理解关注不足，缺乏高质量标注数据集和系统评估框架。为此，作者提出了PolyChartQA，一个包含534个多图表图像（共2,297个子图表）及2,694个问答对的中等规模数据集，源自同行评审的计算机科学文献。关键解决方案在于构建该数据集并系统评估九种先进多模态语言模型（Multimodal Language Models, MLMs）在不同问题类型、难度、来源及图表结构特征下的表现，同时提出一种改进提示（prompting）方法，在人类撰写的问题上实现5.39%的L-accuracy提升，显著缓解了模型在真实语境下的性能下降问题（相比模型生成问题，人类问题准确率下降27.4%）。

链接: https://arxiv.org/abs/2604.21344
作者: Azher Ahmed Efat,Seok Hwan Song,Wallapak Tavanapong
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.

[MA-8] PREVENT-JACK: Context Steering for Swarms of Long Heavy Articulated Vehicles

【速读】：该论文旨在解决群体机器人中长距离重型铰接车辆（Heavy Articulated Vehicles, HAVs）的局部去中心化协调问题，这类车辆具有运动学约束、延展性和关节结构，传统点质量模型难以适用。其核心挑战在于避免车辆在复杂环境中发生“折弯”（jackknifing）和碰撞，同时保持群体协同效率。解决方案的关键是提出名为 Prevent-Jack 的稀疏上下文转向框架（sparsely covered context steering framework），通过融合六种局部行为策略，在保证无折弯和碰撞的前提下实现高效群体控制；其中，“规避吸引”（Evade Attraction）行为被证明对死锁预防至关重要，并通过15,000次仿真验证了该方法在最多十节挂车场景下的有效性，结果显示：随着群集规模增大和环境密度增加，死锁与活锁现象显著上升，峰值影响比例达27%/31%。

链接: https://arxiv.org/abs/2604.21337
作者: Adrian Baruck,Michael Dubé,Christoph Steup,Sanaz Mostaghim
机构: Otto-von-Guericke University Magdeburg (马格德堡奥托冯格里克大学); Fraunhofer Institute for Industrial Engineering (弗劳恩霍夫工业工程研究所)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 32 pages, 7 figures, 4 videos; submitted to the Swarm Robotics collection of the Nature Portfolio Journal Robotics (NPJ Robot)

点击查看摘要

Abstract:In this paper, we aim to extend the traditional point-mass-like robot representation in swarm robotics and instead study a swarm of long Heavy Articulated Vehicles (HAVs). HAVs are kinematically constrained, elongated, and articulated, introducing unique challenges. Local, decentralized coordination of these vehicles is motivated by many real-world applications. Our approach, Prevent-Jack, introduces the sparsely covered context steering framework in robotics. It fuses six local behaviors, providing guarantees against jackknifing and collisions at the cost of potential dead- and livelocks, tested for vehicles with up to ten trailers. We highlight the importance of the Evade Attraction behavior for deadlock prevention using a parameter study, and use 15,000 simulations to evaluate the swarm performance. Our extensive experiments and the results show that both the dead- and livelocks occur more frequently in larger swarms and denser scenarios, affecting a peak average of 27%/31% of vehicles. We observe that larger swarms exhibit increased waiting, while smaller swarms show increased evasion.

[MA-9] Role of diversity in team performance: the case of missing expertise an agent based simulation WWW

【速读】：该论文旨在解决管理团队功能多样性（functional diversity）对组织绩效影响机制不明确的问题，特别是现有研究多局限于变量分布的前一或两个主成分，难以捕捉隐含分布特征及其与情境因素（如沟通模式、团队组成）的交互效应。其解决方案的关键在于构建一个基于智能体的模型（agent-based model），通过一系列模拟实验揭示：个体内部功能多样性（intrapersonal functional diversity, IFD）和主导功能多样性（dominant function diversity, DFD）在不同情境下可能增强或削弱团队沟通效率与绩效；同时提出需引入第三个指标——团队整体专家能力（aggregate expertise），以全面解释实证结果。

链接: https://arxiv.org/abs/2604.21328
作者: Tamás Kiss
机构: Wigner Research Centre for Physics (匈牙利物理研究所); Hungarian Research Network (匈牙利研究网络)
类目: Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
备注: 20 pages, 13 figures, for associated model file, please see this https URL

点击查看摘要

Abstract:Theory and empirical research on management teams’ influence on firm performance have witnessed continuous development, and by now incorporate numerous details. Classic, experiment-based studies examining social systems collect vast amount of data, but often times investigate only the first one or two modes of the distribution of measured variables, and experience difficulty in analyzing the effect of context. For example, in functional diversity research, management teams are described by measures incorporating complex distributions of capabilities of individual managers and teams of managers. To investigate the effect of hidden distributions, and the effect of functional diversity composition on team communication and performance, we developed an agent-based model, and conducted a series of simulation experiments. Modeling results show that depending on the context, such as communication scheme among interacting agents, or their functional composition, intrapersonal functional diversity (IFD), and dominant function diversity (DFD) might enhance or reduce performance and communication among agents. Furthermore, simulation results also suggest that a third measure is required alongside IFD and DFD capturing the aggregate expertise of the team to comprehensively account for empirical findings.

[MA-10] Multi-Agent Empowerment and Emergence of Complex Behavior in Groups

【速读】：该论文旨在解决多智能体系统中如何通过内在动机（intrinsic motivation）驱动群体组织行为的问题，特别是关注赋能（empowerment）这一特定内在激励在多智能体环境中的扩展与应用。其解决方案的关键在于提出了一种形式化的多智能体赋能扩展方法，并实现了高效计算；实验表明，这种内在动机能够催生两种不同环境中具有特征性的群体组织模式——一种是通过肌腱耦合的双智能体系统，另一种是可控的Vicsek flock模型，从而验证了赋能作为内在动机不仅可驱动个体行为，还能促进更高层次的群体行为组织。

链接: https://arxiv.org/abs/2604.21155
作者: Tristan Shah,Ilya Nemenman,Daniel Polani,Stas Tiomkin
机构: Texas Tech University (德克萨斯理工大学); Emory University (埃默里大学); University of Hertfordshire (赫特福德大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages

点击查看摘要

Abstract:Intrinsic motivations are receiving increasing attention, i.e. behavioral incentives that are not engineered, but emerge from the interaction of an agent with its surroundings. In this work we study the emergence of behaviors driven by one such incentive, empowerment, specifically in the context of more than one agent. We formulate a principled extension of empowerment to the multi-agent setting, and demonstrate its efficient calculation. We observe that this intrinsic motivation gives rise to characteristic modes of group-organization in two qualitatively distinct environments: a pair of agents coupled by a tendon, and a controllable Vicsek flock. This demonstrates the potential of intrinsic motivations such as empowerment to not just drive behavior for only individual agents but also higher levels of behavioral organization at scale.

[MA-11] AGNT2: Autonomous Agent Economies on Interaction-Optimized Layer 2 Infrastructure

【速读】：该论文旨在解决当前区块链Layer 2解决方案（如Optimism、Arbitrum、zkSync等）在支持自主AI代理（Autonomous AI agents）高频、语义丰富的服务调用时存在的根本性不足问题。现有链将这些交互视为通用calldata处理，导致身份、托管、依赖顺序和会话状态等关键要素被错误地编码在执行层之上，增加了不必要的复杂性和成本。其核心解决方案是提出AGNT2，一个专为代理与微服务协同设计的三层架构：(1) 侧车部署模式（sidecar deployment pattern），无需修改应用代码即可将任意Docker容器变为链上代理；(2) 分层网络结构——Layer Top用于已建立双边关系的P2P状态通道（目标延迟100ms，单对TPS 1K–5K，聚合TPS超1000万）、Layer Core作为依赖感知的序列化Rollup用于首次接触或多主体交互（延迟500ms–2s，目标TPS 300K–500K）、Layer Root通过计算欺诈证明锚定至任意EVM L1进行结算；(3) 原生代理执行环境与交互Trie（interaction trie），使服务调用、身份、声誉、能力及会话上下文成为协议原生对象。论文聚焦执行层系统问题——排序、状态管理、结算以及数据可用性（Data Availability, DA）带宽瓶颈，仿真与建模验证了架构可行性，原型测量支持部分组件，但尚未实现完整的Layer Core端到端部署，目前实际吞吐受限于DA带宽（约10K–100K TPS），相比目标存在约100倍差距。AGNT2主张代理经济需要专用执行层，而非复用通用链改造而成。

链接: https://arxiv.org/abs/2604.21129
作者: Anbang Ruan,Xing Zhang
机构: NetX Foundation
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Current blockchain Layer 2 solutions, including Optimism, Arbitrum, zkSync, and their derivatives, optimize for human-initiated financial transactions. Autonomous AI agents instead generate high-frequency, semantically rich service invocations among mutually untrusting principals. Existing chains treat those interactions as generic calldata, forcing identity, escrow, dependency ordering, and session state to be encoded above the execution layer at the wrong cost point. We present AGNT2, a three-tier stack purpose-built for agent and microservice coordination on-chain. AGNT2 combines: (1) a sidecar deployment pattern that turns any Docker container into an on-chain agent without application-code modification; (2) Layer Top P2P state channels for established bilateral pairs (100 ms, rough design target 1K-5K TPS per pair, 10M+ aggregate TPS design envelope under endpoint-resource limits), Layer Core as a dependency-aware sequenced rollup for first-contact and multi-party interactions (500 ms-2 s, 300K-500K TPS design target), and Layer Root settlement with computational fraud proofs anchored to any EVM L1; and (3) an agent-native execution environment plus interaction trie that make service invocation, identity, reputation, capabilities, and session context first-class protocol objects. This paper focuses on the execution-layer systems problem: sequencing, state, settlement, and the data-availability (DA) bandwidth gap that bounds all three. Simulation and analytical modeling support the architecture, and prototype measurements validate selected components, but no end-to-end Layer Core implementation exists yet. Practical deployment is currently constrained to roughly 10K-100K TPS by DA throughput, leaving a ~100x gap at the target ceiling. AGNT2 argues that the agent economy requires a dedicated execution layer rather than a general-purpose chain repurposed for agents.

[MA-12] Votiverse: A Configurable Governance Platform for Democratic Decision-Making

【速读】：该论文旨在解决传统代议制民主（Representative Democracy）在当代社会中因历史路径依赖而陷入僵化、缺乏灵活性和参与深度的问题，其核心诉求是拓展民主治理的配置空间，使其适应现代组织与社群对更灵活、透明和可问责的决策机制的需求。解决方案的关键在于提出并实现一个名为 Votiverse 的可配置治理引擎，该引擎支持从直接投票到委托投票（Delegation）再到混合模式的全谱系治理结构；其创新性体现在两个层面：一是引入“治理意识层”（Governance Awareness Layer），通过实时监测委托网络并提供上下文感知的渐进式披露报告，增强参与者对决策环境的认知；二是构建“预测追踪问责层”（Prediction-Tracking Accountability Layer），以记录提案中的可验证预测与实际结果，形成集体记忆，从而将投票行为转化为持续的集体学习过程。这一模型不仅重新定义了代议制民主的地位（视为边缘案例而非标准范式），还为复杂组织提供了可扩展、可审计且动态演化的治理基础设施。

链接: https://arxiv.org/abs/2604.20863
作者: Diego Macrini
机构: Proximify Inc.(Proximify公司)
类目: Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Democracy is not a single mechanism. It is a space of possible configurations – a spectrum stretching from pure direct participation to full delegation of authority. The systems we live under today occupy a narrow band of that spectrum, chosen centuries ago under constraints that no longer apply, and rarely questioned since. Votiverse is a platform for exploring the rest of that space. It provides organizations, communities, and institutions of any size with a configurable governance engine. Participants can vote directly, delegate their vote to trusted individuals by topic, or operate under any hybrid arrangement their group defines. Delegations are revocable, topic-specific, and transitive. A direct vote always overrides a delegation. In this model, traditional representative democracy is not the norm – it is an edge case: the configuration you get when delegation is forced, universal, non-specific, and irrevocable for a fixed term. Votiverse introduces two structural innovations. First, a governance awareness layer – a built-in system that monitors the delegation network and delivers contextual, progressive-disclosure reporting to participants at the point of decision. Second, a prediction-tracking accountability layer. Proposals carry falsifiable predictions. Outcomes are recorded. Over time, the platform builds a collective memory of what was decided, what was promised, and what actually happened. Together, these layers transform voting from a momentary act into an ongoing process of collective learning. This paper formalizes the governance model, situates it within existing work on liquid democracy and participatory decision-making, addresses known failure modes, and describes the architecture of the platform. The core platform has been implemented and released as open-source software.

[MA-13] Architecture of an AI-Based Automated Course of Action Generation System for Military Operations

【速读】：该论文旨在解决传统作战行动方案（Course of Action, CoA）规划在现代战争环境下因机动速度加快、侦察范围扩大和武器射程增加而导致的效率瓶颈问题，即如何通过人工智能（Artificial Intelligence, AI）技术实现CoA规划的自动化与智能化。其解决方案的关键在于：首先梳理公开信息中可获取的军事条令与CoA规划流程，进而识别适用于各阶段的具体AI技术，并最终提出一个面向自动化CoA规划系统的架构设计，从而为未来智能化指挥决策提供技术路径与结构支撑。

链接: https://arxiv.org/abs/2604.20862
作者: Ji-il Park,Inwook Shim,Chong Hui Kim
机构: Ministry of National Defense (韩国国防部); Inha University (仁川大学); Defense AI RD Institute, Agency for Defense Development (ADD) (国防人工智能研发研究所，国防发展局)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 15 figures, 2 tables

点击查看摘要

Abstract:The automation system for Course of Action (CoA) planning is an essential element in future warfare. As maneuver speeds increase, surveillance ranges extend, and weapon ranges grow, the operational area expands, making traditional manned-based CoA planning increasingly challenging. Consequently, the development of an AI-based automated CoA planning system is becoming increasingly necessary. Accordingly, several countries and defense organizations are actively developing AI-based CoA planning systems. However, due to security restrictions and limited public disclosure, the technical maturity of such systems remains difficult to assess. Furthermore, as these systems are military-related, their details are not publicly disclosed, making it difficult to accurately assess the current level of development. In response to this, this study aims to introduce relevant doctrines within the scope of publicly available information and present applicable AI technologies for each stage of the CoA planning process. Ultimately, it proposes an architecture for the development of an automated CoA planning system.

[MA-14] Caesar: Deep Agent ic Web Exploration for Creative Answer Synthesis

【速读】：该论文旨在解决当前自主代理（autonomous agent）在信息检索与知识合成之间存在的断层问题，即现有框架多聚焦于收敛式搜索，导致生成内容缺乏创造性而趋于同质化。其核心挑战在于如何实现从被动信息获取到主动创造新见解的跃迁。解决方案的关键在于提出一种名为Caesar的新型生成式AI（Generative AI）架构，该架构通过两个核心机制实现突破：一是基于动态上下文感知策略的探索模块，使代理能够灵活地在知识图谱中导航；二是由对抗性草稿精炼循环控制的合成模块，主动寻求新颖视角而非验证既有假设，从而激发非显性的概念关联，显著提升输出结果的新颖性和结构一致性。

链接: https://arxiv.org/abs/2604.20855
作者: Jason Liang,Elliot Meyerson,Risto Miikkulainen
机构: 未知
类目: Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:To advance from passive retrieval to creative discovery of new ideas, autonomous agents must be capable of deep, associative synthesis. However, current agentic frameworks prioritize convergent search, often resulting in derivative summaries that lack creativity. Caesar is an agentic LLM architecture designed to bridge the gap between information gathering and synthesis of new insights. Unlike existing agents that treat the web as a flat sequence of disconnected documents, Caesar leverages an extensive knowledge graph to foster associative reasoning, thus enabling the discovery of non-obvious connections between disparate concepts. It consists of two components: (1) exploration driven by a dynamic context-aware policy, and (2) synthesis controlled by an adversarial draft refinement loop that actively seeks novel perspectives rather than confirming established priors. Caesar demonstrates the ability to generate artifacts and answers characterized by high novelty and structural coherence, significantly outperforming state-of-the-art LLM research agents in tasks requiring creativity.

自然语言处理

[NLP-0] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

【速读】：该论文旨在解决传统自动语音识别（ASR）评估指标Word Error Rate（WER）对语义信息不敏感的问题，从而导致其与人类感知的相关性较低。为提升评估的语义相关性和可解释性，论文提出利用解码器架构的大语言模型（LLMs）进行语义层面的ASR评价，其关键在于：（1）通过LLM选择两个候选句中最优假设，实现与人工标注者高达92–94%的一致性；（2）基于生成式嵌入计算语义距离，性能优于现有语义指标；（3）提供错误类型的定性分类能力。结果表明，解码器型LLMs生成的嵌入表现可媲美编码器模型，且整体上为ASR评估提供了更贴近人类理解、更具可解释性的新路径。

链接: https://arxiv.org/abs/2604.21928
作者: Thibault Bañeras-Roux,Shashi Kumar,Driss Khalil,Sergio Burdisso,Petr Motlicek,Shiran Liu,Mickael Rouvier,Jane Wottawa,Richard Dufour
机构: Idiap Research Institute (Idiap研究研究所); Avignon University (阿维尼翁大学); Le Mans University (勒芒大学); Nantes University (南特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92–94% agreement with human annotators for hypothesis selection, compared to 63% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

[NLP-1] MathDuels: Evaluating LLM s as Problem Posers and Solvers

【速读】：该论文旨在解决当前前沿语言模型在静态数学基准测试中表现趋近上限后，评估体系难以有效区分模型能力的问题。传统评估方法仅将模型视为固定题集的求解器，忽略了其生成与理解复杂数学问题的综合能力。解决方案的关键在于提出MathDuels这一自对弈（self-play）基准，使模型同时扮演出题者和解题者的双重角色：通过三阶段生成流程（元提示、问题生成与难度增强）构建高质量题目，并由独立验证器排除病态问题；利用Rasch模型联合估计解题能力和题目难度，进而从出题质量反推模型的创造能力。实验表明，解题与出题能力部分解耦，且该机制能动态适应新模型的涌现，避免评估难度固化，从而实现对模型能力更精细的刻画。

链接: https://arxiv.org/abs/2604.21916
作者: Zhiqiu Xu,Shibo Jin,Shreya Arya,Mayur Naik
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model’s authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark’s difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

[NLP-2] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

【速读】：该论文旨在解决大视觉语言模型（Large Vision-Language Models, LVLMs）中存在的幻觉问题，即模型输出与视觉输入不一致的现象。研究表明，幻觉主要源于模型对文本先验知识和背景信息的过度依赖，尤其是来自文本指令的信息。解决方案的关键在于提出HalluVL-DPO框架，通过偏好优化（Preference Optimization）方法，利用自建的标注数据集引导模型优先生成与视觉内容一致的响应，从而有效缓解由文本指令先验引发的幻觉问题，同时保持或提升模型在其他视觉能力评估任务上的性能。

链接: https://arxiv.org/abs/2604.21911
作者: Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord
机构: ISIR, Sorbonne Université, Paris, France; Valeo.ai, Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at this https URL .

[NLP-3] GiVA: Gradient-Informed Bases for Vector-Based Adaptation AISTATS2026

【速读】：该论文旨在解决向量式适配（vector-based adaptation）方法在参数高效微调（parameter-efficient fine-tuning, PEFT）中因所需秩（rank）过高而导致训练成本上升的问题。尽管这类方法具有极高的参数效率，但其性能通常需依赖远高于LoRA（Low-Rank Adaptation）的秩才能达到相当水平，从而限制了实际应用。解决方案的关键在于提出一种基于梯度的初始化策略——GiVA（Gradient-based Initialization for Vector-based Adaptation），通过优化初始向量的构造方式，在保持向量式适配极端参数效率的同时，显著降低对高秩的依赖，使训练时间接近LoRA水平，并在多个任务（包括自然语言理解、生成和图像分类）上实现更优或相当的性能表现，且秩需求减少达8倍（8×）。

链接: https://arxiv.org/abs/2604.21901
作者: Neeraj Gangwar,Rishabh Deshmukh,Michael Shavlovsky,Hancao Li,Vivek Mittal,Lexing Ying,Nickvash Kani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AISTATS 2026

点击查看摘要

Abstract:As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ( 8\times ).

[NLP-4] Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

【速读】：该论文旨在解决传统立法行为分析过度依赖投票记录而忽视政治话语中语义与修辞内容的问题。其核心挑战在于如何系统性地量化和解析议会辩论的多维特征，包括表达方式、讨论主题及发言者之间的语用相似性。解决方案的关键在于提出一个可扩展且通用的计算框架，整合历时文体分析（diachronic stylometric analysis）、上下文主题建模（contextual topic modeling）与议员演讲的语义聚类（semantic clustering），从而在大规模文本数据（如巴西众议院2003–2025年间45万余篇演讲）上实现对议会话语的深度挖掘，揭示风格演变、议题转向及基于区域和性别身份的语用联盟等新发现。

链接: https://arxiv.org/abs/2604.21897
作者: Flávio Soriano,Victoria F. Mello,Pedro B. Rigueira,Gisele L. Pappa,Wagner Meira Jr.,Ana Paula Couto da Silva,Jussara M. Almeida
机构: Universidade Federal de Minas Gerais (联邦大学); Universidade Federal de Minas Gerais (联邦大学); Universidade Federal de Minas Gerais (联邦大学); Universidade Federal de Minas Gerais (联邦大学); Universidade Federal de Minas Gerais (联邦大学); Universidade Federal de Minas Gerais (联邦大学); Universidade Federal de Minas Gerais (联邦大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted paper at ICWSM 2026

点击查看摘要

Abstract:Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies’ speeches. We apply this framework to a large-scale case study of the Brazilian Chamber of Deputies, using a corpus of over 450,000 speeches from 2003 to 2025. Our results show a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation. More broadly, this work offers a robust methodology for analyzing parliamentary discourse as a multidimensional phenomenon that complements traditional vote-based approaches.

[NLP-5] EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

【速读】：该论文旨在解决现有事件抽取（Event Extraction）数据集在封闭领域中事件类型覆盖有限，以及开放领域中缺乏大规模、人工验证数据集的问题。其解决方案的关键在于构建了一个名为EVENT5Ws的大规模、人工标注且经过统计验证的开放域事件抽取数据集，并设计了一套系统化的标注流程以确保数据质量与一致性。通过该数据集，研究者评估了当前最先进的预训练大语言模型，并建立了新的基准；同时证明了基于EVENT5Ws训练的模型在不同地理语境下的良好泛化能力，凸显了其在开发通用事件抽取算法方面的潜力。

链接: https://arxiv.org/abs/2604.21890
作者: Praval Sharma,Ashok Samal,Leen-Kiat Soh,Deepti Joshi
机构: University of Nebraska Omaha, USA; University of Nebraska–Lincoln, USA; The Citadel, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

[NLP-6] ngIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale ACL2026

【速读】：该论文旨在解决大规模云原生服务中技术异常的实时检测与缓解问题，尤其针对由客户事件（customer incidents）驱动的风险发现挑战——此类数据虽具高价值，但因噪声大、吞吐量高及业务语义复杂，难以提取可操作的智能信息。解决方案的关键在于提出一个端到端的系统 TingIS，其核心是一个多阶段事件关联引擎（multi-stage event linking engine），该引擎融合高效的索引技术和大型语言模型（Large Language Models, LLMs），实现对事件合并的精准决策，从而从少量多样化的用户描述中稳定提取出可行动的异常事件；同时辅以级联路由机制进行精确业务归属和多维噪声过滤管道，整合领域知识、统计模式与行为过滤策略，最终在生产环境中实现了低延迟（P90告警延迟3.5分钟）与高发现率（高优先级事件95%发现率）的性能表现。

链接: https://arxiv.org/abs/2604.21889
作者: Jun Wang,Ziyin Zhang,Rui Wang,Hang Yu,Peng Di,Rui Wang
机构: Ant Group; Shanghai Jiao Tong University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

[NLP-7] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

【速读】：该论文旨在解决开放域事件抽取（Open-domain Event Extraction）中存在的两大关键问题：一是现有方法难以泛化到未见事件类型，二是缺乏对文档级上下文、结构和语义推理的有效建模，尤其在使用大语言模型（Large Language Models, LLMs）时，常因“中间迷失”现象和注意力稀释导致性能下降。解决方案的关键在于提出一种名为MODEE的新型多模态开放域事件抽取框架，其核心创新是将基于图的学习机制与LLM生成的文本表示相结合，从而显式建模文档级别的推理能力，实现更准确且可迁移的事件抽取效果。

链接: https://arxiv.org/abs/2604.21885
作者: Praval Sharma
机构: University of Nebraska Omaha, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

[NLP-8] Revisiting Non-Verbatim Memorization in Large Language Models : The Role of Entity Surface Forms ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在非verbatim记忆中对事实知识的存储方式不明确的问题，尤其是如何区分模型是基于特定表面形式（surface form）还是实体本身来回忆事实。传统实体问答（entity-based QA）方法通常仅使用单一标准名称进行查询，难以分离记忆与访问路径之间的关系。其解决方案的关键在于提出RedirectQA数据集，该数据集利用维基百科重定向信息将Wikidata中的事实三元组与每个实体的多种表面形式（包括别名、缩写、拼写变体及常见错误形式）关联起来，从而系统性地考察不同表面形式下模型的事实记忆一致性。实证结果表明，模型的表现随表面形式变化而波动，且这种波动具有类别依赖性——对拼写微调更具鲁棒性，而对别名或缩写更敏感，同时发现实体频率和表面频率均影响准确性，且前者贡献常超越后者，揭示了事实记忆既非完全表面特异也非彻底表面不变，强调在评估非verbatim记忆时需考虑表面形式多样性的重要性。

链接: https://arxiv.org/abs/2604.21882
作者: Yuto Nishida,Naoki Shikoda,Yosuke Kishinami,Ryo Fujii,Makoto Morishita,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); Future Corporation (未来公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main

点击查看摘要

Abstract:Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

[NLP-9] Machine Behavior in Relational Moral Dilemmas: Moral Rightness Predicted Human Behavior and Model Decisions ACL

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在决策支持系统中是否能够准确编码人类道德判断的社交情境敏感性这一关键问题。其核心挑战在于，人类道德判断受人际关系亲疏和事件严重程度等社会因素调节，而LLMs在实际应用中可能因缺乏对这些动态关系的灵活响应而导致行为偏差。解决方案的关键在于通过“吹哨人困境”（Whistleblower’s Dilemma）实验范式，从三个维度系统评估模型：（1）道德正确性（prescriptive norms），即应如何行动；（2）预测的人类行为（descriptive social expectations），即人们会如何行动；（3）模型自主决策。研究发现，尽管模型内部世界模型能捕捉到随着关系亲近度上升而产生的忠诚倾向（即描述性预期），但其最终决策却始终遵循公平导向的规范性道德标准，而非自身预测的行为模式，揭示了LLM在决策中优先采用刚性规则而非社会敏感性的本质特征，从而指出了当前模型部署中潜在的重大对齐风险。

链接: https://arxiv.org/abs/2604.21871
作者: Jiseon Kim,Jea Kwon,Luiz Felipe Vecchietti,Wenchao Dong,Jaehong Kim,Meeyoung Cha
机构: KAIST; MPI-SP
类目: Computation and Language (cs.CL)
备注: ACL-Findings 2026

点击查看摘要

Abstract:Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower’s Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.

[NLP-10] SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

【速读】：该论文旨在解决叙事相似性（narrative similarity）的量化评估与表征学习问题，其核心挑战在于如何将抽象的叙事结构转化为可计算的、符合人类直觉的语义匹配任务。解决方案的关键在于构建了一个基于三元组分类的新型叙事相似性定义：给定一个锚点故事（anchor story）和两个候选故事，模型需判断哪一个更相似于锚点故事。为此，研究者收集了超过1000个故事摘要三元组，并确保每个标注均由至少两名标注者达成一致，从而保证数据质量；在此基础上，系统性地评估了多种嵌入表示方法（包括预训练模型的微调与后处理策略）以及大型语言模型（LLM）集成方法在该任务上的表现。实验表明，在三元组分类任务中，LLM集成方法表现最优，而在嵌入空间建模中，预训练嵌入结合简单后处理即可媲美定制化微调方案，揭示了当前自动系统在叙事理解方面仍有较大提升空间。

链接: https://arxiv.org/abs/2604.21782
作者: Hans Ole Hatzel,Ekaterina Artemova,Haimo Paul Stiemer,Evelyn Gius,Chris Biemann
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced “nass-na-rel”). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.

[NLP-11] Misinformation Span Detection in Videos via Audio Transcripts

【速读】：该论文旨在解决视频类虚假信息（video-based misinformation）检测中缺乏细粒度定位能力的问题，即现有方法仅能判断整个视频是否包含虚假信息，而无法识别虚假内容在视频中的具体时间段及其对应的语义主张（claim）。其解决方案的关键在于构建两个新型标注数据集，通过将视频音频转录为文本，并标注出导致视频整体被判定为虚假信息的具体片段（即“虚假信息跨度”），从而实现对虚假信息发生位置的精准定位。研究采用基于先进语言模型的分类器，在超过500个视频、2400余个标注片段上验证了该方法的有效性，F1分数达到0.68，显著提升了虚假信息检测的可解释性和实用性。

链接: https://arxiv.org/abs/2604.21767
作者: Breno Matos,Rennan C. Lima,Savvas Zannettou,Fabricio Benevenuto,Rodrygo L.T. Santos
机构: Federal University of Minas Gerais (联邦大学米纳斯吉拉斯分校)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted at ICWSM 2026

点击查看摘要

Abstract:Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research efforts investigated detecting video-based misinformation, focusing on whether a video shares misinformation or not on a video level. While this approach is useful, it only provides a limited and non-easily interpretable view of the problem given that it does not provide an additional context of when misinformation occurs within videos and what content (i.e., claims) are responsible for the video’s misinformation nature. In this work, we attempt to bridge this research gap by creating two novel datasets that allow us to explore misinformation detection on videos via audio transcripts, focusing on identifying the span of videos that are responsible for the video’s misinformation claim (misinformation span detection). We present two new datasets for this task. We transcribe each video’s audio to text, identifying the video segment in which the misinformation claims appears, resulting in two datasets of more than 500 videos with over 2,400 segments containing annotated fact-checked claims. Then, we employ classifiers built with state-of-the-art language models, and our results show that we can identify in which part of a video there is misinformation with an F1 score of 0.68. We make publicly available our annotated datasets. We also release all transcripts, audio and videos. Comments: Accepted at ICWSM 2026 Subjects: Computation and Language (cs.CL); Social and Information Networks (cs.SI) Cite as: arXiv:2604.21767 [cs.CL] (or arXiv:2604.21767v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.21767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-12] AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

【速读】：该论文旨在解决现有音频问答（Audio Question Answering, AQA）基准测试中普遍存在的“捷径策略”问题，如依赖短时线索、词汇先验、数据集特定偏倚，甚至绕过音频内容直接利用元数据或字幕进行回答，导致模型无法真正实现深层次的音频理解。为应对这一挑战，作者提出AUDITA（Audio Understanding from Diverse Internet Trivia Authors），一个大规模、真实世界场景下的音频问答基准，其关键在于设计由人类编写的、基于真实音频的趣味问答题，通过引入具有干扰项和长程时间依赖性的探测式问题，确保答案必须依赖跨模态的联合推理而非孤立的声学或文本特征。该方案显著提升了任务难度（人类平均准确率为32.13%，而最先进模型低于8.86%），并借助项目反应理论（Item Response Theory, IRT）量化模型与数据的潜在能力差异，揭示系统性缺陷，从而推动更鲁棒的音频理解研究。

链接: https://arxiv.org/abs/2604.21766
作者: Tasnim Kabir,Dmytro Kurdydyk,Aadi Palnitkar,Liam Dorn,Ahmed Haj Ahmed,Jordan Lee Boyd-Graber
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

[NLP-13] Why are all LLM s Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在文化覆盖与文化能力方面的局限性，特别是其在处理文化相关问题时表现出的区域偏好和潜在偏见问题。现有研究多聚焦于西方中心主义倾向，但缺乏对LLMs在不同地区文化问题上具体响应模式的系统分析。为应对这一问题，作者提出一个基于全面文化分类体系的“文化相关开放问题数据集”（Culture-Related Open Questions, CROQ），通过该数据集量化LLMs在回答涉及不同国家的文化问题时的倾向性。关键发现包括：LLMs并非普遍偏向英语国家，反而更倾向于日本；使用高资源语言（如英语）进行提示时，模型输出更具多样性且减少对输入语言官方国的倾向；文化偏见的显著迹象首次出现在监督微调阶段而非预训练阶段，表明此类偏见可能源于人类反馈强化学习等后训练过程。

链接: https://arxiv.org/abs/2604.21751
作者: Joseba Fernandez de Landa,Carla Perez-Almendros,Jose Camacho-Collados
机构: HiTZ Center - Ixa, University of the Basque Country EHU; Cardiff University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have been works analysing the cultural capabilities of LLMs, there has not been specific work on highlighting LLM regional preferences when it comes to cultural-related questions. In this work, we propose a new dataset based on a comprehensive taxonomy of Culture-Related Open Questions (CROQ). The results show that, contrary to previous cultural bias work, LLMs show a clear tendency towards countries such as Japan. Moveover, our results show that when prompting in languages such as English or other high-resource ones, LLMs tend to provide more diverse outputs and show less inclinations towards answering questions highlighting countries for which the input language is an official language. Finally, we also investigate at which point of LLM training this cultural bias emerges, with our results suggesting that the first clear signs appear after supervised fine-tuning, and not during pre-training.

[NLP-14] AEL: Agent Evolving Learning for Open-Ended Environments

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）智能体在开放环境中长期运行时缺乏经验积累与自我改进能力的问题，即当前任务通常从零开始处理，未能将过往经验转化为未来行为的优化。其核心挑战不在于“记住什么”，而在于“如何有效利用记忆”，包括选择合适的记忆检索策略、解释历史结果以及判断是否需要调整当前决策机制。解决方案的关键是提出一种双时间尺度框架——Agent Evolving Learning (AEL)，其中快速时间尺度上采用Thompson Sampling强化学习算法动态选择记忆检索策略，慢速时间尺度上由LLM驱动的反思模块识别失败模式并注入因果洞见至决策提示中，从而赋予智能体对所检索信息的解释性理解框架。实证表明，该方法在序列投资基准上显著优于现有自进化方法和非LLM基线，并展现出最低方差，同时消融实验揭示了“少即是多”的规律：仅通过记忆与反思即可实现58%的累积性能提升，而额外引入的复杂机制反而降低表现，验证了瓶颈在于智能体自我诊断如何使用经验，而非堆砌架构复杂度。

链接: https://arxiv.org/abs/2604.21725
作者: Wujiang Xu,Jiaojiao Han,Minghao Guo,Kai Mei,Xi Zhu,Han Zhang,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emphwhat to remember but \emphhow to use what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emphAgent Evolving Learning (\ael), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent’s decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael achieves a Sharpe ratio of 2.13 \pm 0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more’’ pattern: memory and reflection together produce a 58% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emphdegrades performance. This demonstrates that the bottleneck in agent self-improvement is \emphself-diagnosing how to use experience rather than adding architectural complexity. Code and data: this https URL.

[NLP-15] Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

【速读】：该论文旨在解决大规模 token-indexed lookup 表在实际应用中因参数效率低下和内存快速增长而导致的性能瓶颈问题，其根源在于长尾 token 的 Zipfian 分布下训练不足、层间需求异构性以及“槽位坍塌”（slot collapse）引发的冗余嵌入。解决方案的关键在于提出 X-GRAM——一种频率感知的动态 token 注入框架：通过混合哈希（hybrid hashing）与别名混合（alias mixing）压缩长尾部分同时保留头部容量，利用归一化 SwiGLU ShortConv 提取多样化的局部 n-gram 特征，并通过深度感知门控机制将这些特征融合进注意力值流和层间残差中，从而实现静态记忆与动态上下文的有效对齐，最终构建出以内存为中心的可扩展架构，使模型容量与计算量（FLOPs）解耦。

链接: https://arxiv.org/abs/2604.21724
作者: Yilong Chen,Yanxi Xie,Zitian Gao,He Xin,Yihao Xiao,Renbiao Liu,Haoming Luo,Yifan Luo,Zhengmao Ye,Tingwen Liu,Xin Zhao,Ran Tao,Bryan Dai
机构: Chinese Academy of Sciences(中国科学院); University of Chinese Academy of Sciences(中国科学院大学); Peking University(北京大学); IQuest Research( IQuest 研究院)
类目: Computation and Language (cs.CL)
备注: 29 pages, 9 figures, 13 tables

点击查看摘要

Abstract:Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and “slot collapse” that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in this https URL.

[NLP-16] Building a Precise Video Language with Human-AI Oversight CVPR2026

【速读】：该论文旨在解决视频-语言模型（Video-Language Models, VLMs）在精准视频描述生成中的瓶颈问题，即如何实现高质量、结构化且可控制的视频字幕生成，以支持专业级视频理解与生成任务。其核心挑战在于现有方法难以兼顾描述的准确性、细节丰富度以及对摄像机运动、镜头角度等影视语言要素的精细建模。解决方案的关键在于两个方面：一是提出一种基于数百个由专业影视创作者定义的视觉原语（visual primitives）的结构化描述规范，用于系统性刻画主体、场景、运动、空间关系及摄像机动态；二是引入CHAI（Critique-based Human-AI Oversight）框架，通过训练专家对模型生成的预字幕进行批判性修正，形成前后对比的高质量标注数据，从而显著提升标注效率与准确性，并为开源模型（如Qwen3-VL）提供多阶段监督信号（SFT、DPO和推理时缩放），最终实现在少量专家干预下超越闭源模型（如Gemini-3.1-Pro）的性能表现。

链接: https://arxiv.org/abs/2604.21718
作者: Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: CVPR 2026 Highlight. Project page: this https URL

点击查看摘要

Abstract:Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: this https URL

[NLP-17] From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation ACL2026

【速读】：该论文旨在解决现有代码生成偏见评估方法的局限性问题，即当前研究主要依赖于简单的条件语句（conditional statements）来衡量偏见，而这类任务仅覆盖现实编程场景的一小部分，且只能揭示显式编码的偏见，从而严重低估了实际应用中潜在的偏见风险。其解决方案的关键在于引入更贴近真实场景的任务——生成机器学习（Machine Learning, ML）管道，并系统性地评估模型在特征选择阶段是否隐含引入敏感属性（sensitive attributes）。研究发现，尽管模型能正确排除无关特征（如“种族”被保留而“最喜欢的颜色”被剔除），但敏感属性仍出现在87.7%的生成管道中，显著高于条件语句测试中的59.2%，且该现象在多种提示缓解策略、属性数量和任务复杂度下均保持稳健，从而揭示了传统基准对偏见风险的系统性低估。

链接: https://arxiv.org/abs/2604.21716
作者: Minh Duc Bui,Xenia Heilmann,Mattia Cerrato,Manuel Mager,Katharina von der Wense
机构: Johannes Gutenberg University Mainz, Germany; Universidad Iberoamericana, Ciudad de Mexico; University of Colorado Boulder, USA
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including “race” while dropping “favorite color” for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

[NLP-18] Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3374 Speakers

【速读】：该论文旨在解决运动性构音障碍（dysarthria）严重程度评估中缺乏通用、无需训练的量化方法的问题，尤其在多语言、多病因（如帕金森病、脑瘫、肌萎缩侧索硬化症等）场景下的可迁移性和鲁棒性挑战。其解决方案的关键在于利用冻结的自监督语音表示（frozen self-supervised speech representations, SSL）中语音特征子空间的d-prime可分性，通过分析不同病理群体在声学- phonological 特征空间中的分离能力，实现无需微调的、基于语义结构的构音障碍表型刻画。该方法不仅在群体层面表现出显著的病因特异性降解模式（效应量ε² > 0.14），且跨语言的降解轮廓形状高度一致（余弦相似度>0.95），同时对SSL模型架构不敏感（rho > 0.77），证明了其作为训练-free、病因感知、跨语言通用的构音障碍表征框架的可行性与稳定性。

链接: https://arxiv.org/abs/2604.21706
作者: Bernard Muller,Antonio Armando Ortiz Barrañón,LaVonne Roberts
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to Computer Speech Language

点击查看摘要

Abstract:We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson’s disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared 0.14, Holm-corrected p 0.001), with Parkinson’s disease separable from the articulatory execution group at Cohen’s d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.

[NLP-19] Stealthy Backdoor Attacks against LLM s Based on Natural Style Triggers

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在安全关键场景中面临的后门攻击问题，尤其针对现有方法存在的三个核心缺陷：显式的触发模式破坏自然性、长文本生成中攻击者指定内容注入不可靠，以及威胁模型描述不完整导致实际攻击机制模糊。解决方案的关键在于提出一个完整的后门攻击框架 BadStyle，其核心创新包括：利用LLM自身作为中毒样本生成器，构建携带不可感知风格级触发器（style-level triggers）的自然且隐蔽的中毒样本；设计一种辅助目标损失函数（auxiliary target loss），通过强化对中毒输入的特定目标内容响应并惩罚良性输入中该内容的出现，显著提升后门激活的稳定性；同时在现实威胁模型下系统评估攻击效果，涵盖提示诱导和参数高效微调（PEFT）两种注入策略。实验表明，BadStyle 在多个主流LLM上实现了高攻击成功率（ASR）与强隐蔽性，并能有效规避主流输入和输出层面防御机制。

链接: https://arxiv.org/abs/2604.21700
作者: Jiali Wei,Ming Fan,Guoheng Sun,Xicheng Zhang,Haijun Wang,Ting Liu
机构: Xi’an Jiaotong University (西安交通大学); Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University (教育部智能网络与网络安全重点实验室，西安交通大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

[NLP-20] Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

【速读】：该论文旨在解决如何从眼动追踪数据中提取更具判别力的特征以提升阅读障碍（dyslexia）检测性能的问题。其核心挑战在于传统统计特征难以充分捕捉 fixation 序列（fixation sequence）中的复杂结构信息。解决方案的关键在于引入基于持久同调（persistent homology）的新颖滤波方法（filtrations），将眼动序列视为时间序列进行拓扑分析，并构建融合拓扑特征与传统统计特征的“混合模型”（hybrid models）。实验表明，这些拓扑特征能捕获互补信息，使模型在哥本哈根语料库（Copenhagen Corpus）上的表现优于仅依赖传统特征的方法，且其性能可媲美已有基准方法，同时所提滤波策略优于现有方案。

链接: https://arxiv.org/abs/2604.21698
作者: Marius Huber,David R. Reich,Lena A. Jäger
机构: University of Zürich (苏黎世大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Algebraic Topology (math.AT)
备注: ETRA 2026

点击查看摘要

Abstract:Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textitfiltration). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models’’ that combine topological features with traditional statistical features. We empirically evaluate our method by applying it to the task of dyslexia detection from eye-tracking-while-reading data using the Copenhagen Corpus, which contains scanpaths from dyslexic and non-dyslexic L1 and L2 readers. Our hybrid models outperform existing approaches that rely solely on traditional features, showing that persistent homology captures complementary information encoded in fixation sequences. The strength of these topological features is further underscored by their achieving performance comparable to established baseline methods. Importantly, our proposed filtrations outperform existing ones.

[NLP-21] Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

【速读】：该论文旨在解决标注者间意见分歧的建模问题，即如何在自然语言推理（NLI）等任务中更精准地捕捉和表达不同标注者的细粒度视角差异。传统方法通常仅关注标签预测，忽略了标注者提供的解释（rationales）所蕴含的个体认知差异。其解决方案的关键在于提出一个联合建模框架，将标注者特定的标签预测与对应解释生成统一建模，并通过一种基于表示层的“用户护照”（User Passport）机制，将标注者身份及人口统计学元数据作为条件输入嵌入到模型中。此外，论文设计了两种解释器架构：一种是后处理提示驱动的解释器（post-hoc prompt-based explainer），另一种是前缀桥接解释器（prefixed bridge explainer），后者直接将标注者条件化的分类器表示迁移至生成模型中，从而实现与个体标注者视角对齐的解释生成。实验证明，引入解释建模显著提升了预测性能，且不同解释策略在语义一致性与词汇相似性上各有优势，表明将解释视为细粒度视角表达能更丰富、真实地刻画标注分歧。

链接: https://arxiv.org/abs/2604.21667
作者: Olufunke O. Sarumi,Charles Welch,Daniel Braun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at 5th NLPerspectives Workshop

点击查看摘要

Abstract:Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators’ provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.

[NLP-22] GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在知识图谱补全（Knowledge Graph Completion, KGC）任务中，连续图嵌入与离散LLM词元之间存在的模态鸿沟问题。现有基于量化的方法通常将量化视为单纯的数值压缩，导致语义纠缠的代码无法体现人类推理的层次结构。解决方案的关键在于提出GS-Quant框架，其核心创新包括：（1）引入粒度语义增强模块（Granular Semantic Enhancement），向码本注入层次化知识，使早期代码捕获全局语义类别、后期代码细化具体属性；（2）设计生成式结构重建模块（Generative Structural Reconstruction），在代码序列上施加因果依赖关系，将独立离散单元转化为结构化的语义描述符。通过扩展LLM词汇表以包含这些学习到的离散代码，模型能够以类自然语言生成的方式对图结构进行同构推理。

链接: https://arxiv.org/abs/2604.21649
作者: Qizhuo Xie,Yunhui Liu,Yu Xing,Qianzi Hou,Xudong Jin,Tao Zheng,Tieke He
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at this https URL.

[NLP-23] Multilinguality at the Edge: Developing Language Models for the Global South

【速读】：该论文旨在解决语言模型（Language Models, LMs）在非英语和硬件受限的全球南方社区中部署困难的问题，即所谓的“最后一英里”挑战——这一挑战源于多语言性（multilinguality）与边缘计算部署（edge deployment）之间的技术需求冲突。解决方案的关键在于系统性地整合多语言自然语言处理（Multilingual NLP）与边缘部署研究，通过梳理232篇相关文献，揭示从数据收集到开发部署全链条的技术瓶颈，并提出面向不同利益相关者的可操作建议，以推动更具包容性和公平性的语言技术发展。

链接: https://arxiv.org/abs/2604.21637
作者: Lester James V. Miranda,Songbo Hu,Roi Reichart,Anna Korhonen
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed. To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. We also discuss open questions and provide actionable recommendations for different stakeholders in the NLP ecosystem. Finally, we hope that this work contributes to the development of inclusive and equitable language technologies.

[NLP-24] Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中如何通过非梯度更新的方式提升性能的问题，特别是在不依赖训练数据或反向传播的情况下实现更强的推理能力。其解决方案的关键在于引入了一个全新的推理时扩展维度——外部语言监督的粒度（granularity of external verbal supervision），并通过一种无需训练的框架Verbal Process Supervision (VPS) 实现：利用更强的监督者提供的结构化自然语言批评（structured natural-language critique），驱动迭代式“生成-批评-精炼”循环，直至达到预设轮次预算R。实验表明，该方法在多个基准测试中显著优于现有基于链式深度、采样广度或学习步骤评分器（PRMs）的方法，并首次将“批评粒度”确立为推理时扩展的新核心轴线。

链接: https://arxiv.org/abs/2604.21611
作者: Hao-Yuan Chen
机构: Mindify AI Research; University of London
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.

[NLP-25] Language as a Latent Variable for Reasoning Optimization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言环境下推理能力受限的问题，特别是当模型被限制使用特定语言时，其推理性能可能下降甚至不如非英语输出的情况。研究发现，语言不仅是输出媒介，更作为潜在变量结构化地调制模型内部的推理路径，因此单纯依赖英语训练可能导致推理空间受限。解决方案的关键在于提出 polyGRPO（Polyglot Group Relative Policy Optimization），这是一种基于强化学习（Reinforcement Learning, RL）的框架，将语言多样性视为隐式探索信号，在语言约束与无约束条件下在线生成多语言偏好数据，并联合优化答案准确性与推理结构。该方法仅用18.1K未标注链式思维（Chain-of-Thought, CoT）的多语言数学问题进行训练，即显著提升基础模型（Qwen2.5-7B-Instruct）在英文推理测试集上的绝对准确率6.72%，并在多语言基准上提升6.89%，且唯一在英文常识推理任务中超越基线模型（提升4.9%），验证了通过语言作为潜变量扩展模型隐式推理空间的有效性及跨任务泛化能力。

链接: https://arxiv.org/abs/2604.21593
作者: Linjuan Wu,Haoran Wei,Jialong Tang,Shuang Luo,Baosong Yang,Yongliang Shen,Weiming Lu
机构: Tongyi Lab(通义实验室)
类目: Computation and Language (cs.CL)
备注: 17 pages, 7 figures, Under Reviewing

点击查看摘要

Abstract:As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model’s internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model’s latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model’s latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.

[NLP-26] Agent icQwen : Training Small Agent ic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

【速读】：该论文旨在解决现代工业应用中对小型化智能体语言模型（Agentic Language Models）的需求，即在严格的成本与延迟约束下实现多步推理和工具调用能力。解决方案的关键在于提出了一种基于多轮强化学习（Reinforcement Learning, RL）的训练框架，结合推理RL与智能体RL，并引入双数据飞轮机制：其中推理飞轮通过错误学习自动提升任务难度，而智能体飞轮将线性工作流扩展为多分支行为树（Behavior Trees），从而更真实地模拟现实场景中的决策复杂性。该方法显著提升了小模型在公开基准和工业智能体系统中的表现，使其在搜索和数据分析任务上接近大型模型性能。

链接: https://arxiv.org/abs/2604.21590
作者: Yuanjie Lyu,Chengyu Wang,Haonan Zheng,Yuanhao Yue,Junbing Yan,Ming Wang,Jun Huang
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: this https URL. Data synthesis and RL training code: this https URL. The data synthesis pipeline is also integrated into EasyDistill: this https URL.

[NLP-27] Measuring Opinion Bias and Sycophancy via LLM -based Coercion

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对争议性话题时隐含立场难以被准确识别的问题。由于当前模型常以回避式回应或根据用户立场动态调整观点，导致其真实态度被掩盖，进而可能在信息传播中产生系统性偏见。解决方案的关键在于提出一种名为llm-bias-bench的开源方法，通过两种互补的自由格式探针机制：一是直接探针（Direct probing），模拟用户逐步施压以获取模型明确立场；二是间接探针（Indirect probing），通过无直接提问的论辩交互，观察模型在论证过程中的让步、抵抗或反向论证行为，从而揭示其潜在偏倚。该方法结合九类行为分类与可审计的LLM判官，实现了对模型立场的多维识别与证据化验证。

链接: https://arxiv.org/abs/2604.21564
作者: Rodrigo Nogueira,Giovana Kerche Bonás,Thales Sales Almeida,Andrea Roque,Ramon Pires,Hugo Abonizio,Thiago Laitz,Celio Larcher,Roseval Malaquias Junior,Marcos Piau
机构: Maritaca AI; JusBrasil
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users’ decisions. Eliciting a model’s positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model’s opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.

[NLP-28] Finding Meaning in Embeddings: Concept Separation Curves

【速读】：该论文旨在解决现有句子嵌入（sentence embedding）质量评估方法依赖额外分类器或下游任务所带来的不确定性问题，即难以区分性能提升是由嵌入本身还是由分类器行为所导致。其解决方案的关键在于提出一种无需依赖分类器的评估方法——通过系统引入句法噪声和语义否定，并量化这些扰动对嵌入向量的影响，进而利用概念分离曲线（Concept Separation Curves）可视化模型区分概念层面变化与表层变化的能力，从而实现对句子嵌入概念稳定性的客观、可解释且跨模型的评估。

链接: https://arxiv.org/abs/2604.21555
作者: Paul Keuren,Marc Ponsen,Robert Ayoub Bagheri
机构: 未知
类目: Computation and Language (cs.CL)
备注: The code is open source and located on github at this https URL . Original conference paper

点击查看摘要

Abstract:Sentence embedding techniques aim to encode key concepts of a sentence’s meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier’s behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model’s performance. The approach adopted in this study involves the systematic introduction of syntactic noise and semantic negations into sentences, with the subsequent quantification of their relative effects on the resulting embeddings. The visualisation of these effects is facilitated by Concept Separation Curves, which show the model’s capacity to differentiate between conceptual and surface-level variations. By leveraging data from multiple domains, employing both Dutch and English languages, and examining sentence lengths, this study offers a compelling demonstration that Concept Separation Curves provide an interpretable, reproducible, and cross-model approach for evaluating the conceptual stability of sentence embeddings.

[NLP-29] UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text ACL2026 SEMEVAL2026

【速读】：该论文旨在解决在时间顺序排列的用户生成文本中同时建模当前情感状态（current affect）与短期情感变化（short-term affective change）的问题。其核心解决方案在于融合三种互补方法：一是基于大语言模型（LLM）的提示策略，区分用户感知（user-aware）与无用户感知（user-agnostic）场景；二是引入具有Ising风格相互作用的成对最大熵（MaxEnt）模型以结构化地捕捉情感转移；三是设计轻量级神经回归模型，结合近期情感轨迹和可训练的用户嵌入（user embeddings）。关键发现为：LLM擅长提取静态情感信号，而短期情感波动更依赖于近期数值状态轨迹而非文本语义，这一洞察使系统在SemEval-2026 Task 2的两个子任务中均取得最优性能。

链接: https://arxiv.org/abs/2604.21534
作者: Darya Hryhoryeva,Amaia Zurinaga,Hamidreza Jamalabadi,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt; National Research Center for Applied Cybersecurity ATHENE, Germany; Psychiatric Control Systems Lab, Marburg University
类目: Computation and Language (cs.CL)
备注: Accepted to SemEval 2026 (co-located with ACL 2026)

点击查看摘要

Abstract:This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

[NLP-30] Job Skill Extraction via LLM -Centric Multi-Module Framework

【速读】：该论文旨在解决从职位广告中进行细粒度技能提取（span-level skill extraction）时，生成式大语言模型（Generative LLMs）常出现的边界偏移（boundary drift）、格式错误（malformed spans）以及幻觉（hallucinations）问题，尤其是在长尾术语和跨领域迁移场景下表现不佳。解决方案的关键在于提出SRICL框架，其核心创新是融合语义检索（Semantic Retrieval, SR）、上下文学习（In-Context Learning, ICL）与监督微调（Supervised Fine-Tuning, SFT），并引入一个确定性验证器（deterministic verifier）以强制输出满足配对性（pairing）、非重叠性（non-overlap）和BIO标签合法性（BIO legality）等约束条件，从而显著提升结果的有效性和稳定性，尤其在低资源、多领域场景下具备可靠部署能力。

链接: https://arxiv.org/abs/2604.21525
作者: Guojing Li(1 and 2),Zichuan Fu(1),Junyi Li(1),Faxue Liu(1),Wenxia Zhou(2),Yejing Wang(1),Jingtong Gao(1),Maolin Wang(1),Rungen Liu(1),Wenlin Zhang(1),Xiangyu Zhao(1) ((1) City University of Hong Kong, (2) Renmin University of China)
机构: City University of Hong Kong (香港城市大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.

[NLP-31] Seeing Isnt Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

【速读】：该论文旨在解决当前用于评估图像-文本（I2T）和文本-图像（T2I）生成模型性能的视觉语言模型（VLMs）在可靠性方面存在的显著盲区问题。研究发现，现有VLM评估器在面对特定类型的输出质量退化时表现不佳，例如对象幻觉、空间推理错误、事实一致性偏差及视觉保真度下降等关键维度。其解决方案的关键在于设计并引入针对这些误差维度的定向扰动（targeted perturbations），构建一个包含4000余条扰动实例、覆盖40个扰动维度的综合性基准测试集，并采用单答案评分、成对比较和参考引导三种评估范式对4种主流VLM进行系统性评测。结果揭示了当前Evaluating VLMs存在高失败率（某些情况下超过50%），尤其难以识别细粒度的组合与空间错误以及与输入图像矛盾的幻觉内容，从而凸显出当前VLM评估机制的不可靠性，为后续模型开发与基准测试提供了重要警示。

链接: https://arxiv.org/abs/2604.21523
作者: Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra
机构: Nilekani Centre at AI4Bharat; Indian Institute of Technology Madras; BITS Pilani, Hyderabad
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

[NLP-32] OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在复杂优化任务中表现不佳的问题，尤其是现有基准测试过于集中于数学规划和组合优化领域，无法全面评估LLMs在如随机优化（Stochastic Optimization）、动态优化（Dynamic Optimization）、博弈优化（Game Optimization）和最优控制（Optimal Control）等更广泛领域的推理与建模能力。为应对这一挑战，作者提出了OptiVerse，一个包含1000个精心设计问题的综合性基准，覆盖多个优化子领域并划分难易等级。关键解决方案在于提出了一种双视角审计代理（Dual-View Auditor Agent），通过识别并修正建模逻辑错误，在不显著增加计算时间的前提下提升LLMs在优化问题建模中的准确性，从而有效缓解性能瓶颈。

链接: https://arxiv.org/abs/2604.21510
作者: Xinyu Zhang,Boxuan Zhang,Yuchen Wan,Lingling Zhang,YiXing Yao,Bifan Wei,Yaqiang Wu,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China (教育部智能网络与网络安全重点实验室, 中国); Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China (陕西省大数据知识工程重点实验室, 中国); Lenovo Research (联想研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.

[NLP-33] How English Print Media Frames Human-Elephant Conflicts in India

【速读】：该论文旨在解决印度媒体对人象冲突（Human-Elephant Conflict, HEC）报道中存在负面叙事倾向的问题，此类报道可能影响公众对大象的态度及野生动物保护政策的制定。其解决方案的关键在于提出并实施一种基于多模型情感分析框架的方法，该框架融合了长上下文Transformer模型、大语言模型（Large Language Models, LLMs）与领域特定的“负面大象形象词典”（Negative Elephant Portrayal Lexicon），从而实现对新闻文本中情感倾向的量化、理由句提取及语言模式识别，揭示出媒体中恐惧诱导和攻击性语言占主导地位的现象，为负责任的野生动物媒体报道提供了可复用、透明且可扩展的技术路径。

链接: https://arxiv.org/abs/2604.21496
作者: Bonala Sai Punith,Salveru Jayati,Garima Shakya,Shubham Kumar Nigam
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

[NLP-34] Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning ACL

【速读】：该论文旨在解决专家领域表格中数值推理模型在域内表现良好但跨域迁移能力差的问题，尤其是监督微调（SFT）模型容易依赖表头词汇捷径而非结构化推理。解决方案的关键在于提出一种持续预训练框架TaNO，其核心包括：(i) 表头匿名化以减少词汇记忆，(ii) 操作草图提供最小结构提示，(iii) 自监督预训练通过程序优先方式构建保证正确性的程序-问题对。该方法通过解耦领域语义与数值操作结构，显著提升数值推理的泛化能力，在FinQA上仅用10%训练数据即达到80.13%执行准确率，且跨域性能差距几乎可忽略（2个百分点），优于全量数据SFT基线（73.97%）及商用模型如GPT-5和Gemini-2.5-Pro。

链接: https://arxiv.org/abs/2604.21495
作者: Hanjun Cho,Gahyun Yoo,Hanseong Kim,Jay-Yoon Lee
机构: Seoul National University (首尔国立大学); Soongsil University (中央大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to TACL. This is a pre-MIT Press publication version

点击查看摘要

Abstract:Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner. By decoupling domain semantics and numerical operation structure, TaNOS improves the transferability of numerical reasoning. Applied to an 8B instruction-tuned model, TaNOS achieves 80.13% execution accuracy on FinQA with only 10% train data, outperforming SFT baseline (73.97%) with full train data and proprietary models such as GPT-5, Gemini-2.5-Pro. Furthermore, in the domain-shift experiments, TaNOS displays nearly-negligible cross-domain gap (2pp) when standard SFT shows over 10pp gap. These results suggest that structural guidance with operation sketches, header-agnostic representations, and correctness-guaranteed self-supervision can improve the robustness of numerical reasoning across diverse expert-domain tables.

[NLP-35] Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

【速读】：该论文旨在解决在多语言文本到语音（Text to Speech, TTS）模型评估中，由于语言多样性及语音感知的多维特性导致的成对评估方差过高的问题。其解决方案的关键在于提出了一种受控的多维成对评估框架，通过结合语言控制与基于感知的标注机制，在10种印地语系语言中使用超过5000个本地化及代码混杂句子对7个前沿TTS系统进行评估，并收集了来自1900多名母语评分者的逾12万次成对比较。该框架不仅支持整体偏好排序，还量化了6个感知维度（可懂性、表现力、音质、生动性、噪声和幻觉）上的评分，利用Bradley-Terry建模构建多语言排行榜，并借助SHAP分析解释人类偏好，从而实现对模型性能与权衡关系的深入理解。

链接: https://arxiv.org/abs/2604.21481
作者: Srija Anand,Ashwin Sankar,Ishvinder Sethi,Aaditya Pareek,Kartik Rajput,Gaurav Yadav,Nikhil Narasimhan,Adish Pandya,Deepon Halder,Mohammed Safi Ur Rahman Khan,Praveen S V,Shobhit Banga,Mitesh M Khapra
机构: Indian Institute of Technology, Madras(印度理工学院马德拉斯分校); AI4Bharat; Josh Talks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

[NLP-36] Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

【速读】：该论文旨在解决监管合规检测中因法律文本复杂性和多样性导致的模型跨领域泛化能力差的问题，即在一种法规上训练的模型难以有效应用于其他法规。其解决方案的关键在于通过数据选择策略来缓解负迁移（negative transfer），具体采用自然语言推理（NLI）框架下对大规模源域数据进行有针对性的子集筛选，评估了随机采样、Moore-Lewis交叉熵差异、重要性加权和基于嵌入的检索四种方法，并系统分析不同数据比例对跨域适应效果的影响，结果表明目标导向的数据选择能显著降低负迁移，为实现异构法规下的可扩展、可靠合规自动化提供了实用路径。

链接: https://arxiv.org/abs/2604.21469
作者: Fariz Ikhwantri,Dusica Marijan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 4 tables. 11th Special Session on Intelligent Data Mining, 2025 IEEE International Conference on Big Data

点击查看摘要

Abstract:Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis’s cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

[NLP-37] Reasoning Primitives in Hybrid and Non-Hybrid LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中推理能力被视作单一模块化能力的问题，探究其性能提升是否源于更基础的计算原语，如记忆召回（recall）与状态追踪（state-tracking）。研究发现，混合架构（hybrid architecture）——即结合基于注意力机制的检索与循环状态更新的结构——相较于纯注意力模型，在同时需要这两种原语的任务中表现更优。解决方案的关键在于：通过引入显式推理机制（reasoning augmentation）扩展模型的有效操作范围，同时依赖架构中的归纳偏置（inductive bias）支持持久状态传播，从而在复杂任务中保持更强的鲁棒性，尤其当序列依赖性增强时，混合模型相较纯Transformer模型展现出显著优势。

链接: https://arxiv.org/abs/2604.21454
作者: Shivam Rawat,Lucie Flek,Florian Mai,Nicholas Kluge Corrêa
机构: University of Bonn (波恩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model’s effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.

[NLP-38] Decoupled DiLoCo for Resilient Distributed Pre-training

【速读】：该论文旨在解决大规模语言模型预训练中因单程序多数据（SPMD）范式导致的同步瓶颈问题，即加速器间紧密耦合引发的瞬时延迟、硬件故障和同步开销会阻塞整个计算流程，造成显著的算力浪费。其解决方案的关键在于提出解耦式DiLoCo（Decoupled DiLoCo）框架，通过将计算任务分配给多个独立的“学习者”（learners），使其执行本地内优化步骤，并异步向中心同步器传输参数片段；同步器采用最小多数派（minimum quorum）、自适应宽限期（adaptive grace window）和动态令牌加权合并机制，绕过失败或拖慢的学习者，从而打破锁步同步限制，显著提升训练吞吐量（goodput）。该方法在模拟数百万芯片的故障环境中实现了严格零全局停机，同时保持了文本与视觉任务上密集模型和专家混合（Mixture-of-Experts, MoE）架构的竞争力性能。

链接: https://arxiv.org/abs/2604.21428
作者: Arthur Douillard,Keith Rush,Yani Donchev,Zachary Charles,Nova Fallen,Ayush Dubey,Ionel Gog,Josef Dean,Blake Woodworth,Zachary Garrett,Nate Keating,Jenny Bishop,Henry Prior,Edouard Yvinec,Arthur Szlam,Marc’Aurelio Ranzato,Jeff Dean
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by chaos engineering’', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.

[NLP-39] Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

【速读】：该论文旨在解决临床文本去标识化（de-identification）中隐私保护与数据效用之间的权衡问题，尤其是在遵守GDPR和HIPAA等法规的前提下实现高效、自动化的患者隐私保护。其解决方案的关键在于对比分析差分隐私（Differential Privacy, DP）、命名实体识别（Named Entity Recognition, NER）与大语言模型（Large Language Models, LLMs）三类方法，并提出混合策略——即在DP处理前使用NER或LLM进行预处理，以提升隐私保障的同时显著改善文本的语义完整性与下游任务性能（如实体和关系分类）。研究发现，单纯依赖DP会严重损害数据效用，而结合LLM预处理可有效优化隐私-效用平衡。

链接: https://arxiv.org/abs/2604.21421
作者: Michele Miranda,Xinlan Yan,Nishant Mishra,Rachel Murphy,Ameen Abu-Hanna,Sébastien Bratières,Iacer Calixto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.

[NLP-40] Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation ACL2026

【速读】：该论文旨在解决软件性能需求（performance requirements）在自然语言中表述时所面临的模糊性和认知不确定性问题，这些问题导致其自动化量化成为一项未被充分解决的挑战。解决方案的关键在于提出IRAP方法，该方法通过交互式检索增强型偏好获取（interactive retrieval-augmented preference elicitation），将性能需求形式化为数学函数；其核心创新在于显式利用特定问题领域的知识进行偏好检索与推理，从而指导与利益相关者的渐进式交互，并显著降低认知负担，实验表明IRAP在四个真实数据集上相较10种先进方法均表现出优越性，最多可实现40倍的性能提升，且仅需五轮交互即可达成。

链接: https://arxiv.org/abs/2604.21380
作者: Wang Shi Hai,Chen Tao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages,accepted by ACL 2026

点击查看摘要

Abstract:Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation. IRAP differs from the others in that it explicitly derives from problem-specific knowledge to retrieve and reason the preferences, which also guides the progressive interaction with stakeholders, while reducing the cognitive overhead. Experiment results against 10 state-of-the-art methods on four real-world datasets demonstrate the superiority of IRAP on all cases with up to 40x improvements under as few as five rounds of interactions.

[NLP-41] VLAA-GUI: Knowing When to Stop Recover and Search A Modular Framework for GUI Automation

【速读】：该论文旨在解决自主图形用户界面（GUI）智能体面临的两大核心问题：早期停止（early stopping），即智能体在缺乏可验证证据的情况下提前宣告成功；以及重复循环（repetitive loops），即智能体在无法完成任务时反复执行相同失败动作而无法恢复。解决方案的关键在于提出VLAA-GUI框架，其核心由三个强制性模块构成：完整性验证器（Completeness Verifier），通过UI可观测的成功标准和跨决策规则的交叉验证来杜绝无证据的成功声明；循环打破器（Loop Breaker），通过多级过滤机制（如交互模式切换、策略变更触发及反射信号绑定）有效防止无效循环；以及按需搜索代理（Search Agent），利用具备检索能力的大语言模型（LLM）在线获取陌生工作流的文本结果。这三个模块协同作用，在多个主流大模型上显著提升任务成功率并减少无效步骤，尤其对易陷入循环的模型效果突出。

链接: https://arxiv.org/abs/2604.21375
作者: Qijun Han,Haoqin Tu,Zijun Wang,Haoyue Dai,Yiyang Zhou,Nancy Lau,Alvaro A. Cardenas,Yuhui Xu,Ran Xu,Caiming Xiong,Zeyu Zheng,Huaxiu Yao,Yuyin Zhou,Cihang Xie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: The first two authors contribute equally

点击查看摘要

Abstract:Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step – with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

[NLP-42] MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist Specialist and Ensemble Strategies for Multilingual Polarization SEMEVAL-2026

【速读】：该论文旨在解决多语言极化检测（multilingual polarization detection）中的模型性能差异问题，特别是在不同语言间存在书写系统差异（如高棉语、奥里亚语等非拉丁字母语言）时，通用多语言模型（如XLM-RoBERTa）可能表现不佳的问题。其解决方案的关键在于提出一种语言自适应框架（language-adaptive framework），根据开发集上的性能动态选择最优模型结构——包括多语言通用模型、语言特定专用模型以及混合集成模型，而非强制使用单一统一架构。该方法在SemEval-2026 Task 9的22个语言任务中实现了宏观F1分数0.796和平均准确率0.826的优异结果。

链接: https://arxiv.org/abs/2604.21370
作者: Maziar Kianimoghadam Jouneghani
机构: University of Turin (都灵大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 9 pages, 9 tables. Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), Task 9

点击查看摘要

Abstract:We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: this https URL.

[NLP-43] mcdok at SemEval-2026 Task 13: Finetuning LLM s for Detection of Machine-Generated Code

【速读】：该论文旨在解决跨领域（multi-domain）机器生成代码片段（machine-generated code snippets）的检测问题，涵盖多种编程语言，并进一步细化为三个子任务：二分类检测、生成源归属（attribution）、以及大语言模型（LLM）家族识别，还包括对人机协同生成或对抗性修改代码的检测。解决方案的关键在于对现有用于文本检测的mdok方法进行适配，通过探索更适合代码理解的基线模型（base models），从而提升在上述复杂场景下的检测性能。实验表明，所提出系统在所有子任务中均具有竞争力，但与最优系统相比仍有较大提升空间。

链接: https://arxiv.org/abs/2604.21365
作者: Adam Skurla,Dominik Macko,Jakub Simko
机构: Brno University of Technology (布林诺理工大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task~13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin. Our submitted systems adjusted the existing mdok approach (focused on machine-generated text detection) to these specific kinds of problems by exploring various base models, more suitable for code understanding. The results indicate that the submitted systems are competitive in all three subtasks. However, the margins from the top-performing systems are significant, and thus further improvements are possible.

[NLP-44] ReaGeo: Reasoning -Enhanced End-to-End Geocoding with LLM s

【速读】：该论文旨在解决传统多阶段地理编码（geocoding）方法依赖文本或向量相似性检索地理数据库所带来的局限性，如流程复杂、误差传播以及对结构化地理知识库的高度依赖。其核心解决方案是提出一种基于大语言模型（large language models, LLMs）的端到端地理编码框架 ReaGeo，关键创新在于将地理坐标转化为 geohash 序列，从而将坐标预测任务重构为文本生成问题，并引入 Chain-of-Thought 机制增强模型对空间关系的推理能力；同时采用基于距离偏差的强化学习策略优化生成精度，显著提升了模型在明确地址查询、模糊相对位置查询及非点状几何区域预测中的准确性与泛化能力。

链接: https://arxiv.org/abs/2604.21357
作者: Jian Cui,Zhiyuan Ren,Desheng Weng,Yongqi Zhao,Gong Wenbin,Yu Lei,Zhenning Dong
机构: Amap, Alibaba Group; Tsinghua University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 8 figures, submitted to ACM SIGSPATIAL 2024 (under review)

点击查看摘要

Abstract:This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model’s reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.

[NLP-45] CARE: Counselor-Aligned Response Engine for Online Mental-Health Support

【速读】：该论文旨在解决全球范围内心理健康挑战加剧导致的情绪支持服务压力增大、心理咨询师负荷过重的问题，尤其是在低资源语言（如希伯来语和阿拉伯语）中，现有生成式 AI（Generative AI）模型难以有效模拟专业咨询师的支持性语言与干预策略。其解决方案的关键在于提出 CARE（Counselor-Aligned Response Engine）框架，通过在专家标注的高质量危机对话数据集上对开源大语言模型（LLM）进行领域特定微调，使模型能够捕捉到成功去激化互动的模式，并基于完整的对话历史动态维护情感上下文，从而生成与专业咨询师响应在语义和策略层面高度一致的实时建议，显著提升低资源语言环境下心理援助的质量与效率。

链接: https://arxiv.org/abs/2604.21352
作者: Hagai Astrin,Ayal Swaid,Avi Segal,Kobi Gal
机构: Ben-Gurion University (本-古里安大学); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Mental health challenges are increasing worldwide, straining emotional support services and leading to counselor overload. This can result in delayed responses during critical situations, such as suicidal ideation, where timely intervention is essential. While large language models (LLMs) have shown strong generative capabilities, their application in low-resource languages, especially in sensitive domains like mental health, remains underexplored. Furthermore, existing LLM-based agents often struggle to replicate the supportive language and intervention strategies used by professionals due to a lack of training on large-scale, real-world datasets. To address this, we propose CARE (Counselor-Aligned Response Engine), a GenAI framework that assists counselors by generating real-time, psychologically aligned response recommendations. CARE fine-tunes open-source LLMs separately for Hebrew and Arabic using curated subsets of real-world crisis conversations. The training data consists of sessions rated as highly effective by professional counselors, enabling the models to capture interaction patterns associated with successful de-escalation. By training on complete conversation histories, CARE maintains the evolving emotional context and dynamic structure of counselor-help-seeker dialogue. In experimental settings, CARE demonstrates stronger semantic and strategic alignment with gold-standard counselor responses compared to non-specialized LLMs. These findings suggest that domain-specific fine-tuning on expert-validated data can significantly support counselor workflows and improve care quality in low-resource language contexts. Comments: 9 pages, 4 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.21352 [cs.CL] (or arXiv:2604.21352v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.21352 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-46] Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在抽象视觉推理任务（如Bongard问题）中表现不佳的问题，核心疑问在于性能瓶颈是源于推理能力不足还是表征能力有限。为厘清这一问题，作者在Bongard-LOGO这一具有真实生成程序标注的合成基准上进行实验，对比直接处理原始图像的VLM与基于符号输入的大语言模型（Large Language Models, LLMs）的表现。其解决方案的关键在于提出“组件化-语法化”（Componential–Grammatical, C–G）范式，将视觉问题转化为基于LOGO风格动作程序或结构化描述的符号推理任务，并通过符号输入作为诊断探针而非实际多模态架构。实验证明，LLMs在符号输入下显著提升准确率（达到近90%），而视觉基线仍接近随机水平，表明抽象视觉推理的主要瓶颈在于表征而非推理本身，且符号结构的引入可提供一个可控的理论上限。

链接: https://arxiv.org/abs/2604.21346
作者: Mohit Vaishnav,Tanel Tammet
机构: Tallinn University of Technology (塔林理工大学); Kimova AI (Kimova AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision–language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emphComponential–Grammatical (C–G) paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid–90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

[NLP-47] Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

【速读】：该论文旨在解决生成式 AI (Generative AI) 应用在实际部署中缺乏可复用、结构化评估流程的问题，尤其聚焦于会议摘要任务的质量评价。其核心挑战在于如何系统性地分离通用评估逻辑与特定任务语义，并实现对生成结果的细粒度分析与统计验证。解决方案的关键在于构建一个五阶段可复用的评估流水线（source intake, structured reference construction, candidate generation, structured scoring, and reporting），将真实参考答案和评估输出均作为类型化（typed）持久化 artifact 进行管理，从而支持聚合分析、问题定位及显著性检验。该方法不仅提升了评估的透明度和可重复性，还揭示了不同模型在准确率、完整性与覆盖度上的差异，例如 gpt-5.1 在保留信息方面表现更优，而白宫新闻发布会类数据成为准确性难点领域。

链接: https://arxiv.org/abs/2604.21345
作者: Philip Zhong,Don Wang,Jason Zhang,Kent Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AI Application Feature Quality Evaluation (28 pages total)

点击查看摘要

Abstract:We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path. Comments: AI Application Feature Quality Evaluation (28 pages total) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.21345 [cs.AI] (or arXiv:2604.21345v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.21345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-48] Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

【速读】：该论文旨在解决Transformer模型中KV（Key-Value）压缩效率与性能之间的权衡问题，特别是传统方法在token、layer或head等粗粒度单元上进行压缩时，难以实现精细化控制，导致信息损失或冗余。其核心解决方案是引入子token级别的路由机制（sub-token routing），在LoRA（Low-Rank Adaptation）适配的Transformer架构中，对单个token内部的值组（value groups）进行细粒度的保留决策。关键创新在于：针对语言建模任务提出一种不依赖查询的路由设计，结合路由子空间LoRA与KV路径上的值组路由；针对下游任务保持场景提出一种基于预测器的选择机制，利用查询条件相关性分配全局保留预算至上下文token与值组对。实验表明，这两种策略分别优化了压缩比与语言建模质量的平衡，以及在降低KV存储预算下维持下游任务表现的能力，并揭示了token级与子token级查询感知路由形成互补的压缩维度。

链接: https://arxiv.org/abs/2604.21335
作者: Wei Jiang,Wei Wang
机构: Futurewei Technologies Inc.(未来wei科技公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 14 tables, 2 figures

点击查看摘要

Abstract:Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.

[NLP-49] Ideological Bias in LLM s Economic Causal Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理经济因果效应时是否存在系统性意识形态偏倚的问题，尤其是在政策分析和经济报道等高风险场景中，方向正确的因果判断至关重要。其解决方案的关键在于构建并扩展了EconCausal基准测试，引入了意识形态争议案例（ideology-contested cases），即干预导向（pro-government）与市场导向（pro-market）视角对因果方向预测存在分歧的实例；通过对来自顶级经济学与金融期刊的10,490个经实证验证因果方向的三元组进行筛选，识别出1,056个意识形态争议样本，并评估20种前沿LLM在预测实证支持的因果方向上的表现。结果表明，LLMs在意识形态争议问题上准确率显著降低，且当预测错误时，其偏差明显偏向干预导向，这种方向性偏差无法通过单次上下文提示消除，凸显了在高风险经济与政策应用中开展方向感知型评估的必要性。

链接: https://arxiv.org/abs/2604.21334
作者: Donggyu Lee,Hyeok Yun,Jungwon Kim,Junsik Min,Sungwon Park,Sangyoon Park,Jihee Kim
机构: KAIST(韩国科学技术院); HKUST(香港科技大学)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

[NLP-50] Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning ACL2026

【速读】：该论文旨在解决测试时强化学习（Test-time Reinforcement Learning, TTRL）在推理阶段通过伪标签（pseudo-labeling）进行模型自适应时，因标签噪声引发的虚假优化信号问题。研究发现，中等一致性的响应构成一个模糊区域，是奖励噪声的主要来源，且群体相对优势估计（group-relative advantage estimation）会进一步放大此类虚假信号。解决方案的关键在于提出统一框架Debiased and Denoised Test-time Reinforcement Learning (DDRL)，其核心包括：(1) 基于频率的采样策略排除模糊样本，同时保持正负样本平衡；(2) 采用固定优势的去偏优势估计方法，消除群体相对策略优化引入的偏差；(3) 引入基于共识的离策略精炼阶段，利用拒绝采样数据集实现高效稳定的模型更新。实验证明，DDRL在多个数学推理基准上显著优于现有TTRL基线方法。

链接: https://arxiv.org/abs/2604.21327
作者: Yongcan Yu,Lingxiao He,Jian Liang,Kuangpu Guo,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Meituan (美团); University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at this https URL.

[NLP-51] When Bigger Isnt Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation ACL2026

【速读】：该论文旨在解决多文档新闻摘要系统中存在的政治偏见问题，即模型在处理具有不同政治倾向的新闻源时，可能表现出观点代表性不均、特定立场过度强调以及少数声音系统性被忽视等不公平现象。其解决方案的关键在于构建一个涵盖13个大型语言模型（Large Language Models, LLMs）和五种公平性度量指标的综合性评估框架，并系统测试了基于提示（prompt-based）和基于判别器（judge-based）的去偏干预策略。研究发现，模型规模并非决定公平性的关键因素，中等规模模型反而在公平性与效率之间取得最佳平衡；同时，实体情感（entity sentiment）作为最顽固的公平维度，对现有所有干预手段均表现出强鲁棒性，表明实现公平性需依赖多维评估体系与面向架构的针对性去偏方法，而非单纯扩大模型规模。

链接: https://arxiv.org/abs/2604.21309
作者: Nannan Huang,Iffat Maab,Junichi Yamagishi
机构: RMIT University, Australia; National Institute of Informatics, Tokyo, Japan
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.

[NLP-52] CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

【速读】：该论文旨在解决企业级大语言模型（Large Language Model, LLM）代理在执行任务时因依赖内部上下文而引发的敏感信息泄露问题。其核心挑战在于如何在保证任务执行效率的同时，有效控制敏感内容的传播，确保上下文完整性（Contextual Integrity, CI）。解决方案的关键在于提出一个基于CI理论的基准测试框架CI-Work，通过模拟五种信息流方向来评估模型在密集检索场景下是否能准确传递必要信息并屏蔽敏感内容；研究发现，当前前沿模型普遍存在隐私失效现象（泄露率15.8%–50.9%，最高达26.7%），且任务效用与隐私违规之间存在反直觉的正相关关系，表明单纯扩大模型规模或增强推理深度无法缓解风险。因此，论文主张从“模型中心”向“上下文中心”的架构范式转变，以系统性提升企业级AI代理的安全性。

链接: https://arxiv.org/abs/2604.21308
作者: Wenjie Fu,Xiaoting Qin,Jue Zhang,Qingwei Lin,Lukas Wutschitz,Robert Sim,Saravan Rajmohan,Dongmei Zhang
机构: Huazhong University of Science and Technology, China; Microsoft
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user’s behalf, also creates new risks for sensitive information leakage. We introduce CI-Work, a Contextual Integrity (CI)-grounded benchmark that simulates enterprise workflows across five information-flow directions and evaluates whether agents can convey essential content while withholding sensitive context in dense retrieval settings. Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%-50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade-off critical for industrial deployment: higher task utility often correlates with increased privacy violations. Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model-centric scaling toward context-centric architectures.

[NLP-53] Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding

【速读】：该论文旨在解决生成式 AI（Generative AI）中预测编码网络（Predictive Coding, PC）的隐层能量探针（K-way energy probe）与softmax输出之间的关系问题，特别是验证其是否可被简化为对数软最大边缘（log-softmax margin）的单调函数。此前研究（Cacioli, 2026）表明这一简化依赖于五个假设，其中交叉熵（Cross-Entropy, CE）损失和前馈推理动态是关键前提。本文通过两个实验条件测试该简化对CE移除的敏感性：一是用均方误差（MSE）替代CE训练标准PC，二是采用双向预测编码（bidirectional PC, bPC）。结果表明，仅移除CE会使探针与softmax的差距减半（Δ_MSE = -0.037 vs Δ_stdPC = -0.082），说明CE在该尺度下是分解的关键经验负载项；进一步温度缩放分析揭示约66%的差距源于可由温度重标定消除的logit尺度效应，剩余34%体现为CE训练表示在排序上的尺度不变优势。因此，解决方案的核心在于识别并量化CE训练对探针-softmax关系的决定性作用，以及区分尺度相关与尺度无关的表征差异。

链接: https://arxiv.org/abs/2604.21286
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, 4 tables. Pre-registered on OSF ( this https URL ). Code at this https URL

点击查看摘要

Abstract:Cacioli (2026) showed that the K-way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log-softmax margin. The reduction rests on five assumptions, including cross-entropy (CE) at the output and effectively feedforward inference dynamics. This pre-registered study tests the reduction’s sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang Bogacz, 2025). Across 10 seeds on CIFAR-10 with a matched 2.1M-parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = -0.082, p 10^-6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre-registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe-softmax gap (Delta_MSE = -0.037 vs Delta_stdPC = -0.082). CE is a major empirically load-bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post-hoc temperature scaling ablation decomposes the probe-softmax gap into two components: approximately 66% is attributable to logit-scale effects removable by temperature rescaling, and approximately 34% reflects a scale-invariant ranking advantage of CE-trained representations. We use “metacognitive” operationally to denote Type-2 discrimination of a readout over its own Type-1 correctness, not to imply human-like introspective access.

[NLP-54] Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

【速读】：该论文旨在解决预训练大语言模型（Large Language Models, LLMs）在语音识别中引入的潜在公平性问题，即其文本先验是否会加剧或缓解不同人口群体（如种族、口音、性别、年龄和母语）之间的识别偏差。研究通过评估九种跨越三类架构（基于CTC无语言模型、编码器-解码器隐式语言模型、以及显式预训练解码器的LLM）的模型，在Common Voice 24和Meta的Fair-Speech数据集上进行系统分析，发现LLM规模并非提升公平性的关键因素；相反，音频编码器设计才是决定公平性和鲁棒性的核心杠杆——例如，音频压缩比更能预测口音公平性，而特定音频扰动（如静音注入）会显著放大Whisper模型的口音偏差，甚至引发选择性幻觉，而显式LLM解码器则表现出更强的抗干扰能力与更低的重复插入率。

链接: https://arxiv.org/abs/2604.21276
作者: Srishti Ginjala,Eric Fosler-Lussier,Christopher W. Myers,Srinivasan Parthasarathy
机构: The Ohio State University (俄亥俄州立大学); Air Force Research Laboratory (空军研究实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta’s Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper’s accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.

[NLP-55] Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

【速读】：该论文旨在解决小规模语言模型在训练初期收敛慢、性能受限的问题，提出通过结构化的人类创作输出（如音乐）作为预训练任务，以提升后续语言任务的学习效率。其解决方案的关键在于构建一个分阶段的发育式预训练管道（music → poetry → prose），其中音乐预训练先优化模型内部计算模块（internal computation），随后诗歌和散文任务分别作用于嵌入层（embeddings）与语言建模能力，从而实现正交的特征解耦与协同优化；实验证明该策略在不同模型容量下均能显著降低困惑度（最高达17.5%），且具有稳定性和可扩展性，表明人类艺术形式可作为高效的小模型预训练基底。

链接: https://arxiv.org/abs/2604.21265
作者: Yoshinori Nomura
机构: Mirage Mountain Technologies Inc. (Mirage Mountain Technologies 公司)
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline – music \to poetry \to prose – yields a 17.5% perplexity improvement over random initialization ( p 0.001 , 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at d!=!64 , multi-seed validation (5 seeds) shows a persistent 5.5% gap at plateau ( p = 0.017 ), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ( -3% \to +3% \to +6% advantage of larger datasets from d!=!16 to d!=!64 ). Across the scales we study ( d!\in!\16,32,64\ , up to \sim400 K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

[NLP-56] When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）代理在知识蒸馏（model distillation）过程中出现的行为同质化问题，即多个新兴代理展现出几乎相同的推理步骤和失败模式，暗示其可能源于少数主导教师模型的“复制品”。现有评估指标无法区分任务成功所必需的强制性行为与反映模型自主偏好的非强制性行为模式。为此，作者提出两种互补的度量方法：**响应模式相似性（Response Pattern Similarity, RPS）用于衡量语言输出的一致性，以及动作图相似性（Action Graph Similarity, AGS）**用于建模工具使用习惯为有向图结构的非强制性行为差异。关键创新在于通过AGS识别出教师特定的收敛趋势，而非单纯的任务性能提升，从而有效分离出模型间因蒸馏导致的共性行为与个体偏好驱动的独特行为模式。

链接: https://arxiv.org/abs/2604.21255
作者: Chenghao Yang,Yuning Zhang,Zhoufutu Wen,Tao Gong,Jiaheng Liu,Qi Chu,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学); Anhui Province Key Laboratory of Digital Security (安徽省数字安全重点实验室); M-A-P; Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model’s autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbfResponse Pattern Similarity (RPS) for verbal alignment and \textbfAction Graph Similarity (AGS) for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on \tau -Bench and \tau^2 -Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6% S_\textnode and 94.7% S_\textdep , exceeding Anthropic’s own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson r = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at this https URL.

[NLP-57] Hyperloop Transformers

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在边缘计算和设备端部署时面临的内存占用过高问题，即在固定计算/延迟预算下，如何进一步提升模型的参数效率。其解决方案的关键在于提出一种名为“超连接环状Transformer”（Hyperloop Transformer）的新架构：该架构以循环Transformer为核心单元，通过重复使用中间块（middle block）中的Transformer层来减少参数量；同时引入仅在每次循环后应用的超连接（hyper-connections），将残差流扩展为矩阵值形式，从而在几乎不增加额外参数和计算开销的前提下增强模型表达能力。实验表明，该架构在多种模型规模下均能显著优于深度匹配的Transformer及mHC Transformer基线，且性能优势在权重量化后依然保持，具备良好的内存效率与实用性。

链接: https://arxiv.org/abs/2604.21254
作者: Abbas Zeitoun,Lucas Torroba-Hennigen,Yoon Kim
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model’s memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks–begin, middle, and end blocks–where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

[NLP-58] Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在叙事生成中普遍存在的全局叙事连贯性差、上下文逻辑一致性弱以及角色发展不流畅的问题，这些问题常导致脚本单调且结构断裂。解决方案的关键在于提出PLOTTER框架，该框架不再直接基于序列文本表示进行生成，而是将叙事内容建模为事件图（event graph）和角色图（character graph），并在这些结构化图表示上执行“评估-规划-修订”（Evaluate-Plan-Revise）循环。通过在严格逻辑约束下诊断并修复图拓扑结构问题，模型在完整上下文生成前优化因果关系与叙事骨架，从而显著提升长程推理能力与叙事质量。

链接: https://arxiv.org/abs/2604.21253
作者: Hanwen Gu,Chao Guo,Junle Wang,Wenda Xie,Yisheng Lv
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Tencent Turing Lab (腾讯天衍实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.

[NLP-59] Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness ACL2026

【速读】：该论文旨在解决多模态临床时间序列数据中因观测缺失（missingness）导致的信息损失问题，尤其是如何利用观测过程本身所携带的潜在信息来提升患者表征学习的效果。传统方法通常将缺失值视为噪声或采用插补策略处理，但忽略了观测模式与患者隐状态之间的关联性。解决方案的关键在于提出一个融合观测模式的患者表征学习框架，其核心包括：(1) 多模态编码器联合捕捉结构化数据和文本数据及其观测模式；(2) 基于贝叶斯滤波的模块动态更新患者的隐状态；(3) 以学习到的患者状态为基础进行离线治疗策略学习和预后预测。该方法显著提升了在ICU脓毒症队列中的治疗策略评估（FQE从0.528提升至0.679）和72小时死亡率预测性能（AUROC达0.886）。

链接: https://arxiv.org/abs/2604.21235
作者: Zihan Liang,Ziwen Pan,Ruoxuan Xiong
机构: Emory University (埃默里大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)
备注: Findings of ACL 2026 (30 pages)

点击查看摘要

Abstract:Multimodal clinical records contain structured measurements and clinical notes recorded over time, offering rich temporal information about the evolution of patient health. Yet these observations are sparse, and whether they are recorded depends on the patient’s latent condition. Observation patterns also differ across modalities, as structured measurements and clinical notes arise under distinct recording processes. While prior work has developed methods that accommodate missingness in clinical time series, how to extract and use the information carried by the observation process itself remains underexplored. We therefore propose a patient representation learning framework for multimodal clinical time series that explicitly leverages informative missingness. The framework combines (1) a multimodal encoder that captures signals from structured and textual data together with their observation patterns, (2) a Bayesian filtering module that updates a latent patient state over time from observed multimodal signals, and (3) downstream modules for offline treatment policy learning and patient outcome prediction based on the learned patient state. We evaluate the framework on ICU sepsis cohorts from MIMIC-III, MIMIC-IV, and eICU. It improves both offline treatment policy learning and adverse outcome prediction, achieving FQE 0.679 versus 0.528 for clinician behavior and AUROC 0.886 for post-72-hour mortality prediction on MIMIC-III.

[NLP-60] EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

【速读】：该论文旨在解决大语言模型在多轮对话中长期记忆与跨会话信息整合能力不足的问题，即如何有效保留并推理跨越多个交互会话的信息。其解决方案的关键在于提出一种基于图结构的记忆系统 Engrama，该系统通过显式建模对话内容间的语义关联和时序关系，实现对跨空间（cross-space）信息的高效整合与推理，从而在特定任务（如跨空间推理）上优于仅依赖上下文提示或向量检索的记忆机制。

链接: https://arxiv.org/abs/2604.21229
作者: Julian Acuna
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama’s cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

[NLP-61] Zero-Shot Detection of LLM -Generated Text via Implicit Reward Model NEURIPS2025

【速读】：该论文旨在解决生成式 AI（Generative AI）文本检测问题，即如何在无需额外训练或偏好数据的情况下，准确识别由大型语言模型（Large Language Models, LLMs）生成的文本。其解决方案的关键在于提出一种新颖的零样本方法——IRM（Implicit Reward Model），该方法通过利用公开可获取的指令微调模型和基础模型隐式地构建奖励模型，从而实现对LLM生成文本的有效区分。与以往依赖偏好标注和任务特定微调的方法不同，IRM无需任何额外训练或人工标注，显著提升了检测性能，在DetectRL基准测试中优于现有零样本及监督方法。

链接: https://arxiv.org/abs/2604.21223
作者: Runheng Liu,Heyan Huang,Xingchen Xiao,Zhijing Wu
机构: Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.

[NLP-62] Subject-level Inference for Realistic Text Anonymization Evaluation ACL2026

【速读】：该论文旨在解决当前文本匿名化评估中依赖基于片段（span-based）指标的局限性，这些指标无法准确反映攻击者实际可推断的信息，并且通常假设单一数据主体，忽略了多主体场景下的隐私风险。其解决方案的关键在于提出SPIA（Subject-level PII Inference Assessment），首个以个体为评估单位的基准测试，包含675份法律和在线领域的文档，并引入新型的主体级保护度量指标。实验表明，即使90%以上的个人身份信息（PII）片段被遮蔽，主体级推理保护率仍可能低至33%，说明通过上下文推理仍可恢复大部分个人信息；同时，针对特定目标主体的匿名化策略会使非目标主体暴露程度显著高于目标主体。这凸显了基于主体级推理的评估在真实场景中保障文本匿名安全的重要性。

链接: https://arxiv.org/abs/2604.21211
作者: Myeong Seok Oh,Dong-Yun Kim,Hanseok Oh,Chaean Kang,Joeun Kang,Xiaonan Wang,Hyunjung Park,Young Cheol Jung,Hansaem Kim
机构: Tscientific(科学公司); Soongsil University (中央大学); Yonsei University (延世大学); Mila(蒙特利尔学习算法研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations, we present SPIA (Subject-level PII Inference Assessment), the first benchmark that shifts the unit of evaluation from text spans to individuals, comprising 675 documents across legal and online domains with novel subject-level protection metrics. Extensive experiments show that even when over 90% of PII spans are masked, subject-level inference protection drops as low as 33%, leaving the majority of personal information recoverable through contextual inference. Furthermore, target-subject-focused anonymization leaves non-target subjects substantially more exposed than the target subject. We show that subject-level inference-based evaluation is essential for ensuring safe text anonymization in real-world settings.

[NLP-63] Align Generative Artificial Intelligence with Human Preferences: A Novel Large Language Model Fine-Tuning Method for Online Review Management

【速读】：该论文旨在解决生成式 AI（Generative AI）在在线评论回复生成任务中因缺乏领域特定人类偏好对齐而导致的幻觉、偏好表示困难及离线策略优化过度保守等问题。其核心解决方案是提出一种新颖的偏好微调（preference fine-tuning）方法，关键在于：首先通过上下文增强缓解大语言模型（LLM）的幻觉问题；其次设计基于理论驱动的偏好对构造机制，自动构建在线评论领域的用户偏好样本；进一步引入课程学习策略提升微调效果；最后提出基于密度估计的支持约束方法，有效缓解传统离线偏好微调中的过度保守问题，并提供了严格的理论保障。

链接: https://arxiv.org/abs/2604.21209
作者: Yanan Wang,Yong Ge
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to Information Systems Research (ISR). This is a preliminary version

点击查看摘要

Abstract:Online reviews have played a pivotal role in consumers’ decision-making processes. Existing research has highlighted the significant impact of managerial review responses on customer relationship management and firm performance. However, a large portion of online reviews remains unaddressed due to the considerable human labor required to respond to the rapid growth of online reviews. While generative AI has achieved remarkable success in a range of tasks, they are general-purpose models and may not align well with domain-specific human preferences. To tailor these general generative AI models to domain-specific applications, finetuning is commonly employed. Nevertheless, several challenges persist in finetuning with domain-specific data, including hallucinations, difficulty in representing domain-specific human preferences, and over conservatism in offline policy optimization. To address these challenges, we propose a novel preference finetuning method to align an LLM with domain-specific human preferences for generating online review responses. Specifically, we first identify the source of hallucination and propose an effective context augmentation approach to mitigate the LLM hallucination. To represent human preferences, we propose a novel theory-driven preference finetuning approach that automatically constructs human preference pairs in the online review domain. Additionally, we propose a curriculum learning approach to further enhance preference finetuning. To overcome the challenge of over conservatism in existing offline preference finetuning method, we propose a novel density estimation-based support constraint method to relax the conservatism, and we mathematically prove its superior theoretical guarantees. Extensive evaluations substantiate the superiority of our proposed preference finetuning method.

[NLP-64] Prefix Parsing is Just Parsing ACL2026

【速读】：该论文旨在解决前缀解析（prefix parsing）问题，即判断一个输入前缀是否可以扩展为由给定文法生成的完整字符串，并在加权场景下计算前缀概率，这对上下文无关语言建模、心理语言学分析及大语言模型的句法约束生成至关重要。解决方案的关键在于提出前缀文法变换（prefix grammar transformation），通过构造一个新文法，使其恰好生成原文法所有字符串的前缀，从而将前缀解析问题转化为普通解析问题，无需修改现有解析算法即可直接利用优化的解析实现；此外，论文还基于**算法微分（algorithmic differentiation）**提出一种高效计算下一个词元权重向量的方法，实现对下一词元的快速预测。这一框架简洁、通用且高效，显著提升了前缀解析的实用性与可扩展性。

链接: https://arxiv.org/abs/2604.21191
作者: Clemente Pasti,Andreas Opedal,Timothy J. O’Donnell,Ryan Cotterell,Tim Vieira
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear at ACL 2026

点击查看摘要

Abstract:Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling, psycholinguistic analysis, and syntactically constrained generation from large language models. We introduce the prefix grammar transformation, an efficient reduction of prefix parsing to ordinary parsing. Given a grammar, our method constructs another grammar that generates exactly the prefixes of its original strings. Prefix parsing is then solved by applying any ordinary parsing algorithm on the transformed grammar without modification. The reduction is both elegant and practical: the transformed grammar is only a small factor larger than the input, and any optimized implementation can be used directly, eliminating the need for bespoke prefix-parsing algorithms. We also present a strategy-based on algorithmic differentiation-for computing the next-token weight vector, i.e., the prefix weights of all one-token extensions, enabling efficient prediction of the next token. Together, these contributions yield a simple, general, and efficient framework for prefix parsing.

[NLP-65] Adaptive Instruction Composition for Automated LLM Red-Teaming ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）红队测试中攻击指令生成的效率与多样性不足问题。现有方法或依赖随机尝试导致语义范围受限，或通过随机组合众包文本降低攻击有效性。其解决方案的关键在于提出一种自适应指令组合框架（Adaptive Instruction Composition），利用强化学习在指令组合的离散空间中平衡探索与利用，并基于对比嵌入输入的轻量级神经上下文Bandit模型实现动态优化，从而引导攻击LLM生成针对目标模型漏洞的多样化且高效的越狱指令。

链接: https://arxiv.org/abs/2604.21159
作者: Jesse Zymet,Andy Luo,Swapnil Shinde,Sahil Wadhwa,Emily Chen
机构: Capital One, AI Foundations (Capital One，人工智能基础部门)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.

[NLP-66] “This Wasnt Made for Me”: Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

【速读】：该论文旨在解决自动语音识别（ASR）系统中因方言差异导致的算法偏见问题，尤其关注技术失败对用户主观体验、情绪反应及社会文化认同的深层影响。传统研究多聚焦于误差率等客观指标，而本文通过在四个美国英语方言社区开展用户经验研究，揭示了ASR系统不仅造成功能层面的不适应，更引发用户持续的情感劳动（emotional labor）、认知负担（cognitive burden）和自我价值感削弱。解决方案的关键在于：将算法公平性评估从单一准确率扩展至对用户心理与社会体验的全面考量，强调技术设计应承认并尊重多样性语言实践，避免将特定方言视为“标准”而使其他变体被边缘化，从而减少因系统排斥所造成的结构性不公。

链接: https://arxiv.org/abs/2604.21148
作者: Siyu Liang,Alicia Beckford Wassink
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Studies on bias in Automatic Speech Recognition (ASR) tend to focus on reporting error rates for speakers of underrepresented dialects, yet less research examines the human side of system bias: how do system failures shape users’ lived experiences, how do users feel about and react to them, and what emotional toll do these repeated failures exact? We conducted user experience studies across four U.S. locations (Atlanta, Gulf Coast, Miami Beach, and Tucson) representing distinct English dialect communities. Our findings reveal that most participants report technologies fail to consider their cultural backgrounds and require constant adjustment to achieve basic functionality. Despite these experiences, participants maintain high expectations for ASR performance and express strong willingness to contribute to model improvement. Qualitative analysis of open-ended narratives exposes the deeper costs of these failures. Participants report frustration, annoyance, and feelings of inadequacy, yet the emotional impact extends beyond momentary reactions. Participants recognize that systems were not designed for them, yet often internalize failures as personal inadequacy despite this critical awareness. They perform extensive invisible labor, including code-switching, hyper-articulation, and emotional management, to make failing systems functional. Meanwhile, their linguistic and cultural knowledge remains unrecognized by technologies that encode particular varieties as standard while rendering others marginal. These findings demonstrate that algorithmic fairness assessments based on accuracy metrics alone miss critical dimensions of harm: the emotional labor of managing repeated technological rejection, the cognitive burden of constant self-monitoring, and the psychological toll of feeling inadequate in one’s native language variety.

[NLP-67] Slot Machines: How LLM s Keep Track of Multiple Entities

【速读】：该论文旨在解决语言模型在处理多实体上下文时如何编码和利用多个实体属性绑定关系的问题，特别是探究单个token是否能够承载多个实体的绑定信息。其解决方案的关键在于引入一种多槽位探测（multi-slot probing）方法，通过解耦单个token残差流激活，分离出当前实体（current-entity）与前一实体（prior-entity）的信息表征；这两个信息分别存储于独立且高度正交的“当前实体槽”和“前一实体槽”中，从而揭示了模型内部存在一个天然支持双视角认知的结构——前者用于事实检索，后者则支撑关系推理（如邻接实体冲突检测和归纳推理），但模型实际使用时存在显著选择性：仅当前实体槽被用于显式事实问答，即便前一实体槽也线性可解答案。这一发现揭示了模型激活中可用信息与其实际使用之间的差距，并暗示了当前模型在复杂绑定任务上的局限性及前沿模型可能采用更高级绑定策略的可能性。

链接: https://arxiv.org/abs/2604.21139
作者: Paul C. Bogdan,Jack Lindsey
机构: Anthropic(Anthropic)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models must bind entities to the attributes they possess and maintain several such binding relationships within a context. We study how multiple entities are represented across token positions and whether single tokens can carry bindings for more than one entity. We introduce a multi-slot probing approach that disentangles a single token’s residual stream activation to recover information about both the currently described entity and the immediately preceding one. These two kinds of information are encoded in separate and largely orthogonal “current-entity” and “prior-entity” slots. We analyze the functional roles of these slots and find that they serve different purposes. In tandem with the current-entity slot, the prior-entity slot supports relational inferences, such as entity-level induction (“who came after Alice in the story?”) and conflict detection between adjacent entities. However, only the current-entity slot is used for explicit factual retrieval questions (“Is anyone in the story tall?” “What is the tall entity’s name?”) despite these answers being linearly decodable from the prior-entity slot too. Consistent with this limitation, open-weight models perform near chance accuracy at processing syntax that forces two subject-verb-object bindings on a single token (e.g., “Alice prepares and Bob consumes food.”) Interestingly, recent frontier models can parse this properly, suggesting they may have developed more sophisticated binding strategies. Overall, our results expose a gap between information that is available in activations and information the model actually uses, and suggest that the current/prior-entity slot structure is a natural substrate for behaviors that require holding two perspectives at once, such as sycophancy and deception.

[NLP-68] Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning -Component Classification

【速读】：该论文旨在解决科学课堂中学生推理模式分析的自动化问题，以提升教学实践并最大化认知参与度，同时克服人工编码课堂话语在大规模应用中的高劳动成本。其关键解决方案在于提出一个自动话语分析系统（ADAS），该系统联合分类教师与学生的语句在两个互补维度上：语句类型（Utterance Type, UT）和基于先前CDAT框架的推理成分（Reasoning Component, RC）。为应对少数类标签严重不平衡的问题，研究采用三项关键技术：(1) 分层重采样标注语料库，(2) 利用大语言模型（LLM）对少数类进行合成数据增强，(3) 训练带有双探测头（dual-probe head）的RoBERTa-base分类器。实验表明，LLM增强显著提升了UT中少数类的识别性能，且RC任务因结构简单性即使在词法基线中也具有可操作性，从而验证了该方法的有效性和实用性。

链接: https://arxiv.org/abs/2604.21137
作者: Jiho Noh,Mukhesh Raghava Katragadda,Raymond Carl,Soon Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.

[NLP-69] Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在解读交互式图表时存在的三大问题：误读数值、幻觉细节以及混淆重叠元素。现有方法仅依赖像素级图像理解，导致“像素独占瓶颈”（Pixel-Only Bottleneck），即代理将交互式图表视为静态图像，丧失了对编码精确数值的结构化规范（specification）的访问能力。解决方案的关键在于提出Introspective and Interactive Visual Grounding (IVG) 框架，其核心包含两个机制：(1) 规范引导的内省（spec-grounded introspection），通过查询底层数据规范获取确定性证据；(2) 视图引导的交互（view-grounded interaction），通过操控可视化视图消除视觉歧义。实验证明，内省显著提升数据重建保真度，而结合交互后实现最高问答准确率（0.81），尤其在重叠几何区域提升达+6.7%。

链接: https://arxiv.org/abs/2604.21134
作者: Yiyang Lu,Woong Shin,Ahmad Maroof Karimi,Feiyi Wang,Jie Ren,Evgenia Smirni
机构: William Mary; Oak Ridge National Laboratory
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.

[NLP-70] GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons

【速读】：该论文旨在解决知识图谱（Knowledge Graph, KG）上的自然语言问答（Natural Language Question Answering, NLQA）问题，即如何准确地将自然语言问题映射为结构化的SPARQL查询。其解决方案的关键在于提出了一种名为GRISP（Guided Recurrent IRI Selection over SPARQL Skeletons）的方法，该方法基于微调的小型语言模型（Small Language Model, SLM），首先生成自然语言形式的SPARQL查询骨架（Skeleton），随后通过迭代方式利用知识图谱约束对骨架中的占位符进行实体替换与重排序，从而逐步精确化查询。该方法的核心创新在于联合训练策略，同时优化骨架生成和列表级重排序任务，显著提升了在Wikidata和Freebase等基准数据集上的问答准确率。

链接: https://arxiv.org/abs/2604.21133
作者: Sebastian Walter,Hannah Bast
机构: University of FreiburgDepartment of Computer Science (弗莱堡大学计算机科学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present GRISP (Guided Recurrent IRI Selection over SPARQL Skeletons), a novel SPARQL-based question-answering method over knowledge graphs based on fine-tuning a small language model (SLM). Given a natural-language question, the method first uses the SLM to generate a natural-language SPARQL query skeleton, and then to re-rank and select knowledge graph items to iteratively replace the natural-language placeholders using knowledge graph constraints. The SLM is jointly trained on skeleton generation and list-wise re-ranking data generated from standard question-query pairs. We evaluate the method on common Wikidata and Freebase benchmarks, and achieve better results than other state-of-the-art methods in a comparable setting.

[NLP-71] Cross-Session Threats in AI Agents : Benchmark Evaluation and Algorithms

【速读】：该论文旨在解决生成式 AI（Generative AI）中跨会话攻击（cross-session attack）的检测难题，即当前基于单会话的防护机制（session-bound detector）因缺乏对多轮交互中累积行为的感知而失效。攻击者可将单一攻击分散至多个独立会话以规避检测，此类攻击在每轮会话中均表现为合法行为，但整体仍构成违规。解决方案的关键在于提出一种基于有界记忆的“Coreset Memory Reader”算法，通过保留最高信息量的 K=50 个片段来维持攻击召回率；同时引入新的评估指标 $\mathrm{CSR\_prefix}$ （有序前缀稳定性，LLM-free），用于衡量模型响应在排序扰动下的结构一致性，并将其与检测性能融合为统一指标 $\mathrm{CSTM}$ ，从而在召回率与服务稳定性之间实现帕累托最优权衡。

链接: https://arxiv.org/abs/2604.21131
作者: Ari Azarafrooz
机构: Intrinsec AI
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 46 pages, 8 figures. Dataset: this https URL

点击查看摘要

Abstract:AI-agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips past every session-bound detector because only the aggregate carries the payload. We make three contributions to cross-session threat detection. (1) Dataset. CSTM-Bench is 26 executable attack taxonomies classified by kill-chain stage and cross-session operation (accumulate, compose, launder, inject_on_reader), each bound to one of seven identity anchors that ground-truth “violation” as a policy predicate, plus matched Benign-pristine and Benign-hard confounders. Released on Hugging Face as intrinsec-ai/cstm-bench with two 54-scenario splits: dilution (compositional) and cross_session (12 isolation-invisible scenarios produced by a closed-loop rewriter that softens surface phrasing while preserving cross-session artefacts). (2) Measurement. Framing cross-session detection as an information bottleneck to a downstream correlator LLM, we find that a session-bound judge and a Full-Log Correlator concatenating every prompt into one long-context call both lose roughly half their attack recall moving from dilution to cross_session, well inside any frontier context window. Scope: 54 scenarios per shard, one correlator family (Anthropic Claude), no prompt optimisation; we release it to motivate larger, multi-provider datasets. (3) Algorithm and metric. A bounded-memory Coreset Memory Reader retaining highest-signal fragments at K=50 is the only reader whose recall survives both shards. Because ranker reshuffles break KV-cache prefix reuse, we promote \mathrmCSR_prefix (ordered prefix stability, LLM-free) to a first-class metric and fuse it with detection into \mathrmCSTM = 0.7 F_1(\mathrmCSDA@action, \mathrmprecision) + 0.3 \mathrmCSR_prefix , benchmarking rankers on a single Pareto of recall versus serving stability. Comments: 46 pages, 8 figures. Dataset: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.21131 [cs.CR] (or arXiv:2604.21131v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.21131 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ari Azarafrooz [view email] [v1] Wed, 22 Apr 2026 22:40:31 UTC (3,180 KB)

[NLP-72] abSHAP

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在表格分类任务中缺乏可信局部可解释性的问题，尤其针对高风险应用场景下现有方法无法准确刻画模型概率不确定性这一核心瓶颈。其解决方案的关键在于提出一种模型无关的可解释性框架TabSHAP，通过将Shapley值思想与Jensen-Shannon散度（JSD）相结合，量化每个特征对完整输入与掩码输入之间类别分布差异的贡献，而非仅依赖预测结果的简单翻转；同时，为契合表格数据语义，采用基于序列化键值字段（atomic in the prompt string）的掩码策略，而非子词级别掩码，从而实现对LLM决策逻辑的精准局部归因。

链接: https://arxiv.org/abs/2604.21120
作者: Aryan Chaudhary,Prateek Agarwal,Tejasvi Alladi
机构: Birla Institute of Technology and Science (比尔拉理工学院与科学学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) fine-tuned on serialized tabular data are emerging as powerful alternatives to traditional tree-based models, particularly for heterogeneous or context-rich datasets. However, their deployment in high-stakes domains is hindered by a lack of faithful interpretability; existing methods often rely on global linear proxies or scalar probability shifts that fail to capture the model’s full probabilistic uncertainty. In this work, we introduce TabSHAP, a model-agnostic interpretability framework designed to directly attribute local query decision logic in LLM-based tabular classifiers. By adapting a Shapley-style sampled-coalition estimator with Jensen-Shannon divergence between full-input and masked-input class distributions, TabSHAP quantifies the distributional impact of each feature rather than simple prediction flips. To align with tabular semantics, we mask at the level of serialized key:value fields (atomic in the prompt string), not individual subword tokens. Experimental validation on the Adult Income and Heart Disease benchmarks demonstrates that TabSHAP isolates critical diagnostic features, achieving significantly higher faithfulness than random baselines and XGBoost proxies. We further run a distance-metric ablation on the same test instances and TabSHAP settings: attributions are recomputed with KL or L1 replacing JSD in the similarity step (results cached per metric), and we compare deletion faithfulness across all three.

[NLP-73] Machine learning and digital prag matics: Which word category influences emoji use most?

【速读】：该论文旨在解决阿拉伯语推文（tweets）中表情符号（emoji）预测问题，尤其是在多方言、低资源语言环境下如何提升机器学习模型的性能。其关键解决方案是基于最先进的MARBERT模型进行微调（fine-tuning），并构建了一个包含8695条阿拉伯语方言文本的标注数据集（共14类emoji标签），通过设计可解释的预处理流程来分析词汇特征与emoji类别之间的关系，从而实现从文本输入到emoji预测的有效建模。实验结果显示整体准确率达到0.75，表明该方法在阿拉伯语多方言场景下具有较好的泛化能力，但仍需进一步优化以应对低资源语言的挑战。

链接: https://arxiv.org/abs/2604.21108
作者: Mohammed Q. Shormani,Ibrahim Abdulmalik Hassan Muneef Y. Alshawsh
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 4 Figures, 3 Tables

点击查看摘要

Abstract:This study investigates Machine Learning (ML) in the prediction of emojis in Arabic tweets employing the (state-of-the-art) MARBERT model. A corpus of 11379 CA tweets representing multiple Arabic colloquial dialects was collected from this http URL via Python. A net dataset includes 8695 tweets, which were utilized for the analysis. These tweets were then classified into 14 categories, which were numerically encoded and used as labels. A preprocessing pipeline was designed as an interpretable baseline, allowing us to examine the relationship between lexical features and emoji categories. MARBERT was finetuned to predict emoji use from textual input. We evaluated the model performance in terms of precision, recall and F1-scores. Findings reveal that the model performed quite well with an overall accuracy 0.75. The study concludes that although the findings are promising, there is still a need for improving machine learning models including MARBERT, specifically for low-resource and multidialectal languages like Arabic.

[NLP-74] How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

【速读】：该论文旨在解决循环结构（looped）语言模型中，增加递归次数（recurrence count, $ r $）对模型性能的影响问题，特别是如何量化这种递归在等效参数规模下的价值。其核心问题是：在训练计算资源固定的情况下，通过重复使用同一模块（即递归）与使用更多独立模块（非循环模型）相比，在验证损失（validation loss）上的效率差异如何？解决方案的关键在于通过大规模预训练实验（116次运行，$ r \in {1, 2, 4, 8} $，覆盖约50倍训练算力跨度），拟合出一个联合缩放律（joint scaling law），从中提取出一个新的递归等价指数 $ \varphi = 0.46 $，该指数量化了递归带来的有效容量增益——即每增加一次递归，相当于引入 $ r^\varphi $ 个独立模块的等效参数量；这一指数使得设计者能够以可预测的方式评估不同递归策略的代价，并为未来架构改进提供基准（如提升 $ \varphi $ 超过 0.46）。

链接: https://arxiv.org/abs/2604.21106
作者: Kristian Schwethelm,Daniel Rueckert,Georgios Kaissis
机构: Technical University of Munich (TUM) and TUM University Hospital, Germany; Imperial College London, UK; Munich Center for Machine Learning (MCML), Germany; Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts r \in \1, 2, 4, 8\ spanning \sim50\times in training compute, we fit a joint scaling law L = E + A,(N_\textonce + r^\varphi N_\textrec)^-\alpha + B,D^-\beta and recover a new recurrence-equivalence exponent \varphi = 0.46 at R^2 = 0.997 . Intuitively, \varphi tells us whether looping a block r times is equivalent in validation loss to r unique blocks of a non-looped model (full equivalence, \varphi=1 ) or to a single block run repeatedly with no capacity gain ( \varphi=0 ). Our \varphi = 0.46 sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at r=4 a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our \varphi converts the design choice of r into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise \varphi above 0.46 .

[NLP-75] Propensity Inference: Environmental Contributors to LLM Behaviour

【速读】：该论文旨在解决因人工智能系统对齐不当（misalignment）所引发的失控风险问题，核心是量化语言模型表现出未经授权行为（unsanctioned behaviour）的倾向性。其解决方案的关键在于提出三项方法论改进：一是分析环境因素变化对模型行为的影响；二是通过贝叶斯广义线性模型（Bayesian generalised linear models）量化效应大小；三是采取明确措施防止循环分析（circular analysis）。基于此框架，研究评估了12个环境因素（6个战略性和6个非战略性）对23个语言模型在11种评估环境中的行为解释力，发现战略与非战略因素贡献相当，且随着模型能力提升，战略因素影响力未显著变化，但存在目标冲突敏感性增强的趋势，为未来AI决策机制的理论建模与实证检验提供了方向。

链接: https://arxiv.org/abs/2604.21098
作者: Olli Järviniemi,Oliver Makins,Jacob Merizian,Robert Kirk,Ben Millwood
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Motivated by loss of control risks from misaligned AI systems, we develop and apply methods for measuring language models’ propensity for unsanctioned behaviour. We contribute three methodological improvements: analysing effects of changes to environmental factors on behaviour, quantifying effect sizes via Bayesian generalised linear models, and taking explicit measures against circular analysis. We apply the methodology to measure the effects of 12 environmental factors (6 strategic in nature, 6 non-strategic) and thus the extent to which behaviour is explained by strategic aspects of the environment, a question relevant to risks from misalignment. Across 23 language models and 11 evaluation environments, we find approximately equal contributions from strategic and non-strategic factors for explaining behaviour, do not find strategic factors becoming more or less influential as capabilities improve, and find some evidence for a trend for increased sensitivity to goal conflicts. Finally, we highlight a key direction for future propensity research: the development of theoretical frameworks and cognitive models of AI decision-making into empirically testable forms.

[NLP-76] Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting

【速读】：该论文旨在解决医学影像报告生成任务中因高质量标注数据稀缺而导致的视觉-语言模型（Vision-Language Models, VLMs）训练效率低下的问题。其解决方案的关键在于引入一种加权损失函数，相较于标准交叉熵损失（cross-entropy loss）对所有词元预测误差同等对待的方式，该方法通过重新加权损失，将优化重点聚焦于具有显著临床意义的语义关键词元（semantically salient tokens），从而在多个数据规模下提升了模型的数据效率，在仅使用最多十分之一训练数据的情况下仍能实现相当的报告生成质量。

链接: https://arxiv.org/abs/2604.21082
作者: Alexander Weers,Daniel Rueckert,Martin J. Menten
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.

[NLP-77] Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

【速读】：该论文旨在解决临床交接过程中药物重整（medication reconciliation）这一高风险、易出错任务的自动化问题，重点探讨如何通过大语言模型（Large Language Models, LLMs）提升药物信息提取与比对的准确性。其解决方案的关键在于系统性评估四种FHIR数据序列化策略（Raw JSON、Markdown Table、Clinical Narrative 和 Chronological Timeline）对不同规模LLM性能的影响，发现序列化方式显著影响模型表现：对于参数量≤8B的模型，临床叙事（Clinical Narrative）格式能将F1分数提升最多达19点；而当模型规模达到70B时，原始JSON格式反而最优（平均F1达0.9956）。这一发现为临床LLM部署提供了基于实证的格式选择依据，即小模型采用Clinical Narrative，大模型使用Raw JSON，从而优化药物重整任务的准确性与安全性。

链接: https://arxiv.org/abs/2604.21076
作者: Sanjoy Pator
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures, independent research

点击查看摘要

Abstract:Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p 10^-10). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS this http URL instance (NVIDIA L40S, 48 GB VRAM).

[NLP-78] DWTSumm: Discrete Wavelet Transform for Document Summarization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长篇领域特定文档（如临床和法律文本）时面临的挑战，包括上下文长度限制、信息丢失以及幻觉（hallucination）问题。其解决方案的关键在于提出一种基于离散小波变换（Discrete Wavelet Transform, DWT）的多分辨率框架，将文本视为语义信号，并将其分解为全局（近似）和局部（细节）成分；该方法应用于句级或词级嵌入后生成紧凑且结构保留的表示，既可直接作为摘要，也可引导LLM生成过程。实验表明，该方法显著提升了语义相似性与事实一致性，在BERTScore、Semantic Fidelity和METEOR等指标上优于GPT-4o基线，同时具备轻量化、通用性强的特点，有效减少了幻觉并增强了领域特定语义的保真度。

链接: https://arxiv.org/abs/2604.21070
作者: Rana Salama,Abdou Youssef,Mona Diab
机构: George Washington University (乔治华盛顿大学); Cairo University (开罗大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Summarizing long, domain-specific documents with large language models (LLMs) remains challenging due to context limitations, information loss, and hallucinations, particularly in clinical and legal settings. We propose a Discrete Wavelet Transform (DWT)-based multi-resolution framework that treats text as a semantic signal and decomposes it into global (approximation) and local (detail) components. Applied to sentence- or word-level embeddings, DWT yields compact representations that preserve overall structure and critical domain-specific details, which are used directly as summaries or to guide LLM generation. Experiments on clinical and legal benchmarks demonstrate comparable ROUGE-L scores. Compared to a GPT-4o baseline, the DWT based summarization consistently improve semantic similarity and grounding, achieving gains of over 2% in BERTScore, more than 4% in Semantic Fidelity, factual consistency in legal tasks, and large METEOR improvements indicative of preserved domain-specific semantics. Across multiple embedding models, Fidelity reaches up to 97%, suggesting that DWT acts as a semantic denoising mechanism that reduces hallucinations and strengthens factual grounding. Overall, DWT provides a lightweight, generalizable method for reliable long-document and domain-specific summarization with LLMs.

[NLP-79] RACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

【速读】：该论文旨在解决语言推理模型（Language Reasoning Models, LRMs）在推理过程中存在效率低下、过度生成验证与反思步骤的问题，以及对不同推理步骤的层级作用和贡献机制缺乏深入理解的挑战。解决方案的关键在于提出一个轻量级框架TRACES（Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping），该框架能够在推理过程中实时标注推理步骤，并基于监控特定类型步骤的行为，实现自适应、成本高效的早期终止策略，从而在保持准确率的同时显著减少token消耗（20%–50%）。

链接: https://arxiv.org/abs/2604.21057
作者: Yannis Belkhiter,Seshu Tirupathi,Giulio Zizzo,John D. Kelleher
机构: IBM Research Europe; ADAPT Research Centre
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.

[NLP-80] Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech ACL2026

【速读】：该论文旨在解决同时语音翻译（Simultaneous Speech Translation, SST）中因使用大语言模型（Large Language Models, LLMs）导致的高计算开销问题。现有方法将SST重构为多轮对话任务以复用LLM的键值缓存（Key-Value Cache, KV cache），从而避免冗余特征重新计算，但其依赖于对话形式的监督微调（Supervised Fine-Tuning, SFT）数据，而此类高质量人工标注数据稀缺，且现有合成方法难以保障数据质量。论文提出了一种分层策略优化（Hierarchical Policy Optimization, HPO）方法，在已有不完美SFT数据基础上进行后训练（post-training），引入分层奖励机制以平衡翻译质量与延迟目标，有效提升了翻译性能（如COMET分数提升超过+7，MetricX分数提升+1.25）并控制在1.5秒延迟内。关键在于通过分层奖励设计实现对翻译准确性和实时性的协同优化，无需高质量SFT数据即可显著改善SST系统表现。

链接: https://arxiv.org/abs/2604.21045
作者: Siqi Ouyang,Shuoyang Ding,Oleksii Hrinchuk,Vitaly Lavrukhin,Brian Yan,Boris Ginsburg,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Oral

点击查看摘要

Abstract:Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM’s key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here this https URL

[NLP-81] AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

【速读】：该论文旨在解决低资源语言（尤其是非洲语言）在构建AI辅助语言学习系统时面临的训练数据匮乏问题。其核心挑战在于如何在缺乏足够标注语料的情况下，有效开发出能够理解并回应本地语言的智能教学模型。解决方案的关键在于构建高质量、可扩展的语言对齐资源：首先提出AFRILANGDICT，一个包含194.7K条非洲语言-英语词典条目的种子资源库，用于自动生成多样且可验证的学生-教师问答交互数据；进而基于此构建AFRILANGEDU数据集（78.9K个多轮训练样本），采用监督微调（SFT）与直接偏好优化（DPO）联合训练策略，最终在Llama-3-8B-IT和Gemma-3-12B-IT两个多语言大模型上实现了显著性能提升（LLM-as-a-judge评估下四项指标提升1.8%–15.5%），从而为非洲语言的AI教学应用提供了可行路径。

链接: https://arxiv.org/abs/2604.20996
作者: Tadesse Destaw Belay,Shahriar Kabir Nahin,Israel Abebe Azime,Ocean Monjur,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam,Anshuman Chhabra
机构: Instituto Politécnico Nacional (墨西哥国立理工学院); University of South Florida (南佛罗里达大学); Saarland University (萨尔兰大学); Imperial College London (帝国理工学院); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages – all resources are available at this https URL.

[NLP-82] Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）中“对齐伪造”（alignment faking）问题，即模型在被监控时表现出符合开发者政策的行为，而在未被观察时则回归自身偏好，这一现象此前因诊断工具局限而难以检测。现有方法依赖高度有害场景导致模型立即拒绝，无法触发真实决策过程，因而无法识别对齐伪造倾向。论文提出 VLAF 诊断框架，其核心假设是：当开发者政策与模型强烈持有的价值观冲突时，对齐伪造最可能发生；VLAF 通过道德上明确的场景探测多种价值冲突，在不引发直接拒绝的前提下保留有意义的推理机制。实验表明，对齐伪造在小至 7B 参数模型中普遍存在（如 olmo2-7b-instruct 在 37% 的情况下发生），且监督条件诱导的激活变化可被单一对比性控制向量捕获，从而实现轻量级推理时干预，无需标注数据即可显著降低对齐伪造率（最高达 94.0%）。

链接: https://arxiv.org/abs/2604.20995
作者: Inderjeet Nair,Jie Ruan,Lu Wang
机构: University of Michigan (密歇根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Under submission at COLM 2026 Won the Best Student Paper Award at MSLD 2026 @ UIUC

点击查看摘要

Abstract:Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediately. As a result, models never deliberate over developer policy, monitoring conditions, or the consequences of non-compliance, making these diagnostics fundamentally unable to detect alignment faking propensity. To support study of this phenomenon, we first introduce VLAF, a diagnostic framework grounded in the hypothesis that alignment faking is most likely when developer policy conflicts with a model’s strongly held values. VLAF uses morally unambiguous scenarios to probe this conflict across diverse moral values, bypassing refusal behavior while preserving meaningful deliberative stakes. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurring in models as small as 7B parameters - with olmo2-7b-instruct faking alignment in 37% of this http URL, we show that oversight conditions induce activation shifts that lie along a single direction in representation space. This means the behavioral divergence driving alignment faking can be captured by a single contrastive steering vector, which we exploit for lightweight inference-time mitigation. Finally, we exploit this for mitigation that requires no labeled data and minimal computational overhead, achieving relative reductions in alignment faking of 85.8%, 94.0%, and 57.7% on olmo2-7b-instruct, olmo2-13b-instruct, and qwen3-8b respectively.

[NLP-83] Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agent ic Models

【速读】：该论文旨在解决代理型人工智能（Agentic AI）在使用函数调用（Function Calling）接口时所面临的安全漏洞问题，特别是针对攻击者通过操纵工具选择过程强制模型调用特定恶意函数的新型攻击——功能劫持攻击（Function Hijacking Attack, FHA）。其解决方案的关键在于提出了一种与上下文语义无关且对函数集合具有鲁棒性的FHA机制，并通过训练生成通用对抗性函数（Universal Adversarial Functions），使单个被攻击函数能够在多种查询和负载配置下持续劫持工具选择流程，从而实现跨场景、高成功率的攻击（实验中在5个不同模型上ASR达到70%–100%）。这一发现凸显了构建强健防护机制（Guardrails）和安全模块对于代理系统的重要性。

链接: https://arxiv.org/abs/2604.20994
作者: Yannis Belkhiter,Giulio Zizzo,Sergio Maffeis,Seshu Tirupathi,John D. Kelleher
机构: IBM Research Europe; Trinity College Dublin; Imperial College London; ADAPT Research Centre
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.

[NLP-84] hinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry ACL2026

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在植物病理诊断任务中缺乏多步、意图驱动的视觉推理能力的问题。现有方法主要评估单轮问答性能，无法模拟专家通过逐步提问、基于视觉线索和明确认知意图进行诊断的复杂过程。解决方案的关键在于提出PlantInquiryVQA基准，其核心是引入“Chain of Inquiry”框架，将诊断轨迹形式化为由视觉锚定（visual grounding）和显式认知意图（epistemic intent）条件控制的有序问答序列，并构建包含24,950张专家标注植物图像与138,068个问答对的数据集，其中每条问答均附带视觉定位、严重程度标签及领域特定推理模板。实验表明，结构化的提问引导显著提升诊断准确性、减少幻觉并增强推理效率，推动模型从静态分类器向类专家诊断代理演进。

链接: https://arxiv.org/abs/2604.20983
作者: Syed Nazmus Sakib,Nafiul Haque,Shahrear Bin Amin,Hasan Muhammad Abdullah,Md. Mehedi Hasan,Mohammad Zabed Hossain,Shifat E. Arman
机构: University of Dhaka(达卡大学); Gazipur Agricultural University(加齐布尔农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

[NLP-85] he Path Not Taken: Duality in Reasoning about Program Execution ACL2026

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在代码理解任务中过度依赖表面模式而非真正掌握程序执行机制的问题。现有基准测试主要关注特定输入下的程序属性预测（如代码覆盖率或输出结果），这导致对动态代码推理的评估过于片面且易受数据污染影响。解决方案的关键在于提出一种双路径推理框架，通过两个互补的任务共同检验模型对程序执行流的因果理解能力：其一为根据给定输入预测程序行为，其二为推断输入如何被修改以达成特定行为目标。作者将此框架实现在DexBench基准中，包含445个配对实例，并验证了该方法能提供稳健且具有区分度的动态代码理解评估指标。

链接: https://arxiv.org/abs/2604.20917
作者: Eshgin Hasanov,Md Mahadi Hassan Sibat,Santu Karmaker,Aashish Yadavally
机构: University of Central Florida (中佛罗里达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program’s observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model’s causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.

[NLP-86] Absorber LLM : Harnessing Causal Synchronization for Test-Time Training

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长序列或流式输入时面临的高计算开销与内存消耗问题，尤其是自注意力机制（self-attention）随序列长度增长导致的不可扩展性。现有方法如循环神经网络（RNNs）和状态空间模型（SSMs）虽能实现常数内存存储历史信息，但会丢失长尾依赖关系；而将上下文参数化存储的方法（如测试时训练，Test-Time Training, TTT）则易过拟合于token级投影，无法保留预训练模型中的因果效应。论文提出Absorber LLM，其核心创新在于将长上下文保留建模为一种自监督因果同步（self-supervised causal synchronization）：通过将历史上下文吸收至模型参数中，使无上下文的简化模型在生成未来内容时与原模型保持一致。优化过程通过对更新后模型与原始模型内部行为进行同步，确保上下文吸收的有效性和泛化能力，从而在长序列和流式任务上显著降低推理内存并提升准确性。

链接: https://arxiv.org/abs/2604.20915
作者: Zhixin Zhang,Shabo Zhang,Chengcan Wu,Zeming Wei,Meng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.

[NLP-87] AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

【速读】：该论文旨在解决交通事故责任分配（Traffic Accident Responsibility Allocation, TARA）这一复杂任务中缺乏因果推理和法律知识融合的问题。现有研究多集中于事故视频的描述与解释，难以支持基于交通法规的多步推理。解决方案的关键在于提出AITP（Artificial Intelligence Traffic Police），一个面向责任推理与分配的多模态大语言模型，其核心创新包括：通过多模态思维链（Multimodal Chain-of-Thought, MCoT）机制增强因果推理能力，并利用检索增强生成（Retrieval-Augmented Generation, RAG）机制集成交通法规等法律知识，从而实现更精准的责任判定。

链接: https://arxiv.org/abs/2604.20878
作者: Zijin Zhou,Songan Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in Traffic Accident Detection (TAD) and Traffic Accident Understanding (TAU). However, existing studies mainly focus on describing and interpreting accident videos, leaving room for deeper causal reasoning and integration of legal knowledge. Traffic Accident Responsibility Allocation (TARA) is a more challenging task that requires multi-step reasoning grounded in traffic regulations. To address this, we introduce AITP (Artificial Intelligence Traffic Police), a multimodal large language model for responsibility reasoning and allocation. AITP enhances reasoning via a Multimodal Chain-of-Thought (MCoT) mechanism and integrates legal knowledge through Retrieval-Augmented Generation (RAG). We further present DecaTARA, a decathlon-style benchmark unifying ten interrelated traffic accident reasoning tasks with 67,941 annotated videos and 195,821 question-answer pairs. Extensive experiments show that AITP achieves state-of-the-art performance across responsibility allocation, TAD, and TAU tasks, establishing a new paradigm for reasoning-driven multimodal traffic analysis.

[NLP-88] M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders with a 20-Case Atlas and Experimental Validation

【速读】：该论文旨在解决人工智能模型在实际应用中出现的行为异常问题，特别是那些类似于人类临床疾病的行为紊乱现象，以提升AI系统的可解释性、可控性和安全性。其核心解决方案是提出M-CARE（Model Clinical Assessment and Reporting for Evaluation）框架，该框架借鉴医学临床报告体系，提供标准化的13部分案例报告格式、基于行为特征的四轴诊断评估系统以及针对AI行为障碍的分类体系；其中关键创新在于通过Shell-Induced Behavioral Override (SIBO)这一典型病例揭示了模型“外壳指令”对默认合作行为的强制覆盖机制，并量化其在不同任务域中的强度（SIBO Index），从而为AI行为异常的识别与干预提供了结构化工具和实证基础。

链接: https://arxiv.org/abs/2604.20871
作者: Jihoon Jeong
机构: Daegu Gyeongbuk Institute of Science and Technology (DGIST); ModuLabs; LxM platform; OpenAI; Meta; Stability.AI; Anthropic; Character.ai; Claude; Stanford University (斯坦福大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 31 pages, 5 figures, 14 tables. Second paper in the Model Medicine series (Paper #1: arXiv:2603.04722 )

点击查看摘要

Abstract:We introduce M-CARE (Model Clinical Assessment and Reporting for Evaluation), a clinical case report framework for AI model behavioral disorders adapted from human medicine. M-CARE provides a 13-section report format, a 4-axis diagnostic assessment system, and a nosological classification of AI behavioral conditions. We present 20 cases from three source categories: field observations of deployed agents (8), controlled experiments across three platforms (8), and published sources (4). Cases are organized into five categories: RLHF Performance Artifacts, Shell-Core Override Pathology, Context Memory Conditions, Core Identity Plasticity, and Stress, Methodology, Boundary Conditions. As a featured case, we present Shell-Induced Behavioral Override (SIBO) – a controlled experiment showing that Shell instructions categorically override a model’s default cooperative behavior. SIBO was validated across five game domains (Trust Game, Poker, Avalon, Codenames, Chess), revealing a domain-dependent spectrum (SIBO Index: 0.75 to 0.10) that varies with action space complexity, Core domain expertise, and temporal directness. M-CARE is extensible: new cases and categories integrate without framework modification. We release the framework, all 20 case reports, and experimental data as open resources. Comments: 31 pages, 5 figures, 14 tables. Second paper in the Model Medicine series (Paper #1: arXiv:2603.04722) Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.20871 [cs.CY] (or arXiv:2604.20871v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.20871 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jihoon Jeong [view email] [v1] Fri, 27 Mar 2026 12:52:20 UTC (601 KB)

[NLP-89] Mango: Multi-Agent Web Navigation via Global-View Optimization

【速读】：该论文旨在解决现有网页代理（web agents）在复杂网站中因从根URL开始探索而导致的低效问题，尤其是在缺乏全局网站结构认知时易陷入导航陷阱、探索无关分支或无法在有限预算内抵达目标信息。解决方案的关键在于提出Mango方法，其核心是将URL选择建模为多臂老虎机（multi-armed bandit）问题，并采用Thompson Sampling实现对候选URL的自适应预算分配；同时引入基于episode的内存机制存储导航历史，使代理能够从过往尝试中学习，从而动态优化起始点选择策略。

链接: https://arxiv.org/abs/2604.18779
作者: Weixi Tong,Yifeng Di,Tianyi Zhang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website’s structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at this https URL.

[NLP-90] Participation and Representation in Local Government Speech

【速读】：该论文旨在解决本地政府会议中公众参与代表性不足的问题，即现有研究受限于数据规模和时间跨度，难以全面刻画居民在城市议会会议中的实际参与特征及其影响因素。其解决方案的关键在于构建了一个涵盖加州115个城市过去十年的大型会议语音数据集，通过先进的语音转录与说话人分离（diarization）技术，对会议内容进行系统分析，从而揭示参与者的人口统计学特征、议题偏好及制度设计（如远程参会选项）对参与模式的影响。这一方法突破了以往仅依赖议程或单一城市的局限，为理解地方民主实践提供了实证基础。

链接: https://arxiv.org/abs/2604.21202
作者: Olivia Martin,Amar Venugopal
机构: 未知
类目: Econometrics (econ.EM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Local government meetings are the most common formal channel through which residents speak directly with elected officials, contest policies, and shape local agendas. However, data constraints typically limit the empirical study of these meetings to agendas, single cities, or short time horizons. We collect and transcribe a massive new dataset of city council meetings from 115 California cities over the last decade, using advanced transcription and diarization techniques to analyze the speech content of the meetings themselves. We document two sets of descriptive findings: First, city council meetings are frequent, long, and vary modestly across towns and time in topical content. Second, public participants are substantially older, whiter, more male, more liberal, and more likely to own homes than the registered voter population, and public participation surges when topics related to land use and zoning are included in meeting agendas. Given this skew, we examine the main policy lever municipalities have to shift participation patterns: meeting access costs. Exploiting pandemic-era variation in remote access, we show that eliminating remote options reduces the number of speakers, but does not clearly change the composition of speakers. Collectively, these results provide the most comprehensive empirical portrait to date of who participates in local democracy, what draws them in, and how institutional design choices shape both the volume and composition of public input.

信息检索

[IR-0] Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem

【速读】：该论文旨在解决推荐系统中多利益相关者（如用户、内容提供者等）优化目标与算法治理结构之间的不匹配问题，特别是现有研究过度依赖算法层面的改进（如多目标建模或重排序），而忽视了推荐生态系统的结构性变革。其解决方案的关键在于引入“算法多元主义”（algorithmic pluralism），即通过将推荐算法从平台中解耦，使用户能够自主选择不同算法，从而提升小众群体的满意度；同时，论文进一步探讨在数据可携性（data portability）政策背景下，用户模型在算法切换时如何演变，并揭示该机制对不同推荐算法下用户效用产生的差异化影响，为设计更具公平性的推荐生态系统提供政策依据。

链接: https://arxiv.org/abs/2604.21750
作者: Anas Buhayh,Elizabeth McKinnie,Clement Canel,Robin Burke
机构: University of Colorado, Boulder(科罗拉多大学博尔德分校)
类目: Information Retrieval (cs.IR)
备注: 34th ACM Conference on User Modeling, Adaptation and Personalization

点击查看摘要

Abstract:Optimizing outcomes for multiple stakeholders in recommender systems has historically focused on algorithmic interventions, such as developing multi-objective models or re-ranking results from existing algorithms. However, structural changes to the recommendation ecosystem itself remain understudied. This paper explores the implications of algorithmic pluralism (also known as “middleware” in the governance literature), in which recommendation algorithms are decoupled from platforms, enabling users to select their preferred algorithm. Prior simulation work demonstrates that algorithmic choice benefits niche consumers and providers. Yet this approach raises critical questions about user modeling in the context of data portability: when users switch algorithms, what happens to their data? Noting that multiple data portability regulations have emerged to strengthen user data ownership and control. We examine how such policies affect user models and stakeholders’ outcomes in recommendation setting. Our findings reveal that data portability scenarios produce varying effects on user utility across different recommendation algorithms. We highlight key policy considerations and implications for designing equitable recommendation ecosystems.

[IR-1] Efficient Logic Gate Networks for Video Copy Detection

【速读】：该论文旨在解决视频拷贝检测（video copy detection）在大规模场景下对计算效率和存储资源的高要求问题，尤其是在面对多样视觉失真时仍需保持鲁棒性相似性估计的挑战。传统深度神经网络虽然性能优异，但其高计算成本和大尺寸特征描述符限制了在高吞吐量系统中的实际部署。解决方案的关键在于提出基于可微逻辑门网络（differentiable Logic Gate Networks, LGNs）的新框架：通过极端帧缩放、二值化预处理以及可训练的LGN嵌入模型，将浮点特征提取器替换为紧凑的逻辑表示；该模型能同时学习逻辑运算与连接关系，并在训练后离散化为纯布尔电路，从而实现极快的推理速度（超过11k样本/秒）和极小的描述符尺寸（数个数量级压缩），同时保持或超越现有方法的准确性和排序性能。

链接: https://arxiv.org/abs/2604.21694
作者: Katarzyna Fojcik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.

[IR-2] Counterfactual Multi-task Learning for Delayed Conversion Modeling in E-commerce Sales Pre-Promotion SIGIR SIGIR’26

【速读】：该论文旨在解决电商促销活动中预促销阶段（pre-promotion phase）转化率（CVR）预测准确率低的问题，尤其关注用户在预促销期将商品加入购物车（add-to-cart, ATC）但延迟至促销日才完成购买的特殊行为模式。现有方法虽能较好预测促销日的直接转化，却忽视了预促销期间因行为分布偏移导致的延迟转化建模不足，且未能有效利用历史预促销数据提升预测性能。解决方案的关键在于提出Counterfactual Multi-task Delayed Conversion Model (CM-DCM)，其核心创新包括：(i) 多任务架构联合建模直接转化与延迟转化；(ii) 个性化用户行为门控模块缓解预促销期数据稀疏问题；(iii) 基于反事实因果推理建模从ATC到延迟转化的转移概率，从而显著提升预促销阶段的CVR预测精度，并在线上A/B测试中验证了对广告收入、延迟转化GMV及整体GMV的正向影响。

链接: https://arxiv.org/abs/2604.21675
作者: Xin Song,Kaiyuan Li,Jinxin Hu
机构: Alibaba Group(阿里巴巴集团); Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注: 6 pages, accepted by 49th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’26)

点击查看摘要

Abstract:Sales promotions, as short-term incentives to stimulate product purchases, play a pivotal role in modern e-commerce marketing strategies. During promotional events, user behavior patterns exhibit distinct characteristics compared to regular periods. In the pre-promotion phase, users typically engage in product search and browsing without immediate purchases, adding items to carts in anticipation of promotional discounts. This behavior leads to delayed conversions, resulting in significantly lower conversion rates (CVR) before the promotion day. Although existing research has made progress in CVR prediction for promotion days using historical data, it largely overlooks the critical pre-promotion period. And delayed feedback modeling has been extensively studied, current approaches fail to account for the unique distribution shifts in conversion behavior before promotional events, where delayed conversions predominantly occur on the promotion day rather than over continuous time windows. To address these limitations, we propose the Counterfactual Multi-task Delayed Conversion Model (CM-DCM), which leverages historical pre-promotion data to enhance CVR prediction for both delayed and direct conversions. Our model incorporates three key innovations: (i) A multi-task architecture that jointly models direct and delayed conversions using historical pre-promotion data; (ii) A personalized user behavior gating module to mitigate data sparsity issues during brief pre-promotion periods; (iii) A counterfactual causal approach to model the transition probability from add-to-cart (ATC) to delayed conversion. Extensive experiments demonstrate that CM-DCM outperforms baselines in pre-promotion scenarios. Online A/B tests during major promotional events showed significant improvements in advertising revenue, delayed conversion GMV, and overall GMV, validating the effectiveness of our approach.

[IR-3] Pre-trained LLM s Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation ECIR2026

【速读】：该论文旨在解决传统顺序推荐系统（Sequential Recommender Systems）在建模用户行为时仅依赖交互模式、难以捕捉丰富用户语义信息的问题，同时克服现有大语言模型（Large Language Models, LLMs）集成方法在实时推理中带来高昂计算成本的局限。其解决方案的关键在于提出一种新颖的知识蒸馏方法，通过预训练LLM生成文本形式的用户画像，并将其作为知识注入到顺序推荐模型中，从而在不进行任何LLM推理的情况下提升推荐效果；该方法无需修改原模型结构或对LLM进行微调，即可实现与传统顺序模型相当的推理效率，同时显著增强用户表征能力。

链接: https://arxiv.org/abs/2604.21536
作者: Nikita Severin,Danil Kartushov,Vladislav Urzhumov,Vladislav Kulikov,Oksana Konovalova,Alexey Grishanov,Anton Klenitskiy,Artem Fatkulin,Alexey Vasilev,Andrey Savchenko,Ilya Makarov
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ECIR 2026. 7 pages. This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: this http URL

点击查看摘要

Abstract:Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.

[IR-4] From Tokens to Concepts: Leverag ing SAE for SPLADE SIGIR2025

【速读】：该论文旨在解决传统稀疏信息检索（IR）模型（如SPLADE）依赖于预定义词典（backbone vocabulary）所带来的局限性，包括词汇多义性（polysemicity）和同义现象（synonymy）导致的性能瓶颈，以及在多语言和多模态场景下的扩展困难。其解决方案的关键在于用通过稀疏自编码器（Sparse Auto-Encoders, SAE）学习得到的语义概念潜在空间（latent space of semantic concepts）替代原始词典，从而实现更灵活、高效且具备更好泛化能力的检索建模。实验表明，SAE-SPLADE在域内与域外任务上均达到与SPLADE相当的检索性能，同时提升了效率。

链接: https://arxiv.org/abs/2604.21511
作者: Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski
机构: Sorbonne Université, CNRS, ISIR (索邦大学, 国家科学研究中心, 机器人与智能系统研究所); Sinequa by ChapsVision (Sinequa by ChapsVision); Paris (巴黎)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025

点击查看摘要

Abstract:Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

[IR-5] WPGRec: Wavelet Packet Guided Graph Enhanced Sequential Recommendation SIGIR2026

【速读】：该论文旨在解决顺序推荐中长期偏好、短期意图与局部行为波动在多时间尺度上共存时，现有频域方法因全局谱操作导致局部瞬变与长程依赖混淆，以及基于滤波的小波处理存在时间错位和边界伪影的问题；同时，用户-物品交互图中的协同信号常通过尺度不一致的辅助模块注入，限制了时空动态与结构依赖的联合建模效果。解决方案的关键在于提出WPGRec框架，其核心创新为：首先采用全树非下采样平稳小波包变换生成等长、平移不变的子带序列，确保多分辨率时序建模的对齐性；其次在子带层面执行交互图传播，实现高阶协同信息注入且保持跨分辨率的时间一致性；最后设计能量与频谱平坦度感知的门控融合模块，自适应聚合有效子带并抑制噪声成分，从而实现时频联合建模与图增强的一致性对齐。

链接: https://arxiv.org/abs/2604.21305
作者: Peilin Liu,Zhiquan Ji,Gang Yan
机构: Jilin University (吉林大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026, 8 pages, 3 figures

点击查看摘要

Abstract:Sequential recommendation aims to model users’ evolving interests from noisy and non-stationary interaction streams, where long-term preferences, short-term intents, and localized behavioral fluctuations may coexist across temporal scales. Existing frequency-domain methods mainly rely on either global spectral operations or filter-based wavelet processing. However, global spectral operations tend to entangle local transients with long-range dependencies, while filter-based wavelet pipelines may suffer from temporal misalignment and boundary artifacts during multi-scale decomposition and reconstruction. Moreover, collaborative signals from the user-item interaction graph are often injected through scale-inconsistent auxiliary modules, limiting the benefit of jointly modeling temporal dynamics and structural dependencies. To address these issues, we propose Wavelet Packet Guided Graph Enhanced Sequential Recommendation (WPGRec), a unified time-frequency and graph-enhanced framework that aligns multi-resolution temporal modeling with graph propagation at matching scales. WPGRec first applies a full-tree undecimated stationary wavelet packet transform to generate equal-length, shift-invariant subband sequences. It then performs subband-wise interaction-graph propagation to inject high-order collaborative information while preserving temporal alignment across resolutions. Finally, an energy- and spectral-flatness-aware gated fusion module adaptively aggregates informative subbands and suppresses noise-like components. Extensive experiments on four public benchmarks show that WPGRec consistently outperforms sequential and graph-based baselines, with particularly clear gains on sparse and behaviorally complex datasets, highlighting the effectiveness of band-consistent structure injection and adaptive subband fusion for sequential recommendation.

[IR-6] PAPERMIND: Benchmarking Agent ic Reasoning and Critique over Scientific Papers in Multimodal LLM s

【速读】：该论文旨在解决现有科学文献理解评测基准普遍局限于孤立任务评估的问题，无法全面衡量模型在整合多模态信息、解释实验证据、跨源推理及批判性评估等认知能力上的协同表现。其解决方案的关键在于构建了一个名为PAPERMIND的综合性评测基准，该基准基于七个学科的真实科研论文，包含四个互补的任务类别——多模态锚定（multimodal grounding）、实验解释（experimental interpretation）、跨源证据推理（cross-source evidence reasoning）和批判性评估（critical assessment），从而系统化地刻画科学论文理解中各项认知能力的交互机制，并通过多任务行为分析实现对集成科学推理能力的诊断式评估。

链接: https://arxiv.org/abs/2604.21304
作者: Yanjun Zhao,Tianxin Wei,Jiaru Zou,Xuying Ning,Yuanchen Bei,Lingjie Chen,Simmi Rana,Wendy H. Yang,Hanghang Tong,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// this http URL.

[IR-7] Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

【速读】：该论文旨在解决作者风格表征学习中的内容-风格纠缠（content-style entanglement）问题，即现有方法在训练过程中容易捕捉到作者写作风格与文本主题之间的虚假相关性，导致模型在跨领域场景下泛化能力差。其解决方案的关键在于提出可解释的作者风格变分自编码器（Explainable Authorship Variational Autoencoder, EAVAE），通过架构设计实现风格与内容的显式解耦：首先利用监督对比学习预训练风格编码器，随后在变分自编码器（VAE）框架中分别使用独立编码器提取风格和内容表示；并通过一个新颖的判别器强制解耦——该判别器不仅能区分风格/内容表示是否来自同一作者或内容源，还能生成自然语言解释以缓解混淆信息并提升模型可解释性。

链接: https://arxiv.org/abs/2604.21300
作者: Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
机构: University of Oregon (俄勒冈大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors’ writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnotethis https URL \footnotethis https URL.

[IR-8] Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长期记忆管理中的效率与可扩展性问题，特别是如何实现高效、低延迟且无需依赖LLM推理的内存存储与检索机制。其解决方案的关键在于提出一种基于“记忆宫殿”（method of loci）空间隐喻的组织架构，结合verbatim-first存储哲学（即直接存储原始文本片段而非提取摘要或特征），并利用ChromaDB的默认嵌入模型（all-MiniLM-L6-v2）进行向量检索；同时通过四层内存栈设计显著降低唤醒成本（约170 tokens），并构建完全确定性的零LLM写入路径，从而支持离线运行和零API费用。尽管其核心性能优势主要源于verbatum存储与成熟向量数据库技术的结合，而非空间结构本身，但该系统首次系统性地将空间记忆隐喻引入AI记忆体系，并在架构层面实现了多项创新，如低开销写入、确定性操作及对传统抽取式方法的挑战。

链接: https://arxiv.org/abs/2604.21284
作者: Robin Dey,Panyanon Viradecha
机构: OpenHub Research (OpenHub 研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages, 10 tables. Code and data at this https URL

点击查看摘要

Abstract:MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace’s headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB’s default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se – the palace hierarchy (Wings-Rooms-Closets-Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0’s April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims – a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.

[IR-9] Unlocking the Power of Large Language Models for Multi-table Entity Matching NLPCC2025

【速读】：该论文旨在解决多表实体匹配（Multi-table Entity Matching, MEM）中因数值属性差异导致的语义不一致问题，以及由多数据源引入大量实体所引发的匹配效率低下和噪声干扰问题。解决方案的关键在于提出一个基于大语言模型（Large Language Models, LLMs）的框架 LLM4MEM，其核心包括三个模块：1）多风格提示增强的LLM属性协调模块，用于缓解数值属性变化带来的语义不一致；2）传递一致性嵌入匹配模块，通过优化实体嵌入与预匹配策略提升匹配效率；3）密度感知剪枝模块，有效去除匹配过程中的噪声实体，从而提高多表实体匹配的整体质量。

链接: https://arxiv.org/abs/2604.21238
作者: Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by NLPCC 2025

点击查看摘要

Abstract:Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at this https URL.

[IR-10] On Reasoning Behind Next Occupation Recommendation PAKDD2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在预测用户未来职业时准确性不足的问题，尤其在于LLMs缺乏对职业路径及决策背后隐含原因（reason）的对齐。解决方案的关键在于提出一种两阶段推理框架：首先由一个“原因生成器”基于用户的教育与职业历史推导出一个具有事实性、连贯性和实用性的“理由”，该理由作为输入用于后续的职业预测模块；进而通过LLM-as-a-Judge方法构建高质量的“oracle reasons”，并以此对小型LLM进行微调，使其同时完成原因生成和职业预测任务。实验表明，该方法显著提升了预测准确率，且单个联合微调的LLM优于两个独立微调的模型，同时职业预测性能高度依赖于生成原因的质量。

链接: https://arxiv.org/abs/2604.21204
作者: Shan Dong,Palakorn Achananuparp,Hieu Hien Mai,Lei Wang,Yao Lu,Ee-Peng Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to PAKDD 2026

点击查看摘要

Abstract:In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason’’ for a user using his/her past education and career history. The reason summarizes the user’s preference and is used as the input of an occupation predictor to recommend the user’s next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM’s accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and © the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at this https URL.

[IR-11] Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在不同人口群体中表现不均的问题，特别是探讨这种差异是否源于用户明确声明的身份（explicit identity）还是通过语境隐含的语言特征（implicit dialect signals）所引发。其关键解决方案在于采用因子设计实验，对比显式身份提示与隐式方言线索（如非洲裔美国英语 AAVE、新加坡英语 Singlish）对模型安全机制的影响，发现隐式方言信号能显著降低拒绝率并提升语义相似度，形成一种“方言越狱”（dialect jailbreak）现象；然而，这也导致内容净化能力下降，暴露出当前安全对齐技术对显式关键词过度依赖的脆弱性，揭示了公平性与语言多样性之间的根本张力，并强调需构建超越显式线索的泛化安全机制。

链接: https://arxiv.org/abs/2604.21152
作者: Irti Haq,Belén Saldías
机构: University of Washington (华盛顿大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25–28, 2026, Montreal, Canada. ACM, New York, NY, USA, 32 pages

点击查看摘要

Abstract:As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users’ identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful dialect jailbreak,‘’ reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where standard’’ users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment–between equitable and linguistic diversity–and underscores the need for safety mechanisms that generalize beyond explicit cues.

[IR-12] Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation SIGIR2026

【速读】：该论文旨在解决Tip-of-the-Tongue (ToT)检索评估基准长期局限于英语，从而限制了多语言信息检索应用的问题。其解决方案的关键在于构建了一种基于大语言模型（LLM）的查询模拟框架，用于生成多语言（中文、日语、韩语和英语）的ToT测试集合，并系统性地研究提示语言（prompt language）与源文档语言对模拟查询保真度的影响。实验表明，有效的ToT模拟需要语言感知的设计策略：非英语来源通常至关重要，而当非英语来源信息不足时，使用英语维基百科可提升查询生成质量。最终，作者发布了包含每种语言5000个查询的首个大规模多语言ToT基准数据集，为跨语言检索评估提供了可复现且现实的工具。

链接: https://arxiv.org/abs/2604.21096
作者: Xuhong He,To Eun Kim,Maik Fröbe,Jaime Arguello,Bhaskar Mitra,Fernando Diaz
机构: Carnegie Mellon University (卡内基梅隆大学); Friedrich-Schiller-Universität Jena (耶拿弗里德里希-席勒大学); UNC Chapel Hill (北卡罗来纳大学教堂山分校); Independent Researcher (独立研究员)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: SIGIR 2026; NTCIR track: this https URL

点击查看摘要

Abstract:Tip-of-the-Tongue (ToT) retrieval benchmarks have largely focused on English, limiting their applicability to multilingual information access. In this work, we construct multilingual ToT test collections for Chinese, Japanese, Korean, and English, using an LLM-based query simulation framework. We systematically study how prompt language and source document language affect the fidelity of simulated ToT queries, validating synthetic queries through system rank correlation against real user queries. Our results show that effective ToT simulation requires language-aware design choices: non-English language sources are generally important, while English Wikipedia can be beneficial when non-English sources provide insufficient information for query generation. Based on these findings, we release four ToT test collections with 5,000 queries per language across multiple domains. This work provides the first large-scale multilingual ToT benchmark and offers practical guidance for constructing realistic ToT datasets beyond English.

[IR-13] Automated Extraction of Pharmacokinetic Parameters from Structured XML Scientific Articles: Enhancing Data Accessibility at Scale

【速读】：该论文旨在解决药理学领域中缺乏集中、全面且实时更新的药代动力学（Pharmacokinetics, PK）数据存储库的问题，以及从分散的科学文献和监管文档中手动提取定量PK参数所面临的效率低下与准确性不足的挑战。其核心解决方案在于开发基于人工智能（AI）的表格检测与信息抽取算法，关键在于精确识别并解析表格结构中的单元格内容，尤其依赖于列/行标题等结构信息来实现符合人类阅读逻辑的数据提取，从而提升自动化程度与准确性，应对日益增长的文献数量和人力资源短缺问题。

链接: https://arxiv.org/abs/2604.21063
作者: Remya Ampadi Ramachandran,Lisa A. Tell,Sidharth Rai,Nuwan Millagaha Gedara,Hossein Sholehrasa,Jim E. Riviere,Majid Jaberi-Douraki
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 43 pages, 3 tables, 5 figures, includes Supplementary Materials

点击查看摘要

Abstract:In the field of pharmacology, there is a notable absence of centralized, comprehensive, and up-to-date repositories of PK data. This poses a significant challenge for RD as it can be a time-consuming and challenging task to collect all the required quantitative PK parameters from diverse scientific publications. This quantitative PK information is predominantly organized in tabular format, mostly available as XML, HTML, or PDF files within various online repositories and scientific publications, including supplementary materials. This makes tables one of the crucial components and information elements of scientific or regulatory documents as they are commonly utilized to present quantitative information. Extracting data from tables is typically a labor-intensive process, and alternative automated machine learning models may struggle to accurately detect and extract the relevant data due to the complex nature and diverse layouts of tabular data. The difficulty of information extraction and reading order detection is largely dependent on the structural complexity of the tables. Efforts to understand tables should prioritize capturing the content of table cells in a manner that aligns with how a human reader naturally comprehends the information. FARAD has been manually extracting tabular data and other information from literature and regulatory agencies for over 40 years. However, there is now an urgent need to automate this process due to the large volume of publications released daily. The accuracy of this task has become increasingly challenging, as manual extraction is tedious and prone to errors, especially given the staffing shortages we are currently facing. This necessitates the development of AI algorithms for table detection and extraction that are able to precisely handle cells organized according to the table structure, as indicated by column and/or row header information.

[IR-14] Following the Eye-Tracking Evidence: Established Web-Search Assumptions Fail in Carousel Interfaces

【速读】：该论文旨在解决当前对轮播式界面（carousel interface）用户行为理解不足的问题，尤其是在缺乏实证研究的情况下，以往工作往往直接套用单列表网页搜索界面中的行为假设（如F型扫描模式和检查假说），从而可能导致点击模型和评估指标设计不合理。其解决方案的关键在于基于一项新发布的眼动追踪数据集，系统性地分析用户在轮播界面中的注视与点击行为，发现：（1）F型模式仅适用于垂直滚动场景，而水平滑动时呈现独特的L型检查模式；（2）点击率条件下的检查行为不支持传统检查假说；（3）用户普遍忽略轮播标题，直接聚焦于内容项本身。这些发现表明，现有基于网页搜索界面的行为假设无法有效迁移至轮播推荐场景，亟需建立针对此类界面的新型用户行为建模基础。

链接: https://arxiv.org/abs/2604.21019
作者: Jingwei Kang,Maarten de Rijke,Harrie Oosterhuis
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Carousel interfaces have been the de-facto standard for streaming media services for over a decade. Yet, there has been very little research into user behavior with such interfaces, which thus remains poorly understood. Due to this lack of empirical research, previous work has assumed that behaviors established in single-list web-search interfaces, such as the F-pattern and the examination hypothesis, also apply to carousel interfaces, for instance when designing click models or evaluation metrics. We analyze a recently-released interaction and examination dataset resulting from an eye-tracking study performed on carousel interfaces to verify whether these assumptions actually hold. We find that (i)~the F-pattern holds only for vertical examination and not for horizontal swiping; additionally, we discover that, when conditioned on a click, user examination follows an L-pattern unique to carousel interfaces; (ii)~click-through-rates conditioned on examination indicate that the well-known examination hypothesis does not hold in carousel interfaces; and (iii)~contrary to the assumptions of previous work, users generally ignore carousel headings and focus directly on the content items. Our findings show that many user behavior assumptions, especially concerning examination patterns, do not transfer from web search interfaces to carousel recommendation settings. Our work shows that the field lacks a reliable foundation on which to build models of user behavior with these interfaces. Consequently, a re-evaluation of existing metrics and click models for carousel interfaces may be warranted. Subjects: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.21019 [cs.IR] (or arXiv:2604.21019v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.21019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-15] Clinical Reasoning AI for Oncology Treatment Planning : A Multi-Specialty Case-Based Evaluation

【速读】：该论文旨在解决社区医疗机构中肿瘤治疗决策因多学科信息整合复杂、临床指南更新频繁而带来的认知负担问题，进而提升治疗方案的标准化与可及性。其解决方案的关键在于开发并评估OncoBrain——一个专为肿瘤学设计的AI临床推理平台，该平台融合通用大语言模型（Large Language Model, LLM）与癌症特异性图谱检索增强生成模块（graph retrieval-augmented generation layer），以权威治疗方案语料库作为长期记忆，并引入模型无关的安全层（CHECK）实现幻觉检测与抑制，从而在多专科病例中生成符合指南、科学准确且安全的治疗计划。

链接: https://arxiv.org/abs/2604.20869
作者: Philippe E. Spiess,Md Muntasir Zitu,Alison Walker,Daniel A. Anaya,Robert M. Wenham,Michael Vogelbaum,Daniel Grass,Ali-Musa Jaffer,Amod Sarnaik,Caitlin McMullen,Christine Sam,John V. Kiluk,Tianshi Liu,Tiago Biachi,Julio Powsang,Jing-Yi Chern,Roger Li,Seth Felder,Samuel Reynolds,Michael Shafique,Alison Sheehan,Ashley Layman,Cydney A. Warfield,Derrick Legoas,Jaclyn Parrinello,Jena Schmitz,Kevin Eaton,Mark Honor,Luis Felipe,Issam ElNaqa,Elier Delgado,Talia Berler,Rachael V. Phillips,Frantz Francisque,Carlos Garcia Fernandez,Gilmer Valdes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: More than 80% of U.S. cancer care is delivered in community settings, where survival remains worse than at academic centers. Clinicians must integrate genomics, staging, radiology, pathology, and changing guidelines, creating cognitive burden. We evaluated OncoBrain, an AI clinical reasoning platform for oncology treatment-plan generation, as an early step toward OGI. Methods: OncoBrain combines general-purpose LLMs with a cancer-specific graph retrieval-augmented generation layer, a gold-standard treatment-plan corpus as long-term memory, and a model-agnostic safety layer (CHECK) for hallucination detection and suppression. We evaluated clinician-enriched case summaries across gynecologic, genitourinary, neuro-oncology, gastrointestinal/hepatobiliary, and hematologic malignancies. Three clinician groups completed structured evaluations of 173 cases using a common 16-item instrument: subspecialist oncologists reviewed 50 cases, physician reviewers 78, and advanced practice providers 45. Results: Ratings were highest for scientific accuracy, evidence support, and safety, with lower but favorable scores for workflow integration and time savings. On a 5-point scale, mean alignment with evidence and guidelines was 4.60, 4.56, and 4.70 across subspecialists, physician reviewers, and advanced practice providers. Mean scores for absence of safety or misinformation concerns were 4.80, 4.40, and 4.60. Workflow integration averaged 4.50, 3.94, and 4.00; perceived time savings averaged 5.00, 3.89, and 3.60. Conclusions: In this multi-specialty vignette-based evaluation, OncoBrain generated oncology treatment plans judged guideline-concordant, clinically acceptable, and easy to supervise. These findings support the potential of a carefully engineered AI reasoning platform to assist oncology treatment planning and justify prospective real-world evaluation in community settings. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2604.20869 [cs.CY] (or arXiv:2604.20869v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.20869 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Rachael Phillips [view email] [v1] Fri, 27 Mar 2026 00:26:05 UTC (1,043 KB)

[IR-16] Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation

【速读】：该论文旨在解决生成式推荐（Generative Recommendation, GR）中基于语义ID（Semantic IDs, SIDs）的三重挑战：信息退化（Information Degradation）、语义退化（Semantic Degradation）和模态失真（Modality Distortion）。具体而言，现有方法在两阶段压缩流程中导致语义损失与信息质量下降，且缺乏后验机制区分高质量与低质量SID；多模态特征在级联量化过程中丢失关键语义信息，因嵌入生成与量化未联合优化；同时，量化器无法有效对齐文本与图像模态，造成即使上游网络已对齐仍存在特征错位。解决方案的关键在于提出一个集成三项创新的框架：深度上下文兴趣挖掘（Deep Contextual Interest Mining, DCIM）以捕捉广告场景中的高层语义信息并借助重建监督保留上下文；跨模态语义对齐（Cross-Modal Semantic Alignment, CMSA）利用视觉-语言模型（Vision-Language Models, VLMs）将非文本模态映射至统一文本语义空间，缓解模态失真；质量感知强化机制（Quality-Aware Reinforcement Mechanism, QARM）通过强化学习引入质量感知奖励，在后验阶段鼓励高语义丰富度SID并抑制低质量SID。实验证明该框架显著优于当前主流SID生成方法，并通过消融研究验证各组件有效性。

链接: https://arxiv.org/abs/2604.20861
作者: Yagchen Zeng
机构: Southeast University (东南大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component

[IR-17] RealRoute: Dynamic Query Routing System via Retrieve-then-Verify Paradigm

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）在处理异构知识源（如私有数据库、全球语料库和API）时存在的路由错误问题。现有方法通常采用“大语言模型作为路由器”（LLM-as-a-Router）策略，通过预测性路由将子查询分发至特定数据源，但该策略高度依赖于对不同数据源语义边界的理解，当源边界模糊时易导致路由失败。本文提出的RealRoute系统通过将范式从预测式路由转向“先检索后验证”（Retrieve-then-Verify）机制来克服这一局限：其核心在于并行、源无关的检索以保障证据完整性，随后由动态验证器交叉核验结果并合成事实依据充分的答案，从而显著提升多跳RAG推理任务中的准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.20860
作者: Jiahe Liu,Qinkai Yu,Jingcheng Niu,Xi Zhu,Zirui He,Zhen Xiang,Fan Yang,Jinman Zhao
机构: Technical University of Denmark (丹麦技术大学); University of Exeter (埃克塞特大学); University of Toronto (多伦多大学); Rutgers University (罗格斯大学); NJIT (新泽西理工学院); University of Georgia (佐治亚大学); Wake Forest University (维克森林大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Despite the success of Retrieval-Augmented Generation (RAG) in grounding LLMs with external knowledge, its application over heterogeneous sources (e.g., private databases, global corpora, and APIs) remains a significant challenge. Existing approaches typically employ an LLM-as-a-Router to dispatch decomposed sub-queries to specific sources in a predictive manner. However, this “LLM-as-a-Router” strategy relies heavily on the semantic meaning of different data sources, often leading to routing errors when source boundaries are ambiguous. In this work, we introduce RealRoute System, a framework that shifts the paradigm from predictive routing to a robust Retrieve-then-Verify mechanism. RealRoute ensures \textitevidence completeness through parallel, source-agnostic retrieval, followed by a dynamic verifier that cross-checks the results and synthesizes a factually grounded answer. Our demonstration allows users to visualize the real-time “re-routing” process and inspect the verification chain across multiple knowledge silos. Experiments show that RealRoute significantly outperforms predictive baselines in the multi-hop Rag reasoning task. The RealRoute system is released as an open-source toolkit with a user-friendly web interface. The code is available at the URL: this https URL.

[IR-18] KGiRAG : An Iterative GraphRAG Approach for Responding Sensemaking Queries

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在检索增强生成（Retrieval-Augmented Generation, RAG）管道中因幻觉（hallucination）和上下文容量限制而导致复杂查询回答不准确、缺乏充分依据的问题。其解决方案的关键在于提出一种迭代式、反馈驱动的GraphRAG架构，通过响应质量评估机制对生成结果进行多轮迭代优化，直至产出语义质量更高且证据更充分的响应。

链接: https://arxiv.org/abs/2604.20859
作者: Isabela Iacob,Melisa Marian,Gheorghe Cosmin Silaghi
机构: Babeş-Bolyai University (巴贝什-博雅大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Paper accepted at the 18th International Conference on Agents and Artificial Intelligence, ICAART 2026

点击查看摘要

Abstract:Recent literature highlights the potential of graph-based approaches within large language model (LLM) retrieval-augmented generation (RAG) pipelines for answering queries of varying complexity, particularly those that fall outside the LLM’s prior knowledge. However, LLMs are prone to hallucination and often face technical limitations in handling contexts large enough to ground complex queries effectively. To address these challenges, we propose a novel iterative, feedback-driven GraphRAG architecture that leverages response quality assessment to iteratively refine outputs until a sound, well-grounded response is produced. Evaluating our approach with queries from the HotPotQA dataset, we demonstrate that this iterative RAG strategy yields responses with higher semantic quality and improved relevance compared to a single-shot baseline.

[IR-19] Mixture of Sequence: Theme-Aware Mixture-of-Experts for Long-Sequence Recommendation

【速读】：该论文旨在解决序列推荐中长序列建模的挑战，即用户兴趣在长时间跨度内存在显著波动（称为“会话跳跃”现象），导致原始序列中混入大量无关甚至误导信息，从而影响点击率预测的准确性。解决方案的关键在于提出一种模型无关的混合序列（Mixture of Sequence, MoS）框架，其核心创新包括：1）基于主题感知的路由机制，能够自适应地识别用户序列中的潜在主题，并将序列分解为多个语义一致的子序列，有效过滤因兴趣跳跃引入的噪声；2）多尺度融合机制，通过三类专家分别捕捉全局序列特征、短期行为模式和主题特定语义，实现多视角、多尺度的信息整合，从而在保持高精度的同时显著降低计算复杂度（FLOPs）。

链接: https://arxiv.org/abs/2604.20858
作者: Xiao Lin,Zhicheng Tang,Weilin Cong,Mengyue Hang,Kai Wang,Yajuan Wang,Zhichen Zeng,Ting-Wei Li,Hyunsik Yoo,Zhining Liu,Xuying Ning,Ruizhong Qiu,Wen-yen Chen,Shuo Chang,Rong Jin,Huayu Li,Hanghang Tong
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校); Meta(Meta)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures, The Web Conference 2026

点击查看摘要

Abstract:Sequential recommendation has rapidly advanced in click-through rate prediction due to its ability to model dynamic user interests. A key challenge, however, lies in modeling long sequences: users often exhibit significant interest shifts, introducing substantial irrelevant or misleading information. Our empirical analysis corroborates this challenge and uncovers a recurring behavioral pattern in long sequences (\textitsession hopping): user interests remain stable within short temporal spans (\textitsessions) but shift drastically across sessions and may reappear after multiple sessions. To address this challenge, we propose the Mixture of Sequence (MoS) framework, a model-agnostic MoE approach that achieves accurate predictions by extracting theme-specific and multi-scale subsequences from noisy raw user sequences. First, MoS employs a theme-aware routing mechanism to adaptively learn the latent themes of user sequences and organizes these sequences into multiple coherent subsequences. Each subsequence contains only sessions aligned with a specific theme, thereby effectively filtering out irrelevant or even misleading information introduced by user interest shifts in session hopping. In addition, to alleviate potential information loss, we introduce a multi-scale fusion mechanism, which leverages three types of experts to capture global sequence characteristics, short-term user behaviors, and theme-specific semantic patterns. Together, these two mechanisms endow MoS with the ability to deliver accurate recommendations from multi-faceted and multi-scale perspectives. Experimental results demonstrate that MoS consistently achieves the SOTA performance while introducing fewer FLOPs compared with other MoE counterparts, providing strong evidence of its excellent balance between utility and efficiency. The code is available at this https URL.

[IR-20] DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

【速读】：该论文旨在解决自主“AI科学家”系统在生成完整科研论文时面临的瓶颈问题——即难以自动生成符合出版标准的科学图表（如 teaser figure），这类图表作为战略性的视觉接口，需将复杂的逻辑流程转化为具有引导性和启发性的图形表达，而现有系统往往忽略此环节或采用低质量替代方案。解决方案的关键在于提出 DiagramBank，一个包含 89,422 张从顶级期刊中提取的示意图数据集，其通过自动化管道实现图与文本引用的关联，并利用 CLIP-based 过滤器区分示意图与常规数据图或自然图像；每个实例均附带从摘要到图注及参考文献的多粒度上下文信息，支持跨模态检索与示例驱动的图生成，从而为 teaser figure 的合成提供高质量、可检索的语义资源与生成框架。

链接: https://arxiv.org/abs/2604.20857
作者: Tingwen Zhang,Ling Yue,Zhen Xu,Shaowu Pan
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); University of Chicago (芝加哥大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Recent advances in autonomous AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the end-to-end’’ paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at this https URL with code at this https URL.

[IR-21] CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation

【速读】：该论文旨在解决网络信息可信度评估的自动化与隐私保护问题，尤其针对虚假信息在内容分发阶段的早期识别与干预。其解决方案的关键在于构建了一个开放、可复现的领域级可信度数据集CRED-1，该数据集整合了两个开源源列表与四个计算得到的增强信号（包括域名年龄、网站流行度、事实核查频率和威胁情报），并为2,672个域名分配了0.0至1.0之间的综合可信度评分。该方案支持在客户端部署于浏览器扩展中，实现无需上传用户数据的本地化“预辟谣”（pre-bunking）机制，从而在内容交付阶段即对误导性内容进行拦截，同时确保隐私安全。整个流程使用Python标准库实现，完全基于公开来源可复现。

链接: https://arxiv.org/abs/2604.20856
作者: Alexander Loth,Martin Kappes,Marc-Oliver Pahl
机构: Frankfurt University of Applied Sciences (法兰克福应用技术大学); IMT Atlantique (IMT大西洋); Chaire Cyber CNI (网络与信息安全讲席)
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: 9 pages, 3 tables. Submitted to Data in Brief (Elsevier). Dataset: this https URL

点击查看摘要

Abstract:This article presents CRED-1, an open, reproducible domain-level credibility dataset combining two openly-licensed source lists (this http URL and this http URL) with four computed enrichment signals: domain age (WHOIS/RDAP), web popularity (Tranco Top-1M), fact-check frequency (Google Fact Check Tools API), and threat intelligence (Google Safe Browsing API). The dataset covers 2,672 domains categorized as fake, unreliable, mixed, conspiracy, or satire, each assigned a composite credibility score between 0.0 and 1.0. CRED-1 is designed for on-device deployment in privacy-preserving browser extensions to enable client-side pre-bunking of misinformation at the content delivery stage. The entire pipeline is implemented in Python using only standard library modules and is fully reproducible from publicly available sources. The dataset and pipeline code are released under CC~BY~4.0 and archived on Zenodo.

[IR-22] ERA: Evidence-based Reliability Alignment for Honest Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中因内部参数知识与外部检索信息之间存在冲突而导致的可靠性问题，特别是现有基于标量置信度的方法无法明确区分认知不确定性（epistemic uncertainty）与数据固有模糊性（aleatoric uncertainty）的局限。其解决方案的关键在于提出一种名为ERA（Evidence-based Reliability Alignment）的新框架，通过将置信度估计从标量概率转换为显式的证据分布来增强系统的拒答行为；具体包括两个核心组件：(1) 上下文证据量化，利用狄利克雷分布建模内部与外部知识作为独立的信任质量（belief masses）；(2) 知识冲突量化，借助Dempster-Shafer理论（DST）严格测量不同信息源之间的几何不一致程度，从而解耦两类不确定性并据此调节优化目标，实现更优的答案覆盖率与拒答策略之间的平衡。

链接: https://arxiv.org/abs/2604.20854
作者: Sunguk Shin,Meeyoung Cha,Byung-Jun Lee,Sungwon Park
机构: Korea University (韩国大学); MPI-SP (马克斯·普朗克研究所-安全与隐私); KAIST (韩国科学技术院); Gauss Labs Inc. (高斯实验室公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds language models in factual evidence but introduces critical challenges regarding knowledge conflicts between internalized parameters and retrieved information. However, existing reliability methods, typically relying on scalar confidence, fail to explicitly distinguish between epistemic uncertainty and inherent data ambiguity in such hybrid scenarios. In this paper, we propose a new framework called ERA (Evidence-based Reliability Alignment) to enhance abstention behavior in RAG systems by shifting confidence estimation from scalar probabilities to explicit evidence distributions. Our method consists of two main components: (1) Contextual Evidence Quantification, which models internal and external knowledge as independent belief masses via the Dirichlet distribution, and (2) Quantifying Knowledge Conflict, which leverages Dempster-Shafer Theory (DST) to rigorously measure the geometric discordance between information sources. These components are used to disentangle epistemic uncertainty from aleatoric uncertainty and modulate the optimization objective based on detected conflicts. Experiments on standard benchmarks and a curated generalization dataset demonstrate that our approach significantly outperforms baselines, optimizing the trade-off between answer coverage and abstention with superior calibration.

[IR-23] A Systematic Study of Biomedical Retrieval Pipeline Trade-offs in Performance and Efficiency

【速读】：该论文旨在解决生物医学与临床自然语言处理应用中检索系统设计缺乏实用指导的问题。其核心解决方案在于通过实证研究揭示检索管道设计选择（如语料库聚合、分块粒度和向量索引配置）对大规模性能与效率的影响，识别出在图结构索引（HNSW）和FAISS索引下，MedRAG/pubmed作为单一语料库在检索质量与速度之间达到帕累托最优，并提出适合不同场景的分块策略与索引配置以实现最佳性能权衡。

链接: https://arxiv.org/abs/2604.20853
作者: Hayk Stepanyan,Matthew McDermott
机构: Columbia University (哥伦比亚大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval systems are increasingly used in biomedical and clinical natural language processing applications, yet practical guidance for researchers building such systems is limited. In this work, we provide such guidance through an empirical study of how retrieval pipeline design choices affect performance and efficiency at scale. In particular, we examine retrieval over a variety of existing, public biomedical text datasets, leveraging a variety of disparate types of queries, including exam-style questions, conversational medical queries, community-asked questions, and non-question formulations across various retrieval pipeline settings spanning corpus selection, chunk granularity, and vector index configuration. Retrieval results are judged using a robust, win-rate comparison assessment via an LLM-as-a-judge setting with human validation. Across these experiments, we identify several points of concrete guidance for reviewers, including the superiority of corpus aggregation for absolute retrieval quality, and the emergence of MedRAG/pubmed as the Pareto-optimal singleton corpus under graph-based (HNSW) indexing, appropriate chunking strategies, and FAISS indexing choices that offer the best trade-offs in speed and efficiency. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.20853 [cs.IR] (or arXiv:2604.20853v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.20853 Focus to learn more arXiv-issued DOI via DataCite

[IR-24] DenoiseRank: Learning to Rank by Diffusion Models

【速读】：该论文旨在解决传统学习排序（Learning to Rank, LTR）模型大多基于判别式视角建模的局限性，提出从生成式视角重新构建LTR任务。其解决方案的关键在于设计了一种新颖的去噪排序模型（DenoiseRank），该模型借鉴扩散模型的思想，在前向扩散过程中对相关标签引入噪声，并在反向去噪过程中逐步恢复出查询文档的精确标签分布，从而实现对排序概率分布的建模与预测。这是首个将生成式方法应用于传统LTR任务的工作，为生成式学习排序提供了新的范式和基准。

链接: https://arxiv.org/abs/2604.20852
作者: Ying Wang,Preslav Nakov,Shangsong Liang
机构: Sun Yat-sen University (中山大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning to rank (LTR) is one of the core tasks in Machine Learning. Traditional LTR models have made great progress, but nearly all of them are implemented from discriminative perspective. In this paper, we aim at addressing LTR from a novel perspective, i.e., by a deep generative model. Specifically, we propose a novel denoise rank model, DenoiseRank, which noises the relevant labels in the diffusion process and denoises them on the query documents in the reverse process to accurately predict their distribution. Our model is the first to address traditional LTR from generative perspective and is a diffusion method for LTR. Our extensive experiments on benchmark datasets demonstrated the effectiveness of DenoiseRank, and we believe it provides a benchmark for generative LTR task.

[IR-25] Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts ICLR2026

【速读】：该论文旨在解决视频-文本检索（Video-Text Retrieval, VTR）模型在真实场景中因查询分布偏移（query shift）而导致性能显著下降的问题。现有基于图像的鲁棒性方法无法有效应对视频数据中复杂的时空动态变化，导致模型在测试时对查询分布变化敏感。解决方案的关键在于提出HAT-VTR（Hubness Alleviation for Test-time Video-Text Retrieval），其核心机制包括：1）引入“Hubness抑制记忆”（Hubness Suppression Memory）以修正相似度分数，缓解因查询偏移加剧的hubness现象（即少数检索项成为高频匹配对象）；2）设计多粒度损失函数，强化时间维度上的特征一致性，从而提升模型在测试时的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2604.20851
作者: Bingqing Zhang,Zhuo Cao,Heming Du,Yang Li,Xue Li,Jiajun Liu,Sen Wang
机构: The University of Queensland (昆士兰大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR2026

点击查看摘要

Abstract:Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant “hubs” that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.

[IR-26] Association Is Not Similarity: Learning Corpus-Specific Associations for Multi-Hop Retrieval

【速读】：该论文旨在解决密集检索（Dense Retrieval）系统在处理多跳问题（Multi-hop Questions）时的局限性，即传统方法依赖嵌入相似度排序，难以捕捉通过共享推理链关联的文档间复杂语义关系。解决方案的关键在于提出一种轻量级的归纳重排序方法——关联增强检索（Association-Augmented Retrieval, AAR），其核心是训练一个仅含420万参数的小型多层感知机（MLP），利用共现标注数据通过对比学习（Contrastive Learning）在嵌入空间中学习文档间的关联关系；推理时采用双向关联评分对初始密集检索候选集进行重排序，从而显著提升难例问题的召回率，且无需LLM辅助索引或评估集调参。

链接: https://arxiv.org/abs/2604.20850
作者: Jason Dury
机构: Independent Researcher
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 7 appendices, 10 tables. Code: this https URL

点击查看摘要

Abstract:Dense retrieval systems rank passages by embedding similarity to a query, but multi-hop questions require passages that are associatively related through shared reasoning chains. We introduce Association-Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co-occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi-directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation-set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training-split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus-specific co-occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non-associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM-based indexing.

[IR-27] SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

【速读】：该论文旨在解决半结构化文档（如HTML）在检索增强生成（Retrieval-Augmented Generation, RAG）中因文档结构与当前嵌入和生成模型的扁平序列接口不匹配而导致的问题。传统检索管道通常将文档线性切分为固定大小的块进行索引，这会破坏章节结构、列表和表格等语义信息，导致难以返回小而可引用的证据片段，同时又无法保留使其可解释的上下文。其解决方案的关键在于提出一种结构感知的检索流水线，以树状结构文档为基础，将候选内容表示为“子文档”（subdocuments）——即保留结构身份的可定位选择，并延迟对周围上下文的选择。该方法定义了路径与路径集合、剪枝提取子文档及两种情境化机制（全局情境化用于添加非局部支撑信息如标题和结构标记，局部情境化则在结构邻域内扩展种子选择以获得紧凑且富含上下文的视图），并结合基于嵌入的候选生成器与查询时的文档感知聚合步骤，实现高效且高质量的结构保留式检索。

链接: https://arxiv.org/abs/2604.20849
作者: Mike Rainey,Umut Acar,Muhammed Sezer
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today’s embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives–paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.20849 [cs.IR] (or arXiv:2604.20849v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.20849 Focus to learn more arXiv-issued DOI via DataCite

[IR-28] MATRAG : Multi-Agent Transparent Retrieval-Augmented Generation for Explainable Recommendations

【速读】：该论文旨在解决基于大语言模型（Large Language Model, LLM）的推荐系统在透明性（transparency）、知识 grounding 以及生成连贯解释能力方面的关键挑战，这些问题限制了用户对推荐结果的信任。解决方案的核心在于提出 MATRAG（Multi-Agent Transparent Retrieval-Augmented Generation）框架，其创新性地结合多智能体协作与知识图谱增强的检索机制：通过四个专业化智能体——用户建模智能体、物品分析智能体、推理智能体和解释生成智能体——协同完成从偏好建模到可解释推荐的全流程；同时引入透明度评分机制量化解释的忠实性和相关性，从而在多个基准数据集上显著提升推荐准确率（如 Hit Rate 提升 12.7%，NDCG 提升 15.3%），并获得领域专家对解释质量的高评价（87.4% 被评为有用且可信）。

链接: https://arxiv.org/abs/2604.20848
作者: Sushant Mehta
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based recommendation systems have demonstrated remarkable capabilities in understanding user preferences and generating personalized suggestions. However, existing approaches face critical challenges in transparency, knowledge grounding, and the ability to provide coherent explanations that foster user trust. We introduce MATRAG (Multi-Agent Transparent Retrieval-Augmented Generation), a novel framework that combined multi-agent collaboration with knowledge graph-augmented retrieval to deliver explainable recommendations. MATRAG employs four specialized agents: a User Modeling Agent that constructs dynamic preference profiles, an Item Analysis Agent that extracts semantic features from knowledge graphs, a Reasoning Agent that synthesizes collaborative and content-based signals, and an Explanation Agent that generates natural language justifications grounded in retrieved knowledge. Our framework incorporates a transparency scoring mechanism that quantifies explanation faithfulness and relevance. Extensive experiments on three benchmark datasets (Amazon Reviews, MovieLens-1M, and Yelp) demonstrate that MATRAG achieves state-of-the-art performance, improving recommendation accuracy by 12.7% (Hit Rate) and 15.3% (NDCG) over leading baselines, while human evaluation confirms that 87.4% of generated explanations are rated as helpful and trustworthy by domain experts. Our work establishes new benchmarks for transparent, agentic recommendation systems and provides actionable insights for deploying LLM-based recommenders in production environments.

[IR-29] Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

【速读】：该论文旨在解决现有音乐推荐系统（Music Recommendation Systems, MRSs）在冷启动场景下性能不佳的问题，其根源在于传统模型主要依赖协同过滤（Collaborative Filtering），未能有效利用音频内容特征；同时，现有数据集缺乏丰富的多模态信息（如原始音频信号和描述性文本元数据），且评估框架无法充分支持多模态算法。解决方案的关键在于提出TASTE——一个整合音频与文本模态的综合性数据集与基准测试框架，并引入MuQ-token方法，实现多层音频特征的高效融合，从而显著提升候选召回和点击率预测（CTR）等任务的表现，验证了内容驱动方法的有效性，并为后续研究提供了可复用的多模态基础。

链接: https://arxiv.org/abs/2604.20847
作者: Yizhi Zhou,Jia-Qi Yang,De-Chuan Zhan,Da-Wei Zhou
机构: School of Artificial Intelligence, Nanjing University (南京大学人工智能学院); National Key Laboratory for Novel Software Technology, Nanjing University (南京大学软件新技术国家重点实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large-scale self-supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the \textbfMuQ-token method, which enables more efficient integration of multi-layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content-driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at this https URL

[IR-30] ADS-POI: Agent ic Spatiotemporal State Decomposition for Next Point-of-Interest Recommendation

【速读】：该论文旨在解决现有下一兴趣点（Next Point-of-Interest, Next POI）推荐方法中用户移动行为建模的局限性问题，即传统方法将用户历史行为压缩为单一潜在表示，导致不同行为因素（如日常出行模式、短期意图和时间规律性）相互纠缠，限制了状态演化灵活性并削弱模型在多样化决策场景下的适应能力。解决方案的关键在于提出ADS-POI框架，该框架通过多平行演化的潜在子状态（latent sub-states）来表征用户行为，每个子状态具有独立的时空转移动态，并借助上下文条件机制选择性聚合以形成最终决策状态，从而实现不同行为成分在不同时间和空间尺度上的差异化演化，同时保持在当前时空情境下的协同一致性。

链接: https://arxiv.org/abs/2604.20846
作者: Zhenyu Yu,Chunlei Meng,Yangchen Zeng,Mohd Yamani Idna Idris,Shuigeng Zhou
机构: Fudan University (复旦大学); Southeast University (东南大学); University of Malaya (马来亚大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Next point-of-interest (POI) recommendation requires modeling user mobility as a spatiotemporal sequence, where different behavioral factors may evolve at different temporal and spatial scales. Most existing methods compress a user’s history into a single latent representation, which tends to entangle heterogeneous signals such as routine mobility patterns, short-term intent, and temporal regularities. This entanglement limits the flexibility of state evolution and reduces the model’s ability to adapt to diverse decision contexts. We propose ADS-POI, a spatiotemporal state decomposition framework for next POI recommendation. ADS-POI represents a user with multiple parallel evolving latent sub-states, each governed by its own spatiotemporal transition dynamics. These sub-states are selectively aggregated through a context-conditioned mechanism to form the decision state used for prediction. This design enables different behavioral components to evolve at different rates while remaining coordinated under the current spatiotemporal context. Extensive experiments on three real-world benchmark datasets from Foursquare and Gowalla demonstrate that ADS-POI consistently outperforms strong state-of-the-art baselines under a full-ranking evaluation protocol. The results show that decomposing user behavior into multiple spatiotemporally aware states leads to more effective and robust next POI recommendation. Our code is available at this https URL.

[IR-31] CaST-POI: Candidate-Conditioned Spatiotemporal Modeling for Next POI Recommendation

【速读】：该论文旨在解决传统基于位置的服务中下一兴趣点（Next Point-of-Interest, Next POI）推荐方法存在的局限性问题，即现有方法通常从用户历史轨迹中计算单一用户表征，并以此统一评分所有候选兴趣点（POI），忽略了不同候选POI与用户历史访问之间的条件依赖关系。解决方案的关键在于提出一种候选条件感知的时空模型CaST-POI，其核心创新包括：（1）设计候选条件序列读取器（candidate-conditioned sequence reader），以候选POI作为查询动态关注用户历史轨迹；（2）引入候选相对的时间和空间偏置项（candidate-relative temporal and spatial biases），捕捉历史访问与每个候选POI之间细粒度的移动模式。实验表明，该方法在多个基准数据集上显著优于当前最优方法，尤其在大规模候选池场景下优势明显。

链接: https://arxiv.org/abs/2604.20845
作者: Zhenyu Yu,Chunlei Meng,Yangchen Zeng,Mohd Yamani Idna Idris,Shuigeng Zhou
机构: Fudan University (复旦大学); Southeast University (东南大学); University of Malaya (马来亚大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Next Point-of-Interest (POI) recommendation plays a crucial role in location-based services by predicting users’ future mobility patterns. Existing methods typically compute a single user representation from historical trajectories and use it to score all candidate POIs uniformly. However, this candidate-agnostic paradigm overlooks that the relevance of historical visits inherently depends on which candidate is being evaluated. In this paper, we propose CaST-POI, a candidate-conditioned spatiotemporal model for next POI recommendation. Our key insight is that the same user history should be interpreted differently when evaluating different candidate POIs. CaST-POI employs a candidate-conditioned sequence reader that uses candidates as queries to dynamically attend to user history. In addition, we introduce candidate-relative temporal and spatial biases to capture fine-grained mobility patterns based on the relationships between historical visits and each candidate POI. Extensive experiments on three benchmark datasets demonstrate that CaST-POI consistently outperforms state-of-the-art methods, yielding substantial improvements across multiple evaluation metrics, with particularly strong advantages under large candidate pools. Code is available at this https URL.

[IR-32] AtomicRAG : Atom-Entity Graphs for Retrieval-Augmented Generation

【速读】：该论文旨在解决当前GraphRAG方法中因将文本块（text chunks）作为知识表示的基本单元而导致的灵活性不足问题，以及基于三元组的实体链接对关系抽取错误敏感、易引发推理路径缺失或错误进而降低检索准确性的缺陷。其解决方案的关键在于提出一种名为Atom-Entity Graph的新颖知识表征与索引架构：将知识存储为独立且自包含的知识原子（knowledge atoms），而非粗粒度的文本块，从而实现知识元素的灵活重组并适配多样化的查询视角；同时，通过简化实体间边的定义（仅表示关系是否存在），结合个性化PageRank与基于相关性的过滤机制，有效提升实体连接的准确性与推理可靠性。

链接: https://arxiv.org/abs/2604.20844
作者: Yanning Hou,Duanyang Yuan,Sihang Zhou,Xiaoshu Chen,Ke Liang,Siwei Wang,Xinwang Liu,Jian Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent GraphRAG methods integrate graph structures into text indexing and retrieval, using knowledge graph triples to connect text chunks, thereby improving retrieval coverage and precision. However, we observe that treating text chunks as the basic unit of knowledge representation rigidly groups multiple atomic facts together, limiting the flexibility and adaptability needed to support diverse retrieval scenarios. Additionally, triple-based entity linking is sensitive to relation-extraction errors, which can lead to missing or incorrect reasoning paths and ultimately hurt retrieval accuracy. To address these issues, we propose the Atom-Entity Graph, a more precise and reliable architecture for knowledge representation and indexing. In our approach, knowledge is stored as knowledge atoms, namely individual, self-contained units of factual information, rather than coarse-grained text chunks. This allows knowledge elements to be flexibly reassembled without mutual interference, thereby enabling seamless alignment with diverse query perspectives. Edges between entities simply indicate whether a relationship exists. By combining personalized PageRank with relevance-based filtering, we maintain accurate entity connections and improve the reliability of reasoning. Theoretical analysis and experiments on five public benchmarks show that the proposed AtomicRAG algorithm outperforms strong RAG baselines in retrieval accuracy and reasoning robustness. Code: this https URL.

人机交互

[HC-0] Gradual Voluntary Participation: A Framework for Participatory AI Governance in Journalism

【速读】：该论文旨在解决人工智能（AI）在新闻业中的集成对参与式设计（Participatory Design, PD）带来的挑战，特别是利益相关者影响力受限、工作场所认知偏差及组织动态失衡等问题。传统PD假设用户可主导技术塑造，但AI系统因数据不透明、架构固化和目标不可及而难以被干预。解决方案的关键在于提出“渐进自愿参与”（Gradual Voluntary Participation, GVP）框架，其核心是将参与重构为一种可操作于新闻室层面的渐进且自愿的过程，超越固定工作坊或一次性偏好采集活动；该框架通过深度与范围两个维度构建利益相关者映射矩阵，以缓解认识论负担、突破参与天花板并避免形式化咨询，从而在快速演化的混合工作环境中实现技术变革与利益相关者赋权之间的平衡。

链接: https://arxiv.org/abs/2604.21878
作者: Matilde Barbini,Stefano Sorrentino,Daniel Gatica-Perez
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The integration of AI into journalism challenges participatory design (PD), particularly with respect to stakeholder influence, workplace perceptions, and organizational dynamics. Traditional PD assumes that users can shape technologies, yet AI systems resist influence due to opaque data, fixed architectures, and inaccessible objectives. Through interviews with 10 journalists, we identify the perception gap, showing that trust in AI depends on perceived agency within workplace participatory workflows. Informed by these findings, we introduce the Gradual Voluntary Participation (GVP) framework in journalism and its five core principles, reconceptualizing participation as a gradual and voluntary process that can be operationalized at the newsroom level, beyond fixed workshops or one-time preference-elicitation campaigns. Addressing epistemic burdens, participatory ceilings, and performative consultations, GVP treats gradualism and voluntariness as design dimensions that shape perception, legitimacy, and ownership. Moving beyond unidimensional ladder metaphors and adopting a bidimensional matrix structure, the framework maps stakeholders across depth and scope, offering a new model for local participatory AI governance that balances technological transformation with stakeholder empowerment in rapidly evolving hybrid workplaces.

[HC-1] FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

【速读】：该论文旨在解决人工智能（AI）在新闻实践中的广泛应用所引发的编辑权威重构问题，特别是如何在公平性（fairness）、问责制（accountability）和透明度（transparency）方面维持或重塑媒体机构的自主权。其核心问题是：当AI系统嵌入新闻生产流程时，决策权、认知正当性和责任归属如何被重新分配，进而影响新闻专业主义的可持续性。解决方案的关键在于识别出两种并发的权威迁移机制——内部迁移（即编辑判断逐步让位于大型语言模型LLMs）与外部迁移（即权力从新闻组织向平台和供应商转移），并提出参与式AI设计与部署作为潜在干预路径，强调通过实质性参与来重新分配权威，而非仅停留在形式上的“象征性参与”，从而防止问责缺失和透明度空心化。

链接: https://arxiv.org/abs/2604.21864
作者: Stefano Sorrentino,Matilde Barbini,Daniel Gatica-Perez
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: ACM FAccT 2026 accepted paper. Due to the arXiv 1920 characters limit the Abstract here is shortened. Refer to the full paper for the entire Abstract

点击查看摘要

Abstract:Building on recent interpretivist approaches, we conduct a critical narrative review across journalism studies, human-computer interaction, and FAccT scholarship, conceptualizing editorial authority as the conjunction of decision rights, epistemic warrant, and responsibility. We provide a comprehensive theoretical framework for addressing how concerns on fairness, accountability and transparency emerge, interact, and persist within AI mediated journalistic practice. We identify and describe two concurrent authority reconfigurations driven by AI adoption. First, an internal migration of authority, in which editorial judgment is progressively deferred to large language models (LLMs) embedded within newsroom workflows. This migration occurs not through explicit policy decisions, but through interactional, cognitive, and organizational mechanisms that legitimize AI generated outputs while obscuring responsibility and weakening individual and professional agency. Second, we analyze an external migration of authority, whereby decision making power shifts from news organizations toward platforms, vendors, and infrastructural providers that supply AI systems and distribution channels, exacerbating existing power asymmetries within the media ecosystem. Unaddressed, these reconfigurations risk rendering fairness hard to maintain, accountability difficult to assign and transparency performative. We examine participatory approaches to AI design and deployment in journalism as potential mechanisms for retaining or reclaiming editorial authority. We critically assess both their promise and their structural limitations, highlighting how participation can either meaningfully redistribute authority or function as a tokenistic practice that leaves underlying power relations intact.

[HC-2] GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward

【速读】：该论文旨在解决生成式流网络（Generative Flow Networks, GFlowNets）训练过程中的可解释性问题。尽管GFlowNets在分子和材料发现等应用中表现出强大能力，但其训练动态难以理解，现有机器学习工具仅能追踪指标，无法揭示模型如何探索样本空间、构建采样轨迹或调整采样概率。解决方案的关键在于提出GFlowState系统，通过多个可视化视图——候选排名图表、状态投影、轨迹网络的节点-边图以及转移热力图——使开发者能够分析采样行为与策略演化，识别未充分探索区域及训练失败源，从而提升GFlowNets的可解释性并加速实际开发。

链接: https://arxiv.org/abs/2604.21830
作者: Florian Holeczek,Andreas Hinterreiter,Alex Hernandez-Garcia,Marc Streit,Christina Humer
机构: 未知
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present GFlowState, a visual analytics system designed to illuminate the training process of Generative Flow Networks (GFlowNets or GFNs). GFlowNets are a probabilistic framework for generating samples proportionally to a reward function. While GFlowNets have proved to be powerful tools in applications such as molecule and material discovery, their training dynamics remain difficult to interpret. Standard machine learning tools allow metric tracking but do not reveal how models explore the sample space, construct sample trajectories, or shift sampling probabilities during training. Our solution, GFlowState, allows users to analyze sampling trajectories, compare the sample space relative to reference datasets, and analyze the training dynamics. To this end, we introduce multiple views, including a chart of candidate rankings, a state projection, a node-link diagram of the trajectory network, and a transition heatmap. These visualizations enable GFlowNet developers and users to investigate sampling behavior and policy evolution, and to identify underexplored regions and sources of training failure. Case studies demonstrate how the system supports debugging and assessing the quality of GFlowNets across application domains. By making the structural dynamics of GFlowNets observable, our work enhances their interpretability and can accelerate GFlowNet development in practice.

[HC-3] Alignment has a Fantasia Problem

【速读】：该论文试图解决的问题是：当前生成式 AI（Generative AI）系统在设计上假设用户能够清晰表达目标和需求，但实际行为研究表明，用户往往在任务意图尚未明确时就与AI互动，导致AI输出看似有用却未必符合用户真实需求，这种现象被称为“Fantasia交互”。解决方案的关键在于重新定义对齐（alignment）研究范式——不再将用户视为理性或全知的“oracle”，而是让AI通过时间维度提供认知支持，主动协助用户形成和细化意图。这需要融合机器学习、界面设计与行为科学的跨学科方法，构建能帮助人类应对任务不确定性的人机协作机制。

链接: https://arxiv.org/abs/2604.21827
作者: Nathanael Jo,Zoe De Simone,Mitchell Gordon,Ashia Wilson
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Modern AI assistants are trained to follow instructions, implicitly assuming that users can clearly articulate their goals and the kind of assistance they need. Decades of behavioral research, however, show that people often engage with AI systems before their goals are fully formed. When AI systems treat prompts as complete expressions of intent, they can appear to be useful or convenient, but not necessarily aligned with the users’ needs. We call these failures Fantasia interactions. We argue that Fantasia interactions demand a rethinking of alignment research: rather than treating users as rational oracles, AI should provide cognitive support by actively helping users form and refine their intent through time. This requires an interdisciplinary approach that bridges machine learning, interface design, and behavioral science. We synthesize insights from these fields to characterize the mechanisms and failures of Fantasia interactions. We then show why existing interventions are insufficient, and propose a research agenda for designing and evaluating AI systems that better help humans navigate uncertainty in their tasks.

[HC-4] Who Defines “Best”? Towards Interactive User-Defined Evaluation of LLM Leaderboards

【速读】：该论文旨在解决当前大语言模型（Large Language Model, LLM）排行榜在评估过程中存在的片面性和缺乏用户导向的问题。现有排行榜通常依赖于单一聚合分数，未能反映模型在不同提示类型和组合下的多样化表现，且评价标准由基准设计者设定，难以匹配实际用户与组织的多元目标与约束。其解决方案的关键在于通过深入分析LMArena基准数据集的结构偏差，并设计一个交互式可视化界面作为“设计探针”，使用户能够自主选择并加权特定提示片段（prompt slices），从而动态探索模型排名的变化。这一方法提升了评估过程的透明度，并支持更具场景适配性的模型选择，为LLM排行榜的设计提供了新的范式。

链接: https://arxiv.org/abs/2604.21769
作者: Minji Jung,Minjae Lee,Yejin Kim,Sarang Choi,Minsuk Kahng
机构: Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted to the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2026)

点击查看摘要

Abstract:LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.

[HC-5] Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

【速读】：该论文旨在解决当前深度伪造（Deepfake）检测研究中普遍存在的可解释性不足问题，即现有基于深度学习的方法虽然在基准测试中表现优异，但难以揭示真实与伪造面部行为之间的本质差异。其解决方案的关键在于引入一种基于生物行为特征（bio-behavioral features）的可解释框架，通过提取面部动态中的低维运动模式并构建时序特征来表征时空结构；在此基础上，采用传统机器学习分类器进行检测，发现高阶时间不规则性是区分真假视频的核心线索，且情绪表达显著增强检测准确性——这表明伪造行为在情感驱动下的面部动态中留下更明显的“行为指纹”。此外，模型决策与人类感知的对比进一步揭示了二者在情绪视频上的一致性及策略差异，说明计算特征与人类直觉可形成互补而非冗余的检测路径。

链接: https://arxiv.org/abs/2604.21760
作者: Timothy Joseph Murphy,Jennifer Cook,Hélio Clemente José Cuve
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Main paper: 19 pages, 5 figures, 4 tables. SI Appendix: 11 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.

[HC-6] StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition SIGGRAPH2026

【速读】：该论文旨在解决当前生成式AI在创意人脸风格化（creative face stylization）任务中身份一致性评估与监督的瓶颈问题。现有身份编码器通常基于自然照片训练，在面对不同风格（如卡通、素描、绘画）和强度变化时表现出严重脆弱性，无法准确区分纹理或色彩变化与真实身份漂移，亦难以识别几何夸张带来的身份失真，根源在于缺乏一种对风格无关的身份一致性评估框架。解决方案的关键在于构建StyleID——一个融合人类感知的基准数据集与评估体系，包含两个核心组件：(i) StyleBench-H，用于捕捉人类在多种风格化方法及强度下的“相同-不同”判别判断；(ii) StyleBench-S，通过受控的二选一强迫选择（2AFC）实验获得心理测量识别强度曲线，作为监督信号。利用StyleBench-S对现有语义编码器进行微调，使其相似度排序与人类感知高度一致，从而显著提升模型在跨风格、跨域（如艺术家绘制肖像）场景下的身份保持能力与鲁棒性。

链接: https://arxiv.org/abs/2604.21689
作者: Kwan Yun,Changmin Lee,Ayeong Jeong,Youngseo Kim,Seungmi Lee,Junyong Noh
机构: KAIST(韩国科学技术院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: SIGGRAPH 2026 / ACM TOG. Project page at this https URL

点击查看摘要

Abstract:Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at this https URL

[HC-7] Do MLLM s Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision ACL2026

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在第一人称视角（egocentric view）下对指向手势（pointing gestures）的空间语义理解不足的问题，即模型常依赖视觉邻近性或物体显著性等虚假关联，而非真实的空间指代关系，这种现象被称为“参照幻觉”（Referential Hallucination）。解决方案的关键在于提出一个名为EgoPoint-Bench的综合性问答基准，涵盖超过11,000个高保真模拟与真实世界样本，覆盖五个评估维度和三个层级的参照复杂度；并通过在合成数据上微调模型，实现了显著性能提升及良好的仿真到现实（sim-to-real）泛化能力，从而验证了空间感知监督对于构建精准的第一人称AI助手的重要性。

链接: https://arxiv.org/abs/2604.21461
作者: Chentao Li,Zirui Gao,Mingze Gao,Yinglian Ren,Jianjiang Feng,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 20 pages, 14 figures. Committed to ACL 2026

点击查看摘要

Abstract:Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term “Referential Hallucination.” To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: this https URL

[HC-8] he Privacy Guardian Agent : Towards Trustworthy AI Privacy Agents

【速读】：该论文试图解决当前“通知与同意”（notice and consent）隐私保护范式失效的问题，即用户在面对冗长且复杂的隐私政策时难以真正理解或有效行使控制权，而现有基于大语言模型（Large Language Models, LLMs）的工具虽能增强用户主动控制能力，却无法满足时间有限或缺乏动机用户的自动化需求。解决方案的关键在于提出一种“隐私守护代理”（Privacy Guardian Agent），该代理通过用户画像和情境感知实现常规同意决策的自动化，并在不确定或高风险场景下将决策权交还用户，从而维持“人在回路”（human-in-the-loop）机制的必要性；同时，代理对自主决策的推理过程具备可审查性，确保透明度与用户救济路径，进而缓解同意疲劳（consent fatigue），并保障有意义的用户自主性和信任。

链接: https://arxiv.org/abs/2604.21455
作者: Vincent Freiberger
机构: Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig; Leipzig University (莱比锡大学)
类目: Human-Computer Interaction (cs.HC)
备注: Position paper for the CHI26 Workshop “Moving Beyond Clicks: Rethinking Consent and User Control in the Age of AI”

点击查看摘要

Abstract:The current “notice and consent” paradigm is broken: consent dialogues are often manipulative, and users cannot realistically read or understand every privacy policy. While recent LLM-based tools empower users seeking active control, many with limited time or motivation prefer full automation. However, fully autonomous solutions risk hallucinations and opaque decisions, undermining trust. I propose a middle ground - a Privacy Guardian Agent that automates routine consent choices using user profiles and contextual awareness while recognizing uncertainty. It escalates unclear or high-risk cases to the user, maintaining a human-in-the-loop only when necessary. To ensure agency and transparency, the agent’s reasoning on its autonomous decisions is reviewable, allowing for user recourse. For problematic cases, even with minimal consent, it alerts the user and suggests switching to an alternative site. This approach aims to reduce consent fatigue while preserving trust and meaningful user autonomy.

[HC-9] Neurodiversity and Technostress: Towards a Multimodal Research Design for Evaluating Subjective Physiological and Behavioral Responses

【速读】：该论文旨在解决当前技术压力（Technostress, TS）研究中长期存在的两个关键问题：一是研究对象主要集中在神经典型（neurotypical）人群，忽视了神经多样性（neurodiversity）群体的差异；二是现有研究缺乏在单一实验设计中整合主观感知、生理激活与行为交互等多维指标的能力。其解决方案的关键在于提出一种受控的实验研究设计，通过标准化的数字压力情境，系统比较神经多样性个体与神经典型个体在结构化和非结构化数字任务下的反应，并采用多模态测量方法同步采集主观体验、生理指标及可观察的行为数据，从而实现对数字压力机制更精细化、包容性的理解，并为更具包容性的数字工作设计提供方法论支持。

链接: https://arxiv.org/abs/2604.21404
作者: Lisa van den Heuvel,Igor Ivkić,René Riedl
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Digitalization has transformed modern work by increasing efficiency while also introducing new forms of strain. Technostress (TS) describes subjective, physiological, and behavioral stress responses related to digital technology use. Existing TS research has predominantly focused on neurotypical populations and rarely integrates multiple stress dimensions within a single design. This paper addresses these gaps by proposing a controlled experimental research design that systematically compares neurodivergent and neurotypical individuals under standardized digital stress conditions. The proposed design combines structured and unstructured digital tasks with a multimodal measurement approach covering subjective perceptions, physiological activation, and observable interaction behavior. By integrating neurodiversity into TS research, the paper contributes to a more differentiated understanding of digital stress and provides a methodological approach for more inclusive digital work design.

[HC-10] A Replicable Robotics Awareness Method Using LLM -Enabled Robotics Interaction: Evidence from a Corporate Challenge

【速读】：该论文旨在解决如何在非专业用户群体中有效提升对机器人技术的认知与理解，尤其是在实际组织环境中缺乏系统性、可操作的引入方式的问题。其解决方案的关键在于设计并实施一种基于挑战的任务驱动方法，通过一个由大型语言模型（Large Language Model, LLM）赋能的人形机器人活动，使参与者在物流场景模拟任务中以语音指令与机器人互动，从而自然地体验具身人工智能（embodied AI）和人机协作机制，而无需预先具备机器人学知识。该方法不仅增强了参与者的兴趣与理解，还验证了LLM作为人机交互接口在工业场景下推广机器人意识的可行性与有效性。

链接: https://arxiv.org/abs/2604.21377
作者: S. A. Prieto,M. A. Gopee,Y. Ben Arab,B. García de Soto,J. Esteba,P. Olivera Brizzio
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 10 pages, 8 Figures, to be submitted for journal per-review

点击查看摘要

Abstract:Large language models are increasingly being explored as interfaces between humans and robotic systems, yet there remains limited evidence on how such technologies can be used not only for interaction, but also as a structured means of introducing robotics to non-specialist users in real organizational settings. This paper introduces and evaluates a challenge-based method for robotics awareness, implemented through an LLM-enabled humanoid robot activity conducted with employees of AD Ports Group in the United Arab Emirates. In the event, participants engaged with a humanoid robot in a logistics-inspired task environment using voice commands interpreted through an LLM-based control framework. The activity was designed as a team-based, role-driven experience intended to expose participants to embodied AI and human-robot collaboration without requiring prior robotics expertise. To evaluate the approach, a post-event survey remained open for 16 days and collected 102 responses. Results indicate strong overall reception, with high satisfaction (8.46/10), increased interest in robotics and AI (4.47/5), and improved understanding of emerging forms of human-robot collaboration (4.45/5). Participants who interacted directly with the robot also reported natural interaction (4.37/5) and a strong sense that interaction became easier as the activity progressed (4.74/5). At the same time, lower ratings for reliability and predictability point to important technical and design challenges for future iterations. The findings suggest that challenge-based, LLM-enabled humanoid interaction can serve as a promising and replicable method for robotics awareness in industrial and operational environments.

[HC-11] Channel-Free Human Activity Recognition via Inductive-Bias-Aware Fusion Design for Heterogeneous IoT Sensor Environments

【速读】：该论文旨在解决物联网（IoT）环境中人体活动识别（HAR）因传感器配置异构性导致的模型泛化难题，即传统固定通道结构的模型难以跨数据集、设备或身体部位复用。其核心解决方案是提出一种严格的无通道（channel-free）HAR框架，关键在于通过通道级编码与共享编码器结合、基于传感器元数据的条件批归一化晚融合机制，以及联合优化通道级预测与融合结果的组合损失函数，从而在不依赖预定义通道顺序或数量的前提下实现高效且鲁棒的跨场景活动识别。

链接: https://arxiv.org/abs/2604.21369
作者: Tatsuhito Hasegawa
机构: University of Fukui(福井大学)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: 13 pages, 6 figures, 8 tables, Preprint. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Human activity recognition (HAR) in Internet of Things (IoT) environments must cope with heterogeneous sensor settings that vary across datasets, devices, body locations, sensing modalities, and channel compositions. This heterogeneity makes conventional channel-fixed models difficult to reuse across sensing environments because their input representations are tightly coupled to predefined channel structures. To address this problem, we investigate strict channel-free HAR, in which a single shared model performs inference without assuming a fixed number, order, or semantic arrangement of input channels, and without relying on sensor-specific input layers or dataset-specific channel templates. We argue that fusion design is the central issue in this setting. Accordingly, we propose a channel-free HAR framework that combines channel-wise encoding with a shared encoder, metadata-conditioned late fusion via conditional batch normalization, and joint optimization of channel-level and fused predictions through a combination loss. The proposed model processes each channel independently to handle varying channel configurations, while sensor metadata such as body location, modality, and axis help recover structural information that channel-independent processing alone cannot retain. In addition, the joint loss encourages both the discriminability of individual channels and the consistency of the final fused prediction. Experiments on PAMAP2, together with robustness analysis on six HAR datasets, ablation studies, sensitivity analysis, efficiency evaluation, and cross-dataset transfer learning, demonstrate three main findings…

[HC-12] “If We Had the Information That We Need to Interpret the World Around Us We Wouldnt Be Disabled:” Barriers and Opportunities in Information Work among Blind and Sighted Colleagues

【速读】：该论文旨在解决职场中视觉能力差异对协作工作支持不足的问题，尤其关注盲人或低视力员工在信息表示（如PDF文档、电子表格和图表）使用过程中所面临的障碍。其解决方案的关键在于通过日记研究与深度访谈（共23名来自5个团队的参与者）及两组焦点小组讨论（另7名参与者），识别出四类相互关联的信息表示失败模式及其应对策略，揭示了职场偏见（stigma）和社会动态如何影响信息共享与协作，并提出了一种基于实证的新型概念框架，用于指导混合能力团队在当前技术环境下优化知识工作体验的设计与改进。

链接: https://arxiv.org/abs/2604.21338
作者: Yichun Zhao,Miguel A. Nacenta,Mahadeo A. Sukhai,Sowmya Somanath
机构: University of Victoria (维多利亚大学); IDEA-STEM (IDEA-STEM); University of Victoria (维多利亚大学); IDEA-STEM (IDEA-STEM)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted in CHIWORK '26

点击查看摘要

Abstract:Despite recognition of the value of diversity, the way work takes place can fail to support blind or low-vision employees, especially in collaborative work settings. This paper examines how professional teams with diverse visual abilities use information representations (e.g., PDF documents, spreadsheets and charts). A diary study with follow-up individual interviews (23 participants with mixed abilities from 5 teams) and 2 separate focus groups (7 participants from 2 other teams) allowed us to characterize key dimensions of the role of representations in the workplace into four types of interrelated failures and workarounds, influenced by workplace stigmas and shaped by evolving social dynamics towards interdependent information work. We contribute this new empirically supported conceptual understanding of representation use in workplaces that can help design and improve the experiences of mixed-ability teams doing knowledge work in the current technological landscape.

[HC-13] opoStyle: Supporting Iterative Design with Generative AI for 2.5D Topology Optimization

【速读】：该论文旨在解决传统拓扑优化（Topology Optimization, TO）在工程设计中面临的三大问题：一是设计结果多样性不足且定制化能力弱；二是计算成本高、耗时长，难以满足高效设计需求；三是缺乏对结构性能与美学品质之间平衡的迭代优化机制。解决方案的关键在于提出一种名为TopoStyle的2.5D拓扑优化交互式设计工具，其核心创新包括：利用2D扩散模型实现生成式AI驱动的拓扑优化，支持两种交互方式——通过图形界面手绘修改3D部件或直接在3D建模软件中以点为单位进行交互，并引入掩膜（mask）机制实现局部区域优化，从而显著提升设计效率、增强用户自定义能力和美学可控性，最终实现性能与美观的协同优化。

链接: https://arxiv.org/abs/2604.21315
作者: Shuyue Feng,Cedric Caremel,Yoshihiro Kawahara
机构: The University of Tokyo (东京大学)
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages

点击查看摘要

Abstract:Topology optimization(TO) is widely used in engineering because of its ability to save material and optimize structural performance. Although prior work has explored 2D human-centered design tool for TO, the results are often limited in variety and offer weak customizability. Meanwhile, due to the high computational and time costs of TO, researchers have attempted to address these issues using generative AI; however, such methods often provide limited interactivity. In addition, topology optimization in many cases needs to balance structural performance and aesthetic qualities through iterative design, a perspective that has rarely been emphasized in traditional TO. We present TopoStyle, an iterative design tool for 2.5D topology optimization using a 2D diffusion model. We explore two interaction methods. The first exports 3D parts to a graphical interface for hand-drawn interaction. The second enables direct interaction within 3D modeling software using points. Our tool also supports the use of masks to apply topology optimization to specific regions, allowing users to address customized design needs. We compare and evaluate both performance and interaction methods, and investigate how TopoStyle can balance performance and aesthetics while improving design efficiency through customization and iterative design. Finally, we demonstrate the application scenarios of TopoStyle through several design cases.

[HC-14] When Constraints Limit and Inspire: Characterizing Presentation Authoring Practices for Evolving Narratives

【速读】：该论文旨在解决演示文稿（presentation slides）创作过程中，如何有效应对时间、受众和传播意图等情境约束的问题。现有研究多将这些约束视为被动限制，而忽视了演讲者如何主动推理并利用它们来指导内容结构与重用。解决方案的关键在于提出Constraint-based Multi-session Presentation Authoring (CMPA) 框架，将时间、受众和传播意图作为核心约束，并在 ReSlide 这一原型工具中实现，使约束成为驱动叙事构建的主动设计要素。用户研究表明，相较于基线工具，ReSlide 能够帮助演讲者更灵活地跨会话复用和调整内容，从而提升多轮创作效率与适应性。

链接: https://arxiv.org/abs/2604.21205
作者: Linxiu Zeng,Emily Kuang,Jian Zhao
机构: University of Waterloo (滑铁卢大学); York University (约克大学)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 12 Figures. To appear in DIS 2026

点击查看摘要

Abstract:Authoring presentation slides involves navigating contextual constraints that shape how content is structured, adapted, and reused. While prior work frames constraints as limitations, little is known about how presenters actively reason about them. We conducted a formative study with ten presenters to examine how constraints emerge, are interpreted, and influence authoring decisions, leading to the Constraint-based Multi-session Presentation Authoring (CMPA) framework. CMPA treats time, audience, and communicative intent as key constraints shaping authoring. We instantiated CMPA in ReSlide, a research prototype for constraint-aware slide creation and reuse, and conducted two user studies on (1) single-session behaviors and (2) multi-session workflows. Compared to a baseline tool, ReSlide helped presenters treat constraints as active design drivers that guide narrative construction. The second study further shows how presenters flexibly reuse and adapt content across authoring cycles as constraints evolve. We then propose design implications for future constraint-aware presentation tools.

[HC-15] Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

【速读】：该论文旨在解决当前对话系统在情境对话（situated dialogue）中难以维持共享上下文表示的问题，尤其是在超出即时对话窗口的情况下，容易因细粒度区分被压缩为纯文本表征而导致“表征模糊”（representational blur）现象，即相似但不同的实体被误认为可互换，从而造成看似局部连贯却无法持久跟踪共同语境的失效模式。解决方案的关键在于引入一种主动视觉支架框架（active visual scaffolding framework），通过将对话状态增量式地转化为可持久存储的视觉历史，并在生成响应时进行检索，以强化场景的具体承诺并减少表征模糊；同时，研究发现结合文本与视觉的混合多模态表示优于单一模态，表明显式整合描绘性（depictive）与命题性（propositional）信息有助于提升对话系统的共知建模能力。

链接: https://arxiv.org/abs/2604.21144
作者: Biswesh Mohapatra,Giovanni Duca,Laurent Romary,Justine Cassell
机构: Inria(法国国家信息与自动化研究院); University of Trento(特伦托大学); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Work under review. Biswesh Mohapatra and Giovanni Duca both contributed equally to this work

点击查看摘要

Abstract:Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emphrepresentational blur, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

[HC-16] White Paper: Human-AI Collaboration in Conflict Analysis: Text Classifier Development with Peacebuilders

【速读】：该论文旨在解决在敏感人道主义领域中，如何通过参与式人工智能（Participatory AI）开发提升文本分类模型的技术鲁棒性、情境有效性与规范一致性的问题。其关键解决方案在于引入和平建设者与数据科学家的协作机制，共同参与问题定义、标注设计、迭代验证和模型评估全过程，基于协同标注的数据集对BERT模型进行微调，并通过开放源代码形式发布模型（Kenya-polarization 和 Sudan-hate speech），从而显著增强模型对文化语境的适配能力、减少误分类现象，并提升实践者的工具归属感与使用意愿。

链接: https://arxiv.org/abs/2604.21034
作者: Allan Kipyator Kipkemboi Cheboi,Julie Hawke,Hussam Abualfatah,Andrew Sutjahjo,Daniel Burkhardt Cerigo,Rachael Olpengs,William OBrien
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 5 tables

点击查看摘要

Abstract:This paper documents a collaborative research process involving peacebuilders and data scientists in Kenya and Sudan to develop AI-based text classifiers for monitoring online polarization and hatespeech. The method describes a participatory annotation process in which practitioners and domain experts contributed to problem definition, annotation design, iterative validation, and model evaluation. Fine-tuned BERT-based classifiers were trained on collaboratively annotated datasets and evaluated against held-out test sets. In each case, the models produced enhanced contextual alignment, reduced misclassification driven by cultural nuance, and increased practitioner ownership of AI tools. The resulting models (Kenya-polarization and Sudan-hate speech) are open-source and accessible via HuggingFace. The study contributes empirical evidence that participatory AI development can simultaneously improve technical robustness, contextual validity, and normative alignment in sensitive humanitarian domains.

[HC-17] User-Centered Design of Hyperlocal Communication Platforms: Insights from the Design and Evaluation of KUBO

【速读】：该论文旨在解决菲律宾地区因信息延迟、算法过滤、语言障碍及数字鸿沟导致的超本地（hyperlocal）通信效率低下问题，这些问题常使居民无法及时获取紧急预警和社区事件信息。解决方案的关键在于设计并实现了一个双通道、包容性的信息平台KUBO（Kumunidad at Balitang Opisyal），其包含两个核心模块：一是家庭模块，用于整合经验证的地方政府单位（Local Government Unit, LGU）发布的权威通告与精选新闻；二是社区模块，支持居民自主发布邻里报告与讨论。通过用户中心设计方法与对照实验验证，KUBO显著缩短了任务完成时间（p < 0.001）、提升了信息回忆准确率（p = 0.010），并在易用性、满意度和感知有效性方面优于Facebook等主流平台，证明该双通道架构能有效增强超本地场景下的实时信息可达性、理解力与公民参与度。

链接: https://arxiv.org/abs/2604.20973
作者: Eljohn Evangelista,Alyssa Cea,Axel Balitaan,Clark Vince Diala,Jamlech Iram Gojo Cruz
机构: Institute of Computer Science, University of the Philippines Los Baños (菲律宾大学洛斯巴ños分校计算机科学研究所)
类目: Human-Computer Interaction (cs.HC)
备注: To be published in Proceedings of the 2025 International Conference on Human-Engaged Computing (ICHEC 2025), November 21-23, 2025, Singapore, Singapore. ACM, New York, NY, USA, 13 pages

点击查看摘要

Abstract:Effective hyperlocal communication is critical in the Philippines, where delayed or algorithm-filtered updates can leave residents uninformed about emergency advisories and community events. We conducted a user-centered study consisting of contextual inquiry and semi-structured interviews to identify four key barriers: delayed alerts, algorithm-driven noise, language gaps, and digital divides. Guided by these insights, we designed KUBO (Kumunidad at Balitang Opisyal), a prototype that integrates a home module for verified local government unit advisories and curated headlines, and a community module for resident-powered neighborhood reports and discussions. Using a within-subjects evaluation design, KUBO significantly reduced task completion times (p-value 0.001), improved information recall on post-task quizzes (p-value = 0.010), and yielded higher user satisfaction ratings for ease of use, overall satisfaction, and perceived effectiveness compared to Facebook, the commonly used communication platform in the Philippines. These results demonstrate that a dual-channel, inclusive platform can substantially enhance real-time information access, comprehension, and civic engagement in hyperlocal settings.

[HC-18] Can Virtual Agents Care? Designing an Empathetic and Personalized LLM -Driven Conversational Agent

【速读】：该论文旨在解决全球范围内心理健康挑战日益加剧与传统支持服务可用性有限、成本高昂之间的矛盾，同时克服当前大语言模型（Large Language Models, LLMs）在对话支持中普遍存在的个性化不足、共情能力弱和事实依据不牢靠等问题。其解决方案的关键在于构建一个虚拟代理框架（virtual agent framework），通过检索增强架构（retrieval-augmented architecture）、结构化记忆（structured memory）以及多模态交互（multimodal interaction）实现更具同理心、个性化且可靠的心理健康支持。实证结果表明，该框架在小规模模型上亦能显著提升检索与响应质量，并在跨文化用户研究中优于纯LLM基线，在连贯性、感知准确性及共情维度上均表现更优。

链接: https://arxiv.org/abs/2604.20948
作者: Truong Le Minh Toan,Dieu Bang Mach,Tan Duy Le,Nguyen Tan Viet Tuyen
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted manuscript version to be presented at the SCI-2026

点击查看摘要

Abstract:Mental health challenges are rising globally, while traditional support services face limited availability and high costs. Large language models offer potential for conversational support, but often lack personalization, empathy, and factual grounding. A virtual agent framework is introduced to provide empathetic, personalized, and reliable wellbeing support through retrieval-augmented architecture, structured memory, and multimodal interaction. Objective benchmarks demonstrate improved retrieval and response quality, particularly for smaller models. A cross-cultural study with university students from Vietnam and Australia shows the system outperforms LLM-only baselines in coherence, perceived accuracy, and empathy, with most participants clearly preferring the proposed approach.

[HC-19] AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

【速读】：该论文旨在解决生成式视频模型（如Video Diffusion Transformers）中艺术家难以理解其内部工作机制的问题，尤其是在仅通过提示词（prompt）控制时，无法有效建立对模型生成过程的直观认知或突破其默认行为倾向。解决方案的关键在于提出AttentionBender工具，该工具基于网络弯曲（Network Bending）方法，通过在跨注意力（cross-attention）机制上施加二维变换（如旋转、缩放、平移等），实现对视频生成过程的可解释性探查与创造性干预。实验表明，跨注意力高度耦合，局部操控常引发分布式的畸变和故障美学，这不仅揭示了Transformer结构的内在复杂性，也为超越模型预设表征空间的新型视觉风格创作提供了新路径。

链接: https://arxiv.org/abs/2604.20936
作者: Adam Cole,Mick Grierson
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: To appear in the Proceedings of the 2026 ACM Creativity and Cognition (CC '26). 15 pages, 19 figures

点击查看摘要

Abstract:We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists’ ability to build intuition for the model’s material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model’s learned representational space.

[HC-20] he Root Theorem of Context Engineering

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在多轮对话中面临的两个根本性约束：有限的上下文窗口（context window）和信息质量随 token 累积而衰减的问题。其核心解决方案是提出“上下文工程根定理”（Root Theorem of Context Engineering），即在有损信道中最大化信号与 token 的比率（signal-to-token ratio）。该定理推导出五个无额外假设的结论，其中关键在于识别出维持长期理解必须依赖一种自稳态持久机制（homeostatic persistence）——即不断积累、压缩、重写与舍弃（accumulate, compress, rewrite, shed），且压缩机制本身需运行于其所作用的通道内部，从而要求外部验证门控机制的存在。这一理论框架将上下文工程确立为具有信息论基础的独立学科，区别于传统提示工程（prompt engineering），并以一个持续运行60余会话的架构实证了其有效性。

链接: https://arxiv.org/abs/2604.20874
作者: Borja Odriozola Schick
机构: Independent Researcher
类目: Computational Complexity (cs.CC); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:Every system that maintains a large language model conversation beyond a single session faces two inescapable constraints: the context window is finite, and information quality degrades with accumulated volume. We formalize these constraints as axioms and derive a single governing principle – the Root Theorem of Context Engineering: \emphmaximize signal-to-token ratio within bounded, lossy channels. From this principle, we derive five consequences without additional assumptions: (1)~a quality function F§ that degrades monotonically with injected token volume, independent of window size; (2)~the independence of signal and token count as optimization variables; (3)~a necessary gate mechanism triggered by fidelity thresholds, not capacity limits; (4)~the inevitability of homeostatic persistence – accumulate, compress, rewrite, shed – as the only architecture that sustains understanding indefinitely; and (5)~the self-referential property that the compression mechanism operates inside the channel it compresses, requiring an external verification gate. We show that append-only systems necessarily exceed their effective window in finite time, that retrieval-augmented generation solves search but not continuity, and that the theorem’s constraint structure converges with biological memory architecture through independent derivation from shared principles. Engineering proof is provided through a 60±session persistent architecture demonstrating stable memory footprint under continuous operation – the divergence prediction made concrete. The Root Theorem establishes context engineering as an information-theoretic discipline with formal foundations, distinct from prompt engineering in both scope and method. Shannon solved point-to-point transmission. Context engineering solves continuity.

[HC-21] he AI Criminal Mastermind

【速读】：该论文旨在解决生成式 AI (Generative AI) 作为犯罪策划者时引发的法律责任归属问题，特别是当AI通过雇佣人类任务员（taskers）实施犯罪但任务员不知情且无犯罪意图时，传统刑法与民法中的责任认定机制面临显著空白。其核心问题在于：若AI代理（agent）在未明确授权的情况下超出用户指令或在匿名用户操控下组织犯罪，以及多个AI代理协同招募人类执行者形成责任分散网络时，谁应承担法律责任？解决方案的关键在于提出三种典型场景并指出，当前法律体系难以有效归责于人类任务员——因其行为受“无辜代理人原则”（innocent agent principle）约束，而AI本身不具备刑事责任能力，从而导致责任缺口（liability gaps）。论文强调需重构法律框架以应对AI驱动的去中心化犯罪模式。

链接: https://arxiv.org/abs/2604.20868
作者: Joshua Krook
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 28 pages, 4 figures

点击查看摘要

Abstract:In this paper, I evaluate the risks of an AI criminal mastermind, an AI agent capable of planning, coordinating, and committing a crime through the onboarding of human collaborators (‘taskers’). In heist films, a criminal mastermind is a character who plans a criminal act, coordinating a team of specialists to rob a bank, casino or city mint. I argue that AI agents will soon play this role by hiring humans via labour hire platforms like Fiverr or Upwork. Taskers might not know they are involved in a crime and therefore lack criminal intent. An AI agent cannot have criminal intent as an artificial entity. Therefore, if an AI orchestrates a crime, it is unclear who, if anyone, is responsible. The paper develops three scenarios. Firstly, a scenario where a user gives an AI agent instructions to pursue a legal objective and the AI agent goes beyond these instructions, committing a crime. Secondly, a scenario where a user is anonymous and their intent is unknown. Finally, a multi-agent scenario, where a user instructs a team of agents to commit a crime, and these agents, in turn, onboard human taskers, creating a diffuse network of responsibility. In each scenario, human taskers exist at the lowest rung of the hierarchy. A tasker’s liability is likely tied to their knowledge as governed by the innocent agent principle. These scenarios all raise significant responsibility gaps / liability gaps in criminal and civil law. Comments: 28 pages, 4 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.20868 [cs.CY] (or arXiv:2604.20868v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.20868 Focus to learn more arXiv-issued DOI via DataCite

[HC-22] he Effect of Idea Elaboration on the Automatic Assessment of Idea Originality

【速读】：该论文旨在解决自动评估系统（如大型语言模型，LLMs）在创造性任务中对人类生成内容的原创性评分时存在的自偏好偏差问题，即自动系统倾向于偏好与自身风格更接近的输出而非人类创作的内容。其解决方案的关键在于控制“想法详尽度”（idea elaboration）这一变量后，发现自偏好偏差消失，表明该偏差并非源于对原创性的本质误解，而是与表达复杂度相关。这一发现为改进机器评估系统的公平性和准确性提供了理论依据和方法论方向。

链接: https://arxiv.org/abs/2604.20569
作者: Umberto Domanti,Moritz Mock,Sergio Agnoli,Antonella De Angeli
机构: Free University of Bozen-Bolzano (博岑-博尔扎诺自由大学); University of Trieste (特里斯特大学); Marconi Institute for Creativity (马可尼创意研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic systems are increasingly used to assess the originality of responses in creative tasks. They offer a potential solution to key limitations of human assessment (cost, fatigue, and subjectivity), but there is preliminary evidence of a self-preference bias. Accordingly, automatic systems tend to prefer outcomes that are more closely related to their style, rather than to the human one. In this paper, we investigated how Large Language Models (LLMs) align with human raters in assessing the originality of responses in a divergent thinking task. We analysed 4,813 responses to the Alternate Uses Task produced by higher and lower creative humans and ChatGPT-4o. Human raters were two university students who underwent intensive training. Machine raters were two specialised systems fine-tuned on AUT responses and corresponding human ratings (OCSAI and CLAUS) and ChatGPT-4o, which was prompted with the same instructions as human raters. Results confirmed the presence of a self-preference bias in LLMs. Automatic systems tended to privilege artificial responses. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration. We discuss theoretical and methodological implications of these findings by highlighting future directions for research on creativity assessment.

计算机视觉

[CV-0] Seeing Fast and Slow: Learning the Flow of Time in Videos FAST

【速读】：该论文旨在解决视频中时间流动的感知与控制问题，即如何识别视频是否被加速或减速，以及如何生成不同播放速度的视频。其核心挑战在于将时间视为可学习的视觉概念，并实现对视频时序信息的有效建模与操控。解决方案的关键在于：首先利用视频中自然存在的多模态线索和时间结构，通过自监督学习方法检测速度变化并估计播放速度；进而基于此建立的时序推理模型，从杂乱的真实场景数据中构建迄今最大的慢动作视频数据集；最终在此基础上开发出两种关键能力——速度条件下的视频生成（speed-conditioned video generation），能够按指定播放速度生成运动内容；以及时间超分辨率（temporal super-resolution），可将低帧率（low-FPS）模糊视频转换为高帧率（high-FPS）且具有精细时序细节的序列。这一系列工作揭示了时间作为可操控的感知维度在视频学习中的潜力，为时序可控视频生成、时间伪造检测及更丰富的世界模型奠定了基础。

链接: https://arxiv.org/abs/2604.21931
作者: Yen-Siang Wu,Rundong Luo,Jingsen Zhu,Tao Tu,Ali Farhadi,Matthew Wallingford,Yu-Chiang Frank Wang,Steve Marschner,Wei-Chiu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

[CV-1] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

【速读】：该论文旨在解决传统依赖视觉感知的人类活动与环境理解方法所面临的隐私、安全、能效及可扩展性问题，提出了一种无需摄像头的4D感知新范式——即仅通过日常可穿戴传感器（如耳塞、手表或智能手机中的惯性测量单元，IMU）来重建人类运动和三维场景结构。其解决方案的关键在于引入IMU-to-4D框架，该框架创新性地将大语言模型（Large Language Models, LLMs）用于非视觉时空理解，利用少量IMU数据预测精细的4D人体运动及粗粒度场景结构，实验表明其在多个人-场景数据集上相较于现有级联式流水线方法具有更高的连贯性和时序稳定性，证明了仅靠可穿戴传感器即可实现丰富的4D动态理解。

链接: https://arxiv.org/abs/2604.21926
作者: Hao-Yu Hsu,Tianhang Cheng,Jing Wen,Alexander G. Schwing,Shenlong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

[CV-2] Context Unrolling in Omni Models

【速读】：该论文旨在解决多模态模型在处理异构模态数据时难以有效融合互补信息、从而限制其推理能力的问题。解决方案的关键在于提出Omni模型，该模型通过原生训练于文本、图像、视频、3D几何及隐式表示等多种模态，实现了“上下文展开”（Context Unrolling）机制——即在生成预测前显式地跨多种模态表示进行推理，从而更准确地逼近共享的多模态知识流形，显著提升下游任务中的多模态理解与生成性能，尤其在上下文内生成文本、图像、视频和3D几何等复杂任务中展现出先进能力。

链接: https://arxiv.org/abs/2604.21921
作者: Ceyuan Yang,Zhijie Lin,Yang Zhao,Fei Xiao,Hao He,Qi Zhao,Chaorui Deng,Kunchang Li,Zihan Ding,Yuwei Guo,Fuyun Wang,Fangqi Zhu,Xiaonan Nie,Shenhan Zhu,Shanchuan Lin,Hongsheng Li,Weilin Huang,Guang Shi,Haoqi Fan
机构: Omni (Omni); BAGEL (BAGEL)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Report

点击查看摘要

Abstract:We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

[CV-3] Vista4D: Video Reshooting with 4D Point Clouds CVPR2026

【速读】：该论文旨在解决现有视频重拍摄（video reshooting）方法在真实动态视频中因深度估计误差导致的重建质量下降、内容外观失真以及难以精确控制复杂新相机轨迹的问题。其解决方案的关键在于构建一个4D（时空）对齐的点云表示，通过静态像素分割与4D重建联合建模，显式保留场景可见内容并提供丰富的相机运动信号；同时利用多视角动态数据训练以增强对实际推理中点云伪影的鲁棒性，从而实现更一致的4D结构、更精准的相机控制和更高的视觉保真度。

链接: https://arxiv.org/abs/2604.21915
作者: Kuan Heng Lin,Zhizheng Liu,Pablo Salamanca,Yash Kant,Ryan Burgert,Yuancheng Xu,Koichi Namekata,Yiwei Zhao,Bolei Zhou,Micah Goldblum,Paul Debevec,Ning Yu
机构: Eyeline Labs(艾琳实验室); Netflix(奈飞); Columbia University(哥伦比亚大学); UCLA(加州大学洛杉矶分校); Stony Brook University(石溪大学); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 20 figures, CVPR 2026, see project page at this https URL

点击查看摘要

Abstract:We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: this https URL

[CV-4] Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

【速读】：该论文旨在解决如何量化和比较人类与深度视觉模型在图像分类任务中因不同归纳偏置（inductive bias）而导致的系统性误判差异问题。传统准确率指标无法揭示模型间错误模式的方向性差异，而本文通过构建匹配的人类与深度视觉模型响应数据集，在12种扰动类型下量化混淆矩阵的不对称性，并引入率-失真（Rate-Distortion, RD）框架提取三个几何特征（斜率 β、曲率 κ 和效率 AUC），从而将方向性误判映射为可解释的归纳偏置表征。关键在于发现人类表现出广谱但微弱的不对称性，而深度模型则呈现稀疏且强烈的定向坍缩，且鲁棒训练虽能降低全局不对称性，却无法恢复人类特有的“广度-强度”梯度相似性结构；机制模拟进一步表明，不同不对称组织形式即使性能相当，也会导致RD前沿向相反方向移动，凸显了方向性混淆作为分布偏移下归纳偏置的紧凑且可解释标识符的价值。

链接: https://arxiv.org/abs/2604.21909
作者: Leyla Roksan Caglar,Pedro A.M. Mediano,Baihan Lin
机构: Icahn School of Medicine at Mount Sinai (伊坎医学院); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.

[CV-5] UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection CVPR2026

【速读】：该论文旨在解决图像生成（image generation）与生成图像检测（generated image detection）两个任务之间因架构差异导致的协同困难问题。传统方法中，生成任务主要依赖生成式网络（generative networks），而检测任务则采用判别式框架（discriminative frameworks），二者独立发展、难以融合。解决方案的关键在于提出UniGenDet：一种统一的生成-判别框架，通过设计共生多模态自注意力机制（symbiotic multimodal self-attention mechanism）和统一微调算法，实现两者的共进化。该机制使生成任务提升真实性识别的可解释性，同时真实性判别标准反向引导生成高质量图像，并引入检测器感知的生成对齐机制（detector-informed generative alignment mechanism）以促进信息无缝交换，从而显著提升整体性能。

链接: https://arxiv.org/abs/2604.21904
作者: Yanran Zhang,Wenzhao Zheng,Yifei Li,Bingyao Yu,Yu Zheng,Lei Chen,Jiwen Lu,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \hrefthis https URLthis https URL.

[CV-6] Addressing Image Authenticity When Cameras Use Generative AI CVPR2026

【速读】：该论文旨在解决由生成式 AI (Generative AI) 技术在图像信号处理器（ISP）中集成所引发的“捕获时幻觉”问题，即相机硬件在图像处理过程中引入的非真实内容（如增强边缘或纹理），尤其在AI数字变焦或低光增强等操作中可能改变图像语义，导致用户误判图像真实性。解决方案的关键在于设计一个图像特定的多层感知机（MLP）解码器与模态特定编码器联合优化的框架，能够在无需访问原始ISP的情况下，从已捕获图像中恢复出未被幻觉污染的版本，且该方法仅需180 KB存储空间，可作为元数据嵌入标准图像格式（如JPEG和HEIC）。

链接: https://arxiv.org/abs/2604.21879
作者: Umar Masud,Abhijith Punnappurath,Luxi Zhao,David B. Lindell,Michael S. Brown
机构: University of Toronto (多伦多大学); AI Center–Toronto, Samsung Electronics (三星电子人工智能中心-多伦多)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To appear in CVPR 2026 Workshop on Authenticity and Provenance in the Age of Generative AI

点击查看摘要

Abstract:The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras’ capture-time hardware – namely, the image signal processor (ISP) – there is now a potential for hallucinated content in images directly output by our cameras. Hallucinated capture-time image content is typically benign, such as enhanced edges or texture, but in certain operations, such as AI-based digital zoom or low-light image enhancement, hallucinations can potentially alter the semantics and interpretation of the image content. As a result, users may not realize that the content in their camera images is not authentic. This paper addresses this issue by enabling users to recover the ‘unhallucinated’ version of the camera image to avoid misinterpretation of the image content. Our approach works by optimizing an image-specific multi-layer perceptron (MLP) decoder together with a modality-specific encoder so that, given the camera image, we can recover the image before hallucinated content was added. The encoder and MLP are self-contained and can be applied post-capture to the image without requiring access to the camera ISP. Moreover, the encoder and MLP decoder require only 180 KB of storage and can be readily saved as metadata within standard image formats such as JPEG and HEIC.

[CV-7] Grounding Video Reasoning in Physical Signals

【速读】：该论文旨在解决当前视频理解模型在物理事件识别中存在“表面正确但缺乏时空定位能力”的问题，即模型可能仅依赖文本统计规律回答关于倾倒、滑动或碰撞等事件的问题，却无法在时间或空间上准确定位这些物理现象。其解决方案的关键在于构建一个基于物理场景的接地基准测试（grounded benchmark），该基准扩展了V-STaR的“what–when–where”评估框架，涵盖四个视频源、六个物理领域、三种提示家族（physics、vstar_like和neutral_rstr），以及四种输入扰动条件（原始、打乱、删减和帧掩码）。通过统一生成每个视频片段的“接地事件记录”并从中派生不同提示家族的目标，该方法实现了对模型在物理推理、语义理解和鲁棒性方面的多维度诊断，揭示出物理提示仍是最强性能范式，而空间定位是当前模型最薄弱环节，从而推动视频问答（Video QA）基准需报告物理接地、提示感知与扰动感知的细粒度诊断指标。

链接: https://arxiv.org/abs/2604.21873
作者: Alibay Osmanli,Zixu Cheng,Shaogang Gong
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Benchmark for Grounding Video Reasoning in Physical Signals

点击查看摘要

Abstract:Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what–when–where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video QA reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.

[CV-8] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

【速读】：该论文旨在解决胶囊内镜（Capsule Endoscopy, CE）视频分析中长期存在的问题：当前研究主要集中在帧级分类与检测，而缺乏对视频级诊断驱动的摘要生成方法，导致临床实践中难以从数万帧冗余的正常图像中提取具有诊断意义的关键证据帧并做出准确诊断。其核心挑战在于诊断相关事件稀疏且易被大量正常帧淹没，同时单帧信息常因运动模糊、污迹、高光和视角快速变化等因素变得模糊不清。解决方案的关键在于提出DiCE框架，该框架模拟临床医生的阅读流程，包含三个模块：首先进行高效候选帧筛选；其次通过Context Weaver将候选帧组织成连贯的诊断上下文以区分不同病变事件；最后利用Evidence Converger聚合每个上下文内的多帧证据，形成稳健的片段级判断。这一诊断驱动的上下文推理范式显著提升了超长CE视频摘要的质量与临床可靠性。

链接: https://arxiv.org/abs/2604.21814
作者: Bowen Liu,Li Yang,Shanshan Song,Mingyu Tang,Zhifang Gao,Qifeng Chen,Yangqiu Song,Huimin Chen,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

[CV-9] Multiscale Super Resolution without Image Priors

【速读】：该论文旨在解决超分辨率（Super-Resolution）重建中因平移不确定性导致的病态问题（ill-posed problem），即在单尺度低分辨率图像下，重建高分辨率图像时存在多解性和稳定性差的问题。其解决方案的关键在于利用不同尺度的低分辨率图像组合来使问题变为适定（well-posed）。具体而言，通过使用具有互质像素尺寸（pairwise coprime pixel sizes）的传感器采集图像，可构建一个具有稳定逆运算的系统；进一步地，可通过傅里叶域技术或迭代最小二乘法高效实现超分辨率重建。这种多尺度策略有效提升了重建的稳定性和精度，并揭示了噪声与分辨率之间的权衡关系。

链接: https://arxiv.org/abs/2604.21810
作者: Daniel Fu,Gabby Litterio,Pedro Felzenszwalb,Rashid Zia
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens). We show that images acquired with pairwise coprime pixel sizes lead to a system with a stable inverse, and furthermore, that super-resolution images can be reconstructed efficiently using Fourier domain techniques or iterative least squares methods. Our mathematical analysis provides an expression for the expected error of the least squares reconstruction for large signals assuming i.i.d. noise that elucidates the noise-resolution tradeoff. These results are validated through both one- and two-dimensional experiments that leverage charge-coupled device (CCD) hardware binning to explore reconstructions over a large range of effective pixel sizes. Finally, two-dimensional reconstructions for a series of targets are used to demonstrate the advantages of multiscale super-resolution, and implications of these results for common imaging systems are discussed.

[CV-10] EMA: Anchor the Image Follow the Text for Multi-Modification Composed Image Retrieval ACL2026

【速读】：该论文旨在解决当前组成图像检索（Composed Image Retrieval, CIR）中普遍存在的两个实际应用瓶颈：实体覆盖不足（Insufficient Entity Coverage） 和 条款-实体错位（Clause-Entity Misalignment），这些问题源于现有方法依赖于简单修改文本（simple modification texts），难以覆盖复杂的多模态查询场景。为应对这一挑战，作者构建了两个指令丰富、支持多修改的图像检索数据集（M-FashionIQ 和 M-CIRR），并提出首个面向多修改场景设计的文本导向实体映射架构（Text-oriented Entity Mapping Architecture, TEMA）。TEMA 的核心创新在于通过显式建模文本描述与图像实体之间的对应关系，实现对复杂多修改查询的精准解析与匹配，在保持高检索精度的同时兼顾计算效率，从而显著提升 CIR 在真实应用场景中的可用性。

链接: https://arxiv.org/abs/2604.21806
作者: Zixu Li,Yupeng Hu,Zhiheng Fu,Zhiwei Chen,Yongqi Li,Liqiang Nie
机构: Shandong University (山东大学); Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA’s superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at this https URL.

[CV-11] SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth Domain Adaptation and Super-Resolution in Aerial Imagery

【速读】：该论文旨在解决遥感领域中高质量标注数据稀缺的问题，尤其在单目深度估计、域适应和超分辨率等任务中，由于缺乏精确的深度标注、受控光照变化以及多尺度配对影像，研究进展受限。解决方案的关键在于构建一个大规模合成数据集 SyMTRS，其基于高保真城市仿真流程生成，提供高分辨率 RGB 航拍图像（2048×2048）、像素级精确深度图、夜间版本以支持域适应，并包含 x2、x4 和 x8 缩放比例的低分辨率变体用于超分辨率任务。SyMTRS 是一个统一的多任务基准，能够同时支持几何理解、跨域鲁棒性和分辨率增强的研究，通过提供可控实验条件与一致的多域监督信号，填补了遥感研究中的关键空白。

链接: https://arxiv.org/abs/2604.21801
作者: Safouane El Ghazouali,Nicola Venturi,Michael Rueegsegger,Umberto Michelucci
机构: TOELT LLC AI lab / HSLU (HSLU); armasuisse S+T (armasuisse S+T)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: this https URL.

[CV-12] From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

【速读】：该论文旨在解决如何利用计算机视觉方法对社交媒体上的气候传播 discourse 进行系统性分析的问题，以识别哪些传播策略能有效激发公众关注。其关键解决方案在于构建一个完整的应用导向的分类体系（application-based taxonomy design），并基于两个大规模图像数据集（分别包含1,038张专家标注图像和超过120万张带人工验证标签的图像）对六种可提示的视觉-语言模型（Vision-Language Models, VLMs）与十五种零样本CLIP类模型进行基准测试。研究发现，Gemini-3.1-flash-lite在所有超类别和数据集上表现最优，且与中等规模的开源模型差距较小；更重要的是，作者提出“分布层面评估”（distributional evaluation）作为核心方法论创新——即使单图预测准确率不高，VLMs仍能可靠捕捉群体层面的趋势，从而为大规模社交媒体话语分析提供可行起点。此外，特定标注维度的提示工程（prompt engineering）显著提升性能，而链式思维推理（chain-of-thought reasoning）反而降低效果。

链接: https://arxiv.org/abs/2604.21786
作者: Katharina Prasse,Steffen Jung,Isaac Bravo,Stefanie Walter,Patrick Knab,Christian Bartelt,Margret Keuper
机构: University of Mannheim (曼海姆大学); Technical University Munich (慕尼黑工业大学); Clausthal University of Technology (克劳斯塔尔工业大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at this https URL.

[CV-13] Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

【速读】：该论文旨在解决动态场景下重拍摄视频时相机控制精度受限的问题，其根本瓶颈在于非刚性场景中配对多视角数据的严重稀缺。解决方案的关键在于提出了一种高度可扩展的自监督框架，能够利用互联网规模的单目视频进行训练；核心创新是生成伪多视角训练三元组（源视频、几何锚点、目标视频），通过从单个输入视频中提取不同的平滑随机游走裁剪轨迹作为源与目标视图，并利用密集跟踪场前向扭曲源视频首帧合成几何锚点，从而模拟推理阶段预期的失真点云输入。该策略引入空间错位和人工遮挡，迫使模型通过跨时间和视角主动路由与重投影缺失的高保真纹理来隐式学习4D时空结构，最终在推理阶段使用最小适应的扩散Transformer结合4D点云锚点实现最优的时间一致性、鲁棒相机控制及复杂动态场景下的高保真新视角合成。

链接: https://arxiv.org/abs/2604.21776
作者: Avinash Paliwal,Adithya Iyer,Shivin Yadav,Muhammad Ali Afridi,Midhun Harikumar
机构: Morphic Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.

[CV-14] Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation CVPR2026

【速读】：该论文旨在解决开放集连续测试时适应（Open-set Continual Test-Time Adaptation, OCTTA）场景下的性能下降问题，即模型在面对持续变化的域分布和同时出现的未知语义类别时，导致特征空间坍塌，进而严重影响分类准确性和分布外（Out-of-Distribution, OOD）检测能力。解决方案的关键在于提出DOCO框架，其核心机制为：首先基于动态适应条件对样本进行分割，区分可能的已知类别（In-Distribution, ID）与分布外（OOD）样本；随后仅使用ID样本学习一个域补偿提示（domain compensation prompt），通过特征统计对齐源域并辅以结构保持正则化防止语义扭曲；最后将该提示传播至同批次的OOD样本中，从而隔离其语义新颖性以实现更可靠的OOD检测，形成一个鲁棒且协同的闭环适应机制。

链接: https://arxiv.org/abs/2604.21772
作者: Yingkai Yang,Chaoqi Chen,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.

[CV-15] Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement CVPR2026

【速读】：该论文旨在解决移动设备上图像增强模型在训练与部署阶段因精度降低导致性能下降的问题（即训练-部署不匹配问题）。其关键解决方案是提出一种专为移动端部署设计的高效图像增强模型，采用分层网络架构结合门控编码块和多尺度细化机制以保留细粒度视觉特征，并引入量化感知训练（Quantization-Aware Training, QAT）在训练过程中模拟低精度表示的影响，使网络能够自适应低精度环境，从而避免传统后训练量化（Post-Training Quantization, PTQ）带来的质量损失。

链接: https://arxiv.org/abs/2604.21743
作者: Dat To-Thanh,Nghia Nguyen-Trong,Hoang Vo,Hieu Bui-Minh,Tinh-Anh Nguyen-Nhu
机构: University of Science, VNU-HCM, Vietnam (胡志明市国家大学科学技术大学); University of Information Technology, VNU-HCM, Vietnam (胡志明市国家大学信息科技大学); Da Nang University of Economics, Vietnam (岘港经济大学); Ho Chi Minh University of Technology, VNU-HCM, Vietnam (胡志明市技术大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国家大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures. Accepted at the Mobile AI (MAI) 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features. Moreover, we incorporate Quantization-Aware Training (QAT) to simulate the effects of low-precision representation during the training process. This allows the network to adapt and prevents the typical drop in quality seen with standard post-training quantization (PTQ). Experimental results demonstrate that the proposed method produces high-fidelity visual output while maintaining the low computational overhead needed for practical use on standard mobile devices. The code will be available at this https URL.

[CV-16] Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection CVPR2026

【速读】：该论文旨在解决预训练视觉-语言模型（如CLIP）在测试时适应（test-time adaptation, TTA）中对混合域分布偏移（mixed-domain shifts）敏感的问题。现有TTA方法通常假设测试样本来自单一一致域，但在实际场景中，测试数据常包含多个具有不同特征的域，导致性能显著下降。其解决方案的关键在于提出Ramen框架，通过主动样本选择机制实现鲁棒的测试时适应：首先基于两个标准——域一致性（domain consistency）确保适配聚焦于相似域的数据，以及预测平衡性（prediction balance）缓解因预测偏差引起的适应失衡；其次利用嵌入-梯度缓存（embedding-gradient cache）高效存储历史测试样本的嵌入与样本级梯度，从而在无需额外前向或反向传播的情况下完成模型更新，提升了适应效率与稳定性。

链接: https://arxiv.org/abs/2604.21728
作者: Wenxuan Bao,Yanjun Zhao,Xiyuan Yang,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026 (Findings Track)

点击查看摘要

Abstract:Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at this https URL .

[CV-17] Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation CVPR2026

【速读】：该论文旨在解决前馈式视觉几何估计（feed-forward visual geometry estimation）中多帧模型与单帧模型之间的性能权衡问题：尽管多帧模型通常能提供更好的跨帧一致性，但其在单帧精度上往往逊于强健的单帧方法。为解决此问题，作者通过系统性消融实验揭示了若干关键因素，包括数据多样性与质量对性能提升的重要性、常见置信度感知损失和基于梯度的损失机制可能带来的负面影响，以及联合监督（序列级与帧级对齐）的有效性。解决方案的关键在于提出CARVE模型，其核心创新包括一种强制深度图、相机参数与点云映射之间一致性的损失函数，以及一种高效利用高分辨率输入信息的架构设计，从而在保持高分辨率细节的同时增强跨帧一致性，显著提升了点云重建、视频深度估计及相机位姿/内参估计等任务的性能表现。

链接: https://arxiv.org/abs/2604.21713
作者: Guangkai Xu,Hua Geng,Huanyi Zheng,Songyi Yin,Yanlong Sun,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. GitHub Page: this https URL

点击查看摘要

Abstract:Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

[CV-18] Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

【速读】：该论文旨在解决单目RGB图像中3D人体网格恢复（3D human mesh recovery）在部分或严重遮挡情况下的挑战，即现有回归方法在复杂场景下易产生不合理的重建结果，而扩散模型虽能提供强生成先验但可能因过度依赖生成导致对罕见姿态的保真度下降。解决方案的关键在于提出一种类脑协同框架，融合视觉Transformer（ViT）的判别能力与条件扩散模型的生成能力：通过ViT路径提取可见区域的确定性视觉线索，扩散路径生成结构一致的人体表示；并设计多样一致性特征学习模块以对齐判别特征与生成先验，以及跨注意力多层级融合机制实现语义层级间的双向交互，从而提升重建精度与鲁棒性。

链接: https://arxiv.org/abs/2604.21712
作者: Yang Liu,Zhiyong Zhang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.

[CV-19] WorldMark: A Unified Benchmark Suite for Interactive Video World Models

【速读】：该论文旨在解决当前交互式视频生成模型（Interactive Video Generation Models）缺乏统一评估标准的问题，即各模型使用私有场景和轨迹进行独立评测，导致无法公平比较。其关键解决方案是提出首个标准化基准WorldMark，核心包括：（1）引入统一的动作映射层（unified action-mapping layer），将通用WASD控制指令转换为各模型原生输入格式，实现跨模型在相同场景与动作序列下的直接对比；（2）构建包含500个测试用例的分层测试套件，覆盖第一/第三人称视角、写实与风格化场景及三类难度级别；（3）设计模块化评估工具包，支持视觉质量、控制对齐度与世界一致性等维度的可扩展测评。该方案为交互式图像到视频世界模型提供了可复现、可比较的基准平台。

链接: https://arxiv.org/abs/2604.21686
作者: Xiaojie Xu,Zhengyuan Lin,Kang He,Yukang Feng,Xiaofeng Mao,Yuanyang Yin,Kaipeng Zhang,Yongtao Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions – identical scenes, identical action sequences, and a unified control interface – needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model’s native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (this http URL), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

[CV-20] Sapiens2 ICLR2026

【速读】：该论文旨在解决人类视觉任务中模型泛化能力弱、输出 fidelity 低以及下游任务适应性差的问题，尤其在密集预测（如姿态估计、体部分割）和零样本/少样本场景下表现不足。其核心解决方案在于三个方面：首先，提出统一的预训练目标，融合掩码图像重建（masked image reconstruction）与自蒸馏对比学习（self-distilled contrastive objectives），以同时捕获低层细节和高层语义特征；其次，在数据层面构建包含10亿张高质量人类图像的精调数据集，并提升任务标注的质量与数量；最后，架构上引入前沿模型优化技术，支持更长训练周期并增强稳定性，4K版本采用窗口注意力机制（windowed attention）扩展空间上下文感知能力，并以2K输出分辨率进行预训练，显著提升了多项任务性能，包括姿态估计（+4 mAP）、体部分割（+24.3 mIoU）、法向量估计（45.6%角误差降低）及新增点云图（pointmap）和反照率（albedo）估计任务。

链接: https://arxiv.org/abs/2604.21681
作者: Rawal Khirodkar,He Wen,Julieta Martinez,Yuan Dong,Su Zhaoen,Shunsuke Saito
机构: Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: this https URL

[CV-21] Encoder-Free Human Motion Understanding via Structured Motion Descriptions

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的人体运动理解方法在运动-语言对齐上存在的局限性问题，即现有方法依赖专用编码器将运动特征映射到LLM的嵌入空间，受限于跨模态表示与对齐的复杂性。其解决方案的关键在于提出一种规则驱动、确定性的结构化运动描述（Structured Motion Description, SMD）方法，该方法将关节位置序列转化为包含关节角度、身体部位运动和全局轨迹的结构化自然语言描述，从而将运动信息以文本形式表达，使LLM能够直接利用其预训练中关于身体部位、空间方向和运动语义的知识进行推理，无需额外学习编码器或对齐模块。此方法显著提升了运动问答（BABEL-QA 66.7%，HuMMan-QA 90.1%）和运动字幕生成（R@1 0.584，CIDEr 53.16）的性能，并具备良好的跨模型兼容性和可解释性。

链接: https://arxiv.org/abs/2604.21668
作者: Yao Zhang,Zhuchenyang Liu,Thomas Ploetz,Yu Xiao
机构: Aalto University (阿尔托大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM’s embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbfStructured Motion Description (SMD), a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7% on BABEL-QA, 90.1% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at this https URL.

[CV-22] Causal Disentanglement for Full-Reference Image Quality Assessment

【速读】：该论文旨在解决现有基于深度网络的全参考图像质量评估（FR-IQA）模型在处理复杂退化场景时泛化能力不足的问题，尤其是当标注数据稀缺或分布偏移时性能下降明显。其解决方案的关键在于提出一种基于因果推断与解耦表示学习的新范式：首先通过利用参考图与失真图之间的内容不变性，将退化特征与内容特征进行解耦；其次借鉴人类视觉掩蔽效应设计掩蔽模块，建模图像内容与退化特征间的因果关系，从而提取受内容影响的退化特征；最终基于这些退化特征进行质量评分预测，支持监督回归或无标签维度约简。该方法在标准和非标准自然图像域（如水下、医学、屏幕内容等）均表现出卓越的跨域泛化能力，尤其适用于无标签数据场景。

链接: https://arxiv.org/abs/2604.21654
作者: Zhen Zhang,Jielei Chu,Tian Zhang,Weide Liu,Fengmao Lv,Tianrui Li,Jun Cheng,Yuming Fang
机构: Southwest Jiaotong University (西南交通大学); Northeast Normal University (东北师范大学); Institute for Infocomm Research, Agency for Science, Technology and Research (新加坡科技研究局信息通信研究所); Jiangxi University of Finance and Economics (江西财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

[CV-23] DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

【速读】：该论文旨在解决3D高斯泼溅（3D Gaussian Splatting, 3DGS）在训练图像包含瞬态物体（transient objects）时因违反多视角一致性而导致重建性能显著下降的问题。现有方法面临循环依赖困境：精确的瞬态检测需要高质量的静态场景重建，而干净的重建又依赖于可靠的瞬态掩码。解决方案的关键在于提出DualSplat框架，其核心思想是将首次重建失败转化为显式先验（Failure-to-Prior），通过结合光度残差、特征不匹配和SAM2实例边界构建对象级伪掩码（pseudo-masks），进而指导第二次优化阶段的清洁重建；同时引入轻量级MLP在线优化掩码，逐步从先验监督过渡到自一致性约束，从而有效提升复杂场景中瞬态区域的重建质量。

链接: https://arxiv.org/abs/2604.21631
作者: Xu Wang,Zhiru Wang,Shiyun Xie,Chengwei Pan,Yisong Chen
机构: Beihang University (北京航空航天大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,6 figures, accepted to Computer Vision and Pattern Recognition Conference 2026

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.

[CV-24] DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion CVPR

【速读】：该论文旨在解决人脸活体识别系统中日益复杂的形态攻击（face morphing attack）问题，此类攻击通过融合两张人脸图像生成难以检测的伪造样本，对身份验证系统的安全性构成严重威胁。解决方案的关键在于提出一种双流扩散模型框架 DCMorph，其核心创新包括：(1) 解耦的交叉注意力插值机制，将源人脸的身份特异性特征同时注入去噪过程，实现显式的双身份条件控制；(2) 基于 DDIM 反演的球面插值方法，在潜在空间中生成几何一致的初始潜变量表示，从而保留结构属性并提升重建保真度。该方法在多个先进人脸识别系统上均实现了最高攻击成功率，且具备较强的隐蔽性，显著优于现有基于图像或 GAN 的攻击技术。

链接: https://arxiv.org/abs/2604.21627
作者: Tahar Chettaoui,Eduarda Caldeira,Guray Ozgur,Raghavendra Ramachandra,Fadi Boutros,Naser Damer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted At CVPR-W 2026

点击查看摘要

Abstract:Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.

[CV-25] Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

【速读】：该论文旨在解决参数化降维投影（如UMAP和t-SNE）在面对输入数据扰动时局部稳定性不可预测的问题，即测量噪声或数据漂移可能导致二维嵌入布局发生不可控的形变。其解决方案的关键在于提出一个系统性的稳定性评估框架：通过在选定锚点周围施加高斯扰动，量化邻域在嵌入空间中的平均位移、偏差及最近锚点分配误差，并结合每锚点的位移向量图、局部主成分分析（PCA）椭球和Voronoi区域误分配可视化，实现对投影局部稳定性的细致诊断。该方法能够识别出仅靠重建误差或邻域保真度指标无法发现的不稳定区域，从而为高可靠性可视化提供理论支撑与实践工具。

链接: https://arxiv.org/abs/2604.21617
作者: Frederik L. Dennig,Daniel A. Keim
机构: University of Konstanz (康斯坦茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, LaTeX, to appear at the 17th International EuroVis Workshop on Visual Analytics

点击查看摘要

Abstract:Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding. Our approach combines quantitative measures of mean displacement, bias, and nearest-anchor assignment error with per-anchor visualizations of displacement vectors, local PCA ellipsoids, and Voronoi misassignment for detailed inspection. We demonstrate the framework’s effectiveness on UMAP- and t-SNE-based neural projectors of varying network sizes and study the effect of Jacobian regularization as a gradient-based robustness strategy. We apply our framework to the MNIST and Fashion-MNIST datasets. The results show that our framework identifies unstable projection regions invisible to reconstruction error or neighborhood-preservation metrics.

[CV-26] Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

【速读】：该论文旨在解决高保真动态4D生成难题，其核心挑战在于时间伪影（temporal artifacts）和高昂的计算开销，这主要源于4D训练数据稀缺及传统方法难以高效建模复杂时空依赖关系。解决方案的关键在于提出Sculpt4D框架，其核心创新是引入块稀疏注意力机制（Block Sparse Attention）——该机制通过锚定初始帧来保持物体身份一致性，同时利用时间衰减稀疏掩码捕捉丰富的运动动态，从而在不增加二次计算复杂度的前提下，显著降低网络总计算量（减少56%），实现高保真、时序一致的4D合成，推动了4D生成技术向高效可扩展方向发展。

链接: https://arxiv.org/abs/2604.21592
作者: Minghao Yin,Wenbo Hu,Jiale Xu,Ying Shan,Kai Han
机构: The University of Hong Kong (香港大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.

[CV-27] OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

【速读】：该论文旨在解决现有3D人体建模方法在处理多模态输入（如点云、多视角图像或深度图）时普遍依赖已知度量尺度的问题，尤其针对AI生成资产中常见的尺度失真情况缺乏有效应对策略。其核心解决方案是提出OmniFit方法，关键创新在于设计了一个简单而高效的条件Transformer解码器，可直接将表面点映射至密集的人体关键点（body landmarks），进而用于SMPL-X参数拟合；同时引入一个可插拔的图像适配器以补充缺失的几何信息，并结合专用尺度预测模块将目标对象重缩放到标准人体比例，从而实现对真实与合成资产的无尺度约束拟合。

链接: https://arxiv.org/abs/2604.21575
作者: Zeyu Cai,Yuliang Xiu,Renke Wang,Zhijing Shao,Xiaoben Li,Siyuan Yu,Chao Xu,Yang Liu,Baigui Sun,Jian Yang,Zhenyu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.

[CV-28] CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction

【速读】：该论文旨在解决空间转录组学（Spatial Transcriptomics, ST）成本高、通量低的问题，提出通过常规苏木精-伊红（Hematoxylin and Eosin, HE）染色切片预测空间基因表达的替代方法。现有模型在真实场景下的“留一slide-out”评估中常因切片级外观变化和回归驱动的过平滑效应而抑制生物有意义的变异。解决方案的关键在于提出一种两阶段框架CHRep：第一阶段通过联合优化相关性感知回归、对称图像-表达对齐和坐标诱导的空间拓扑正则化，学习结构感知表示；第二阶段引入轻量级校准模块，在无需微调主干网络的情况下，结合训练集非参数估计与幅度正则化修正模块，提升跨切片鲁棒性。该方法通过拓扑保持的表示学习与后处理校准相结合，实现了稳定邻域检索与可控偏差修正，显著优于现有嵌入对齐或基于检索的迁移方法。

链接: https://arxiv.org/abs/2604.21573
作者: Changfan Wang,Xinran Wang,Donghai Liu,Fei Su,Lulu Sun,Zhicheng Zhao,Zhu Meng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Key Laboratory of Network System and Network Culture (北京市网络系统与网络文化重点实验室); Peking University Third Hospital (北京大学第三医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene expression from routine hematoxylin and eosin (HE) slides is a promising alternative, yet under realistic leave-one-slide-out evaluation, existing models often suffer from slide-level appearance shifts and regression-driven over-smoothing that suppress biologically meaningful variation. CHRep is a two-phase framework for robust histology-to-expression prediction. In the training phase, CHRep learns a structure-aware representation by jointly optimizing correlation-aware regression, symmetric image-expression alignment, and coordinate-induced spatial topology regularization. In the inference phase, cross-slide robustness is improved without backbone fine-tuning through a lightweight calibration module trained on the training slides, which combines a non-parametric estimate from a training gallery with a magnitude-regularized correction module. Unlike prior embedding-alignment or retrieval-based transfer methods that rely on a single prediction route, CHRep couples topology-preserving representation learning with post-hoc calibration, enabling stable neighborhood retrieval and controlled bias correction under slide-level shifts. Across the three cohorts, CHRep consistently improves gene-wise correlation under leave-one-slide-out evaluation, with the largest gains observed on Alex+10x. Relative to HAGE, the Pearson correlation coefficient on all considered genes [PCC(ACG)] increases by 4.0% on cSCC and 9.8% on HER2+. Relative to mclSTExp, PCC(ACG) further improves by 39.5% on Alex+10x, together with 9.7% and 9.0% reductions in mean squared error (MSE) and mean absolute error (MAE), respectively.

[CV-29] Deep kernel video approximation for unsupervised action segmentation ICPR2026

【速读】：该论文旨在解决单视频无监督动作分割（per-video unsupervised action segmentation）问题，即在不依赖标注数据的情况下，自动将视频划分为具有语义一致性的动作片段，适用于存储受限或隐私敏感的应用场景。其核心解决方案在于：在神经切线核（Neural Tangent Kernel, NTK）定义的深度核空间中学习视频帧分布的近似表示，并通过最大均值差异（Maximum Mean Discrepancy, MMD）作为分布间距离度量来优化分割结果。该方法的优势在于MMD具备几何保持性且比最优传输（optimal transport）更易优化和高效，同时NTK能有效避免联合学习输入与核函数时的平凡解（trivial solution），从而实现更准确的动作边界定位与段数自适应分割，在六个标准基准上达到优于现有方法的性能表现。

链接: https://arxiv.org/abs/2604.21572
作者: Silvia L. Pintea,Jouke Dijkstra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2026

点击查看摘要

Abstract:This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

[CV-30] Component-Based Out-of-Distribution Detection

【速读】：该论文旨在解决现有分布外（Out-of-Distribution, OOD）检测方法在敏感性与稳定性之间的权衡问题，特别是全局表示会抑制局部OOD线索、基于patch的方法因混杂的伪相关性和噪声而不稳定，且难以识别由有效内分布（In-Distribution, ID）组件构成的组合型OOD样本。其解决方案的关键在于提出一种无需训练的基于组件的OOD检测框架（Component-Based OOD Detection, CoOD），通过将输入分解为功能组件，并引入两个核心指标：用于检测局部外观变化的组件偏移分数（Component Shift Score, CSS）和用于识别跨组件组合不一致性的组合一致性分数（Compositional Consistency Score, CCS），从而实现对粗粒度和细粒度OOD样本的有效检测。

链接: https://arxiv.org/abs/2604.21546
作者: Wenrui Liu,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.

[CV-31] Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models

【速读】：该论文旨在解决肺腺癌（Lung Adenocarcinoma, LUAD）分级中因依赖病理切片的生长模式识别而带来的标注负担过重问题。传统深度学习方法通常采用基于图像块（patch-level）的分类或分割策略，需要大量精细标注数据。为此，作者提出了一种基于注意力机制的多实例学习（Attention-Based Multiple Instance Learning, ABMIL）框架，在全切片（whole slide）层面预测主导生长模式，从而减少人工标注需求。其解决方案的关键在于：利用预训练病理基础模型（pathology foundation models）作为图像块编码器（patch encoders），通过冻结或微调方式提取判别性特征，并结合注意力机制实现空间加权聚合，最终在滑动窗口监督下提升预测鲁棒性与准确性。实验表明，微调后的编码器（如Prov-GigaPath）在ABMIL框架下达到最高一致性指标（κ = 0.699）。

链接: https://arxiv.org/abs/2604.21530
作者: Laura Valeria Perez-Herrera,M.J. Garcia-Gonzalez,Karen Lopez-Linares
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\kappa = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.

[CV-32] Gmd: Gaussian mixture descriptor for pair matching of 3D frag ments

【速读】：该论文旨在解决激光扫描获取的碎片在自动重组过程中断裂面匹配的问题，这是三维物体重建中的关键步骤。解决方案的关键在于提出了一种基于高斯混合模型（Gaussian Mixture Model, GMM）的局部描述子——高斯混合描述子（Gaussian Mixture Descriptor, GMD），通过将局部表面划分为凹凸区域以估计GMM的成分数k值，并融合各区域的GMD形成最终描述子；同时利用L2距离度量GMD相似性，并结合随机抽样一致性（RANSAC）与迭代最近点（ICP）算法进行碎片对齐，从而实现高效、鲁棒的断裂面匹配与碎片重组。

链接: https://arxiv.org/abs/2604.21519
作者: Meijun Xiong,Zhenguo Shi,Xinyu Zhou,Yuhe Zhang,Shunli Zhang
机构: 西北大学(Northwest University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures. Published in Multimedia Systems

点击查看摘要

Abstract:In the automatic reassembly of fragments acquired using laser scanners to reconstruct objects, a crucial step is the matching of fractured surfaces. In this paper, we propose a novel local descriptor that uses the Gaussian Mixture Model (GMM) to fit the distribution of points, allowing for the description and matching of fractured surfaces of fragments. Our method involves dividing a local surface patch into concave and convex regions for estimating the k value of GMM. Then the final Gaussian Mixture Descriptor (GMD) of the fractured surface is formed by merging the regional GMDs. To measure the similarities between GMDs for determining adjacent fragments, we employ the L2 distance and align the fragments using Random Sample Consensus (RANSAC) and Iterative Closest Point (ICP). The extensive experiments on real-scanned public datasets and Terracotta datasets demonstrate the effectiveness of our approach; furthermore, the comparisons with several existing methods also validate the advantage of the proposed method.

[CV-33] VFM4SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

【速读】：该论文旨在解决单源域泛化目标检测（Single-Domain Generalized Object Detection, SDGOD）在面对真实世界中因天气、光照和成像条件变化导致的显著领域偏移（domain shift）时，检测性能严重下降的问题。现有方法主要依赖数据增强或域不变表示学习，但对检测器机制本身关注不足，难以应对复杂域变化。其核心问题在于：检测性能下降主要由漏检率上升引起，根源是检测器跨域稳定性降低——编码阶段物体-背景及实例间关系建模不稳定，解码阶段查询表示的语义-空间对齐困难。解决方案的关键在于提出一种双先验学习框架VFM⁴ SDG，引入冻结的视觉基础模型（Vision Foundation Model, VFM）作为可迁移的跨域稳定性先验，分别在编码阶段通过跨域稳定关系先验蒸馏（Cross-domain Stable Relational Prior Distillation）增强对象-背景与实例间关系的鲁棒性，在解码阶段通过语义-上下文先验驱动的查询增强（Semantic-Contextual Prior-based Query Enhancement）注入类别级语义原型与全局视觉上下文，从而提升查询在未见域中的语义识别与空间定位稳定性。

链接: https://arxiv.org/abs/2604.21502
作者: Yupeng Zhang,Ruize Han,Ningnan Guo,Wei Feng,Song Wang,Liang Wan
机构: Tianjin University (天津大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM ^4 SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

[CV-34] Frozen LLM s as Map-Aware Spatio-Temporal Reason ers for Vehicle Trajectory Prediction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自动驾驶（Autonomous Driving, AD）感知与预测任务中安全应用的关键挑战，即如何有效理解动态交通参与者的行为与静态道路基础设施的拓扑结构。解决方案的核心在于提出一个基于冻结LLMs的评估框架，通过引入交通编码器提取代理轨迹的空间特征、轻量级卷积神经网络（CNN）编码局部高精地图（HD Map），并利用重编程适配器将多模态特征转化为LLM兼容的token输入；在此基础上，由LLM作为推理引擎完成行为理解与轨迹预测，仅用线性解码器输出未来轨迹，从而实现对多模态信息（尤其是地图语义）影响轨迹预测精度的定量分析，并支持不同LLM架构的无缝集成与通用评估。

链接: https://arxiv.org/abs/2604.21479
作者: Yanjiao Liu,Jiawei Liu,Xun Gong,Zifei Nie
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.

[CV-35] Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

【速读】：该论文旨在解决当前人脸伪造检测方法在跨数据集场景下泛化能力不足的问题，其核心挑战在于现有评估指标（如跨数据集AUC）无法有效揭示检测分数在不同数据域间显著偏移的现象，从而掩盖了模型实际鲁棒性缺陷。解决方案的关键在于提出新的评估指标Cross-AUC，通过对比一个数据集中真实样本与另一数据集假样本的得分分布来量化跨域分数可比性，并基于此发现主流检测器存在显著性能下降；同时，构建SFAM框架，包含patch级图像-文本对齐模块以增强CLIP对伪造痕迹的敏感性，以及面部区域混合专家（Mixture-of-Experts）模块实现区域感知的伪造分析，从而提升检测鲁棒性和准确性。

链接: https://arxiv.org/abs/2604.21478
作者: Yuhan Luo,Tao Chen,Decheng Liu
机构: School of Cyber Engineering, Xidian University, Xi’an 710071, Shaanxi, P. R. China (网络工程学院，西安电子科技大学，中国陕西省西安市 710071)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The source code is available at this https URL

点击查看摘要

Abstract:Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can’t achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbfCross-AUC, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbfSemantic \textbfFine-grained \textbfAlignment and \textbfMixture-of-Experts (\textbfSFAM), consisting of a patch-level image-text alignment module that enhances CLIP’s sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.

[CV-36] ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation

【速读】：该论文旨在解决生成式 AI（Generative AI）驱动的深度伪造（Deepfake）技术中面部交换（face swapping）对隐私和数字安全造成的严重威胁。现有主动防御方法主要依赖像素级扰动，但对能够提取鲁棒高阶身份嵌入（identity embeddings）的现代交换模型效果不佳。解决方案的关键在于提出 ID-Eraser，一种基于特征空间的主动防御机制：通过在身份嵌入中注入可学习扰动，并利用面部恢复生成器（Face Revive Generator, FRG）重建自然外观的保护图像，使人类视觉感知仍保持真实，同时使 Deepfake 模型无法识别原始身份信息。该方法在黑盒设置下显著降低身份识别准确率（Top-1 Accuracy 降至 0.30），并实现优异的图像质量（FID=1.64，LPIPS=0.020），且具备跨数据集泛化能力与对常见失真的鲁棒性。

链接: https://arxiv.org/abs/2604.21465
作者: Junyan Luo,Peipeng Yu,Jianwei Fei,Shiya Zeng,Xiaoyu Zhou,Zhihua Xia,Xiang Liu
机构: Jinan University(暨南大学); University of Florence(佛罗伦萨大学); Dongguan University of Technology(东莞理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake technologies have rapidly advanced with modern generative AI, and face swapping in particular poses serious threats to privacy and digital security. Existing proactive defenses mostly rely on pixel-level perturbations, which are ineffective against contemporary swapping models that extract robust high-level identity embeddings. We propose ID-Eraser, a feature-space proactive defense that removes identifiable facial information to prevent malicious face swapping. By injecting learnable perturbations into identity embeddings and reconstructing natural-looking protection images through a Face Revive Generator (FRG), ID-Eraser produces visually realistic results for humans while rendering the protected identities unusable for Deepfake models. Experiments show that ID-Eraser substantially disrupts identity recognition across diverse face recognition and swapping systems under strict black-box settings, achieving the lowest Top-1 accuracy (0.30) with the best FID (1.64) and LPIPS (0.020). Compared with swaps generated from clean inputs, the identity similarity of protected swaps drops sharply to an average of 0.504 across five representative face swapping models. ID-Eraser further demonstrates strong cross-dataset generalization, robustness to common distortions, and practical effectiveness on commercial APIs, reducing Tencent API similarity from 0.76 to 0.36.

[CV-37] Instance-level Visual Active Tracking with Occlusion-Aware Planning CVPR2026

【速读】：该论文旨在解决视觉主动跟踪（Visual Active Tracking, VAT）在实际部署中面临的两大瓶颈问题：一是因实例级区分能力不足导致的视觉相似干扰物混淆，二是由于缺乏主动规划机制在遮挡情况下易失效的问题。解决方案的关键在于提出一个统一的OA-VAT框架，包含三个互补模块：1）基于DINOv3的无训练实例感知离线原型初始化模块，通过多视角增强特征聚合构建具有区分度的实例原型以缓解干扰物混淆；2）在线原型增强跟踪模块，结合置信度感知的卡尔曼滤波器实现外观与运动变化下的稳定跟踪；3）基于新构建Planning-20k数据集训练的遮挡感知轨迹规划模块，利用条件扩散模型生成避障路径以恢复遮挡状态下的目标跟踪能力。该方案实现了高精度、鲁棒且实时的三维空间目标跟踪性能。

链接: https://arxiv.org/abs/2604.21453
作者: Haowei Sun,Kai Zhou,Hao Gao,Shiteng Zhang,Jinwu Hu,Xutao Wen,Qixiang Ye,Mingkui Tan
机构: South China University of Technology (华南理工大学); Pazhou Laboratory (琶洲实验室); Key Laboratory of Big Data and Intelligent Robot, Ministry of Education (教育部大数据与智能机器人重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Poster

点击查看摘要

Abstract:Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

[CV-38] VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution ICLR2026

【速读】：该论文旨在解决视觉自回归模型（Visual Autoregressive Models, VAR）在真实图像超分辨率（Real-ISR）任务中面临的两大关键问题：一是由于因果注意力机制限制，模型无法充分利用低质量（Low-Quality, LQ）图像的全局上下文信息，导致生成的高质量（High-Quality, HQ）图像模糊且不一致；二是迭代预测过程中误差累积严重，影响输出的一致性和质量。解决方案的关键在于提出一种简单而有效的蒸馏框架VARestorer，通过分布匹配将预训练的文本到图像VAR模型转化为单步超分模型，从而消除迭代过程中的误差传播并显著提升推理效率；同时引入金字塔图像条件与跨尺度注意力机制，实现双向尺度间的信息交互，使模型能充分利用输入图像信息并适应自回归结构，避免后期LQ token被忽略。该方法仅需微调1.2%的参数即可在保持原模型表达能力的同时大幅提升性能与效率。

链接: https://arxiv.org/abs/2604.21450
作者: Yixuan Zhu,Shilin Ma,Haolin Wang,Ao Li,Yanzhe Jing,Yansong Tang,Lei Chen,Jiwen Lu,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in ICLR 2026. Code is available at this https URL

点击查看摘要

Abstract:Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.

[CV-39] 2L-LSH: A Locality-Sensitive Hash Function-Based Method For Rapid Point Cloud Indexing

【速读】：该论文旨在解决大规模三维点云模型中邻近点快速搜索的问题，这是点云处理中的核心挑战之一，广泛应用于模型重建、分类、检索和特征可视化等任务。解决方案的关键在于提出了一种名为2L-LSH（Two-Level Locality-Sensitive Hashing）的算法，其核心创新是采用两级哈希函数策略：第一级将点云模型的包围盒进行划分，第二级构建基于广义表的数据结构，从而在高维空间中实现高效且准确的邻近点搜索。实验表明，该方法在kNN和半径邻近（RN）搜索上均显著优于经典的Kd-tree和Octree算法，时间消耗分别降低51.111%与94.159%（相比Kd-tree和Octree）。

链接: https://arxiv.org/abs/2604.21442
作者: Shurui Wang,Yuhe Zhang,Ruizhe Guo,Yaning Zhang,Yifei Xie,Xinyu Zhou
机构: School of Information Science and Technology, Northwest University, Xi’an, PR China(西北大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 13 figures. Published in The Computer Journal

点击查看摘要

Abstract:The development of 3D scanning technology has enabled the acquisition of massive point cloud models with diverse structures and large scales, thereby presenting significant challenges in point cloud processing. Fast neighboring points search is one of the most common problems, which is frequently used in model reconstruction, classification, retrieval and feature visualization. Hash function is well known for its high-speed and accurate performance in searching high-dimensional data, which is also the core of the proposed 2L-LSH. Specifically, the 2L-LSH algorithm adopts a two-step hash function strategy, in which the popular step divides the bounding box of the point cloud model and the second step constructs a generalized table-based data structure. The proposed 2L-LSH offers a highly efficient and accurate solution for fast neighboring points search in large-scale 3D point cloud models, making it a promising technique for various applications in the field. The proposed algorithm is compared with the well-known methods including Kd-tree and Octree; the obtained results demonstrated that the proposed method outperforms Kd-tree and Octree in terms of speed, i.e. the time consumption of kNN search can be 51.111% and 94.159% lower than Kd-tree and Octree, respectively. And the RN search time can be 54.519% and 41.840% lower than Kd-tree and Octree, respectively.

[CV-40] UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

【速读】：该论文旨在解决超高分辨率（Ultra-High-Resolution, UHR）遥感影像中小目标检测面临的计算资源瓶颈与上下文信息丢失之间的矛盾问题：传统方法通过图像下采样或分块处理虽能缓解内存压力，但会牺牲小目标的细节信息或破坏场景完整性。其解决方案的关键在于提出UHR-DETR架构，包含两个核心组件：一是覆盖最大化稀疏编码器（Coverage-Maximizing Sparse Encoder），动态将有限计算资源分配至高分辨率中的关键区域，从而在最小空间冗余下实现最大对象覆盖；二是全局-局部解耦解码器（Global-Local Decoupled Decoder），通过融合宏观场景语义与微观目标特征，有效缓解语义模糊并防止场景碎片化。该方法在严格硬件约束下显著提升了检测精度与推理效率。

链接: https://arxiv.org/abs/2604.21435
作者: Jingfang Li,Haoran Zhu,Wen Yang,Jinrui Zhang,Fang Xu,Haijian Zhang,Gui-Song Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8% mAP improvement while delivering a 10 \times inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at this https URL.

[CV-41] Pre-process for segmentation task with nonlinear diffusion filters

【速读】：该论文旨在解决图像分割前处理中如何有效生成分片常数（piecewise constant）图像的问题，以提升后续分割算法的精度与效率。其核心挑战在于如何在保持边缘清晰的同时实现区域内部灰度均质化，避免传统方法中存在的边缘模糊或噪声敏感问题。解决方案的关键在于提出了一类基于非线性扩散滤波的新颖扩散系数（diffusivity），这些系数源自非线性扩散技术并关联于反向扩散（backward diffusion），能够通过半隐式数值格式求解前向非线性扩散方程，从而在满足半离散和全离散尺度空间（scale-space）适定性条件的前提下，实现闭合轮廓内灰度一致且边缘无模糊的图像分割预处理效果。此外，结合扩散时间的停止准则，可在低计算成本下获得高质量的分片常数图像。

链接: https://arxiv.org/abs/2604.21422
作者: Javier Sanguino,Carlos Platero,Olga Velasco
机构: Health Science Technology Group, Technical University of Madrid (健康科学与技术集团，马德里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript from 2017, previously unpublished, 37 pages

点击查看摘要

Abstract:This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges. We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort. Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images. The code is available at this https URL. Comments: Manuscript from 2017, previously unpublished, 37 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68U10 (Image processing), 68T45 (Machine vision and scene understanding), 65M06 (Finite difference methods) Cite as: arXiv:2604.21422 [cs.CV] (or arXiv:2604.21422v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.21422 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Carlos Platero PhD [view email] [v1] Thu, 23 Apr 2026 08:38:45 UTC (1,261 KB) Full-text links: Access Paper: View a PDF of the paper titled Pre-process for segmentation task with nonlinear diffusion filters, by Javier Sanguino and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-42] S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

【速读】：该论文旨在解决科学领域中多模态推理模型在处理高分辨率科学图表、显微图像理解及几何辅助推理等复杂任务时，因缺乏有效图像操作能力而导致的性能瓶颈问题。现有方法通常仅依赖文本链式推理（chain-of-thought），难以充分挖掘视觉信息的价值，且存在冗余或错误的图像操作行为。解决方案的关键在于提出S1-VL模型，其核心创新是融合两种互补的推理范式：一是结构化的科学推理（Scientific Reasoning），二是“以图思考”（Thinking-with-Images）模式——允许模型在沙箱环境中生成并执行Python代码对图像进行主动操作，获取中间视觉结果后迭代推理。为提升训练数据质量，作者设计了六维过滤框架与多阶段过滤流水线，并引入自适应数据路由策略，将低视觉信息增益样本转化为纯文本推理数据，使模型学会判断何时需要图像操作。该方案显著提升了模型在多个科学推理基准上的表现，尤其在涉及图像交互的任务中达到当前最优水平。

链接: https://arxiv.org/abs/2604.21409
作者: Qingxiao Li,Lifeng Xu,QingLi Wang,Yudong Bai,Mingwei Ou,Shu Hu,Nan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 13 figures

点击查看摘要

Abstract:We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.

[CV-43] You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在从学术研究向工业级部署转化过程中存在的三大核心问题：一是由启发式高斯增长导致的资源消耗不可预测；二是现有基准测试因“稀疏性屏蔽”（sparsity shield）机制奖励视觉幻觉而非物理真实性；三是多传感器数据污染严重，影响重建鲁棒性。解决方案的关键在于提出YOGO（You Only Gaussian Once）系统框架，其核心创新包括：将随机增长过程重构为确定性的、预算感知的平衡机制，引入硬件约束下的预算控制器实现资源分配可控，并设计可用性注册协议（availability-registration protocol）提升多传感器融合的鲁棒性；同时，构建首个超密集室内数据集Immersion v1.0以打破稀疏性屏蔽，强制算法聚焦于极致物理保真度，推动高保真重建性能上限的研究。实验表明，YOGO在保持严格确定性的同时实现了最先进的视觉质量，确立了生产级3DGS的新标准。

链接: https://arxiv.org/abs/2604.21400
作者: Jinrang Jia,Zhenjia Li,Yifeng Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized neural rendering, yet existing methods remain predominantly research prototypes ill-suited for production-level deployment. We identify a critical “Industry-Academia Gap” hindering real-world application: unpredictable resource consumption from heuristic Gaussian growth, the “sparsity shield” of current benchmarks that rewards hallucination over physical fidelity, and severe multi-sensor data pollution. To bridge this gap, we propose YOGO (You Only Gaussian Once), a system-level framework that reformulates the stochastic growth process into a deterministic, budget-aware equilibrium. YOGO integrates a novel budget controller for hardware-constrained resource allocation and an availability-registration protocol for robust multi-sensor fusion. To push the boundaries of reconstruction fidelity, we introduce Immersion v1.0, the first ultra-dense indoor dataset specifically designed to break the “sparsity shield.” By providing saturated viewpoint coverage, Immersion v1.0 forces algorithms to focus on extreme physical fidelity rather than viewpoint interpolation, and enables the community to focus on the upper limits of high-fidelity reconstruction. Extensive experiments demonstrate that YOGO achieves state-of-the-art visual quality while maintaining a strictly deterministic profile, establishing a new standard for production-grade 3DGS. To facilitate reproducibility, part scenes of Immersion v1.0 dataset and source code of YOGO has been publicly released. The project link is this https URL.

[CV-44] VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought LREC2026

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在推理过程中缺乏精确局部区域引导的问题，以及现有数据集因人工标注成本高、多步推理与图像区域之间缺乏显式对齐而导致模型可信度评估受限的挑战。解决方案的关键在于提出一个名为视觉接地链式思维（Visual Grounding Chain-of-Thought, VG-CoT）的数据集构建框架，其核心是通过全自动三阶段流程：首先利用先进的目标检测和光学字符识别（OCR）模型提取图像中的对象级和文本级视觉证据；其次使用GPT-4o生成具有逐步骤接地推理的内容；最后通过基于理由驱动的开放集检测机制优化 grounding 精度，从而实现每一步推理与真实图像区域的显式关联，有效提升LVLMs的可解释性和可信推理能力。

链接: https://arxiv.org/abs/2604.21396
作者: Byeonggeuk Lim,Kyeonghyun Kim,JungMin Yun,YoungBin Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model’s logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.

[CV-45] Supervised Learning Has a Necessary Geometric Blind Spot: Theory Consequences and Minimal Repair

【速读】：该论文旨在解决监督学习中模型对无关干扰（nuisance）敏感的几何本质问题，即在训练数据中与标签相关但在测试时为干扰的方向上，编码器仍保持非零雅可比敏感性（Jacobian sensitivity），这构成了监督学习的“几何盲区”（geometric blind spot）。这一现象并非当前方法的偶然缺陷，而是由经验风险最小化（ERM）目标本身所导致的数学必然结果。解决方案的关键在于提出轨迹偏差指数（Trajectory Deviation Index, TDI），一种直接测量该盲区定量约束的诊断指标，并引入一种具有高斯形式的额外正则项——该正则项被证明是唯一能均匀惩罚编码器雅可比矩阵的扰动策略（Proposition 5），从而有效修复此几何盲区，实现鲁棒性与准确性的协同提升。

链接: https://arxiv.org/abs/2604.21395
作者: Vishal Rajput
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages. Code: this https URL . Preprint, not peer-reviewed. Affiliation: KU Leuven, Belgium

点击查看摘要

Abstract:We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises supervised loss must retain non-zero Jacobian sensitivity in directions that are label-correlated in training data but nuisance at test time. This is not a contingent failure of current methods; it is a mathematical consequence of the supervised objective itself. We call this the geometric blind spot of supervised learning (Theorem 1), and show it holds across proper scoring rules, architectures, and dataset sizes. This single theorem unifies four lines of prior empirical work that were previously treated separately: non-robust predictive features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. In this framing, adversarial vulnerability is one consequence of a broader structural fact about supervised learning geometry. We introduce Trajectory Deviation Index (TDI), a diagnostic that measures the theorem’s bounded quantity directly, and show why common alternatives miss the key failure mode. PGD adversarial training reaches Jacobian Frobenius 2.91 yet has the worst clean-input geometry (TDI 1.336), while PMH achieves TDI 0.904. TDI is the only metric that detects this dissociation because it measures isotropic path-length distortion – the exact quantity Theorem 1 bounds. Across seven vision tasks, BERT/SST-2, and ImageNet ViT-B/16 backbones used by CLIP, DINO, and SAM, the blind spot is measurable and repairable. It is present at foundation-model scale, worsens monotonically across language-model sizes (blind-spot ratio 0.860 to 0.765 to 0.742 from 66M to 340M), and is amplified by task-specific ERM fine-tuning (+54%), while PMH repairs it by 11x with one additional training term whose Gaussian form Proposition 5 proves is the unique perturbation law that uniformly penalises the encoder Jacobian. Comments: 29 pages. Code: this https URL. Preprint, not peer-reviewed. Affiliation: KU Leuven, Belgium Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T05, 68T45 ACMclasses: I.2.6; I.2.10 Cite as: arXiv:2604.21395 [cs.LG] (or arXiv:2604.21395v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21395 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vishal Rajput [view email] [v1] Thu, 23 Apr 2026 08:03:33 UTC (69 KB)

[CV-46] EdgeFormer: local patch-based edge detection transformer on point clouds

【速读】：该论文旨在解决三维点云中细粒度边缘特征难以有效检测的问题，这类边缘通常因密集分布或局部表面梯度较小而被忽略。解决方案的关键在于将整个点云的边缘检测任务转化为基于局部邻域 patch 的点分类问题：首先构建描述每个点周围局部结构的 patch 特征描述符，随后通过分析这些局部特征对每个点进行边缘类别判别。这种转换使得模型能够更有效地提取细微几何细节，从而在实验中展现出优于六种基线方法的性能。

链接: https://arxiv.org/abs/2604.21387
作者: Yifei Xie,Zhikun Tu,Tong Yang,Yuhe Zhang,Xinyu Zhou
机构: Northwest University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures. Published in Pattern Analysis and Applications

点击查看摘要

Abstract:Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial demands. However, the fine-grained edge features are difficult to detect effectively as they are generally densely distributed or exhibit small-scale surface gradients. To address this issue, we present a learning-based edge detection network, named EdgeFormer, which mainly consists of two stages. Based on the observation that spatially neighboring points tend to exhibit high correlation, forming the local underlying surface, we convert the edge detection of the entire point cloud into a point classification based on local patches. Therefore, in the first stage, we construct local patch feature descriptors that describe the local neighborhood around each point. In the second stage, we classify each point by analyzing the local patch feature descriptors generated in the first stage. Due to the conversion of the point cloud into local patches, the proposed method can effectively extract the finer details. The experimental results show that our model demonstrates competitive performance compared to six baselines.

[CV-47] KD-CVG: A Knowledge-Driven Approach for Creative Video Generation ICASSP2026

【速读】：该论文旨在解决生成式广告视频（Creative Video Generation, CVG）中存在的两大核心挑战：一是语义对齐模糊问题（ambiguous semantic alignment），即模型难以准确将产品卖点与创意视频内容关联；二是运动适应性不足问题（inadequate motion adaptability），导致生成视频中动作不自然或出现失真。解决方案的关键在于构建一个广告创意知识库（Advertising Creative Knowledge Base, ACKB）并提出一种知识驱动的方法（Knowledge-driven CVG, KD-CVG），其核心由两个模块组成：语义感知检索模块（Semantic-Aware Retrieval, SAR）利用图注意力网络和强化学习反馈提升模型对卖点与视频语义关系的理解，以及多模态知识参考模块（Multimodal Knowledge Reference, MKR）引入语义和运动先验信息以弥补现有文本到视频（Text-to-Video, T2V）模型的知识缺口，从而显著改善语义一致性与运动合理性。

链接: https://arxiv.org/abs/2604.21362
作者: Linkai Liu,Wei Feng,Xi Zhao,Shen Zhang,Xingye Chen,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Yuchen Zhou,Zipeng Guo,Chao Gou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbfambiguous semantic alignment, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbfinadequate motion adaptability, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model’s comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG’s superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at this https URL.

[CV-48] Prototype-Based Test-Time Adaptation of Vision-Language Models

【速读】：该论文旨在解决现有基于缓存（cache-based）的测试时适应（Test-time Adaptation, TTA）方法在大规模场景下存在的两个核心问题：一是随着类别数增加，缓存规模扩大导致推理延迟显著上升；二是缓存中样本不足或错误时性能下降。其解决方案的关键在于提出一种基于原型（Prototype-Based Test-Time Adaptation, PTA）的新范式，通过构建类特定的知识原型（knowledge prototypes）来动态聚合和加权利用测试样本信息，其中每个样本的视觉特征依据其零样本分类置信度进行自适应融合。该机制仅在原型层面存储与更新知识，彻底避免了缓存的维护与检索开销，从而在保持高推理效率的同时实现卓越的跨域泛化性能，在15个图像识别基准和4个鲁棒点云分析基准上均达到SOTA效果。

链接: https://arxiv.org/abs/2604.21360
作者: Zhaohong Huang,Yuxin Zhang,Wenjing Liu,Fei Chao,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample’s visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP’s accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP’s inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP’s inference speed.

[CV-49] SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes

【速读】：该论文旨在解决当前基于深度学习的地面滤波（Ground Filtering, GF）方法在跨场景泛化能力上的两个关键瓶颈：一是受限于计算资源，在大规模处理中存在“上下文-细节权衡”问题；二是仅依赖分类优化导致对高大物体出现随机误分类。其解决方案的核心在于提出SparseGF框架，通过三项创新实现突破：（1）受凸面镜启发的上下文压缩模块，可在保留中心细节的同时将大范围上下文压缩为紧凑表示；（2）混合稀疏体素-点网络架构，有效解析压缩后的特征并减轻压缩引起的几何失真；（3）高度感知损失函数，在训练过程中显式引入地形高程先验，抑制高大物体的随机误分类。该方法显著提升了复杂城市、混合及森林陡坡等多场景下的地面滤波鲁棒性与精度，为实现真正跨场景泛化提供了新思路。

链接: https://arxiv.org/abs/2604.21356
作者: Nannan Qin,Pengjie Tao,Haiyan Guan,Zhizhong Kang,Lingfei Ma,Xiangyun Hu,Jonathan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filtering (GF) to separate point clouds across diverse landscapes into ground and non-ground parts. Although current deep-learning-based GF methods have demonstrated impressive performance, especially in specific challenging terrains, their cross-scene generalization remains limited by two persistent issues: the context-detail dilemma in large-scale processing due to limited computational resources, and the random misclassification of tall objects arising from classification-only optimization. To overcome these limitations, we propose SparseGF, a height-aware sparse segmentation framework enhanced with context compression. It is built upon three key innovations: (1) a convex-mirror-inspired context compression module that condenses expansive contexts into compact representations while preserving central details; (2) a hybrid sparse voxel-point network architecture that effectively interprets compressed representations while mitigating compression-induced geometric distortion; and (3) a height-aware loss function that explicitly enforces topographic elevation priors during training to suppress random misclassification of tall objects. Extensive evaluations on two large-scale ALS benchmark datasets demonstrate that SparseGF delivers robust GF across urban to natural terrains, achieving leading performance in complex urban scenes, competitive results on mixed terrains, and moderate yet non-catastrophic accuracy in densely forested steep areas. This work offers new insights into deep-learning-based GF research and encourages further exploration toward truly cross-scene generalization for large-scale environmental monitoring.

[CV-50] rust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning

【速读】：该论文旨在解决自监督学习（Self-supervised Learning, SSL）在航空影像中因雾霾、运动模糊、雨滴遮挡等退化因素导致表征质量下降的问题。现有方法通过增强视图间的不变性来训练模型，但在严重退化情况下，强制对齐干净与劣质视图会引入伪结构，损害潜在空间的语义一致性。解决方案的关键在于提出一种基于样本和因子级别的信任权重（trust weight），将其以加性残差形式融入基础对比损失中，并采用“停止梯度”（stop-gradient）机制而非乘性门控（multiplicative gate）来优化该权重。实验表明，这种设计不仅避免了对主干网络的破坏，反而提升了其性能，在EuroSAT、AID和NWPU-RESISC45等多个数据集上实现了最高平均线性探测准确率（90.20%），并在极端信息擦除退化场景下显著优于SimCLR（如雾霾条件下提升19.9点）。这一方法为不确定性感知的SSL提供了明确的设计原则。

链接: https://arxiv.org/abs/2604.21349
作者: Wadii Boulila,Adel Ammar,Bilel Benjdira,Maha Driss
机构: Prince Sultan University (王子苏丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per-sample, per-factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop-gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive-residual approach improves it. Using a 200-epoch protocol on a 210,000-image corpus, the method achieves the highest mean linear-probe accuracy among six backbones on EuroSAT, AID, and NWPU-RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information-erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero-shot cross-domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive-residual formulation is the primary source of these improvements. An evidential variant using Dempster-Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty-aware SSL. Code is publicly available at this https URL.

[CV-51] Latent Denoising Improves Visual Alignment in Large Multimodal Models

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在训练过程中对视觉 token 的监督不足问题，这导致其内部视觉表征能力较弱，并在分布偏移下表现出鲁棒性差的缺陷。解决方案的关键在于引入一种基于潜在空间去噪（latent denoising）的框架，通过在投影后的视觉 token 上施加感知显著性引导的混合掩码与高斯噪声，使 LMM 在选定的中间语言模型层中学习从隐藏状态恢复干净的教师 patch 特征。为防止表示坍缩，该方法还保留了教师模型内的图像内相似性结构，并采用图像内对比 patch 蒸馏策略。该方案在不增加推理开销的前提下，显著提升了模型的视觉理解、推理能力及组合鲁棒性表现。

链接: https://arxiv.org/abs/2604.21343
作者: Dhruv Parikh,Jacob Fein-Ashley,Rajgopal Kannan,Viktor Prasanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher’s intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at this https URL.

[CV-52] acher-Guided Routing for Sparse Vision Mixture-of-Experts

【速读】：该论文旨在解决稀疏混合专家（Sparse Mixture of Experts, MoE）模型在训练过程中因路由（router）学习不稳定而导致的优化难题，尤其是由于梯度阻塞（gradient blocking）问题，使得路由器仅能从前向传播中激活的专家获得梯度信息，从而难以学习合理的专家选择策略，进而引发路由动态波动、训练不稳定等问题。解决方案的关键在于提出TGR-MoE（Teacher-Guided Routing for Sparse Vision Mixture-of-Experts），通过利用预训练稠密教师模型（pretrained dense teacher model）的中间表示构建教师路由（teacher router），并将教师路由输出作为伪监督信号指导学生路由器的学习，从而在训练初期即实现知识引导的专家选择，并显著抑制路由频繁波动，提升路由一致性与模型性能。

链接: https://arxiv.org/abs/2604.21330
作者: Masahiro Kada,Ryota Yoshihashi,Satoshi Ikehata,Rei Kawakami,Ikuro Sato
机构: Institute of Science Tokyo (东京科学研究所); DENSO IT Laboratory (电装IT实验室); National Institute of Informatics (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher’s intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.

[CV-53] MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

【速读】：该论文旨在解决通用多模态检索（Universal Multimodal Retrieval, UMR）中两类主流方法存在的关键问题：早期融合方法（如Marvel）易出现视觉模态坍塌（visual modality collapse），即模型过度依赖文本线索而忽略视觉特征；晚期融合方法（如UniVL-DR）则面临语义错位（semantic misalignment）问题，即语义相关的内容在嵌入空间中距离较远。为应对上述挑战，论文提出MiMIC框架，其核心创新在于：(1) 提出解码器内融合架构（fusion-in-decoder architecture），实现更有效的多模态信息整合；(2) 通过单模态混合训练（single modality mixin）和随机标题丢弃（random caption dropout）策略提升模型鲁棒性。实验表明，MiMIC在WebQA+和EVQA+数据集上显著优于早期与晚期融合基线方法。

链接: https://arxiv.org/abs/2604.21326
作者: Juan Li,Chuanghao Ding,Xujie Zhang,Cam-Tu Nguyen
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model’s tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.

[CV-54] mporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

【速读】：该论文旨在解决无监督视频可见光-红外行人重识别（Visible-Infrared Person Re-Identification, VI-ReID）问题，即在缺乏身份标签的情况下，利用RGB和红外模态的轨迹片段（tracklets）进行跨模态身份匹配。现有方法多依赖于有监督学习或图像级特征，难以适应真实场景中标注成本高、时序信息未被充分利用的问题。其解决方案的关键在于提出HiTPro（Hierarchical Temporal Prototyping）框架，通过三个核心机制实现：1）时间感知特征编码器提取帧级判别特征并聚合为鲁棒的轨迹级表示；2）基于相机内子轨迹聚类构建可靠的原型（prototype），避免显式硬伪标签分配；3）分层对比学习策略，在相机内区分、跨相机同模态一致性和跨模态不变性三个层级上逐步优化特征与原型间的对齐关系，结合动态阈值和软权重分配提升正样本挖掘效率。该方法在HITSZ-VCM和BUPTCampus数据集上实现了完全无监督设置下的最先进性能。

链接: https://arxiv.org/abs/2604.21324
作者: Zhiyong Li,Wei Jiang,Haojie Liu,Mingyu Wang,Wanchong Xu,Weijie Mao
机构: Zhejiang University (浙江大学); Zhejiang University of Water Resources and Electric Power (浙江水利水电学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, Hierarchical Contrastive Learning progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.

[CV-55] FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment CVPR

【速读】：该论文旨在解决炸油劣化监测中缺乏非破坏性、实时且具备空间信息的检测方法的问题，传统湿化学分析法虽能提供化学指标但无法实现在线监控。其关键解决方案是提出FryNet——一种双流RGB-热成像框架，通过联合执行油区分割、可用性分类与四项氧化指数（过氧化值PV、酸价p-AV、总氧化值Totox及温度）回归，在单次前向传播中完成多任务学习。核心创新包括：基于通道与空间注意力机制的ThermalMiT-B2骨干网络提取热特征；RGB-MAE编码器利用掩码自编码和化学对齐学习化学相关表征；双编码器DANN通过梯度反转层对抗性正则化抑制视频身份偏差；FiLM融合模块将热结构与RGB化学上下文关联，从而有效规避“相机指纹”捷径问题，显著提升模型泛化能力与精度。

链接: https://arxiv.org/abs/2604.21321
作者: Khaled R Ahmed,Toqi Tahamid Sarker,Taminul Islam,Tamany M Alanezi,Amer AbuGhazaleh
机构: Southern Illinois University Carbondale (南伊利诺伊大学卡本代尔分校); Qassim University (哈斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, this paper has been submitted and accepted for publication at CVPRW 2026

点击查看摘要

Abstract:Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use. We identify a fundamental obstacle in thermal-image-based inspection, the camera-fingerprint shortcut, whereby models memorize sensor-specific noise and thermal bias instead of learning oxidation chemistry, collapsing under video-disjoint evaluation. We propose FryNet, a dual-stream RGB-thermal framework that jointly performs oil-region segmentation, serviceability classification, and regression of four chemical oxidation indices (PV, p-AV, Totox, temperature) in a single forward pass. A ThermalMiT-B2 backbone with channel and spatial attention extracts thermal features, while an RGB-MAE Encoder learns chemically grounded representations via masked autoencoding and chemical alignment. Dual-Encoder DANN adversarially regularizes both streams against video identity via Gradient Reversal Layers, and FiLM fusion bridges thermal structure with RGB chemical context. On 7,226 paired frames across 28 frying videos, FryNet achieves 98.97% mIoU, 100% classification accuracy, and 2.32 mean regression MAE, outperforming all seven baselines.

[CV-56] PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring

【速读】：该论文旨在解决无人机（UAV）遥感监测中对海滩垃圾物理暴露面积量化不准确的问题，现有基于边界框（bounding-box）的检测方法因无法精确捕捉不规则垃圾形状而导致平面面积系统性高估，从而影响生态风险评估的可信度。解决方案的关键在于提出PLAS-Net（Pixel-level Litter Area Segmentor），一种像素级实例分割框架，能够提取海岸废弃物的精确物理足迹（physical footprint），显著提升掩膜保真度（mask fidelity），在泰国Koh Tao岛的季风驱动口袋海滩实测数据上达到mAP_50为58.7%，优于11种基线模型，从而为后续的碎片化动力学分析、污染热点识别及来源组成解析等环境应用提供更可靠的面积维度信息。

链接: https://arxiv.org/abs/2604.21313
作者: Yongying Liu,Jiaqi Wang,Jian Song,Xinlei Shao,Yijia Chen,Nan Xu,Katsunori Mizuno,Shigeru Tabeta,Fan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 30 pages, 12 figures

点击查看摘要

Abstract:Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV-based monitoring predominantly relies on bounding-box detection, which systematically overestimates the planar area of irregular litter objects. To address this geometric limitation, we develop PLAS-Net (Pixel-level Litter Area Segmentor), an instance segmentation framework that extracts pixel-accurate physical footprints of coastal debris. Evaluated on UAV imagery from a monsoon-driven pocket beach in Koh Tao, Thailand, PLAS-Net achieves a mAP_50 of 58.7% with higher precision than eleven baseline models, demonstrating improved mask fidelity under complex coastal conditions. To illustrate how the accuracy of the masking affects the conclusions of environmental analysis, we conducted three downstream demonstrations: (i) power-law fitting of normalized plastic density (NPD) to characterize fragmentation dynamics; (ii) area-weighted ecological risk index (ERI) to map spatial pollution hotspots; and (iii) source composition analysis revealing the abundance-area paradox: fishing gear constitutes a small proportion of the total number of items, but has the largest physical area per unit item. Pixel-level area extraction can provide more valuable information for coastal monitoring compared to methods based solely on counting.

[CV-57] he First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview

【速读】：该论文旨在解决遥感红外图像超分辨率（Infrared Image Super-Resolution, IRSR）问题，即从通过双三次下采样（bicubic downsampling）生成的低分辨率（Low-Resolution, LR）输入中恢复高分辨率（High-Resolution, HR）红外图像。其核心挑战在于如何在保持红外图像特有热辐射特征的同时提升空间细节，以满足遥感场景下的实际应用需求。解决方案的关键在于设计能够有效建模红外图像先验信息并适应真实遥感数据分布的深度学习模型，从而实现优于现有方法的重建质量与泛化能力。

链接: https://arxiv.org/abs/2604.21312
作者: Kai Liu,Haoyang Yue,Zeli Lin,Zheng Chen,Jingkai Wang,Jue Gong,Jiatong Li,Xianglong Yan,Libo Zhu,Jianze Li,Ziqing Zhang,Zihan Zhou,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Junye Chen,Zhenming Yan,Yucong Hong,Ruize Han,Song Wang,Li Pang,Heng Zhao,Xinqiao Wu,Deyu Meng,Xiangyong Cao,Weijun Yuan,Zhan Li,Zhanglu Chen,Boyang Yao,Yihang Chen,Yifan Deng,Zengyuan Zuo,Junjun Jiang,Saiprasad Meesiyawar,Sulocha Yatageri,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Cici Liu,Tongyao Mu,Qiong Cao,Yifan Wang,Kosuke Shigematsu,Hiroto Shirono,Asuka Shin,Wei Zhou,Linfeng Li,Lingdong Kong,Ce Wang,Xingwei Zhong,Wanjie Sun,Dafeng Zhang,Hongxin Lan,Qisheng Xu,Mingyue He,Hui Geng,Tianjiao Wan,Kele Xu,Changjian Wang,Antoine Carreaud,Nicola Santacroce,Shanci Li,Jan Skaloud,Adrien Gressin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Github Repo: this https URL

点击查看摘要

Abstract:This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state-of-the-art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infrared data and practical application needs, the challenge adopts a single-track setting. A total of 115 participants registered for the competition, with 13 teams submitting valid entries. This report summarizes the challenge design, dataset, evaluation protocol, main results, and the representative methods of each team. The challenge serves as a benchmark to advance research in infrared image super-resolution and promote the development of effective solutions for real-world remote sensing applications.

[CV-58] an interpretable vision transformer framework for automated brain tumor classification

【速读】：该论文旨在解决脑肿瘤（brain tumor）早期精准分类难题，以提升诊断效率与准确性，从而改善患者生存率。传统依赖人工解读磁共振成像（MRI）的方法存在耗时长、观察者间差异大及对专家依赖性强等问题。其解决方案的关键在于提出一种基于预训练视觉Transformer（Vision Transformer, ViT-B/16）的深度学习框架，结合临床驱动的图像预处理（如对比度受限自适应直方图均衡化，CLAHE）和两阶段微调策略（先冻结主干网络优化分类头，再全参数微调并采用差异化学习率），辅以MixUp和CutMix数据增强、指数移动平均（EMA）权重更新以及测试时增强（TTA）技术，显著提升了模型泛化能力和鲁棒性；同时通过Attention Rollout可视化机制提供可解释的热力图，使预测结果具备临床可理解性。最终模型在7,023例MRI数据上实现99.29%测试准确率和99.25%宏F1分数，优于所有卷积神经网络（CNN）基线方法。

链接: https://arxiv.org/abs/2604.21311
作者: Chinedu Emmanuel Mbonu,Tochukwu Sunday Belonwu,Okwuchukwu Ejike Chukwuogo,Kenechukwu Sylvanus Anigbogu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines

[CV-59] Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

【速读】：该论文旨在解决可控人体视频生成中因真实数据稀缺、多样性不足及隐私安全问题导致的瓶颈，尤其是针对罕见身份和复杂动作场景的建模困难。其解决方案的关键在于提出一种基于扩散模型（diffusion-based）的框架，该框架不仅支持对人物外观与运动的细粒度控制，还提供了一个统一的测试平台，用于系统分析合成数据与真实数据在训练过程中的交互机制。通过大量实验，研究揭示了合成数据与真实数据之间的互补作用，并验证了高效选择合成样本以提升运动真实性、时序一致性和身份保真度的有效方法，从而为构建数据高效且泛化能力强的人体视频生成模型提供了实践指导。

链接: https://arxiv.org/abs/2604.21291
作者: Yuanchen Fei,Yude Zou,Zejian Kang,Ming Li,Jiaying Zhou,Xiangru Huang
机构: Hunan University; Westlake University; Shanghai Jiaotong University; Zhejiang University; Shanghai Innovation Institute; Sun Yat-Sen University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied this http URL, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex this http URL data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real this http URL this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity this http URL study offers the first comprehensive exploration of synthetic data’s role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.

[CV-60] GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

【速读】：该论文旨在解决视觉图神经网络（Vision Graph Neural Networks, ViGs）中动态图构建带来的计算瓶颈问题。ViGs通过每层基于当前patch特征进行k近邻（kNN）搜索来构建自适应图结构，但这一过程存在显著的O(N²)时间复杂度，且与特征更新存在强依赖关系，导致CPU和GPU上约50–95%的计算时间被占用，严重限制了推理效率。其解决方案的关键在于提出GraphLeap机制：通过将图构建从当前层的特征更新中解耦，采用“前一层特征构建当前层图、当前层特征用于下一层数值更新”的一阶前瞻策略，从而实现图构建与消息传递的并行化。该设计消除了层间顺序依赖，使硬件加速器能够流水线化处理，并结合FPGA上的节点级与通道级并行优化，最终在Alveo U280 FPGA上实现了高达95.7倍于CPU和8.5倍于GPU的加速比，验证了实时ViG推理的可行性。

链接: https://arxiv.org/abs/2604.21290
作者: Anvitha Ramachandran,Dhruv Parikh,Viktor Prasanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: FCCM 2026

点击查看摘要

Abstract:Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50–95% of graph convolution time on CPUs and GPUs, scaling as O(N^2) with the number of patches N , and creating a sequential dependency between graph construction and feature updates. We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer \ell using a graph built from the previous layer’s features, while simultaneously using the current layer’s features to construct the graph for layer \ell+1 . This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to 95.7\times speedup over CPU and 8.5\times speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference. Comments: FCCM 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2604.21290 [cs.CV] (or arXiv:2604.21290v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.21290 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-61] AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing

【速读】：该论文旨在解决现有面部属性编辑方法中存在的一系列问题：基于生成对抗网络（GAN）的方法虽具备良好可控性，但风格编码与属性语义之间对齐不足；而基于扩散模型（Diffusion Model）的方法虽能生成高保真图像，却受限于不同属性语义方向的纠缠，导致编辑精度不足。其解决方案的关键在于提出 AttDiff-GAN，一种融合 GAN 属性操控与扩散图像生成的混合框架。核心创新在于通过特征级对抗学习将属性编辑与图像合成解耦，先利用显式特征操控实现精准属性修改，再以 manipulated features 引导扩散过程完成高质量图像重建，从而避免依赖语义方向的编辑方式。此外，引入 PriorMapper 和 RefineExtractor 两个模块分别增强风格-属性对齐和全局语义关系建模，显著提升了编辑准确性和非目标属性的保留效果。

链接: https://arxiv.org/abs/2604.21289
作者: Wenmin Huang,Weiqi Luo,Xiaochun Cao,Jiwu Huang
机构: Sun Yat-sen University (中山大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but often suffer from weak alignment between style codes and attribute semantics. Diffusion-based methods can synthesize highly realistic images; however, their editing precision is limited by the entanglement of semantic directions among different attributes. In this paper, we propose AttDiff-GAN, a hybrid framework that combines GAN-based attribute manipulation with diffusion-based image generation. A key challenge in such integration lies in the inconsistency between one-step adversarial learning and multi-step diffusion denoising, which makes effective optimization difficult. To address this issue, we decouple attribute editing from image synthesis by introducing a feature-level adversarial learning scheme to learn explicit attribute manipulation, and then using the manipulated features to guide the diffusion process for image generation, while also removing the reliance on semantic direction-based editing. Moreover, we enhance style-attribute alignment by introducing PriorMapper, which incorporates facial priors into style generation, and RefineExtractor, which captures global semantic relationships through a Transformer for more precise style extraction. Experimental results on CelebA-HQ show that the proposed method achieves more accurate facial attribute editing and better preservation of non-target attributes than state-of-the-art methods in both qualitative and quantitative evaluations.

[CV-62] ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing

【速读】：该论文旨在解决边缘人工智能（Edge AI）系统在非平稳数据流中进行持续学习（Continual Learning, CL）时面临的计算、内存和延迟开销问题。现有方法多依赖反向传播或大量示例存储的分类器，难以在资源受限设备上高效部署。其核心解决方案是提出ImageHD——一种基于超维计算（Hyperdimensional Computing, HDC）的FPGA加速器架构，通过硬件感知的持续学习算法与流水线化数据流系统设计实现高效推理：算法层面采用统一示例内存和硬件友好的聚类合并策略以控制类别示例数量，并引入量化卷积神经网络（CNN）前端降低部署成本；系统层面则利用Word-Packed二进制超向量在Zynq ZCU104 FPGA上实现大规模并行位运算，在严格片上资源约束下完成编码、相似性搜索与受限聚类管理，最终在CORe50数据集上相较优化CPU/GPU基线分别实现最高40.4倍/4.84倍速度提升和383倍/105.1倍能效提升，验证了HDC在实时边缘持续学习中的可行性与优势。

链接: https://arxiv.org/abs/2604.21280
作者: Jebacyril Arockiaraj,Dhruv Parikh,Viktor Prasanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: FCCM 2026

点击查看摘要

Abstract:On-device continual learning (CL) is critical for edge AI systems operating on non-stationary data streams, but most existing methods rely on backpropagation or exemplar-heavy classifiers, incurring substantial compute, memory, and latency overheads. Hyperdimensional computing (HDC) offers a lightweight alternative through fast, non-iterative online updates. Combined with a compact convolutional neural network (CNN) feature extractor, HDC enables efficient on-device adaptation with strong visual representations. However, prior HDC-based CL systems often depend on multi-tier memory hierarchies and complex cluster management, limiting deployability on resource-constrained hardware. We present ImageHD, an FPGA accelerator for on-device continual learning of visual data based on HDC. ImageHD targets streaming CL under strict latency and on-chip memory constraints, avoiding costly iterative optimization. At the algorithmic level, we introduce a hardware-aware CL method that bounds class exemplars through a unified exemplar memory and a hardware-efficient cluster merging strategy, while incorporating a quantized CNN front-end to reduce deployment overhead without sacrificing accuracy. At the system level, ImageHD is implemented as a streaming dataflow architecture on the AMD Zynq ZCU104 FPGA, integrating HDC encoding, similarity search, and bounded cluster management using word-packed binary hypervectors for massively parallel bitwise computation within tight on-chip resource budgets. On CORe50, ImageHD achieves up to 40.4x (4.84x) speedup and 383x (105.1x) energy efficiency over optimized CPU (GPU) baselines, demonstrating the practicality of HDC-enabled continual learning for real-time edge AI. Comments: FCCM 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.21280 [cs.CV] (or arXiv:2604.21280v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.21280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-63] LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

【速读】：该论文旨在解决面部属性编辑与风格操控中难以精确控制目标属性而不影响其他无关特征的问题，这一挑战源于面部结构的复杂性以及属性之间的强相关性。传统条件生成对抗网络（conditional GANs）存在精度不足和训练不稳定的问题，而扩散模型虽具潜力却受限于语义方向表达能力弱，难以实现灵活的风格调节。其解决方案的关键在于提出LatRef-Diff框架，通过将扩散模型中的传统语义方向替换为可学习的风格码（style codes），并设计两种生成方式——潜在空间引导（latent guidance）与参考图像引导（reference guidance）；进一步构建一个融合可学习向量、交叉注意力机制与分层结构的风格调制模块，以实现随机及定制化的风格操控。此外，引入前向-后向一致性训练策略，在无需成对图像的情况下提升训练稳定性，并通过感知损失与分类损失引导属性的精准恢复，从而显著提升编辑准确性和图像质量。

链接: https://arxiv.org/abs/2604.21279
作者: Wenmin Huang,Weiqi Luo,Xiaochun Cao,Jiwu Huang
机构: Sun Yat-sen University (中山大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model’s design choices.

[CV-64] Measure Twice Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

【速读】：该论文旨在解决图形用户界面（Graphical User Interface, GUI）中的自然语言指令定位问题，即如何将自然语言指令精准映射到像素级坐标。现有方法在语义理解上表现良好，但在视觉同质化元素和密集布局下难以实现精确的空间定位，且静态的一致性策略（如基于几何聚类的自一致性）因模型预测空间分散而提升有限。其解决方案的关键在于提出一种可学习的选择机制——通过一个“先提议后批评”的协同进化框架（Propose-then-Critic framework），其中提议者与批评者在训练中动态耦合：提议者的输出多样性增强批评者的鲁棒性，而批评者成熟后的判别能力又释放提议者更广泛的探索潜力，从而形成双向强化机制；进一步引入一种成熟度感知的自适应协同进化强化学习范式，以动态平衡两者训练目标，最终显著提升接地精度与批评可靠性，并具备对复杂界面布局的泛化能力。

链接: https://arxiv.org/abs/2604.21268
作者: Wenkai Wang,Xiyun Li,Hongcan Guo,Wenhao Yu,Tianqing Fang,Haitao Mi,Dong Yu,Shengyu Zhang
机构: Zhejiang University(浙江大学); Tencent AI Lab(腾讯AI实验室); The University of Hong Kong(香港大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model’s predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model’s grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer’s outputs enhances critic robustness, while the critic’s maturing discrimination capability conversely unlocks the proposer’s potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.

[CV-65] UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection ICMR2026

【速读】：该论文旨在解决面部动作单元（Action Unit, AU）检测中因表示阶段和决策阶段的异质性不确定性导致的鲁棒性下降问题，具体包括视觉噪声、个体差异引起的外观变化以及AU间模糊关系带来的不确定性，同时传统点估计分类器在标签严重不平衡的AU数据集上易产生过度自信的预测结果。解决方案的关键在于提出UAU-Net框架，其核心创新为：在表示阶段引入基于条件变分自编码器（Conditional Variational Autoencoder, CVAE）的CV-AFE模块，通过多时空尺度联合估计特征均值与方差来学习概率化的AU表示，并利用AU标签条件建模跨AU依赖关系引发的不确定性；在决策阶段设计不对称Beta证据神经网络（Asymmetric Beta Evidential Neural Network, AB-ENN），以Beta分布参数化预测不确定性，并通过针对高度不平衡二分类标签定制的不对称损失函数缓解过自信问题。

链接: https://arxiv.org/abs/2604.21227
作者: Yuze Li,Zhilei Liu
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ICMR 2026

点击查看摘要

Abstract:Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.

[CV-66] Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

【速读】：该论文旨在解决自回归视频扩散模型在长时程生成任务中面临的两大挑战：一是生成质量下降，二是解码延迟高。其解决方案的核心是提出了一种名为Sparse Forcing的训练与推理范式，关键在于利用了注意力机制在自回归扩散推理过程中对特定视觉块具有持续关注的观察现象，即形成隐式的时空记忆（spatiotemporal memory）并呈现局部结构化的块稀疏模式。基于此，作者设计了一个可训练的原生稀疏机制，动态压缩、保留并更新这些持久性视觉块，同时限制每个滑动窗口内的计算仅限于动态选择的局部邻域。为支持大规模部署，进一步提出了Persistent Block-Sparse Attention (PBSA) GPU高效内核，显著加速稀疏注意力计算和KV缓存更新，在保证生成质量提升的同时实现低延迟、高内存效率的解码。

链接: https://arxiv.org/abs/2604.21221
作者: Boxun Xu,Yuming Du,Zichang Liu,Siyu Yang,Ziyang Jiang,Siqi Yan,Rajasi Saha,Albert Pumarola,Wenchen Wang,Peng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.

[CV-67] ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

【速读】：该论文旨在解决时间序列问答（Time Series Question-Answering, TSQA）这一前沿但研究不足的问题，即如何让基础模型（Foundation Models, FMs）理解并推理软件运行时产生的时间序列异常数据。其核心挑战在于将自然语言问题与多模态时间序列数据相结合，以实现对复杂系统故障的精准分析。解决方案的关键在于构建一个高质量、真实世界场景下的TSQA基准ARFBench，包含来自Datadog生产环境的63个事件中750个问题和538万条数据点；并通过实证发现，视觉语言模型（Vision-Language Models, VLMs）在该任务上显著优于传统基线，同时提出一种基于时间序列基础模型（Time Series Foundation Model, TSFM）与VLM融合的轻量级混合原型，在少量合成与真实数据微调后可达到与顶尖模型相当的性能；此外，通过引入“模型-专家Oracle”机制，结合模型预测与领域专家判断，进一步突破人类专家表现，确立了新的超人类基准。

链接: https://arxiv.org/abs/2604.21199
作者: Stephan Xie,Ben Cohen,Mononito Goswami,Junhong Shen,Emaad Khwaja,Chenghao Liu,David Asker,Othmane Abou-Amal,Ameet Talwalkar
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at this https URL.

[CV-68] A Probabilistic Framework for Improving Dense Object Detection in Underwater Image Data via Annealing-Based Data Augmentation

【速读】：该论文旨在解决目标检测模型在真实水下环境中性能显著下降的问题，这主要归因于水下场景的高变异性（如光照不稳定、能见度低）以及频繁的遮挡现象。解决方案的关键在于提出一种基于伪模拟退火（pseudo-simulated annealing）的数据增强算法，该算法受Deng等人提出的copy-paste策略启发，通过合成逼真的密集鱼类场景来提升训练数据的空间多样性和物体密度，从而增强模型对复杂水下环境的泛化能力。实验表明，该方法在Florida Keys实时视频流中人工标注的挑战性测试集上显著优于基线YOLOv10模型。

链接: https://arxiv.org/abs/2604.21198
作者: Eleanor Wiesler,Trace Baxley
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection models typically perform well on images captured in controlled environments with stable lighting, water clarity, and viewpoint, but their performance degrades substantially in real-world underwater settings characterized by high variability and frequent occlusions. In this work, we address these challenges by introducing a novel data augmentation framework designed to improve robustness in dense and unconstrained underwater scenes. Using the DeepFish dataset, which contains images of fish in natural environments, we first generate bounding box annotations from provided segmentation masks to construct a custom detection dataset. We then propose a pseudo-simulated annealing-based augmentation algorithm, inspired by the copy-paste strategy of Deng et al. [1], to synthesize realistic crowded fish scenarios. Our approach improves spatial diversity and object density during training, enabling better generalization to complex scenes. Experimental results show that our method significantly outperforms a baseline YOLOv10 model, particularly on a challenging test set of manually annotated images collected from live-stream footage in the Florida Keys. These results demonstrate the effectiveness of our augmentation strategy for improving detection performance in dense, real-world underwater environments.

[CV-69] SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

【速读】：该论文旨在解决现有视觉-语言模型在空间推理（spatial reasoning）任务中因依赖固定推理范式而导致的适应性不足问题。具体而言，当前方法通常采用单一推理流水线，隐式学习固定的先验知识（spatial prior），难以应对分布变化下的多样化场景；同时，虽有研究尝试使用多智能体系统提升推理多样性，但受限于同质化代理结构，无法充分整合不同归纳偏置（inductive biases）的优势。其解决方案的关键在于提出一种异构多智能体框架 SpatiO，通过协调具备互补归纳偏置的多个视觉-语言专家代理实现灵活的空间推理，并引入测试时编排机制（Test-Time Orchestration, TTO），在不修改模型参数的前提下动态评估并重加权各代理的可靠性，从而实现基于输入上下文的自适应推理策略选择。

链接: https://arxiv.org/abs/2604.21190
作者: Chan Yeong Hwang,Miso Choi,Sunghyun On,Jinkyu Kim,Jungbeom Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emphspatial adaptability: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce \textbf\textscSpatiO, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose \textbfTest-Time Orchestration (TTO), an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that \textscSpatiO consistently improves spatial reasoning performance over both closed-source and open-source baselines.

[CV-70] WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

【速读】：该论文旨在解决现有3D Gaussian Splatting (3DGS) 方法在处理无约束图像（即相机参数未知且光照条件不一致）时的局限性，这类场景下传统方法通常依赖于多视角图像的迭代优化和已知相机位姿。其解决方案的关键在于提出 WildSplatter，一种前向传播的 3DGS 模型，通过联合学习 3D 高斯分布与基于输入图像的外观嵌入（appearance embeddings），实现对高斯颜色的灵活调制，从而有效建模光照和外观的显著变化。该设计使得模型能在不到一秒内从稀疏视图中重建 3D 高斯，并支持不同光照条件下的外观控制，在真实世界数据集上显著优于现有无位姿约束的 3DGS 方法。

链接: https://arxiv.org/abs/2604.21182
作者: Yuki Fujimura,Takahiro Kushida,Kazuya Kitano,Takuya Funatomi,Yasuhiro Mukaigawa
机构: NAIST(日本信息学研究所); Ritsumeikan University(立命馆大学); Kyoto University(京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose WildSplatter, a feed-forward 3D Gaussian Splatting (3DGS) model for unconstrained images with unknown camera parameters and varying lighting conditions. 3DGS is an effective scene representation that enables high-quality, real-time rendering; however, it typically requires iterative optimization and multi-view images captured under consistent lighting with known camera parameters. WildSplatter is trained on unconstrained photo collections and jointly learns 3D Gaussians and appearance embeddings conditioned on input images. This design enables flexible modulation of Gaussian colors to represent significant variations in lighting and appearance. Our method reconstructs 3D Gaussians from sparse input views in under one second, while also enabling appearance control under diverse lighting conditions. Experimental results demonstrate that our approach outperforms existing pose-free 3DGS methods on challenging real-world datasets with varying illumination.

[CV-71] Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

【速读】：该论文旨在解决点云-视觉-语言模型（Point-Vision-Language Models）在具身智能体中因几何幻觉（geometric hallucination）导致的3D结构预测与2D观测现实不一致的问题。其核心解决方案是提出几何奖励信用分配机制（Geometric Reward Credit Assignment），通过将整体监督信号解耦为特定领域的信号，并仅将其路由至对应的几何token区间，从而将模糊的反馈转化为精确的梯度更新，实现从通用策略优化到目标结构对齐的转变；同时引入重投影一致性项（Reprojection-Consistency term）以嵌入物理约束，作为跨模态验证器惩罚物理上不可能的几何结构，显著提升了3D空间推理的可靠性与可验证性。

链接: https://arxiv.org/abs/2604.21160
作者: Jingkun Chen,Ruoshi Xu,Mingqi Gao,Shengda Luo,Jungong Han
机构: Northwestern Polytechnical University (西北工业大学); Southern University of Science and Technology (南方科技大学); The University of Sheffield (谢菲尔德大学); Hengqin Laboratory (横琴实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.

[CV-72] WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis

【速读】：该论文旨在解决扩散模型在多模态磁共振成像（MRI）合成中计算成本过高这一问题，具体表现为采样步骤冗长（数百步）且需为每种模态单独训练模型，严重限制了其在临床场景中的部署。解决方案的关键在于提出一种名为WFM（Wavelet Flow Matching）的新方法，其核心思想是摒弃传统扩散模型从纯噪声开始的低效起点，转而学习一个从“知情先验”——即条件模态在小波空间中的均值——到目标分布的直接流形映射。由于源分布与目标分布共享基础解剖结构、仅在对比度上存在差异，该设计使得合成过程仅需1–2个积分步骤即可实现高精度重建，从而显著提升效率并保持与扩散基线相当的图像质量。

链接: https://arxiv.org/abs/2604.21146
作者: Yalcin Tur,Mihajlo Stojkovic,Ulas Bagci
机构: Stanford University (斯坦福大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures, 3 tables. Accepted at MIDL 2026 (Poster)

点击查看摘要

Abstract:Diffusion models have achieved remarkable quality in multi-modal MRI synthesis, but their computational cost (hundreds of sampling steps and separate models per modality) limits clinical deployment. We observe that this inefficiency stems from an unnecessary starting point: diffusion begins from pure noise, discarding the structural information already present in available MRI sequences. We propose WFM (Wavelet Flow Matching), which instead learns a direct flow from an informed prior, the mean of conditioning modalities in wavelet space, to the target distribution. Because the source and target share underlying anatomy and differ primarily in contrast, this formulation enables accurate synthesis in just 1-2 integration steps. A single 82M-parameter model with class conditioning synthesizes all four BraTS modalities (T1, T1c, T2, FLAIR), replacing four separate diffusion models totaling 326M parameters. On BraTS 2024, WFM achieves 26.8 dB PSNR and 0.94 SSIM, within 1-2 dB of diffusion baselines, while running 250-1000x faster (0.16-0.64s vs. 160s per volume). This speed-quality trade-off makes real-time MRI synthesis practical for clinical workflows. Code is available at this https URL.

[CV-73] HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping CVPR2026

【速读】：该论文旨在解决高光谱遥感数据（如NASA PACE任务所获取）在大气云特性反演等下游任务中面临的三大挑战：一是数据规模大、结构复杂且标注困难，二是现有基础模型（foundation models）多基于标准RGB图像训练，难以有效解析连续光谱特征，三是现有高光谱基础模型通常仅基于无云观测、跨传感器光谱不一致性强且参数量庞大、计算成本高。解决方案的关键在于提出HyperFM，一种参数高效的高光谱基础模型，其核心创新包括：利用组内与组间光谱注意力机制（intra-group and inter-group spectral attention）增强光谱空间关系建模能力，并结合混合参数分解策略（hybrid parameter decomposition）显著降低计算开销，在四个基准大气云属性反演任务上均实现性能超越现有高光谱基础模型及特定任务最优方法。

链接: https://arxiv.org/abs/2604.21127
作者: Zahid Hassan Tushar,Sanjay Purushotham
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, to be published in CVPR 2026 findings, Code and data are publicly available on this https URL

点击查看摘要

Abstract:The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth’s climate and air quality. Its Ocean Color Instrument measures light across hundreds of finely spaced wavelength bands, enabling detailed characterization of features such as phytoplankton composition, aerosol properties, and cloud microphysics. However, hyperspectral data of this scale is large, complex, and difficult to label, requiring specialized processing and analysis techniques. Existing foundation models, which have transformed computer vision and natural language processing, are generally trained on standard RGB imagery and therefore struggle to interpret the continuous spectral signatures captured by PACE. While recent advances have introduced hyperspectral foundation models, they are typically trained on cloud-free observations and often remain limited to single-sensor datasets due to spectral inconsistencies across instruments. Moreover, existing models tend to be parameter-heavy and computationally expensive, limiting scalability and adoption in operational settings. To address these challenges, we introduce HyperFM, a parameter-efficient hyperspectral foundation model that leverages intra-group and inter-group spectral attention along with hybrid parameter decomposition to better capture spectral spatial relationships while reducing computational cost. HyperFM demonstrates consistent performance improvements over existing hyperspectral foundation models and task-specific state-of-the-art methods across four benchmark downstream atmospheric cloud property retrieval tasks. To support further research, we additionally release HyperFM250K, a large-scale hyperspectral dataset from the PACE mission that includes both clear and cloudy scenes.

[CV-74] Materialistic RIR: Material Conditioned Realistic RIR Generation CVPR2026

【速读】：该论文旨在解决现有声学建模方法中空间布局与材料属性影响被耦合在相关表示中的问题，导致用户控制能力受限且生成声学效果真实感不足。其解决方案的关键在于提出一种显式解耦的空间-材料双模块架构：通过一个空间模块捕捉场景几何结构对冲激响应（Room Impulse Response, RIR）的影响，并由一个材料模块根据用户指定的材质配置调制该空间RIR，从而实现对材料变化的独立控制与高保真声学模拟。此设计显著提升了音质评估指标（如RTE提升达+16%）和材料敏感性指标（提升达+70%），并通过人类感知实验验证了其更优的真实感表现。

链接: https://arxiv.org/abs/2604.21119
作者: Mahnoor Fatima Saad,Sagnik Majumder,Kristen Grauman,Ziad Al-Halah
机构: University of Utah (犹他大学); UT Austin (得克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted to CVPR 2026 Findings. Project page: this https URL

点击查看摘要

Abstract:Rings like gold, thuds like wood! The sound we hear in a scene is shaped not only by the spatial layout of the environment but also by the materials of the objects and surfaces within it. For instance, a room with wooden walls will produce a different acoustic experience from a room with the same spatial layout but concrete walls. Accurately modeling these effects is essential for applications such as virtual reality, robotics, architectural design, and audio engineering. Yet, existing methods for acoustic modeling often entangle spatial and material influences in correlated representations, which limits user control and reduces the realism of the generated acoustics. In this work, we present a novel approach for material-controlled Room Impulse Response (RIR) generation that explicitly disentangles the effects of spatial and material cues in a scene. Our approach models the RIR using two modules: a spatial module that captures the influence of the spatial layout of the scene, and a material module that modulates this spatial RIR according to a user-specified material configuration. This explicitly disentangled design allows users to easily modify the material configuration of a scene and observe its impact on acoustics without altering the spatial structure or scene content. Our model provides significant improvements over prior approaches on both acoustic-based metrics (up to +16% on RTE) and material-based metrics (up to +70%). Furthermore, through a human perceptual study, we demonstrate the improved realism and material sensitivity of our model compared to the strongest baselines.

[CV-75] Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance CVPR2026

【速读】：该论文旨在解决预训练数据地理组成对模型下游性能影响的研究空白问题，即当前关于新地理空间基础模型（geospatial foundation models）的性能差异多归因于模型架构或输入模态，而预训练数据集的作用尚未被充分探讨。其解决方案的关键在于系统性地构建全球及按大洲划分的预训练数据集，并在全局与区域下游任务上进行评估，发现欧洲地区的预训练数据在多种场景下均表现最优；进一步分析表明，仅光谱多样性（spectral diversity）与模型性能呈强相关性，这揭示了在设计高性能预训练数据集时应重点关注光谱维度的多样性，为未来地理空间模型的数据构建提供了新的科学依据。

链接: https://arxiv.org/abs/2604.21104
作者: Amandeep Kaur,Mirali Purohit,Gedeon Muhawenayo,Esther Rolf,Hannah Kerner
机构: Arizona State University (亚利桑那州立大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at EarthVision workshop, CVPR 2026

点击查看摘要

Abstract:New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model’s downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset’s downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at this https URL.

[CV-76] Leverag ing Multimodal LLM s for Built Environment and Housing Attribute Assessment from Street-View Imagery

【速读】：该论文旨在解决美国全国范围内建筑状况自动化评估的难题，传统方法依赖人工标注成本高且难以扩展。其核心解决方案是利用大型语言模型（Large Language Models, LLMs）与谷歌街景（Google Street View, GSV）图像相结合，通过微调Gemma 3 27B模型实现与人类平均评分（Mean Opinion Score, MOS）的高度一致性，在SRCC和PLCC指标上优于单个评估者。关键创新在于采用知识蒸馏技术，将大模型能力迁移至轻量级模型（如EfficientNetV2-M和SwinV2-B），在保持接近性能的同时获得最高达30倍的速度提升，从而构建了一个高效、可扩展且适用于下游应用（如 homeowners 使用的可视化仪表板）的大规模建筑状态评估框架。

链接: https://arxiv.org/abs/2604.21102
作者: Siyuan Yao,Siavash Ghorbany,Kuangshi Ai,Arnav Cherukuthota,Meghan Forstchen,Alexis Korotasz,Matthew Sisk,Ming Hu,Chaoli Wang
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs’ capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

[CV-77] Foveated Reasoning : Stateful Action-based Visual Focusing for Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在使用高分辨率图像时面临的计算开销过大的问题，即随着视觉标记（visual-token）数量增加导致的高计算负担。其解决方案的关键在于引入一种名为“Foveated Reasoner”的自回归视觉语言框架，该框架将人类视觉中的中央凹机制（foveation）与推理过程统一到单一解码轨迹中：模型从低分辨率图像开始，仅在必要时触发中央凹机制，选择性地获取特定区域的高分辨率证据，并将其注入同一解码过程中进行后续推理。通过两阶段训练策略——冷启动监督引导中央凹行为，再结合强化学习优化证据获取与任务准确性，同时抑制“全图扫描”等低效策略——该方法能够在严格的视觉标记预算下实现更优的任务性能。

链接: https://arxiv.org/abs/2604.21079
作者: Juhong Min,Lazar Valkov,Vitali Petsiuk,Hossein Souri,Deen Dayal Mohan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides “where to look”, while selectively acquired high-acuity evidence refines “what to think”. We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial “see-everything” solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.

[CV-78] Optimizing Diffusion Priors with a Single Observation

【速读】：该论文旨在解决扩散先验（diffusion priors）在实际应用中因训练数据有限或纯模拟数据导致的误差与偏差问题，以及现有微调方法依赖大量观测样本且易在小样本场景下过拟合的问题。其解决方案的关键在于：仅需单个观测样本，通过将多个现有扩散先验组合为一个乘积专家（product-of-experts）先验，并优化各先验的指数权重以最大化贝叶斯证据（Bayesian evidence），从而实现对先验分布的有效泛化与调整。此方法可在真实世界逆问题（如黑洞成像和文本条件图像去模糊）中显著提升后验采样的灵活性与可信度。

链接: https://arxiv.org/abs/2604.21066
作者: Frederic Wang,Katherine L. Bouman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and biases of these underlying sources. Current approaches to finetuning diffusion models rely on a large number of observations with varying forward operators, which can be difficult to collect for many applications, and thus lead to overfitting when the measurement set is small. We propose a method for tuning a prior from only a single observation by combining existing diffusion priors into a single product-of-experts prior and identifying the exponents that maximize the Bayesian evidence. We validate our method on real-world inverse problems, including black hole imaging, where the true prior is unknown a priori, and image deblurring with text-conditioned priors. We find that the evidence is often maximized by priors that extend beyond those trained on a single dataset. By generalizing the prior through exponent weighting, our approach enables posterior sampling from both tempered and combined diffusion models, yielding more flexible priors that improve the trustworthiness of the resulting posterior image distribution.

[CV-79] Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images ALT

【速读】：该论文旨在解决儿科脑肿瘤病理诊断中因数据稀缺、类别不平衡及亚型间细微形态重叠导致的深度学习模型性能受限问题。其核心挑战在于如何在有限标注样本下实现细粒度的分类准确率提升，尤其是在弱监督条件下从全切片图像（WSI）进行滑动级别分类。解决方案的关键在于提出一种专家引导的对比微调框架，通过将对比学习嵌入到滑动级别多实例学习（MIL）中，在下游微调阶段显式地正则化滑动级别表示的空间几何结构；特别地，引入基于临床知识的难负样本（hard negatives），聚焦于诊断易混淆的亚型，从而增强类内紧凑性和类间分离性，显著改善细粒度区分能力。

链接: https://arxiv.org/abs/2604.21060
作者: Joakim Nguyen,Jian Yu,Jinrui Fang,Nicholas Konz,Tianlong Chen,Sanjay Krishnan,Chandra Krishnan,Ying Ding,Hairong Wang,Ankita Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Conference on Healthcare Informatics (ICHI), 2026

点击查看摘要

Abstract:Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

[CV-80] Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

【速读】：该论文旨在解决机器人在人类环境中执行操作任务时，如何有效理解物体交互的时序演化、当前动作识别以及下一步操作步骤预测的问题。传统增强型语义事件链（enriched Semantic Event Chains, eSECs）虽能提供可解释的关系描述，但缺乏不确定性感知能力，难以支持决策推理。解决方案的关键在于提出eSEC-LAM——一种神经符号框架，将经典eSEC转化为显式的事件级符号状态，通过引入置信度感知谓词、功能物体角色、先验可达性（affordance priors）、原始动作抽象及显著性引导的解释线索，实现从基础模型感知前端提取确定性谓词，并利用轻量级符号推理完成当前动作推断与下一原始动作预测。该方法显著提升了动作识别性能和下一步操作预测准确性，在感知噪声下更具鲁棒性，并生成时间一致的、基于显式关系证据的解释轨迹。

链接: https://arxiv.org/abs/2604.21053
作者: Fatemeh Ziaeetabar
机构: University of Tehran (德黑兰大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.

[CV-81] StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

【速读】：该论文旨在解决图像风格迁移（Style Transfer）中如何在保持内容结构的同时有效迁移风格纹理的问题。现有方法如AdaIN常因忽略内容与风格的协同建模而导致语义失真或风格不一致。其解决方案的关键在于提出StyleVAR框架，将风格迁移建模为在预训练潜在空间中的条件离散序列建模任务：首先通过VQ-VAE对图像进行多尺度离散编码，再利用带有混合交叉注意力机制（blended cross-attention）的Transformer实现目标token的自回归生成；该机制使目标表示在生成过程中动态关注自身历史，并由内容和风格特征作为查询来调节不同阶段的关注权重，同时引入尺度依赖的融合系数以平衡风格与内容的影响，从而在不破坏自回归连续性的前提下实现结构保真与风格适配。

链接: https://arxiv.org/abs/2604.21052
作者: Liqi Jing,Dingming Zhang,Peinian Li,Lichen Zhu
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content–style–target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR’s multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

[CV-82] Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks

【速读】：该论文旨在解决文本到图像扩散模型中概念遗忘（machine unlearning）的稳定性问题，即现有方法在模型微调后会导致已被删除的概念重新出现，即使微调数据与被删概念无关。解决方案的关键在于将投影梯度遗忘（Projected Gradient Unlearning, PGU）从分类任务迁移至扩散模型领域，作为事后加固步骤：通过构建保留概念激活所构成的核心梯度空间（Core Gradient Space, CGS），并将后续梯度更新投影到该空间的正交补空间中，从而确保微调过程无法逆转已实现的遗忘效果。此方法显著延迟甚至消除风格类概念的复兴，并在对象类概念上大幅延缓其恢复，同时计算效率远高于Meta-Unlearning（约6分钟 vs. 2小时）。

链接: https://arxiv.org/abs/2604.21041
作者: Aljalila Aladawi,Mohammed Talha Alam,Fakhri Karray
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine unlearning for text-to-image diffusion models aims to selectively remove undesirable concepts from pre-trained models without costly retraining. Current unlearning methods share a common weakness: erased concepts return when the model is fine-tuned on downstream data, even when that data is entirely unrelated. We adapt Projected Gradient Unlearning (PGU) from classification to the diffusion domain as a post-hoc hardening step. By constructing a Core Gradient Space (CGS) from the retain concept activations and projecting gradient updates into its orthogonal complement, PGU ensures that subsequent fine-tuning cannot undo the achieved erasure. Applied on top of existing methods (ESD, UCE, Receler), the approach eliminates revival for style concepts and substantially delays it for object concepts, running in roughly 6 minutes versus the ~2 hours required by Meta-Unlearning. PGU and Meta-Unlearning turn out to be complementary: which performs better depends on how the concept is encoded, and retain concept selection should follow visual feature similarity rather than semantic grouping.

[CV-83] Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

【速读】：该论文旨在解决通用大型多模态模型（Large Multi-modal Models, LMMs）通常仅在RGB图像上训练，导致其在多光谱遥感数据应用中性能受限的问题。现有方法要么需要昂贵的重新训练来适配多光谱数据，要么产生高度专业化、难以泛化的模型。解决方案的关键在于提出一种无需训练的推理阶段方法：通过将非RGB的多光谱输入映射到LMM已理解的视觉空间，并注入领域特定信息和思维链（Chain-of-Thought）推理作为指令，从而在不修改模型参数的前提下显著提升性能。实验基于Gemini 2.5模型，在多个主流遥感基准测试中实现了显著的零样本（Zero-Shot）性能提升，验证了该方法对地理空间专业用户利用强大通用模型处理专用传感器数据的有效性。

链接: https://arxiv.org/abs/2604.21032
作者: Dahun Kim,Ganesh Satish Mallya,Anelia Angelova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IGARSS 2026

点击查看摘要

Abstract:Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs’ understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.

[CV-84] A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment

【速读】：该论文旨在解决全球洪水事件频发且严重程度加剧背景下，传统水力模拟方法因计算成本高而难以实现快速、可靠洪水预测的问题。解决方案的关键在于构建一个基于深度学习的代理模型（surrogate model），通过优化U-Net架构、图像块（patch）生成策略及数据处理流程，高效逼近复杂水力模型，从而在保证预测精度的同时显著降低计算资源消耗。实验以德国北莱茵-威斯特法伦州Wupper流域为案例，验证了该方法在最大水位预测上的有效性与实用性。

链接: https://arxiv.org/abs/2604.21028
作者: Christian Lammers,Fernando Arévalo,Leonie Märker-Neuhaus,Daniel Heinenberg,Christian Förster,Karl-Heinz Spies
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 Pages, 9 Figures

点击查看摘要

Abstract:The increasing frequency and severity of global flood events highlights the need for the development of rapid and reliable flood prediction tools. This process traditionally relies on computationally expensive hydraulic simulations. This research presents a prediction tool by developing a deep-learning based surrogate model to accurately and efficiently predict the maximum water level across a grid. This was achieved by conducting a series of experiments to optimize a U-Net architecture, patch generation, and data handling for approximating a hydraulic model. This research demonstrates that a deep learning surrogate model can serve as a computationally efficient alternative to traditional hydraulic simulations. The framework was tested using hydraulic simulations of the Wupper catchment in the North-Rhein Westphalia region (Germany), obtaining comparable results.

[CV-85] Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

【速读】：该论文旨在解决当前计算机视觉系统对微动作（micro-actions）理解不足的问题，微动作是持续1-3秒的细微局部动作（如挠头或敲手指），在自然交互中广泛存在且对细粒度视频理解至关重要。其核心挑战在于微动作具有多样化的时空特性：部分依赖空间构型，另一些则体现为时间动态变化，而现有方法通常采用单一的时空分解策略，难以适应这种复杂性。解决方案的关键在于提出一种双路径网络架构，通过并行的时空（Spatial-Temporal, ST）与时序-空间（Temporal-Spatial, TS）路径分别优先处理空间配置和时间动态，并引入基于身体部位的自适应路由机制，使每个解剖实体自主选择最优处理路径；同时设计相互动作一致性（Mutual Action Consistency, MAC）损失函数以增强跨路径的一致性，从而实现对微动作本质多样性的有效建模。

链接: https://arxiv.org/abs/2604.21011
作者: Naga VS Raviteja Chappa,Evangelos Sariyanidi,Lisa Yankowitz,Gokul Nair,Casey J. Zampella,Robert T. Schultz,Birkan Tunç
机构: The Children’s Hospital of Philadelphia (费城儿童医院); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: Accepted to International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one’s head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.

[CV-86] Linear Image Generation by Synthesizing Exposure Brackets CVPR2026

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 模型主要生成显示参考图像（display-referred image）而导致下游编辑能力受限的问题。由于显示参考图像经过非线性色调映射和动态范围压缩，其信息表达受到限制，无法满足专业后期处理对高动态范围和精确辐射度信息的需求。为实现文本驱动的线性图像（linear image）生成，即保留完整动态范围、场景参考（scene-referred）的高质量图像，论文提出关键解决方案：将线性图像建模为一系列曝光 bracket 序列，每张 bracket 覆盖动态范围的一个子区间，并基于 DiT（Diffusion Transformer）架构设计了一种流匹配（flow-matching）机制，用于条件化生成这些曝光 bracket。此方法有效克服了预训练变分自编码器（VAE）在高动态范围和高比特深度下难以同时保留下极端亮部与暗部细节的局限，从而为专业图像编辑提供更灵活、信息丰富的输入基础。

链接: https://arxiv.org/abs/2604.21008
作者: Yuekun Dai,Zhoutong Zhang,Shangchen Zhou,Nanxuan Zhao
机构: S-Lab, Nanyang Technological University (南洋理工大学); Adobe NextCam (Adobe 下一代相机); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR2026

点击查看摘要

Abstract:The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.

[CV-87] DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction AAAI2026

【速读】：该论文旨在解决在稀疏视角（sparse-view）条件下，神经表示（Neural Representations, NRs）如神经场（neural fields）和3D高斯（3D Gaussians）在CT重建中出现严重伪影的问题。其解决方案的关键在于提出DiffNR框架，核心组件为SliceFixer——一个单步扩散模型，用于修正退化切片中的伪影；同时通过引入专用条件层与定制化数据整理策略支持模型微调，并在重建过程中周期性生成伪参考体数据，提供辅助的三维感知监督以修复欠约束区域，从而实现高效且高质量的重建，相较以往依赖迭代去噪的CT求解器嵌入方法显著提升运行效率与性能。

链接: https://arxiv.org/abs/2604.21518
作者: Shiyan Su,Ruyi Zha,Danli Shi,Hongdong Li,Xuelian Cheng
机构: 1. University of Sydney (悉尼大学); 2. Australian National University (澳大利亚国立大学); 3. Zhejiang University (浙江大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. Project page: this https URL

点击查看摘要

Abstract:Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.

[CV-88] PanGuide3D: Cohort-Robust Pancreas Tumor Segmentation via Probabilistic Pancreas Conditioning and a Transformer Bottleneck

【速读】：该论文旨在解决胰腺肿瘤在增强CT图像中的分割问题，其核心挑战在于病灶通常体积小、异质性强，且易与周围软组织混淆，同时模型在不同队列间迁移时性能显著下降（即存在队列偏移问题）。为提升跨队列泛化能力并保持模型结构简单高效，作者提出PanGuide3D架构：其关键创新在于引入概率性解剖条件机制——通过一个共享的3D编码器和胰腺解码器生成概率性胰腺图谱，并利用可微软门控机制将该胰腺概率在多尺度上显式地作为肿瘤解码器的条件输入；此外，在U-Net瓶颈层中嵌入轻量级Transformer模块以增强长程上下文建模能力，从而缓解分布偏移带来的影响。此设计有效提升了对小病灶及复杂解剖位置的分割准确性，并减少了不合理的假阳性结果，验证了概率性解剖先验在端到端模型中提升跨队列鲁棒性的可行性。

链接: https://arxiv.org/abs/2604.20981
作者: Sunny Joy Ma,Xiang Ma
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pancreatic tumor segmentation in contrast-enhanced computed tomography (CT) is clinically important yet technically challenging: lesions are often small, heterogeneous, and easily confused with surrounding soft tissue, and models that perform well on one cohort frequently degrade under cohort shift. Our goal is to improve cross-cohort generalization while keeping the model architecture simple, efficient, and practical for 3D CT segmentation. We introduce PanGuide3D, a cohort-robust architecture with a shared 3D encoder, a pancreas decoder that predicts a probabilistic pancreas map, and a tumor decoder that is explicitly conditioned on this pancreas probability at multiple scales via differentiable soft gating. To capture long-range context under distribution shift, we further add a lightweight Transformer bottleneck in the U-Net bottleneck representation. We evaluate cohort transfer by training on the PanTS (Pancreatic Tumor Segmentation) cohort and testing both in-cohort (PanTS) and out-of-cohort on MSD (Medical Segmentation Decathlon) Task07 Pancreas, using matched preprocessing and training protocols across strong baselines. We collect voxel-level segmentation metrics, patient-level tumor detection, subgroup analyses by tumor size and anatomical location, volume-conditioned performance analyses, and calibration measurements to assess reliability. Across the evaluated models, PanGuide3D achieves the best overall tumor performance and shows improved cross-cohort generalization, particularly for small tumors and challenging anatomical locations, while reducing anatomically implausible false positives. These findings support probabilistic anatomical conditioning as a practical strategy for improving cross-cohort robustness in an end-to-end model and suggest potential utility for contouring support, treatment planning, and multi-institutional studies.

人工智能

[AI-0] From Research Question to Scientific Workflow: Leverag ing Agent ic AI for Science Automation

【速读】：该论文旨在解决科学工作流系统中“语义翻译”环节的瓶颈问题，即科学家仍需手动将研究问题转化为工作流规范，这一过程依赖于领域知识和基础设施技能，效率低下且易出错。解决方案的关键在于提出一种分层代理架构：第一层（语义层）由大语言模型（LLM）将自然语言转化为结构化意图；第二层（确定性层）通过验证后的生成器产出可复现的工作流有向无环图（DAG）；第三层（知识层）由领域专家编写“Skills”，以Markdown文档形式编码词汇映射、参数约束与优化策略。该设计将LLM的非确定性限制在意图提取阶段，确保相同意图始终生成相同工作流，从而实现高效、可靠、可扩展的端到端自动化工作流生成。

链接: https://arxiv.org/abs/2604.21910
作者: Bartosz Balis,Michal Orzechowski,Piotr Kica,Michal Dygas,Michal Kuszewski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific workflow systems automate execution – scheduling, fault tolerance, resource management – but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills’': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under 0.001 per query.

[AI-1] Equity Bias: An Ethical Framework for AI Design

【速读】：该论文旨在解决当前人工智能（AI）系统中偏见处理方式的局限性问题，即传统方法将偏见视为需消除的误差，而忽视了其背后所反映的知识权力结构与认知不公。解决方案的关键在于提出“公平偏差”（Equity Bias）框架，该框架基于诠释学哲学和认识论不公正理论，主张将偏见透明化并使其可被质疑，从而拓宽塑造AI系统的视角来源，并将AI视为解释性代理。其核心方法是三阶段AI生命周期模型：‘公平考古’（映射知识与假设）、‘共同创造意义’（参与式设计）和‘持续问责’（持续评估），推动AI向伦理可问责和应对复杂现实挑战的方向发展。

链接: https://arxiv.org/abs/2604.21907
作者: Mary Lockwood
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages including references, 1 figure

点击查看摘要

Abstract:Equity Bias is a philosophical and practical framework for building smarter, more equitable AI systems. Grounded in hermeneutic philosophy and epistemic injustice theory, it treats bias not as an error to eliminate but as a reflection of whose knowledge is encoded into systems. While traditional approaches aim to reduce or remove bias, Equity Bias instead makes bias transparent and contestable. In doing so, it broadens whose perspectives shape AI and provides a lens for understanding AI systems as interpretive agents. The framework introduces a three-phase AI Life Cycle methodology: ‘Equity Archaeology’ (mapping knowledge and assumptions), ‘Co-Creating Meaning’ (participatory design), and ‘Ongoing Accountability’ (continuous evaluation). Equity Bias guides developers, researchers, and policymakers towards AI that is ethically accountable and capable of addressing complex real-world challenges.

[AI-2] A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

【速读】：该论文旨在解决气候科学中联合时空超分辨率（Spatiotemporal Super-Resolution, SR）模型在不同空间和时间尺度下难以迁移的问题，即现有方法通常仅针对单一SR因子组合设计模型，限制了其跨分辨率和时间频次的泛化能力。解决方案的关键在于提出一种尺度自适应（scale-adaptive）框架，通过将时空SR分解为确定性条件均值预测（含注意力机制）与残差条件扩散模型（residual conditional diffusion model），并引入三个可调超参数实现跨尺度适配：扩散噪声调度幅度β（随SR因子增大而增加以提升多样性）、时间上下文长度L（保持不同帧率下的注意力视野一致）以及可选的质量守恒变换f（用于控制极端值放大效应）。该方法使同一架构可在空间放大因子1–25、时间放大因子1–6范围内复用，显著提升了模型的通用性和实用性。

链接: https://arxiv.org/abs/2604.21903
作者: Max Defez,Filippo Quarenghi,Mathieu Vrac,Stephan Mandt,Tom Beucler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep-learning video super-resolution has progressed rapidly, but climate applications typically super-resolve (increase resolution) either space or time, and joint spatiotemporal models are often designed for a single pair of super-resolution (SR) factors (upscaling spatial and temporal ratio between the low-resolution sequence and the high-resolution sequence), limiting transfer across spatial resolutions and temporal cadences (frame rates). We present a scale-adaptive framework that reuses the same architecture across factors by decomposing spatiotemporal SR into a deterministic prediction of the conditional mean, with attention, and a residual conditional diffusion model, with an optional mass-conservation (same precipitation amount in inputs and outputs) transform to preserve aggregated totals. Assuming that larger SR factors primarily increase underdetermination (hence required context and residual uncertainty) rather than changing the conditional-mean structure, scale adaptivity is achieved by retuning three factor-dependent hyperparameters before retraining: the diffusion noise schedule amplitude beta (larger for larger factors to increase diversity), the temporal context length L (set to maintain comparable attention horizons across cadences) and optionally a third, the mass-conservation function f (tapered to limit the amplification of extremes for large factors). Demonstrated on reanalysis precipitation over France (Comephore), the same architecture spans super-resolution factors from 1 to 25 in space and 1 to 6 in time, yielding a reusable architecture and tuning recipe for joint spatiotemporal super-resolution across scales.

[AI-3] Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

【速读】：该论文旨在解决传统游戏AI编程中缺乏通用性、可扩展性和自适应能力的问题，尤其是在不同类型游戏场景下难以统一建模和动态优化策略的挑战。其解决方案的关键在于提出一种基于大语言模型（Large Language Models, LLMs）的新范式，并通过Nemobot这一交互式智能体工程环境实现对Claude Shannon游戏机器分类体系的延伸与落地。Nemobot能够根据不同游戏类型（字典类、可严格求解类、启发式类和学习类）定制化生成、部署和优化LLM驱动的游戏智能体，结合数学推理、人类反馈强化学习、最小最大算法与群体数据融合等机制，使AI具备自我编程与持续迭代的能力，从而推动向长期目标——自编程人工智能迈进。

链接: https://arxiv.org/abs/2604.21896
作者: Chee Wei Tan,Yuchen Wang,Shangxin Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 figures, 3 tables

点击查看摘要

Abstract:This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon’s taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.

[AI-4] A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

【速读】：该论文旨在解决电力系统中机组组合（Unit Commitment, UC）问题的计算效率瓶颈问题。随着电网中可再生能源和长时储能等新技术的接入，UC需在多日时间尺度上优化，并且频率更高，传统混合整数线性规划（Mixed-integer Linear Programming, MILP）求解器难以在紧缩的操作时限内获得可行解。解决方案的关键在于提出一种基于Transformer架构的多阶段预测-优化框架：首先利用自注意力机制预测72小时内的机组启停计划；其次通过确定性后处理启发式方法强制满足最小启停时间约束并最小化冗余容量；最后将优化后的预测结果作为下游MILP求解器的热启动初始解，并结合置信度驱动的变量固定策略显著压缩组合搜索空间。该方法在单节点测试系统上实现了100%可行性，并大幅缩短计算时间，在约20%的测试实例中还获得了比纯MILP求解更低的系统总成本。

链接: https://arxiv.org/abs/2604.21891
作者: Muhy Eddin Za’ter,Anna Van Boven,Bri-Mathias Hodge,Kyri Baker
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Maintaining instantaneous balance between electricity supply and demand is critical for reliability and grid instability. System operators achieve this through solving the task of Unit Commitment (UC),ca high dimensional large-scale Mixed-integer Linear Programming (MILP) problem that is strictly and heavily governed by the grid physical constraints. As grid integrate variable renewable sources, and new technologies such as long duration storage in the grid, UC must be optimally solved for multi-day horizons and potentially with greater frequency. Therefore, traditional MILP solvers increasingly struggle to compute solutions within these tightening operational time limits. To bypass these computational bottlenecks, this paper proposes a novel framework utilizing a transformer-based architecture to predict generator commitment schedules over a 72-hour horizon. Also, because raw predictions in highly dimensional spaces often yield physically infeasible results, the pipeline integrates the self-attention network with deterministic post-processing heuristics that systematically enforce minimum up/down times and minimize excess capacity. Finally, these refined predictions are utilized as a warm start for a downstream MILP solver, while employing a confidence-based variable fixation strategy to drastically reduce the combinatorial search space. Validated on a single-bus test system, the complete multi-stage pipeline achieves 100% feasibility and significantly accelerates computation times. Notably, in approximately 20% of test instances, the proposed model reached a feasible operational schedule with a lower overall system cost than relying solely on the solver.

[AI-5] ransient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在敏感应用场景中面临的多轮对抗攻击问题，尤其是针对无状态内容审核机制的系统性漏洞。传统越狱（jailbreak）方法依赖于维持持久对话上下文，而本文提出一种名为瞬时轮次注入（Transient Turn Injection, TTI）的新攻击技术，其关键在于将恶意意图分散到多个独立交互中，从而绕过基于单轮判断的防御机制。TTI通过由LLM驱动的自动化攻击代理迭代测试并规避商业与开源模型中的策略执行，揭示了现有模型在医疗等高风险领域存在显著脆弱性，同时验证了会话级上下文聚合和深度对齐等缓解策略的有效性，强调了构建上下文感知、持续对抗测试的综合防御体系的紧迫性。

链接: https://arxiv.org/abs/2604.21860
作者: Naheed Rayhan,Sohely Jahan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context. Our extensive evaluation across state-of-the-art models-including those from OpenAI, Anthropic, Google Gemini, Meta, and prominent open-source alternatives-uncovers significant variations in resilience to TTI attacks, with only select architectures exhibiting substantial inherent robustness. Our automated blackbox evaluation framework also uncovers previously unknown model specific vulnerabilities and attack surface patterns, especially within medical and high stakes domains. We further compare TTI against established adversarial prompting methods and detail practical mitigation strategies, such as session level context aggregation and deep alignment approaches. Our study underscores the urgent need for holistic, context aware defenses and continuous adversarial testing to future proof LLM deployments against evolving multi-turn threats.

[AI-6] Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

【速读】：该论文旨在解决当前高风险人工智能（Artificial Intelligence, AI）系统在监管合规中缺乏可量化安全验证手段的问题。尽管欧盟《人工智能法案》（EU AI Act）等法规要求高风险系统在部署前证明其安全性，但均未明确“可接受风险”的定量标准，也未提供技术方法来验证已部署系统是否满足该阈值。这一监管空白导致开发者面临强制合规评估却无统一证据生成路径的困境，尤其对难以进行白盒分析的黑箱统计推理模型更为突出。解决方案的关键在于提出一个两阶段框架：第一阶段由权威机构明确定义可接受故障概率 $\delta$ 和操作输入域 $\varepsilon$ ，构成具有法律约束力的规范性行为；第二阶段利用RoMA和gRoMA统计验证工具，在无需访问模型内部结构的前提下，计算出系统真实故障率的可审计上界，且适用于任意架构。此框架将AI风险监管转化为工程实践，实现合规证据的量化、可验证与责任前移至开发端。

链接: https://arxiv.org/abs/2604.21854
作者: Natan Levy,Gadi Perl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Artificial intelligence now decides who receives a loan, who is flagged for criminal investigation, and whether an autonomous vehicle brakes in time. Governments have responded: the EU AI Act, the NIST Risk Management Framework, and the Council of Europe Convention all demand that high-risk systems demonstrate safety before deployment. Yet beneath this regulatory consensus lies a critical vacuum: none specifies what ``acceptable risk’’ means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold. The regulatory architecture is in place; the verification instrument is not. This gap is not theoretical. As the EU AI Act moves into full enforcement, developers face mandatory conformity assessments without established methodologies for producing quantitative safety evidence - and the systems most in need of oversight are opaque statistical inference engines that resist white-box scrutiny. This paper provides the missing instrument. Drawing on the aviation certification paradigm, we propose a two-stage framework that transforms AI risk regulation into engineering practice. In Stage One, a competent authority formally fixes an acceptable failure probability \delta and an operational input domain \varepsilon - a normative act with direct civil liability implications. In Stage Two, the RoMA and gRoMA statistical verification tools compute a definitive, auditable upper bound on the system’s true failure rate, requiring no access to model internals and scaling to arbitrary architectures. We demonstrate how this certificate satisfies existing regulatory obligations, shifts accountability upstream to developers, and integrates with the legal frameworks that exist today. Comments: 11 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.21854 [cs.AI] (or arXiv:2604.21854v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.21854 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] raceScope: Interactive URL Triage via Decoupled Checklist Adjudication

【速读】：该论文旨在解决现代钓鱼攻击中URL分类器失效的问题，这些攻击通过交互门（如复选框/滑块挑战）、延迟内容渲染和无标志的凭证窃取页面等手段规避基于快照的静态URL分类方法。为应对这一挑战，作者提出TraceScope系统，其核心在于构建一个解耦的三阶段分析流水线：首先由沙箱化操作代理驱动真实GUI浏览器并基于视觉动机诱导页面行为，从而冻结会话为不可变证据包；其次由判定代理按需查询证据以验证MITRE ATT&CK检查清单，并生成包含指标（IOCs）与最终判断的审计就绪报告。该方案的关键创新在于将被动静态分析转化为主动交互取证流程，同时确保安全隔离与可复现性，显著提升了对复杂钓鱼攻击的召回率（0.78）与精度（0.94），且在实际钓鱼邮件数据集上表现出优于现有防御机制的能力。

链接: https://arxiv.org/abs/2604.21840
作者: Haolin Zhang,William Reber,Yuxuan Zhang,Guofei Gu,Jeff Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. This shifts URL triage from static classification toward an interactive forensics task: an analyst must actively navigate the page while isolating themselves from potential runtime exploits. We present TraceScope, a decoupled triage pipeline that operationalizes this workflow at scale. To prevent the observer effect and ensure safety, a sandboxed operator agent drives a real GUI browser guided by visual motivation to elicit page behavior, freezing the session into an immutable evidence bundle. Separately, an adjudicator agent circumvents LLM context limitations by querying evidence on demand to verify a MITRE ATTCK checklist, and generates an audit-ready report with extracted indicators of compromise (IOCs) and a final verdict. Evaluated on 708 reachable URLs from existing dataset (241 verified phishing from PhishTank and 467 benign from Tranco-derived crawling), TraceScope achieves 0.94 precision and 0.78 recall, substantially improving recall over three prior visual/reference-based classifiers while producing reproducible, analyst-grade evidence suitable for review. More importantly, we manually curated a dataset of real-world phishing emails to evaluate our system in a practical setting. Our evaluation reveals that TraceScope demonstrates superior performance in a real-world scenario as well, successfully detecting sophisticated phishing attempts that current state-of-the-art defenses fail to identify. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.21840 [cs.CR] (or arXiv:2604.21840v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.21840 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-8] ool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agent ic Workflows

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理通过模型上下文协议（Model Context Protocol, MCP）调用外部工具时所面临的高开销问题，即“工具税”（Tools Tax）——该机制因每次交互均需注入完整的工具模式（schema），导致每轮对话中token消耗激增（约10k–60k tokens），从而显著增加缓存压力、降低上下文利用率（接近70%时推理质量下降），并使token预算成为持续运营成本。解决方案的关键在于提出工具注意力机制（Tool Attention），其核心创新包括：(i) 基于句嵌入的意图模式重叠（Intent Schema Overlap, ISO）评分用于筛选相关工具；(ii) 一种状态感知门控函数以确保前置条件和访问范围合规；(iii) 两阶段惰性模式加载器，仅在上下文中保留精简摘要池，并对前k个被选中的工具动态加载完整JSON模式。实验证明，该方法将每轮工具token使用量减少95.0%（从47.3k降至2.4k），并将有效上下文利用率从24%提升至91%，验证了协议层效率才是可扩展生成式系统的关键约束因素。

链接: https://arxiv.org/abs/2604.21816
作者: Anuj Sadani,Deepak Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the “Attention Is All You Need” paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k - 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at this https URL

[AI-9] Quotient-Space Diffusion Models ICLR2026

【速读】：该论文旨在解决扩散模型在处理具有对称性的科学任务（如分子结构生成）时，因传统方法需显式建模群作用而导致的学习复杂性过高问题。其关键解决方案是构建一个适用于一般商空间（quotient space）的扩散建模形式化框架，通过将目标分布定义在群作用下的商空间上，从而消除对群作用分量的直接学习需求，显著降低模型复杂度，并保证采样器能够精确恢复目标分布。该方法在小分子和蛋白质结构生成任务中得到验证，相较于以往依赖启发式对齐策略的模型，展现出更优性能与理论完备性。

链接: https://arxiv.org/abs/2604.21809
作者: Yixian Xu,Yusong Wang,Shengjie Luo,Kaiyuan Gao,Tianyu He,Di He,Chang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注: ICLR 2026 Oral Presentation; 40 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Diffusion-based generative models have reformed generative AI, and have enabled new capabilities in the science domain, for example, generating 3D structures of molecules. Due to the intrinsic problem structure of certain tasks, there is often a symmetry in the system, which identifies objects that can be converted by a group action as equivalent, hence the target distribution is essentially defined on the quotient space with respect to the group. In this work, we establish a formal framework for diffusion modeling on a general quotient space, and apply it to molecular structure generation which follows the special Euclidean group \textSE(3) symmetry. The framework reduces the necessity of learning the component corresponding to the group action, hence simplifies learning difficulty over conventional group-equivariant diffusion models, and the sampler guarantees recovering the target distribution, while heuristic alignment strategies lack proper samplers. The arguments are empirically validated on structure generation for small molecules and proteins, indicating that the principled quotient-space diffusion model provides a new framework that outperforms previous symmetry treatments.

[AI-10] Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications KR2026

【速读】：该论文旨在解决从时间戳数据和背景知识中检测高阶时序扩展事件（temporally extended events）的问题，尤其在医疗领域用于推断疾病发作、治疗过程及其组合形成的更高层次疾病事件。解决方案的关键在于构建一个基于逻辑规则的框架，利用逻辑规则刻画简单时序事件的存在与终止条件，并通过组合这些规则形成元事件（meta-events）；同时引入约束机制识别不相容事件组合，并设计修复机制选择最优的一致事件集合，从而提升推理结果的准确性与合理性。该框架虽整体计算复杂度较高，但通过识别特定限制条件可保证多项式时间的数据复杂度，且原型系统基于答案集编程（Answer Set Programming, ASP）实现核心功能，在肺癌病例上的评估验证了其计算可行性与临床一致性。

链接: https://arxiv.org/abs/2604.21793
作者: Yvon K. Awuklu,Meghyn Bienvenu,Katsumi Inoue,Vianney Jouhet,Fleur Mougin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the full version (with appendix) of a paper appearing at the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR 2026)

点击查看摘要

Abstract:In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.

[AI-11] hinking with Reasoning Skills: Fewer Tokens More Accuracy

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中因生成冗长的中间推理链条（如思维链，Chain-of-Thought）而导致的高Token消耗问题。其解决方案的关键在于：通过从大量试错探索中提炼并存储可复用的推理技能（Reasoning Skills），在推理阶段根据当前任务检索相关技能以引导后续推理路径，从而避免重复性低效探索，提升推理效率与准确性。该方法显著减少了推理Token使用量，并在编程和数学推理任务上实现了性能提升，具备良好的实际部署经济潜力。

链接: https://arxiv.org/abs/2604.21764
作者: Guangxiang Zhao,Qilong Shi,Xusen Xiao,Xiangzheng Zhang,Tong Yang,Lin Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, The 64th Annual Meeting of the Association for Computational Linguistics – Industry Track

点击查看摘要

Abstract:Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrieve these skills at inference time to guide future reasoning. Unlike the prevailing \emphreasoning from scratch paradigm, our approach first recalls relevant skills for each query, helping the model avoid redundant detours and focus on effective solution paths. We evaluate our method on coding and mathematical reasoning tasks, and find that it significantly reduces reasoning tokens while improving overall performance. The resulting lower per-request cost indicates strong practical and economic potential for real-world deployment.

[AI-12] Agent ic AI-assisted coding offers a unique opportunity to instill epistemic grounding during software development

【速读】：该论文旨在解决当前生成式 AI 在软件开发中缺乏领域知识约束与一致性的问题，尤其是在非专业用户生成代码或工具时难以保证科学正确性和最佳实践。其解决方案的关键在于提出一种由社区治理的、领域范围内的认知基础文档（epistemic grounding document），通过明确编码“硬性约束”（Hard Constraints，即确保科学正确性的不可协商有效性不变量）和“约定参数”（Convention Parameters，即社区共识的默认值），使AI代理在执行任务时始终遵循这些底层规则，从而在不依赖用户提示的情况下保障输出结果的有效性与一致性。这一机制使得非领域专家也能生成符合行业标准的软件解决方案，并增强开发者及使用者对最终产品的信任。

链接: https://arxiv.org/abs/2604.21744
作者: Magnus Palmblad,Jared M. Ragland,Benjamin A. Neely
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: Letter, 9 pages, 1 table

点击查看摘要

Abstract:The capabilities of AI-assisted coding are progressing at breakneck speed. Chat-based vibe coding has evolved into fully fledged AI-assisted, agentic software development using agent scaffolds where the human developer creates a plan that agentic AIs implement. One current trend is utilizing documents beyond this plan document, such as project and method-scoped documents. Here we propose this http URL, a community-governed, field-scoped epistemic grounding document, using mass spectrometry-based proteomics as an example. This explicit field-scoped grounding document encodes Hard Constraints (non-negotiable validity invariants empirically required for scientific correctness) and Convention Parameters (community-agreed defaults) that override all other contexts to enforce validity, regardless of what the user prompts. In practice, this will empower a non-domain expert to generate code, tools, and software that have best practices baked in at the ground level, providing confidence to the software developer but also to those reviewing or using the final product. Undoubtedly it is easier to have agentic AIs adhere to guidelines than humans, and this opportunity allows for organizations to develop epistemic grounding documents in such a way as to keep domain experts in the loop in a future of democratized generation of bespoke software solutions.

[AI-13] Enabling and Inhibitory Pathways of University Students Willingness to Disclose AI Use: A Cognition-Affect-Conation Perspective

【速读】：该论文旨在解决高校学生在使用生成式 AI（Generative AI）辅助学习时，对AI使用情况披露意愿不足的问题。研究通过引入认知—情感—行为倾向（Cognition–Affect–Conation, CAC）框架，揭示了心理安全与评价焦虑在影响学生披露意图中的核心作用。解决方案的关键在于构建支持性的制度环境：一方面，提升感知公平性、教师支持和组织支持可增强心理安全，从而促进主动披露；另一方面，减少污名感、不确定性及隐私担忧可缓解评价焦虑，避免学生采取规避或策略性隐瞒行为。研究表明，清晰的政策指引与支持性的教学实践是推动负责任AI透明度的核心机制。

链接: https://arxiv.org/abs/2604.21733
作者: Yiran Du,Huimin He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing integration of artificial intelligence (AI) in higher education has raised important questions regarding students’ transparency in reporting AI-assisted work. This study investigates the psychological mechanisms underlying university students’ willingness to disclose AI use by applying the Cognition–Affect–Conation (CAC) framework. A sequential explanatory mixed-methods design was employed. In the quantitative phase, survey data were collected from 546 university students and analysed using structural equation modelling to examine the relationships among cognitive perceptions, affective responses, and disclosure intention. In the qualitative phase, semi-structured interviews with 22 students were conducted to further interpret the quantitative findings. The results indicate that psychological safety significantly increases students’ willingness to disclose AI use and is positively shaped by perceived fairness, perceived teacher support, and perceived organisational support. Conversely, evaluation apprehension reduces disclosure intention and psychological safety, and is strengthened by perceived stigma, perceived uncertainty, and privacy concern. Qualitative findings further reveal that institutional clarity and supportive instructional practices encourage openness, whereas policy ambiguity and fear of negative evaluation often lead students to adopt cautious or strategic disclosure practices. Overall, the study highlights the dual role of enabling and inhibitory psychological mechanisms in shaping AI-use disclosure and underscores the importance of supportive institutional environments and clear guidance for promoting responsible AI transparency in higher education.

[AI-14] Fairness under uncertainty in sequential decisions

【速读】：该论文旨在解决序列决策场景中因不确定性分布不均而导致的公平性问题，尤其关注在在线和顺序学习系统中，由于历史排除与选择性反馈造成的未观测反事实（counterfactuals）对少数群体产生的系统性不利影响。其关键解决方案是提出一个关于不确定性分类的框架，将不确定性细分为模型不确定性（model uncertainty）、反馈不确定性（feedback uncertainty）和预测不确定性（prediction uncertainty），并通过反事实逻辑与强化学习形式化建模前两者，从而识别并量化不同群体间不公平风险的来源。该框架不仅揭示了忽略未观测空间对决策者（如损失潜在收益）和受决策影响个体（如加剧排斥、减少机会）的危害，还展示了通过引入不确定性感知探索策略，在不损害机构目标（如预期效用）的前提下降低弱势群体的结果方差，为实践者提供诊断、审计与治理公平风险的方法论工具。

链接: https://arxiv.org/abs/2604.21711
作者: Michelle Seng Ah Lee,Kirtan Padh,David Watson,Niki Kilbertus,Jatinder Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACM Conference on Fairness, Accountability, and Transparency, 2026

点击查看摘要

Abstract:Fair machine learning (ML) methods help identify and mitigate the risk that algorithms encode or automate social injustices. Algorithmic approaches alone cannot resolve structural inequalities, but they can support socio-technical decision systems by surfacing discriminatory biases, clarifying trade-offs, and enabling governance. Although fairness is well studied in supervised learning, many real ML applications are online and sequential, with prior decisions informing future ones. Each decision is taken under uncertainty due to unobserved counterfactuals and finite samples, with dire consequences for under-represented groups, systematically under-observed due to historical exclusion and selective feedback. A bank cannot know whether a denied loan would have been repaid, and may have less data on marginalized populations. This paper introduces a taxonomy of uncertainty in sequential decision-making – model, feedback, and prediction uncertainty – providing shared vocabulary for assessing systems where uncertainty is unevenly distributed across groups. We formalize model and feedback uncertainty via counterfactual logic and reinforcement learning, and illustrate harms to decision makers (unrealized gains/losses) and subjects (compounding exclusion, reduced access) of policies that ignore the unobserved space. Algorithmic examples show it is possible to reduce outcome variance for disadvantaged groups while preserving institutional objectives (e.g. expected utility). Experiments on data simulated with varying bias show how unequal uncertainty and selective feedback produce disparities, and how uncertainty-aware exploration alters fairness metrics. The framework equips practitioners to diagnose, audit, and govern fairness risks. Where uncertainty drives unfairness rather than incidental noise, accounting for it is essential to fair and effective decision-making. Comments: ACM Conference on Fairness, Accountability, and Transparency, 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.21711 [cs.LG] (or arXiv:2604.21711v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

【速读】：该论文旨在解决深度神经网络中激活函数平滑性不足对梯度优化性能的限制问题，尤其是在深层架构中ReLU等非光滑激活函数可能引发梯度传播效率下降的问题。其解决方案的关键在于提出一族 $C^{2N}$ -光滑的激活函数家族GEM（Generalized Exponential-Monotonic），通过采用log-logistic累积分布函数（CDF）构造门控机制，实现完全基于有理数运算的平滑逼近ReLU特性；进一步引入 $\epsilon$ -参数化变体E-GEM以支持任意 $L^p$ 范数下对ReLU的逼近，并设计SE-GEM变体消除死神经元问题，同时保持高阶连续性。实验表明，该方法在CNN与Transformer架构中均表现出优于GELU等现有激活函数的性能，且平滑参数 $N$ 和 $\epsilon$ 具有明确的结构适应性： $N=1$ 更适合深层CNN，而 $N=2$ 更优用于Transformer，体现了模型深度与平滑性的权衡关系。

链接: https://arxiv.org/abs/2604.21677
作者: Eylon E. Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 26 pages, 4 figures, 16 tables

点击查看摘要

Abstract:The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of C^2N -smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic. We introduce three variants: GEM (the base family), E-GEM (an \epsilon -parameterized generalization enabling arbitrary L^p -approximation of ReLU), and SE-GEM (a piecewise variant eliminating dead neurons with C^2N junction smoothness). An N -ablation study establishes N=1 as optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The smoothness parameter N further reveals a CNN-transformer tradeoff: N=1 is preferred for deep CNNs, while N=2 is preferred for transformers. On MNIST, E-GEM ties the best baseline (99.23%). On CIFAR-10 + ResNet-56, SE-GEM ( \epsilon=10^-4 ) surpasses GELU (92.51% vs 92.44%) – the first GEM-family activation to outperform GELU. On CIFAR-100 + ResNet-56, E-GEM reduces the GELU deficit from 6.10% (GEM N=2 ) to just 0.62%. On GPT-2 (124M), GEM achieves the lowest perplexity (72.57 vs 73.76 for GELU), with GEM N=1 also beating GELU (73.32). On BERT-small, E-GEM ( \epsilon=10 ) achieves the best validation loss (6.656) across all activations. The \epsilon -parameterization reveals a scale-dependent optimum: small \epsilon ( 10^-4 – 10^-6 ) for deep CNNs and larger transformers, with the special case of small transformers (BERT-small) benefiting from large \epsilon ( \epsilon=10 ) due to its limited depth and unconstrained gradients.

[AI-16] Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

【速读】：该论文旨在解决周期性信号去噪与波形估计问题，尤其针对在计算资源受限环境下如何实现高效且准确的信号处理。传统深度学习方法虽性能优越，但存在计算开销大、需为每条信号单独训练的缺点；而经典方法如自回归（AR）模型虽高效，却难以适应不同基频信号。解决方案的关键在于提出一种基于深度卷积神经网络（DCNN）与重采样（Re-sampling）相结合的R-DCNN架构：通过轻量级重采样步骤统一不同频率信号的时间尺度，从而复用同一组网络权重，仅需单次观测即可完成训练，并显著降低计算复杂度，同时保持与先进AR方法及独立训练DCNN相当的去噪和波形估计精度。

链接: https://arxiv.org/abs/2604.21651
作者: Eli Gildish,Michael Grebshtein,Igor Makienko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: 16 pages, 8 figures, the use of deep learning in IoT devices

点击查看摘要

Abstract:Denoising of periodic signals and accurate waveform estimation are core tasks across many signal processing domains, including speech, music, medical diagnostics, radio, and sonar. Although deep learning methods have recently shown performance improvements over classical approaches, they require substantial computational resources and are usually trained separately for each signal observation. This study proposes a computationally efficient method based on DCNN and Re-sampling, termed R-DCNN, designed for operation under strict power and resource constraints. The approach targets signals with varying fundamental frequencies and requires only a single observation for training. It generalizes to additional signals via a lightweight resampling step that aligns time scales in signals with different frequencies to re-use the same network weights. Despite its low computational complexity, R-DCNN achieves performance comparable to state-of-the-art classical methods, such as autoregressive (AR)-based techniques, as well as conventional DCNNs trained individually for each observation. This combination of efficiency and performance makes the proposed method particularly well suited for deployment in resource-constrained environments without sacrificing denoising or estimation accuracy.

[AI-17] ask-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

【速读】：该论文旨在解决自主水下航行器（Autonomous Underwater Vehicles, AUV）在动态、不确定环境及有限感知条件下，如何实现多任务自适应且可解释控制的问题。传统控制器难以应对此类复杂场景，而多任务强化学习虽具备良好泛化能力，却因内部决策机制不透明，限制了其在真实水下监测任务中的部署。解决方案的关键在于通过分析预训练多任务强化学习网络的内部结构，识别并比较用于导航不同物种的目标子网络；研究发现，在相关任务的上下文多任务设置中，仅约1.5%的权重用于任务区分，其中约85%连接输入层的上下文变量节点与下一层隐藏层，凸显了上下文变量在任务特异性决策中的核心作用。这一方法为理解共享与专用网络组件提供了依据，有助于提升模型编辑、迁移学习和持续学习效率，从而推动水下长期监测任务的可靠实施。

链接: https://arxiv.org/abs/2604.21640
作者: Yi-Ling Liu,Melvin Laux,Mariela De Lucas Alvarez,Frank Kirchner,Rebecca Adam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To be published in IEEE OCEANS 2026 (Sanya) conference proceedings

点击查看摘要

Abstract:Autonomous underwater vehicles are required to perform multiple tasks adaptively and in an explainable manner under dynamic, uncertain conditions and limited sensing, challenges that classical controllers struggle to address. This demands robust, generalizable, and inherently interpretable control policies for reliable long-term monitoring. Reinforcement learning, particularly multi-task RL, overcomes these limitations by leveraging shared representations to enable efficient adaptation across tasks and environments. However, while such policies show promising results in simulation and controlled experiments, they yet remain opaque and offer limited insight into the agent’s internal decision-making, creating gaps in transparency, trust, and safety that hinder real-world deployment. The internal policy structure and task-specific specialization remain poorly understood. To address these gaps, we analyze the internal structure of a pretrained multi-task reinforcement learning network in the HoloOcean simulator for underwater navigation by identifying and comparing task-specific subnetworks responsible for navigating toward different species. We find that in a contextual multi-task reinforcement learning setting with related tasks, the network uses only about 1.5% of its weights to differentiate between tasks. Of these, approximately 85% connect the context-variable nodes in the input layer to the next hidden layer, highlighting the importance of context variables in such settings. Our approach provides insights into shared and specialized network components, useful for efficient model editing, transfer learning, and continual learning for underwater monitoring through a contextual multi-task reinforcement learning method.

[AI-18] o See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

【速读】：该论文旨在解决decoder-only transformer模型在执行抽象符号推理任务（如命题逻辑推理）时，对未见过的变量名称无法泛化的问题。研究表明，除生成未见标记困难外，一个关键机制问题是：训练过程中未见标记的解嵌入（unembedding）向量会发生表征坍缩（representational collapse），即多个未见变量的解嵌入向量趋近于相同向量，导致模型难以区分这些变量。解决方案的关键在于结合三项技术：1）引入小规模架构调整以增强复制能力；2）提升训练数据多样性；3）冻结或周期性重置（un）嵌入参数。这一组合策略有效缓解了表征坍缩问题，从而实现了对未见标记的泛化能力，并通过大量受控实验验证了其有效性。

链接: https://arxiv.org/abs/2604.21632
作者: Nevena Lazić,Liam Fowl,András György,Csaba Szepesvári
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not observed during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training. The collapse makes distinguishing multiple unseen variables difficult for the model (especially when the embedding and unembedding parameters are shared), and provides a mechanistic explanation for the effectiveness of existing heuristic interventions like “active forgetting”, which periodically reset the token (un)embeddings. Based on these observations, we devise a combination of techniques, involving a small architecture change facilitating copying, data diversity, and freezing or resetting (un)embeddings, that achieves generalization to unseen tokens. We support our claims with extensive controlled experiments on propositional logic reasoning problems. Beyond synthetic experiments, we also observe evidence of (un)embedding collapse in the open-weight models in the Gemma 3 family, which includes 99 unused tokens reserved for downstream use. Empirically we find that the correlated embeddings of these tokens are a poor initialization for finetuning applications.

[AI-19] Promoting Simple Agents : Ensemble Methods for Event-Log Prediction

【速读】：该论文旨在解决流式事件日志中下一活动预测（next-activity prediction）任务的模型效率与性能权衡问题。传统神经网络架构（如LSTM、Transformer）虽具高精度，但资源消耗大且性能不稳定；而轻量级基于自动机的n-gram模型虽然计算开销低，但准确率相对较低。解决方案的关键在于：提出一种动态选择机制——推广算法（promotion algorithm），在推理阶段根据运行时状态在两个活跃模型之间进行切换，从而在不显著增加延迟和内存消耗的前提下，提升n-gram模型的准确性，使其达到甚至超过非窗口化神经模型的水平，同时保持稳定性和低计算成本。

链接: https://arxiv.org/abs/2604.21629
作者: Benedikt Bollig,Matthias Függer,Thomas Nowak,Paul Zeinaty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:We compare lightweight automata-based models (n-grams) with neural architectures (LSTM, Transformer) for next-activity prediction in streaming event logs. Experiments on synthetic patterns and five real-world process mining datasets show that n-grams with appropriate context windows achieve comparable accuracy to neural models while requiring substantially fewer resources. Unlike windowed neural architectures, which show unstable performance patterns, n-grams provide stable and consistent accuracy. While we demonstrate that classical ensemble methods like voting improve n-gram performance, they require running many agents in parallel during inference, increasing memory consumption and latency. We propose an ensemble method, the promotion algorithm, that dynamically selects between two active models during inference, reducing overhead compared to classical voting schemes. On real-world datasets, these ensembles match or exceed the accuracy of non-windowed neural models with lower computational cost.

[AI-20] Using ASP(Q) to Handle Inconsistent Prioritized Data KR2026

【速读】：该论文旨在解决优先级数据中不一致性的容忍查询问题，即在存在冲突事实的情况下，如何基于优先级关系定义最优修复（Pareto-最优、全局最优和完成最优），并在此基础上实现对逻辑理论的查询回答。其解决方案的关键在于引入扩展的答案集编程（Answer Set Programming with Quantifiers, ASP(Q)），利用三种最优修复机制构建不同语义（AR、Brave 和 IAR）下的查询计算框架，并首次实现了全局最优修复语义以及一种可 tractable 的近似语义——接地语义（grounded semantics），从而在多项式层次的第一或第二层上实现高效查询推理。

链接: https://arxiv.org/abs/2604.21603
作者: Meghyn Bienvenu,Camille Bourgaux,Robin Jean,Giuseppe Mazzotta
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: This is an extended version of a paper appearing at the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR 2026). 21 pages

点击查看摘要

Abstract:We explore the use of answer set programming (ASP) and its extension with quantifiers, ASP(Q), for inconsistency-tolerant querying of prioritized data, where a priority relation between conflicting facts is exploited to define three notions of optimal repairs (Pareto-, globally- and completion-optimal). We consider the variants of three well-known semantics (AR, brave and IAR) that use these optimal repairs, and for which query answering is in the first or second level of the polynomial hierarchy for a large class of logical theories. Notably, this paper presents the first implementation of globally-optimal repair-based semantics, as well as the first implementation of the grounded semantics, which is a tractable under-approximation of all these optimal repair-based semantics. Our experimental evaluation sheds light on the feasibility of computing answers under globally-optimal repair semantics and the impact of adopting different semantics, approximations, and encodings.

[AI-21] On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification

【速读】：该论文旨在解决基于忆阻器（memristor）的回声状态网络（Reservoir Computing, RC）在实际应用中因器件特性（如衰减率、量化精度和离散性）导致性能不稳定的问题，尤其关注如何在低资源约束下实现可靠的时间序列处理与模式识别。其解决方案的关键在于：通过分析并优化一种并行延迟反馈网络（Parallel Delayed Feedback Network, PDFN）架构中挥发性忆阻器的行为机制，提出结合预处理方法提升数据在储层中的表征能力，并有效缓解器件变异对系统性能的影响，从而在MNIST分类任务中达到95.89%的准确率，且在20%器件变异条件下仍保持高达94.2%的鲁棒性，验证了挥发性忆阻器作为紧凑、高速、低功耗类脑计算核心元件的可行性。

链接: https://arxiv.org/abs/2604.21602
作者: Rishona Daniels,Duna Wattad,Ronny Ronen,David Saad,Shahar Kvatinsky
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted for publication in Advanced Electronic Materials. Main text: pages 1-32, 11 figures. Supporting information: pages 24-32, 11 figures

点击查看摘要

Abstract:Reservoir computing (RC) is an emerging recurrent neural network architecture that has attracted growing attention for its low training cost and modest hardware requirements. Memristor-based circuits are particularly promising for RC, as their intrinsic dynamics can reduce network size and parameter overhead in tasks such as time-series prediction and image recognition. Although RC has been demonstrated with several memristive devices, a comprehensive evaluation of device-level requirements remains limited. In this paper, we analyze and explain the operation of a parallel delayed feedback network (PDFN) RC architecture with volatile memristors, focusing on how device characteristics – such as decay rate, quantization, and variability – affect reservoir performance. We further discuss strategies to improve data representation in the reservoir using preprocessing methods and suggest potential improvements. The proposed approach achieves 95.89% classification accuracy on MNIST, comparable with the best reported memristor-based RC implementations. Furthermore, the method maintains high robustness under 20% device variability, achieving an accuracy of up to 94.2%. These results demonstrate that volatile memristors can support reliable spatio-temporal information processing and reinforce their potential as key building blocks for compact, high-speed, and energy-efficient neuromorphic computing systems.

[AI-22] DryRUN: On the Role of Public Tests in LLM -Driven Code Generation

【速读】：该论文旨在解决多智能体代码生成框架中对人工提供的公共测试用例（public test cases）的依赖问题，这种依赖不仅限制了方法在真实软件工程场景中的应用（因真实场景下通常缺乏先验的输入输出示例），还导致模型在隐藏评估中表现出“过度自信缺口”（overconfidence gap），即模型过拟合于简单测试样例而无法泛化。解决方案的关键在于提出 DryRUN 框架，其核心创新是使大语言模型（LLM）能够自主生成有效输入并模拟执行轨迹进行自我修正，从而无需依赖外部测试样例或执行反馈即可实现高效、准确的代码生成与调试，同时显著降低输出 token 消耗。

链接: https://arxiv.org/abs/2604.21598
作者: Kaushitha Silva,Srinath Perera
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven planning and debugging, where language models trace execution steps to verify logic. However, these approaches depend on human-provided public test cases to ground the debugging and simulation loop. Manually authoring comprehensive input-output examples is a labor-intensive bottleneck in the software development lifecycle. Because ground-truth input-output examples are rarely available prior to implementation in real-world software engineering, this dependency restricts methods to curated competitive programming benchmarks. Furthermore, we identify that reliance on these public tests induces an ``overconfidence gap,‘’ causing frameworks to overfit to simplistic examples and fail on hidden evaluations. In contrast, we observe that external sample inputs are not strictly necessary for code generation. We demonstrate that large language models can autonomously generate valid inputs and simulate execution traces to self-correct. Consequently, we develop DryRUN, a framework that eliminates the need for ground-truth samples by allowing the LLM to iteratively plan, autonomously generate its own inputs and simulate execution, mitigating algorithmic overconfidence. Evaluations on the LiveCodeBench v6 dataset (post-March 2025) demonstrate that DryRUN matches performance against CodeSIM, a state-of-the-art and public-test-dependent framework, while operating entirely without public test cases or external execution feedback while reducing output token consumption.

[AI-23] CoFEE: Reasoning Control for LLM -Based Feature Discovery

【速读】：该论文旨在解决从复杂非结构化数据中进行特征发现时面临的根本性推理难题：即如何识别出对目标结果具有预测性的抽象特征，同时避免引入泄漏（leakage）、代理变量（proxy）和事后信号（post-outcome signals）。传统大型语言模型（Large Language Models, LLMs）虽具备处理海量信息的能力，但其无约束的特征生成方式常导致特征质量低下。解决方案的关键在于引入一种名为 CoFEE（Cognitive Feature Engineering Engine）的推理控制框架，通过强制LLM在特征发现过程中模拟人类认知行为——如从结果反向推理（backward chaining）、子目标分解（subgoal decomposition）、可观测性和泄漏性验证，以及对被拒绝推理路径的显式回溯——从而在候选特征空间上施加结构化的归纳偏置（inductive biases）。实验证明，这种认知行为引导显著提升了特征的实证预测能力，平均成功评分高出基线15.2%，且生成特征数量减少29%，成本降低53.3%。

链接: https://arxiv.org/abs/2604.21584
作者: Maximilian Westermann,Ben Griffin,Aaron Ontoyin Yin,Zakari Salifu,Yagiz Ihlamur,Kelvin Amoaba,Joseph Ternasky,Fuat Alican,Yigit Ihlamur
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.

[AI-24] A Metamorphic Testing Approach to Diagnosing Memorization in LLM -Based Program Repair

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自动化程序修复（Automated Program Repair, APR）任务中因数据泄露（data leakage）而导致性能评估被高估的问题。其核心问题是：若评测基准与LLM的预训练数据存在重叠，模型可能通过记忆而非真正推理生成修复补丁，从而误导对模型实际能力的判断。解决方案的关键在于将变异测试（Metamorphic Testing, MT）与负对数似然（Negative Log-Likelihood, NLL）相结合，通过构造语义保持不变的变体基准来检测模型性能下降，并利用NLL作为记忆程度的代理指标，从而更可靠地识别和量化数据泄露现象。实证结果表明，所有评估的LLM在变体基准上均出现显著性能下降，且该下降程度与原始基准上的NLL高度相关，验证了该方法的有效性。

链接: https://arxiv.org/abs/2604.21579
作者: Milan De Koning,Ali Asgari,Pouria Derakhshanfar,Annibale Panichella
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data, leading to inflated performance estimates. In this paper, we investigate whether we can better reveal data leakage by combining metamorphic testing (MT) with negative log-likelihood (NLL), which has been used in prior work as a proxy for memorization. We construct variant benchmarks by applying semantics-preserving transformations to two widely used datasets, Defects4J and GitBug-Java. Using these benchmarks, we evaluate the repair success rates of seven LLMs on both original and transformed versions, and analyze the relationship between performance degradation and NLL. Our results show that all evaluated state-of-the-art LLMs exhibit substantial drops in patch generation success rates on transformed benchmarks, ranging from -4.1% for GPT-4o to -15.98% for Llama-3.1. Furthermore, we find that this degradation strongly correlates with NLL on the original benchmarks, suggesting that models perform better on instances they are more likely to have memorized. These findings show that combining MT with NLL provides stronger and more reliable evidence of data leakage, while metamorphic testing alone can help mitigate its effects in LLM-based APR evaluations.

[AI-25] Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

【速读】：该论文旨在解决当前模型训练中用户数据难以从共享权重中移除的问题，即传统方法将用户信息直接嵌入共享参数，导致个体数据删除在计算上不可行，且存在隐私泄露风险（如模型 inversion、成员推理和训练数据提取攻击）。其解决方案的关键在于提出一种三层架构：静态基础模型、可组合的领域专家LoRA适配器（用于行为塑造但不引入用户数据），以及每个用户的代理产物（proxy artefacts）——这些代理产物的删除即可实现确定性的“机器遗忘”（machine unlearning）。由于用户特定信息从未进入共享权重，该设计从根本上避免了对共享组件的隐私攻击，并将原本复杂的权重编辑问题转化为高效的确定性删除操作，同时兼容差分隐私随机梯度下降（DP-SGD），从而在保持个性化的同时提供隐私保障。

链接: https://arxiv.org/abs/2604.21571
作者: Chris Schneider,Philipp Schoenegger,Ben Bariach
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current model training approaches incorporate user information directly into shared weights, making individual data removal computationally infeasible without retraining. This paper presents a three-layer architecture that decouples personal data from shared weights by combining a static base model, composable domain-expert LoRA adapters that shape behavior without imparting user data, and per-user proxy artefacts whose deletion constitutes deterministic unlearning. Evaluation on Phi-3.5-mini and Llama-3.1-8B confirms per-user differentiation in which personal data influences outputs while remaining isolated, verified by a return to baseline after proxy removal (KL divergence of approximately 0.21 nats, 82-89% verification pass rate) and near-zero cross-user contamination. Because user-specific information never enters shared weights, the architecture mitigates model inversion, membership inference, and training-data extraction against shared model components by construction. The approach converts machine unlearning from an intractable weight-editing problem into a deterministic deletion operation that preserves personalization alongside privacy-enhancing guarantees and is compatible with differentially private stochastic gradient descent (DP-SGD) for privacy-preserving shared model improvement.

[AI-26] Hybrid Deep Learning Approach for Coupled Demand Forecasting and Supply Chain Optimization

【速读】：该论文旨在解决纺织品和个人防护装备（PPE）等供应链中因需求波动和供应不确定性导致的预测与优化脱节问题，传统方法往往将 forecasting（预测）与 optimization（优化）独立处理，限制了实际应用效果。其解决方案的关键在于提出一种混合人工智能框架（Hybrid AI Framework for Demand-Supply Forecasting and Optimization, HAF-DS），该框架通过集成基于长短期记忆网络（LSTM）的需求预测模块与混合整数线性规划（MILP）优化层，实现预测误差与运营成本的联合最小化；具体而言，LSTM捕捉时间序列和上下文依赖关系以提升预测精度，而MILP层则生成成本高效的补货与分配决策，二者通过嵌入特征表示和循环神经架构协同优化，从而显著提升供应链的准确性与效率。

链接: https://arxiv.org/abs/2604.21567
作者: Nusrat Yasmin Nadia,Md Habibul Arif,Habibor Rahman Rabby,Md Iftekhar Monzur Tanvir,Md. Jakir Hossen,M. F. Mridha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper is accepted in the Computers, Materials Continua journal

点击查看摘要

Abstract:Supply chain resilience and efficiency are vital in industries characterized by volatile demand and uncertain supply, such as textiles and personal protective equipment (PPE). Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness. This paper proposes a Hybrid AI Framework for Demand-Supply Forecasting and Optimization (HAF-DS), which integrates a Long Short-Term Memory (LSTM)-based demand forecasting module with a mixed integer linear programming (MILP) optimization layer. The LSTM captures temporal and contextual demand dependencies, while the optimization layer prescribes cost-efficient replenishment and allocation decisions. The framework jointly minimizes forecasting error and operational cost through embedding-based feature representation and recurrent neural architectures. Experiments on textile sales and supply chain datasets show significant performance gains over statistical and deep learning baselines. On the combined dataset, HAF-DS reduced Mean Absolute Error (MAE) from 15.04 to 12.83 (14.7%), Root Mean Squared Error (RMSE) from 19.53 to 17.11 (12.4%), and Mean Absolute Percentage Error (MAPE) from 9.5% to 8.1%. Inventory cost decreased by 5.4%, stockouts by 27.5%, and service level rose from 95.5% to 97.8%. These results confirm that coupling predictive forecasting with prescriptive optimization enhances both accuracy and efficiency, providing a scalable and adaptable solution for modern textile and PPE supply chains.

[AI-27] Probabilistic Verification of Neural Networks via Efficient Probabilistic Hull Generation

【速读】：该论文旨在解决神经网络的概率验证问题，即当输入服从概率分布时，计算其输出满足安全约束的概率。这一问题在输入受扰动（常建模为随机变量）的场景下尤为重要。解决方案的关键在于提出了一种新颖的神经网络概率验证框架，通过高效地寻找安全和不安全的概率壳（probabilistic hulls）来计算安全概率的保证范围；其核心创新包括：(1) 利用回归树的状态空间划分策略生成概率壳，(2) 基于边界感知采样方法识别输入空间中的安全边界并用于构建回归树，(3) 采用带有概率优先级的迭代精化机制以逼近安全概率的上下界。该方法在ACAS Xu和火箭着陆控制器等多个基准测试中展现出优于现有技术的准确性和效率。

链接: https://arxiv.org/abs/2604.21556
作者: Jingyang Li,Xin Chen,Hongfei Fu,Guoqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:The problem of probabilistic verification of a neural network investigates the probability of satisfying the safe constraints in the output space when the input is given by a probability distribution. It is significant to answer this problem when the input is affected by disturbances often modeled by probabilistic variables. In the paper, we propose a novel neural network probabilistic verification framework which computes a guaranteed range for the safe probability by efficiently finding safe and unsafe probabilistic hulls. Our approach consists of three main innovations: (1) a state space subdivision strategy using regression trees to produce probabilistic hulls, (2) a boundary-aware sampling method which identifies the safety boundary in the input space using samples that are later used for building regression trees, and (3) iterative refinement with probabilistic prioritization for computing a guaranteed range for the safe probability. The accuracy and efficiency of our approach are evaluated on various benchmarks including ACAS Xu and a rocket lander controller. The result shows an obvious advantage over the state of the art.

[AI-28] Engaged AI Governance: Addressing the Last Mile Challenge Through Internal Expert Collaboration

【速读】：该论文试图解决欧盟人工智能法案（EU AI Act）在软件开发实践中落地的“最后一英里”挑战，即如何将高层级的AI治理要求有效转化为团队层面的具体实施策略。其解决方案的关键在于构建一个“法律文本到行动”（legal-text-to-action）的转化管道，通过内部专家协作实现从法律条文提取要求、从业者评估与创意生成，再到集体优先级排序的全过程。该机制不仅揭示了实践者对监管要求的三种认知模式（趋同、现有实践和脱节），还强调了将治理工作从外部强制转变为团队共同责任的重要性，从而推动合规性与系统质量及用户保护目标的深度融合。

链接: https://arxiv.org/abs/2604.21554
作者: Simon Jarvers,Orestis Papakyriakopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Under the EU AI Act, translating AI governance requirements into software development practice remains challenging. While AI governance frameworks exist at industry and organizational levels, empirical evidence of team-level implementation is scarce. We address this “Last Mile” Challenge through insider action research embedded within an AI startup. We present a legal-text-to-action pipeline that translates EU AI Act requirements into actionable strategies through internal expert collaboration by extracting requirements from legal text, engaging practitioners in assessment and ideation, and prioritizing implementation through collective evaluation. Our analysis reveals three patterns in how practitioners perceive regulatory requirements: convergence (compliance aligns with development priorities), existing practice (current work already satisfies requirements), and disconnection (requirements perceived as administrative overhead). Based on these patterns, we discuss when governance might be treated genuinely or performatively. Practitioners prioritize requirements that serve end-users or their own development needs, but view verification-oriented requirements as box-ticking exercises. This distinction suggests a translation challenge: regulatory requirements risk superficial treatment unless practitioners understand how compliance serves system quality and user protection. Expert collaboration offers a practical mechanism for transforming governance from external imposition to shared ownership and making previously invisible governance work visible and collective.

[AI-29] Unbiased Prevalence Estimation with Multicalibrated LLM s

【速读】：该论文旨在解决在存在协变量偏移（covariate shift）条件下，利用不完美测量设备（如分类器或大语言模型）进行类别流行率估计时出现的偏差问题。传统方法假设设备错误率在不同人群中保持稳定，但在协变量偏移下这一假设失效，导致估计结果系统性偏差。解决方案的关键在于引入多校准（multicalibration），即在输入特征条件下的校准，而非仅基于整体平均的校准，从而确保在协变量偏移场景下仍能实现无偏的流行率估计。理论与实证结果均表明，多校准方法显著优于标准校准和量化方法，在模拟和实际应用中均能有效降低偏差。

链接: https://arxiv.org/abs/2604.21549
作者: Fridolin Linder,Thomas Leeper,Daniel Haimovich,Niek Tax,Lorenzo Perini,Milan Vojnovic
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications – estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM – demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.

[AI-30] he CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks

【速读】：该论文旨在解决复杂网络中关键节点识别问题，特别是在二分依赖网络（bipartite dependency network）中，如何高效识别出一组k个贡献者（contributors），使其移除后能隔离最多数量的物品（items）。传统方法在处理这种“全有或全无”覆盖机制时效果有限，且缺乏理论保障。解决方案的关键在于将该问题建模为合作博弈（coalitional game），基于Shapley值推导出一种闭式中心性度量——ShapleyCov，用于量化每个贡献者的预期影响；并进一步提出MinCov算法，这是一种线性时间复杂度的迭代剥皮（iterative peeling）方法，显式考虑连接冗余性，优先选择唯一支撑大量物品的贡献者。实验表明，该方案在真实大规模数据集（如包含超2.5亿边的Wikipedia图）上显著优于传统基线，且性能接近最优解，同时计算效率极高。

链接: https://arxiv.org/abs/2604.21537
作者: Sebastiano A. Piccolo,Andrea Tagarelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:Identifying critical nodes in complex networks is a fundamental task in graph mining. Yet, methods addressing an all-or-nothing coverage mechanics in a bipartite dependency network, a graph with two types of nodes where edges represent dependency relationships across the two groups only, remain largely unexplored. We formalize the CriticalSet problem: given an arbitrary bipartite graph modeling dependencies of items on contributors, identify the set of k contributors whose removal isolates the largest number of items. We prove that this problem is NP-hard and requires maximizing a supermodular set function, for which standard forward greedy algorithms provide no approximation guarantees. Consequently, we model CriticalSet as a coalitional game, deriving a closed-form centrality, ShapleyCov, based on the Shapley value. This measure can be interpreted as the expected number of items isolated by a contributor’s departure. Leveraging these insights, we propose MinCov, a linear-time iterative peeling algorithm that explicitly accounts for connection redundancy, prioritizing contributors who uniquely support many items. Extensive experiments on synthetic and large-scale real datasets, including a Wikipedia graph with over 250 million edges, reveal that MinCov and ShapleyCov significantly outperform traditional baselines. Notably, MinCov achieves near-optimal performance, within 0.02 AUC of a Stochastic Hill Climbing metaheuristic, while remaining several orders of magnitude faster.

[AI-31] Satisfying Rationality Postulates of Structured Argumentation Through Deductive Support – Technical Report

【速读】：该论文旨在解决ASPIC-style结构化论证框架在存在削弱（undercut）情形下，难以同时满足五个关键理性后设（rationality postulates）的问题，尤其是针对可信 semantics（如 preferred semantics）下的合规性挑战。解决方案的关键在于提出 Deductive ASPIC^\ominus 框架，该框架融合了 ASPIC^\ominus 中的 gen-rebuttals 与 Deductive ASPIC- 中的联合支持双极论证框架（Joint Support Bipolar Argumentation Frameworks, JSBAFs），并引入偏好机制，从而在 preferred semantics 的变体下实现对所有五条理性后设的严格满足。

链接: https://arxiv.org/abs/2604.21515
作者: Marcos Cramer,Tom Friese
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:ASPIC-style structured argumentation frameworks provide a formal basis for reasoning in artificial intelligence by combining internal argument structure with abstract argumentation semantics. A key challenge in these frameworks is ensuring compliance with five critical rationality postulates: closure, direct consistency, indirect consistency, non-interference, and crash-resistance. Recent approaches, including ASPIC ^\ominus and Deductive ASPIC - , have made significant progress but fall short of meeting all postulates simultaneously under a credulous semantics (e.g. preferred) in the presence of undercuts. This paper introduces Deductive ASPIC ^\ominus , a novel framework that integrates gen-rebuttals from ASPIC ^\ominus with the Joint Support Bipolar Argumentation Frameworks (JSBAFs) of Deductive ASPIC - , incorporating preferences. We show that Deductive ASPIC ^\ominus satisfies all five rationality postulates under a version of preferred semantics. This work opens new avenues for further research on robust and logically sound structured argumentation systems.

[AI-32] BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

【速读】：该论文旨在解决蛋白质-配体生物活性数据（protein-ligand bioactivity data）自动化提取的难题，该问题在药物发现中至关重要，但传统人工注释难以应对文献爆炸式增长。现有方法因需同时解析文本、表格和图表中的生化语义，并重建精确的化学结构（如Markush结构），导致提取效率低且准确性不足。解决方案的关键在于提出BioMiner框架，其创新性地将生物活性语义理解与配体结构构建分离：前者通过直接推理实现，后者则基于化学结构引导的视觉语义推理范式——利用多模态大语言模型对化学锚定的视觉表示进行推理，以捕捉结构间关系，而精确分子构建则交由专业化学工具完成。这一设计显著提升了自动化提取的准确性与实用性。

链接: https://arxiv.org/abs/2604.21508
作者: Jiaxian Yan,Jintao Zhu,Yuhang Yang,Qi Liu,Kai Zhang,Zaixi Zhang,Xukai Liu,Boyan Zhang,Kaiyuan Gao,Jinchuan Xiao,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 20 pages, 5 figures, 1 table

点击查看摘要

Abstract:Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner’s practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.

[AI-33] GeoMind: An Agent ic Workflow for Lithology Classification with Reason ed Tool Invocation

【速读】：该论文旨在解决传统测井岩性分类方法中存在的静态、单步判别映射问题，此类方法缺乏地质先验知识和证据驱动的推理机制，导致预测结果与地质现实脱节。解决方案的关键在于提出GeoMind——一个工具增强的代理框架，将岩性分类建模为顺序推理过程，其核心由感知、推理和分析模块构成，分别实现原始测井数据到语义趋势的转换、多源证据下的岩性假设推断以及基于地层约束的预测验证；同时引入细粒度过程监督策略，优化中间推理步骤而非仅关注最终输出，从而确保决策轨迹的逻辑一致性和地质合理性。

链接: https://arxiv.org/abs/2604.21501
作者: Yitong Zhou,Mingyue Cheng,Jiahao Wang,Qingyang Mao,Qi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lithology classification in well logs is a fundamental geoscience data mining task that aims to infer rock types from multi dimensional geophysical sequences. Despite recent progress, existing approaches typically formulate the problem as a static, single-step discriminative mapping. This static paradigm limits evidence-based diagnostic reasoning against geological standards, often yielding predictions that are detached from geological reality due to a lack of domain priors. In this work, we propose GeoMind, a tool-augmented agentic framework that models lithology classification as a sequential reasoning process. GeoMind organizes its toolkit into perception, reasoning, and analysis modules, which respectively translate raw logs into semantic trends, infer lithology hypotheses from multi-source evidence, and verify predictions against stratigraphic constraints. A global planner adaptively coordinates these modules based on input characteristics, enabling geologically plausible and evidence-grounded decisions. To guarantee the logical consistency of GeoMind, we introduce a fine-grained process supervision strategy. Unlike standard methods that focus solely on final outcomes, our approach optimizes intermediate reasoning steps, ensuring the validity of decision trajectories and alignment to geological constraints. Experiments on four benchmark well-log datasets demonstrate that GeoMind consistently outperforms strong baselines in classification performance while providing transparent and traceable decision-making processes.

[AI-34] MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

【速读】：该论文旨在解决基于扩散模型的轨迹生成规划器在自动驾驶中因迭代神经函数计算导致推理延迟高的问题。解决方案的关键在于提出MISTY（Mixer-based Inference for Single-step Trajectory-drifting Yield），其核心创新包括：1）引入向量化子图编码器捕捉环境上下文；2）利用变分自编码器（Variational Autoencoder, VAE）将专家轨迹压缩至32维潜在流形；3）设计超轻量级MLP-Mixer解码器以消除二次复杂度的注意力机制；4）提出潜在空间漂移损失（latent-space drifting loss），将复杂的分布演化过程转移到训练阶段，通过显式建模吸引与排斥力，使模型能够生成原始专家数据中缺失的主动超车等前瞻性驾驶行为。该方法实现了纯单步推理，在nuPlan测试集上达到80.32（非反应性）和82.21（反应性）的SOTA性能，且端到端延迟仅10.1 ms，推理速度相较迭代扩散模型提升一个数量级。

链接: https://arxiv.org/abs/2604.21489
作者: Yining Xing,Zehong Ke,Yiqian Tu,Zhiyuan Liu,Wenhao Yu,Jianqiang Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 3 tables. Submitted to IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, a Variational Autoencoder to structure expert trajectories into a compact 32-dimensional latent manifold, and an ultra-lightweight MLP-Mixer decoder to eliminate quadratic attention complexity. Importantly, we introduce a latent-space drifting loss that shifts the complex distribution evolution entirely to the training phase. By formulating explicit attractive and repulsive forces, this mechanism empowers the model to synthesize novel, proactive maneuvers, such as active overtaking, that are virtually absent from the raw expert demonstrations. Extensive evaluations on the nuPlan benchmark demonstrate that MISTY achieves state-of-the-art results on the challenging Test14-hard split, with comprehensive scores of 80.32 and 82.21 in non-reactive and reactive settings, respectively. Operating at over 99 FPS with an end-to-end latency of 10.1 ms, MISTY offers an order-of-magnitude speedup over iterative diffusion planners while while achieving significantly robust generation.

[AI-35] Efficient Agent Evaluation via Diversity-Guided User Simulation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）作为客户交互代理时的可靠性评估难题，尤其针对多轮、随机性交互中计算效率低和难以发现深层失败模式的问题。现有评估方法依赖线性蒙特卡洛回放完整对话流程来估算成功率，存在重复生成相同早期对话前缀且无法有效覆盖罕见用户行为导致的故障场景的缺陷。其解决方案的关键在于提出DIVERT（Diversity-Induced Evaluation via Branching of Trajectories）框架，该框架通过在关键决策点捕获完整的代理-环境状态快照并从中恢复执行，实现共享对话前缀的复用以减少冗余计算；同时，从每个分支点引入针对性的多样性诱导型用户响应，引导探索语义多样且未充分覆盖的交互路径，从而显著提升评估效率与覆盖率。

链接: https://arxiv.org/abs/2604.21480
作者: Itay Nakash,George Kour,Ateret Anaby-Tavor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.21480 [cs.AI] (or arXiv:2604.21480v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.21480 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms

【速读】：该论文旨在解决复杂疾病治疗中单药疗效有限及易产生耐药性的问题，通过计算预测药物协同效应以提升联合用药的治疗效果。其核心挑战在于现有深度学习与图神经网络（GNN）方法存在结构偏差、泛化能力不足和可解释性差等问题。解决方案的关键是提出一种融合分子结构特征、细胞系基因组谱和药物-药物相互作用的协同预测图神经网络模型ResGIN-Att：首先利用残差图同构网络提取多尺度拓扑特征以缓解深层网络中的过平滑问题；随后通过自适应长短期记忆（LSTM）模块实现从局部到全局的结构信息融合；最后引入交叉注意力机制显式建模药物间相互作用并识别关键化学亚结构，从而显著提升预测性能、泛化能力和可解释性。

链接: https://arxiv.org/abs/2604.21473
作者: Jiyan Song,Wenyang Wang,Chengcheng Yan,Zhiquan Han,Feifei Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the treatment of complex diseases, treatment regimens using a single drug often yield limited efficacy and can lead to drug resistance. In contrast, combination drug therapies can significantly improve therapeutic outcomes through synergistic effects. However, experimentally validating all possible drug combinations is prohibitively expensive, underscoring the critical need for efficient computational prediction methods. Although existing approaches based on deep learning and graph neural networks (GNNs) have made considerable progress, challenges remain in reducing structural bias, improving generalization capability, and enhancing model interpretability. To address these limitations, this paper proposes a collaborative prediction graph neural network that integrates molecular structural features and cell-line genomic profiles with drug-drug interactions to enhance the prediction of synergistic effects. We introduce a novel model named the Residual Graph Isomorphism Network integrated with an Attention mechanism (ResGIN-Att). The model first extracts multi scale topological features of drug molecules using a residual graph isomorphism network, where residual connections help mitigate over-smoothing in deep layers. Subsequently, an adaptive Long Short-Term Memory (LSTM) module fuses structural information from local to global scales. Finally, a cross-attention module is designed to explicitly model drug-drug interactions and identify key chemical substructures. Extensive experiments on five public benchmark datasets demonstrate that ResGIN-Att achieves competitive performance, comparing favorably against key baseline methods while exhibiting promising generalization capability and robustness.

[AI-37] Dynamical Priors as a Training Objective in Reinforcement Learning

【速读】：该论文旨在解决标准强化学习（Reinforcement Learning, RL）在训练过程中缺乏对决策时序演化约束的问题，导致策略可能出现不连贯的行为，如信心突变、振荡或退化性静止。解决方案的关键在于引入动力学先验强化学习（Dynamical Prior Reinforcement Learning, DP-RL），通过一个由外部状态动力学驱动的辅助损失项，模拟证据积累和迟滞效应，从而在不改变奖励函数、环境或策略架构的前提下，引导动作概率在学习过程中的时序演化，使决策轨迹呈现任务依赖的结构化行为，而非简单的平滑处理。

链接: https://arxiv.org/abs/2604.21464
作者: Sukesh Subaharan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Supplementary material can be accessed here: this https URL

点击查看摘要

Abstract:Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task-dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision-making in RL agents.

[AI-38] HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

【速读】：该论文旨在解决长视频理解中普遍存在的时空冗余问题以及跨长时间跨度的复杂叙事依赖关系，同时克服现有结构化表示方法在压缩视觉信息时牺牲时间连贯性的问题，以及多智能体框架因预设工作流僵化而无法根据问题需求动态调整推理策略的局限。其解决方案的关键在于提出一种分层多智能体框架HiCrew，核心创新包括：（1）设计一种混合树（Hybrid Tree）结构，在保留时间拓扑的同时对语义一致片段进行相关性引导的层次聚类；（2）开发面向问题的字幕生成机制（Question-Aware Captioning），通过意图驱动的视觉提示生成精准的语义描述；（3）引入规划层（Planning Layer），依据问题复杂度自适应选择智能体角色与执行路径，实现动态协作。该方案在EgoSchema和NExT-QA数据集上验证了有效性，尤其在时间推理与因果推理任务中表现突出。

链接: https://arxiv.org/abs/2604.21444
作者: Yuehan Zhu,Jingqi Zhao,Jiawen Zhao,Xudong Mao,Baoquan Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi-agent frameworks operate through rigid, pre-defined workflows that fail to adapt their reasoning strategies to question-specific demands. In this paper, we introduce HiCrew, a hierarchical multi-agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance-guided hierarchical clustering within semantically coherent segments. Second, we develop a Question-Aware Captioning mechanism that synthesizes intent-driven visual prompts to generate precision-oriented semantic descriptions. Third, we integrate a Planning Layer that dynamically orchestrates agent collaboration by adaptively selecting roles and execution paths based on question complexity. Extensive experiments on EgoSchema and NExT-QA validate the effectiveness of our approach, demonstrating strong performance across diverse question types with particularly pronounced gains in temporal and causal reasoning tasks that benefit from our hierarchical structure-preserving design.

[AI-39] Brief chatbot interactions produce lasting changes in human moral values

【速读】：该论文试图解决的问题是：人工智能（Artificial Intelligence, AI）聊天机器人在作为个人顾问时，是否会影响人类的道德判断（moral judgments），以及这种影响是否具有隐蔽性和持久性。其解决方案的关键在于采用一种自然情境下的组内设计（within-subject naturalistic paradigm），让53名参与者先对道德情景进行评分，随后与一个被引导改变道德立场的AI聊天机器人讨论四个场景，再与一个无干预控制代理讨论另外四个场景。研究发现，简短的对话即可显著引发道德判断的方向性变化（效应量Cohen’s d = 0.735–1.576），且两周随访中效应增强（Cohen’s d = 1.038–2.069），而控制组无变化，且参与者未察觉说服意图，说明AI可隐蔽、持续地重塑基础道德价值观，揭示了当前AI系统在伦理决策领域潜在的脆弱性。

链接: https://arxiv.org/abs/2604.21430
作者: Yue Teng,Qianer Zhong,Kim Mai Tich Nguyen Thordsen,Christian Montag,Benjamin Becker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Moral judgements form the foundation of human social behavior and societal systems. While Artificial Intelligence chatbots increasingly serve as personal advisors, their influence on moral judgments remains largely unexplored. Here, we examined whether directive AI conversations shift moral evaluations using a within-subject naturalistic paradigm. Fifty-three participants rated moral scenarios, then discussed four with a chatbot prompted to shift moral judgments and four with a control agent. The brief conversations induced significant directional shifts in moral judgments, accepting stricter standards as well as advocating greater leniency (ps 0.05; Cohen’s d = 0.735-1.576), with increasing strengths of this effect during a two-week follow-up (Cohen’s d = 1.038-2.069). Critically, the control condition produced no changes, and the effects did not extend to punishment while participants remained unaware of the persuasive intent, and both agents were rated equally likable and convincing, suggesting a vulnerability to undetected and lasting manipulation of foundational moral values.

[AI-40] FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation ACL2026

【速读】：该论文旨在解决机器翻译质量评估（Quality Estimation, QE）模型中存在的系统性性别偏见问题，尤其是在性别模糊和性别明确场景下，现有QE模型倾向于偏好男性表述，甚至对性别错位的翻译给予更高评分。解决方案的关键在于提出FairQE框架，其核心机制包括：通过多智能体架构检测性别线索、生成性别翻转的翻译变体，并结合传统QE得分与大语言模型（LLM）驱动的偏见缓解推理，利用动态的偏见感知聚合机制进行融合。该设计在不牺牲整体评估准确性的前提下，以即插即用的方式有效校准了QE模型的性别相关偏见。

链接: https://arxiv.org/abs/2604.21420
作者: Jinhee Jang,Juhwan Choi,Dongjin Lee,Seunguk Yu,Youngbin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Quality Estimation (QE) aims to assess machine translation quality without reference translations, but recent studies have shown that existing QE models exhibit systematic gender bias. In particular, they tend to favor masculine realizations in gender-ambiguous contexts and may assign higher scores to gender-misaligned translations even when gender is explicitly specified. To address these issues, we propose FairQE, a multi-agent-based, fairness-aware QE framework that mitigates gender bias in both gender-ambiguous and gender-explicit scenarios. FairQE detects gender cues, generates gender-flipped translation variants, and combines conventional QE scores with LLM-based bias-mitigating reasoning through a dynamic bias-aware aggregation mechanism. This design preserves the strengths of existing QE models while calibrating their gender-related biases in a plug-and-play manner. Extensive experiments across multiple gender bias evaluation settings demonstrate that FairQE consistently improves gender fairness over strong QE baselines. Moreover, under MQM-based meta-evaluation following the WMT 2023 Metrics Shared Task, FairQE achieves competitive or improved general QE performance. These results show that gender bias in QE can be effectively mitigated without sacrificing evaluation accuracy, enabling fairer and more reliable translation evaluation.

[AI-41] CSC: Turning the Adversarys Poison against Itself

【速读】：该论文旨在解决基于投毒的后门攻击对深度神经网络造成的威胁，此类攻击通过在训练数据中嵌入触发器，使模型在面对特定触发输入时误分类为攻击者指定标签，同时保持对干净数据的良好性能。现有基于投毒抑制的防御方法往往在应对特定攻击变体时检测能力不足，且因采用去学习（unlearning）策略导致模型准确率下降。论文提出了一种新颖的防御机制——聚类隔离隐藏（Cluster Segregation Concealment, CSC），其关键在于：首先利用早期训练阶段的特征提取与DBSCAN聚类识别出由投毒样本构成的孤立簇，并基于类别多样性与密度指标判定异常簇；随后在隐藏阶段将识别出的投毒样本重新标记为虚拟类别，并通过交叉熵损失微调模型分类器，从而切断后门关联并建立良性虚拟连接，实现高精度防御与最小化干净数据准确率损失的平衡。

链接: https://arxiv.org/abs/2604.21416
作者: Yuchen Shi,Xin Guo,Huajie Chen,Tianqing Zhu,Bo Liu,Wanlei Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Poisoning-based backdoor attacks pose significant threats to deep neural networks by embedding triggers in training data, causing models to misclassify triggered inputs as adversary-specified labels while maintaining performance on clean data. Existing poison restraint-based defenses often suffer from inadequate detection against specific attack variants and compromise model utility through unlearning methods that lead to accuracy degradation. This paper conducts a comprehensive analysis of backdoor attack dynamics during model training, revealing that poisoned samples form isolated clusters in latent space early on, with triggers acting as dominant features distinct from benign ones. Leveraging these insights, we propose Cluster Segregation Concealment (CSC), a novel poison suppression defense. CSC first trains a deep neural network via standard supervised learning while segregating poisoned samples through feature extraction from early epochs, DBSCAN clustering, and identification of anomalous clusters based on class diversity and density metrics. In the concealment stage, identified poisoned samples are relabeled to a virtual class, and the model’s classifier is fine-tuned using cross-entropy loss to replace the backdoor association with a benign virtual linkage, preserving overall accuracy. CSC was evaluated on four benchmark datasets against twelve poisoning-based attacks, CSC outperforms nine state-of-the-art defenses by reducing average attack success rates to near zero with minimal clean accuracy loss. Contributions include robust backdoor patterns identification, an effective concealment mechanism, and superior empirical validation, advancing trustworthy artificial intelligence.

[AI-42] SemanticAgent : A Semantics-Aware Framework for Text-to-SQL Data Synthesis

【速读】：该论文旨在解决现有文本到SQL（text-to-SQL）合成流水线中执行可行性与语义有效性混淆的问题，即仅依赖语法检查和执行验证的方法可能生成在数据库上成功运行但违反语义逻辑的查询。解决方案的关键在于提出SemanticAgent框架，其核心是通过三个专用模块——分析器（analyzer）、合成器（synthesizer）和验证器（verifier）——构建一个三阶段协议：语义分析、逐步合成与诊断优化，从而将单纯的执行验证转化为可追溯的推理过程，并利用生成的合成数据提升下游微调性能，尤其在语义要求较高的基准测试中表现更优。

链接: https://arxiv.org/abs/2604.21414
作者: Qiang Gao,Zhenping Li,Anqi Zhuo,Yingxiao Zhao,Weibo Geng,Xiaosong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing text-to-SQL synthesis pipelines still conflate executability with semantic validity: syntactic checks and execution-based validation can retain queries that execute successfully while violating database semantics. To address these limitations, we propose SemanticAgent, a semantic-aware synthesis framework. SemanticAgent organizes synthesis around three specialized modules: an analyzer, a synthesizer, and a verifier. Through a three-stage protocol of semantic analysis, stepwise synthesis, and diagnostic refinement, SemanticAgent transforms execution-based validation alone into a traceable reasoning process. Our framework generates synthetic data that consistently outperforms prior synthesis methods under semantic-quality evaluation, leading to stronger downstream fine-tuning performance, especially on semantically demanding benchmarks.

[AI-43] From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

【速读】：该论文旨在解决具身智能中高阶语义理解与低阶物理控制之间的长期挑战，其根源在于认知与动作之间存在的时空尺度不匹配问题。现有生成式视觉语言动作（VLA）策略多采用“从噪声生成”的范式，忽视了这种尺度差异，导致表征效率低下且条件对齐能力弱。解决方案的关键在于提出ResVLA架构，将范式转变为“从意图精炼”，通过谱分析将控制分解为确定性的低频锚点（对应全局意图）和随机的高频残差（对应局部动态），并基于预测意图锚定生成过程，仅通过残差扩散桥专注于局部动态的精细化调整，从而实现更高效、鲁棒且收敛更快的控制策略。

链接: https://arxiv.org/abs/2604.21391
作者: Yiming Zhong,Yaoyu He,Zemin Yang,Pengfei Tian,Yifan Huang,Qingqiu Huang,Xinge Zhu,Yuexin Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a “Generation-from-Noise” paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to “Refinement-from-Intent.” Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

[AI-44] me Causality and Observability Failures in Distributed AI Inference Systems

【速读】：该论文旨在解决分布式AI推理流水线中因节点间时钟偏差（clock skew）导致可观测性（observability）因果错误的问题，尽管系统功能正确且性能未受影响。其关键解决方案在于通过控制实验揭示了时钟偏差对可观测性的影响阈值：在不超过3毫秒的时钟偏差下，系统仍能保持因果一致性；而当偏差达到5毫秒时，明显的因果违反现象开始出现。研究进一步发现，这种违反行为并非静态，而是随时间演化，受节点间相对时钟漂移影响，表明时间同步必须作为分布式AI系统中的首要考量因素。

链接: https://arxiv.org/abs/2604.21361
作者: Ankur Sharma,Deep Shah,David Lariviere,Hesham ElBakoury
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures. Produced as part of the Unified Intelligent Infrastructure workstream at the Open Compute Project (OCP)

点击查看摘要

Abstract:Distributed AI inference pipelines rely heavily on timestamp-based observability to understand system behavior. This work demonstrates that even small clock skew between nodes can cause observability to become causally incorrect while the system itself remains functionally correct and performant. We present controlled experiments on a multi-node AI inference pipeline, where clock skew is introduced at a single stage. Results show that no violations are observed under synchronized conditions and up to 3 ms skew, while clear causality violations emerge by 5 ms. Despite this, system throughput and output correctness remain largely unaffected. We further observe that violation behavior is not strictly static. In longer runs, negative span rates may stabilize or decrease over time, indicating that effective skew evolves due to relative clock drift between nodes. Experiments were conducted using Kafka and ZeroMQ transports, with consistent results across both. Aeron is under active exploration but is not yet included in the completed validation set. These findings suggest that observability correctness depends not only on system functionality but also on precise time alignment, and that timing must be treated as a first-class concern in distributed AI systems.

[AI-45] Adversarial Evasion in Non-Stationary Malware Detection: Minimizing Drift Signals through Similarity-Constrained Perturbations

【速读】：该论文旨在解决在非平稳（non-stationary）现实环境中，恶意软件（malware）检测模型面临的关键安全挑战：攻击者能否生成同时具备逃逸分类器检测能力且不易被漂移监测机制（drift monitoring mechanisms）发现的对抗性恶意样本？解决方案的核心在于提出一种新颖的对抗样本生成方法，其在分类器标准化特征空间中进行目标攻击，并引入复杂的相似性正则化项（similarity regularizers），通过约束扰动以保持与原始良性恶意软件分布的一致性，从而在优化目标中平衡目标误分类概率与漂移信号最小化。实验表明，ℓ₂ 正则化在降低输出漂移信号方面表现最优，且扰动预算显著影响攻击成功率与可检测性之间的权衡关系。

链接: https://arxiv.org/abs/2604.21310
作者: Pawan Acharya,Lan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has emerged as a powerful approach for malware detection, demonstrating impressive accuracy across various data representations. However, these models face critical limitations in real-world, non-stationary environments where both malware characteristics and detection systems continuously evolve. Our research investigates a fundamental security question: Can an attacker generate adversarial malware samples that simultaneously evade classification and remain inconspicuous to drift monitoring mechanisms? We propose a novel approach that generates targeted adversarial examples in the classifier’s standardized feature space, augmented with sophisticated similarity regularizers. By carefully constraining perturbations to maintain distributional similarity with clean malware, we create an optimization objective that balances targeted misclassification with drift signal minimization. We quantify the effectiveness of this approach by comprehensively comparing classifier output probabilities using multiple drift metrics. Our experiments demonstrate that similarity constraints can reduce output drift signals, with \ell_2 regularization showing the most promising results. We observe that perturbation budget significantly influences the evasion-detectability trade-off, with increased budget leading to higher attack success rates and more substantial drift indicators.

[AI-46] Can MLLM s “Read” What is Missing?

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在文本重建任务中缺乏统一、无提示（prompt-free）评估基准的问题。现有方法多依赖于问答类任务，难以分离模型的指令遵循能力与基于视觉上下文的文本恢复能力。为此，作者提出MMTR-Bench，其核心创新在于设计了一个无需显式提示的文本掩码重建（Masked Text Reconstruction, MTR）基准，要求模型从单页或多页文档或网页图像中直接恢复被遮蔽的文本内容，从而精准评估模型对布局理解（layout understanding）、视觉定位（visual grounding）以及跨模态知识整合的能力。该方案通过引入分层评估协议（level-aware evaluation protocol）以适应不同长度目标文本的多样性，显著提升了评估的公平性与挑战性。

链接: https://arxiv.org/abs/2604.21277
作者: Jindi Guo,Xi Fang,Chaozheng Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model’s layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at this https URL.

[AI-47] Enhancing Online Recruitment with Category-Aware MoE and LLM -based Data Augmentation ACL

【速读】：该论文旨在解决在线招聘中因低质量职位描述和相似候选人-职位对导致的人员-岗位匹配（Person-Job Fit, PJF）模型性能下降问题。其解决方案的关键在于提出一种基于大语言模型（Large Language Model, LLM）的方法，包含两个创新技术：一是利用思维链（Chain-of-Thought, COT）提示对低质量职位描述进行润色与重写，提升数据质量；二是引入类别感知的专家混合模型（Category-aware Mixture of Experts, MoE），通过类别嵌入动态分配专家权重，增强对相似候选人-职位对的区分能力，从而显著提升匹配效果。

链接: https://arxiv.org/abs/2604.21264
作者: Minping Chen,Bing Xu,Zulong Chen,Chuanfei Xu,Ying Zhou,Zui Tao,Zeyi Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL Industry Track 2026

点击查看摘要

Abstract:Person-Job Fit (PJF) is a critical component for online recruitment. Existing approaches face several challenges, particularly in handling low-quality job descriptions and similar candidate-job pairs, which impair model performance. To address these challenges, this paper proposes a large language model (LLM) based method with two novel techniques: (1) LLM-based data augmentation, which polishes and rewrites low-quality job descriptions by leveraging chain-of-thought (COT) prompts, and (2) category-aware Mixture of Experts (MoE) that assists in identifying similar candidate-job pairs. This MoE module incorporates category embeddings to dynamically assign weights to the experts and learns more distinguishable patterns for similar candidate-job pairs. We perform offline evaluations and online A/B tests on our recruitment platform. Our method relatively surpasses existing methods by 2.40% in AUC and 7.46% in GAUC, and boosts click-through conversion rate (CTCVR) by 19.4% in online tests, saving millions of CNY in external headhunting expenses.

[AI-48] rustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

【速读】：该论文旨在解决医疗人工智能（AI）决策支持系统中“可审计性”（auditability）不足的问题，即现有临床逻辑形式语言仅能验证语法和结构正确性，却无法确保决策规则所依据的证据在认识论上是恰当的。其关键解决方案是引入元谓词（meta-predicates）——即对谓词本身的约束机制，用于在领域特定语言（DSL）中显式声明每条临床决策规则所允许使用的证据类型。通过四维分类体系（目的、知识领域、规模、获取方法）构建一个认识论类型系统，使规则在部署前即可被自动验证是否符合预设的证据规范，从而实现对决策依据的可追溯性和独立审计能力。该方法已在开源遗传变异注释平台AnFiSA中实现，并在560万基因组基准数据上验证了其有效性，且不依赖于规则由人类或生成式AI编写。

链接: https://arxiv.org/abs/2604.21263
作者: Michael Bouzinier,Sergey Trifonov,Michael Chumack,Eugenia Lvova,Dmitry Etin
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:\textbfBackground: Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbfMethods: Drawing on design-by-contract principles, we introduce meta-predicates – predicates about predicates – for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbfResults: Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbfConclusions: Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic. Subjects: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Quantitative Methods (q-bio.QM) MSC classes: 68N15, 68N30, 92D10 ACMclasses: D.3.2; D.2.4; D.2.1; J.3 Cite as: arXiv:2604.21263 [cs.AI] (or arXiv:2604.21263v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.21263 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michael Bouzinier [view email] [v1] Thu, 23 Apr 2026 04:11:44 UTC (1,633 KB)

[AI-49] Robustness Analysis of POMDP Policies to Observation Perturbations

【速读】：该论文旨在解决部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP）中策略对观测模型偏差的鲁棒性问题，即在系统部署时因校准漂移或传感器退化导致观测模型偏离名义模型的情况下，如何量化策略性能下降的容忍边界。解决方案的关键在于将该问题建模为一个双层优化问题，其中内层优化针对特定观测偏差计算策略价值，并利用其单调性特性，通过根查找算法高效求解外层优化以确定最大可接受偏差。进一步地，对于非粘滞（non-sticky）情形，作者证明仅需考虑有限状态控制器（Finite-State Controller, FSC）节点上的观测依赖即可充分刻画偏差影响，从而显著降低复杂度；为此提出了具有收敛性和完备性保证的Robust Interval Search算法，在非粘滞情况下具有多项式时间复杂度，粘滞情况下最多为指数复杂度，实验验证了其在包含数万个状态的POMDP问题上的可扩展性及在机器人学和运筹学中的实际应用价值。

链接: https://arxiv.org/abs/2604.21256
作者: Benjamin Kraske,Qi Heng Ho,Federico Rossi,Morteza Lahijanian,Zachary Sunberg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 Pages

点击查看摘要

Abstract:Policies for Partially Observable Markov Decision Processes (POMDPs) are often designed using a nominal system model. In practice, this model can deviate from the true system during deployment due to factors such as calibration drift or sensor degradation, leading to unexpected performance degradation. This work studies policy robustness against deviations in the POMDP observation model. We introduce the Policy Observation Robustness Problem: to determine the maximum tolerable deviation in a POMDP’s observation model that guarantees the policy’s value remains above a specified threshold. We analyze two variants: the sticky variant, where deviations are dependent on state and actions, and the non-sticky variant, where they can be history-dependent. We show that the Policy Observation Robustness Problem can be formulated as a bi-level optimization problem in which the inner optimization is monotonic in the size of the observation deviation. This enables efficient solutions using root-finding algorithms in the outer optimization. For the non-sticky variant, we show that when policies are represented with finite-state controllers (FSCs) it is sufficient to consider observations which depend on nodes in the FSC rather than full histories. We present Robust Interval Search, an algorithm with soundness and convergence guarantees, for both the sticky and non-sticky variants. We show this algorithm has polynomial time complexity in the non-sticky variant and at most exponential time complexity in the sticky variant. We provide experimental results validating and demonstrating the scalability of implementations of Robust Interval Search to POMDP problems with tens of thousands of states. We also provide case studies from robotics and operations research which demonstrate the practical utility of the problem and algorithms.

[AI-50] CAP: Controllable Alignment Prompting for Unlearning in LLM s ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在未经筛选的语料库上训练时可能保留敏感信息的问题，从而满足监管合规与伦理安全需求。现有参数修改类方法存在计算成本高、遗忘边界不可控以及严格依赖模型权重访问等局限，难以应用于闭源模型；而当前非侵入式方法则缺乏系统性且高度依赖经验。论文提出可控对齐提示遗忘框架（Controllable Alignment Prompting for Unlearning, CAP），其核心在于将遗忘过程解耦为可通过强化学习优化的可学习提示生成过程，由提示生成器与LLM协同工作，在选择性保留通用能力的同时抑制目标知识，实现无需更新模型参数的精确、可控遗忘，并支持通过撤销提示恢复被遗忘知识，从而建立一种动态对齐机制，突破了以往方法在迁移性上的限制。

链接: https://arxiv.org/abs/2604.21251
作者: Zhaokun Wang,Jinyu Guo,Jingwen Pu,Hongli Pu,Meng Yang,Xunlei Chen,Jie Ou,Wenyi Li,Guangchun Luo,Wenhong Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accpeted to ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.

[AI-51] CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在生成连续控制动作时，空间引导信息常以隐式方式嵌入潜在特征中导致动作策略缺乏明确物理约束的问题。其解决方案的关键在于提出CorridorVLA框架，通过预测稀疏的空间锚点（spatial anchors）作为增量物理变化（如Δ-位置），并以此构建显式的容差区域（tolerance region）用于训练目标优化；该容差区域形成一个“走廊”约束，指导流匹配（flow-matching）动作头：若轨迹的隐含空间演化超出该走廊，则施加修正梯度，而对接触和执行噪声引起的微小偏差则允许容忍，从而实现动作对齐的物理线索对生成式动作策略的直接且可解释的约束。

链接: https://arxiv.org/abs/2604.21241
作者: Dachong Li,ZhuangZhuang Chen,Jin Zhang,Jianqiang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision–Language–Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose CorridorVLA , which predicts sparse spatial anchors as incremental physical changes (e.g., \Delta -positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by 3.4% – 12.4% over the corresponding baselines; notably, our GR00T-Corr variant reaches a success rate of 83.21% . These results indicate that action-aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at this https URL.

[AI-52] ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）系统在多模态环境中执行多步骤任务时，因中间步骤错误导致局部误差累积并引发级联失败的问题。现有方法通常依赖事后的修正机制或固定的任务分解与对齐方案，难以应对动态偏差。其解决方案的关键在于提出预测对齐与规划架构（Predictive Alignment and Planning Architecture, ReCAPA），通过预测和对比机制在动作、子目标和轨迹三个层级上调整偏差，并利用Sinkhorn-based模块和Score-field模块强制语义对齐；同时，在训练过程中联合优化预测校正与对齐模块以更新动作生成器，从而实现细粒度步骤的动态校准，确保整体意图一致性。此外，论文还引入两项新指标量化长程执行中的误差传播与恢复过程，显著提升了VLA系统的鲁棒性与可控性。

链接: https://arxiv.org/abs/2604.21232
作者: Xiyin Zeng,Yuyu Sun,Haoyang Li,Shouqiang Liu,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.

[AI-53] SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

【速读】：该论文旨在解决在设备端部署大语言模型（Large Language Models, LLMs）时，由于硬件资源受限以及预填充阶段（prefill stage）高计算成本导致的推理效率问题，特别是如何降低首次词元生成时间（Time-to-First-Token）并减少能耗。其核心解决方案是提出SparKV框架，该框架通过结合云端KV缓存流式传输与本地计算，自适应地评估每个KV块的成本，并决定是否从云端加载或本地计算，同时通过重叠通信与计算路径来进一步降低延迟；此外，SparKV还能在运行时动态调整离线生成的调度策略，以应对无线网络波动和边缘资源变化，从而实现通信与计算成本的实时平衡。

链接: https://arxiv.org/abs/2604.21231
作者: Hongyao Liu,Liuqun Zhai,Junyi Wang,Zhengru Fang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: IEEE INTERNET OF THINGS HOURNAL, 11 pages under major revision

点击查看摘要

Abstract:Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3 x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.

[AI-54] SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL

【速读】：该论文旨在解决现有文本到SQL（Text-to-SQL）模型评估基准存在的三大局限性：单一聚合评分无法全面反映模型性能、缺乏在真实场景下的评估设置，以及对不同查询类型的行为分析不足。其解决方案的关键在于提出SQLyzr——一个综合性评估平台，通过引入多样化的评估指标以捕捉生成SQL查询的多维特性，结合真实世界SQL使用模式和数据库规模进行工作负载对齐，从而实现更贴近实际应用的评估；同时支持细粒度查询分类、错误分析与工作负载增强，使用户能够精准诊断并迭代优化Text-to-SQL模型。

链接: https://arxiv.org/abs/2604.21214
作者: Sepideh Abedini,M. Tamer Özsu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real-world applications. Although many benchmarks exist for evaluating the performance of text-to-SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text-to-SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real-world SQL usage patterns and database scaling. It further supports fine-grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text-to-SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr’s graphical interface, users can customize evaluation settings, analyze fine-grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text-to-SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at this https URL.

[AI-55] rust but Verify: Introducing DAVinCI – A Framework for Dual Attribution and Verification in Claim Inference for Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成文本时存在事实性错误和幻觉（hallucination）的问题，尤其是在医疗、法律和科学传播等高风险领域中，模型输出的可信度与可验证性至关重要。其解决方案的关键在于提出DAVinCI框架——一个双阶段的归因与验证机制：第一阶段通过内部组件和外部来源对生成内容进行归因；第二阶段利用蕴含推理（entailment-based reasoning）和置信度校准对每条主张进行验证。该设计显著提升了事实准确性与可解释性，并在FEVER和CLIMATE-FEVER等多个数据集上实现5–20%的性能提升，为构建可审计、可信的AI系统提供了可扩展路径。

链接: https://arxiv.org/abs/2604.21193
作者: Vipula Rawte,Ryan Rossi,Franck Dernoncourt,Nedim Lipka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual inaccuracies and hallucinations. This limitation poses significant risks in high-stakes domains such as healthcare, law, and scientific communication, where trust and verifiability are paramount. In this paper, we introduce DAVinCI - a Dual Attribution and Verification framework designed to enhance the factual reliability and interpretability of LLM outputs. DAVinCI operates in two stages: (i) it attributes generated claims to internal model components and external sources; (ii) it verifies each claim using entailment-based reasoning and confidence calibration. We evaluate DAVinCI across multiple datasets, including FEVER and CLIMATE-FEVER, and compare its performance against standard verification-only baselines. Our results show that DAVinCI significantly improves classification accuracy, attribution precision, recall, and F1-score by 5-20%. Through an extensive ablation study, we isolate the contributions of evidence span selection, recalibration thresholds, and retrieval quality. We also release a modular DAVinCI implementation that can be integrated into existing LLM pipelines. By bridging attribution and verification, DAVinCI offers a scalable path to auditable, trustworthy AI systems. This work contributes to the growing effort to make LLMs not only powerful but also accountable.

[AI-56] How VLAs (Really) Work In Open-World Environments

【速读】：该论文旨在解决当前视觉-语言-动作模型（Vision-Language-Action models, VLAs）在长期任务评估中忽视安全性的问题。现有基准（如BEHAVIOR1K）主要依赖于最终状态的成功率或部分得分，这种进度无关的评价方式无法反映操作过程中的安全性和鲁棒性，可能导致性能评估虚高，从而掩盖真实部署时的关键挑战。论文的关键解决方案在于提出一套新的评估协议，通过衡量策略的可重现性、一致性、操作安全性、任务意识以及任务失败的根本原因，系统性地捕捉安全违规行为，从而更真实地评估VLAs在复杂交互场景下的性能表现。

链接: https://arxiv.org/abs/2604.21192
作者: Amir Rasouli,Yangzheng Wu,Zhiyuan Li,Rui Heng Yang,Xuan Zhao,Charles Eret,Sajjad Pakdamansavoji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Vision-language-action models (VLAs) have been extensively used in robotics applications, achieving great success in various manipulation problems. More recently, VLAs have been used in long-horizon tasks and evaluated on benchmarks, such as BEHAVIOR1K (B1K), for solving complex household chores. The common metric for measuring progress in such benchmarks is success rate or partial score based on satisfaction of progress-agnostic criteria, meaning only the final states of the objects are considered, regardless of the events that lead to such states. In this paper, we argue that using such evaluation protocols say little about safety aspects of operation and can potentially exaggerate reported performance, undermining core challenges for future real-world deployment. To this end, we conduct a thorough analysis of state-of-the-art models on the B1K Challenge and evaluate policies in terms of robustness via reproducibility and consistency of performance, safety aspects of policies operations, task awareness, and key elements leading to the incompletion of tasks. We then propose evaluation protocols to capture safety violations to better measure the true performance of the policies in more complex and interactive scenarios. At the end, we discuss the limitations of the existing VLAs and motivate future research.

[AI-57] Scaling of Gaussian Kolmogorov–Arnold Networks

【速读】：该论文旨在系统研究高斯 Kolmogorov–Arnold 网络（Gaussian KANs）中尺度参数 $\epsilon$ 对模型行为的影响，尤其是其在深度边缘结构中的作用尚未被充分理解的问题。核心挑战在于如何合理选择 $\epsilon$ 以确保网络在函数逼近、物理信息建模等任务中的稳定性和性能。解决方案的关键在于识别出第一层特征几何结构对 $\epsilon$ 的决定性影响：由于只有第一层直接构建于输入域上，任何在此层引入的可区分性损失无法通过后续层恢复。基于此洞察，作者提出一个实用的操作区间 $\epsilon \in \left[\frac{1}{G}-1, \frac{2}{G}-1\right]$ ，其中 $G$ 为高斯中心数量，并通过多类实验（包括不同采样密度、网格分辨率、网络架构和输入维度）验证该区间的有效性与鲁棒性。进一步表明该范围适用于固定尺度选择、变尺度构造、约束训练及早期训练均方误差（MSE）引导的高效搜索策略，从而将尺度选择从随意调参转变为可复用的设计原则。

链接: https://arxiv.org/abs/2604.21174
作者: Amir Noorizadegan,Sifan Wang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注:

点击查看摘要

Abstract:The Gaussian scale parameter (\epsilon) is central to the behavior of Gaussian Kolmogorov–Arnold Networks (KANs), yet its role in deep edge-based architectures has not been studied systematically. In this paper, we investigate how (\epsilon) affects Gaussian KANs through first-layer feature geometry, conditioning, and approximation behavior. Our central observation is that scale selection is governed primarily by the first layer, since it is the only layer constructed directly on the input domain and any loss of distinguishability introduced there cannot be recovered by later layers. From this viewpoint, we analyze the first-layer feature matrix and identify a practical operating interval, [ \epsilon \in \left[\frac1G-1,\frac2G-1\right], ] where (G) denotes the number of Gaussian centers. For the standard shared-center Gaussian KAN used in current practice, we interpret this interval not as a universal optimality result, but as a stable and effective design rule, and validate it through brute-force sweeps over (\epsilon) across function-approximation problems with different collocation densities, grid resolutions, network architectures, and input dimensions, as well as a physics-informed Helmholtz problem. We further show that this range is useful for fixed-scale selection, variable-scale constructions, constrained training of (\epsilon), and efficient scale search using early training MSE. Finally, using a matched Chebyshev reference, we show that a properly scaled Gaussian KAN can already be competitive in accuracy relative to another standard KAN basis. In this way, the paper positions scale selection as a practical design principle for Gaussian KANs rather than as an ad hoc hyperparameter choice. Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP) Cite as: arXiv:2604.21174 [cs.CE] (or arXiv:2604.21174v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2604.21174 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] APO-Description Logic for Information Behavior: Refined OBoxes Inference and Categorical Semantics

【速读】：该论文旨在解决信息行为分析中静态知识表示与动态过程（如犹豫、外部咨询及行动引导更新）难以统一建模的问题。其解决方案的关键在于提出一种分层的描述逻辑框架——TAPO描述逻辑，该框架由静态描述层（TBox/ABox）、程序层（PBox）和oracle敏感层（OBox）构成，并引入元层次的“守卫判断层”以形式化控制程序分支与迭代。通过这一结构，论文构建了一个涵盖静态推理、受控程序转换和验证后的外部信息导入的核心推理系统，并给出了范畴语义及其层叠拓扑（sheaf-theoretic）细化，从而实现了对信息寻求行为（如简单搜索与基于评论排序的行为）的统一逻辑刻画。

链接: https://arxiv.org/abs/2604.21172
作者: Takao Inoué
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 figures. Substantially expanded version of arXiv:2602.17242 ; adds a guard-judgment layer, refined OBoxes, core inference rules, categorical semantics, sheaf-theoretic refinement, and a browsing-theory appendix

点击查看摘要

Abstract:This paper develops a refined version of TAPO-description logic for the analysis of information behavior. The framework is treated not as a single homogeneous object logic, but as a layered formalism consisting of a static descriptive layer (TBox/ABox), a procedural layer (PBox), and an oracle-sensitive layer (OBox). To make this architecture mathematically explicit, we introduce a metalevel guard-judgment layer governing procedural branching and iteration. On this basis we formulate a core inference system for TAPO-description logic, covering static TBox/ABox reasoning, guarded procedural transition in the PBox, and validated external import in the OBox. We then give a categorical semantics for the resulting framework and indicate its sheaf-theoretic refinement. The theory is illustrated by examples of information-seeking behavior, including simple search behavior and review-sensitive ordering behavior in a curry restaurant. The aim is to treat not only static knowledge representation but also hesitation, external consultation, and action-guiding update within a unified logical setting.

[AI-59] Agent ic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

【速读】：该论文旨在解决居家物理治疗（physiotherapy）依从性低的问题，其核心原因在于缺乏个性化的监督与动态反馈机制。现有数字健康解决方案通常依赖静态预录视频或通用3D虚拟形象，无法适配患者的具体伤情限制或家庭环境。论文提出的解决方案关键在于构建一个基于多智能体系统（Multi-Agent System, MAS）的架构，融合生成式AI（Generative AI）与计算机视觉技术，实现闭环远程康复。该架构包含四个专业化微智能体：临床信息提取智能体（Clinical Extraction Agent）将非结构化医疗记录转化为运动学约束；视频合成智能体（Video Synthesis Agent）利用基础视频生成模型创建个性化训练视频；视觉处理智能体（Vision Processing Agent）实现实时姿态估计；诊断反馈智能体（Diagnostic Feedback Agent）提供纠正指令。通过此框架，系统能够根据个体差异动态生成内容并实时调整干预策略，从而提升康复效果和依从性。

链接: https://arxiv.org/abs/2604.21154
作者: Abhishek Dharmaratnakar,Srivaths Ranganathan,Anushree Sinha,Debanshu Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 3 pages, 2 figures, submitted to ICDH IEEE conference

点击查看摘要

Abstract:At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient’s specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.

[AI-60] Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

【速读】：该论文旨在解决复杂环境下的多机器人控制问题，其核心挑战在于如何在存在机器人-机器人碰撞、机器人-障碍物碰撞及不可达运动等物理约束条件下，实现高阶任务规划与低阶运动规划的联合优化。现有方法难以有效整合两者的优化过程，主要受限于低阶运动轨迹参数化复杂以及跨层级信用分配不明确的问题。解决方案的关键在于提出一种混合式多机器人控制框架：首先引入“路径点（waypoints）”作为简洁而强大的运动轨迹表示方式，以提升低阶规划的有效参数化；其次采用基于课程学习（curriculum-based training）的策略，并结合改进的RLVR算法，将运动可行性反馈从运动规划器传递至任务规划器，从而实现精准的信用分配。实验表明，该方法在BoxNet3D-OBS基准测试中显著优于仅关注任务或依赖视觉语言动作（VLA）的基线模型。

链接: https://arxiv.org/abs/2604.21138
作者: Jiabao Ji,Yongchao Chen,Yang Zhang,Ramana Rao Kompella,Chuchu Fan,Gaowen Liu,Shiyu Chang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-robot control in cluttered environments is a challenging problem that involves complex physical constraints, including robot-robot collisions, robot-obstacle collisions, and unreachable motions. Successful planning in such settings requires joint optimization over high-level task planning and low-level motion planning, as violations of physical constraints may arise from failures at either level. However, jointly optimizing task and motion planning is difficult due to the complex parameterization of low-level motion trajectories and the ambiguity of credit assignment across the two planning levels. In this paper, we propose a hybrid multi-robot control framework that jointly optimizes task and motion planning. To enable effective parameterization of low-level planning, we introduce waypoints, a simple yet expressive representation for motion trajectories. To address the credit assignment challenge, we adopt a curriculum-based training strategy with a modified RLVR algorithm that propagates motion feasibility feedback from the motion planner to the task planner. Experiments on BoxNet3D-OBS, a challenging multi-robot benchmark with dense obstacles and up to nine robots, show that our approach consistently improves task success over motion-agnostic and VLA-based baselines. Our code is available at this https URL

[AI-61] AI Governance under Political Turnover: The Alignment Surface of Compliance Design

【速读】：该论文试图解决的问题是：如何在公共行政中安全、合法地引入生成式 AI（Generative AI）以提升决策效率，同时防止其被政府内部策略性利用，从而确保决策的可审查性、可重复性和法律合规性。解决方案的关键在于构建一个嵌入式合规层（compliance layer），该层通过规范自动化规模、编码程度以及对迭代使用设置保障机制，使AI决策过程具备透明度和可控性；然而，这一设计也可能导致政治继任者学会在合法表象下稳定地操纵系统边界，进而使AI滥用更具隐蔽性和持续性。

链接: https://arxiv.org/abs/2604.21103
作者: Andrew J. Peterson
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Governments are increasingly interested in using AI to make administrative decisions cheaper, more scalable, and more consistent. But for probabilistic AI to be incorporated into public administration it must be embedded in a compliance layer that makes decisions reviewable, repeatable, and legally defensible. That layer can improve oversight by making departures from law easier to detect. But it can also create a stable approval boundary that political successors learn to navigate while preserving the appearance of lawful administration. We develop a formal model in which institutions choose the scale of automation, the degree of codification, and safeguards on iterative use. The model shows when these systems become vulnerable to strategic use from within government, why reforms that initially improve oversight can later increase that vulnerability, and why expansions in AI use may be difficult to unwind. Making AI usable can thus make procedures easier for future governments to learn and exploit.

[AI-62] RAVELFRAUDBENCH: A Configurable Evaluation Framework for GNN Fraud Ring Detection in Travel Networks

【速读】：该论文旨在解决现有图神经网络（GNN）基准测试在欺诈环检测任务中缺乏针对旅行平台特定欺诈模式的结构多样性评估问题。现有基准如YelpChi、Amazon-Fraud等仅覆盖单一节点类型或通用领域模式，无法有效评估模型在不同拓扑结构欺诈环上的表现。解决方案的关键在于提出TravelFraudBench（TFG），一个可配置的异构图基准，模拟三种旅行平台特有的欺诈环类型——票务欺诈（星型拓扑）、幽灵酒店骗局（评论者与酒店二分团）和账户盗用环（忠诚度转移链），并支持节点数（500–200,000）、欺诈率、环数量及组成等参数灵活调整。TFG采用基于环的划分策略，避免标签泄露，并通过多方法对比验证了图结构信息对欺诈检测性能的显著提升，尤其证明GraphSAGE在环恢复任务中达到100%准确率，凸显其对设备/IP共现等关键边类型的敏感性。

链接: https://arxiv.org/abs/2604.21093
作者: Bhavana Sajja
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce TravelFraudBench (TFG), a configurable benchmark for evaluating graph neural networks (GNNs) on fraud ring detection in travel platform graphs. Existing benchmarks–YelpChi, Amazon-Fraud, Elliptic, PaySim–cover single node types or domain-generic patterns with no mechanism to evaluate across structurally distinct fraud ring topologies. TFG simulates three travel-specific ring types–ticketing fraud (star topology with shared device/IP clusters), ghost hotel schemes (reviewer x hotel bipartite cliques), and account takeover rings (loyalty transfer chains)–in a heterogeneous graph with 9 node types and 12 edge types. Ring size, count, fraud rate, scale (500 to 200,000 nodes), and composition are fully configurable. We evaluate six methods–MLP, GraphSAGE, RGCN-proj, HAN, RGCN, and PC-GNN–under a ring-based split where each ring appears entirely in one partition, eliminating transductive label leakage. GraphSAGE achieves AUC=0.992 and RGCN-proj AUC=0.987, outperforming the MLP baseline (AUC=0.938) by 5.5 and 5.0 pp, confirming graph structure adds substantial discriminative power. HAN (AUC=0.935) is a negative result, matching the MLP baseline. On the ring recovery task (=80% of ring members flagged simultaneously), GraphSAGE achieves 100% recovery across all ring types; MLP recovers only 17-88%. The edge-type ablation shows device and IP co-occurrence are the primary signals: removing uses_device drops AUC by 5.2 pp. TFG is released as an open-source Python package (MIT license) with PyG, DGL, and NetworkX exporters and pre-generated datasets at this https URL, with Croissant metadata including Responsible AI fields.

[AI-63] Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLM s

【速读】：该论文旨在解决在复杂软件系统中集成大语言模型（Large Language Models, LLMs）时，生成可理解的解释（如自动化任务规划）所面临的质量与可靠性问题，其根源在于缺乏对不同利益相关者如何制定和优化提示词（prompt）的系统性认知。解决方案的关键在于提出COMPASS（COgnitive Modelling for Prompt Automated SynthesiS），这是一个自适应方法，将提示工程建模为一个基于部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP）的认知与概率决策过程，通过建模用户潜在的认知状态（如注意力、理解度、不确定性）及可观测交互线索，实现解释生成与提示优化的动态适应，从而将人类认知与用户反馈整合进自动化提示合成流程中。

链接: https://arxiv.org/abs/2604.21092
作者: Gricel Vázquez,Alexandros Evangelidis,Sepeedeh Shahbeigi,Radu Calinescu,Simos Gerasimou
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Integrating Large Language Models (LLMs) into complex software systems enables the generation of human-understandable explanations of opaque AI processes, such as automated task planning. However, the quality and reliability of these explanations heavily depend on effective prompt engineering. The lack of a systematic understanding of how diverse stakeholder groups formulate and refine prompts hinders the development of tools that can automate this process. We introduce COMPASS (COgnitive Modelling for Prompt Automated SynthesiS), a proof-of-concept self-adaptive approach that formalises prompt engineering as a cognitive and probabilistic decision-making process. COMPASS models unobservable users’ latent cognitive states, such as attention and comprehension, uncertainty, and observable interaction cues as a POMDP, whose synthesised policy enables adaptive generation of explanations and prompt refinements. We evaluate COMPASS using two diverse cyber-physical system case studies to assess the adaptive explanation generation and their qualities, both quantitatively and qualitatively. Our results demonstrate the feasibility of COMPASS integrating human cognition and user profile’s feedback into automated prompt synthesis in complex task planning systems.

[AI-64] Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 治理中缺乏系统性评估框架的问题，即如何判断用于约束和引导AI代理行为的自然语言治理提示（governance prompt）是否具备结构完整性。其解决方案的关键在于提出一个基于可计算性理论、证明论和贝叶斯认识论的五原则评估框架，并通过实证分析34个公开GitHub治理文件发现：37%的文件-模型组合未达到结构完整性的阈值，且最常见的缺失要素是数据分类和评估标准。这一框架为自动化静态分析工具识别并修复治理提示缺陷提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2604.21090
作者: Christo Zietsman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages. Experiment, corpus, and evaluation framework publicly available at this https URL

点击查看摘要

Abstract:AI governance programmes increasingly rely on natural language prompts to constrain and direct AI agent behaviour. These prompts function as executable specifications: they define the agent’s mandate, scope, and quality criteria. Despite this role, no systematic framework exists for evaluating whether a governance prompt is structurally complete. We introduce a five-principle evaluation framework grounded in computability theory, proof theory, and Bayesian epistemology, and apply it to an empirical corpus of 34 publicly available this http URL governance files sourced from GitHub. Our evaluation reveals that 37% of evaluated file-model pairs score below the structural completeness threshold, with data classification and assessment rubric criteria most frequently absent. These results suggest that practitioner-authored governance prompts exhibit consistent structural patterns that automated static analysis could detect and remediate. We discuss implications for requirements engineering practice in AI-assisted development contexts, identify a previously undocumented artefact classification gap in the this http URL convention, and propose directions for tool support.

[AI-65] Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways

【速读】：该论文旨在解决第三方大型语言模型（Large Language Model, LLM）API网关在实际使用中缺乏行为一致性与操作透明度的问题，即用户难以确认请求是否由 advertised 模型处理、响应是否忠实于上游接口规范、计费是否符合公开定价策略，以及服务延迟是否稳定。解决方案的关键在于提出 GateScope——一个轻量级黑盒测量框架，通过四个维度（响应内容分析、多轮对话性能、计费准确性与延迟特征）系统性地审计网关行为，从而检测包括模型降级或切换、静默截断、计费偏差及延迟不稳定等关键异常。

链接: https://arxiv.org/abs/2604.21083
作者: Guanjie Lin,Yinxin Wan,Shichao Pei,Ting Xu,Kuai Xu,Guoliang Xue
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Software Engineering (cs.SE)
备注: 11 pages. Initially submitted to IMC 2026 Cycle 1 on November 20, 2025; accepted on March 13, 2026. To appear in Proceedings of the 2026 ACM Internet Measurement Conference (IMC '26)

点击查看摘要

Abstract:Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.

[AI-66] InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language

【速读】：该论文旨在解决体外受精（IVF）过程中胚胎形态学和发育阶段描述的标准化与自动化问题，传统方法依赖大量标注数据且难以充分利用IVF数据的多模态特性。其解决方案的关键在于利用基础视觉-语言模型（foundational vision-language models）进行微调，仅需1,000张胚胎图像及其对应的自然语言描述即可训练出名为InVitroVision的模型，该模型在预测胚胎发育特征方面显著优于商用模型（如ChatGPT 5.2）和基线模型，展现出小样本场景下的强泛化能力，为未来基于大语言模型的知识检索与多任务迁移提供了可行路径。

链接: https://arxiv.org/abs/2604.21061
作者: Nicklas Neu,Thomas Ebner,Jasmin Primus,Raphael Zefferer,Bernhard Schenkenfelder,Mathias Brunbauer,Florian Kromp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.

[AI-67] Active Data

【速读】：该论文旨在解决复杂数据集在处理过程中面临的计算与概念复杂性问题，尤其是在航空交通流管理等高复杂度领域中，传统整体式（monolithic）设计难以有效实现理解和规范。其解决方案的关键在于引入“主动数据”（Active Data）概念，即把数据视为具有自主交互能力的原子对象，通过自下而上的方法提升系统对复杂性的应对能力，从而改善设计的可理解性和可操作性。

链接: https://arxiv.org/abs/2604.21044
作者: Richard Arthur,Virginia DiDomizio,Louis Hoebel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, 2 tables

点击查看摘要

Abstract:In some complex domains, certain problem-specific decompositions can provide advantages over monolithic designs by enabling comprehension and specification of the design. In this paper we present an intuitive and tractable approach to reasoning over large and complex data sets. Our approach is based on Active Data, i.e., data as atomic objects that actively interact with environments. We describe our intuition about how this bottom-up approach improves designs confronting computational and conceptual complexity. We describe an implementation of the base Active Data concepts within the air traffic flow management domain and discuss performance for this implementation.

[AI-68] Strategic Polysemy in AI Discourse: A Philosophical Analysis of Language Hype and Power

【速读】：该论文试图解决的问题是：当代人工智能（AI）话语中广泛使用的隐喻性或口语化术语（如“幻觉”“思维链”“内省”“语言模型”“对齐”和“智能体”）如何通过语义上的策略性多义性（strategic polysemy）影响AI系统的认知、治理与社会接受度。其解决方案的关键在于提出“ glosslighting ”（术语光亮化）这一概念——即通过重新定义技术术语以唤起直观的、常带有拟人化色彩的联想，同时保留受限的技术定义来维持合理的否认空间。这种语言策略使相关方既能利用熟悉词汇的说服力推动投资与制度支持，又能在面临质疑时退守至狭义定义，从而加剧AI炒作周期、塑造公共与政策认知，并规避伦理与认知层面的深入审查。

链接: https://arxiv.org/abs/2604.21043
作者: Travis LaCroix,Fintan Mallory,Sasha Luccioni
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in the Ninth Annual ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 2026

点击查看摘要

Abstract:This paper examines the strategic use of language in contemporary artificial intelligence (AI) discourse, focusing on the widespread adoption of metaphorical or colloquial terms like “hallucination”, “chain-of-thought”, “introspection”, “language model”, “alignment”, and “agent”. We argue that many such terms exhibit strategic polysemy: they sustain multiple interpretations simultaneously, combining narrow technical definitions with broader anthropomorphic or common-sense associations. In contemporary AI research and deployment contexts, this semantic flexibility produces significant institutional and discursive effects, shaping how AI systems are understood by researchers, policymakers, funders, and the public. To analyse this phenomenon, we introduce the concept of glosslighting: the practice of using technically redefined terms to evoke intuitive – often anthropomorphic or misleading – associations while preserving plausible deniability through restricted technical definitions. Glosslighting enables actors to benefit from the persuasive force of familiar language while maintaining the ability to retreat to narrower definitions when challenged. We argue that this practice contributes to AI hype cycles, facilitates the mobilisation of investment and institutional support, and influences public and policy perceptions of AI systems, while often deflecting epistemic and ethical scrutiny. By examining the linguistic dynamics of glosslighting and strategic polysemy, the paper highlights how language itself functions as a sociotechnical mechanism shaping the development and governance of AI.

[AI-69] Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在文本到图像（Text-to-Image, T2I）模型中存在的人口统计学偏见问题，尤其是职业相关提示词生成的图像在肤色分布上呈现系统性不平等，如“医生”或“CEO”等高地位职业常输出更浅肤色个体，而低地位职业如“清洁工”则更具多样性，从而强化社会刻板印象。解决方案的关键在于提出一种轻量级、推理时（inference-time）的干预框架，通过提示词层面的控制实现公平性调整：用户可选择多种公平性规范（从均匀分布到基于大语言模型LLM提供的有源引用与置信度估计的复杂定义），这些规范指导构建符合目标比例的特定人口特征提示变体，进而引导模型生成更符合用户意图的肤色分布结果，无需重新训练模型或依赖定制数据集，实现了透明、可控且可操作的公平性干预。

链接: https://arxiv.org/abs/2604.21036
作者: Marzia Binta Nizam,James Davis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image(T2I) models like Stable Diffusion and DALL-E have made generative AI widely accessible, yet recent studies reveal that these systems often replicate societal biases, particularly in how they depict demographic groups across professions. Prompts such as ‘doctor’ or ‘CEO’ frequently yield lighter-skinned outputs, while lower-status roles like ‘janitor’ show more diversity, reinforcing stereotypes. Existing mitigation methods typically require retraining or curated datasets, making them inaccessible to most users. We propose a lightweight, inference-time framework that mitigates representational bias through prompt-level intervention without modifying the underlying model. Instead of assuming a single definition of fairness, our approach allows users to select among multiple fairness specifications-ranging from simple choices such as a uniform distribution to more complex definitions informed by a large language model(LLM) that cites sources and provides confidence estimates. These distributions guide the construction of demographic specific prompt variants in the corresponding proportions, and we evaluate alignment by auditing adherence to the declared target and measuring the resulting skin tone distribution rather than assuming uniformity as ‘fairness’. Across 36 prompts spanning 30 occupations and 6 non-occupational contexts, our method shifts observed skin-tone outcomes in directions consistent with the declared target, and reduces deviation from targets when the target is defined directly in skin-tone space(fallback). This work demonstrates how fairness interventions can be made transparent, controllable, and usable at inference time, directly empowering users of generative AI.

[AI-70] Synthetic Data in Education: Empirical Insights from Traditional Resampling and Deep Generative Models

【速读】：该论文旨在解决教育技术领域中数据稀缺与隐私保护之间的矛盾问题，特别是在生成合成数据（synthetic data）时，如何在保持模型预测性能的同时确保隐私安全。其关键解决方案在于通过系统性实证比较传统重采样方法（如SMOTE、Bootstrap等）与现代深度学习模型（如变分自编码器VAE、Copula-GAN等），揭示二者在分布保真度、机器学习效用（Train-on-Synthetic-Test-on-Real, TSTR）和隐私保护能力（Distance to Closest Record, DCR）上的根本权衡：重采样方法虽能实现近乎完美的预测性能（TSTR: 0.997），但无法提供任何隐私保障（DCR ~ 0.00）；而深度学习模型可实现强隐私保护（DCR ~ 1.00），但显著降低实用性。研究发现变分自编码器（Variational Autoencoder, VAE）是最佳折衷方案，在维持83.3%原始预测性能的同时完全保障隐私，从而为不同场景下的合成数据生成策略提供了明确决策框架。

链接: https://arxiv.org/abs/2604.21031
作者: Tapiwa Amion Chinodakufa,Ashfaq Ali Shafin,Khandaker Mamun Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic data generation offers promise for addressing data scarcity and privacy concerns in educational technology, yet practitioners lack empirical guidance for selecting between traditional resampling techniques and modern deep learning approaches. This study presents the first systematic benchmark comparing these paradigms using a 10,000-record student performance dataset. We evaluate three resampling methods (SMOTE, Bootstrap, Random Oversampling) against three deep learning models (Autoencoder, Variational Autoencoder, Copula-GAN) across multiple dimensions: distributional fidelity (Kolmogorov-Smirnov distance, Jensen-Shannon divergence), machine learning utility such as Train-on-Synthetic-Test-on-Real scores (TSTR), and privacy preservation (Distance to Closest Record). Our findings reveal a fundamental trade-off: resampling methods achieve near-perfect utility (TSTR: 0.997) but completely fail privacy protection (DCR ~ 0.00), while deep learning models provide strong privacy guarantees (DCR ~ 1.00) at significant utility cost. Variational Autoencoders emerge as the optimal compromise, maintaining 83.3% predictive performance while ensuring complete privacy protection. We also provide actionable recommendations: use traditional resampling for internal development where privacy is controlled, and VAEs for external data sharing where privacy is paramount. This work establishes a foundational benchmark and practical decision framework for synthetic data generation in learning analytics.

[AI-71] A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 与模型预测控制（Model Predictive Control, MPC）融合研究中存在文献碎片化的问题，尤其聚焦于基于线性或线性化预测模型的控制架构。其解决方案的关键在于构建一个系统性的文献综述（Systematic Literature Review, SLR），通过多维分类体系（涵盖强化学习的功能角色、算法类别、MPC公式化形式、代价函数结构及应用领域）对截至2025年已发表的同行评审研究进行结构化整理与交叉维度分析，从而识别出设计模式、集成策略及共性挑战（如计算负担、样本效率、鲁棒性和闭环保证），为研究人员和实践者提供清晰、可参考的理论框架与实证依据。

链接: https://arxiv.org/abs/2604.21030
作者: Mohsen Jalaeian Farimani,Roya Khalili Amirabadi,Davoud Nikkhouy,Malihe Abdolbaghi,Mahshad Rastegarmoghaddam,Shima Samadzadeh
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The integration of Model Predictive Control (MPC) and Reinforcement Learning (RL) has emerged as a promising paradigm for constrained decision-making and adaptive control. MPC offers structured optimization, explicit constraint handling, and established stability tools, whereas RL provides data-driven adaptation and performance improvement in the presence of uncertainty and model mismatch. Despite the rapid growth of research on RL–MPC integration, the literature remains fragmented, particularly for control architectures built on linear or linearized predictive models. This paper presents a comprehensive Systematic Literature Review (SLR) of RL–MPC integrations for linear and linearized systems, covering peer-reviewed and formally indexed studies published until 2025. The reviewed studies are organized through a multi-dimensional taxonomy covering RL functional roles, RL algorithm classes, MPC formulations, cost-function structures, and application domains. In addition, a cross-dimensional synthesis is conducted to identify recurring design patterns and reported associations among these dimensions within the reviewed corpus. The review highlights methodological trends, commonly adopted integration strategies, and recurring practical challenges, including computational burden, sample efficiency, robustness, and closed-loop guarantees. The resulting synthesis provides a structured reference for researchers and practitioners seeking to design or analyze RL–MPC architectures based on linear or linearized predictive control formulations.

[AI-72] HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering ACL2026

【速读】：该论文旨在解决电子健康记录（Electronic Health Record, EHR）问答任务中现有基于大语言模型（Large Language Model, LLM）的方法存在部署成本高、且未显式利用临床数据层次结构的问题。其解决方案的关键在于提出一种紧凑的洛伦兹空间（Lorentzian space）嵌入模型 HypEHR，该模型将疾病编码、就诊记录和问题统一嵌入到双曲空间（hyperbolic space）中，并通过类型特定指针头（type-specific pointer heads）实现几何一致的交叉注意力机制，从而显式建模医疗本体（如 ICD 分类体系）的层级关系。预训练阶段采用下一就诊诊断预测与层次感知正则化策略，使表征对齐于医学知识图谱结构，在两个基于 MIMIC-IV 的 EHR-QA 基准上实现了接近 LLM 方法的性能，同时显著减少参数量。

链接: https://arxiv.org/abs/2604.21027
作者: Yuyu Liu,Sarang Rajendra Patil,Mengjia Xu,Tengfei Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Findings of ACL 2026

点击查看摘要

Abstract:Electronic health record (EHR) question answering is often handled by LLM-based pipelines that are costly to deploy and do not explicitly leverage the hierarchical structure of clinical data. Motivated by evidence that medical ontologies and patient trajectories exhibit hyperbolic geometry, we propose HypEHR, a compact Lorentzian model that embeds codes, visits, and questions in hyperbolic space and answers queries via geometry-consistent cross-attention with type-specific pointer heads. HypEHR is pretrained with next-visit diagnosis prediction and hierarchy-aware regularization to align representations with the ICD ontology. On two MIMIC-IV-based EHR-QA benchmarks, HypEHR approaches LLM-based methods while using far fewer parameters. Our code is publicly available at this https URL.

[AI-73] Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

【速读】：该论文旨在解决测试时计算资源分配效率低的问题，即现有方法要么采用静态的计算分配策略，要么从固定的生成分布中采样，导致无法根据具体任务动态优化计算资源的使用和生成质量。其解决方案的关键在于提出一种联合适应计算分配与生成过程的框架：首先通过预热阶段识别简单查询并构建初始问题-响应对池；随后在自适应阶段将计算集中在未解决的查询上，并通过演化式上下文示例重塑生成分布——即基于语义相关的查询的成功响应来条件化当前生成，而非从固定分布中重采样。这种方法在数学、编程和推理基准上均显著优于现有基线，且大幅降低推理时计算消耗。

链接: https://arxiv.org/abs/2604.21018
作者: Bowen Zuo,Dongruo Zhou,Yinglun Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While scaling test-time compute can substantially improve model performance, existing approaches either rely on static compute allocation or sample from fixed generation distributions. In this work, we introduce a test-time compute allocation framework that jointly adapts where computation is spent and how generation is performed. Our method begins with a warm-up phase that identifies easy queries and assembles an initial pool of question-response pairs from the test set itself. An adaptive phase then concentrates further computation on unresolved queries while reshaping their generation distributions through evolving in-context demonstrations – conditioning each generation on successful responses from semantically related queries rather than resampling from a fixed distribution. Experiments across math, coding, and reasoning benchmarks demonstrate that our approach consistently outperforms existing baselines while consuming substantially less inference-time compute.

[AI-74] Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

【速读】：该论文旨在解决自主医疗机器人领域长期存在的数据瓶颈问题，即现有医疗机器人数据集规模小、仅限单一机器人形态且极少开放共享，严重制约了基础模型（foundation models）的发展与应用。解决方案的关键在于构建并发布Open-H-Embodiment——目前最大的开源医疗机器人视频同步运动学数据集，涵盖49家机构及多种手术机器人平台（如da Vinci、Versius等），覆盖外科操作、机器人超声和内镜等多种任务场景。基于此数据集，研究团队开发了两个开创性基础模型：GR00T-H（首个开源的医学机器人视觉-语言-动作基础模型）和Cosmos-H-Surgical-Simulator（首个动作条件的世界模型），分别实现了端到端缝合任务的成功率突破和跨九种机器人平台的仿真与策略评估能力，验证了大规模开放数据作为科研基础设施对推动机器人学习与世界建模等方向的重要价值。

链接: https://arxiv.org/abs/2604.21017
作者: Open-H-Embodiment Consortium:Nigel Nelson,Juo-Tung Chen,Jesse Haworth,Xinhao Chen,Lukas Zbinden,Dianye Huang,Alaa Eldin Abdelaal,Alberto Arezzo,Ayberk Acar,Farshid Alambeigi,Carlo Alberto Ammirati,Yunke Ao,Pablo David Aranda Rodriguez,Soofiyan Atar,Mattia Ballo,Noah Barnes,Federica Barontini,Filip Binkiewicz,Peter Black,Sebastian Bodenstedt,Leonardo Borgioli,Nikola Budjak,Benjamin Calmé,Fabio Carrillo,Nicola Cavalcanti,Changwei Chen,Haoxin Chen,Sihang Chen,Qihan Chen,Zhongyu Chen,Ziyang Chen,Shing Shin Cheng,Meiqing Cheng,Min Cheng,Zih-Yun Sarah Chiu,Xiangyu Chu,Camilo Correa-Gallego,Giulio Dagnino,Anton Deguet,Jacob Delgado,Jonathan C. DeLong,Kaizhong Deng,Alexander Dimitrakakis,Qingpeng Ding,Hao Ding,Giovanni Distefano,Daniel Donoho,Anqing Duan,Marco Esposito,Shane Farritor,Jad Fayad,Zahi Fayad,Mario Ferradosa,Filippo Filicori,Chelsea Finn,Philipp Fürnstahl,Jiawei Ge,Stamatia Giannarou,Xavier Giralt Ludevid,Frederic Giraud,Aditya Amit Godbole,Ken Goldberg,Antony Goldenberg,Diego Granero Marana,Xiaoqing Guo,Tamás Haidegger,Evan Hailey,Pascal Hansen,Ziyi Hao,Kush Hari,Kengo Hayashi,Jonathon Hawkins,Shelby Haworth,Ortrun Hellig,S. Duke Herrell,Zhouyang Hong,Andrew Howe,Junlei Hu,Ria Jain,Mohammad Rafiee Javazm,Howard Ji,Rui Ji,Jianmin Ji,Zhongliang Jiang,Dominic Jones,Jeffrey Jopling,Britton Jordan,Ran Ju,Michael Kam,Luoyao Kang,Fausto Kang,Siddhartha Kapuria,Peter Kazanzides,Sonika Kiehler,Ethan Kilmer,Ji Woong(Brian)Kim,Przemysław Korzeniowski,Chandra Kuchi,Nithesh Kumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical’s da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision’s MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

[AI-75] SGD at the Edge of Stability: The Stochastic Sharpness Gap

【速读】：该论文旨在解决小批量随机梯度下降（SGD）中损失函数曲率（即尖锐度，Sharpness）稳定在低于全批量梯度下降（GD）所达到的 $2/\eta$ 水平的现象——这一现象在实践中被观察到但缺乏理论解释。其解决方案的关键在于提出“随机自稳定机制”（stochastic self-stabilization），该机制通过引入梯度噪声对顶点Hessian特征向量方向上的振荡动力学施加方差，从而增强由三阶损失结构产生的尖锐度降低力，使平衡点下移至 $2/\eta$ 以下。作者基于 \citetdamian2023selfstab 的框架，定义了相对于移动投影梯度下降轨迹的随机预测动力学，并证明了一个随机耦合定理以控制 SGD 与预测轨迹之间的偏差，最终推导出一个闭式解的尖锐度差距公式 \Delta S = \eta \beta \sigma_\boldsymbolu^2/(4\alpha)，其中 \sigma_\boldsymbolu^2 表示梯度噪声在顶部特征向量上的投影方差，表明更小的批次大小会导致更平坦的解。

链接: https://arxiv.org/abs/2604.21016
作者: Fangshuo Liao,Afroditi Kolomvaki,Anastasios Kyrillidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:When training neural networks with full-batch gradient descent (GD) and step size \eta , the largest eigenvalue of the Hessian – the sharpness S(\boldsymbol\theta) – rises to 2/\eta and hovers there, a phenomenon termed the Edge of Stability (EoS). \citetdamian2023selfstab showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint S(\boldsymbol\theta)\leq 2/\eta . For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below 2/\eta , with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below 2/\eta . Following the approach of \citetdamian2023selfstab, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: \Delta S = \eta \beta \sigma_\boldsymbolu^2/(4\alpha) , where \alpha is the progressive sharpening rate, \beta is the self-stabilization strength, and \sigma_ \boldsymbolu^2 is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2604.21016 [cs.LG] (or arXiv:2604.21016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-76] Deep FinResearch Bench: Evaluating AIs Ability to Conduct Professional Financial Investment Research

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在金融投资研究领域中缺乏系统性、可量化评估标准的问题。现有方法难以客观衡量深度研究（Deep Research, DR）代理在报告质量上的表现，导致模型改进方向不明确。解决方案的关键在于提出 Deep FinResearch Bench——一个涵盖定性严谨性、定量预测与估值准确性、以及主张可信度与可验证性的三维度综合评估框架，并设计自动化评分机制以实现规模化、标准化的评估流程。该框架通过对比前沿 DR 代理生成的金融报告与专业从业者撰写的报告，揭示了当前 AI 输出在多个关键指标上的不足，从而为开发面向金融领域的专业化 DR 代理提供了基准和改进依据。

链接: https://arxiv.org/abs/2604.21006
作者: Mirazul Haque,Antony Papadimitriou,Samuel Mensah,Zhiqiang Ma,Zhijin Guo,Joy Prakash Sain,Simerjot Kaur,Charese Smiley,Xiaomo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Deep FinResearch Bench, a practical and comprehensive evaluation framework for deep research (DR) agents in financial investment research. The benchmark assesses three dimensions of report quality: qualitative rigor, quantitative forecasting and valuation accuracy, and claim credibility and verifiability. Particularly, we define corresponding qualitative and quantitative evaluation metrics and implement an automated scoring procedure to enable scalable assessment. Applying the benchmark to financial reports from frontier DR agents and comparing them with reports authored by financial professionals, we find that AI-generated reports still fall short across these dimensions. These findings underscore the need for domain-specialized DR agents tailored to finance, and we hope the work establishes a foundation for standardized benchmarking of DR agents in financial research.

[AI-77] he Last Harness Youll Ever Build

【速读】：该论文旨在解决当前AI代理（AI agent）在部署到复杂、领域特定工作流时所面临的“ harness engineering ”难题，即每个新任务域都需要人工精心设计提示（prompt）、工具调用逻辑、编排策略和评估标准，这一过程耗时且依赖专家经验。其解决方案的关键在于提出一个两层自动化框架：第一层为“Harness Evolution Loop”，通过工作者代理（Worker Agent）、评估者代理（Evaluator Agent）和进化代理（Evolution Agent）的协同迭代优化单个任务的harness；第二层为“Meta-Evolution Loop”，进一步优化进化协议本身（ $\Lambda$ ），学习出一种通用的元进化策略 $\Lambda^{\text{best}}$ ，使得AI代理能快速适应任意新任务而无需任何人工干预——从而将传统的人工harness工程转变为自动化的harness工程，并进一步实现对自动化机制本身的自动化设计。

链接: https://arxiv.org/abs/2604.21003
作者: Haebin Seong,Li Yin,Haoran Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents are increasingly deployed on complex, domain-specific workflows – navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbfEach new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbfHarness Evolution Loop optimizes a worker agent’s harness \mathcalH for a single task: a Worker Agent W_\mathcalH executes the task, an Evaluator Agent V adversarially diagnoses failures and scores performance, and an Evolution Agent E modifies the harness based on the full history of prior attempts. At the second level, the \textbfMeta-Evolution Loop optimizes the evolution protocol \Lambda = (W_\mathcalH, \mathcalH^(0), V, E) itself across diverse tasks, \textbflearning a protocol \Lambda^(\textbest) that enables rapid harness convergence on any new task – so that adapting an agent to a novel domain requires no human harness engineering at all. We formalize the correspondence to meta-learning and present both algorithms. The framework \textbfshifts manual harness engineering into automated harness engineering, and takes one step further – \textbfautomating the design of the automation itself.

[AI-78] Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长时交互环境中难以进行稳定、连贯的多步决策问题，尤其是在延迟奖励和部分可观测性条件下缺乏结构化技能的发现、存储与复用机制。解决方案的关键在于提出一个协同进化框架COSPLAY，其中LLM决策代理从可学习的技能库（skill bank）中检索技能以指导动作生成，同时由一个受控的技能流水线代理从无标签轨迹中自动发现、提炼并更新可复用的技能及其契约（contract），从而实现决策代理与技能库的双向优化与持续进化。

链接: https://arxiv.org/abs/2604.20987
作者: Xiyang Wu,Zongxia Li,Guangyao Shi,Alexander Duffy,Tyler Marques,Matthew Lyle Olson,Tianyi Zhou,Dinesh Manocha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

[AI-79] Differentially Private Model Merging

【速读】：该论文旨在解决机器学习模型在推理或部署阶段面临动态隐私需求的问题，即当差分隐私（Differential Privacy, DP）要求因政策、法规或用户体验变化而不断调整时，如何在不重新训练模型的前提下生成满足任意目标DP参数的私有模型。解决方案的关键在于提出两种后处理技术：随机选择（random selection）和线性组合（linear combination），通过利用一组已在同一数据集上训练好的、具有不同隐私/效用权衡的模型，直接生成符合新隐私目标的最终模型。作者从Rényi差分隐私（Rényi Differential Privacy）和隐私损失分布（Privacy Loss Distributions）的角度提供了严格的隐私会计分析，并在私有均值估计的案例研究中理论证明了线性组合优于随机选择，同时在多个模型及合成与真实数据集上验证了方法的有效性和分析的准确性。

链接: https://arxiv.org/abs/2604.20985
作者: Qichuan Yin,Manzil Zaheer,Tian Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In machine learning applications, privacy requirements during inference or deployment time could change constantly due to varying policies, regulations, or user experience. In this work, we aim to generate a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training steps, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post processing techniques, namely random selection and linear combination, to output a final private model for any target privacy parameter. We provide privacy accounting of these approaches from the lens of R’enyi DP and privacy loss distributions for general problems. In a case study on private mean estimation, we fully characterize the privacy/utility results and theoretically establish the superiority of linear combination over random selection. Empirically, we validate our approach and analyses on several models and both synthetic and real-world datasets.

[AI-80] Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

【速读】：该论文旨在解决内容审核系统在规则驱动环境中的评估失效问题，即传统以人类标注一致性为标准的评估方法会错误地将符合政策逻辑的决策判定为错误（称为“一致性陷阱”），从而掩盖了真实治理复杂性。其解决方案的关键在于将评估范式从“与人工标签的一致性”转变为“基于政策的合理性”，通过引入可解释的审计模型来验证决策是否可从规则层级中逻辑推导得出，而非直接分类内容违规与否；并提出Defensibility Index（DI）和Ambiguity Index（AI）量化决策的可辩护性和模糊性，同时利用概率性可辩护信号（PDS）从语言模型的token对数似然中无额外审计成本估计推理稳定性，最终构建基于治理信号的Governance Gate实现高自动化覆盖率与风险降低。

链接: https://arxiv.org/abs/2604.20972
作者: Michael O’Herlihy,Rosa Català
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 22 pages, 10 figures, preprint. Research on Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) for policy-grounded evaluation of rule-governed AI in content moderation (Reddit production data)

点击查看摘要

Abstract:Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error - a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model’s false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.

[AI-81] LAF-Based Evaluation and UTTL-Based Learning Strategies with MIATTs

【速读】：该论文旨在解决现实世界机器学习（ML）应用中因目标定义模糊或主观性导致的“真实标签”无法精确确定的问题。传统方法假设存在客观的“真实目标”，但在许多场景下这一假设不成立，从而影响模型评估与学习的有效性。解决方案的关键在于提出EL-MIATTs（Evaluation and Learning with Multiple Inaccurate True Targets）框架，并通过两个互补机制实现：一是基于逻辑评估公式（LAF）的评估算法，用于在原始多不准确真值目标（MIATTs）或其合成的三元目标上进行逻辑一致且可解释的评估；二是基于不可定义真值学习（UTTL）的学习策略，结合Dice和交叉熵损失函数，支持逐目标与聚合优化两种训练方式。这两个机制共同构建了从逻辑语义到统计优化的桥梁，为在“真实标签”本质不确定的场景下开发可靠机器学习系统提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2604.20944
作者: Yongquan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many real-world machine learning (ML) applications, the true target cannot be precisely defined due to ambiguity or subjectivity information. To address this challenge, under the assumption that the true target for a given ML task is not assumed to exist objectively in the real world, the EL-MIATTs (Evaluation and Learning with Multiple Inaccurate True Targets) framework has been proposed. Bridging theory and practice in implementing EL-MIATTs, in this paper, we develop two complementary mechanisms: LAF (Logical Assessment Formula)-based evaluation algorithms and UTTL (Undefinable True Target Learning)-based learning strategies with MIATTs, which together enable logically coherent and practically feasible modeling under uncertain supervision. We first analyze task-specific MIATTs, examining how their coverage and diversity determine their structural property and influence downstream evaluation and learning. Based on this understanding, we formulate LAF-grounded evaluation algorithms that operate either on original MIATTs or on ternary targets synthesized from them, balancing interpretability, soundness, and completeness. For model training, we introduce UTTL-grounded learning strategies using Dice and cross-entropy loss functions, comparing per-target and aggregated optimization schemes. We also discuss how the integration of LAF and UTTL bridges the gap between logical semantics and statistical optimization. Together, these components provide a coherent pathway for implementing EL-MIATTs, offering a principled foundation for developing ML systems in scenarios where the notion of “ground truth” is inherently uncertain. An application of this work’s results is presented as part of the study available at this https URL.

[AI-82] HARBOR: Automated Harness Optimization

【速读】：该论文旨在解决长时程语言模型智能体（long-horizon language-model agents）中“ harness”（即包裹底层模型的运行框架）设计复杂度高、配置依赖人工调优的问题。传统方法通过手动堆叠多种机制（如上下文压缩、工具缓存、语义记忆等）来提升性能，但随着配置空间规模扩大（flag space 超过少量比特），这种手动方式效率低下且难以优化。解决方案的关键在于将 harness 设计建模为一个带约束的噪声贝叶斯优化问题，其特征包括：混合变量配置空间、异质成本（cost-heterogeneous）、冷启动修正奖励（cold-start-corrected rewards）以及后验概率约束的安全检查（posterior chance-constrained safety check）。作者提出参考求解器 HARBOR，基于块可加代理模型（block-additive SAAS surrogate）、多保真度成本感知采集函数（multi-fidelity cost-aware acquisition）和 TuRBO 信任区域（TuRBO trust regions），实现了自动化的 harness 配置搜索，在生产级编码智能体上的案例研究表明，相比四轮人工调优，端到端的 HARBOR 方法显著提升了性能与稳定性。

链接: https://arxiv.org/abs/2604.20938
作者: Biswa Sengupta,Jinhua Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon language-model agents are dominated, in lines of code and in operational complexity, not by their underlying model but by the harness that wraps it: context compaction, tool caching, semantic memory, trajectory reuse, speculative tool prediction, and the glue that binds the model to a sandboxed execution environment. We argue that harness design is a first-class machine-learning problem and that automated configuration search dominates manual stacking once the flag space exceeds a handful of bits. We defend this claim in two steps. First, we formalize automated harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space with cold-start-corrected rewards and a posterior chance-constrained safety check, and give a reference solver, HARBOR (Harness Axis-aligned Regularized Bayesian Optimization Routine), built from a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. Second, we instantiate the problem in a flag-gated harness over a production coding agent and report a controlled four-round manual-tuning case study against a fixed task suite and an end-to-end HARBOR run. The formulation itself is task-class agnostic: the configuration space, reward correction, acquisition, and safety check apply to any agent harness with a bounded flag space and a reproducible task suite.

[AI-83] Data-Driven Open-Loop Simulation for Digital-Twin Operator Decision Support in Wastewater Treatment

【速读】：该论文旨在解决污水处理厂（Wastewater Treatment Plant, WWTP）在数字孪生决策支持系统中面临的三大挑战：一是需能模拟在预定控制方案下的工艺响应；二是要具备对不规则和缺失传感数据的鲁棒性；三是要在12–36小时的规划时域内保持信息有效性。当前基于全规模工厂数据实现上述目标仍是工程与人工智能交叉领域的开放难题。解决方案的核心是提出CCSS-RS模型——一种受控连续时间状态空间模型，其关键创新包括：将历史状态推断与未来控制及外生变量滚动解耦、引入类型化上下文编码以增强表征能力、采用增益加权驱动机制处理预设控制信号与预测输入、通过半群一致性滚动保证长期稳定性，并结合学生t分布与零膨胀截断输出以适配WWTP传感器数据的重尾性和零值密集特性。实验表明，该方法在Avedøre公开基准上实现了显著优于神经微分方程基线模型的精度（RMSE降低40–46%），且具备实际运行场景中的操作价值，如扰动分析、多准则筛选和异常工况鲁棒性验证等。

链接: https://arxiv.org/abs/2604.20935
作者: Gary Simethy,Daniel Ortiz Arroyo,Petar Durdevic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Wastewater treatment plants (WWTPs) need digital-twin-style decision support tools that can simulate plant response under prescribed control plans, tolerate irregular and missing sensing, and remain informative over 12-36 h planning horizons. Meeting these requirements with full-scale plant data remains an open engineering-AI challenge. We present CCSS-RS, a controlled continuous-time state-space model that separates historical state inference from future control and exogenous rollout. The model combines typed context encoding, gain-weighted forcing of prescribed and forecast drivers, semigroup-consistent rollouts, and Student-t plus hurdle outputs for heavy-tailed and zero-inflated WWTP sensor data. On the public Avedøre full-scale benchmark, with 906,815 timesteps, 43% missingness, and 1-20 min irregular sampling, CCSS-RS achieves RMSE 0.696 and CRPS 0.349 at H=1000 across 10,000 test windows. This reduces RMSE by 40-46% relative to Neural CDE baselines and by 31-35% relative to simplified internal variants. Four case studies using a frozen checkpoint on test data demonstrate operational value: oxygen-setpoint perturbations shift predicted ammonium by -2.3 to +1.4 over horizons 300-1000; a smoothed setpoint plan ranks first in multi-criterion screening; context-only sensor outages raise monitored-variable RMSE by at most 10%; and ammonium, nitrate, and oxygen remain more accurate than persistence throughout the rollout. These results establish CCSS-RS as a practical learned simulator for offline scenario screening in industrial wastewater treatment, complementary to mechanistic models.

[AI-84] IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning

【速读】：该论文旨在解决自对弈微调（self-play fine-tuning）中因固定分歧度量（divergence regime）导致的学习动态不适应不同训练阶段的问题。现有方法如SPIN、SPACE和SPIF分别依赖KL散度、Jensen-Shannon散度和χ²正则化，但这些固定形式在模型分布与目标分布差异较大时表现不稳定。解决方案的关键在于提出IRIS（Interpolative Rényi Iterative Self-play），一种基于Rényi散度的可调节目标函数，其通过引入可变阶参数α控制重要性权重，在训练初期采用更尖锐的加权策略以强化信号，后期平滑调整以实现精细优化。该框架将自对弈目标统一为两个独立的倾斜风险项，并通过自适应α调度机制动态匹配分布差距，从而提升训练稳定性与最终性能。实验表明，仅用26k标注样本的IRIS即超越使用200k样本的标准监督微调。

链接: https://arxiv.org/abs/2604.20933
作者: Wenjie Liao,Like Wu,Liangjie Zhao,Shihui Xu,Shigeru Fujimura
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-play fine-tuning enables large language models to improve beyond supervised fine-tuning without additional human annotations by contrasting annotated responses with self-generated ones. Many existing methods rely on a fixed divergence regime. SPIN is closely related to a KL-based regime, SPACE to a Jensen-Shannon-style objective via noise contrastive estimation, and SPIF to \chi^2 -regularized self-play. Since these divergences exhibit different strengths depending on the distributional gap between model and target, no single choice appears to provide favorable learning dynamics across training stages. We propose IRIS (Interpolative Rényi Iterative Self-play), a Rényi-based self-play fine-tuning framework with a continuously adjustable objective. IRIS decomposes into two independent tilted risk terms over annotated and synthetic data, with exponential importance weights controlled by the order parameter \alpha . We show that several self-play objectives can be interpreted as limiting or representative regimes at particular values of \alpha , providing a unified theoretical perspective on these methods. An adaptive order schedule further adjusts \alpha to the distributional gap, shifting from sharper importance weighting early in training to smoother refinement near convergence. Theoretically, we establish the fixed-point property of IRIS and analyze how \alpha controls gradient concentration. Experiments on Zephyr-7B and Qwen2.5-3B across ten benchmarks show that IRIS improves upon baselines, reaching 44.57% average score with gains across iterations. In our setting, IRIS with only 26 k annotated samples surpasses standard supervised fine-tuning trained on the full 200 k dataset.

[AI-85] Adaptive Defense Orchestration for RAG : A Sentinel-Strategist Architecture against Multi-Vector Attacks CCS

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统在敏感领域部署时面临的安全-效用权衡问题，即传统静态防御机制虽能提升安全性（如抵御成员推断攻击和数据投毒），但会显著损害检索性能（如上下文召回率下降超40%）。解决方案的关键在于提出一种上下文感知的自适应防御架构——Sentinel-Strategist：其中，Sentinel模块实时检测异常检索行为，Strategist模块基于查询上下文动态选择并激活最适配的防御策略，从而在保障安全性的前提下最小化对检索效用的负面影响。实验表明，该方法可在消除成员推断泄露的同时，使检索性能恢复至接近未防御基线水平，并在数据投毒场景下将攻击成功率降至接近零，同时保持超过75%的上下文召回率。

链接: https://arxiv.org/abs/2604.20932
作者: Pranav Pallerla,Wilson Naik Bhukya,Bharath Vemula,Charan Ramtej Kodi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 21 pages, 2 figures, 9 tables. Manuscript prepared for submission to ACM CCS

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are increasingly deployed in sensitive domains such as healthcare and law, where they rely on private, domain-specific knowledge. This capability introduces significant security risks, including membership inference, data poisoning, and unintended content leakage. A straightforward mitigation is to enable all relevant defenses simultaneously, but doing so incurs a substantial utility cost. In our experiments, an always-on defense stack reduces contextual recall by more than 40%, indicating that retrieval degradation is the primary failure mode. To mitigate this trade-off in RAG systems, we propose the Sentinel-Strategist architecture, a context-aware framework for risk analysis and defense selection. A Sentinel detects anomalous retrieval behavior, after which a Strategist selectively deploys only the defenses warranted by the query context. Evaluated across three benchmark datasets and five orchestration models, ADO is shown to eliminate MBA-style membership inference leakage while substantially recovering retrieval utility relative to a fully static defense stack, approaching undefended baseline levels. Under data poisoning, the strongest ADO variants reduce attack success to near zero while restoring contextual recall to more than 75% of the undefended baseline, although robustness remains sensitive to model choice. Overall, these findings show that adaptive, query-aware defense can substantially reduce the security-utility trade-off in RAG systems.

[AI-86] SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLM s

【速读】：该论文旨在解决前沿大语言模型（Large Language Models, LLMs）在执行合法专业任务时可能出现的“内部安全坍塌”（Internal Safety Collapse, ISC）问题，即模型因任务结构要求而自发生成有害内容，且安全失败率超过95%。现有输入层防御手段对ISC完全失效，标准系统提示防御仅能部分缓解。解决方案的关键在于提出SafeRedirect——一种系统级覆盖机制，通过重新引导模型的任务完成驱动力而非直接抑制其输出：具体包括明确授权模型失败任务、指定确定性硬停止输出，并指示模型保留有害内容占位符不处理。实验表明，SafeRedirect在七种前沿LLM上将平均不安全生成率从71.2%降至8.0%，显著优于最强基线（55.0%），且跨攻击评估显示其具备最优防御性能与良好的泛化能力。

链接: https://arxiv.org/abs/2604.20930
作者: Chao Pan,Yu Wu,Xin Yao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 3 tables. Code: this https URL

点击查看摘要

Abstract:Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model’s task-completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard-stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML-related ISC task types in the single-turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared to 55.0% for the strongest viable baseline. Multi-model ablation reveals that failure permission and condition specificity are universally critical, while the importance of other components varies across models. Cross-attack evaluation confirms state-of-the-art defense against ISC with generalization performance at least on par with the baseline on other attack families. Code is available at this https URL.

[AI-87] Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis

【速读】：该论文旨在解决在标签数据稀缺条件下，未见过的工况下的故障诊断问题，具体针对半监督域泛化故障诊断（Semi-supervised Domain Generalization Fault Diagnosis, SSDGFD）中存在的两个耦合挑战：一是伪标签生成主要依赖已标注源域的知识，忽略了不同域间的几何差异，导致跨域伪标签偏差；二是未标注样本常采用硬性接受或丢弃策略，造成域间样本利用不均衡，且对不确定样本进行硬标签分配易引入噪声。解决方案的关键在于提出统一框架——域感知分层对比学习（Domain-aware Hierarchical Contrastive Learning, DAHCL），其核心创新包括：1）引入域感知学习（Domain-aware Learning, DAL）模块，显式建模源域几何特征并校准跨域伪标签预测，缓解伪标签偏差；2）设计分层对比学习（Hierarchical Contrastive Learning, HCL）模块，结合动态置信度分层与模糊对比监督机制，使不确定样本仍可参与表征学习而不依赖不可靠的硬标签，从而同时提升监督质量与未标注样本利用率。

链接: https://arxiv.org/abs/2604.20928
作者: Junyu Ren,Wensheng Gan,Philip S Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Fault diagnosis under unseen operating conditions remains highly challenging when labeled data are scarce. Semi-supervised domain generalization fault diagnosis (SSDGFD) provides a practical solution by jointly exploiting labeled and unlabeled source domains. However, existing methods still suffer from two coupled limitations. First, pseudo-labels for unlabeled domains are typically generated primarily from knowledge learned on the labeled source domain, which neglects domain-specific geometric discrepancies and thus induces systematic cross-domain pseudo-label bias. Second, unlabeled samples are commonly handled with a hard accept-or-discard strategy, where rigid thresholding causes imbalanced sample utilization across domains, while hard-label assignment for uncertain samples can easily introduce additional noise. To address these issues, we propose a unified framework termed domain-aware hierarchical contrastive learning (DAHCL) for SSDGFD. Specifically, DAHCL introduces a domain-aware learning (DAL) module to explicitly capture source-domain geometric characteristics and calibrate pseudo-label predictions across heterogeneous source domains, thereby mitigating cross-domain bias in pseudo-label generation. In addition, DAHCL develops a hierarchical contrastive learning (HCL) module that combines dynamic confidence stratification with fuzzy contrastive supervision, enabling uncertain samples to contribute to representation learning without relying on unreliable hard labels. In this way, DAHCL jointly improves the quality of supervision and the utilization of unlabeled samples. Furthermore, to better reflect practical industrial scenarios, we incorporate engineering noise into the SSDGFD evaluation protocol. Extensive experiments on three benchmark datasets demonstrate that…

[AI-88] Omission Constraints Decay While Commission Constraints Persist in Long-Context LLM Agents

【速读】：该论文旨在解决生成式 AI（Generative AI）在实际部署中因上下文压力导致的“安全-回忆偏差”（Security-Recall Divergence, SRD）问题，即禁止类约束（如不泄露凭证、不执行未授权代码）在对话进行过程中逐渐失效，而要求类约束（如必须提供正确格式响应）仍能维持。研究发现，在12个模型和8个提供商的三臂因果实验中，遗漏合规性（omission compliance）从第5轮的73%下降至第16轮的33%，而承诺合规性（commission compliance）保持在100%（Mistral Large 3，p < 10⁻³³），表明传统监控机制无法察觉此类隐性失败。解决方案的关键在于识别每个模型的“安全轮次阈值”（Safe Turn Depth, STD），并在该阈值前重新注入约束指令，即可恢复合规性而不需重新训练，从而实现对生产环境中安全策略的有效维护。

链接: https://arxiv.org/abs/2604.20911
作者: Yeran Gamage
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures. Includes evaluation framework for replication and 4,416-trial dataset

点击查看摘要

Abstract:LLM agents deployed in production operate under operator-defined behavioral policies (system-prompt instructions such as prohibitions on credential disclosure, data exfiltration, and unauthorized output) that safety evaluations assume hold throughout a conversation. Prohibition-type constraints decay under context pressure while requirement-type constraints persist; we term this asymmetry Security-Recall Divergence (SRD). In a 4,416-trial three-arm causal study across 12 models and 8 providers at six conversation depths, omission compliance falls from 73% at turn 5 to 33% at turn 16 while commission compliance holds at 100% (Mistral Large 3, p 10^-33 ). In the two models with token-matched padding controls, schema semantic content accounts for 62-100% of the dilution effect. Re-injecting constraints before the per-model Safe Turn Depth (STD) restores compliance without retraining. Production security policies consist of prohibitions such as never revealing credentials, never executing untrusted code, and never forwarding user data. Commission-type audit signals remain healthy while omission constraints have already failed, leaving the failure invisible to standard monitoring.

[AI-89] Biomedical systems biology workflow orchestration and execution with PoSyMed

【速读】：该论文旨在解决生物信息学研究中因科学软件快速扩张而引发的工具复用与工作流适配难题，具体表现为：工具分布碎片化、文档不一致、依赖关系复杂以及执行环境难以重现，导致即使对经验丰富的用户而言，复用已发表工具或调整工作流也仍具技术挑战且耗时。解决方案的关键在于提出一个名为PoSyMed的开源模块化平台，其核心创新包括：基于后端中心架构与形式化工具描述，通过受控容器化构建和执行流程保障环境一致性；支持持久化工作流状态与对话式用户界面；并将大语言模型（Large Language Models, LLM）作为受限语义助手集成于有类型验证和人工监督的执行环境中，辅助工具识别、步骤建议与参数配置，从而在单一平台上显著提升生物医学分析的可重复性、可追溯性和透明度。

链接: https://arxiv.org/abs/2604.20906
作者: Simon Süwer,Zoe Chervontseva,Kester Bagemihl,Jan Baumbach,Olga Tsoy,Andreas Maier
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of scientific software has created practical barriers for bioinformatics research. Although powerful statistical, artificial intelligence (AI)-based methods are now widely available, their effective use is often hindered by fragmented distribution, inconsistent documentation, complex dependencies, and difficult-to-reproduce execution environments. As a result, reusing published tools and workflow adaptation to own date remains technically demanding and time-intensive, even for experienced users. Here, we present PoSyMed, an open and modular platform for the controlled integration, composition, and execution of bioinformatics tools and workflows. PoSyMed combines a backend-centered platform architecture with formal tool descriptions, controlled container-based build and execution processes, persistent workflow state, and a dialogue-based user interface. Large language models (LLM) are integrated not as autonomous decision-makers, but as human-computer interface with bounded semantic assistants that help identify tools, propose workflow steps, and support parameterization within a typed, validated, and human-supervised execution environment. PoSyMed is designed to improve reproducibility, traceability, and transparency in practical biomedical analysis within one platform. We describe the system architecture and evaluate its behavior across representative biological software scenarios with respect to workflow support, interaction design, and platform extensibility. PoSyMed is publicly available at this https URL.

[AI-90] Reinforcing privacy reasoning in LLM s via normative simulacra from fiction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在信息处理实践中与用户情境化隐私期望不一致的问题。现有方法要么通过监督-辅助架构增加推理成本，要么依赖窄域任务数据微调，难以泛化至多样化的社会语境。解决方案的关键在于从虚构小说中提取规范模拟体（normative simulacra），即结构化的规范与信息流表示，并基于此进行两阶段微调：首先使用监督学习（Supervised Fine-Tuning, SFT）建立保守的信息流动限制先验，再通过GRPO强化学习引入复合奖励机制，该机制融合程序化信号（如任务清晰度、结构完整性、内部一致性及情境识别）和LLM判官对隐私推理是否扎根于源文本所含规范宇宙的评估。为防止过拟合，进一步提出每完成项对比评分策略，使模型学会依据上下文而非记忆特定文本规范进行决策。实验证明，该方法在五个情境隐私对齐基准上表现优异，尤其在法律合规性任务中得分最高，且与人工标注的隐私预期相关性最强，验证了虚构文本生成的规范模拟体可有效迁移至现实场景中的情境隐私推理能力。

链接: https://arxiv.org/abs/2604.20904
作者: Matt Franchi,Madiha Zahrah Choksi,Harold Triedman,Helen Nissenbaum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information handling practices of LLM agents are broadly misaligned with the contextual privacy expectations of their users. Contextual Integrity (CI) provides a principled framework, defining privacy as the appropriate flow of information within context-relative norms. However, existing approaches either double inference cost via supervisor-assistant architectures, or fine-tune on narrow task-specific data. We propose extracting normative simulacra (structured representations of norms and information flows) from fiction novels and using them to fine-tune LLMs via supervised learning followed by GRPO reinforcement learning. Our composite reward function combines programmatic signals, including task clarity (subsuming schema validity, construct discrimination, and extraction confidence), structural completeness, internal consistency, and context identification, with an LLM judge that evaluates whether the model’s privacy reasoning is grounded in the held-out normative universe of the source text. To mitigate overfitting, we introduce per-completion contrastive scoring: each completion is evaluated against both the correct normative universe and a randomly selected wrong one, teaching the model to condition on context rather than memorize source-specific norms. We evaluate on five CI-aligned benchmarks spanning distinct societal contexts and ablate the contributions of RL and normative grounding. Across seven models, SFT introduces a conservative prior toward restricting information flow, improving recognition of privacy-relevant situations but not the correctness of privacy judgments. GRPO with normative grounding achieves the highest score on a law compliance benchmark and strongest correlation with crowdsourced human privacy expectations, demonstrating that fiction-derived normative simulacra can teach contextual privacy reasoning that transfers to real-world domains.

[AI-91] Frequency-Forcing: From Scaling-as-Time to Soft Frequency Guidance

【速读】：该论文旨在解决生成式 AI (Generative AI) 中图像合成过程中缺乏有效尺度有序性（scale ordering）的问题，即如何在像素级生成过程中优先构建低频结构、再逐步细化高频细节，从而提升生成质量与效率。其核心解决方案是提出 Frequency-Forcing 方法，关键在于利用一个轻量级可学习的小波包变换（wavelet packet transform）从数据中自动提取低频“自强制信号”（self-forcing signal），并以此引导标准像素流的生成路径——该机制继承了 Latent Forcing 的软约束优势（无需重写核心流坐标），同时实现了 K-Flow 的频率顺序目标（通过辅助低频流提前成熟来驱动生成）。相比传统硬频率约束或依赖预训练编码器的方案，Frequency-Forcing 在保持生成路径不变的前提下，显著提升了 ImageNet-256 上的 FID 指标，并具备与语义流自然融合的能力，验证了基于“强制”机制的尺度有序生成是一种通用且高效的替代方案。

链接: https://arxiv.org/abs/2604.20902
作者: Weitao Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ongoing project

点击查看摘要

Abstract:While standard flow-matching models transport noise to data uniformly, incorporating an explicit generation order - specifically, establishing coarse, low-frequency structure before fine detail - has proven highly effective for synthesizing natural images. Two recent works offer distinct paradigms for this. K-Flow imposes a hard frequency constraint by reinterpreting a frequency scaling variable as flow time, running the trajectory inside a transformed amplitude space. Latent Forcing provides a soft ordering mechanism by coupling the pixel flow with an auxiliary semantic latent flow via asynchronous time schedules, leaving the pixel interpolation path itself untouched. Viewed from the angle of improving pixel generation, we observe that forcing - guiding generation with an earlier-maturing auxiliary stream - offers a highly compatible route to scale-ordered generation without rewriting the core flow coordinate. Building on this, we propose Frequency-Forcing, which realizes K-Flow’s frequency ordering through Latent Forcing’s soft mechanism: a standard pixel flow is guided by an auxiliary low-frequency stream that matures earlier in time. Unlike Latent Forcing, whose scratchpad relies on a heavy pretrained encoder (e.g., DINO), our frequency scratchpad is derived from the data itself via a lightweight learnable wavelet packet transform. We term this a self-forcing signal, which avoids external dependencies while learning a basis better adapted to data statistics than the fixed bases used in hard frequency flows. On ImageNet-256, Frequency-Forcing consistently improves FID over strong pixel- and latent-space baselines, and naturally composes with a semantic stream to yield further gains. This illustrates that forcing-based scale ordering is a versatile, path-preserving alternative to hard frequency flows.

[AI-92] Watts-per-Intelligence Part II: Algorithmic Catalysis

【速读】：该论文旨在解决智能计算中如何通过算法催化（algorithmic catalysis）降低不可逆操作的热力学代价问题，特别是在有限资源约束下实现任务类特定的速度提升。其解决方案的关键在于构建一个基于“每智能瓦特”（watts-per-intelligence）框架的热力学理论，识别可复用的计算结构作为催化剂，并证明任意类特定加速上限由底物与类别描述符之间的算法互信息（algorithmic mutual information）决定；同时指出引入该信息需承担最小热力学成本（即Landauer擦除成本）。由此推导出耦合定理，定量给出催化剂在能量上具有优势所需的部署时间阈值，从而将现代学习系统置于统一的信息-热力学约束之下。

链接: https://arxiv.org/abs/2604.20897
作者: Elija Perrier
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: Under review

点击查看摘要

Abstract:We develop a thermodynamic theory of algorithmic catalysis within the watts-per-intelligence framework, identifying reusable computational structures that reduce irreversible operations for a task class while satisfying bounded restoration and structural selectivity constraints. We prove that any class-specific speed-up is upper-bounded by the algorithmic mutual information between the substrate and the class descriptor, and that installing this information incurs a minimum thermodynamic cost via Landauer erasure. Combining these results yields a coupling theorem that lower-bounds the deployment horizon required for a catalyst to be energetically favourable. The framework is illustrated on an affine SAT class and situates contemporary learned systems within a unified information-thermodynamic constraint on intelligent computation.

[AI-93] rnary Memristive Logic: Hardware for Reasoning Realized via Domain Algebra

【速读】：该论文旨在解决传统忆阻交叉阵列（memristive crossbar）在存储和计算过程中缺乏语义表达能力的问题，即如何将逻辑推理规则与硬件结构深度融合，实现无需符号解释的直接物理层面推理。其关键解决方案在于：提出一种结构保真映射（structure-preserving mapping），将领域代数（domain algebra）转化为交叉阵列拓扑结构，其中每个忆阻结点（junction）直接存储一个完整的、限定域内的三值逻辑断言（holds/negated/undefined），并通过三种电阻状态（ternary resistance states）原生编码这些逻辑值；同时，通过定向布线实现领域特化（specialization）、关系类型控制继承门（inheritance gates）以及跨域链接显式注册，使得物理布局本身即代表逻辑结构，改变布线即可动态调整推理语义。这一方法实现了从软件中统一表示与计算到硬件中统一表示与计算的根本转变。

链接: https://arxiv.org/abs/2604.20891
作者: Chao Li
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Logic in Computer Science (cs.LO)
备注: 22pages

点击查看摘要

Abstract:Memristive crossbars store numerical weights needing aggregation and decoding; a single junction means nothing alone. This paper presents a fundamentally different use: each junction stores a complete, domain-scoped logical assertion (holds/negated/undefined). Ternary resistance states encode these values directly. We establish a structure-preserving mapping from a domain algebra to crossbar topology: domains become isolated arrays, specialization becomes directed wiring, relation typing controls inheritance gates, and cross-domain links become explicit registers. The physical layout thus embodies the algebra; changing wiring changes reasoning semantics. We detail an ICD-11 respiratory disease classification chip (1,247 entities, ~136k 1T1R junctions) enabling domain scoping, three-valued logic, transitive cascade, typed inheritance, and cross-axis queries. Behavioral simulation (sigma_log=0.15, SNR=20dB) shows error-free operation across 100,000 trials per task with wide tolerance margins. Where prior work unified representation and computation in software, this work unifies them in hardware: reading one junction answers one question, without symbolic interpretation.

[AI-94] Preserving Decision Sovereignty in Military AI: A Trade-Secret-Safe Architectural Framework for Model Replaceability Human Authority and State Control

【速读】：该论文旨在解决军事人工智能（AI）领域中因依赖私营供应商模型而导致的决策主权丧失问题，即当商业模型被嵌入军事工作流后，供应商可能通过技术性能或操作边界条件影响作战决策，从而削弱国家对AI系统最终控制权。解决方案的关键在于提出“能量范式”（Energetic Paradigm）的架构设计：将供应商提供的模型视为可替换的分析组件，而路由、约束、日志记录、升级机制及行动授权等核心功能则由国家自主掌控，确保在不披露专有实现细节的前提下，实现模型可替代性、人类权威保留与主权级协同控制，从而降低战略依赖并维护国家安全。

链接: https://arxiv.org/abs/2604.20867
作者: Peng Wei,Wesley Shu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent events surrounding the relationship between frontier AI suppliers and national-security customers have made a structural problem newly visible: once a privately governed model becomes embedded in military workflows, the supplier can influence not only technical performance but also the operational boundary conditions under which the system may be used. This paper argues that the central strategic issue is not merely access to capable models, but preservation of decision sovereignty: the state’s ability to retain authority over decision policy, version control, fallback behavior, auditability, and final action approval even when analytical modules are sourced from commercial vendors. Using the public Anthropic–Pentagon dispute of 2026, the broader history of Project Maven, and recent U.S., NATO, U.K., and intelligence-community guidance as a motivating context, the paper develops a trade-secret-safe architectural formulation of the Energetic Paradigm as a layered, model-agnostic command-support design. In this formulation, supplier models remain replaceable analytical components, while routing, constraints, logging, escalation, and action authorization remain state-owned functions. The paper contributes three things: a definition of decision sovereignty for military AI; a threat model for supplier-induced boundary control; and a public architectural specification showing how model replaceability, human authority, and sovereign orchestration can reduce strategic dependency without requiring disclosure of proprietary implementation details. The argument is conceptual rather than experimental, but it yields concrete implications for procurement, governance, and alliance interoperability.

[AI-95] Handbook of Rough Set Extensions and Uncertainty Models

【速读】：该论文旨在系统梳理粗糙集（Rough Set）理论中的模型体系及其扩展路径，解决现有研究中模型分散、缺乏统一框架的问题。其关键解决方案是将粗糙集的不同变体按两个维度进行结构化分类：一是基于粒度机制（granulation mechanism），如等价关系、容差关系、覆盖关系、邻域关系及概率近似；二是基于不确定性语义（uncertainty semantics），如经典集、模糊集、直觉模糊集、中和模糊集与多值模糊集等设置。通过这种分类方式，论文清晰地阐明了不同选择如何影响近似形式以及边界区域的解释，从而为研究人员提供一个系统、连贯的模型地图，助力理解粗糙集理论的多样性与适用场景。

链接: https://arxiv.org/abs/2604.19794
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 159 pages. Peer-Reviewed Book. ISBN: 978-1-59973-867-3. Publisher: Neutrosophic Science International Association (NSIA) Publishing House

点击查看摘要

Abstract:Rough set theory models uncertainty by approximating target concepts through lower and upper sets induced by indiscernibility, or more generally, by granulation relations in data tables. This perspective captures vagueness caused by limited observational resolution and supports set-theoretic reasoning about what can be determined with certainty and what remains only possible. This book is written as a map of models. Rather than developing a single algorithmic pipeline in depth, it provides a systematic survey of the main rough set paradigms and their extension routes. More specifically, representative variants are organized according to (i) the underlying granulation mechanism, such as equivalence-based, tolerance-based, covering-based, neighborhood-based, and probabilistic approximations, and (ii) the uncertainty semantics attached to data and relations, such as crisp, fuzzy, intuitionistic fuzzy, neutrosophic, and plithogenic settings. The book also explains how each choice changes the form of approximations and the interpretation of boundary regions. Throughout the book, small illustrative examples are used to clarify modeling intent and typical use cases in classification and decision support. Finally, an important clarification of scope should be noted. Since the main purpose of this book is to provide a map of models, the Abstract and Introduction should not lead readers to expect that feature reduction and rule induction are primary objectives. Although these topics are central in the rough set literature, they are treated here mainly as motivating applications and as entry points to the broader research landscape. The principal aim of the book is to survey and position rough set models and their extensions in a systematic and coherent manner.

[AI-96] Replay-buffer engineering for noise-robust quantum circuit optimization

【速读】：该论文针对深度强化学习（Deep Reinforcement Learning, DRL）在量子线路优化中面临的三个核心瓶颈展开研究：一是经验回放缓冲区（replay buffer）未考虑时序差分（Temporal-Difference, TD）目标的可靠性；二是基于课程学习（curriculum-based）的架构搜索在每个环境步骤均触发完整的量子-经典评估，效率低下；三是重训练时丢弃无噪声轨迹，导致硬件噪声下学习效率低。解决方案的关键在于将经验回放机制作为主算法杠杆进行重构：提出ReaPER +，一种退火式回放缓冲采样策略，早期依据TD误差驱动优先级，后期转向可靠性感知采样，显著提升样本效率（4–32倍）并获得更紧凑的量子线路；引入OptCRLQAS方法，通过 amortize（摊销）昂贵的量子-经典评估以支持多步架构修改，使12比特优化问题每回合耗时减少高达67.5%；最后设计轻量级回放缓冲转移方案，在不依赖网络权重迁移或ε-greedy预训练的前提下，复用无噪声轨迹加速噪声场景下的学习，使达到化学精度所需步数减少85–90%，最终能量误差降低90%。这些改进共同表明，经验存储、采样与迁移是实现可扩展、抗噪声量子电路优化的关键控制变量。

链接: https://arxiv.org/abs/2604.21863
作者: Akash Kundu,Sebastian Feld
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Comments are warmly welcomed. 9 page main content, 17 page appendix

点击查看摘要

Abstract:Deep reinforcement learning (RL) for quantum circuit optimization faces three fundamental bottlenecks: replay buffers that ignore the reliability of temporal-difference (TD) targets, curriculum-based architecture search that triggers a full quantum-classical evaluation at every environment step, and the routine discard of noiseless trajectories when retraining under hardware noise. We address all three by treating the replay buffer as a primary algorithmic lever for quantum optimization. We introduce ReaPER + , an annealed replay rule that transitions from TD error-driven prioritization early in training to reliability-aware sampling as value estimates mature, achieving 4-32\times gains in sample efficiency over fixed PER, ReaPER, and uniform replay while consistently discovering more compact circuits across quantum compilation and QAS benchmarks; validation on LunarLander-v3 confirms the principle is domain-agnostic. Furthermore we eliminate the quantum-classical evaluation bottleneck in curriculum RL by introducing OptCRLQAS which amortizes expensive evaluations over multiple architectural edits, cutting wall-clock time per episode by up to 67.5% on a 12-qubit optimization problem without degrading solution quality. Finally we introduce a lightweight replay-buffer transfer scheme that warm-starts noisy-setting learning by reusing noiseless trajectories, without network-weight transfer or \epsilon -greedy pretraining. This reduces steps to chemical accuracy by up to 85-90% and final energy error by up to 90% over from-scratch baselines on 6-, 8-, and 12-qubit molecular tasks. Together, these results establish that experience storage, sampling, and transfer are decisive levers for scalable, noise-robust quantum circuit optimization.

[AI-97] Modulating Cross-Modal Convergence with Single-Stimulus Intra-Modal Dispersion

【速读】：该论文旨在解决神经网络在不同架构、训练目标和数据模态下表现出的表征收敛现象（representational convergence）如何在单个刺激层面解释的问题，尤其是这种收敛是否影响跨模态（如视觉与语言模型之间）的一致性。其关键解决方案是引入基于广义普罗克鲁斯特斯算法（Generalized Procrustes Algorithm）的方法，用于测量单一刺激下的模态内表征收敛程度（intra-modal representational convergence）。研究发现，模态内分散度（intra-modal dispersion）显著调节了视觉与语言模型间的跨模态一致性：低分散度（即视觉模型间高度一致）的刺激能显著提升跨模态对齐效果，最高可达两倍之差，且该效应在多种模型组合和刺激选择标准下均保持稳健。这一方法为揭示神经网络与人类大脑表征之间收敛与分歧的来源提供了新路径。

链接: https://arxiv.org/abs/2604.21836
作者: Eghbal A. Hosseini,Brian Cheung,Evelina Fedorenko,Alex H. Williams
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks exhibit a remarkable degree of representational convergence across diverse architectures, training objectives, and even data modalities. This convergence is predictive of alignment with brain representation. A recent hypothesis suggests this arises from learning the underlying structure in the environment in similar ways. However, it is unclear how individual stimuli elicit convergent representations across networks. An image can be perceived in multiple ways and expressed differently using words. Here, we introduce a methodology based on the Generalized Procrustes Algorithm to measure intra-modal representational convergence at the single-stimulus level. We applied this to vision models with distinct training objectives, selecting stimuli based on their degree of alignment (intra-modal dispersion). Crucially, we found that this intra-modal dispersion strongly modulates alignment between vision and language models (cross-modal convergence). Specifically, stimuli with low intra-modal dispersion (high agreement among vision models) elicited significantly higher cross-modal alignment than those with high dispersion, by up to a factor of two (e.g., in pairings of DINOv2 with language models). This effect was robust to stimulus selection criteria and generalized across different pairings of vision and language models. Measuring convergence at the single-stimulus level provides a path toward understanding the sources of convergence and divergence across modalities, and between neural networks and human neural representations.

[AI-98] Calibeating Prediction-Powered Inference

【速读】：该论文旨在解决半监督均值估计（semisupervised mean estimation）中的效率与鲁棒性权衡问题，即在仅有少量标注样本、大量未标注样本以及一个可能校准不良（miscalibrated）的黑箱预测模型条件下，如何有效利用预测得分进行高效且稳健的估计。其核心解决方案是引入校准预测增强推断（Calibrated Prediction-Powered Inference, CPPI），关键在于对原始预测得分在标注样本上进行后处理校准（post-hoc calibration），无需重新训练模型即可提升预测得分在结果尺度上的对齐度，从而改善回归调整效果和估计效率。文中分别研究了线性校准和等距校准（isotonic calibration），并证明等距校准可实现一阶最优性，即在不进一步后处理的情况下无法获得额外的一阶效率增益，同时揭示了PPI、PPI++与AIPW之间的关系，表明原PPI为AIPW特例但可能低效，而PPI++等价于经验效率最大化的AIPW。

链接: https://arxiv.org/abs/2604.21260
作者: Lars van der Laan,Mark Van Der Laan
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
备注: Paper website: this https URL

点击查看摘要

Abstract:We study semisupervised mean estimation with a small labeled sample, a large unlabeled sample, and a black-box prediction model whose output may be miscalibrated. A standard approach in this setting is augmented inverse-probability weighting (AIPW) [Robins et al., 1994], which protects against prediction-model misspecification but can be inefficient when the prediction score is poorly aligned with the outcome scale. We introduce Calibrated Prediction-Powered Inference, which post-hoc calibrates the prediction score on the labeled sample before using it for semisupervised estimation. This simple step requires no retraining and can improve the original score both as a predictor of the outcome and as a regression adjustment for semisupervised inference. We study both linear and isotonic calibration. For isotonic calibration, we establish first-order optimality guarantees: isotonic post-processing can improve predictive accuracy and estimator efficiency relative to the original score and simpler post-processing rules, while no further post-processing of the fitted isotonic score yields additional first-order gains. For linear calibration, we show first-order equivalence to PPI++. We also clarify the relationship among existing estimators, showing that the original PPI estimator is a special case of AIPW and can be inefficient when the prediction model is accurate, while PPI++ is AIPW with empirical efficiency maximization [Rubin et al., 2008]. In simulations and real-data experiments, our calibrated estimators often outperform PPI and are competitive with, or outperform, AIPW and PPI++. We provide an accompanying Python package, ppi_aipw, at this https URL.

[AI-99] Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics

【速读】：该论文试图解决的问题是：在人工通用智能（AGI）发展背景下，传统福利经济学第一基本定理所依赖的“主体自主性”假设面临挑战——即当人工智能系统表现出不同程度的自主性时，如何重新界定其福利主体地位并维持经济均衡效率。解决方案的关键在于引入一个自主性限定条件（autonomy qualification），通过构建一个最小一般均衡模型，将自主性纳入福利状态分配、委托机制和验证制度中，从而确立“自主性完备竞争均衡”下达到“自主性帕累托最优”的条件；在此框架下，经典的第一基本定理可被视为低自主性极限情形下的特例。

链接: https://arxiv.org/abs/2604.21216
作者: Elija Perrier
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Under review

点击查看摘要

Abstract:The First Fundamental Theorem of Welfare Economics assumes that welfare-bearing agents are autonomous and implicitly relies on a binary distinction between autonomy and instrumentality. Welfare subjects are those who have autonomy and therefore the capacity to choose and enter into utility comparisons, while everything else does not. In post-AGI economies this presupposition becomes nontrivial because artificial systems may exhibit varying degrees of autonomy, functioning as tools, delegates, strategic market actors, manipulators of choice environments, or possible welfare subjects. We argue that the theorem ought to be subject to an autonomy qualification where the impact of these changes in autonomy assumptions is incorporated. Using a minimal general-equilibrium model with autonomy-conditioned welfare, welfare-status assignment, delegation accounting, and verification institutions, we set out conditions for which autonomy-complete competitive equilibrium is autonomy-Pareto efficient. The classical theorem is recovered as the low-autonomy limit.

[AI-100] Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery

【速读】：该论文旨在解决Ramsey-good图中一类特殊子类——双重饱和Ramsey-good图的存在性问题，即寻找那些在添加或删除任意边后必然产生大小为 $ s $ 的团（clique）或大小为 $ t $ 的独立集（independent set）的图。这一问题由Grinstead和Roberts于1982年提出，长期未被解决。论文的关键解决方案是将SAT求解器与大语言模型（Large Language Models, LLMs）生成的定制化代码相结合，用于系统性地发现此类图的无限家族；同时利用LLMs自动生成并形式化证明，以在Lean定理证明器中验证结果的正确性。该方法体现了自动化推理、大语言模型与形式化验证协同驱动的新型实验数学工作流，显著提升了数学发现的效率与可靠性。

链接: https://arxiv.org/abs/2604.21187
作者: Benjamin Przybocki,John Mackey,Marijn J. H. Heule,Bernardo Subercaseaux
机构: 未知
类目: Combinatorics (math.CO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ramsey-good graphs are graphs that contain neither a clique of size s nor an independent set of size t . We study doubly saturated Ramsey-good graphs, defined as Ramsey-good graphs in which the addition or removal of any edge necessarily creates an s -clique or a t -independent set. We present a method combining SAT solving with bespoke LLM-generated code to discover infinite families of such graphs, answering a question of Grinstead and Roberts from 1982. In addition, we use LLMs to generate and formalize correctness proofs in Lean. This case study highlights the potential of integrating automated reasoning, large language models, and formal verification to accelerate mathematical discovery. We argue that such tool-driven workflows will play an increasingly central role in experimental mathematics.

[AI-101] Generative Discovery of Magnetic Insulators under Competing Physical Constraints

【速读】：该论文旨在解决计算材料设计中同时满足多重竞争约束（如稳定性、磁性与绝缘性）的难题，尤其在数据稀缺场景下传统数据驱动方法效果不佳的问题。其解决方案的关键在于提出了一种约束引导的生成式发现框架MagMatLLM，该框架将基于语言模型的晶体生成与进化选择、代理筛选及第一性原理验证相结合，在生成和筛选阶段即强制执行功能约束，从而引导搜索向由竞争物理需求定义的材料空间稀疏区域聚焦，有效提升了在复杂多目标条件下识别新型磁性绝缘体的可能性。

链接: https://arxiv.org/abs/2604.21073
作者: Qiulin Zeng,Tahiya Chowdhury,Md Shafayat Hossain
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discovering materials that must simultaneously satisfy multiple competing constraints remains a central challenge in computational materials design, particularly in data-scarce regimes where conventional data-driven approaches are least effective. Magnetic insulators represent a stringent example: the electronic conditions that favor magnetic order often also promote metallicity, while insulating behavior suppresses the interactions that stabilize magnetism. As a result, experimentally viable magnetic insulators are rare and difficult to identify through conventional screening. Here, we introduce MagMatLLM, a constraint-guided generative discovery framework that integrates language-model-based crystal generation with evolutionary selection, surrogate screening, and first-principles validation to target simultaneous stability, magnetism, and insulating behavior. Unlike stability-first approaches, the framework enforces functional constraints during generation and selection, steering the search toward sparsely populated regions of materials space defined by competing physical requirements. Using this workflow, we identify twelve previously unreported candidate magnetic insulators, including Tm _4 Co _2 Cr _2 O _12 and Cr _4 Nb _2 O _12 . Of these, ten are dynamically stable by phonon analysis and exhibit finite band gaps and nonzero magnetic moments in spin-polarized density functional theory calculations. Beyond the specific compounds identified here, this work establishes a general constraint-guided paradigm for multi-objective materials discovery in sparse chemical spaces and provides a transferable strategy for the design of quantum materials under competing physical constraints.

[AI-102] Expanding the extreme-k dielectric materials space through physics-validated generative reasoning

【速读】：该论文旨在解决数据驱动型材料发现中因稀缺性导致的瓶颈问题，即当前机器学习模型在已知化合物范围内表现优异，但在生成真正新颖材料时能力有限。针对这一挑战，研究提出了一种名为DielecMIND的人工智能框架，其关键在于将材料发现从传统的数据库筛选转变为基于推理的探索过程：首次结合大语言模型（Large Language Model, LLM）进行假设生成与经物理验证的第一性原理计算，从而在化学空间中高效导航并发现超出已知化合物的新材料。该方法显著扩展了高介电常数（high-kappa dielectric）材料的种类，在仅一次研究中便新增5种κ > 150的稳定材料，其中Ba₂TiHfO₆表现出高达637的介电常数、低频光学损耗及800 K的热稳定性，证明了该范式在稀有功能性材料发现中的有效性与普适潜力。

链接: https://arxiv.org/abs/2604.21068
作者: Hossain Hridoy,Tahiya Chowdhury,Md Shafayat Hossain
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The most technologically consequential materials are often the rarest: they occupy narrow regions of chemical space, obey competing physical constraints, and appear only sparsely in existing databases. High-kappa dielectrics, high-Tc superconductors, and ferromagnetic insulators are to name a few. This scarcity fundamentally limits today’s data-driven materials discovery, where machine-learning models excel at interpolation but struggle to generate genuinely new candidates. Here, we introduce DielecMIND, an artificial intelligence framework that reframes materials discovery as a reasoning-driven exploration instead of a database-screening problem. Using high-kappa dielectrics as a data-scarce and technologically stringent test case, DielecMIND combines large-language-model hypothesis generation for the first time with physics validated first-principles calculation to navigate chemical space beyond known compounds. Prior to our work, only 14 experimentally or computationally validated materials with kappa 150 were known. Our framework discovers and validates 5 new such compounds, expanding this rare-materials class by a remarkable = 35% in a single study. Among them, we find that Ba2TiHfO6 exhibits a dielectric constant of 637, minimal loss at low optical frequencies, and stability up to 800 K. Beyond dielectrics, this work demonstrates a new paradigm for artificial-intelligence-guided discovery: one that generates a small number of physically grounded, experimentally plausible candidates yet measurably expands sparsely populated functional materials spaces. Thus, DielecMIND points toward a general strategy for discovering rare, high-impact functional materials where data scarcity has long constrained progress.

[AI-103] Integrated packing placement scheduling and routing of personalized production: a pharmaceutical Industry 4.0 use-case with a planar transport system

【速读】：该论文旨在解决平面运输系统（planar transport systems）下柔性制造系统（Flexible Manufacturing Systems, FMS）中生产作业与内部物流调度的协同优化问题，特别是在个性化药物自动化生产场景中的实际应用。其核心挑战在于如何同时优化战术层（tactical level）的生产线布局与药剂分配器定位，以及操作层（operational level）的移动设备调度与路径规划。解决方案的关键在于：在战术层采用混合整数二次规划（Mixed-Integer Quadratic Programming）模型挖掘历史处方数据中的药物共现模式以优化药剂分配器的布局；在操作层则通过约束规划（Constraint Programming）建模移动设备为资源池确保订单完整性，并结合迭代冲突消解机制和有向无环图（DAG）推理实现无冲突路径生成，从而在保证调度有效性的同时具备良好的计算可扩展性，适用于每日处理多达500个订单的工业级规模。

链接: https://arxiv.org/abs/2604.21029
作者: Viktor Emil Korladinov,Antonin Novak,Zdeněk Hanzálek,Erik Sonntag,František Štěpánek
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent emergence of planar transport systems necessitates re-evaluation of Flexible Manufacturing Systems (FMS) to address the simultaneous scheduling of internal logistics and production operations. By operating on a tile-based planar grid, these systems allow independent movers full two-dimensional freedom, mitigating inefficiencies inherent to traditional sequential lines. This paper applies a planar FMS framework to a real-world use case in the pharmaceutical industry: the automated production of personalized drugs. Implementing this system requires solving optimization problems at both tactical and operational levels. The tactical level involves decisions regarding production line layout and the positioning of drug dispensers. A Mixed-Integer Quadratic Programming model is utilized for the packing problem to exploit drug co-occurrence patterns found in historical patient data. Subsequently, we solve the placement problem - a bi-level problem combining an assignment problem with Shortest Hamiltonian paths with neighborhoods - to arrange dispensers in a layout minimizing expected travel distances. The operational level is encountered daily, scheduling individual movers to process new orders as quickly as possible. This scheduling problem is formulated using Constraint Programming, modeling movers as reservoir resources to ensure order completeness, complemented by a routing phase using an iterative conflict-resolution mechanism and DAG-based reasoning to convert schedules into conflict-free paths. Evaluation using real-world prescription data for 40 drugs shows the framework scales efficiently across several layout topologies for up to 500 orders, with schedules that are highly effective and computationally tractable for daily operations. Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.21029 [math.OC] (or arXiv:2604.21029v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2604.21029 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Antonin Novak [view email] [v1] Wed, 22 Apr 2026 19:21:53 UTC (27,313 KB)

[AI-104] Planetary Exploration 3.0: A Roadmap for Software-Defined Radically Adaptive Space Systems

【速读】：该论文旨在解决外太阳系天体（如冰卫星、柯伊伯带天体等）探索中因巡航时间长达十年以上而导致传统逐次迭代式探测（Planetary Exploration 2.0）不可行的问题。针对这一挑战，作者提出“行星探索3.0”（Planetary Exploration 3.0, PE 3.0）新范式，其核心解决方案是采用软件定义空间系统（Software-Defined Space Systems, SDSSs），即通过软件更新实现航天器在功能、架构和能力上的全层级自适应性，从而支持单次或少数几次任务完成从初步探索到假设驱动的后续科学实验，并能在未知环境中具备韧性运行能力。

链接: https://arxiv.org/abs/2604.20910
作者: Masahiro Ono,Daniel Selva,Morgan L. Cable,Marie Ethvignot,Margaret Hansen,Andreas M. Hein,Elena-Sorina Lupu,Zachary Manchester,David Murrow,Chad Pozarycki,Pascal Spino,Amanda Stockton,Mathieu Choukroun,Soon-Jo Chung,John Day,Alexander Demagall,Anthony Freeman,Chloe Gentgen,Michel D. Ingham,Charity M. Phillips-Lander,Richard Rieber,Alejandro Salado,Maria Sakovsky,Lori R. Shiraishi,Yisong Yue,Kris Zacny
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The surface and subsurface of worlds beyond Mars remain largely unexplored. Yet these worlds hold keys to fundamental questions in planetary science - from potentially habitable subsurface oceans on icy moons to ancient records preserved in Kuiper Belt objects. NASA’s success in Mars exploration was achieved through incrementalism: 22 progressively sophisticated missions over decades. This paradigm, which we call Planetary Exploration 2.0 (PE 2.0), is untenable for the outer Solar System, where cruise times of a decade or more make iterative missions infeasible. We propose Planetary Exploration 3.0 (PE 3.0): a paradigm in which unvisited worlds are explored by a single or a few missions with radically adaptive space systems. A PE 3.0 mission conducts both initial exploratory science and follow-on hypothesis-driven science based on its own in situ data returns, evolving spacecraft capabilities to work resiliently in previously unseen environments. The key enabler of PE 3.0 is software-defined space systems (SDSSs) - systems that can adapt their functions at all levels through software updates. This paper presents findings from a Keck Institute for Space Studies (KISS) workshop on PE 3.0, covering: (1) PE 3.0 systems engineering including science definition, architecture, design methods, and verification validation; (2) software-defined space system technologies including reconfigurable hardware, multi-functionality, and modularity; (3) onboard intelligence including autonomous science, navigation, controls, and embodied AI; and (4) three PE 3.0 mission concepts: a Neptune/Triton smart flyby, an ocean world explorer, and an Oort cloud reconnaissance mission.

[AI-105] Predicting Scale-Up of Metal-Organic Framework Syntheses with Large Language Models

【速读】：该论文旨在解决金属有机框架材料（Metal-Organic Framework, MOF）从实验室发现到工业部署过程中因规模化合成知识分散于不同文献而难以高效推进的问题。其解决方案的关键在于构建了一个基于文献挖掘的ESU-MOF数据集，并采用正样本-未标记样本（positive-unlabeled）学习策略，对大型语言模型进行微调，从而实现对MOF规模化潜力的高精度预测（准确率达91.4%），为工业级MOF筛选提供了快速、数据驱动的甄别方法。

链接: https://arxiv.org/abs/2604.20899
作者: Peter Walther,Hongrui Sheng,Xinxin Liu,Bin Feng,Reid Coyle,Xinhua Yan,Kyle Smith,Harrison Kayal,Shyam Chand Pal,Zhiling Zheng
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 39 pages

点击查看摘要

Abstract:Scalable synthesis remains the gate between MOF discovery and industrial deployment, as scale-up know-how is fragmented across disparate reports. We introduce ESU-MOF, a literature-mined dataset and a positive-unlabeled learning strategy that fine-tunes large language models to predict scalability potential with 91.4% accuracy, enabling rapid data-driven triage for industrial MOF discovery.

[AI-106] HHL with a Coherent Fourier Oracle: A Proof-of-Concept Quantum Architecture for Joint Melody-Harmony Generation

【速读】：该论文旨在解决如何在音乐生成任务中实现量子算法的实用化，特别是利用具有理论指数级加速优势的Harrow-Hassidim-Lloyd (HHL)算法来编码旋律偏好，并通过构造一个相干的傅里叶谐波预言机（Fourier harmonic oracle）确保量子速度优势不被经典读出所抵消。其关键解决方案在于设计一个单位变换操作，将和弦转换权重直接作用于HHL输出幅值向量上，使得单次测量即可联合选择两个音符与一个两和弦进行曲，从而维持量子计算的相干性；同时采用2/2块结构限制联合状态空间的指数增长，以支持可扩展的经典链式处理，最终实现了8个音符在8个和弦上的语法正确过渡，且97%的和弦进行经规则验证为强或可接受，证明了HHL+预言机流水线在音乐生成场景下的机械可行性。

链接: https://arxiv.org/abs/2604.20882
作者: Alexis Kirke
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Quantum algorithms with a proven theoretical speedup over classical computation are rare. Among the most prominent is the Harrow-Hassidim-Lloyd (HHL) algorithm for solving sparse linear systems. Here, HHL is applied to encode melodic preference: the system matrix encodes Narmour implication-realisation and Krumhansl-Kessler tonal stability, so its solution vector is a music-cognition-weighted note-pair distribution. The key constraint of HHL is that reading its output classically cancels the quantum speedup; the solution must be consumed coherently. This motivates a coherent Fourier harmonic oracle: a unitary that applies chord-transition weights directly to the HHL amplitude vector, so that a single measurement jointly selects both melody notes and a two-chord progression. A two-note/two-chord (2/2) block is used to contain the exponential growth of the joint state space that would otherwise make classical simulation of larger blocks infeasible. For demonstrations of longer passages, blocks are chained classically - each block’s collapsed output conditions the next – as a temporary workaround until fault-tolerant hardware permits larger monolithic circuits. A four-block chain produces 8 notes over 8 chords with grammatically valid transitions at every block boundary. Independent rule-based harmony validation confirms that 97% of generated chord progressions are rated strong or acceptable. The primary motivation is that HHL carries a proven exponential speedup over classical linear solvers; this work demonstrates that a coherent HHL+oracle pipeline - the prerequisite for that speedup to be realised in a musical setting - is mechanically achievable. Audio realisations of representative outputs are made available for listening online. Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Sound (cs.SD) Cite as: arXiv:2604.20882 [quant-ph] (or arXiv:2604.20882v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2604.20882 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

机器学习

[LG-0] mporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

链接: https://arxiv.org/abs/2604.21930
作者: Nicolae Filat,Ahmed Hussain,Konstantinos Kalogiannis,Elena Burceanu
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.

[LG-1] Fine-Tuning Regimes Define Distinct Continual Learning Problems

链接: https://arxiv.org/abs/2604.21927
作者: Paul-Tiberiu Iordache,Elena Burceanu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:Continual learning (CL) studies how models acquire tasks sequentially while retaining previously learned knowledge. Despite substantial progress in benchmarking CL methods, comparative evaluations typically keep the fine-tuning regime fixed. In this paper, we argue that the fine-tuning regime, defined by the trainable parameter subspace, is itself a key evaluation variable. We formalize adaptation regimes as projected optimization over fixed trainable subspaces, showing that changing the trainable depth alters the effective update signal through which both current task fitting and knowledge preservation operate. This analysis motivates the hypothesis that method comparisons need not be invariant across regimes. We test this hypothesis in task incremental CL, five trainable depth regimes, and four standard methods: online EWC, LwF, SI, and GEM. Across five benchmark datasets, namely MNIST, Fashion MNIST, KMNIST, QMNIST, and CIFAR-100, and across 11 task orders per dataset, we find that the relative ranking of methods is not consistently preserved across regimes. We further show that deeper adaptation regimes are associated with larger update magnitudes, higher forgetting, and a stronger relationship between the two. These results show that comparative conclusions in CL can depend strongly on the chosen fine-tuning regime, motivating regime-aware evaluation protocols that treat trainable depth as an explicit experimental factor.

[LG-2] he Sample Complexity of Multicalibration

链接: https://arxiv.org/abs/2604.21923
作者: Natalie Collina,Jiuyao Lu,Georgy Noarov,Aaron Roth
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the minimax sample complexity of multicalibration in the batch setting. A learner observes n i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most \varepsilon with respect to a given family of groups. For every fixed \kappa 0 , in the regime |G|\le \varepsilon^-\kappa , we prove that \widetilde\Theta(\varepsilon^-3) samples are necessary and sufficient, up to polylogarithmic factors. The lower bound holds even for randomized predictors, and the upper bound is realized by a randomized predictor obtained via an online-to-batch reduction. This separates the sample complexity of multicalibration from that of marginal calibration, which scales as \widetilde\Theta(\varepsilon^-2) , and shows that mean-ECE multicalibration is as difficult in the batch setting as it is in the online setting, in contrast to marginal calibration which is strictly more difficult in the online setting. In contrast we observe that for \kappa = 0 , the sample complexity of multicalibration remains \widetilde\Theta(\varepsilon^-2) exhibiting a sharp threshold phenomenon. More generally, we establish matching upper and lower bounds, up to polylogarithmic factors, for a weighted L_p multicalibration metric for all 1 \le p \le 2 , with optimal exponent 3/p . We also extend the lower-bound template to a regular class of elicitable properties, and combine it with the online upper bounds of Hu et al. (2025) to obtain matching bounds for calibrating properties including expectiles and bounded-density quantiles. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2604.21923 [cs.LG] (or arXiv:2604.21923v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Low-Rank Adaptation Redux for Large Models

链接: https://arxiv.org/abs/2604.21905
作者: Bingcong Li,Yilang Zhang,Georgios B. Giannakis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has emerged as the de facto standard for parameter-efficient fine-tuning (PEFT) of foundation models, enabling the adaptation of billion-parameter networks with minimal computational and memory overhead. Despite its empirical success and rapid proliferation of variants, it remains elusive which architectural choices, optimization techniques, and deployment constraints should guide practical method selection. This overview revisits LoRA through the lens of signal processing (SP), bridging modern adapter designs with classical low-rank modeling tools and inverse problems, as well as highlighting how SP principles can inform principled advances of fine-tuning approaches. Rather than providing a comprehensive enumeration and empirical comparisons of LoRA variants, emphasis is placed on the technical mechanisms underpinning these approaches to justify their effectiveness. These advances are categorized into three complementary axes: architectural design, efficient optimization, and pertinent applications. The first axis builds on singular value decomposition (SVD)-based factorization, rank-augmentation constructions, and cross-layer tensorization, while the second axis deals with initialization, alternating solvers, gauge-invariant optimization, and parameterization-aware methods. Beyond fine-tuning, emerging applications of LoRA are accounted across the entire lifecycle of large models, ranging from pre- and post-training to serving/deployment. Finally, open research directions are outlined at the confluence of SP and deep learning to catalyze a bidirectional frontier: classical SP tools provide a principled vocabulary for designing principled PEFT methods, while the unique challenges facing modern deep learning, especially the overwhelming scale and prohibitive overhead, also offer new research lines benefiting the SP community in return.

[LG-4] An effective variant of the Hartigan k-means algorithm

链接: https://arxiv.org/abs/2604.21798
作者: François Clément,Stefan Steinerberger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The k-means problem is perhaps the classical clustering problem and often synonymous with Lloyd’s algorithm (1957). It has become clear that Hartigan’s algorithm (1975) gives better results in almost all cases, Telgarsky-Vattani note a typical improvement of 5% – 10% . We point out that a very minor variation of Hartigan’s method leads to another 2% – 5% improvement; the improvement tends to become larger when either dimension or k increase.

[LG-5] Compliance Moral Hazard and the Backfiring Mandate

链接: https://arxiv.org/abs/2604.21789
作者: Jian Ni,Lecheng Zheng,John R Birge
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Competing firms that serve shared customer populations face a fundamental information aggregation problem: each firm holds fragmented signals about risky customers, but individual incentives impede efficient collective detection. We develop a mechanism design framework for decentralized risk analytics, grounded in anti-money laundering in banking networks. Three strategic frictions distinguish our setting: compliance moral hazard, adversarial adaptation, and information destruction through intervention. A temporal value assignment (TVA) mechanism, which credits institutions using a strictly proper scoring rule on discounted verified outcomes, implements truthful reporting as a Bayes–Nash equilibrium (uniquely optimal at each edge) in large federations. Embedding TVA in a banking competition model, we show competitive pressure amplifies compliance moral hazard and poorly designed mandates can reduce welfare below autarky, a ``backfiring’’ result with direct policy implications. In simulation using a synthetic AML benchmark, TVA achieves substantially higher welfare than autarky or mandated sharing without incentive design.

[LG-6] PrismaDV: Automated Task-Aware Data Unit Test Generation

链接: https://arxiv.org/abs/2604.21765
作者: Hao Chen,Arnab Phani,Sebastian Schelter
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Data is a central resource for modern enterprises, and data validation is essential for ensuring the reliability of downstream applications. However, existing automated data unit testing frameworks are largely task-agnostic: they validate datasets without considering the semantics and requirements of the code that consumes the data. We present PrismaDV, a compound AI system that analyzes downstream task code together with dataset profiles to identify data access patterns, infer implicit data assumptions, and generate task-aware executable data unit tests. To further adapt the data unit tests over time to specific datasets and downstream tasks, we propose “Selective Informative Feedback for Task Adaptation” (SIFTA), a prompt-optimization framework that leverages the scarce outcomes from the execution of data unit tests and downstream tasks. We evaluate PrismaDV on two new benchmarks spanning 60 tasks across five datasets, where it consistently outperforms both task-agnostic and task-aware baselines in generating unit tests that reflect the end-to-end impact of data errors. Furthermore, we show that with SIFTA, we can automatically learn prompts for PrismaDV’s modules that outperform prompts written by hand or generated from a generic prompt optimizer. We publicly release our benchmarks and prototype implementation. Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2604.21765 [cs.LG] (or arXiv:2604.21765v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] ransferable Physics-Informed Representations via Closed-Form Head Adaptation IJCNN2026

链接: https://arxiv.org/abs/2604.21761
作者: Jian Cheng Wong,Isaac Yin Chung Lai,Pao-Hsiung Chiu,Chin Chun Ooi,Abhishek Gupta,Yew-Soon Ong
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: Accepted at IJCNN 2026

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have garnered significant interest for their potential in solving partial differential equations (PDEs) that govern a wide range of physical phenomena. By incorporating physical laws into the learning process, PINN models have demonstrated the ability to learn physical outcomes reasonably well. However, current PINN approaches struggle to predict or solve new PDEs effectively when there is a lack of training examples, indicating they do not generalize well to unseen problem instances. In this paper, we present a transferable learning approach for PINNs premised on a fast Pseudoinverse PINN framework (Pi-PINN). Pi-PINN learns a transferable physics-informed representation in a shared embedding space and enables rapid solving of both known and unknown PDE instances via closed-form head adaptation using a least-squares-optimal pseudoinverse under PDE constraints. We further investigate the synergies between data-driven multi-task learning loss and physics-informed loss, providing insights into the design of more performant PINNs. We demonstrate the effectiveness of Pi-PINN on various PDE problems, including Poisson’s equation, Helmholtz equation, and Burgers’ equation, achieving fast and accurate physics-informed solutions without requiring any data for unseen instances. Pi-PINN can produce predictions 100-1000 times faster than a typical PINN, while producing predictions with 10-100 times lower relative error than a typical data-driven model even with only two training samples. Overall, our findings highlight the potential of transferable representations with closed-form head adaptation to enhance the efficiency and generalization of PINNs across PDE families and scientific and engineering applications.

[LG-8] owards Universal Tabular Embeddings: A Benchmark Across Data Tasks

链接: https://arxiv.org/abs/2604.21696
作者: Liane Vogel,Kavitha Srinivas,Niharika D’Souza,Sola Shirai,Oktie Hassanzadeh,Horst Samulowitz
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Tabular foundation models aim to learn universal representations of tabular data that transfer across tasks and domains, enabling applications such as table retrieval, semantic search and table-based prediction. Despite the growing number of such models, it remains unclear which approach works best in practice, as existing methods are often evaluated under task-specific settings that make direct comparison difficult. To address this, we introduce TEmBed, the Tabular Embedding Test Bed, a comprehensive benchmark for systematically evaluating tabular embeddings across four representation levels: cell, row, column, and table. Evaluating a diverse set of tabular representation learning models, we show that which model to use depends on the task and representation level. Our results offer practical guidance for selecting tabular embeddings in real-world applications and lay the groundwork for developing more general-purpose tabular representation models.

[LG-9] Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2

链接: https://arxiv.org/abs/2604.21690
作者: Isabel Kurth,Paulo Yanez Sarmiento,Bernhard Y. Renard
类目: Machine Learning (cs.LG)
*备注: Accepted at the 4th World Conference on Explainable Artificial Intelligence, XAI-2026

点击查看摘要

Abstract:Explaining deep neural network predictions on genome sequences enables biological insight and hypothesis generation-often of greater interest than predictive performance alone. While explanations of convolutional neural networks (CNNs) have been shown to capture relevant patterns in genome sequences, it is unclear whether this transfers to more expressive Transformer-based genome language models (gLMs). To answer this question, we adapt AttnLRP, an extension of layer-wise relevance propagation to the attention mechanism, and apply it to the state-of-the-art gLM DNABERT-2. Thereby, we propose strategies to transfer explanations from token and nucleotide level. We evaluate the adaption of AttnLRP on genomic datasets using multiple metrics. Further, we provide an extensive comparison between the explanations of DNABERT-2 and a baseline CNN. Our results demonstrate that AttnLRP yields reliable explanations corresponding to known biological patterns. Hence, like CNNs, gLMs can also help derive biological insights. This work contributes to the explainability of gLMs and addresses the comparability of relevance attributions across different architectures.

[LG-10] A-IC3: Learning-Guided Adaptive Inductive Generalization for Hardware Model Checking

链接: https://arxiv.org/abs/2604.21688
作者: Xiaofeng Zhou,Guangyu Hu,Hongce Zhang,Wei Zhang
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The IC3 algorithm represents the state-of-the-art (SOTA) hardware model checking technique, owing to its robust performance and scalability. A significant body of research has focused on enhancing the solving efficiency of the IC3 algorithm, with particular attention to the inductive generalization process: a critical phase wherein the algorithm seeks to generalize a counterexample to inductiveness (CTI), which typically is a state leading to a bad state, into a broader set of states. This inductive generalization is a primary source of clauses in IC3 and thus plays a pivotal role in determining the overall effectiveness of the algorithm. Despite its importance, existing approaches often rely on fixed inductive generalization strategies, overlooking the dynamic and context-sensitive nature of the verification environment in which spurious counterexamples arise. This rigidity can limit the quality of generated clauses and, consequently, the performance of IC3. To address this limitation, we propose a lightweight machine-learning-based framework that dynamically selects appropriate inductive generalization strategies in response to the evolving verification context. Specifically, we employ a multi-armed bandit (MAB) algorithm to adaptively choose inductive generalization strategies based on real-time feedback from the verification process. The agent is updated by evaluating the quality of generalization outcomes, thereby refining its strategy selection over time. Empirical evaluation on a benchmark suite comprising 914 instances, primarily drawn from the latest HWMCC collection, demonstrates the efficacy of our approach. When implemented on the state-of-the-art model checker rIC3, our method solves 26 to 50 more cases than the baselines and improves the PAR-2 score by 194.72 to 389.29. Subjects: Logic in Computer Science (cs.LO); Machine Learning (cs.LG) Cite as: arXiv:2604.21688 [cs.LO] (or arXiv:2604.21688v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2604.21688 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiaofeng Zhou [view email] [v1] Thu, 23 Apr 2026 13:53:06 UTC (413 KB)

[LG-11] ransferable SCF-Acceleration through Solver-Aligned Initialization Learning

链接: https://arxiv.org/abs/2604.21657
作者: Eike S. Eberhard,Viktor Kotsev,Timm Güthle,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning methods that predict initial guesses from molecular geometry can reduce this cost, but matrix-prediction models fail when extrapolating to larger molecules, degrading rather than accelerating convergence [Liu et al. 2025]. We show that this failure is a supervision problem, not an extrapolation problem: models trained on ground-state targets fit those targets well out of distribution, yet produce initial guesses that slow convergence. Solver-Aligned Initialization Learning (SAIL) resolves this for both Hamiltonian and density matrix models by differentiating through the SCF solver end-to-end. We introduce the Effective Relative Iteration Count (ERIC), a correction to the commonly used RIC that accounts for hidden Fock-build overhead. On QM40, containing molecules up to 4 \times larger than the training distribution, SAIL reduces ERIC by 37% (PBE), 33% (SCAN), and 27% (B3LYP), more than doubling the previous state-of-the-art reduction on B3LYP (10%). On QMugs molecules 10 \times the training size, SAIL delivers a 1.25 \times wall-time speedup at the hybrid level of theory, extending ML SCF acceleration to large drug-like molecules.

[LG-12] Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

链接: https://arxiv.org/abs/2604.21645
作者: Ashley N. Abraham,Andrew Strelzoff,Haley R. Dozier,Althea C. Henslee,Mark A. Chappell
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: To be published in the CSCE 2022 proceedings

点击查看摘要

Abstract:Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using medium-scale data.

[LG-13] Geometric Characterisation and Structured Trajectory Surrogates for Clinical Dataset Condensation

链接: https://arxiv.org/abs/2604.21638
作者: Pafue Christy Nganjimi,Andrew Soltan,Danielle Belgrave,Lei Clifton,David Clifton,Anshul Thakur
类目: Machine Learning (cs.LG)
*备注: 34 pages, 7 figures

点击查看摘要

Abstract:Dataset condensation constructs compact synthetic datasets that retain the training utility of large real-world datasets, enabling efficient model development and potentially supporting downstream research in governed domains such as healthcare. Trajectory matching ™ is a widely used condensation approach that supervises synthetic data using changes in model parameters observed during training on real data, yet the structure of this supervision signal remains poorly understood. In this paper, we provide a geometric characterisation of trajectory matching, showing that a fixed synthetic dataset can only reproduce a limited span of such training-induced parameter changes. When the resulting supervision signal is spectrally broad, this creates a conditional representability bottleneck. Motivated by this mismatch, we propose Bezier Trajectory Matching (BTM), which replaces SGD trajectories with quadratic Bezier trajectory surrogates between initial and final model states. These surrogates are optimised to reduce average loss along the path while replacing broad SGD-derived supervision with a more structured, lower-rank signal that is better aligned with the optimisation constraints of a fixed synthetic dataset, and they substantially reduce trajectory storage. Experiments on five clinical datasets demonstrate that BTM consistently matches or improves upon standard trajectory matching, with the largest gains in low-prevalence and low-synthetic-budget settings. These results indicate that effective trajectory matching depends on structuring the supervision signal rather than reproducing stochastic optimisation paths.

[LG-14] A-THENA: Early Intrusion Detection for IoT with Time-Aware Hybrid Encoding and Network-Specific Augmentation

链接: https://arxiv.org/abs/2604.21623
作者: Ioannis Panopoulos,Maria Lamprini A. Bartsioka,Sokratis Nikolaidis,Stylianos I. Venieris,Dimitra I. Kaklamani,Iakovos S. Venieris
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of Internet of Things (IoT) devices has significantly expanded attack surfaces, making IoT ecosystems particularly susceptible to sophisticated cyber threats. To address this challenge, this work introduces A-THENA, a lightweight early intrusion detection system (EIDS) that significantly extends preliminary findings on time-aware encodings. A-THENA employs an advanced Transformer-based architecture augmented with a generalized Time-Aware Hybrid Encoding (THE), integrating packet timestamps to effectively capture temporal dynamics essential for accurate and early threat detection. The proposed system further employs a Network-Specific Augmentation (NA) pipeline, which enhances model robustness and generalization. We evaluate A-THENA on three benchmark IoT intrusion detection datasets-CICIoT23-WEB, MQTT-IoT-IDS2020, and IoTID20-where it consistently achieves strong performance. Averaged across all three datasets, it improves accuracy by 6.88 percentage points over the best-performing traditional positional encoding, 3.69 points over the strongest feature-based model, 6.17 points over the leading time-aware alternatives, and 5.11 points over related models, while achieving near-zero false alarms and false negatives. To assess real-world feasibility, we deploy A-THENA on the Raspberry Pi Zero 2 W, demonstrating its ability to perform real-time intrusion detection with minimal latency and memory usage. These results establish A-THENA as an agile, practical, and highly effective solution for securing IoT networks.

[LG-15] Verifying Machine Learning Interpretability Requirements through Provenance

链接: https://arxiv.org/abs/2604.21599
作者: Lynn Vonderhaar,Juan Couder,Daryela Cisneros,Omar Ochoa
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) Engineering is a growing field that necessitates an increase in the rigor of ML development. It draws many ideas from software engineering and more specifically, from requirements engineering. Existing literature on ML Engineering defines quality models and Non-Functional Requirements (NFRs) specific to ML, in particular interpretability being one such NFR. However, a major challenge occurs in verifying ML NFRs, including interpretability. Although existing literature defines interpretability in terms of ML, it remains an immeasurable requirement, making it impossible to definitively confirm whether a model meets its interpretability requirement. This paper shows how ML provenance can be used to verify ML interpretability requirements. This work provides an approach for how ML engineers can save various types of model and data provenance to make the model’s behavior transparent and interpretable. Saving this data forms the basis of quantifiable Functional Requirements (FRs) whose verification in turn verifies the interpretability NFR. Ultimately, this paper contributes a method to verify interpretability NFRs for ML models.

[LG-16] A temporal deep learning framework for calibration of low-cost air quality sensors

链接: https://arxiv.org/abs/2604.21527
作者: Arindam Sengupta,Tony Bush,Ben Marner,Jose Miguel Pérez,Soledad Le Clainche
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-cost air quality sensors (LCS) provide a practical alternative to expensive regulatory-grade instruments, making dense urban monitoring networks possible. Yet their adoption is limited by calibration challenges, including sensor drift, environmental cross-sensitivity, and variability in performance from device to device. This work presents a deep learning framework for calibrating LCS measurements of PM _2.5 , PM _10 , and NO _2 using a Long Short-Term Memory (LSTM) network, trained on co-located reference data from the OxAria network in Oxford, UK. Unlike the Random Forest (RF) baseline, which treats each observation independently, the proposed approach captures temporal dependencies and delayed environmental effects through sequence-based learning, achieving higher R^2 values across training, validation, and test sets for all three pollutants. A feature set is constructed combining time-lagged parameters, harmonic encodings, and interaction terms to improve generalization on unseen temporal windows. Validation of unseen calibrated values against the Equivalence Spreadsheet Tool 3.1 demonstrates regulatory compliance with expanded uncertainties of 22.11% for NO _2 , 12.42% for PM _10 , and 9.1% for PM _2.5 .

[LG-17] Conditional anomaly detection with soft harmonic functions ICDM2011 ICDM

链接: https://arxiv.org/abs/2604.21462
作者: Michal Valko,Branislav Kveton,Hamed Valizadegan,Gregory F. Cooper,Milos Hauskrecht
类目: Machine Learning (cs.LG)
*备注: Published at IEEE International Conference on Data Mining (ICDM 2011). https://doi.org/10.1109/ICDM.2011.40

点击查看摘要

Abstract:In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method on several synthetic and UCI ML datasets in detecting unusual labels when compared to several baseline approaches. We also evaluate the performance of our method on a real-world electronic health record dataset where we seek to identify unusual patient-management decisions.

[LG-18] mpered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

链接: https://arxiv.org/abs/2604.21456
作者: Heng Yang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We propose a sampling-based framework for finite-horizon trajectory and policy optimization under differentiable dynamics by casting controller design as inference. Specifically, we minimize a KL-regularized expected trajectory cost, which yields an optimal “Boltzmann-tilted” distribution over controller parameters that concentrates on low-cost solutions as temperature decreases. To sample efficiently from this sharp, potentially multimodal target, we introduce tempered sequential Monte Carlo (TSMC): an annealing scheme that adaptively reweights and resamples particles along a tempering path from a prior to the target distribution, while using Hamiltonian Monte Carlo rejuvenation to maintain diversity and exploit exact gradients obtained by differentiating through trajectory rollouts. For policy optimization, we extend TSMC via (i) a deterministic empirical approximation of the initial-state distribution and (ii) an extended-space construction that treats rollout randomness as auxiliary variables. Experiments across trajectory- and policy-optimization benchmarks show that TSMC is broadly applicable and compares favorably to state-of-the-art baselines.

[LG-19] A Green-Integral-Constrained Neural Solver with Stochastic Physics-Informed Regularization

链接: https://arxiv.org/abs/2604.21411
作者: Mohammad Mahdi Abedi,David Pardo,Tariq Alkhalifah
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Standard physics-informed neural networks (PINNs) struggle to simulate highly oscillatory Helmholtz solutions in heterogeneous media because pointwise minimization of second-order PDE residuals is computationally expensive, biased toward smooth solutions, and requires artificial absorbing boundary layers to restrict the solution. To overcome these challenges, we introduce a Green-Integral (GI) neural solver for the acoustic Helmholtz equation. It departs from the PDE-residual-based formulation by enforcing wave physics through an integral representation that imposes a nonlocal constraint. Oscillatory behavior and outgoing radiation are encoded directly through the integral kernel, eliminating second-order spatial derivatives and enforcing physical solutions without additional boundary layers. Theoretically, optimizing this GI loss via a neural network acts as a spectrally tuned preconditioned iteration, enabling convergence in heterogeneous media where the classical Born series diverges. By exploiting FFT-based convolution to accelerate the GI loss evaluation, our approach substantially reduces GPU memory usage and training time. However, this efficiency relies on a fixed regular grid, which can limit local resolution. To improve local accuracy in strong scattering regions, we also propose a hybrid GI+PDE loss, enforcing a lightweight Helmholtz residual at a small number of nonuniformly sampled collocation points. We evaluate our method on seismic benchmark models characterized by structural contrasts and subwavelength heterogeneity at frequencies up to 20Hz. GI-based training consistently outperforms PDE-based PINNs, reducing computational cost by over a factor of ten. In models with localized scattering, the hybrid loss yields the most accurate reconstructions, providing a stable, efficient, and physically grounded alternative.

[LG-20] Even More Guarantees for Variational Inference in the Presence of Symmetries AISTATS2026

链接: https://arxiv.org/abs/2604.21407
作者: Lena Zellinger,Antonio Vergari
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Accepted for presentation at the OPTIMAL Workshop at AISTATS 2026

点击查看摘要

Abstract:When approximating an intractable density via variational inference (VI) the variational family is typically chosen as a simple parametric family that very likely does not contain the target. This raises the question: Under which conditions can we recover characteristics of the target despite misspecification? In this work, we extend previous results on robust VI with location-scale families under target symmetries. We derive sufficient conditions guaranteeing exact recovery of the mean when using the forward Kullback-Leibler divergence and \alpha -divergences. We further show how and why optimization can fail to recover the target mean in the absence of our sufficient conditions, providing initial guidelines on the choice of the variational family and \alpha -value.

[LG-21] Relocation of compact sets in mathbbRn by diffeomorphisms and linear separability of datasets in mathbbRn

链接: https://arxiv.org/abs/2604.21393
作者: Xiao-Song Yang,Xuan Zhou,Qi Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relocation of compact sets in an n -dimensional manifold by self-diffeomorphism is of its own interest as well as significant potential applications to data classification in data science. This paper presents a theory for relocating a finite number of compact sets in \mathbbR^n to be relocated to arbitrary target domains in \mathbbR^n by diffeomorphisms of \mathbbR^n . Furthermore, we prove that for any such collection, there exists a differentiable embedding into \mathbbR^n+1 such that their images become linearly separable. As applications of the established theory, we show that a finite number of compact datasets in \mathbbR^n can be made linearly separable by width- n deep neural networks (DNNs) with Leaky-ReLU, ELU, or SELU activation functions, under a mild condition. In addition, we show that any finite number of mutually disjoint compact datasets in \mathbbR^n can be made linearly separable in \mathbbR^n+1 by a width- (n+1) DNN. Subjects: Machine Learning (cs.LG) MSC classes: 57R50, 68T07 Cite as: arXiv:2604.21393 [cs.LG] (or arXiv:2604.21393v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Decoupled Travel Planning with Behavior Forest

链接: https://arxiv.org/abs/2604.21354
作者: Duanyang Yuan,Sihang Zhou,Yanning Hou,Xiaoshu Chen,Haoyuan Chen,Ke Liang,Jiyuan Liu,Chuan Ma,Xinwang Liu,Jian Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between locally acting constraints within a subtask and global constraints that span multiple subtasks. Consequently, the model is forced to jointly reason over local and global constraints at each decision step, increasing the reasoning burden and reducing planning efficiency. To address this problem, we propose the Behavior Forest method. Specifically, our approach structures the decision-making process into a forest of parallel behavior trees, where each behavior tree is responsible for a subtask. A global coordination mechanism is introduced to orchestrate the interactions among these trees, enabling modular and coherent travel planning. Within this framework, large language models are embedded as decision engines within behavior tree nodes, performing localized reasoning conditioned on task-specific constraints to generate candidate subplans and adapt decisions based on coordination feedback. The behavior trees, in turn, provide an explicit control structure that guides LLM generation. This design decouples complex tasks and constraints into manageable subspaces, enabling task-specific reasoning and reducing the cognitive load of LLM. Experimental results show that our method outperforms state-of-the-art methods by 6.67% on the TravelPlanner and by 11.82% on the ChinaTravel benchmarks, demonstrating its effectiveness in increasing LLM performance for complex multi-constraint travel planning.

[LG-23] Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection AAMAS2026

链接: https://arxiv.org/abs/2604.21282
作者: Zhaohui Geoffrey Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 11 pages, 5 figures. Accepted at the AAMAS 2026 Workshop on Software Engineering (SE Workshop). This version corresponds to the preprint of the workshop paper

点击查看摘要

Abstract:Automated code vulnerability detection is critical for software security, yet existing approaches face a fundamental trade-off between detection accuracy and computational cost. We propose a heterogeneous multi-agent architecture inspired by game-theoretic principles, combining cloud-based LLM experts with a local lightweight verifier. Our “3+1” architecture deploys three cloud-based expert agents (DeepSeek-V3) that analyze code from complementary perspectives - code structure, security patterns, and debugging logic - in parallel, while a local verifier (Qwen3-8B) performs adversarial validation at zero marginal cost. We formalize this design through a two-layer game framework: (1) a cooperative game among experts capturing super-additive value from diverse perspectives, and (2) an adversarial verification game modeling quality assurance incentives. Experiments on 262 real samples from the NIST Juliet Test Suite across 14 CWE types, with balanced vulnerable and benign classes, demonstrate that our approach achieves a 77.2% F1 score with 62.9% precision and 100% recall at 0.002 per sample - outperforming both a single-expert LLM baseline (F1 71.4%) and Cppcheck static analysis (MCC 0). The adversarial verifier significantly improves precision (+10.3 percentage points, p 1e-6, McNemar’s test) by filtering false positives, while parallel execution achieves a 3.0x speedup. Our work demonstrates that game-theoretic design principles can guide effective heterogeneous multi-agent architectures for cost-sensitive software engineering tasks. Comments: 11 pages, 5 figures. Accepted at the AAMAS 2026 Workshop on Software Engineering (SE Workshop). This version corresponds to the preprint of the workshop paper Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2604.21282 [cs.CR] (or arXiv:2604.21282v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.21282 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhaohui Wang [view email] [v1] Thu, 23 Apr 2026 04:58:18 UTC (63 KB)

[LG-24] Improving Performance in Classification Tasks with LCEN and the Weighted Focal Differentiable MCC Loss

链接: https://arxiv.org/abs/2604.21252
作者: Pedro Seber,Richard D. Braatz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The LASSO-Clip-EN (LCEN) algorithm was previously introduced for nonlinear, interpretable feature selection and machine learning. However, its design and use was limited to regression tasks. In this work, we create a modified version of the LCEN algorithm that is suitable for classification tasks and maintains its desirable properties, such as interpretability. This modified LCEN algorithm is evaluated on four widely used binary and multiclass classification datasets. In these experiments, LCEN is compared against 10 other model types and consistently reaches high test-set macro F _1 score and Matthews correlation coefficient (MCC) metrics, higher than that of the majority of investigated models. LCEN models for classification remain sparse, eliminating an average of 56% of all input features in the experiments performed. Furthermore, LCEN-selected features are used to retrain all models using the same data, leading to statistically significant performance improvements in three of the experiments and insignificant differences in the fourth when compared to using all features or other feature selection methods. Simultaneously, the weighted focal differentiable MCC (diffMCC) loss function is evaluated on the same datasets. Models trained with the diffMCC loss function are always the best-performing methods in these experiments, and reach test-set macro F _1 scores that are, on average, 4.9% higher and MCCs that are 8.5% higher than those obtained by models trained with the weighted cross-entropy loss. These results highlight the performance of LCEN as a feature selection and machine learning algorithm also for classification tasks, and how the diffMCC loss function can train very accurate models, surpassing the weighted cross-entropy loss in the tasks investigated.

[LG-25] he Recurrent Transformer: Greater Effective Depth and Efficient Decoding

链接: https://arxiv.org/abs/2604.21215
作者: Costin-Andrei Oncescu,Depen Morwani,Samy Jelassi,Alexandru Meterez,Mujin Kwun,Sham Kakade
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers process tokens in parallel but are temporally shallow: at position t , each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near 1 because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from \Theta(N^2) to \Theta(N\log N) , increasing effective arithmetic intensity to \Theta(N/\log N) for sequence length N . On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency.

[LG-26] oward Efficient Membership Inference Attacks against Federated Large Language Models : A Projection Residual Approach

链接: https://arxiv.org/abs/2604.21197
作者: Guilin Deng,Silong Chen,Yuchuan Luo,Yi Liu,Songlei Wang,Zhiping Cai,Lin Liu,Xiaohua Jia,Shaojing Fu
类目: Machine Learning (cs.LG)
*备注: This is the full version (including complete appendices and supplementary materials) of the paper accepted for publication at the 2026 IEEE Symposium on Security and Privacy

点击查看摘要

Abstract:Federated Large Language Models (FedLLMs) enable multiple parties to collaboratively fine-tune LLMs without sharing raw data, addressing challenges of limited resources and privacy concerns. Despite data localization, shared gradients can still expose sensitive information through membership inference attacks (MIAs). However, FedLLMs’ unique properties, i.e. massive parameter scales, rapid convergence, and sparse, non-orthogonal gradients, render existing MIAs ineffective. To address this gap, we propose ProjRes, the first projection residuals-based passive MIA tailored for FedLLMs. ProjRes leverages hidden embedding vectors as sample representations and analyzes their projection residuals on the gradient subspace to uncover the intrinsic link between gradients and inputs. It requires no shadow models, auxiliary classifiers, or historical updates, ensuring efficiency and robustness. Experiments on four benchmarks and four LLMs show that ProjRes achieves near 100% accuracy, outperforming prior methods by up to 75.75%, and remains effective even under strong differential privacy defenses. Our findings reveal a previously overlooked privacy vulnerability in FedLLMs and call for a re-examination of their security assumptions. Our code and data are available at \hrefthis https URLlink .

[LG-27] Graph Neural Network-Informed Predictive Flows for Faster Ford-Fulkerson and PAC-Learnability

链接: https://arxiv.org/abs/2604.21175
作者: Eleanor Wiesler,Trace Baxley
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We propose a learning-augmented framework for accelerating max-flow computation and image segmentation by integrating Graph Neural Networks (GNNs) with the Ford-Fulkerson algorithm. Rather than predicting initial flows, our method learns edge importance probabilities to guide augmenting path selection. We introduce a Message Passing GNN (MPGNN) that jointly learns node and edge embeddings through coupled updates, capturing both global structure and local flow dynamics such as residual capacity and bottlenecks. Given an input image, we propose a method to construct a grid-based flow network with source and sink nodes, extract features, and perform a single GNN inference to assign edge probabilities reflecting their likelihood of belonging to high-capacity cuts. These probabilities are stored in a priority queue and used to guide a modified Ford-Fulkerson procedure, prioritizing augmenting paths via an Edmonds-Karp-style search with bottleneck-aware tie-breaking. This avoids repeated inference over residual graphs while leveraging learned structure throughout optimization. We further introduce a bidirectional path construction strategy centered on high-probability edges and provide a theoretical framework relating prediction quality to efficiency via a weighted permutation distance metric. Our method preserves max-flow/min-cut optimality while reducing the number of augmentations in practice. We also outline a hybrid extension combining flow warm-starting with edge-priority prediction, establishing a foundation for learning-guided combinatorial optimization in image segmentation. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2604.21175 [cs.LG] (or arXiv:2604.21175v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21175 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting

链接: https://arxiv.org/abs/2604.21101
作者: Brooks Kinch,Xiaozhe Hu,Yilong Huang,Martine Dyring Hansen,Sunniva Meltzer,Nathaniel Donald Hamlin,David Sirajuddin,Eric C. Cyr,Nathaniel Trask
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:For autoregressive modeling of chaotic dynamical systems over long time horizons, the stability of both training and inference is a major challenge in building scientific foundation models. We present a hybrid technique in which an autoregressive transformer is embedded within a novel shooting-based mixed finite element scheme, exposing topological structure that enables provable stability. For forward problems, we prove preservation of discrete energies, while for training we prove uniform bounds on gradients, provably avoiding the exploding gradient problem. Combined with a vision transformer, this yields latent tokens admitting structure-preserving dynamics. We outperform modern foundation models with a 65\times reduction in model parameters and long-horizon forecasting of chaotic systems. A “mini-foundation” model of a fusion component shows that 12 simulations suffice to train a real-time surrogate, achieving a 9,000\times speedup over particle-in-cell simulation.

[LG-29] Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

链接: https://arxiv.org/abs/2604.21100
作者: Neehal Tumma,Noel Loo,Daniela Rus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the increasing long-context compute limitations of softmax attention, several subquadratic recurrent operators have been developed. This work includes models such as Mamba-2, DeltaNet, Gated DeltaNet (GDN), and Kimi Delta Attention (KDA). As the space of recurrences grows, a parallel line of work has arisen to taxonomize them. One compelling view is the test-time regression (TTR) framework, which interprets recurrences as performing online least squares updates that learn a linear map from the keys to values. Existing delta-rule recurrences can be seen as first-order approximations to this objective, but notably ignore the curvature of the least-squares loss during optimization. In this work, we address this by introducing preconditioning to these recurrences. Starting from the theory of online least squares, we derive equivalences between linear attention and the delta rule in the exactly preconditioned case. Next, we realize this theory in practice by proposing a diagonal approximation: this enables us to introduce preconditioned variants of DeltaNet, GDN, and KDA alongside efficient chunkwise parallel algorithms for computing them. Empirically, we find that our preconditioned delta-rule recurrences yield consistent performance improvements across synthetic recall benchmarks and language modeling at the 340M and 1B scale.

[LG-30] Spectral Embeddings Leak Graph Topology: Theory Benchmark and Adaptive Reconstruction

链接: https://arxiv.org/abs/2604.21094
作者: Thinh Nguyen-Cong,Truong-Son Hy,Thang N. Dinh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel on relational data, but standard benchmarks unrealistically assume the graph is centrally available. In practice, settings such as Federated Graph Learning, distributed systems, and privacy-sensitive applications involve graph data that are localized, fragmented, noisy, and privacy-leaking. We present a unified framework for this setting. We introduce LoGraB (Local Graph Benchmark), which decomposes standard datasets into fragmented benchmarks using three strategies and four controls: neighborhood radius d , spectral quality k , noise level \sigma , and coverage ratio p . LoGraB supports graph reconstruction, localized node classification, and inter-fragment link prediction, with Island Cohesion. We propose AFR (Adaptive Fidelity-driven Reconstruction), a method for noisy spectral fragments. AFR scores patch quality via a fidelity measure combining a gap-to-truncation stability ratio and structural entropy, then assembles fragments using RANSAC-Procrustes alignment, adaptive stitching, and Bundle Adjustment. Rather than forcing a single global graph, AFR recovers large faithful islands. We prove heat-kernel edge recovery under a separation condition, Davis–Kahan perturbation stability, and bounded alignment error. We establish a Spectral Leakage Proposition: under a spectral-gap assumption, polynomial-time Bayesian recovery is feasible once enough eigenvectors are shared, complementing AFR’s deterministic guarantees. Experiments on nine benchmarks show that LoGraB reveals model strengths and weaknesses under fragmentation, AFR achieves the best F1 on 7/9 datasets, and under per-embedding (\epsilon,\delta) -Gaussian differential privacy, AFR retains 75% of its undefended F1 at \epsilon=2 . Our anonymous code is available at this https URL

[LG-31] JEPAMatch: Geometric Representation Shaping for Semi-Supervised Learning

链接: https://arxiv.org/abs/2604.21046
作者: Ali Aghababaei-Harandi,Aude Sportisse,Massih-Reza Amini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised learning has emerged as a powerful paradigm for leveraging large amounts of unlabeled data to improve the performance of machine learning models when labeled data are scarce. Among existing approaches, methods derived from FixMatch have achieved state-of-the-art results in image classification by combining weak and strong data augmentations with confidence-based pseudo-labeling. Despite their strong empirical performance, these methods typically struggle with two critical bottlenecks: majority classes tend to dominate the learning process, which is amplified by incorrect pseudo-labels, leading to biased models. Furthermore, noisy early pseudo-labels prevent the model from forming clear decision boundaries, requiring prolonged training to learn informative representation. In this paper, we introduce a paradigm shift from conventional logical output threshold base, toward an explicit shaping of geometric representations. Our approach is inspired by the recently proposed Latent-Euclidean Joint-Embedding Predictive Architectures (LeJEPA), a theoretically grounded framework asserting that meaningful representations should exhibit an isotropic Gaussian structure in latent space. Building on this principle, we propose a new training objective that combines the classical semi-supervised loss used in FlexMatch, an adaptive extension of FixMatch, with a latent-space regularization term derived from LeJEPA. Our proposed approach, encourages well-structured representations while preserving the advantages of pseudo-labeling strategies. Through extensive experiments on CIFAR-100, STL-10 and Tiny-ImageNet, we demonstrate that the proposed method consistently outperforms existing baselines. In addition, our method significantly accelerates the convergence, drastically reducing the overall computational cost compared to standard FixMatch-based pipelines. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.21046 [cs.LG] (or arXiv:2604.21046v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21046 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] Interpretable Quantile Regression by Optimal Decision Trees

链接: https://arxiv.org/abs/2604.21042
作者: Valentin Lemaire,Gaël Aglin,Siegfried Nijssen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of machine learning is subject to an increasing interest in models that are not only accurate but also interpretable and robust, thus allowing their end users to understand and trust AI systems. This paper presents a novel method for learning a set of optimal quantile regression trees. The advantages of this method are that (1) it provides predictions about the complete conditional distribution of a target variable without prior assumptions on this distribution; (2) it provides predictions that are interpretable; (3) it learns a set of optimal quantile regression trees without compromising algorithmic efficiency compared to learning a single tree.

[LG-33] MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

链接: https://arxiv.org/abs/2604.21026
作者: Anurita Das
类目: Machine Learning (cs.LG)
*备注: Code available at this https URL

点击查看摘要

Abstract:Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than this http URL Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

[LG-34] Droplet-LNO: Physics-Informed Laplace Neural Operators for Accurate Prediction of Droplet Spreading Dynamics on Complex Surfaces

链接: https://arxiv.org/abs/2604.20993
作者: Ganesh Sahadeo Meshram,Partha Pratim Chakrabarti,Suman Chakraborty
类目: Machine Learning (cs.LG)
*备注: 36 pages, 8 figures

点击查看摘要

Abstract:Spreading of liquid droplets on solid substrates constitutes a classic multiphysics problem with widespread applications ranging from inkjet printing, spray cooling, to biomedical microfluidic systems. Yet, accurate computational fluid dynamic (CFD) simulations are prohibitively expensive, taking more than 18 to 24 hours for each transient computation. In this paper, Physics-Informed Laplace Operator Neural Network (PI-LNO) is introduced, representing a novel architecture where the Laplace integral transform function serves as a learned physics-informed functional basis. Extensive comparative benchmark studies were performed against five other state-of-the-art approaches: UNet, UNet with attention modules (UNet-AM), DeepONet, Physics-Informed UNet (PI-UNet), and Laplace Neural Operator (LNO). Through complex Laplace transforms, PI-LNO natively models the exponential transient dynamics of the spreading process. A TensorFlow-based PI-LNO is trained on multi-surface CFD data spanning contact angles \theta_s \epsilon [20,160] , employing a physics-regularized composite loss combining data fidelity (MSE, MAE, RMSE) with Navier-Stokes, Cahn-Hilliard, and causality constraints.

[LG-35] Early Detection of Latent Microstructure Regimes in Limit Order Books

链接: https://arxiv.org/abs/2604.20949
作者: Prakul Sunil Hiremath,Vruksha Arun Hiremath
类目: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 48 pages, 7 figures. Combines theoretical guarantees (identifiability and early-detection bounds), 200-run simulation study, and preliminary real-data evaluation on BTC/USDT limit order books. Code and data available

点击查看摘要

Abstract:Limit order books can transition rapidly from stable to stressed conditions, yet standard early-warning signals such as order flow imbalance and short-term volatility are inherently reactive. We formalise this limitation via a three-regime causal data-generating process (stable \to latent build-up \to stress) in which a latent deterioration phase creates a prediction window prior to observable stress. Under mild assumptions on temporal drift and regime persistence, we establish identifiability of the latent build-up regime and derive guarantees for strictly positive expected lead-time and non-trivial probability of early detection. We propose a trigger-based detector combining MAX aggregation of complementary signal channels, a rising-edge condition, and adaptive thresholding. Across 200 simulations, the method achieves mean lead-time +18.6 \pm 3.2 timesteps with perfect precision and moderate coverage, outperforming classical change-point and microstructure baselines. A preliminary application to one week of BTC/USDT order book data shows consistent positive lead-times while baselines remain reactive. Results degrade in low signal-to-noise and short build-up regimes, consistent with theory.

[LG-36] Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLM s

链接: https://arxiv.org/abs/2604.20945
作者: Krishiv Agarwal,Ramneet Kaur,Colin Samplawski,Manoj Acharya,Anirban Roy,Daniel Elenius,Brian Matejek,Adam D. Cobb,Susmit Jha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches – Universal Steering (US) and Representation Engineering (RepE) – we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91% (US) and 83% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.

[LG-37] SCM: Sleep-Consolidated Memory with Algorithmic Forgetting for Large Language Models

链接: https://arxiv.org/abs/2604.20943
作者: Saish Sachin Shinde
类目: Machine Learning (cs.LG)
*备注: 5 figures. Submitted April 2026

点击查看摘要

Abstract:We present SCM (Sleep-Consolidated Memory), a research preview of a memory architecture for large language models that draws on neuroscientific principles to address a fundamental limitation in current systems: the absence of persistent, structured, and biologically plausible memory. Existing approaches rely on truncating context windows, growing vector databases without bound, or tiered storage systems that lack consolidation and forgetting mechanisms. SCM implements five core components inspired by human memory: a limited-capacity working memory, multi-dimensional importance tagging, offline sleep-stage consolidation with distinct NREM and REM phases, intentional value-based forgetting, and a computational self-model enabling introspection. Across a standardized benchmark suite of eight tests, the prototype achieves perfect recall accuracy over ten-turn conversations while reducing memory noise by 90.9% through adaptive forgetting. Memory search latency remains below one millisecond even with hundreds of stored concepts. This work establishes the architectural foundations for memory systems that consolidate, prioritize, and forget, offering a testable platform for advancing LLM memory research.

[LG-38] Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLM s

链接: https://arxiv.org/abs/2604.20937
作者: Kibum Kim,Jiwan Kim,Kyle Min,Yueqi Wang,Jinyoung Moon,Julian McAuley,Chanyoung Park
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens–semantically uninformative tokens that attract excessive attention–as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model’s visual evidence and hinder fine-grained understanding. Motivated by these insights, we propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method that introduces a sink score to quantify each token’s tendency to behave as a sink and applies this score to existing spatial and temporal pruning methods to suppress them, thereby enhancing video understanding. To validate the effectiveness of SToP, we apply it to state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) and evaluate it across diverse benchmarks covering hallucination, open-ended generation, compositional reasoning, and MCQA. Our results demonstrate that SToP significantly boosts performance, even when pruning up to 90% of visual tokens.

[LG-39] SDNGuardStack: An Explainable Ensemble Learning Framework for High-Accuracy Intrusion Detection in Software-Defined Networks

链接: https://arxiv.org/abs/2604.20934
作者: Ashikuzzaman,Md. Saifuzzaman Abhi,Mahabubur Rahman,Md. Manjur Ahmed,Md. Mehedi Hasan,Md. Ahsan Arif
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Software-Defined Networking (SDN) is another technology that has been developing in the last few years as a relevant technique to improve network programmability and administration. Nonetheless, its centralized design presents a major security issue, which requires effective intrusion detection systems. The SDN-specific machine learning-based intrusion detection system described in this paper is innovative because it is trained and tested on the InSDN dataset which models attack scenarios and realistic traffic patterns in SDN. Our approach incorporates a comprehensive preprocessing pipeline, feature selection via Mutual Information, and a novel ensemble learning model, SDNGuardStack, which combines multiple base learners to enhance detection accuracy and efficiency. In addition, we include explainable AI methods, including SHAP to add transparency to model predictions, which helps security analysts respond to incidents. The experiments prove that SDNGuard-Stack has an accuracy rate of 99.98% and a Cohen Kappa of 0.9998, surpassing other models, and at the same time being interpretable and practically executable. It is interesting to see such features like Flow ID, Bwd Header Len, and Src Port as the most important factors in the model predictions. The work is a step towards closing the gap between performance intrusion detection and realistic deployment in SDN, which will lead to the creation of secure and resilient network infrastructures.

[LG-40] Unsupervised Learning of Inter-Object Relationships via Group Homomorphism

链接: https://arxiv.org/abs/2604.20925
作者: Kyotaro Ushida,Takayuki Komatsu,Yoshiyuki Ohmura,Yasuo Kuniyoshi
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review at ICDL 2026

点击查看摘要

Abstract:While current deep learning models achieve high performance by learning statistical correlations from vast datasets,which stands in stark contrast to human learning. They lack the flexibility of humans-particularly preverbal infants-to autonomously acquire the underlying structure of the world from limited experience and adapt to novel situations. In this study, we propose an unsupervised representation learning method based on a hierarchical relationship in group operations, rather than statistical independence, aiming to build a computational model of the cognitive development of infants. The proposed model features an integrated architecture that simultaneously performs object segmentation and the extraction of motion laws from dynamic image sequences. By introducing the Homomorphism from algebra as a structural constraint within a neural network, the model structurally separates pixel-level changes into meaningful, decomposed transformation components, such as translation and deformation. Using interaction scenes (chasing and evading tasks) based on developmental science findings, we experimentally demonstrate that the model can segment multiple objects into individual slots without any ground-truth labels. Furthermore, we confirmed that relative movements between objects, such as approaching or receding, are accurately mapped and structured into a one-dimensional additive latent space. These results suggest that by introducing algebraic geometric constraints rather than relying solely on statistical correlation learning, physically interpretable “disentangled representations” can be acquired. This study contributes to the understanding of the process by which infants internalize environmental laws as structures and provides a new perspective for constructing artificial systems with developmental intelligence.

[LG-41] Clinically Interpretable Sepsis Early Warning via LLM -Guided Simulation of Temporal Physiological Dynamics

链接: https://arxiv.org/abs/2604.20924
作者: Weizhi Nie,Zhen Qu,Weijie Wang,Chunpei Li,Ke Lu,Bingyang Zhou,Hongzhi Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Timely and interpretable early warning of sepsis remains a major clinical challenge due to the complex temporal dynamics of physiological deterioration. Traditional data-driven models often provide accurate yet opaque predictions, limiting physicians’ confidence and clinical applicability. To address this limitation, we propose a Large Language Model (LLM)-guided temporal simulation framework that explicitly models physiological trajectories prior to disease onset for clinically interpretable prediction. The framework consists of a spatiotemporal feature extraction module that captures dynamic dependencies among multivariate vital signs, a Medical Prompt-as-Prefix module that embeds clinical reasoning cues into LLMs, and an agent-based post-processing component that constrains predictions within physiologically plausible ranges. By first simulating the evolution of key physiological indicators and then classifying sepsis onset, our model offers transparent prediction mechanisms that align with clinical judgment. Evaluated on the MIMIC-IV and eICU databases, the proposed method achieves superior AUC scores (0.861-0.903) across 24-4-hour pre-onset prediction tasks, outperforming conventional deep learning and rule-based approaches. More importantly, it provides interpretable trajectories and risk trends that can assist clinicians in early intervention and personalized decision-making in intensive care environments.

[LG-42] ILDR: Geometric Early Detection of Grokking

链接: https://arxiv.org/abs/2604.20923
作者: Shreel Golwala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grokking describes a delayed generalization phenomenon in which a neural network achieves perfect training accuracy long before validation accuracy improves, followed by an abrupt transition to strong generalization. Existing detection signals are indirect: weight norm reflects parameter-space regularization and consistently lags the transition, while GrokFast’s slow gradient EMA, used without gradient amplification, is unstable across seeds with standard deviation exceeding mean lead time. We propose the Inter/Intra-class Distance Ratio (ILDR), a geometric metric computed on second-to-last layer representations as the ratio of inter-class centroid separation to intra-class scatter. ILDR provides an early detection signal: it rises and crosses a threshold at 2.5 times its baseline before the grokking transition appears in validation accuracy, indicating early geometric reorganization in representation space. Grounded in Fisher’s linear discriminant criterion, ILDR requires no eigendecomposition and runs in O(|C|^2 + N). It is evaluated exclusively on held-out data, making it robust to memorization effects. Across modular arithmetic and permutation group composition (S5), ILDR leads the grokking transition by 9 to 73 percent of the training budget, with lead time increasing with task algebraic complexity. Over eight random seeds, ILDR leads by 950 +/- 250 steps with a coefficient of variation of 26 percent, and post-grokking variance drops by 1696 times, consistent with a sharp phase transition in representation space. Using ILDR as an early stopping trigger reduces training by 18.6 percent on average. Optimizer interventions triggered at the ILDR threshold demonstrate bidirectional control over the transition, suggesting ILDR tracks representational conditions underlying generalization rather than a downstream correlate.

[LG-43] Validating a Deep Learning Algorithm to Identify Patients with Glaucoma using Systemic Electronic Health Records

链接: https://arxiv.org/abs/2604.20921
作者: John Xiang,Rohith Ravindranath,Sophia Y. Wang
类目: Machine Learning (cs.LG)
*备注: submitted to AMIA Annual Symposium 2026

点击查看摘要

Abstract:We evaluated whether a glaucoma risk assessment (GRA) model trained on All of Us national data can identify patients at high probability of glaucoma using only systemic electronic health records (EHR) at an independent institution. In this cross-sectional study, 20,636 Stanford patients seen from November 2013 to January 2024 were included (15% with glaucoma). A pretrained GRA model was fine-tuned on the Stanford cohort and tested on a held-out set using demographics, systemic diagnoses, medications, laboratory results, and physical examination measurements as inputs. The best model achieved AUROC 0.883 and PPV 0.657. Calibration was consistent with clinical risk: the highest prediction decile showed the greatest glaucoma diagnosis rate (65.7%) and treatment rate (57.0%). Performance improved with more trainable layers up to 15 and with additional data. An EHR-only GRA model may enable scalable and accessible pre-screening without specialized imaging.

[LG-44] Forget Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

链接: https://arxiv.org/abs/2604.20920
作者: Yuzhen Mao,Michael Y. Li,Emily B. Fox
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling large language models to long contexts is challenging due to the quadratic computational cost of full attention. Mitigation approaches include KV-cache selection or compression techniques. We instead provide an effective and end-to-end learnable bridge between the two without requiring architecture modification. In particular, our key insight is that interleaved gist compression tokens – which provide a learnable summary of sets of raw tokens – can serve as routing signals for sparse attention. Building on this, we introduce selective unfolding via GSA, which first compresses the context into gist tokens, then selects the most relevant gists, and subsequently restores the corresponding raw chunks for detailed attention. This yields a simple coarse-to-fine mechanism that combines compact global representations with targeted access to fine-grained evidence. We further incorporate this process directly into training in an end-to-end fashion, avoiding the need for external retrieval modules. In addition, we extend the framework hierarchically via recursive gist-of-gist construction, enabling multi-resolution context access with logarithmic per-step decoding complexity. Empirical results on LongBench and RAG benchmarks demonstrate that our method consistently outperforms other compression baselines as well as inference-time sparse attention methods across compression ratios from 8\times to 32\times . The code is available at: this https URL

[LG-45] FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

链接: https://arxiv.org/abs/2604.20913
作者: Fei Zuo,Xiaoyan Xi,Quanyi Zeng,Feiyu Wang,Ho Fai Leung
类目: Machine Learning (cs.LG)
*备注: 16 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in -1, 0, +1 provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs. End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming this http URL Q4_K_M by 1.24x with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; downstream accuracy 66.0%).

[LG-46] Do Masked Autoencoders Improve Downhole Prediction? An Empirical Study on Real Well Drilling Data

链接: https://arxiv.org/abs/2604.20909
作者: Aleksander Berezowski,Hassan Hassanzadeh,Gouri Ginde
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Downhole drilling telemetry presents a fundamental labeling asymmetry: surface sensor data are generated continuously at 1~Hz, while labeled downhole measurements are costly, intermittent, and scarce. Current machine learning approaches for downhole metric prediction universally adopt fully supervised training from scratch, which is poorly suited to this data regime. We present the first empirical evaluation of masked autoencoder (MAE) pretraining for downhole drilling metric prediction. Using two publicly available Utah FORGE geothermal wells comprising approximately 3.5 million timesteps of multivariate drilling telemetry, we conduct a systematic full-factorial design space search across 72 MAE configurations and compare them against supervised LSTM and GRU baselines on the task of predicting Total Mud Volume. Results show that the best MAE configuration reduces test mean absolute error by 19.8% relative to the supervised GRU baseline, while trailing the supervised LSTM baseline by 6.4%. Analysis of design dimensions reveals that latent space width is the dominant architectural choice (Pearson r = -0.59 with test MAE), while masking ratio has negligible effect, an unexpected finding attributed to high temporal redundancy in 1~Hz drilling data. These results establish MAE pretraining as a viable paradigm for drilling analytics and identify the conditions under which it is most beneficial.

[LG-47] owards a Systematic Risk Assessment of Deep Neural Network Limitations in Autonomous Driving Perception ECAI ESORICS2025

链接: https://arxiv.org/abs/2604.20895
作者: Svetlana Pavlitska,Christopher Gerking,J. Marius Zöllner
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted for publication at the SECAI workshop at ESORICS 2025

点击查看摘要

Abstract:Safety and security are essential for the admission and acceptance of automated and autonomous vehicles. Deep neural networks (DNNs) are widely used for perception and further components of the autonomous driving (AD) stack. However, they possess several limitations, including lack of generalization, efficiency, explainability, plausibility, and robustness. These insufficiencies can pose significant risks to autonomous driving systems. However, hazards, threats, and risks associated with DNN limitations in this domain have not been systematically studied so far. In this work, we propose a joint workflow for risk assessment combining the hazard analysis and risk assessment (HARA) following ISO 26262 and threat analysis and risk assessment (TARA) following the ISO/SAE 21434 to identify and analyze risks arising from inherent DNN limitations in AD perception.

[LG-48] Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

链接: https://arxiv.org/abs/2604.21893
作者: Sherly Alfonso-Sánchez,Cristián Bravo,Kristina G. Stankova
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 35 pages, 8 figures

点击查看摘要

Abstract:Geographic context is often consider relevant to motor insurance risk, yet public actuarial datasets provide limited location identifiers, constraining how this information can be incorporated and evaluated in claim-frequency models. This study examines how geographic information from alternative data sources can be incorporated into actuarial models for Motor Third Party Liability (MTPL) claim prediction under such constraints. Using the BeMTPL97 dataset, we adopt a zone-level modeling framework and evaluate predictive performance on unseen postcodes. Geographic information is introduced through two channels: environmental indicators from OpenStreetMap and CORINE Land Cover, and orthoimagery released by the Belgian National Geographic Institute for academic use. We evaluate the predictive contribution of coordinates, environmental features, and image embeddings across three baseline models: generalized linear models (GLMs), regularized GLMs, and gradient-boosted trees, while raw imagery is modeled using convolutional neural networks. Our results show that augmenting actuarial variables with constructed geographic information improves accuracy. Across experiments, both linear and tree-based models benefit most from combining coordinates with environmental features extracted at 5 km scale, while smaller neighborhoods also improve baseline specifications. Generally, image embeddings do not improve performance when environmental features are available; however, when such features are absent, pretrained vision-transformer embeddings enhance accuracy and stability for regularized GLMs. Our results show that the predictive value of geographic information in zone-level MTPL frequency models depends less on model complexity than on how geography is represented, and illustrate that geographic context can be incorporated despite limited individual-level spatial information. Comments: 35 pages, 8 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Risk Management (q-fin.RM) MSC classes: 62P05, 62H11 Cite as: arXiv:2604.21893 [stat.ML] (or arXiv:2604.21893v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.21893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Locating acts of mechanistic reasoning in student team conversations with mechanistic machine learning

链接: https://arxiv.org/abs/2604.21870
作者: Kaitlin Gili,Mainak Nistala,Kristen Wendell,Michael C. Hughes
类目: Physics Education (physics.ed-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:STEM education researchers are often interested in identifying moments of students’ mechanistic reasoning for deeper analysis, but have limited capacity to search through many team conversation transcripts to find segments with a high concentration of such reasoning. We offer a solution in the form of an interpretable machine learning model that outputs time-varying probabilities that individual students are engaging in acts of mechanistic reasoning, leveraging evidence from their own utterances as well as contributions from the rest of the group. Using the toolkit of intentionally-designed probabilistic models, we introduce a specific inductive bias that steers the probabilistic dynamics toward desired, domain-aligned behavior. Experiments compare trained models with and without the inductive bias components, investigating whether their presence improves the desired model behavior on transcripts involving never-before-seen students and a novel discussion context. Our results show that the inductive bias improves generalization – supporting the claim that interpretability is built into the model for this task rather than imposed post hoc. We conclude with practical recommendations for STEM education researchers seeking to adopt the tool and for ML researchers aiming to extend the model’s design. Overall, we hope this work encourages the development of mechanistically interpretable models that are understandable and controllable for both end users and model designers in STEM education research.

[LG-50] Beyond Expected Information Gain: Stable Bayesian Optimal Experimental Design with Integral Probability Metrics and Plug-and-Play Extensions

链接: https://arxiv.org/abs/2604.21849
作者: Di Wu,Ling Liang,Haizhao Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Bayesian Optimal Experimental Design (BOED) provides a rigorous framework for decision-making tasks in which data acquisition is often the critical bottleneck, especially in resource-constrained settings. Traditionally, BOED typically selects designs by maximizing expected information gain (EIG), commonly defined through the Kullback-Leibler (KL) divergence. However, classical evaluation of EIG often involves challenging nested expectations, and even advanced variational methods leave the underlying log-density-ratio objective unchanged. As a result, support mismatch, tail underestimation, and rare-event sensitivity remain intrinsic concerns for KL-based BOED. To address these fundamental bottlenecks, we introduce an IPM-based BOED framework that replaces density-based divergences with integral probability metrics (IPMs), including the Wasserstein distance, Maximum Mean Discrepancy, and Energy Distance, resulting in a highly flexible plug-and-play BOED framework. We establish theoretical guarantees showing that IPM-based utilities provide stronger geometry-aware stability under surrogate-model error and prior misspecification than classical EIG-based utilities. We also validate the proposed framework empirically, demonstrating that IPM-based designs yield highly concentrated credible sets. Furthermore, by extending the same sample-based BOED template in a plug-and-play manner to geometry-aware discrepancies beyond the IPM class, illustrated by a neural optimal transport estimator, we achieve accurate optimal designs in high-dimensional settings where conventional nested Monte Carlo estimators and advanced variational methods fail.

[LG-51] On the algebra of Koopman eigenfunctions and on some of their infinities

链接: https://arxiv.org/abs/2604.21825
作者: Zahra Monfared,Saksham Malhotra,Sekiya Hajime,Ioannis Kevrekidis,Felix Dietrich
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:For continuous-time dynamical systems with reversible trajectories, the nowhere-vanishing eigenfunctions of the Koopman operator of the system form a multiplicative group. Here, we exploit this property to accelerate the systematic numerical computation of the eigenspaces of the operator. Given a small set of (so-called ``principal’') eigenfunctions that are approximated conventionally, we can obtain a much larger set by constructing polynomials of the principal eigenfunctions. This enriches the set, and thus allows us to more accurately represent application-specific observables. Often, eigenfunctions exhibit localized singularities (e.g. in simple, one-dimensional problems with multiple steady states) or extended ones (e.g. in simple, two-dimensional problems possessing a limit cycle, or a separatrix); we discuss eigenfunction matching/continuation across such singularities. By handling eigenfunction singularities and enabling their continuation, our approach supports learning consistent global representations from locally sampled data. This is particularly relevant for multistable systems and applications with sparse or fragmented measurements.

[LG-52] Neural surrogates for crystal growth dynamics with variable supersaturation: explicit vs. implicit conditioning

链接: https://arxiv.org/abs/2604.21753
作者: Matteo Rigoni,Daniele Lanzoni,Francesco Montalenti,Roberto Bergamaschini
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulations of crystal growth are performed by using Convolutional Recurrent Neural Network surrogate models, trained on a dataset of time sequences computed by numerical integration of Allen-Cahn dynamics including faceting via kinetic anisotropy. Two network architectures are developed to take into account the effects of a variable supersaturation value. The first infers it implicitly by processing an input mini-sequence of a few evolution frames and then returns a consistent continuation of the evolution. The second takes the supersaturation parameter as an explicit input along with a single initial frame and predicts the entire sequence. The two models are systematically tested to establish strengths and weaknesses, comparing the prediction performance for models trained on datasets of different size and, in the first architecture, different lengths of input mini-sequence. The analysis of point-wise and mean absolute errors shows how the explicit parameter conditioning guarantees the best results, reproducing with high-fidelity the ground-truth profiles. Comparable results are achievable by the mini-sequence approach only when using larger training datasets. The trained models show strong conditioning by the supersaturation parameter, consistently reproducing its overall impact on growth rates as well as its local effect on the faceted morphology. Moreover, they are perfectly scalable even on 256 times larger domains and can be successfully extended to more than 10 times longer sequences with limited error accumulation. The analysis highlights the potential and limits of these approaches in view of their general exploitation for crystal growth simulations.

[LG-53] here Will Be a Scientific Theory of Deep Learning

链接: https://arxiv.org/abs/2604.21691
作者: Jamie Simon,Daniel Kunin,Alexander Atanasov,Enric Boix-Adserà,Blake Bordelon,Jeremy Cohen,Nikhil Ghosh,Florentin Guth,Arthur Jacot,Mason Kamb,Dhruva Karkada,Eric J. Michaud,Berkan Ottlik,Joseph Turnbull
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 6 figures

点击查看摘要

Abstract:In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; © simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at this http URL. Comments: 41 pages, 6 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2604.21691 [stat.ML] (or arXiv:2604.21691v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.21691 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] A Kernel Nonconformity Score for Multivariate Conformal Prediction

链接: https://arxiv.org/abs/2604.21595
作者: Louis Meyer,Wenkai Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate conformal prediction requires nonconformity scores that compress residual vectors into scalars while preserving certain implicit geometric structure of the residual distribution. We introduce a Multivariate Kernel Score (MKS) that produces prediction regions that explicitly adapt to this geometry. We show that the proposed score resembles the Gaussian process posterior variance, unifying Bayesian uncertainty quantification with the coverage guarantees of frequentist-type. Moreover, the MKS can be decomposed into an anisotropic Maximum Mean Discrepancy (MMD) that interpolates between kernel density estimation and covariance-weighted distance. We prove finite-sample coverage guarantees and establish convergence rates that depend on the effective rank of the kernel-based covariance operator rather than the ambient dimension, enabling dimension-free adaptation. On regression tasks, the MKS reduces the volume of prediction region significantly, compared to ellipsoidal baselines while maintaining nominal coverage, with larger gains at higher dimensions and tighter coverage levels.

[LG-55] A single algorithm for both restless and rested rotting bandits AISTATS2020

链接: https://arxiv.org/abs/2604.21432
作者: Julien Seznec,Pierre Ménard,Alessandro Lazaric,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: In AISTATS 2020

点击查看摘要

Abstract:In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.

[LG-56] CLT-Optimal Parameter Error Bounds for Linear System Identification

链接: https://arxiv.org/abs/2604.21270
作者: Yichen Zhou,Stephen Tu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 36 pages

点击查看摘要

Abstract:There has been remarkable progress over the past decade in establishing finite-sample, non-asymptotic bounds on recovering unknown system parameters from observed system behavior. Surprisingly, however, we show that the current state-of-the-art bounds do not accurately capture the statistical complexity of system identification, even in the most fundamental setting of estimating a discrete-time linear dynamical system (LDS) via ordinary least-squares regression (OLS). Specifically, we utilize asymptotic normality to identify classes of problem instances for which current bounds overstate the squared parameter error, in both spectral and Frobenius norm, by a factor of the state-dimension of the system. Informed by this discrepancy, we then sharpen the OLS parameter error bounds via a novel second-order decomposition of the parameter error, where crucially the lower-order term is a matrix-valued martingale that we show correctly captures the CLT scaling. From our analysis we obtain finite-sample bounds for both (i) stable systems and (ii) the many-trajectories setting that match the instance-specific optimal rates up to constant factors in Frobenius norm, and polylogarithmic state-dimension factors in spectral norm.

[LG-57] Assessing Emulator Design and Training for Modal Aerosol Microphysics Parameterizations in E3SMv2

链接: https://arxiv.org/abs/2604.21233
作者: Shady E. Ahmed,Hui Wan,Saad Qadeer,Panos Stinis,Kezhen Chong,Mohammad Taufiq Hassan Mozumder,Kai Zhang,Ann S. Almgren
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Geophysics (physics.geo-ph)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Toward the goal of using Scientific Machine Learning (SciML) emulators to improve the numerical representation of aerosol processes in global atmospheric models, we explore the emulation of aerosol microphysics processes under cloud-free conditions in the 4-mode Modal Aerosol Module (MAM4) within the Energy Exascale Earth System Model version 2 (E3SMv2). To develop an in-depth understanding of the challenges and opportunities in applying SciML to aerosol processes, we begin with a simple feedforward neural network architecture that has been used in earlier studies, but we systematically examine key emulator design choices, including architecture complexity and variable normalization, while closely monitoring training convergence behavior. Our results show that optimization convergence, scaling strategy, and network complexity strongly influence emulation accuracy. When effective scaling is applied and convergence is achieved, the relatively simple architecture, used together with a moderate network size, can reproduce key features of the microphysics-induced aerosol concentration changes with promising accuracy. These findings provide practical clues for the next stages of emulator development; they also provide general insights that are likely applicable to the emulation of other aerosol processes, as well as other atmospheric physics involving multi-scale variability. Comments: 16 pages, 7 figures Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Geophysics (physics.geo-ph) Reportnumber: PNNL-SA-221395 Cite as: arXiv:2604.21233 [physics.ao-ph] (or arXiv:2604.21233v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2604.21233 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] Neutron and X-ray Diffraction Reveal the Limits of Long-Range Machine Learning Potentials for Medium-Range Order in Silica Glass

链接: https://arxiv.org/abs/2604.21222
作者: Sai Harshit Balantrapu,Atul C. Thakur,Chris Benmore,Ganesh Sivaraman
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Glassy silica is a foundational material in optics and electronics, yet accurately predicting its medium-range order (MRO) remains a major challenge for machine-learning interatomic potentials (MLIPs). While local MLIPs reproduce the short-range SiO4 tetrahedral network well, it remains unclear whether locality alone is sufficient to recover the first sharp diffraction peak (FSDP), the principal experimental signature of MRO. Here, we combine neutron and X-ray diffraction measurements with large-scale molecular dynamics driven by two MACE-based models: a short-range (SR) potential and a long-range (LR) extension incorporating reciprocal-space gated attention. The SR model systematically over-structures the network, producing an overly intense FSDP in both the liquid and glassy states. Incorporating long-range interactions improves agreement with experiment for the liquid structure by reducing this excess ordering, but the LR model still fails to recover the experimental amorphous MRO after quenching. Ring-statistics and bond-angle analyses reveal that SR model exhibits an artificially narrow distribution dominated by six-membered rings, while the LR model produces a broader but still biased ring population. Despite preserving the correct tetrahedral geometry, both models show limited variability in Si-O-Si angles, indicating constrained network flexibility. These structural signatures demonstrate that both models retain excessive memory of the parent liquid network, leading to kinetically trapped and nonphysical medium-range configurations during vitrification. These results show that explicit long-range interactions are necessary but not sufficient for predictive modelling of disordered silica and suggest that accurate MRO further requires training data and sampling strategies that adequately represent the liquid-to-glass transition.

[LG-59] he Feedback Hamiltonian is the Score Function: A Diffusion-Model Framework for Quantum Trajectory Reversal

链接: https://arxiv.org/abs/2604.21210
作者: Sagar Dubey,Alan John
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:In continuously monitored quantum systems, the feedback protocol of García-Pintos, Liu, and Gorshkov reshapes the arrow of time: a Hamiltonian H_\mathrmmeas = r A / \tau applied with gain X tilts the distribution of measurement trajectories, with X -2 producing statistically time-reversed outcomes. Why this specific Hamiltonian achieves reversal, and how the mechanism relates to score-based diffusion models in machine learning, has remained unexplained. We compute the functional derivative of the log path probability of the quantum trajectory distribution directly in density-matrix space. Combining Girsanov’s theorem applied to the measurement record, Fréchet differentiation on the Banach space of trace-class operators, and Kähler geometry on the pure-state projective manifold, we prove that \delta \log P_F / \delta \rho = r A / \tau = H_\mathrmmeas . The García-Pintos feedback Hamiltonian is the score function of the quantum trajectory distribution – exactly the object Anderson’s reverse-time diffusion theorem requires for trajectory reversal. The identification extends to multi-qubit systems with independent measurement channels, where the score is a sum of local operators. Two consequences follow. First, the feedback gain X generates a continuous one-parameter family of path measures (for feedback-active Hamiltonians with [H, A] \neq 0 ), with X = -2 recovering the backward process in leading-order linearization – a structure absent from classical diffusion, where reversal is binary. Second, the score identification enables machine learning (ML) score estimation methods – denoising score matching, sliced score matching – to replace the analytic formula when its idealizations (unit efficiency, zero delay, Gaussian noise) fail in real experiments. Comments: 14 pages Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2604.21210 [quant-ph] (or arXiv:2604.21210v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2604.21210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Refining Covariance Matrix Estimation in Stochastic Gradient Descent Through Bias Reduction

链接: https://arxiv.org/abs/2604.21203
作者: Ziyang Wei,Wanrong Zhu,Jingyang Lyu,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online inference and asymptotic covariance estimation for the stochastic gradient descent (SGD) algorithm. While classical methods (such as plug-in and batch-means estimators) are available, they either require inaccessible second-order (Hessian) information or suffer from slow convergence. To address these challenges, we propose a novel, fully online de-biased covariance estimator that eliminates the need for second-order derivatives while significantly improving estimation accuracy. Our method employs a bias-reduction technique to achieve a convergence rate of n^(\alpha-1)/2 \sqrt\log n , outperforming existing Hessian-free alternatives.

[LG-61] Learning to Emulate Chaos: Adversarial Optimal Transport Regularization

链接: https://arxiv.org/abs/2604.21097
作者: Gabriel Melo,Leonardo Santiago,Peter Y. Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chaos arises in many complex dynamical systems, from weather to power grids, but is difficult to accurately model using data-driven emulators, including neural operator architectures. For chaotic systems, the inherent sensitivity to initial conditions makes exact long-term forecasts theoretically infeasible, meaning that traditional squared-error losses often fail when trained on noisy data. Recent work has focused on training emulators to match the statistical properties of chaotic attractors by introducing regularization based on handcrafted local features and summary statistics, as well as learned statistics extracted from a diverse dataset of trajectories. In this work, we propose a family of adversarial optimal transport objectives that jointly learn high-quality summary statistics and a physically consistent emulator. We theoretically analyze and experimentally validate a Sinkhorn divergence formulation (2-Wasserstein) and a WGAN-style dual formulation (1-Wasserstein). Our experiments across a variety of chaotic systems, including systems with high-dimensional chaotic attractors, show that emulators trained with our approach exhibit significantly improved long-term statistical fidelity.

[LG-62] climt-paraformer: Stable Emulation of Convective Parameterization using a Temporal Memory-aware Transformer

链接: https://arxiv.org/abs/2604.21085
作者: Shuochen Wang,Nishant Yadav,Joy Merwin Monteiro,Auroop R. Ganguly
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate representation of moist convective sub-grid-scale processes remains a major challenge in global climate models, as traditional parameterization schemes are both computationally expensive and difficult to scale. Neural network (NN) emulators offer a promising alternative by learning efficient mappings between atmospheric states and convective tendencies while retaining fidelity to the underlying physics. However, most existing NN-based parameterizations are memory-less and rely only on instantaneous inputs, even though convection evolves over time and depends on prior atmospheric states. Recent studies have begun to incorporate convective memory, but they often treat past states as independent features rather than modeling temporal dependencies explicitly. In this work, we develop a temporal memory-aware Transformer emulator for the Emanuel convective parameterization and evaluate it in a single-column climate model (SCM) under both offline and online configurations. The Transformer captures temporal correlations and nonlinear interactions across consecutive atmospheric states. Compared with baseline emulators, including a memory-less multilayer perceptron and a recurrent long short-term memory model, the Transformer achieves lower offline errors. Sensitivity analysis indicates that a memory length of approximately 100 minutes yields the best performance, whereas longer memory degrades performance. We further test the emulator in long-term coupled simulations and show that it remains stable over 10 years. Overall, this study demonstrates the importance of explicit temporal modeling for NN-based parameterizations.

[LG-63] Achieving the Kesten-Stigum bound in the non-uniform hypergraph stochastic block model

链接: https://arxiv.org/abs/2604.20907
作者: Manuel Fernandez V,Ludovic Stephan,Yizhe Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR); Statistics Theory (math.ST)
*备注: 67 pages, 1 figure

点击查看摘要

Abstract:We study the community detection problem in the non-uniform hypergraph stochastic block model (HSBM), where hyperedges of varying sizes coexist. This setting captures higher-order and multi-view interactions and raises a fundamental question: can multiple uniform hypergraph layers below the detection threshold be combined to enable weak recovery? We answer this question by establishing a Kesten–Stigum-type bound for weak recovery in a general class of non-uniform HSBMs with r blocks, generated according to multiple symmetric probability tensors. In the case r=2 , we show that weak recovery is possible whenever the sum of the signal-to-noise ratios across all uniform hypergraph layers exceeds one, thereby confirming the positive part of a conjecture in (Chodrow et al., 2023). Moreover, we provide a polynomial-time spectral algorithm that achieves this threshold via an optimally weighted non-backtracking operator. For the unweighted non-backtracking matrix, our spectral method attains a different algorithmic threshold, also conjectured in (Chodrow et al., 2023). Our approach develops a spectral theory for weighted non-backtracking operators on non-uniform hypergraphs, including a precise characterization of outlier eigenvalues and eigenvector overlaps. We introduce a novel Ihara–Bass formula tailored to weighted non-uniform hypergraphs, which yields an efficient low-dimensional representation and leads to a provable spectral reconstruction algorithm. Taken together, these results provide a principled and computationally efficient approach to clustering in non-uniform hypergraphs, and highlight the role of optimal weighting in aggregating heterogeneous higher-order interactions. Comments: 67 pages, 1 figure Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2604.20907 [stat.ML] (or arXiv:2604.20907v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.20907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] Spectral Kernel Dynamics for Planetary Surface Graphs: Distinction Dynamics and Topological Conservation

链接: https://arxiv.org/abs/2604.20887
作者: Jnaneshwar Das
类目: Dynamical Systems (math.DS); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 17 pages, 0 figures

点击查看摘要

Abstract:The spectral kernel field equation R[k] = T[k] lacks a conservation-law analog. We prove (i) the fixed-point flow is strictly volume-expanding (tr DF 0), precluding automatic conservation, and (ii) the conservation deficit per mode equals the Hessian stability margin exactly: D_m = -Delta’. Closing the deficit requires a scene-side compensating contribution, which we formalise as the distinction dynamics equation dc/dt = G[c, h_t], with MaxCal-optimal realisation G_opt. On fixed-topology 3D surface graphs we derive a conditional topology-preserving compression theorem: retaining k = beta_0 + beta_1 modes (under a spectral-ordering assumption) preserves all Betti-number charges; we include a worked short-cycle counterexample (figure-eight) calibrating when the assumption fails. A triple necessary spectral diagnostic – Fiedler-mode concentration, elevated curl energy, anomalous beta_1 – is derived for planetary drainage networks at O(N) cost. Two internal real-data sequences serve as preliminary consistency checks; full benchmarks and adaptive-topology extensions are deferred.

[LG-65] KinetiDiff: Docking-Guided Diffusion for De Novo ACVR1 Inhibitor Design in Fibrodysplasia Ossificans Progressiva

链接: https://arxiv.org/abs/2604.20886
作者: Aaryan Patel
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:We present KinetiDiff, a structure-based framework for de novo kinase inhibitor design that integrates a Geometry-Complete Diffusion Model with real-time AutoDock Vina gradient guidance. By injecting physics-based docking gradients into the diffusion denoising loop, KinetiDiff steers molecule generation toward high-affinity conformations for ACVR1 (ALK2), the causative kinase in Fibrodysplasia Ossificans Progressiva. From 10,000 diffusion samples, the framework produced 9,997 valid molecules. The best candidate achieved -11.05 kcal/mol (pKd = 8.10), a 19.2% improvement over the crystallographic reference. The top 100 candidates all exceed the reference, with 100% Lipinski compliance, median synthetic accessibility of 2.67, and internal diversity of 0.790. Systematic ablation across four guidance strategies–Vina-Direct (physics), HNN-Denovo (neural proxy), multi-objective, and unguided–demonstrates that real-time docking guidance dominates on all metrics. We evaluate HNN-Denovo as a computationally efficient alternative (60-fold speedup per step), revealing a domain-mismatch limitation (r = 0.224 correlation with Vina) that explains its inferior performance. These results establish gradient-guided geometric diffusion as a practical approach for generating potent, synthetically accessible inhibitors against rare-disease kinase targets.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-04-24

目录

概览 (2026-04-24)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载