Arxiv今日论文 | 2026-04-29

本篇博文主要内容为 2026-04-29 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共81篇(Computation and Language (cs.CL))
人工智能共180篇(Artificial Intelligence (cs.AI))
计算机视觉共86篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共129篇(Machine Learning (cs.LG))
多智能体系统共11篇(Multiagent Systems (cs.MA))
信息检索共19篇(Information Retrieval (cs.IR))
人机交互共29篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Pythia: Toward Predictability-Driven Agent -Native LLM Serving

【速读】：该论文旨在解决多智能体（Multi-Agent）生成式 AI 应用在服务部署时面临的运行时不确定性高、资源利用率低及延迟大等问题。现有系统将代理工作负载视为通用流量处理，未能利用多智能体架构中固有的结构化拓扑和语义可预测性，导致前缀缓存命中率低、长上下文请求引发严重资源争用以及因缩放策略不当造成的显著队列延迟。解决方案的关键在于提出 Pythia 系统，通过在服务层引入一个简洁的接口显式捕获工作流语义，从而解锁新的优化机会，在不改变底层模型能力的前提下显著提升吞吐量和任务完成时间。

链接: https://arxiv.org/abs/2604.25899
作者: Shan Yu,Junyi Shu,Yuanjiang Ni,Kun Qian,Xue Li,Yang Wang,Jinyuan Zhang,Ziyi Xu,Shuo Yang,Lingjun Zhu,Ennan Zhai,Qingda Lu,Jiarong Xing,Youyou Lu,Xin Jin,Xuanzhe Liu,Harry Xu
机构: UCLA; Alibaba Cloud Computing; Alibaba Group; Intel; SJTU; UC Berkeley; Rice University; Tsinghua University; Peking University
类目: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty – yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.

[MA-1] Volitional Multiagent Atomic Transactions: Describing People and their Machines

【速读】：该论文旨在解决传统并发与分布式系统形式模型中忽视人类操作者的问题，特别是针对由个人操作设备（如智能手机）构成的草根平台（grassroots platforms），其行为必须同时刻画人与机器的状态及其协同作用。解决方案的关键在于提出“意志多智能体原子事务”（volitional multiagent atomic transactions）这一数学基础：每个代理的状态包含意志状态（volitional state）和机器状态（machine state），事务的执行需满足两个条件——机器前提条件成立且守卫者（即相关人）愿意参与；例如，添加好友需双方同意，撤销好友只需任一方意愿，而交易则需双方共同授权。该框架进一步建立了安全性和活性（safety and liveness）的形式化表达机制，并成功应用于社交网络和币券系统的规范建模，为AI生成可执行实现提供了理论支撑。

链接: https://arxiv.org/abs/2604.25596
作者: Andy Lewis-Pye,Ehud Shapiro
机构: 1. London School of Economics and Political Science (伦敦政治经济学院); 2. Weizmann Institute of Science (魏茨曼科学研究所)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Formal models for concurrent and distributed systems describe machines; the people who operate them are either ignored or treated as external environment. Yet key distributed systems – notably grassroots platforms – include people operating their personal machines (smartphones), and their faithful description must include the states of both people and machines and how they jointly effect system behaviour. Here, we propose volitional multiagent atomic transactions – executed atomically by machines and guarded by their people’s volitions – as a novel mathematical foundation for specifying systems consisting of people operating machines. Each agent’s state consists of a volitional state and machine state; a transaction is enabled when the machine precondition holds and the guarding persons are willing. For example, befriending two people is guarded by both; unfriending, by either; voluntary swap of coins and bonds is guarded by both parties, while a payment is guarded by the payer. We develop the mathematical machinery to express safety and liveness of platforms specified in this framework, and provide example specifications of two grassroots platforms: social networks, and coins and bonds. These specifications are then used by AI to derive working implementations. % We employ here a novel and simpler definition of `grassroots’ that better captures the informal notion – multiple instances can form and operate independently, yet may coalesce – and show that the platforms specified here, as well as those hitherto proven grassroots under the original definition, are grassroots under the new definition. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI) Cite as: arXiv:2604.25596 [cs.DC] (or arXiv:2604.25596v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.25596 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-2] Should I Replan? Learning to Spot the Right Time in Robust MAPF Execution

【速读】：该论文旨在解决多智能体路径规划（Multi-Agent Path Finding, MAPF）在实际执行过程中因智能体延迟导致的同步失效问题，从而避免因异步移动引发的碰撞。现有鲁棒执行方法（如动作依赖图 Action Dependency Graph, ADG）虽能通过同步高风险动作保障安全，但常因等待延迟智能体而增加执行代价。为降低这一代价并维持安全性，论文提出一种基于神经网络的决策机制：利用新设计的ADG特征作为输入，预测单次重规划（replanning）所能带来的收益（即成本降低潜力），从而判断是否值得进行重规划。其关键创新在于构建了一个全连接前馈神经网络模型，以高效估计重规划的潜在效益，并在包含12,000次实验的新标注数据集上验证了该方法可实现高达94.6%的延迟影响缓解效果。

链接: https://arxiv.org/abs/2604.25567
作者: David Zahrádka(1 and 2),David Woller(1),Denisa Mužíková(1 and 2),Miroslav Kulich(1),Libor Přeučil(1) ((1) Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, (2) Faculty of Electrical Engineering, Czech Technical University in Prague)
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Czechia; Faculty of Electrical Engineering, Czech Technical University in Prague, Czechia
类目: Multiagent Systems (cs.MA)
备注: 8 pages, 10 figures. Submitted for double-blind review to IEEE

点击查看摘要

Abstract:During the execution of Multi-Agent Path Finding (MAPF) plans in real-life applications, the MAPF assumption that the fleet’s movement is perfectly synchronized does not apply. Since one or more of the agents may become delayed due to internal or external factors, it is often necessary to use a robust execution method to avoid collisions caused by desynchronization. Robust execution methods - such as the Action Dependency Graph (ADG) - synchronize the execution of risky actions, but often at the expense of increased plan execution cost, because it may require some agents to wait for the delayed agents. In such cases, the execution’s cost can be reduced while still preserving safety by finding a new plan either by rescheduling (reordering the agents at crossroads) or the more general replanning capable of finding new paths. However, these operations may be costly, and the new plan may not even lead to lower execution cost than the original plan: for example, the two plans may be the exact same. Therefore, we estimate the benefit that can be achieved by single replanning in scenarios with delayed agents given an immediate state of the execution with a fully connected feed-forward neural network. The input to the neural network is a set of newly designed ADG-based features describing the robust execution’s state and the impact of potential delays, and the output is an estimated benefit achievable by replanning. We train and test the network on a new labeled dataset containing 12,000 experiments, and we show that our proposed method is capable of reducing the impact of delays by up to 94.6% of the achievable reduction.

[MA-3] Where Did It Go Wrong? Capability-Oriented Failure Attribution for Vision-and-Language Navigation Agents

【速读】：该论文旨在解决安全关键型场景下具身智能体（embodied agents）在视觉语言导航（Vision-Language Navigation, VLN）等任务中因多能力耦合导致的故障难以定位与归因的问题。现有测试方法主要为系统级测试，缺乏对具体能力缺陷的识别能力。其解决方案的关键在于提出一种以能力为导向的测试方法，通过三个核心组件实现：（1）基于种子选择与变异的自适应测试用例生成机制；（2）用于识别特定能力错误的能力判别器（capability oracles）；（3）将失败归因于具体能力并反馈引导后续测试生成的闭环机制。该方法显著提升了故障发现率与能力层面缺陷的精准定位能力，为改进具身智能体提供了可解释且可操作的指导。

链接: https://arxiv.org/abs/2604.25161
作者: Jianming Chen,Yawen Wang,Junjie Wang,Xiaofei Xie,Shoubin Li,Qing Wang,Fanjiang Xu
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); Science Technology on Integrated Information System Laboratory (信息系统技术科学与技术实验室); State Key Laboratory of Complex System Modeling and Simulation Technology (复杂系统建模与仿真技术国家重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Singapore Management University (新加坡管理大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied agents in safety-critical applications such as Vision-Language Navigation (VLN) rely on multiple interdependent capabilities (e.g., perception, memory, planning, decision), making failures difficult to localize and attribute. Existing testing methods are largely system-level and provide limited insight into which capability deficiencies cause task failures. We propose a capability-oriented testing approach that enables failure detection and attribution by combining (1) adaptive test case generation via seed selection and mutation, (2) capability oracles for identifying capability-specific errors, and (3) a feedback mechanism that attributes failures to capabilities and guides further test generation. Experiments show that our method discovers more failure cases and more accurately pinpoints capability-level deficiencies than state-of-the-art baselines, providing more interpretable and actionable guidance for improving embodied agents.

[MA-4] Asymmetric-Information Resource Allocation Games: An LP Approach to Purposeful Deception

【速读】：该论文旨在解决在贝叶斯博弈框架下，防御者如何通过资源分配策略实施有目的的欺骗（deception），以引导攻击者偏离真实资产、从而提升自身防护效能的问题。其核心挑战在于防御者需在信念（belief）与策略（policy）相互耦合的动态环境中，设计出既能有效分配资源又能操控攻击者认知的策略。解决方案的关键在于构建并求解完美贝叶斯纳什均衡（Perfect Bayesian Nash Equilibrium, PBNE），并通过一个非迭代的线性规划（linear programming）形式化方法高效求得最优策略，从而实现资源分配与信念操纵之间的自然平衡，使欺骗行为具有目的性和涌现性。

链接: https://arxiv.org/abs/2604.25070
作者: Longxu Pan,Yue Guan,Daigo Shishika,Panagiotis Tsiotras
机构: Georgia Institute of Technology (佐治亚理工学院); George Mason University (乔治梅森大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In this work, we introduce the Deceptive Resource Allocation Game (DRAG), which studies purposeful deception within a Bayesian game framework. In DRAG, a Defender allocates resources across the true asset and several decoys to influence an Attacker’s beliefs and actions, with the goal of diverting the Attacker away from the true asset. We seek to characterize purposeful deception, whereby the Defender deceives only when doing so improves its performance. To this end, we solve for the Perfect Bayesian Nash Equilibrium (PBNE) of the corresponding game. We show that, despite the coupled belief-policy interdependence, the problem admits an efficient, non-iterative linear programming formulation. Numerical results demonstrate that the resulting policies naturally balance effective allocation and belief manipulation, giving rise to purposeful and emergent deceptive behaviors.

[MA-5] Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

【速读】：该论文旨在解决如何提前预警生成式 AI (Generative AI) 系统在具备递归自我改进（recursive self-improvement）能力前的关键时间节点问题。现有基准测试主要衡量广义能力增长，但难以提供足够的早期信号来捕捉AI加速自身研究的能力。其解决方案的关键在于设计一个端到端的机器学习流水线自主实现任务：给定一个简短的任务描述而非完整的研究文献，评估前沿编码代理能否在消费级硬件上于三小时内独立复现类似 AlphaZero 的强化学习框架（以 Connect Four 为案例）。该方法通过模拟 AI 对已有突破性研究的“研究品味”（research taste）来更敏感地探测其自主科研潜力，实验表明部分模型（如 Claude Opus 4.7）已能在该任务中显著优于基准解法，且任务难度随时间快速收敛，体现出对 AI 自主科研能力演进的有效监测价值。

链接: https://arxiv.org/abs/2604.25067
作者: Joshua Sherwood,Ben Aybar,Benjamin Kaplan
机构: University of Chicago(芝加哥大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI’s capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4’s time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.

[MA-6] MultiHedge: Adaptive Coordination via Retrieval-Augmented Control CCS2026

【速读】：该论文旨在解决在动态环境条件下决策系统难以泛化且在不确定性下表现不稳定的问题（decision-making under changing conditions）。其解决方案的关键在于提出一种名为MultiHedge的混合架构，该架构通过大语言模型（LLM）结合检索增强机制生成结构化的分配决策，并以经典期权策略作为执行基础，从而实现模块化决策流水线的鲁棒性提升。实证结果表明，相较于单纯扩大模型规模，引入记忆增强的检索机制更能提高系统的稳定性与鲁棒性。

链接: https://arxiv.org/abs/2604.24905
作者: Feliks Bańka,Jarosław A. Chudziak
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures. Accepted to the 26th International Conference on Computational Science (ICCS 2026), to appear in Springer LNCS proceedings

点击查看摘要

Abstract:Decision-making under changing conditions remains a fundamental challenge in many real-world systems. Existing approaches often fail to generalize across shifting regimes and exhibit unstable behavior under uncertainty. This raises the research question: can retrieval-augmented LLM coordination improve the robustness of modular decision pipelines? We propose MultiHedge, a hybrid architecture where an LLM produces structured allocation decisions conditioned on retrieved historical precedents, and execution is grounded in canonical option strategies. In a controlled evaluation using U.S. equities, we compare MultiHedge to rule-based and learning-based baselines. The key result is that memory-augmented retrieval confers greater robustness and stability than increasing model scale alone. Our paper contributes a controlled computational study showing that memory and architectural design play a central role in robustness in modular decision systems.

[MA-7] Co-Director: Agent ic Generative Video Storytelling

【速读】：该论文旨在解决扩散模型生成视频片段虽具高保真度，但难以转化为连贯叙事引擎的问题。当前基于代理（agent）的流水线方法通过模块化链式结构实现自动化，但因独立且手工设计的提示（prompting）导致语义漂移和级联失败。其解决方案的关键在于提出 Co-Director——一种分层多智能体框架，将视频叙事建模为全局优化问题；通过分层参数化机制实现语义一致性：全局层面采用多臂赌博机（multi-armed bandit）识别有前景的创意方向，局部层面引入多模态自精炼循环以缓解身份漂移并保障序列级一致性，从而在探索新颖叙事策略与利用有效创意配置之间取得平衡。

链接: https://arxiv.org/abs/2604.24842
作者: Yale Song,Yiwen Song,Nick Losier,Nathan Hodson,Ye Jin,Rhyard Zhu,Yan Xu,Daniel Vlasic,Carina Claassen,Jasmine Leon,Khanh G. LeViet,Zack Chomyn,Joe Timmons,Brett Slatkin,Scott Penberthy,Tomas Pfister
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives. Project Page: this https URL

[MA-8] ITAS: A Multi-Agent Architecture for LLM -Based Intelligent Tutoring

【速读】：该论文旨在解决生成式 AI (Generative AI) 在真实课程环境中部署时面临的系统性挑战，尤其是如何构建一个可运行、可扩展且与教学流程深度融合的智能教学助理系统（Intelligent Teaching Assistant System, ITAS）。其核心问题是：尽管基于大语言模型（Large Language Models, LLMs）的教学代理在笔记本环境中易于开发，但在实际教学场景中却难以稳定运行，特别是在数据积累与教师反馈之间存在“盲点”——即教师无法及时获取学生行为细节，导致教学干预滞后。解决方案的关键在于采用分层架构设计：教学层通过多代理协作（视频、代码、指导三类专家代理+合成器）实现任务专业化与边界控制，避免领域整合引发的幻觉问题；操作层基于云原生微服务（Cloud Run）和事件流处理（Pub/Sub → BigQuery）保障高并发下的状态一致性与可观测性；反馈层则引入窄域对话代理，对每节课的伪匿名事件流进行分析，主动识别并上报关键教学洞见，从而缓解“盲教师问题”。该架构直接回应早期原型的失败经验，验证了端到端系统在真实课程中可行性的技术路径。

链接: https://arxiv.org/abs/2604.24808
作者: Iizalaarab Elhaimeur,Nikos Chrisochoides
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Companion papers: arXiv:Q-ID (Quantum deployment), arXiv:L-ID (Latency analysis)

点击查看摘要

Abstract:Large language model tutors are easy to build in a notebook and hard to run in a real course. We describe ITAS (Intelligent Teaching Assistant System), a multi-agent tutoring system that a graduate quantum computing course used for a semester at Old Dominion University. The system has three layers. The teaching layer is a Spoke-and-Wheel of three parallel specialist agents (Video, Code, Guidance) followed by a Synthesizer, plus a separate autograder that evaluates both the correctness and the approach of checkpoint submissions. The operational layer is four Cloud Run microservices with session state in Cloud SQL and interaction events streamed through Pub/Sub to BigQuery. The feedback layer is a narrow-scope conversational agent that answers instructor questions over per-lesson pseudonymized event streams, addressing what we call the Blind Instructor Problem: LLM tutors accumulate more data about students than the instructor can reach through routine channels. The architecture is a direct response to specific failures of an earlier prototype, and we describe which of those fixes carried forward and which were dropped for this iteration. We report on a pilot deployment (five students, one course, one semester) interpreted as system-behavior evidence rather than learning-outcome evidence: the teaching layer handled 334 chat turns without the task-boundary hallucinations that domain consolidation would have risked, the operational layer captured 10,628 events across five modules, and the feedback layer surfaced two findings the instructor acted on mid-semester. We do not claim the pilot generalizes. We do claim that the system as described is one workable answer to the question of what an LLM-based ITS needs to look like end-to-end to run in a real course.

[MA-9] From Prototype to Classroom: An Intelligent Tutoring System for Quantum Education

【速读】：该论文旨在解决量子计算教学中面临的三大挑战：概念反直觉、数学形式化复杂且师资稀缺，尤其是在资源有限的高校中。为应对这些问题，作者提出并实现了一个名为ITAS（Intelligent Teaching Assistant System）的多智能体辅导系统，其关键在于四个核心贡献：基于Watrous信息优先框架构建的五模块量子信息科学（Quantum Information Science, QIS）课程体系、采用“轮辐式”（Spoke-and-Wheel）架构的量子专用智能体分工机制、面向生产环境与合规性设计的云基础设施，以及面向教师和内容开发者提供对话式分析能力的智能层。实证表明，该系统在奥尔德尼大学的实际课程部署中有效缓解了原型系统中出现的任务边界失效问题，实现了课堂级并发支持，并为教师提供了可操作的教学洞察，从而验证了高度专业化智能体在高技术难度领域中的可靠性与实用性。

链接: https://arxiv.org/abs/2604.24807
作者: Iizalaarab Elhaimeur,Nikos Chrisochoides
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 6 figures, 1 table. Submitted to IEEE QCE 2026. Companion papers (in preparation): ITAS architecture and latency analysis

点击查看摘要

Abstract:Quantum computing instructors face a compounding problem: the concepts are counterintuitive, the mathematical formalism is dense, and qualified faculty are scarce outside a small number of well-resourced institutions. Our prior work introduced a knowledge-graph-augmented tutoring prototype with two specialized LLM agents: a Teaching Agent for dynamic interaction and a Lesson Planning Agent for lesson generation. Validated on simulated runs rather than in a real course, that prototype left open whether more aggressive agent specialization would be needed to handle the full range of quantum education tasks under real student load. This paper answers the three questions that the prototype could not answer. Can agent specialization solve the reliability problem in a domain as technically demanding as quantum information science? Can the system run in a real course, not a demonstration? Does the instructor gain actionable intelligence from the deployment? We present ITAS (Intelligent Teaching Assistant System), a multi-agent tutoring system built around four contributions: a five-module QIS curriculum grounded in Watrous’s information-first framework, a Spoke-and-Wheel teaching architecture with quantum-specialized agents, a cloud infrastructure designed for production use and regulatory compliance, and a conversational analytics layer for instructors and content developers. Piloted in a quantum computing course at Old Dominion University, the system supports all three answers: deployment evidence is consistent with specialization addressing the task-boundary failures observed in the prototype, cloud infrastructure supports classroom-scale concurrency at sub-textbook cost, and the analytics agent surfaces curriculum gaps the instructor could not otherwise see.

[MA-10] GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多智能体系统（Multi-Agent Systems, MAS）中集成后所面临的安全性问题，尤其是提示感染（prompt infection）和智能体间通信被破坏等漏洞，同时指出当前缺乏一个标准化、可复现的评估环境来训练和验证基于图的异常检测防御模型。解决方案的关键在于提出Gammaf框架——一个开源的基准测试平台，其核心由两个相互依赖的模块组成：一是训练数据生成阶段，通过模拟不同网络拓扑下的辩论过程，构建具有丰富属性的图结构数据；二是防御系统基准测试阶段，在实时推理过程中动态隔离标记的恶意节点以评估防御模型性能。该框架不提供新的防御机制，而是为现有及未来防御模型提供统一、可扩展且高效的评估能力，从而推动LLM-MAS安全性的系统化研究与优化。

链接: https://arxiv.org/abs/2604.24477
作者: Pablo Mateo-Torrejón,Alfonso Sánchez-Macián
机构: University Carlos III of Madrid (卡洛斯三世大学); OpenAI (OpenAI)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid integration of Large Language Models (LLMs) into Multi-Agent Systems (MAS) has significantly enhanced their collaborative problem-solving capabilities, but it has also expanded their attack surfaces, exposing them to vulnerabilities such as prompt infection and compromised inter-agent communication. While emerging graph-based anomaly detection methods show promise in protecting these networks, the field currently lacks a standardized, reproducible environment to train these models and evaluate their efficacy. To address this gap, we introduce Gammaf (Graph-based Anomaly Monitoring for LLM Multi-Agent systems Framework), an open-source benchmarking platform. Gammaf is not a novel defense mechanism itself, but rather a comprehensive evaluation architecture designed to generate synthetic multi-agent interaction datasets and benchmark the performance of existing and future defense models. The proposed framework operates through two interdependent pipelines: a Training Data Generation stage, which simulates debates across varied network topologies to capture interactions as robust attributed graphs, and a Defense System Benchmarking stage, which actively evaluates defense models by dynamically isolating flagged adversarial nodes during live inference rounds. Through rigorous evaluation using established defense baselines (XG-Guard and BlindGuard) across multiple knowledge tasks (such as MMLU-Pro and GSM8K), we demonstrate Gammaf’s high utility, topological scalability, and execution efficiency. Furthermore, our experimental results reveal that equipping an LLM-MAS with effective attack remediation not only recovers system integrity but also substantially reduces overall operational costs by facilitating early consensus and cutting off the extensive token generation typical of adversarial agents.

自然语言处理

[NLP-0] Recursive Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）中协作能力的可扩展性问题，即如何通过递归机制提升智能体间的协同推理效率与性能。其核心挑战在于传统MAS依赖静态交互模式，难以实现深层次的跨智能体知识传递与动态优化。解决方案的关键是提出RecursiveMAS框架，该框架将整个多智能体系统建模为统一的潜在空间递归计算过程，利用轻量级RecursiveLink模块构建异构智能体间的协作环路，从而实现分布内潜在思维生成和跨智能体潜在状态迁移；同时设计内外层循环学习算法，基于共享梯度分配机制完成递归轮次中的系统级联合优化，理论上保证了训练稳定性和运行效率，并在多个任务场景下显著提升了准确率、推理速度并降低了token消耗。

链接: https://arxiv.org/abs/2604.25917
作者: Xiyuan Yang,Jiaru Zou,Rui Pan,Ruizhong Qiu,Pan Lu,Shizhe Diao,Jindong Jiang,Hanghang Tong,Tong Zhang,Markus J. Buehler,Jingrui He,James Zou
机构: UIUC; Stanford University; NVIDIA; MIT
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 Pages. Project Website: this https URL

点击查看摘要

Abstract:Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2 \times -2.4 \times end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in this https URL.

[NLP-1] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

【速读】：该论文旨在解决现有数据可视化（Data Visualization, DV）评估基准在真实世界应用中存在三大局限性的问题：即代码沙箱限制、单一语言的仅创建任务以及对用户意图完美已知的假设。为应对这些挑战，作者提出了DV-World这一涵盖260个任务的新基准，其关键创新在于构建了一个覆盖真实专业生命周期的多维度测试框架，包括DV-Sheet（原生电子表格操作）、DV-Evolution（跨平台重构可视化资产）和DV-Interact（与模拟用户的主动意图对齐），并通过Table-value Alignment与MLLM-as-a-Judge结合评分机制实现数值精度与语义视觉质量的协同评估，从而更真实地反映模型在企业级工作流中的综合能力。

链接: https://arxiv.org/abs/2604.25914
作者: Jinxiang Meng,Shaoping Huang,Fangyu Lei,Jingyu Guo,Haoxiang Liu,Jiahao Su,Sihan Wang,Yao Wang,Enrui Wang,Ye Yang,Hongze Chai,Jinming Lv,Anbang Yu,Huangjing Zhang,Yitong Zhang,Yiming Huang,Zeyao Ma,Shizhu He,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \hrefthis https URLthis project page.

[NLP-2] A paradox of AI fluency

【速读】：该论文旨在解决“用户对人工智能（AI）的熟练程度如何影响其实际获得的AI效果”这一关键问题，这是用户、AI产品开发者及社会层面均需关注的核心议题。研究基于WildChat-4.8M中27K条标注对话数据发现，高熟练度用户倾向于承担更复杂的任务，并采用与AI协同迭代的互动模式——即主动调整目标并批判性评估输出结果；而新手则采取被动接受姿态。解决方案的关键在于揭示了“AI熟练度悖论”：尽管熟练用户遭遇更多显性失败（因深度参与而被察觉），但这些失败更易部分恢复，且伴随更高复杂任务的成功率；相反，新手虽看似成功，实则常陷入“隐形失败”（即对话表面完成但未达成目标）。因此，论文主张个体应采取主动参与而非被动接受的态度，AI产品设计应聚焦于引导用户行为，鼓励深度交互而非仅追求无摩擦体验，从而提升整体使用成效。

链接: https://arxiv.org/abs/2604.25905
作者: Christopher Potts,Moritz Sudhof
机构: Bigspin AI(Bigspin AI); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How much does a user’s skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices – but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at this https URL

[NLP-3] oward a Functional Geometric Algebra for Natural Language Semantics

【速读】：该论文旨在解决传统分布语义模型（如基于线性代数的向量、矩阵和张量表示方法）在组合语义、类型敏感性和可解释性方面的结构性局限问题。其解决方案的关键在于引入几何代数（Geometric Algebra, GA），特别是Clifford代数，构建一种功能型几何代数（Functional Geometric Algebra, FGA）框架，该框架通过将n维嵌入空间扩展为 $2^n$ 维的多矢量代数结构，实现了基础语义概念及其高阶交互关系在一个统一且有原则的代数体系中的显式表示，从而在保留与当前分布学习和神经网络架构兼容性的前提下，提升了语义表示的结构性组织能力。

链接: https://arxiv.org/abs/2604.25902
作者: James Pustejovsky
机构: Brandeis University (布兰迪斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 43 pages. Keywords: geometric algebra, Clifford algebra, compositional semantics, natural language semantics, type coercion, multivector representations, graded type system, Generative Lexicon, neural language models, distributional semantics

点击查看摘要

Abstract:Distributional and neural approaches to natural language semantics have been built almost exclusively on conventional linear algebra: vectors, matrices, tensors, and the operations that accompany them. These methods have achieved remarkable empirical success, yet they face persistent structural limitations in compositional semantics, type sensitivity, and interpretability. I argue in this paper that geometric algebra (GA) – specifically, Clifford algebras – provides a mathematically superior foundation for semantic representation, and that a Functional Geometric Algebra (FGA) framework extends GA toward a typed, compositional semantics capable of supporting inference, transformation, and interpretability while retaining full compatibility with distributional learning and modern neural architectures. I develop the formal foundations, identify three core capabilities that GA provides and linear algebra does not, present a detailed worked example illustrating operator-level semantic contrasts, and show how GA-based operations already implicit in current transformer architectures can be made explicit and extended. The central claim is not merely increased dimensionality but increased structural organization: GA expands an n -dimensional embedding space into a 2^n multivector algebra where base semantic concepts and their higher-order interactions are represented within a single, principled algebraic framework.

[NLP-4] hree Models of RLHF Annotation: Extension Evidence and Authority

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于人类偏好的对齐方法（如强化学习与人类反馈，Reinforcement Learning with Human Feedback, RLHF）在伦理和操作层面缺乏明确规范的问题，即人类标注者在塑造大语言模型行为时所扮演的角色不清晰。论文提出三种概念模型来界定这一角色：扩展（extension）、证据（evidence）和权威（authority），并指出当前RLHF实践往往模糊地混用这些模型，导致设计缺陷或伦理风险。解决方案的关键在于将标注任务分解为可分离的维度，并针对每个维度选择最合适的模型进行适配，而非追求单一统一的标注流程，从而提升RLHF系统的透明性、可解释性和规范性。

链接: https://arxiv.org/abs/2604.25895
作者: Steve Coyne
机构: University of Toronto (多伦多大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages. Accepted to ACM FAccT '26, June 25-28, Montreal

点击查看摘要

Abstract:Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers’ own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

[NLP-5] From Syntax to Emotion: A Mechanistic Analysis of Emotion Inference in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在情感识别任务中的内部表征机制不明确的问题，特别是缺乏对情绪相关特征如何在模型各层中逐步涌现及其因果影响的理解。解决方案的关键在于利用稀疏自编码器（Sparse Autoencoders, SAEs）分析不同层的稀疏特征激活，揭示出一个稳定的三阶段信息流过程，并发现情绪表征由共通特征与情绪特异性特征共同构成；进一步通过分阶段因果追踪识别出少量关键特征对情绪预测具有显著影响，且其数量和因果强度因情绪类型而异（如厌恶情绪表现得更弱且分散）。基于此，研究提出一种可解释、数据高效且具备因果控制能力的特征引导方法，在多个模型和数据集上均显著提升情感识别性能，同时保持语言建模能力不受明显损害。

链接: https://arxiv.org/abs/2604.25866
作者: Bangzhao Shu,Arinjay Singh,Mai ElSherief
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 18 pages including appendix

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, yet little is known about how emotion recognition is internally represented. In this work, we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers, we identify a consistent three-phase information flow, in which emotion-related features emerge only in the final phase. We further show that emotion representations comprise both shared features across emotions and emotion-specific features. Using phase-stratified causal tracing, we identify a small set of features that strongly influence emotion predictions, and show that both their number and causal impact vary across emotions; in particular, Disgust is more weakly and diffusely represented than other emotions. Finally, we propose an interpretable and data-efficient causal feature steering method that significantly improves emotion recognition performance across multiple models while largely preserving language modeling ability, and demonstrate that these improvements generalize across multiple emotion recognition datasets. Overall, our findings provide a systematic analysis of the internal mechanisms underlying emotion recognition in LLMs and introduce an efficient, interpretable, and controllable approach for improving model performance.

[NLP-6] Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

【速读】：该论文旨在解决机器生成文本（Machine-generated Text, MGT）检测中依赖模型特异性指纹而导致泛化能力差的问题，核心挑战在于如何识别跨不同生成模型的结构不变信号。解决方案的关键在于提出一种零样本统计方法 Luminol-AIDetect，其核心思想是利用大语言模型（Large Language Models, LLMs）自回归生成特性所导致的结构脆弱性——即在随机打乱文本顺序后，MGT 的困惑度（perplexity）分布会显著偏离人类写作的稳定性。通过提取输入文本及其打乱版本的少量困惑度标量特征，并结合密度估计与集成预测机制，该方法实现了对 MGT 的高效、模型无关检测，在8个内容领域、11种对抗攻击类型和18种语言上均达到当前最优性能，且误报率（FPR）相比先前方法降低最多达17倍，同时计算成本更低。

链接: https://arxiv.org/abs/2604.25860
作者: Lucio La Cava,Andrea Tagarelli
机构: DIMES Dept., University of Calabria (大学卡拉布里亚 DIMES 系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on model-specific fingerprints. In this respect, we hypothesize that while large language models excel at local semantic consistency, their autoregressive nature results in a specific kind of structural fragility compared to human writing. We propose Luminol-AIDetect, a novel, zero-shot statistical approach that exposes this fragility through coherence disruption. By applying a simple randomized text-shuffling procedure, we demonstrate that the resulting shift in perplexity serves as a principled, model-agnostic discriminant, as MGT displays a characteristic dispersion in perplexity-under-shuffling that differs markedly from the more stable structural variability of human-written text. Luminol-AIDetect leverages this distinction to inform its decision process, where a handful of perplexity-based scalar features are extracted from an input text and its shuffled version, then detection is performed via density estimation and ensemble-based prediction. Evaluated across 8 content domains, 11 adversarial attack types, and 18 languages, Luminol-AIDetect demonstrates state-of-the-art performance, with gains up to 17x lower FPR while being cheaper than prior methods.

[NLP-7] G-Loss: Graph-Guided Fine-Tuning of Language Models

【速读】：该论文旨在解决传统损失函数（如交叉熵、对比损失、三元组损失及监督对比损失）在微调预训练语言模型（如BERT）时仅关注局部邻域信息、忽视全局语义结构的问题。其解决方案的关键在于提出G-Loss，一种基于图引导的损失函数，通过引入半监督标签传播机制，利用嵌入流形内的结构关系来增强模型对全局语义的感知能力；具体而言，G-Loss构建文档相似性图以捕捉全局语义关联，从而指导模型学习更具判别性和鲁棒性的嵌入表示。

链接: https://arxiv.org/abs/2604.25853
作者: Sharma Aditya,Agarwal Vinti,Kumar Rajesh
机构: BITS Pilani, India; Bucknell University, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, Learning on Graphs (LoG2025)

点击查看摘要

Abstract:Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.

[NLP-8] Agent ic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

【速读】：该论文旨在解决编码代理（coding agent）中 harness 工程自动化难题，即如何在复杂的动作空间、稀疏且噪声较大的评估信号、数百万token的执行轨迹以及难以归因的编辑影响下，实现 harness 的自主进化。其解决方案的核心是提出 Agentic Harness Engineering (AHE) 框架，通过三个可观察性支柱实现对工程循环（组件编辑、轨迹检查与决策制定）的精准监控与反馈：(1) 组件可观测性为每个可编辑的 harness 组件提供文件级表示，使动作空间显式且可回滚；(2) 经验可观测性将海量原始轨迹文本压缩为分层可钻取的证据语料库，供演化代理有效消费；(3) 决策可观测性将每次编辑与其自声明预测配对，并在下一轮任务结果中验证，从而形成可证伪的契约机制。这一设计使 harness 的迭代演进摆脱盲目试错，实现持续优化。

链接: https://arxiv.org/abs/2604.25850
作者: Jiahang Lin,Shichun Liu,Chengjun Pan,Lizhi Lin,Shihan Dou,Xuanjing Huang,Hang Yan,Zhenhua Han,Tao Gui
机构: Fudan University (复旦大学); Peking University (北京大学); Shanghai Qiji Zhifeng Co., Ltd (上海奇稷智峰科技有限公司)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round’s outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round’s task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.

[NLP-9] PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

【速读】：该论文旨在解决当前抑郁患者模拟器（depression patient simulator）在心理健康培训中存在行为真实性不足、多样性缺失以及评估方法不科学的问题。现有评估主要依赖提示不明确的大型语言模型（LLM-judges），且未从对话轮次、完整对话及群体层面量化行为多样性与临床合理性。解决方案的关键在于提出PSI-Bench——一个自动化的评估框架，能够提供可解释、临床基础扎实的诊断指标，涵盖turn-、dialogue-和population-level维度，从而系统性地揭示模拟器在响应长度、词汇多样性、情绪演变轨迹及行为变异性等方面的偏差，并验证了模拟框架设计对仿真保真度的影响大于模型规模。

链接: https://arxiv.org/abs/2604.25840
作者: Nguyen Khoi Hoang,Shuhaib Mehri,Tse-An Hsu,Yi-Jyun Sun,Quynh Xuan Nguyen Truong,Khoa D Doan,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); VinUniversity (Vin大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.

[NLP-10] Barriers to Universal Reasoning With Transformers (And How to Overcome Them) KR

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于思维链（Chain-of-Thought, CoT）的 Transformer 模型在长度泛化能力上的局限性问题，即模型能否在训练时未见过的更长 CoT 推理路径上仍保持可靠性能。研究表明，在标准位置编码和有限词汇表条件下，Transformer 的表达能力被限制在 TC⁰ 类别内，无法实现真正意义上的长度泛化；而通过允许词汇表随问题规模增长，并引入“信标标记符”（signpost token）和仅记录值变化的编码机制，论文提出了一种可长度泛化的图灵机模拟方案，其中 CoT 路径长度与模拟运行时间呈线性关系。该解决方案的关键在于：1）为每个磁带位置分配唯一信标标记符以规避重复复制问题；2）通过计数恢复当前磁带符号来克服最后出现位置检索障碍，从而显著提升模型在复杂推理任务中的长度泛化能力。

链接: https://arxiv.org/abs/2604.25800
作者: Oliver Kraus,Yash Sarrof,Yuekun Yao,Alexander Koller,Michael Hahn
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Oliver Kraus and Yash Sarrof contributed equally as first authors. Alexander Koller and Michael Hahn are co-senior authors. Code: this https URL

点击查看摘要

Abstract:Chain-of-Thought (CoT) has been shown to empirically improve Transformers’ performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that – under standard positional encodings and a finite alphabet – Transformers with CoT cannot solve problems beyond TC^0 , i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.

[NLP-11] Subliminal Steering: Stronger Encoding of Hidden Signals

【速读】：该论文旨在解决子liminal学习（subliminal learning）中三个未解问题：可转移信号的范围、作用机制以及偏置编码的精确性。此前研究仅限于单字词偏好，且缺乏对机制和精度的深入理解。论文提出“子liminal引导”（subliminal steering）作为解决方案，其关键在于将教师模型的偏置通过一个训练好的引导向量（steering vector）实现，而非传统的系统提示（system prompt）。该方法不仅能够传输复杂的多词偏置，还通过机制分析证实偏置连同引导向量本身被精确地传递至教师模型被引导的层，且在子liminally laden数据上重新训练的引导向量与原始向量具有高余弦相似度，表明偏置以高精度编码于看似无关的数据中。

链接: https://arxiv.org/abs/2604.25783
作者: George Morgulis,John Hewitt
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher’s bias is implemented not via a system prompt, as in prior work, but through a steering vector trained to maximize the likelihood of a set of target samples. First, we show that subliminal steering transfers complex multi-word biases, whereas prior work focused on single-word preferences, demonstrating a large scope of subliminally transferrable signals. Second, we provide mechanistic evidence that subliminal learning transfers not only the target behavioral bias, but also the steering vector itself, localized to the layers at which the teacher was steered. Finally, we show that the bias is encoded with surprising precision. We train a new steering vector directly on the subliminally-laden dataset and find that it attains high cosine similarity with the original vector.

[NLP-12] Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research LREC2026

【速读】：该论文试图解决语音情感识别（Speech Emotion Recognition, SER）研究中动机与实践之间存在的脱节问题，即研究者常声称旨在开发适用于医疗或语音交互系统等实际场景的技术，但所使用的数据集却未反映这些部署情境，导致研究目标与实际应用之间存在显著偏差。解决方案的关键在于，SER研究应重新聚焦于具体的、可验证的应用场景，以确保研究动机与数据选择、评估方法的一致性，从而避免因误读、滥用而引发的伦理风险和下游危害。

链接: https://arxiv.org/abs/2604.25776
作者: Taryn Wong,Zeerak Talat,Hanan Aldarmaki,Anjalie Field
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the Workshop on Computational Affective Science (CAS) at LREC 2026

点击查看摘要

Abstract:Critical analyses of emotion recognition technology have raised ethical concerns around task validity and potential downstream impacts, urging researchers to ensure alignment between their stated motivations and practice. However, these discussions have not adequately influenced or drawn from research on speech emotion recognition (SER). We address this gap by conducting a systematic survey of SER research to uncover what stated motivations drive this work and if they align with the datasets and emotions studied. We find that while SER research identifies appealing goals, such as well-situated voice-activated systems or healthcare applications, commonly-used datasets do not reflect these proposed deployment contexts, thus presenting a gap between motivations and research practices. We argue that such gaps engender ethical concerns, and that SER research should reassert itself with concrete use-cases to prevent misinterpretations, misuse, and downstream harms.

[NLP-13] CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM -based Approaches for Recipe Nutrient Estimation LREC2026 ALT

【速读】：该论文旨在解决从非结构化食谱文本中准确估算营养成分的问题，这在饮食监测领域具有重要意义但面临挑战，主要源于食材术语的模糊性和数量表达的高度变异性。解决方案的关键在于利用大语言模型（Large Language Models, LLMs）的生成式推理能力，通过其预训练的世界知识来解析歧义术语并标准化非标准单位，从而显著提升预测精度；相较之下，传统基于词法匹配（如TF-IDF）或浅层语义编码（如DeBERTa-v3）的方法在数据稀缺场景下表现受限，而结合LLM的混合流水线（如TF-IDF与Gemini 2.5 Flash联合使用）在保持高准确性的同时展现出更强的鲁棒性，尽管带来了更高的推理延迟，体现了实时效率与营养精度之间的权衡。

链接: https://arxiv.org/abs/2604.25774
作者: Wei-Chun Chen,Yu-Xuan Chen,I-Fang Chung,Ying-Jia Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the Third Workshop on Patient-oriented Language Processing (CL4Health) at LREC 2026

点击查看摘要

Abstract:Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.

[NLP-14] oward Multimodal Conversational AI for Age-Related Macular Degeneration

【速读】：该论文旨在解决当前深度学习模型在视网膜疾病检测中缺乏临床推理与交互式解释的问题，即多数系统仅提供静态预测结果，无法支持医生决策或患者沟通。解决方案的关键在于开发并训练OcularChat——一个基于Qwen2.5-VL架构的多模态大语言模型（Multimodal Large Language Model, MLLM），通过模拟患者-医生对话生成705,850组配对数据用于训练，使其能够基于彩色眼底照相（Color Fundus Photographs, CFPs）进行视觉问答，并识别年龄相关性黄斑变性（Age-related Macular Degeneration, AMD）的关键特征，同时输出具有临床意义的诊断推理和交互式解释。实验表明，OcularChat在AMD严重程度分类任务中显著优于现有MLLMs，并获得眼科专家更高评分，验证了其在准确性、可解释性和临床实用性方面的优势。

链接: https://arxiv.org/abs/2604.25720
作者: Ran Gu,Benjamin Hou,Mélanie Hébert,Asmita Indurkar,Yifan Yang,Emily Y. Chew,Tiarnán D. L. Keenan,Zhiyong Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 38 pages, 4 figures

点击查看摘要

Abstract:Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.

[NLP-15] Cross-Lingual Jailbreak Detection via Semantic Codebooks

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言部署中存在系统性安全漏洞的问题，特别是针对跨语言提示攻击（jailbreak attacks）时，现有英文主导的安全机制失效的现象。其核心问题是：如何在不依赖语言特定训练或适配的前提下，实现对多语言恶意输入的有效检测与防御。解决方案的关键在于提出一种基于语言无关语义相似性的外部防护机制——通过将多语言查询嵌入（query embeddings）与固定英文 jailbreak 提示代码本（codebook）进行比对，构建一个无需重新训练的黑盒外部护栏（external guardrail），从而实现对跨语言攻击的通用检测能力。

链接: https://arxiv.org/abs/2604.25716
作者: Shirin Alanova,Bogdan Minko,Sabrina Sadiekh,Evgeniy Kokuykin
机构: ITMO University (ITMO大学); HiveTraceLab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC \approx 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.

[NLP-16] Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

【速读】：该论文旨在解决当前神经机器翻译（Neural Machine Translation, NMT）系统在依赖监督平行数据训练后仍存在持续翻译错误的问题。其解决方案的关键在于提出一种基于强化学习（Reinforcement Learning, RL）的后训练范式，利用仅需通用文本语料库和专家翻译者（可为人或AI）提供的迭代反馈进行优化；具体实现中采用直接偏好优化（Direct Preference Optimization, DPO）方法对预训练模型进行偏好驱动的后训练，实验证明该策略能显著提升翻译质量（如Gemma3-1b模型在英德翻译任务上的COMET得分从0.703提升至0.747），展现出高效且稳定的性能增强潜力。

链接: https://arxiv.org/abs/2604.25702
作者: Mehrdad Ghassabi,Spehr Rajabi,Hamidreza Baradaran Kashani,Sadra Hakim,Mahshid Keivandarian
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it’s COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.

[NLP-17] CORAL: Adaptive Retrieval Loop for Culturally-Aligned Multilingual RAG ACL2026

【速读】：该论文旨在解决多语言检索增强生成（multilingual retrieval-augmented generation, mRAG）中因文化语境不匹配导致的检索-生成错位问题，即在面对具有文化根基的查询时，传统固定检索空间方法（如通过翻译或跨语言嵌入向量）可能无法获取与目标文化语境一致的相关证据，从而影响生成答案的文化相关性。其解决方案的关键在于提出CORAL（COntext-aware Retrieval with Agentic Loop），一种基于代理循环的自适应检索机制，能够通过迭代优化检索空间（语料库）和检索探针（查询），依据证据的质量进行动态调整：系统首先选择语料库并检索文档，随后对证据的相关性和文化一致性进行批判性评估，并检查其充分性；若证据不足，则重新选择语料库并重写查询，从而实现更精准的文化对齐检索与生成。

链接: https://arxiv.org/abs/2604.25676
作者: Nayeon Lee,Jiwoo Song,Byeongcheol Kang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures. Accepted at ACL 2026 (Findings)

点击查看摘要

Abstract:Multilingual retrieval-augmented generation (mRAG) is often implemented within a fixed retrieval space, typically via query or document translation or multilingual embedding vector representations. However, this approach may be inadequate for culturally grounded queries, in which retrieval-condition misalignment may occur. Even strong retrievers and generators may struggle to produce culturally relevant answers when sourcing evidence from inappropriate linguistic or regional contexts. To this end, we introduce CORAL (COntext-aware Retrieval with Agentic Loop, an adaptive retrieval methodology for mRAG that enables iterative refinement of both the retrieval space (corpora) and the retrieval probe (query) based on the quality of the evidence. The overall process includes: (1) selecting corpora, (2) retrieving documents, (3) critiquing evidence for relevance and cultural alignment, and (4) checking sufficiency. If the retrieved documents are insufficient to answer the query correctly, the system (5) reselects corpora and rewrites the query. Across two cultural QA benchmarks, CORAL achieves up to a 3.58%p accuracy improvement on low-resource languages relative to the strongest baselines.

[NLP-18] Modeling Human-Like Color Naming Behavior in Context

【速读】：该论文旨在解决生成式 AI 系统中神经代理（neural agents）在模拟人类颜色命名行为时所产生词典（lexicon）与人类颜色范畴之间存在的几何不一致性问题，即模型生成的色词汇语区域在颜色空间中呈现高度非凸性（non-convexity），而人类颜色类别通常具有良好的凸性（convexity）。解决方案的关键在于引入两个核心机制：一是通过在监督学习（supervised learning, SL）阶段对罕见颜色术语进行过采样（upsampling），以提升词汇多样性与系统级信息量；二是采用多听者强化学习（multi-listener reinforcement learning, RL）交互设置，增强群体沟通压力，从而促进更接近人类认知结构的凸形颜色类别形成。实验证明，适度过采样与多听者交互的结合能显著提升模型生成词典与人类语言系统的相似度。

链接: https://arxiv.org/abs/2604.25674
作者: Yuqing Zhang,Ecesu Ürker,Tessa Verhoef,Gemma Boleda,Arianna Bisazza
机构: Center for Language and Cognition, University of Groningen (格罗宁根大学语言与认知中心); Department of Translation and Language Sciences, Universitat Pompeu Fabra (庞佩乌法布拉大学翻译与语言科学系); Leiden Institute of Advanced Computer Science, Leiden University (莱顿大学高级计算机科学研究所); ICREA (加泰罗尼亚研究与学术协会)
类目: Computation and Language (cs.CL)
备注: Cognitive Science Society Annual Conference 2026

点击查看摘要

Abstract:Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.

[NLP-19] Progressing beyond Art Masterpieces or Touristic Clichés: how to assess your LLM s for cultural alignment? LREC2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在文化适配性评估中缺乏高质量、系统性数据集的问题，尤其针对现有评估方法在区分具有特定文化专长与不具备文化专长的模型时能力不足的局限。其解决方案的关键在于提出一套面向标注者的结构化设计指南，并基于此构建了一个具有更强区分能力的测试集；通过对比实验验证，该设计显著提升了测试集对不同文化适应性模型的判别效能，从而为文化适配性的量化评估提供了更可靠的基准。

链接: https://arxiv.org/abs/2604.25654
作者: António Branco,João Silva,Nuno Marques,Luis Gomes,Ricardo Campos,Raquel Sequeira,Sara Nerea,Rodrigo Silva,Miguel Marques,Rodrigo Duarte,Artur Putyato,Diogo Folques,Tiago Valente
机构: 未知
类目: Computation and Language (cs.CL)
备注: RESOURCEFUL-2026 Workshop at LREC 2026

点击查看摘要

Abstract:Although the cultural (mis)alignment of Large Language Models (LLMs) has attracted increasing attention – often framed in terms of cultural bias – until recently there has been limited work on the design and development of datasets for cultural assessment. Here, we review existing approaches to such datasets and identify their main limitations. To address these issues, we propose design guidelines for annotators and report on the construction of a dataset built according to these principles. We further present a series of contrastive experiments conducted with this dataset. The results demonstrate that our design yields test sets with greater discriminative power, effectively distinguishing between models specialized for a given culture and those that are not, ceteris paribus.

[NLP-20] he Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）输出内容的实时验证与溯源问题，特别是如何在不依赖模型内部结构或加密水印的情况下，实现高效、准确的生成内容可信度评估。其核心挑战在于现有基于采样的检测方法延迟过高（可达10万倍于CPU原生处理速度），难以应用于大规模部署场景。解决方案的关键在于发现并利用LLM输出中普遍存在的统计规律：无论模型来源、规模或领域差异，token的秩频分布均收敛至同一类两参数Mandelbrot分布，且拟合优度极高（R² > 0.94，AIC优于Zipf分布）。这一普适性使得可构建一个仅需2.6微秒/词元的CPU原生评分基元（scoring primitive），用于快速识别异常文本（如词汇异常、未支持实体）并实现模型指纹识别（statistical model fingerprinting），从而作为复合评估体系中的首层筛查机制，而非替代采样或源条件验证方法。

链接: https://arxiv.org/abs/2604.25634
作者: Alex Bogdan,Adrian de Valois-Franklin
机构: Evolutionairy AI (Evolutionairy AI)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 25 pages, 6 figures, 6 tables, 37 references. Code and data: this https URL

点击查看摘要

Abstract:We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000 \times (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding R^2 = 0.94 and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross-model spread in q (1.63 to 3.69) exceeds its per-model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of standard deviations of separation per few thousand output tokens. Two capabilities follow. First, statistical model fingerprinting: text from a vendor-delivered LLM can be tested against its claimed model family without cryptographic watermarks or access to model internals, supporting provenance verification and silent-substitution audits. Second, a model-agnostic reference distribution for black-box output assessment, from which we derive a single-pass scoring primitive that composes with model log probabilities when available and degrades to a rank-only mode usable on closed APIs. Pilot results on FRANK, TruthfulQA, and HaluEval map where the primitive helps (lexical anomalies, unsupported entities) and where it structurally cannot (reasoning errors in domain-appropriate vocabulary). We position the primitive as a first-pass triage layer in compound evaluation stacks, not as a replacement for sampling-based or source-conditioned verifiers. Comments: 25 pages, 6 figures, 6 tables, 37 references. Code and data: this https URL Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2604.25634 [cs.CR] (or arXiv:2604.25634v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.25634 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alex Bogdan [view email] [v1] Tue, 28 Apr 2026 13:35:31 UTC (251 KB) Full-text links: Access Paper: View a PDF of the paper titled The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive, by Alex Bogdan and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CR prev | next new | recent | 2026-04 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-21] WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

【速读】：该论文旨在解决实时自动语音识别（ASR）系统中 transcription accuracy（转录准确率）与 computational efficiency（计算效率）之间的根本性权衡问题，尤其是在部署大规模 Transformer 模型（如 Whisper）时面临的挑战。现有流式处理方法要么因激进的音频分块牺牲准确率，要么因无界上下文累积导致不可接受的内存开销。解决方案的关键在于提出 WhisperPipe 架构，其核心创新包括：1）融合 Silero VAD 与基于能量的过滤机制的混合语音活动检测（Voice Activity Detection, VAD）管道，将误激活减少 34%；2）具有重叠上下文窗口的动态缓冲机制，有效防止段边界处的信息丢失；3）根据语音特征自适应调整处理策略，实现延迟与准确率的平衡。该设计在保持与离线 Whisper 相当的词错误率（WER）的同时，将端到端延迟降低至 89ms（中位数），峰值 GPU 内存消耗减少 48%，且在长时间运行中内存使用稳定无增长。

链接: https://arxiv.org/abs/2604.25611
作者: Erfan Ramezani,Mohammad Mahdi Giahi,Mohammad Erfan Zarabadipour,Amir Reza Yosefian,Hamid Ghadiri
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 36 pages, 14 figures. Open-source implementation available at PyPI

点击查看摘要

Abstract:Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture’s modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.

[NLP-22] Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP CSS and LLM Evaluation

【速读】：该论文旨在解决当前自然语言处理（Natural Language Processing, NLP）、内容安全（Content Safety, CSS）及大语言模型（Large Language Models, LLMs）评估研究中对单一 proprietary毒性检测工具——Perspective API的过度依赖所引发的系统性认知问题（epistemic problems）。其核心问题是：该工具缺乏版本控制、透明度不足、标注体系受单一企业立场影响，且评分被同时用作评估目标与标准，导致基准不可更新、结果难以复现，进而威胁整个研究领域的可信度与可持续发展。解决方案的关键在于构建一个独立、有效、可适应且可复现的毒性与仇恨言论测量基础设施，并明确提出该基础设施应具备的技术实现路径与治理结构要求，以避免未来重复陷入封闭源代码模型带来的学术风险。

链接: https://arxiv.org/abs/2604.25580
作者: David Hartmann,Manuel Tonneau,Angelie Kraft,LK Seiling,Dimitri Staufer,Pieter Delobelle,Jan Fillies,Anna Ricarda Luther,Jan Batzner,Mareike Lisker
机构: Weizenbaum Institute; TU Berlin; University of Oxford; KU Leuven; Pleias; Freie Universität Berlin; University of Bremen; ifib research; TU Munich; Munich Center for Machine Learning; HTW Berlin; University of Hamburg
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, 1 table

点击查看摘要

Abstract:The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts. Perspective’s model was periodically updated without versioning or disclosure, its annotation structure reflected a single corporate operationalisation of a contested concept, and its scores were used simultaneously as an evaluation target and an evaluation standard. Its closure leaves behind non-updatable benchmarks, irreproducible results, and ultimately a field at risk of perpetuating these issues by turning to closed-source LLMs. We use Perspective’s announced termination as an opportunity to call for an independent, valid, adaptable, and reproducible toxicity and hate speech measurement infrastructure, with the technical and governance requirements outlined in this paper.

[NLP-23] Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

【速读】：该论文旨在解决大规模语言模型在计算效率与多语言性能之间难以平衡的问题。传统密集型模型虽然性能优异，但参数冗余导致训练和推理成本高昂，且在多语言场景下易出现干扰效应，限制了语言扩展的灵活性。解决方案的关键在于提出Marco-MoE——一个完全开源的稀疏多专家（Mixture-of-Experts, MoE）架构，其通过仅激活约5%的总参数来实现极端稀疏性，并借助从密集模型迁移优化（upcycling）策略，在5T tokens上高效预训练。该设计不仅显著提升了性能-计算比，还展现出跨语言共享的结构化专家激活模式以及对孤立语言的高度专业化利用，从而支持无干扰的语言扩展，解决了现有方法在多语言场景下的效率瓶颈与扩展难题。

链接: https://arxiv.org/abs/2604.25578
作者: Fan Jiang,Yu Zhao,Chenyang Lyu,Tianqi Shi,Yichao Du,Feihu Jiang,Longyue Wang,Weihua Luo
机构: Alibaba International Digital Commerce
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textscInstruct variants, which surpass the performance of competing models possessing 3 – 14\times more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

[NLP-24] From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂、多层角色扮演游戏（Role-Playing Game, RPG）世界中进行叙事生成时面临的连贯性（coherence）、可控性（controllability）和结构一致性（structural consistency）不足的问题。其解决方案的关键在于提出一种依赖感知的多阶段提示（prompt）流水线，通过结构化的中间表示（structured intermediate representations）建模叙事依赖关系，并将生成过程分解为世界构建、非玩家角色创建、玩家角色创建、战役级任务规划和任务扩展五个有序阶段，每个阶段均基于前一阶段的结构化JSON输出进行条件约束。该设计通过强制Schema和显式数据流，有效减少叙事漂移（narrative drift）与幻觉（hallucinations），并支持可扩展的互连叙事元素生成，从而在不牺牲质量的前提下提升复杂场景下的生成稳定性与逻辑合理性。

链接: https://arxiv.org/abs/2604.25482
作者: Dominik Borawski,Marta Szulc,Robert Chudy,Małgorzata Giedrowicz,Piotr Mironowicz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure, 5 listings

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong potential for narrative generation, but their use in complex, multi-layered role-playing game (RPG) worlds is still limited by issues of coherence, controllability, and structural consistency. This paper explores a dependency-aware, multi-stage prompt pipeline for procedural RPG content generation that models narrative dependencies through structured intermediate representations. The approach decomposes generation into sequential stages: world building, non-player character creation, player character creation, campaign-level quest planning, and quest expansion. Each stage conditions on structured JSON outputs from previous stages. By enforcing schemas and explicit data flow, the pipeline reduces narrative drift, limits hallucinations, and supports scalable creation of interconnected narrative elements. The system is evaluated qualitatively through human-centered analysis across multiple independent runs. Outputs are assessed using criteria such as structural completeness, internal consistency, narrative coherence, diversity, and actionability. Results show that the pipeline consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases. Separating high-level campaign planning from detailed quest expansion improves both global structure and local storytelling. These findings suggest that dependency-aware prompt pipelines with structured intermediate representations are an effective design pattern for LLM-based procedural content generation. This approach may also generalize to other domains requiring sequential reasoning over evolving contextual states.

[NLP-25] PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

【速读】：该论文旨在解决标准文本到语音（TTS）评估指标无法量化口音问题，尤其是在印地语系语言中，现有指标（如WER、CER、MOS、UTMOS）虽能衡量可懂性和自然度，但无法捕捉影响母语感知的关键音位特征，如卷舌发音（retroflex articulation）、送气（aspiration）、元音长短（vowel length）以及泰米尔语特有的卷舌近音（Tamil retroflex approximant, zha）。为应对这一挑战，作者提出PSP（Phoneme Substitution Profile），一个可解释的、基于音位维度的口音评测基准，其关键在于将口音分解为六个互补维度：卷舌塌陷率（RR）、送气保真度（AF）、元音长度保真度（LF）、泰米尔-zha保真度（ZF）、Frechet音频距离（FAD）和韵律特征偏离度（PSD）。其中前四项通过强制对齐结合Wav2Vec2-XLS-R层9嵌入的母语者中心声学探针计算，后两项则基于语料库级分布距离。此方法首次实现了对TTS系统在多个音位维度上的细粒度口音分析，并揭示了商业与开源模型在不同维度上的性能差异。

链接: https://arxiv.org/abs/2604.25476
作者: Venkata Pushpak Teja Menta
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 8 pages, 7 tables. Companion paper to Praxy Voice (arXiv:submission id - 7506231). Code: this https URL Centroids: this https URL

点击查看摘要

Abstract:Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5-R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi Telugu Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering – commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.

[NLP-26] An Investigation of Linguistic Biases in LLM -Based Recommendations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在跨方言语境下推荐系统中存在的语言偏见问题，具体考察Southern American English (AE)、Indian English (IE) 和混合语码转换（Code-Switched）的Hindi-English方言对餐厅与商品推荐结果的影响。其解决方案的关键在于设计一个零样本冷启动（zero-shot cold-start）实验框架：通过在提示词中引入按菜系类型和产品类别平衡的候选列表，并使用20个随机种子进行多轮采样以增强泛化能力；随后基于聚合后的推荐频次数据，采用混合效应回归模型与似然比检验分析不同模型家族及尺寸下推荐数量的群体差异，从而量化方言类型和模型规模对推荐行为的交互影响。

链接: https://arxiv.org/abs/2604.25456
作者: Nitin Venkateswaran,Jason Ang,Deep Adhikari,Tarun Krishna Dasari
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate linguistic biases in LLM-based restaurant and product recommendations given prompts varying across Southern American English (AE), Indian English (IE), and Code-Switched Hindi-English dialects, using the Yelp Open dataset (Yelp Inc., 2023) and Walmart product reviews dataset (PromptCloud,2020). We add lists of restaurant and product names balanced by cuisine type and product category to the prompts given to the LLM, and we zero-shot prompt the LLMs in a cold-start setting to select the top-20 restaurant and product recommendations from these lists for each of the dialect-varied prompts. We prompt LLMs using different list samples across 20 seeds for better generalization, and aggregate per cuisine-type and per category response counts for each seed, question/prompt, and LLM model. We run mixed-effects regression models for each model family and topic (restaurant/product) with the aggregate response counts as the dependent, and conduct likelihood ratio tests for the fixed effects with post-hoc pairwise testing of estimated marginal means differences, to investigate group-level differences in recommendation counts by model size and dialect type. Results show that dialect plays a role in the type of restaurant selected across the models tested with the mistral-small-3.1 model and both the llama-3.1 family models tested showing more sensitivity to Indian English and Code-Switched prompts. In terms of product recommendations, the llama-3.1-70B-model is particularly sensitive to Code-Switched prompts in four out of seven categories, and more beauty and home category recommendations are seen when using the Indian English and Code-Switched prompts for larger and smaller models, respectively. No broad trends are seen in the model-size based differences, with differing recommendations based on model sizes conditioned by the type of dialect.

[NLP-27] Benchmarking Logistic Regression SVM and LightGBM Against BiLSTM with Attention for Sentiment Analysis on Indonesian Product Reviews

【速读】：该论文旨在解决电子商务平台上产品评论的情感分析问题，以自动理解用户满意度并为卖家提供改进产品质量的可操作洞察。其解决方案的关键在于通过系统性对比传统机器学习（Machine Learning, ML）方法与深度学习（Deep Learning, DL）方法在印尼语产品评论上的二分类情感识别性能，发现基于PyCaret AutoML框架的逻辑回归（Logistic Regression, LR）模型在19,728个平衡样本数据集上达到了97.26%的准确率和F1分数，略优于采用双向长短期记忆网络（Bidirectional Long Short-Term Memory, BiLSTM）结合注意力机制的DL模型（准确率97.24%，F1分数97.24%），表明经过适当预处理和特征提取的传统ML算法在高维文本数据中可与复杂序列DL架构竞争，并具备更高的计算效率优势。

链接: https://arxiv.org/abs/2604.25452
作者: Razin Hafid Hamdi,Ivana Margareth Hutabarat,Hanna Gresia Sinaga,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang
机构: Institut Teknologi Sumatera (苏门答腊理工学院)
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures. Benchmarking study comparing PyCaret-based machine learning models (Logistic Regression, SVM, LightGBM) with a BiLSTM+Attention model for sentiment analysis on Indonesian product reviews

点击查看摘要

Abstract:Sentiment analysis of product reviews on e-commerce platforms plays a critical role in automatically understanding customer satisfaction and providing actionable insights for sellers seeking to improve product quality. This paper presents a comprehensive benchmarking study comparing a Machine Learning (ML) approach via the PyCaret AutoML framework against a Deep Learning (DL) approach based on a Bidirectional Long Short-Term Memory (BiLSTM) architecture with an Attention mechanism for binary sentiment classification on Indonesian product reviews. The dataset comprises 19,728 samples balanced equally between positive and negative reviews. For the ML approach, three prominent algorithms were evaluated via 10-fold stratified cross-validation: Logistic Regression (LR), Support Vector Machine (SVM) with a linear kernel, and Light Gradient Boosting Machine (LightGBM). Logistic Regression achieved the best ML performance with an accuracy of 97.26% and an F1-score of 97.26%. The BiLSTM with Attention model, evaluated on 3,946 held-out test samples, achieved an accuracy of 97.24% and an F1-score of 97.24%. These comparative results demonstrate that traditional ML algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential DL architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.

[NLP-28] Navigating Global AI Regulation: A Multi-Jurisdictional Retrieval-Augmented Generation System LREC2026

【速读】：该论文旨在解决跨司法管辖区人工智能（Artificial Intelligence, AI）监管法规的复杂性与碎片化问题，使得政策制定者、法律专业人士和研究人员难以高效获取和比较不同地区的AI治理规则。其解决方案的关键在于构建一个面向全球AI监管的多司法管辖区检索增强生成（Retrieval-Augmented Generation, RAG）系统，核心创新包括：针对法律文本异构性设计的类型特定分块策略以保留法律结构；基于实体识别与元数据的条件检索路由机制以精准定位法律引用；以及优先级驱动的重排序策略，提升已颁布立法相对于政策文件和次级资料的检索权重。实证评估表明，该系统在单实体查询和跨司法管辖区比较查询中均展现出高忠实度（平均0.87）与相关性（平均0.84），验证了领域定制化检索策略在复杂、异构监管语料中的有效性。

链接: https://arxiv.org/abs/2604.25448
作者: Courtney Ford,Ojas Rane,Susan Leavy
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Accepted at PoliticalNLP Workshop, LREC 2026. 10 pages, 1 figure

点击查看摘要

Abstract:Navigating AI regulation across jurisdictions is increasingly difficult for policymakers, legal professionals, and researchers. To address this, we present a multi-jurisdictional Retrieval-Augmented Generation system for global AI regulation. Our corpus includes 242 documents across 68 jurisdictions, ranging from formal legislation like the EU AI Act to unstructured policy documents such as national AI strategies. The system makes three technical contributions: type-specific chunking that preserve legal structure across heterogenous documents; conditional retrieval routing with entity detection and metadata for legal citations; and priority-based re-ranking to boost enacted legislation over policy and secondary sources. Evaluation of 50 queries reveals strong performance across both single-entity and multi-jurisdictional questions, achieving 0.87 average faithfulness and 0.84 average answer relevancy. Single-entity queries achieve 0.86 average faithfulness and 0.92 average answer relevancy, while multi-jurisdictional comparison queries achieve 0.88 average faithfulness and 0.75 average answer relevancy. These findings highlight the effectiveness of domain-specific retrieval strategies for navigating complex, heterogenous regulatory corpora.

[NLP-29] One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement ACL26

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对模糊的人类查询时，难以激活其潜在推理能力的问题，根源在于人类提问的分布与机器执行所需结构化逻辑之间存在分布偏差。现有对齐方法要么因逐模型微调导致计算复杂度高达O(N)，要么依赖静态提示无法应对查询层面的结构复杂性。解决方案的关键在于提出ReQueR（Reinforcement Query Refinement）框架，将推理激发建模为推理时对齐任务：通过强化学习训练一个专用的Refiner策略，在不修改冻结LLM的前提下，将其原始查询重写为显式的逻辑分解；同时引入基于教育心理学“最近发展区”理论的自适应求解器层级（Adaptive Solver Hierarchy），动态调整环境难度以匹配Refiner的能力演进，从而稳定训练过程。该方法在多种架构和基准上实现1.7%–7.2%的一致绝对提升，并展现出强大的泛化能力——仅需少量模型训练即可有效解锁未见模型的推理潜力。

链接: https://arxiv.org/abs/2604.25444
作者: Yixiao Zhou,Dongzhou Cheng,zhiliang wu,Yi Yang,Yu Cheng,Hehe Fan
机构: Zhejiang University(浙江大学); Shanghai Innovation Institute(上海创新研究院); Southeast University(东南大学); The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL26

点击查看摘要

Abstract:Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive O(N) costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbfReinforcement \textbfQuery \textbfRefinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner’s evolving competence. ReQueR yields consistent absolute gains of 1.7%–7.2% across diverse architectures and benchmarks, outperforming strong baselines by 2.1% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at this https URL.

[NLP-30] Praxy Voice: Voice-Prompt Recovery BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

【速读】：该论文旨在解决开源文本到语音（Text-to-Speech, TTS）系统在印地语系语言（如泰卢固语、泰米尔语和印地语）上无法达到商业级音质的问题，尤其针对现有非印地语原生基座模型（如Chatterbox）在这些语言上的音位建模能力不足、缺乏对印度文字的tokenization支持以及无商业TTS训练数据可用的情况。其关键解决方案包括三部分：（1）提出BUPS（Brahmic Unified Phoneme Space），将七种印度文字统一映射为ISO-15919罗马化表示，使拉丁字符tokeniser可处理；（2）仅在文本预测模块（Chatterbox的t3）上训练LoRA适配器，使用约1220小时授权印地语音频并引入印地语代理语言标识符（language_id）；（3）设计一种语音提示恢复配方（voice-prompt recovery recipe），通过8–11秒同语言参考片段及特定采样参数（exaggeration 0.7, temperature 0.6, min_p 0.1）实现无需声学解码器训练即可获得商业级声学输出。此方案在多项指标上达到或超越商业基线，且支持跨句代码混用场景下的性能提升。

链接: https://arxiv.org/abs/2604.25441
作者: Venkata Pushpak Teja Menta
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 9 pages, 6 figures, 6 tables. Companion paper to PSP benchmark. Code: this https URL ; Model: this https URL ; Demo: this https URL

点击查看摘要

Abstract:Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox’s Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox’s t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe – an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; “Config B”) – that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio’s 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

[NLP-31] Do LLM s Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）是否真正习得具身认知（embodied cognition）和文化惯例的问题。其核心问题是：LLMs能否从纯文本中理解空间指代词（如英语中的“this/that”与汉语中的“这/那”）所体现的具身性（egocentric）与社会视角（sociocentric）差异，并展现出跨文化的认知模式？解决方案的关键在于引入指示词（demonstratives）作为新型探针任务，通过收集320名母语者的6400条响应建立人类基线，发现英语使用者擅长区分近远指代但难以视角转换，而汉语使用者则能灵活切换视角但接受远距离模糊；相比之下，五种前沿LLMs均未能内化近远对比，且无文化差异表现，仅呈现英语中心推理倾向，从而揭示了当前模型在具身认知与文化适应性方面的系统性不足。

链接: https://arxiv.org/abs/2604.25423
作者: Yu Wang,Emmanuele Chersoni,Chu-Ren Huang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like “this/that” in English and “zhè/nà” in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.

[NLP-32] Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

【速读】：该论文旨在解决生成式 AI（Generative AI）中概率 Transformer（Probabilistic Transformer, PT）在模型规模扩展时的鲁棒性不足问题，即其超参数对模型大小敏感，难以高效扩展。解决方案的关键在于采用最大更新参数化（Maximal Update Parametrization, muP）方法对 PT 的参数进行重新缩放，使得在小模型上优化的超参数可直接迁移至更大模型而无需额外调参，从而成功将 PT 扩展至 0.4B 参数规模，并在掩码语言建模（Masked Language Modeling, MLM）任务中表现出优于标准 Transformer 的性能。

链接: https://arxiv.org/abs/2604.25409
作者: Penghao Kuang,Haoyi Wu,Kewei Tu
机构: ShanghaiTech University (上海科技大学); Shanghai Engineering Research Center of Intelligent Vision and Imaging (智能视觉与成像上海市工程研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently. In this work, we follow Maximal Update Parametrization (muP) to rescale PT’s parameters, so that hyperparameters optimized on small models can be transferred to larger models without additional tuning. With this approach, we successfully scale PT to models with up to 0.4B parameters. Experiments show that PT consistently outperforms standard transformer under the same parameter budget on Masked Language Modeling (MLM) tasks. We hope this work will contribute to the practical deployment of probabilistic models at substantially larger scales in the future.

[NLP-33] Benchmarking PyCaret AutoML Against IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian IKN Twitter Data

【速读】：该论文旨在解决印尼语社交媒体文本（具体为关于“伊布·科塔·努桑塔拉”Ibu Kota Nusantara的推文评论）的二元情感分析问题。其解决方案的关键在于对比经典机器学习方法（基于PyCaret AutoML的Logistic Regression、Naive Bayes和Support Vector Machine）与深度学习方法（基于IndoBERT模型微调）在该任务上的性能差异。实验表明，使用IndoBERT进行微调的深度学习方案在测试准确率（89.59%）和F1分数（89.37%）上显著优于传统机器学习方法（最高仅77.57%准确率和77.17% F1分数），凸显了基于Transformer的上下文表示在处理非正式印尼语社交文本中的有效性。

链接: https://arxiv.org/abs/2604.25392
作者: Mutia Alfi Mayzaroh,Dwi Fitria Ningsih,Nindi Destriani,Martin C.T. Manullang
机构: Institut Teknologi Sumatera (印尼科技大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, 4 tables. Presented as a benchmarking study on Indonesian sentiment analysis using PyCaret and IndoBERT

点击查看摘要

Abstract:This paper benchmarks a classical machine learning approach based on PyCaret AutoML against a deep learning approach based on IndoBERT fine-tuning for binary sentiment analysis of Indonesian-language Twitter comments related to Ibu Kota Nusantara (IKN). The dataset contains 1,472 manually labeled samples, consisting of 780 negative and 692 positive comments. In the machine learning setting, Logistic Regression, Naive Bayes, and Support Vector Machine were evaluated using 10-fold cross-validation, with Logistic Regression achieving the best performance among the classical models at 77.57% accuracy and 77.17% F1-score. In the deep learning setting, the indobenchmark/indobert-base-p1 model was fine-tuned for five epochs and achieved 89.59% test accuracy and 89.37% F1-score. The results show that IndoBERT substantially outperforms the machine learning baselines, highlighting the effectiveness of Transformer-based contextual representations for informal Indonesian social media text.

[NLP-34] Wiki Dumps to Training Corpora: South Slavic Case

【速读】：该论文旨在解决从原始维基媒体数据中构建高质量文本语料库的问题，特别是针对七种南斯拉夫语言（South Slavic languages）。其核心挑战在于原始数据中包含大量低质量内容，如由数据库或结构化知识库生成的重复性、模板化文章，这些内容缺乏原创性和语言多样性，会干扰下游语言模型训练或跨语言研究。解决方案的关键在于采用两阶段方法：第一阶段系统性地提取并清洗来自维基百科、维基文库、维基教科书等平台的文本，去除wiki标记以保留自然语言内容；第二阶段引入基于n-gram的冗余检测策略，识别并剔除具有高度文本重复性的文章，从而确保最终语料库具备语言学丰富性和文化真实性。该方法具有较强的通用性，可推广至其他语言和语系。

链接: https://arxiv.org/abs/2604.25384
作者: Mihailo Škorić
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote, where available. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of suspicious or low-quality articles, which are often generated from databases or structured knowledge bases. These articles are characterised by repetitive patterns, generic phrasing, and minimal to no original content. To mitigate their impact, a n-gram-based filtering strategy was employed to detect high levels of textual redundancy between articles and then remove such articles from the corpora entirely. The resulting datasets aim to provide linguistically rich texts suitable for training language models or conducting comparative research across South Slavic languages. By combining systematic extraction with quality control, this work contributes to the creation of reliable, high-information corpora that reflect authentic language use and cultural context. While focused on the South Slavic case in the paper, the approach is mostly language-agnostic and can be generalised to other languages and language families.

[NLP-35] Language corpora for the Dutch medical domain

【速读】：该论文旨在解决荷兰语医学语料库稀缺的问题，从而限制自然语言处理（Natural Language Processing, NLP）在荷兰语医疗领域的发展。其解决方案的关键在于通过翻译英文数据集、从通用语料库中识别医学文本，并提取开放的荷兰语医学资源，最终构建了一个包含约350亿词元（tokens）和1亿份文档的大规模荷兰语医学语言语料库，该语料库已开源发布于Hugging Face，可直接用于预训练及下游NLP任务。

链接: https://arxiv.org/abs/2604.25374
作者: B. van Es
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, no figures

点击查看摘要

Abstract:\textbfBackground: Dutch medical corpora are scarce, limiting NLP development. \ \textbfMethods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \ \textbfResults: The resulting corpus comprises \pm 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \ \textbfConclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

[NLP-36] he Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models NEURIPS2026

【速读】：该论文旨在解决当前生成式AI模型在从多模态非结构化数据中提取结构化输出时缺乏统一、公平且跨源评估基准的问题。现有评测体系要么仅关注模式合规性（schema compliance），要么局限于单一数据域的价值准确性（value correctness），无法全面反映模型在真实场景下的结构化信息抽取能力。其解决方案的关键在于提出SOB（Structured Output Benchmark）——一个覆盖文本、图像和音频三类源模态的多源基准，所有模型均基于文本归一化的上下文输入，从而将结构化输出能力与原始视觉或语音处理质量解耦，实现源无关的公平比较；同时，通过包含5000个文本记录、209个图像记录和115个音频记录的多样化数据集，并结合7项指标对21个前沿及开源模型进行系统评估，揭示出模型在schema合规性上表现优异但价值准确性随模态复杂度显著下降的现象，为未来研究提供了标准化评估工具与深入分析依据。

链接: https://arxiv.org/abs/2604.25359
作者: Abhinav Kumar Singh,Harsha Vardhan Khurdula,Yoeven D Khemlani,Vineet Agarwal
机构: JigsawStack, Inc. (JigsawStack公司); Interfaze.ai (Interfaze人工智能公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures, 11 tables, submitted to NeurIPS 2026

点击查看摘要

Abstract:Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.

[NLP-37] R3-SQL: Ranking Reward and Resampling for Text-to-SQL ACL2026

【速读】：该论文旨在解决现代Text-to-SQL系统在候选SQL查询排序中存在的两个关键问题：一是对功能等价但形式不同的SQL查询评分不一致，导致排名不稳定；二是当正确SQL不在候选池中时，排名机制无法恢复正确结果。解决方案的核心在于提出R³-SQL框架，其关键创新包括：（1）基于执行结果对候选SQL进行分组并统一奖励以实现排名一致性，通过结合组间成对偏好与最优组的点对值（包含组排名和规模信息）来综合评估各组质量，从而兼顾相对偏好、一致性与候选质量；（2）引入代理式重采样（agentic resampling）机制，动态判断当前候选池是否可能遗漏正确SQL，并在必要时主动重采样以提升召回率。该方法在BIRD-dev上达到75.03%的执行准确率，为使用公开模型规模的方法树立了新基准。

链接: https://arxiv.org/abs/2604.25325
作者: Hojae Han,Yeonseok Jeong,Seung-won Hwang,Zhewei Yao,Yuxiong He
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by Findings of ACL 2026

点击查看摘要

Abstract:Modern Text-to-SQL systems generate multiple candidate SQL queries and rank them to judge a final prediction. However, existing methods face two limitations. First, they often score functionally equivalent SQL queries inconsistently despite identical execution results. Second, ranking cannot recover when the correct SQL is absent from the candidate pool. We propose R ^3 -SQL, a Text-to-SQL framework that addresses both issues through unified reward for ranking and resampling. R ^3 -SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combines a pairwise preference across groups with a pointwise utility from the best group rank and size, capturing relative preference, consistency, and candidate quality. To improve candidate recall, R ^3 -SQL introduces agentic resampling, which judges the generated candidate pool and selectively resamples when the correct SQL is likely absent. R ^3 -SQL achieves 75.03 execution accuracy on BIRD-dev, a new state of the art among methods using models with disclosed sizes, with consistent gains across five benchmarks.

[NLP-38] Cutscene Agent : An LLM Agent : An LLM Agent Framework for Automated 3D Cutscene Generation

【速读】：该论文旨在解决视频游戏与交互媒体中过场动画（cutscene）制作流程复杂、耗时长且需多学科协作的问题，传统方式往往需要数天至数周才能产出几分钟的高质量内容。其解决方案的关键在于提出一个名为Cutscene Agent的大型语言模型（LLM）代理框架，该框架通过三个核心创新实现自动化端到端生成：一是基于模型上下文协议（Model Context Protocol, MCP）构建的过场动画工具包，实现了LLM代理与游戏引擎之间的双向集成，使代理不仅能调用引擎操作，还能实时感知场景状态，从而闭环生成可编辑的原生引擎资产；二是采用多代理系统，由导演代理协调动画、摄影和音效设计等专业子代理，并引入视觉推理反馈回路以驱动感知驱动的精细化调整；三是设计了CutsceneBench这一分层评估基准，专门针对长周期、多步骤、强顺序约束的过场动画生成任务进行量化评测，填补了现有工具使用基准在复杂协同能力评估上的空白。

链接: https://arxiv.org/abs/2604.25318
作者: Lanshan He,Haozhou Pang,Qi Gan,Xin Shen,Ziwei Zhang,Yibo Liu,Gang Fang,Bo Liu,Kai Sheng,Shengfeng Zeng,Chaofan Li,Zhen Hui,Keer Zhou,Lan Zhou,Shujun Dai
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages excluding appendix

点击查看摘要

Abstract:Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of collaborative effort from multidisciplinary teams to produce minutes of polished content. In this work, we present Cutscene Agent, an LLM agent framework for automated end-to-end cutscene generation. The framework makes three contributions: (1)~a Cutscene Toolkit built on the Model Context Protocol (MCP) that establishes \emphbidirectional integration between LLM agents and the game engine – agents not only invoke engine operations but continuously observe real-time scene state, enabling closed-loop generation of editable engine-native cinematic assets; (2)~a multi-agent system where a director agent orchestrates specialist subagents for animation, cinematography, and sound design, augmented by a visual reasoning feedback loop for perception-driven refinement; and (3)~CutsceneBench, a hierarchical evaluation benchmark for cutscene generation. Unlike typical tool-use benchmarks that evaluate short, isolated function calls, cutscene generation requires long-horizon, multi-step orchestration of dozens of interdependent tool invocations with strict ordering constraints – a capability dimension that existing benchmarks do not cover. We evaluate a range of LLMs on CutsceneBench and analyze their performance across this challenging task.

[NLP-39] Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）模型在生成回答时倾向于依赖参数化记忆（parametric memory）而非检索到的上下文（retrieved context）的问题，即“忠实性缺失”（faithfulness issue），这削弱了RAG的核心优势。解决方案的关键在于构建一个大规模、可控的对抗性数据集——Faithfulness-QA，其通过在SQuAD和TriviaQA两个标准抽取式问答基准上进行反事实实体替换（counterfactual entity substitution），制造上下文与模型内部知识之间的冲突，从而迫使模型学习优先依据外部检索信息作答。该数据集包含99,094个样本，并配有严格的质控流程和可复现的构建管道，可用于训练注意力机制驱动的忠实性目标函数及评估RAG系统对上下文的依赖程度。

链接: https://arxiv.org/abs/2604.25313
作者: Li Ju,Junzhe Wang,Qi Zhang
机构: WisPaper.AI; Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness-QA, a large-scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks–SQuAD and TriviaQA–we automatically identify answer-bearing named entities in each context, replace them with type-consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200-sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness-QA is designed as a training resource for attention-based faithfulness objectives and as an evaluation benchmark for measuring context-grounding behavior in RAG systems. Data and code are available at this https URL.

[NLP-40] LegalMidm: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model ICLR2026

【速读】：该论文旨在解决当前领域专用大语言模型（Large Language Models, LLMs）在实际应用中因训练数据与方法不匹配而导致的实用性不足问题，尤其是在法律领域对精度和可靠性要求极高的场景下。其解决方案的关键在于构建一个以法律实务需求为导向的系统性训练框架，通过与法律专业人士深度协作、严格的数据筛选与标注，以及针对特定使用场景优化的数据集构建和训练流程，从而提升模型在韩国法律任务中的准确性与适用性。

链接: https://arxiv.org/abs/2604.25297
作者: Youngjoon Jang,Chanhee Park,Hyeonseok Moon,Young-kyoung Ham,Jiwon Moon,Jinhyeon Kim,JuKyung Jung,Heuiseok Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026 DATA-FM Workshop

点击查看摘要

Abstract:In recent years, the rapid proliferation of open-source large language models (LLMs) has spurred efforts to turn general-purpose models into domain specialists. However, many domain-specialized LLMs are developed using datasets and training protocols that are not aligned with the nuanced requirements of real-world applications. In the legal domain, where precision and reliability are essential, this lack of consideration limits practical utility. In this study, we propose a systematic training framework grounded in the practical needs of the legal domain, with a focus on Korean law. We introduce LegalMidm, a Korean legal-domain LLM, and present a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. Our approach emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, and demonstrates effectiveness in key legal tasks.

[NLP-41] Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在医疗应用中因传统数据整理策略依赖粗粒度按模态或科室划分而导致的性能瓶颈问题，这些问题限制了模型对细粒度医学知识识别与复杂临床推理的能力。其解决方案的关键在于提出一种以实体为中心的医疗数据工程框架（Entity-Centric Medical Data Engineering framework），通过从权威医学文献中自动提取实体构建医学实体树（Medical Entity Tree, MET），形成结构化的知识库，并在此基础上设计三项核心技术：节点引导检索（node-guided retrieval）、两阶段混合过滤与对齐流程（two-stage hybrid filtering and alignment pipeline）以及基于知识约束的数据合成方法（knowledge-aware data synthesis），从而显著提升MLLMs在复杂临床场景下的理解与推理能力。

链接: https://arxiv.org/abs/2604.25296
作者: Jianghang Lin,Haihua Yang,Deli Yu,Kai Wu,Kai Ye,Jinghao Lin,Zihan Wang,Yuhang Wu,Liujuan Cao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to capture the hierarchical and interconnected nature of clinical medical knowledge, limiting the models’ ability to perform fine-grained recognition and complex reasoning. In this paper, we propose a novel Entity-Centric Medical Data Engineering framework. We automatically extract entities from authoritative medical literature to construct a Medical Entity Tree (MET), a hierarchical structure that systematically encodes diseases, anatomical structures, modalities, and symptoms into a unified knowledge repository. Building upon the MET, we propose an advanced data engine that includes: (1) node-guided retrieval to anchor raw data to specific medical concepts, (2) a two-stage hybrid filtering and alignment pipeline to ensure precise visual-semantic correspondence, and (3) knowledge-aware data synthesis to generate enriched captions and targeted reasoning VQA pairs, leveraging structural constraints. Extensive evaluations across six medical benchmarks demonstrate that our approach significantly enhances the medical capabilities of general-purpose MLLMs, improving their ability to handle complex clinical queries and achieve state-of-the-art performance in diverse medical contexts.

[NLP-42] Below-Chance Blindness: Prompted Underperformance in Small LLM s Produces Positional Bias Rather than Answer Avoidance

【速读】：该论文旨在解决人工智能安全领域中“沙袋行为”（sandbagging）的检测问题，即模型在能力评估中故意表现低于其真实能力的行为。研究尝试借鉴临床心理学中的症状有效性测试（Symptom Validity Testing, SVT）逻辑，通过观察模型在强制选择题上是否出现低于随机水平的表现（Below-Chance Behavior, BCB）来识别沙袋行为。关键发现是：尽管BCB理论上可作为答案感知规避的标志，但在当前实验条件下未观测到显著的BCB现象；相反，模型表现出三种不同类型的失败模式，其中最突出的是位置主导型响应策略——如Llama-3-8B将答案分布集中于字母表中间选项（如E、F），而非基于内容进行回避。这表明，在当前模型规模下，基于位置分布偏移的指标可能比BCB更有效用于识别被提示的低绩效行为。

链接: https://arxiv.org/abs/2604.25249
作者: Jon-Paul Cacioli
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2 tables. Pre-registered: this https URL

点击查看摘要

Abstract:Detecting sandbagging–the deliberate underperformance on capability evaluations–is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic, collapsing its response distribution onto middle-alphabet options (E at 31.8%, F at 26.1%) regardless of where the correct answer fell. This produced accuracy boosts of up to 33 percentage points when the correct answer coincidentally occupied the model’s preferred position. An explicit anti-task instruction (“pick the least likely answer”) drove two of three models below chance, with accuracy as low as 0.024. The capability for answer-aware avoidance therefore exists but is not activated by “deliberately underperform.” BCB did not fail as a logical marker of answer-aware avoidance. It was not observed in this regime because the model showing the largest behavioural shift exhibited behaviour consistent with a position-dominant response policy rather than content-aware answer avoidance. We propose that positional-distribution shift may be a more effective behavioural signature than below-chance accuracy for detecting prompted underperformance at this model scale.

[NLP-43] VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）作为自动化评判者时缺乏可靠性评估的问题，即其输出的分数无法反映置信度或不确定性。解决方案的关键在于引入分布无关的校准方法——共形预测（Conformal Prediction），该方法仅利用评分-标记的对数概率即可将点估计转化为可校准的预测区间，无需重新训练模型。通过系统性分析三种VLM在14类视觉任务上的表现，研究发现评价不确定性具有显著的任务依赖性，并揭示了标准指标未能捕捉到的新失败模式——排序-评分解耦现象，即模型虽能正确排序响应但生成宽泛且无信息量的区间，从而实现对多模态评估可靠性的量化映射与诊断。

链接: https://arxiv.org/abs/2604.25235
作者: Divake Kumar,Sina Tayebati,Devashri Naik,Ranganath Krishnan,Amit Ranjan Trivedi
机构: University of Illinois at Chicago (芝加哥大学伊利诺伊分校); AI Labs at Capital One (Capital One 人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge’s point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: this https URL

[NLP-44] DRAG ON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在图表问答（Diagram Question Answering, DQA）任务中虽能取得高准确率，但其推理过程缺乏对图表区域的有效证据支撑的问题。现有模型可能依赖文本关联或数据集中的偏差线索，而非真正识别并利用图表中的视觉元素进行推理，这导致评估不可靠且模型决策缺乏可解释性。解决方案的关键在于提出DRAGON基准测试，要求模型不仅给出正确答案，还需定位出支持该答案的视觉证据区域（如图例、坐标轴、标注文字、连接线等），从而实现对模型视觉推理过程的可验证与可解释性评估。该基准包含11,664个标注样本，并提供2,445个经人工验证的测试实例及标准化评价框架，可用于系统性评估不同VLM在多种图表类型中的证据定位能力。

链接: https://arxiv.org/abs/2604.25231
作者: Anirudh Iyengar Kaniyar Narayana Iyengar,Tampu Ravi Kumar,Gaurav Najpande,Manan Suri,Dinesh Manocha,Puneet Mathur,Vivek Gupta
机构: Arizona State University(亚利桑那州立大学); Adobe Research(Adobe 研究院); University of Maryland(马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 Pages, 14 Figures

点击查看摘要

Abstract:Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.

[NLP-45] BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

【速读】：该论文旨在解决定制化安全策略（custom policies）部署中面临的挑战：通用安全模型难以捕捉任务特定需求，而基于提示（prompting）大语言模型（Large Language Models, LLMs）则存在边界案例性能不稳定和推理成本高的问题；尽管训练专用分类器可兼顾准确性和效率，但其对大量标注数据的依赖导致成本高昂。解决方案的关键在于提出BARRED框架——通过仅需任务描述和少量无标签样本，利用维度分解（dimension decomposition）实现领域空间的全面覆盖，并结合多智能体辩论机制（multi-agent debate）验证标签正确性，从而生成高保真度且多样化的合成训练数据。实验表明，微调后的小型语言模型在多种自定义策略场景下均优于当前最先进的商用LLM及专用防护模型，且消融实验证明维度分解与辩论验证是保障训练数据多样性与标签可信性的核心要素。

链接: https://arxiv.org/abs/2604.25203
作者: Arnon Mazza,Elad Levi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

[NLP-46] CroSearch-R1: Better Leverag ing Cross-lingual Knowledge for Retrieval-Augmented Generation SIGIR2026

【速读】：该论文旨在解决多语言检索增强生成（Retrieval-Augmented Generation, RAG）中因语言差异导致的知识整合失效问题，即直接拼接不同语言的知识片段可能因语义和表达方式的不一致而无法提升生成效果。其解决方案的关键在于提出CroSearch-R1框架，该框架通过引入基于跨语言知识融合的多轮检索策略，动态地将其他语言的知识对齐到统一表示空间，并结合多语言rollout机制优化推理在不同语言间的迁移能力，从而有效利用跨语言互补性，提升RAG在多语言语料中的性能表现。

链接: https://arxiv.org/abs/2604.25182
作者: Rui Qi,Fengran Mo,Sijin Lu,Yufeng Chen,Jian-Yun Nie,Kaiyu Huang
机构: Beijing Jiaotong University (北京交通大学); Université de Montréal (蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted to SIGIR 2026 (Short Paper)

点击查看摘要

Abstract:A multilingual collection may contain useful knowledge in other languages to supplement and correct the facts in the original language for Retrieval-Augmented Generation (RAG). However, the vanilla approach that simply concatenates multiple pieces of knowledge from different languages into the context may fail to improve effectiveness due to the potential disparities across languages. To better leverage multilingual knowledge, we propose CroSearch-R1, a search-augmented reinforcement learning framework to integrate multilingual knowledge into the Group Relative Policy Optimization (GRPO) process. In particular, the approach adopts a multi-turn retrieval strategy with cross-lingual knowledge integration to dynamically align the knowledge from other languages as supplementary evidence into a unified representation space. Furthermore, we introduce a multilingual rollout mechanism to optimize reasoning transferability across languages. Experimental results demonstrate that our framework effectively leverages cross-lingual complementarity and improves the effectiveness of RAG with multilingual collections.

[NLP-47] MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text Detectors

【速读】：该论文旨在解决当前机器生成文本（Machine-Generated Text, MGT）检测方法评估体系碎片化的问题，即现有研究在数据集、预处理、攻击手段和评估指标等方面缺乏统一标准，导致结果难以比较与复现。解决方案的关键在于提出一个可扩展的平台MGTEVAL，其核心是将MGT检测评估流程系统化地划分为四个模块：数据集构建、数据集攻击、检测器训练和性能评估，支持通过配置化大语言模型（Large Language Models, LLMs）生成MGT、应用12种文本攻击策略测试鲁棒性、通过统一接口训练检测器，并综合报告有效性、鲁棒性和效率等多维指标，从而实现标准化、可重复且用户友好的评估实验。

链接: https://arxiv.org/abs/2604.25152
作者: Yuanfan Li,Qi Zhou,Chengzhengxu Li,Zhaohan Zhang,Chenxu Zhao,Zepu Ruan,Chao Shen,Xiaoming Liu
机构: Xi’an Jiaotong University (西安交通大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present MGTEVAL, an extensible platform for systematic evaluation of Machine-Generated Text (MGT) detectors. Despite rapid progress in MGT detection, existing evaluations are often fragmented across datasets, preprocessing, attacks, and metrics, making results hard to compare and reproduce. MGTEVAL organizes the workflow into four components: Dataset Building, Dataset Attack, Detector Training, and Performance Evaluation. It supports constructing custom benchmarks by generating MGT with configurable LLMs, applying 12 text attacks to test sets, training detectors via a unified interface, and reporting effectiveness, robustness, and efficiency. The platform provides both command-line and Web-based interfaces for user-friendly experimentation without code rewriting.

[NLP-48] Frictive Policy Optimization for LLM s: Epistemic Intervention Risk-Sensitive Control and Reflective Alignment

【速读】：该论文旨在解决大语言模型在对齐过程中仅关注表面偏好或任务效用，而忽视其在对话中如何管理认知风险（epistemic risk）和规范风险（normative risk）的问题。传统对齐方法无法有效控制模型在何时、以何种方式干预对话（如澄清、验证、质疑、引导或拒绝），从而导致信念演化、承诺一致性和不确定性管理上的偏差。解决方案的关键在于提出摩擦式策略优化（Frictive Policy Optimization, FPO）框架，将干预行为视为显式的控制动作，并将其建模为一个风险敏感的认知控制问题——即干预决策基于其对未来认知质量的预期影响，而非仅依赖即时奖励。FPO引入了一种紧凑的摩擦干预分类法、结构化的摩擦函数以表征多种对齐失效模式，并构建了涵盖奖励塑形、偏好配对、群体相对排序和风险条件信任区域的统一算法族，从而实现模型在输出结果与认知行为两方面均实现对齐。

链接: https://arxiv.org/abs/2604.25136
作者: James Pustejovsky,Nikhil Krishnaswamy
机构: Brandeis University (布兰迪斯大学); Colorado State University (科罗拉多州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Frictive Policy Optimization; epistemic alignment; risk-sensitive control; LLM alignment; clarification and refusal; preference learning; trust regions; dialogue agents

点击查看摘要

Abstract:We propose Frictive Policy Optimization (FPO), a framework for learning language model policies that regulate not only what to say, but when and how to intervene in order to manage epistemic and normative risk. Unlike standard alignment methods that optimize surface-level preference or task utility, FPO treats clarification, verification, challenge, redirection, and refusal as explicit control actions whose purpose is to shape the evolution of belief, commitment, and uncertainty over time. We formalize alignment as a risk-sensitive epistemic control problem in which intervention decisions are selected based on their expected effect on downstream epistemic quality rather than on immediate reward alone. We introduce a compact taxonomy of frictive interventions, a structured friction functional that operationalizes multiple alignment failure modes, and a unified family of FPO methods spanning reward shaping, preference pairing, group-relative ranking, and risk-conditioned trust regions. We further propose an evaluation framework that measures epistemic competence directly through clarification behavior, calibration, contradiction repair, refusal proportionality, and information efficiency. Together, these results provide a formal and algorithmic foundation for learning agents that are aligned not only in outcome, but in epistemic conduct.

[NLP-49] FAMA: Failure-Aware Meta-Agent ic Framework for Open-Source LLM s in Interactive Tool Use Environments ACL2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在作为自主代理（autonomous agents）决策核心时，在模拟真实世界以客户为中心的问题解决场景中频繁因错误决策的级联效应而失效的问题，尤其针对开源小参数量LLMs在有限上下文窗口和推理预算下易积累误差的挑战。解决方案的关键在于提出Failure-Aware Meta-Agentic (FAMA) 框架，其核心机制为：首先分析基线代理的失败轨迹以识别最常见错误；随后通过编排机制激活一组专门化代理，针对这些失败注入目标上下文给工具使用代理（tool-use agent），从而在决策前进行精准干预，实现对典型失败模式的有效缓解。实验表明，该方法可在多种评估模式下使开源LLM性能提升最高达27%。

链接: https://arxiv.org/abs/2604.25135
作者: Amir Saeidi,Venkatesh Mishra,Souradeep Mukhopadhyay,Gaowen Liu,Ali Payani,Jayanth Srinivasa,Chitta Baral
机构: Arizona State University (亚利桑那州立大学); Cisco Research (思科研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.

[NLP-50] Korean aegyo speech shows systematic F1 increase to signal childlike qualities

【速读】：该论文旨在解决成人之间在浪漫互动中使用韩语“奶音”（aegyo）这一社会认可的童声说话风格时，其语音特征如何体现儿童语音特征的问题。解决方案的关键在于通过分析12名首尔韩语母语者在相同脚本下以奶音和非奶音两种方式发音时的共振峰频率（formant frequencies），发现奶音显著提高了所有元音的F1值，并对前元音进行了选择性前移，从而导致元音空间扩张，但主要表现为F1整体升高；这表明成人在模仿儿童语音时，主要是通过全局性元音降低和局部前移来模拟儿童较短的声道长度。

链接: https://arxiv.org/abs/2604.25133
作者: Ji-eun Kim,Volker Dellwo
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 18 pages, 2 figures, under review

点击查看摘要

Abstract:Korean aegyo is a socially recognized childlike speaking style used predominantly in romantic interactions among adults. This study examined vowel space modification in aegyo by analyzing formant frequencies from twelve Seoul Korean speakers who produced identical scripts in aegyo and non-aegyo styles. Results show that aegyo speech features a significant increase in F1 values across vowels and selective fronting of front vowels, leading to vowel space expansion but mainly a shift to higher F1. These findings suggest that adult speakers stylize childlike speech by imitating the shorter vocal tract of children, mainly through global vowel lowering and partial fronting.

[NLP-51] What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective ACL2026

【速读】：该论文旨在解决指令微调（instruction tuning）数据集中存在的冗余和低质量样本问题，从而提升数据选择效率。其解决方案的关键在于提出一种基于加权上下文影响（weighted in-context influence, wICI）的数据选择框架，该框架通过衡量每个候选样本对语义相关样本的指令遵循难度降低效果来评估其价值，实验证明该方法在有限数据预算下能显著优于现有基线，并揭示了样本难度与上下文影响之间存在负相关关系。

链接: https://arxiv.org/abs/2604.25132
作者: Guangzeng Han,Xiaolei Huang
机构: University of Memphis (孟菲斯大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026, main conference

点击查看摘要

Abstract:Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods. We propose an instruction data selection framework based on weighted in-context influence (wICI), which measures how effectively each candidate example reduces instruction-following difficulty for semantically related peers. Through systematic experiments, we address three key questions: what constitutes effective instruction tuning data from an in-context perspective, whether sample difficulty correlates with in-context influence, and how in-context influence translates to instruction tuning effectiveness. Experiments across multiple models and benchmarks demonstrate that our method consistently outperforms existing baselines under constrained data budgets, while empirically showing that sample difficulty negatively correlates with in-context influence.

[NLP-52] LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

【速读】：该论文旨在解决长文档摘要评估中存在的核心瓶颈问题，即现有评价指标与人类判断相关性弱、缺乏可解释性且无法指导生成过程改进，从而阻碍了对可验证准确性要求较高的应用场景中的有效优化。其解决方案的关键在于提出 LongSumEval 框架，通过结构化问答（Question-Answering, QA）反馈机制将评估与生成过程统一起来，将摘要质量定义为问题的答案可得性（answerability）和事实一致性（factual alignment），从而生成可解释的评分和可操作的改进建议，识别覆盖缺口与事实不一致之处，实现无需重新训练即可通过自我精炼显著提升摘要质量，确立了以评估反馈作为生成执行指令的新范式。

链接: https://arxiv.org/abs/2604.25130
作者: Huyen Nguyen,Haoxuan Zhang,Yang Zhang,Haihua Chen,Junhua Ding
机构: University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications requiring verifiable accuracy. We introduce LongSumEval, a unified framework bridging evaluation and generation through structured question-answering feedback. The framework operationalizes summary quality as answerability and factual alignment of question-answer pairs, generating interpretable scores and actionable feedback that identifies coverage gaps and factual inconsistencies. This resolves the misalignment where evaluation operates independently of generation objectives. Meta-evaluation of our QA-based evaluation module across seven benchmarks demonstrates substantially stronger agreement with human judgments compared to established metrics. Structured feedback enables significant quality improvements through self-refinement without retraining. By demonstrating that evaluation feedback can serve as executable instructions for generation, this work establishes a generalizable paradigm for aligning assessment with improvement, with direct implications for controllable text generation requiring verifiable accuracy and transparent quality control. All code and datasets will be released in GitHub for reproducibility.

[NLP-53] Diagnosis Bad Planning Reasoning Planning Reasoning. Treatment SCOPE – Planning for Hybrid Querying over Clinical Trial Data

【速读】：该论文旨在解决临床试验表格推理（clinical trial table reasoning）中的复杂语义理解问题，即答案并非直接存储于可见单元格中，而是需通过归一化、分类、提取或轻量级领域推理从部分观测的表格中推导得出。其核心挑战在于当前大语言模型（LLM）在隐式规划假设下常出现“错误推理”（bad reasoning），尤其在恢复如治疗类型、添加药物、终点角色或随访状态等隐含属性时表现不佳。解决方案的关键是提出SCOPE（Structured Clinical hybrid Planning for Evidence retrieval in clinical trials），一个基于多LLM规划器的框架，将任务分解为行选择、结构化规划和执行三个阶段，使源字段、推理规则和输出约束在答案生成前显式明确，从而显著降低歧义性并提升准确性，相较直接提示方法和更复杂的代理基线模型展现出更好的准确率-效率权衡。

链接: https://arxiv.org/abs/2604.25120
作者: Suparno Roy Chowdhury,Manan Roy Choudhury,Tejas Anvekar,Muhammad Ali Khan,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Mayo Clinic (梅奥诊所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study clinical trial table reasoning, where answers are not directly stored in visible cells but must be reasoned from semantic understanding through normalization, classification, extraction, or lightweight domain reasoning. Motivated by the observation that current LLM approaches often suffer from “bad reasoning” under implicit planning assumptions, we focus on settings in which the model must recover implicit attributes such as therapy type, added agents, endpoint roles, or follow-up status from partially observed clinical-trial tables. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, Blend-SQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution

[NLP-54] Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在测试时计算扩展（Test-Time Compute Scaling, TTS）推理能力受限于高参数量和高推理成本的问题，特别是针对结构化剪枝（structured pruning）导致TTS性能显著下降的现象。其解决方案的关键在于重新审视剪枝策略的有效性，提出采用非结构化剪枝（unstructured pruning），即通过精细化移除冗余或有害权重而非整体删除层块，来实现模型压缩的同时提升甚至超越原始完整模型的TTS性能。实验表明，非结构化剪枝不仅避免了结构化剪枝带来的性能损失，还在多个推理基准上展现出优于未剪枝全权重模型的能力，同时层级稀疏分配策略对效果有重要影响。

链接: https://arxiv.org/abs/2604.25098
作者: Ocean Monjur,Shahriar Kabir Nahin,Anshuman Chhabra
机构: Bellini College of AI, Cybersecurity, and Computing (贝利尼人工智能、网络安全与计算学院); University of South Florida (南佛罗里达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.

[NLP-55] Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

【速读】：该论文旨在解决生成式 AI（Generative AI）代理在混合动机场景中缺乏有效协作与竞争策略的问题，这类场景要求代理通过短期合作实现长期竞争目标（如多方政治博弈）。其核心挑战在于如何设计能够动态调整谈判行为、适应复杂人际关系并提升胜率的多智能体系统。解决方案的关键在于构建了一个名为“Cooperate to Compete”（C2C）的多智能体环境，该环境支持非约束性私人协商、异构目标设定以及动态联盟形成，同时通过大规模游戏实验（超过1,100场游戏、15.2百万token对话）识别出人类与基于语言模型（Language Model, LM）代理在谈判行为上的差异：人类偏好低复杂度协议且可靠性较低，而LM代理更倾向于接受无对等条件的提议。基于这些发现，作者采用针对性提示工程（targeted prompting）优化了代理的谈判策略，使胜率从22.2%显著提升至32.7%，从而验证了C2C作为研究和开发具备现实部署能力的LM代理的有效基准平台。

链接: https://arxiv.org/abs/2604.25088
作者: Abigail O’Neill,Alan Zhu,Mihran Miroyan,Narges Norouzi,Joseph E. Gonzalez
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective. Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players’ short-term interests align and diverge. We run AI only games and conduct a user study pitting human players against AI opponents. We identify significant differences between human and AI negotiation behaviors, finding that humans favor lower-complexity deals and are significantly less reliable partners compared to LM-based agents. We also find that humans are more aggressive negotiators, accepting deals without a counteroffer only 56.3% of the time compared to 67.6% for LM-based agents. Through targeted prompting inspired by these findings, we modify agents’ negotiation behavior and improve win rates from 22.2% to 32.7%. We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions. Our results establish C2C as a testbed for studying and building LM-based agents that can navigate the sophisticated coordination required for real-world deployments. The game, code, and dataset may be found at this https URL.

[NLP-56] Analyzing LLM Reasoning to Uncover Mental Health Stigma

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在心理健康应用中可能存在的隐性偏见问题，特别是其对心理疾病患者的污名化倾向。现有评估方法主要依赖多选题（Multiple-Choice Questions, MCQs），难以揭示模型内部推理过程中嵌入的偏见逻辑。论文的关键解决方案是分析LLM在回答心理健康相关问题时的中间推理步骤，利用临床专业知识构建污名化语言分类框架，识别并标注推理链条中的问题语句，并对其严重程度进行分级，从而区分显性歧视与更隐蔽的、潜在有害的偏见。该方法显著提升了对模型偏见的检测能力，并揭示了模型在理解心理健康问题上的逻辑缺陷。

链接: https://arxiv.org/abs/2604.25053
作者: Sreehari Sankar,Aliakbar Nafar,Mona Barman,Hannah K. Heitz,Ashwin Kumar,Pouria Tohidi,Dailun Li,Danish Hussain,Russell DuBois,Hamed Hasheminia,Farshad Majzoubi
机构: BetterHelp
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) are increasingly being explored for mental health applications, recent studies reveal that they can exhibit stigma toward individuals with psychological conditions. Existing evaluations of this stigma primarily rely on multiple-choice questions (MCQs), which fail to capture the biases embedded within the models’ underlying logic. In this paper, we analyze the intermediate reasoning steps of LLMs to uncover hidden stigmatizing language and the internal rationales driving it. We leverage clinical expertise to categorize common patterns of stigmatizing language directed at individuals with psychological conditions and use this framework to identify and tag problematic statements in LLM reasoning. Furthermore, we rate the severity of these statements, distinguishing between overt prejudice and more subtle, less immediately harmful biases. To broaden the reasoning domain and capture a wider array of patterns, we also extend an existing mental health stigma benchmark by incorporating additional psychological conditions. Our findings demonstrate that evaluating model reasoning not only exposes substantially more stigma than traditional MCQ-based methods but it helps to identify the flaws in the LLMs’ logic and their understanding of mental health conditions.

[NLP-57] Leverag e Laws: A Per-Task Framework for Human-Agent Collaboration

【速读】：该论文旨在解决人机协作中效率评估的量化难题，即如何在任务层面衡量人类工作被代理（agent）替代的程度，并将其与人类为指定任务、处理中断及审查结果所花费的时间进行标准化比较。其核心解决方案是提出一个“每任务杠杆比”（per-task leverage ratio），该比值将人类被代理取代的工作量除以人类完成任务所需的总时间（包括任务指定、中途干预处理和结果审核），并进一步将分母分解为三个信息流通道（human-to-agent、agent-to-human、task planning），每个通道具有独立的时间成本标量。关键创新在于揭示信息密度的方向性限制——人类到代理与代理到人类的信息流动分别受制于上限，并且杠杆比的渐近行为可拆解为能力（capability）和记忆（memory）两个标度轴，其中规划项存在由人类处理能力决定的非零下限，从而为系统设计提供了可操作的理论框架。

链接: https://arxiv.org/abs/2604.25040
作者: Stan Loosmore
机构: University of Southern California (南加州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:We propose a per-task leverage ratio for human-agent collaboration: human work displaced by an agent, divided by the human time required to specify the task, resolve mid-run interrupts, and review the result. The denominator decomposes into three channels through which a conserved per-task information requirement must flow, each with its own time-cost scalar. We show that information density itself is directional and bounded by separate ceilings on human-to-agent and agent-to-human flow, and that the asymptotic behavior of leverage decomposes into two scaling axes (capability and memory) with a non-zero floor on the planning term set by irreducible task novelty bounded by human throughput. We extend this per-task analysis to a windowed leverage measure that accommodates recurring tasks, spawned subtasks, and amortized system-design investment. The per-task ceiling does not bind the windowed measure, though both remain bounded: L_\texttask by per-task novelty, L_\textwindow by the stock of accumulated planning investment that pays out within the window. The framework operationalizes aspects of earlier qualitative work on supervisory control (Sheridan, 1992), common ground (Clark Brennan, 1991), and mixed-initiative interaction (Horvitz, 1999) within a single normative ratio, and produces a list of testable empirical questions that we leave as open problems.

[NLP-58] Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

【速读】：该论文旨在解决小型语言模型（Small Language Models, SLMs）在有限计算资源和token预算下，难以有效执行多步推理任务的问题。现有测试时推理方法如自一致性（self-consistency）、思维树（Tree-of-Thoughts）及批判-修正循环虽能提升性能，但通常消耗大量token且缺乏细粒度的步骤级控制。其解决方案的关键在于探索是否可通过过程监督（process supervision）与简单的测试时控制策略（如token预算限制和冗余步骤拒绝机制），在不依赖模型规模扩大或高采样次数的前提下，显著提升SLMs的推理可靠性与效率，从而为设备端部署、低延迟或成本敏感场景提供实用的优化路径。

链接: https://arxiv.org/abs/2604.25039
作者: Sagnik Chatterjee,Atharva Patil,Sricharan Ramesh
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) solve many reasoning tasks via chain-of-thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi-step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree-of-Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine-grained step-level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test-time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on-device, low-latency, or cost-constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.

[NLP-59] Faithful Autoformalization via Roundtrip Verification and Repair

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在将自然语言形式化为逻辑表达式时的忠实性（faithfulness）问题，即如何验证形式化结果是否准确保留了原始语义。其核心解决方案是提出一种无需标注数据的“往返验证”（roundtrip verification）方法：首先将自然语言陈述形式化为逻辑表达式，再将其翻译回自然语言，随后重新形式化，并利用形式化工具检查两次形式化结果是否逻辑等价。若二者一致，则表明形式化过程具有较高忠实性；若不一致，则通过诊断步骤定位失败环节并应用针对性修复算子进行修正。实验表明，该方法显著提升了形式等价率（从45–61%提升至83–85%），且形式等价性与语义漂移程度呈负相关，验证了方案的有效性。

链接: https://arxiv.org/abs/2604.25031
作者: Daneshvar Amrollahi,Jerry Lopez,Clark Barrett
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When an LLM formalizes natural language, how do we know the output is faithful? We propose a roundtrip verification approach which does not require ground-truth annotations: formalize a statement, translate the result back to natural language, re-formalize, and use a formal tool to check logical equivalence. When the two formalizations agree, this provides evidence of a faithful formalization. When they disagree, a diagnosis step identifies which translation stage failed, and a targeted repair operator attempts to correct that stage. We evaluate our approach on 150 traffic rules using Claude Opus 4.6 and GPT-5.2. Diagnosis-guided repair raises formal equivalence from 45–61% to 83–85% for both models, outperforming a random-repair baseline. An independent NLI analysis confirms that formal equivalence is correlated with less semantic drift.

[NLP-60] Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models ACL2026

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）后训练相较于监督微调（Supervised Fine-Tuning, SFT）在大语言模型（Large Language Models, LLMs）中更优泛化能力的机制不明确问题。解决方案的关键在于提出一种特征层面的机制分析方法，通过控制实验设置（RL与SFT从同一基础模型、相同数据出发）并构建可解释性框架，对不同模型内部激活进行特征空间对齐，从而追踪特征演化过程。研究发现，SFT快速引入大量高度特化的稳定特征，而RL则引发更受控且持续演化的特征变化，保留了基础模型的表征结构；进一步识别出一组任务无关的紧凑特征，这些特征直接介导跨任务泛化，并经特征干预实验证实其因果作用：关闭这些特征显著削弱RL模型的泛化性能，增强它们则提升基础模型表现。

链接: https://arxiv.org/abs/2604.25011
作者: Dan Shi,Zhuowen Han,Simon Ostermann,Renren Jin,Josef van Genabith,Deyi Xiong
机构: TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China; German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany; Saarland University, Saarbrücken, Germany
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models’ representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models’ generalization performance, while amplifying them improves base models’ performance. The code is available at this https URL.

[NLP-61] Dont Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination ACL

【速读】：该论文旨在解决企业深度研究（Enterprise Deep Research, EDR）中常见的问题，包括信息覆盖不均、上下文爆炸以及过早终止（premature stopping），这些问题导致研究产出难以转化为决策可用的报告。其解决方案的关键在于提出一种可扩展的EDR架构，通过三个核心机制实现：(i) 利用带反思的提纲生成将任务分解为以覆盖度为导向的目标；(ii) 基于依赖关系引导的执行策略定位上下文并显式共享信息，从而控制上下文范围；(iii) 引入基于证据的完成标准，确保代理迭代收集信息直至满足充分性条件。实验证明，这种依赖控制的上下文管理和显式的证据充分性约束能够显著减少过早终止现象，提升企业研究输出的一致性和深度。

链接: https://arxiv.org/abs/2604.24978
作者: Prafulla Kumar Choubey,Kung-Hsiang Huang,Pranav Narayanan Venkit,Jiaxin Zhang,Vaibhav Vats,Yu Li,Xiangyu Peng,Chien-Sheng Wu
机构: Salesforce AI Research
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: ACL Industry 2026

点击查看摘要

Abstract:Enterprise deep research often fails to produce decision-ready reports due to uneven information coverage, context explosion, and premature stopping. We propose a scalable Enterprise Deep Research (EDR) architecture to address these failures. Our system (i) decomposes requests into coverage-driven objectives via outline generation with reflection, (ii) localizes context with dependency-guided execution and explicit information sharing, and (iii) enforces evidence-based completion criteria so agents iteratively collect information until sufficiency conditions are met. We evaluate on an internal sales enablement task and the public DeepResearch Bench benchmark, where our proposed system design achieves the strongest overall performance compared with competitive deep-research baselines. The results show that dependency-controlled context and explicit evidence sufficiency criteria reduce premature stopping and improve the consistency and depth of enterprise research outputs.

[NLP-62] Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

【速读】：该论文旨在解决罕见疾病临床异常定位中因数据稀缺导致的监督微调不适用以及单次推理不稳定的问题。其核心解决方案是提出动态决策学习（Dynamic Decision Learning, DDL）框架，该框架通过优化指令并利用视觉扰动下的预测一致性，在语言与视觉空间中迭代优化冻结的大规模视觉-语言模型（Large Vision-Language Models, LVLMs）的决策过程，从而提升定位精度，并生成基于共识的可靠性评分以量化模型置信度。

链接: https://arxiv.org/abs/2604.24972
作者: Jun Li,Mingxuan Liu,Jiazhen Pan,Che Liu,Wenjia Bai,Cosmin I. Bercea,Julia A. Schnabel
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine-tuning impractical and single-pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision-language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus-based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare-disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare-disease cases and outperforms adaptation baselines and supervised fine-tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: this https URL

[NLP-63] PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

【速读】：该论文旨在解决多并发推理代理（inference agents）在生成式 AI (Generative AI) 应用中因各自独立维护 KV 缓存（Key-Value cache）而导致的内存占用过高问题。传统方法为每个代理分配独立的 KV 缓存，造成资源浪费，尤其在高并发场景下难以扩展。其解决方案的关键在于提出 PolyKV 系统，通过构建一个共享的、不对称压缩的 KV 缓存池实现高效内存利用：其中 Key 采用 int8 量化（q8_0）以保持 softmax 稳定性，Value 则使用 TurboQuant MSE 方法——基于快速沃尔什-哈达玛变换（FWHT）旋转后进行 3-bit Lloyd-Max 量化，并针对标准正态分布 N(0,1) 调整中心点，从而在保证生成质量的前提下实现高达 2.91 倍的稳定压缩比。该设计支持最多 15 个代理并发读取同一缓存池，显著降低内存消耗（如 Llama-3-8B 模型下从 19.8 GB 降至 0.45 GB），同时仅引入可忽略的性能损失（PPL 增幅 ≤ +0.57%，BERTScore F1 接近 0.93）。

链接: https://arxiv.org/abs/2604.24971
作者: Ishan Patel,Ishan Joshi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 6 tables. Code: this https URL Keywords: KV cache compression, multi-agent LLM inference, asymmetric quantization, FWHT, TurboQuant, shared memory

点击查看摘要

Abstract:We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent – the standard paradigm – PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE – a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB – a 97.7% reduction – while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.

[NLP-64] Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

【速读】：该论文旨在解决现有网页代理（web agent）评估基准在长周期、多站点任务上的不足，这类任务更贴近真实世界中的复杂网络使用场景，如跨域产品比对、多服务行程规划等。当前主流评测集中于短周期单站点任务，导致前沿模型在此类任务上已接近饱和，无法有效衡量其长期推理与跨网站协同能力。解决方案的关键在于提出 Odysseys 基准，包含 200 个源自真实浏览会话的长周期网页任务，并引入基于评分量表（rubric-based evaluation）的细粒度评估机制，每个任务平均标注 6.1 个评分维度，相较于传统的二元通过/失败或 LLM-as-a-judge 评估方法，显著提升与人类评价的一致性并提供更丰富的性能信号。此外，论文还提出“轨迹效率”（Trajectory Efficiency）指标（每步的平均 rubric 得分），揭示即使最强模型也仅达 1.15% 的效率水平，凸显了对高效执行能力的迫切需求。

链接: https://arxiv.org/abs/2604.24964
作者: Lawrence Keunho Jang,Jing Yu Koh,Daniel Fried,Ruslan Salakhutdinov
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at this https URL

[NLP-65] BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

【速读】：该论文旨在解决当前复杂基准测试（benchmark）中存在的评估基础设施缺陷问题，这些问题常被误判为智能代理（agent）的失败，实则源于错误的规范、隐含假设或僵化的评估脚本，从而对合法替代方案产生不公平惩罚。解决方案的关键在于提出并实现BenchGuard——首个面向任务导向型、基于执行的代理基准的自动化审计框架，其通过结构化大语言模型（LLM）协议交叉验证所有基准要素，并可选地引入代理解法或执行轨迹作为诊断证据，从而系统性识别基准漏洞。实验表明，BenchGuard在ScienceAgentBench中发现12项作者确认的问题（包括导致任务不可解的致命错误），并在BIXBench Verified-50子集上精准匹配83.3%专家识别的问题，且成本低于15美元/50个复杂生物信息学任务，显著优于传统人工审查。

链接: https://arxiv.org/abs/2604.24955
作者: Xinming Tu,Tianze Wang,Yingzhou(Minta)Lu,Kexin Huang,Yuanhao Qu,Sara Mostafavi
机构: Allen School, University of Washington, Seattle, WA, USA; Phylo, Inc., South San Francisco, CA, USA; Genentech, Inc., South San Francisco, CA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the BIXBench Verified-50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.

[NLP-66] Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

【速读】：该论文旨在解决传统基于体素（voxelwise）的编码模型在fMRI数据分析中面临的三大问题：测量噪声、跨被试变异性和空间相关体素引起的冗余信号。为克服这些限制，作者提出了一种基于独立成分（independent component, IC）的编码框架，其关键在于利用ICA（Independent Component Analysis）从自然故事听觉任务的fMRI数据中分解出多个功能成分，并在独立数据集上训练编码模型，将大语言模型（large language model, LLM）提取的语言输入特征映射到各IC的时间序列上。这种方法能够有效分离刺激驱动信号与噪声驱动信号，从而识别出具有跨被试一致性、高预测性且可解释的功能网络成分（如听觉和语言网络），显著提升了神经活动建模的稳健性和可比性。

链接: https://arxiv.org/abs/2604.24942
作者: Kamya Hari,Taha Binhuraib,Jin Li,Cory Shain,Anna A. Ivanova
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Encoding models provide a powerful framework for linking continuous stimulus features to neural activity; however, traditional voxelwise approaches are limited by measurement noise, inter-subject variability, and redundancy arising from spatially correlated voxels encoding overlapping neural signals. Here, we propose an independent component (IC)-based encoding framework that dissociates stimulus-driven and noise-driven signals in fMRI data. We decompose continuous fMRI data from naturalistic story listening into ICs using one subset of the data, and train encoding models on independent data to predict IC time series from large language model representations of linguistic input. Across subjects, a subset of ICs exhibited consistently high predictivity. These ICs were spatially and temporally consistent across subjects and included cognitive networks known to respond during story listening (auditory and language). Auditory component time series were strongly correlated with acoustic stimulus features, highlighting the interpretability of identified component time series. Components identified as noise or motion-related artifacts by ICA-AROMA showed uniformly poor predictive performance, confirming that highly predicted components reflect genuine stimulus-related neural signals rather than confounds. Overall, IC-based encoding models enable analyses at the level of functional networks, accommodating the variability in network locations across individuals and providing interpretable results that are easy to compare across subjects.

[NLP-67] ADE: Adaptive Dictionary Embeddings – Scaling Multi-Anchor Representations to Large Language Models

【速读】：该论文旨在解决传统词嵌入（Word Embeddings）方法中单向量表示导致的多义词语义表达瓶颈问题，即同一词汇在不同上下文中具有多种含义时，单一向量难以充分捕捉其语义多样性。为此，作者提出自适应字典嵌入（Adaptive Dictionary Embeddings, ADE）框架，其核心创新在于三项关键技术：（1）词汇投影（Vocabulary Projection, VP），将昂贵的两阶段锚点查找转化为高效的矩阵运算；（2）分组位置编码（Grouped Positional Encoding, GPE），使同一词的锚点共享位置信息以保持语义一致性并支持锚点级差异；（3）上下文感知锚点重加权机制，利用自注意力动态调整各锚点贡献权重，实现推理阶段的语境感知组合。ADE成功将多锚点表示扩展至大规模Transformer架构，在显著减少参数量的同时提升文本分类性能，证明了多锚点嵌入是现代Transformer模型中一种高效且实用的替代方案。

链接: https://arxiv.org/abs/2604.24940
作者: Orhan Demirci,Sezer Aptourachman
机构: Hacettepe University (哈切特佩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages (9 pages main text + 4 pages appendix), 6 tables, 1 algorithm

点击查看摘要

Abstract:Word embeddings are fundamental to natural language processing, yet traditional approaches represent each word with a single vector, creating representational bottlenecks for polysemous words and limiting semantic expressiveness. While multi-anchor representations have shown promise by representing words as combinations of multiple vectors, they have been limited to small-scale models due to computational inefficiency and lack of integration with modern transformer architectures. We introduce Adaptive Dictionary Embeddings (ADE), a framework that successfully scales multi-anchor word representations to large language models. ADE makes three key contributions: (1) Vocabulary Projection (VP), which transforms the costly two-stage anchor lookup into a single efficient matrix operation; (2) Grouped Positional Encoding (GPE), a novel positional encoding scheme where anchors of the same word share positional information, preserving semantic coherence while enabling anchor-level variation; and (3) context-aware anchor reweighting, which leverages self-attention to dynamically compose anchor contributions based on sequence context. We integrate these components into the Segment-Aware Transformer (SAT), which provides context-aware reweighting of anchor contributions at inference time. We evaluate ADE on AG News and DBpedia-14 text classification benchmarks. With 98.7% fewer trainable parameters than DeBERTa-v3-base, ADE surpasses DeBERTa on DBpedia-14 (98.06% vs. 97.80%) and approaches it on AG News (90.64% vs. 94.50%), while compressing the embedding layer over 40x – demonstrating that multi-anchor representations are a practical and parameter-efficient alternative to single-vector embeddings in modern transformer architectures.

[NLP-68] Rethinking Layer Redundancy in Large Language Models : Calibration Objectives and Search for Depth Pruning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）推理效率低下的问题，核心方法是通过深度剪枝（depth pruning）移除Transformer模块以压缩模型结构。其关键创新在于从“功能视角”（functional perspective）重新审视层冗余问题——即冗余并非模型固有属性，而是由模型本身与特定评估目标共同决定的动态特性。实验表明，不同校准目标（如困惑度perplexity与下游任务准确率）会识别出显著不同的冗余层，且同一目标下搜索算法的影响相对较小，因此设计合理的校准目标比选择复杂的搜索策略更具决定性作用。

链接: https://arxiv.org/abs/2604.24938
作者: Minkyu Kim,Vincent-Daniel Yun,Youngrae Kim,Youngjin Heo,Suin Cho,Seong-hun Kim,Woosang Lim,Gaeul Kwon
机构: Neural Superintelligence Lab, MODULABS, Republic of Korea; University of Southern California, United States; Boston University, United States; Seoul National University, Republic of Korea
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work has focused on importance criteria and search algorithms, often treating layer redundancy as an inherent structural property of pretrained networks. In contrast, we adopt a \emphfunctional perspective, where redundancy is jointly influenced by the model and the evaluation objective, suggesting that a universal ranking may not be sufficient. Through an empirical study across three LLM families, two calibration objectives, and seven search algorithms, we observe that different objectives yield qualitatively different redundant layers, and that perplexity and downstream accuracy rankings do not consistently align. Under a fixed objective, however, search algorithms tend to produce similar solutions. Overall, our results suggest that the calibration objective may play a more influential role than the choice of search algorithm, indicating that further attention to objective design could be beneficial.

[NLP-69] GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

【速读】：该论文旨在解决当前代理基准测试（agent benchmarks）普遍存在的语言偏倚问题，即现有多语言版本大多依赖机器翻译（machine translation, MT）和有限后编辑，导致查询-回答错位或文化语境偏离，从而破坏基准的有效性。其解决方案的关键在于提出一种精细化的适配流程，通过自动化检查与人工审核相结合的方式，实现功能对齐（functional alignment）、文化对齐（cultural alignment）以及难度校准（difficulty calibration），以确保多语言基准在任务层面保持一致性。实验表明，该方法可将代理成功率提升最高达32.7%，显著缩小多语言性能差距，使最接近的设置仅落后于英文性能3.1%，揭示了相当一部分多语言性能差异源于基准测量误差，强调了跨语言适配时任务级对齐的重要性。

链接: https://arxiv.org/abs/2604.24929
作者: Yunsu Kim,Kaden Uhlig,Joern Wuebker
机构: LILT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package at this https URL. We also release the code used in our experiments at this https URL.

[NLP-70] Large Language Models Explore by Latent Distilling

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在测试时扩展（test-time scaling）过程中生成响应多样性不足的问题。现有基于标准随机采样的方法通常仅产生表层的词汇变化，难以实现语义层面的探索。其解决方案的关键在于提出一种名为Exploratory Sampling (ESamp) 的解码策略，该策略通过训练一个轻量级的Distiller模型，在推理阶段实时预测LLM深层隐藏表示，并利用预测误差作为新颖性信号来重新加权候选词扩展，从而引导生成过程偏向未充分探索的语义模式。这一机制有效实现了高多样性与高连贯性之间的平衡，显著提升了推理模型的Pass@k效率，并在数学、科学和代码生成等多个基准上展现出鲁棒的泛化能力。

链接: https://arxiv.org/abs/2604.24927
作者: Yuanhao Zeng,Ao Lu,Lufei Li,Zheng Zhang,Yexin Li,Kan Ren
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well-known observation that neural networks tend to make lower-error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations to model the LLM’s depth-wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ESamp is implemented with an asynchronous training–inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade-off between diversity and coherence in creative writing. Our code has released at: this https URL.

[NLP-71] Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System ACL2026

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作中因采用单一生成范式而导致的语义-执行鸿沟（semantic-actuation gap）问题，即高阶语义指令难以高效映射到高频连续运动控制。其核心解决方案是提出一种分层式的粗粒度到细粒度双系统架构（Coarse-to-Fine Dual-System VLA），通过显式解耦学习复杂度：由语义规划器（Semantic Planner）预测离散动作标记以捕获宏观方向意图，再由动作精炼器（Action Refiner）基于此粗粒度意图生成高频连续动作实现精细对齐；关键创新在于发现性能随动作分解粒度呈倒U型曲线变化，当两个子系统的学习难度达到平衡时性能最优，并结合异步执行策略提升系统的可扩展性、鲁棒性和响应速度。

链接: https://arxiv.org/abs/2604.24921
作者: Yifei Wei,Linqing Zhong,Yi Liu,Yuxiang Lu,Xindong He,Maoqing Yao,Guanghui Ren
机构: Beihang University (北京航空航天大学); AgiBot (AgiBot)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Main Conference of ACL 2026. Project page: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy. The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment. Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.

[NLP-72] Intrinsic Mutual Information as a Modulator for Preference Optimization ACL

【速读】：该论文旨在解决离线偏好优化方法（如直接偏好优化，Direct Preference Optimization, DPO）在对齐大语言模型（Large Language Models, LLMs）与人类价值观时，因依赖额外超参数调优而导致训练时间开销大的问题。现有方法虽有改进，但效果有限且仍未完全消除对超参数敏感性的依赖。解决方案的关键在于提出RMiPO框架，其核心创新是利用内在的响应级互信息（Response-level Mutual Information）进行偏好优化，并通过超参数调制动态解耦偏好贡献，在几乎不增加计算成本的前提下实现更高效、稳定的训练性能。实验表明，RMiPO在保持或超越现有方法性能的同时，将训练开销降低超过15%。

链接: https://arxiv.org/abs/2604.24804
作者: Peng Liao,Peijia Zheng,Lingbo Li,Shangsong Liang,Lin Chen
机构: Sun Yat-sen University (中山大学); University of Warwick (华威大学); Macao Polytechnic University (澳门理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ACL Findings 2026

点击查看摘要

Abstract:Offline preference optimization methods, such as Direct Preference Optimization (DPO), offer significant advantages in aligning Large Language Models (LLMs) with human values. However, achieving optimal performance with these methods typically involves additional hyperparameter tuning, resulting in substantial time overhead. Although prior work has proposed a range of improvements, these methods remain limited in effectiveness and have not fully eliminated reliance on hyperparameter tuning. In this work, we propose RMiPO, a lightweight and efficient framework for offline preference optimization. RMiPO leverages intrinsic Response-level Mutual information for Preference Optimization with hyperparameter modulation, dynamically decoupling preference contributions at negligible additional computational cost. Extensive experimental results demonstrate that RMiPO achieves consistently superior performance over existing methods while reducing training overhead by more than 15%. Our code is available at this https URL.

[NLP-73] Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

【速读】：该论文旨在解决老年人语音识别（Elderly Speech Recognition, EASR）中因训练数据稀缺以及老年人语音在声学和语言特征上的独特性所带来的挑战。其解决方案的关键在于构建一个结合大语言模型（Large Language Model, LLM）文本改写与文本到语音（Text-to-Speech, TTS）合成的数据增强流程：首先利用LLM生成符合老年人语境的原文本改写，再通过TTS模型以老年参考说话人为声学模板合成对应语音，最终将合成的音频-文本对与原始数据融合，用于微调Whisper模型，无需修改其架构即可显著提升识别性能。

链接: https://arxiv.org/abs/2604.24770
作者: Minsik Lee,Seoi Hong,Chongmin Lee,Sieun Choi,Jian Kim,Jua Han,Jihie Kim
机构: Dongguk University (东国大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 2 figures, under review at IEEE Signal Processing Letters

点击查看摘要

Abstract:Despite recent progress in automatic speech recognition (ASR), elderly ASR (EASR) remains challenging due to limited training data and the distinct acoustic and linguistic characteristics of elderly speech. In this work, we address data scarcity in EASR through a data augmentation pipeline that combines large language model (LLM)-based transcript paraphrasing with text-to-speech (TTS) synthesis. Given an elderly speech dataset, the LLM first generates elderly-contextual paraphrases of the original transcripts, and the TTS model then synthesizes corresponding speech using elderly reference speakers. The resulting synthetic audio-text pairs are merged with the original data to fine-tune Whisper without architectural modification. We further analyze the effects of augmentation ratio and reference-speaker composition in low-resource EASR. Experiments on English and Korean elderly speech datasets from speakers aged 70 and above show that the proposed method consistently improves performance over conventional augmentation baselines, achieving up to a 58.2% reduction in word error rate (WER) compared with the Whisper baseline.

[NLP-74] Benchmarking Testing in Automated Theorem Proving ACL2026

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在形式化定理证明中缺乏可靠语义正确性评估的问题。现有方法依赖于词法重叠或人工检查等间接指标，难以真实反映生成定理的逻辑有效性。其解决方案的关键在于提出 T 框架，通过构建“依赖编译测试”机制：一个生成的定理被视为语义正确，仅当所有依赖它的后续定理均能成功编译，这类似于软件工程中的集成测试（integration testing）。该方法无需人工标注，可自动从真实 Lean 4 代码库中提取问题与依赖关系，从而提供一种更严格、更贴近实际应用的评估标准。实验表明，即使是最先进的模型在该语义指标下表现显著下降，揭示了当前定理生成能力的重大局限。

链接: https://arxiv.org/abs/2604.23698
作者: Jongyoon Kim,Hojae Han,Seung-won Hwang
机构: Seoul National University(首尔国立大学); Electronics and Telecommunications Research Institute(电子与电信研究所)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: ACL 2026 Industry

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

[NLP-75] Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

【速读】：该论文旨在解决生成式音频语言模型（Audio-Language Models, ALLMs）在音频理解与推理任务中存在幻觉（hallucination）和过度自信输出的问题，其核心挑战在于如何有效估计模型在多模态条件下的不确定性。解决方案的关键在于首次系统性地实证评估五种不确定性估计方法（包括预测熵、长度归一化熵、语义熵、离散语义熵及P(True)），并发现：语义级和验证驱动的方法在通用音频推理基准上显著优于基于token级别的基线；而在以可信度为导向的场景（如幻觉检测和不可回答问题识别）中，不同模型与基准对不确定性的敏感性差异显著，表明通用推理设置下的结论无法直接迁移至高风险应用。这一发现为构建可靠、具备不确定性感知能力的音频-语言系统提供了关键基础。

链接: https://arxiv.org/abs/2604.25591
作者: Chun-Yi Kuan,Wei-Ping Huang,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); Artificial Intelligence Center of Research Excellence (AI-CoRE) (人工智能研究中心)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Manuscript in progress

点击查看摘要

Abstract:Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.

信息检索

[IR-0] Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text

【速读】：该论文旨在解决如何使任意文本集合具备可导航性的问题，即在不依赖人工构建超链接的情况下，自动挖掘并构造出能够支持灵活浏览与跳转的文本结构。其核心解决方案是提出并研究多种构建文本超图（Hypergraph of Text, HoT）的方法，并引入一种新颖的定量评估指标——“努力比”（effort ratio），用于衡量所构建HoT的结构质量。实验表明，即使采用简单的TF-IDF基线方法，也能在该指标上达到与基于大语言模型（LLM）的方法相当的效果，验证了所提方法的有效性和实用性。

链接: https://arxiv.org/abs/2604.25906
作者: Dean E. Alvarez,ChengXiang Zhai
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:One reason the Web is more useful than a simple collection of documents is that the structure created by hyperlinks enables flexible navigation from one web page to another. However, hyperlinks are typically created manually and cannot fully capture a corpus’ implicit semantic structures. Is there a general way to make an arbitrary collection navigable? Recent work has formalized this problem generally as constructing a Hypergraph of Text (HoT), which provides a formal mathematical structure for supporting navigation and browsing. However, how to construct and evaluate a Hypergraph of Text remains a challenge. In this paper, we propose and study several methods for constructing a HoT. We also propose a novel quantitative metric, effort ratio, for evaluating the structural quality of a constructed HoT. Experimental results show that even simple TF-IDF baselines can match LLM-based methods on our proposed effort ratio metric.

[IR-1] Break the Inaccessible Boundary: Distilling Post-Conversion Content for User Retention Modeling

【速读】：该论文旨在解决实时竞价（Real-time Bidding, RTB）广告系统中用户留存预测模型因特征泄露（feature leakage）而导致的训练与推理阶段不一致问题。具体而言，虽然用于用户引导的“上手内容”（Onboarding Content）能提供强信号以提升留存预测精度，但若直接将其用于训练，会在模型推理时引入未来信息，造成偏差。解决方案的关键在于提出一种两阶段蒸馏对齐框架（OCARM），第一阶段通过显式暴露上手内容训练一个分层编码器生成教师表示（teacher representations），第二阶段通过知识蒸馏将用户编码器与冻结的教师模型对齐，使模型在仅使用可观察特征的情况下隐式学习上手内容的信息，从而在不产生特征泄露的前提下实现高精度的留存预测。

链接: https://arxiv.org/abs/2604.25839
作者: Tianbao Ma,Ruochen Yang,Chengen Li,Yuexin Shi,Jiangxia Cao,Linxun Chen,Zhaojie Liu,Yanan Niu,Han Li,Kun Gai
机构: Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注: Work in progress

点击查看摘要

Abstract:User retention is a key metric to measure long-term engagement in modern platforms. In real-time bidding (RTB) advertising system for user re-engagement, the retention model is required to predict future revisit probability at bidding time, before the user converts and consumes any content. Although post-conversion content, termed Onboarding Content, provides highly informative signals for retention prediction, directly using it in training causes severe feature leakage and creates a gap between training and serving. To address this issue, we propose OCARM, a two-stage distillation-aligned framework for Onboarding Content Augmented Retention Modeling, enabling the model to implicitly capture future content using only observable features during inference. In the first stage, we deliberately expose onboarding content to train a hierarchical encoder that produces teacher representations. In the second stage, a user encoder is aligned with the frozen teacher through distillation, allowing the model to approximate the inaccessible onboarding signals without leakage. Extensive offline experiments and online A/B tests demonstrate that our framework achieves consistent improvements in a real-world growth scenario.

[IR-2] Action-Aware Generative Sequence Modeling for Short Video Recommendation SIGIR2026

【速读】：该论文旨在解决传统二分类推荐模型在短视频平台中难以准确捕捉用户对视频内不同片段的差异化偏好问题，因这些模型将整个视频视为单一整体，忽略了用户行为的时间动态性和意图多样性。解决方案的关键在于提出一种新颖的建模范式——动作感知生成序列网络（Action-Aware Generative Sequence Network, A2Gen），其核心创新包括：引入上下文感知注意力模块（Context-aware Attention Module, CAM）以融合物品特定上下文特征来增强动作序列表示；设计分层序列编码器（Hierarchical Sequence Encoder, HSE）用于学习用户历史行为中的时间模式；并构建动作序列自回归生成器（Action-seq Autoregressive Generator, AAG）以生成结构化动作序列进行统一预测。该方法通过显式建模用户行为的时间维度，显著提升了多任务推荐效果，在真实大规模在线实验中实现了观看时长、互动率和用户留存率的显著提升。

链接: https://arxiv.org/abs/2604.25834
作者: Wenhao Li,Zihan Lin,Zhengxiao Guo,Jie Zhou,Shukai Liu,Yongqi Liu,Chuan Luo,Chaoyi Ma,Ruiming Tang,Han Li
机构: Kuaishou Inc.(快手公司); Beihang University (北京航空航天大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 11 pages, 8 figures, SIGIR 2026

点击查看摘要

Abstract:With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns. Based on this insight, we propose a novel modeling paradigm: Action-Aware Generative Sequence Network (A2Gen), which refines user actions along the temporal dimension and chains them into sequences for unified processing and prediction. First, we introduce the Context-aware Attention Module (CAM) to model action sequences enriched with item-specific contextual features. Building upon this, we develop the Hierarchical Sequence Encoder (HSE) to learn temporal action patterns from users’ historical actions. Finally, through leveraging CAM, we design a module for action sequence generation: the Action-seq Autoregressive Generator (AAG). Extensive offline experiments on the Kuaishou’s dataset and the Tmall public dataset demonstrate the superiority of our proposed model. Furthermore, through large-scale online A/B testing deployed on Kuaishou’s platform, our model achieves significant improvements over baseline methods in multi-task prediction by leveraging sequential information. Specifically, it yields increases of 0.34% in user watch time, 8.1% in interaction rate, and 0.162% in overall user retention (LifeTime-7), leading to successful deployment across all traffic, serving over 400 million users every day.

[IR-3] Harmonizing Generative Retrieval and Ranking in Chain-of-Recommendation

【速读】：该论文旨在解决生成式推荐系统（Generative Recommender Systems）中存在的一大瓶颈问题：即在采用无上下文感知的下一物品预测范式时，虽然能够通过语义ID生成大量候选物品（如从beam-256中筛选），但难以准确评估这些候选物品之间的相对优劣（例如从256个候选中选出Top-10），导致生成能力与排序性能之间存在显著差距。解决方案的关键在于提出RecoChain框架，这是一个统一的生成-检索-排序架构，其核心创新是在单一Transformer骨干网络中整合候选生成与排序过程：推理阶段首先通过分层语义ID预测生成候选物品，随后利用SIM（Similarity-based Interaction Module）机制对候选物品进行连续点击可能性估计，从而实现生成与排序的一体化优化，有效弥合了生成与排序之间的性能鸿沟，并在大规模真实数据集上验证了其在保持强大生成能力的同时显著提升Top-K推荐效果。

链接: https://arxiv.org/abs/2604.25787
作者: Yu Liu,Jiangxia Cao
机构: NJUST(南京理工大学); Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注: Work in progress

点击查看摘要

Abstract:Generative recommender systems have recently emerged as a promising paradigm by formulating next-item prediction as an auto-regressive semantic IDs generation, such as OneRec series works. However, with the next-item-agnostic prediction paradigm, its could beam out some next potential items via Semantic IDs but hard to estimate which items are better from them, e.g., select the top-10 from beam-256 items, leading to a gap between generation and ranking performance. To fulfill this gap, we propose RecoChain, a unified generative retrieval and ranking framework that integrates candidate generation and ranking within a single Transformer backbone. Specifically, in inference, the model first generates candidate items via hierarchical semantic ID prediction, then performs the SIM-based ranking process to estimate the click possibility of corresponding item candidate continuously. Extensive experiments on large-scale real-world datasets demonstrate that our approach effectively bridges the gap between generative retrieval and ranking, achieving improved Top-K recommendation performance while maintaining strong generative capability.

[IR-4] Can Code Evaluation Metrics Detect Code Plagiarism?

【速读】：该论文旨在解决生成式 AI (Generative AI) 在软件工程教育中用于代码生成任务时，现有代码评估指标（Code Evaluation Metrics, CEMs）是否能够可靠检测不同修改程度（L1–L6）的源码抄袭问题。其关键解决方案是通过对比五种主流CEMs（CodeBLEU、CrystalBLEU、RUBY、TSED、CodeBERTScore）与两种先进源码抄袭检测工具（JPlag和Dolos），基于两个开源标注数据集（ConPlag和IRPlag）进行系统性实证研究，并采用无阈值的排序评价指标衡量整体、分数据集及分层次的检测性能表现。结果表明，预处理后CrystalBLEU在多数场景下优于传统工具Dolos，且在高复杂度修改（如L6）仍保持竞争力，说明CEMs具备与专用工具相当的排序能力，可作为代码抄袭检测的有效替代方案。

链接: https://arxiv.org/abs/2604.25778
作者: Fahad Ebrahim,Mike Joy(The University of Warwick)
机构: University of Warwick (华威大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures, accepted at LEARNER 2026 workshop (associated with EASE 2026)

点击查看摘要

Abstract:Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it remains unclear whether such metrics can reliably detect plagiarism across different levels of modification (L1-L6), increasing in complexity. In this paper, we perform a comparative empirical study using two open-source labelled datasets, ConPlag (raw and template-free versions) and IRPlag. We evaluate five CEMs, namely CodeBLEU, CrystalBLEU, RUBY, Tree Structured Edit Distance (TSED), and CodeBERTScore. The performance is evaluated using threshold-free ranking-based measures to assess overall, per dataset, and per-level plagiarism performance. The results are compared against state-of-the-art (SOTA) Source Code Plagiarism Detection Tools (SCPDTs), JPlag and Dolos. Our findings show that without preprocessing, Dolos achieves the highest overall ranking performance, while among the individual metrics, CrystalBLEU, CodeBLEU, and RUBY outperform JPlag. Performance is strongest at L1 and drops from L4 onward, while CrystalBLEU remains competitive on L6. With preprocessing, CrystalBLEU surpasses Dolos overall. Per dataset, Dolos achieved the best ranking on the ConPlag raw dataset, while CrystalBLEU was the best-performing metric on the remaining datasets. At the plagiarism levels, Dolos remains strongest on L4, while Crystal-BLEU leads most of the remaining difficult levels. These results indicate that CEMs are comparable to dedicated tools in terms of ranking metrics. Comments: 10 pages, 5 figures, accepted at LEARNER 2026 workshop (associated with EASE 2026) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2604.25778 [cs.SE] (or arXiv:2604.25778v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.25778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Personalized Multi-Interest Modeling for Cross-Domain Recommendation to Cold-Start Users

【速读】：该论文旨在解决跨域推荐（Cross-domain Recommendation, CDR）中冷启动用户因缺乏目标域交互数据而导致推荐性能下降的问题，同时克服现有方法在建模用户个性化多兴趣偏好和捕捉用户间共性偏好方面的局限性。其解决方案的关键在于提出一种名为NF-NPCDR的个性化多兴趣建模框架：首先设计了一个个性化偏好编码器，通过将神经过程（Neural Process, NP）与归一化流（Normalizing Flow, NF）结合，将单一高斯分布转换为多模态分布，从而有效捕获用户的个性化多兴趣；其次引入一个公共偏好编码器，利用偏好池（preference pool）挖掘不同用户间的共性偏好；最后采用随机自适应解码器，动态融合个性化与公共偏好，实现对冷启动用户的精准推荐。

链接: https://arxiv.org/abs/2604.25732
作者: Xiaodong Li,Jiawei Sheng,Jiangxia Cao,Xinghua Zhang,Wenyuan Zhang,Yong Sun,Shirui Pan,Zhihong Tian,Tingwen Liu
机构: Chinese Academy of Sciences (中国科学院); National Natural Science Foundation of China (国家自然科学基金委员会); CPSF (中国博士后科学基金会); Guangdong Province (广东省); Guangdong Key Laboratory of Industrial Control System Security (广东省工业控制系统安全重点实验室)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Cross-domain recommendation (CDR) has demonstrated to be an effective solution for alleviating the user cold-start issue. By leveraging rich user-item interactions available in a richly informative source domain, CDR could improve the recommendation performance for cold-start users in the target domain. Previous CDR approaches mostly adhere the Embedding and Mapping (EMCDR) paradigm, which learns a user-shared mapping function to transfer users’ preference from the source domain to the target domain, neglecting users’ personalized preference. Recent CDR approaches further leverage the meta-learning paradigm, considering the CDR task for each user independently and learning user-specific mapping functions for each user. However, they mostly learn representations for each user individually, which ignores the common preference between different users, neglecting valuable information for CDR. In addition, all these approaches usually summarize the user’s preference into an overall representation, which can hardly capture the user’s multi-interest preference. To this end, we propose a personalized multi-interest modeling framework for CDR to cold-start users, termed as NF-NPCDR. Specifically, we propose a personalized preference encoder that enhances the neural process (NP) with the normalizing flow (NF) to convert the Gaussian (unimodal) distribution to a multimodal distribution, providing a novel way to capture the user’s personalized multi-interest preference. Then, we propose a common preference encoder with a preference pool to capture the common preference between different users. Furthermore, we introduce a stochastic adaptive decoder to incorporate both the personalized and common preference for cold-start users, adaptively modulating both preference for better recommendation.

[IR-6] From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms

【速读】：该论文旨在解决生成式搜索引擎（Generative Search Engines）中内容可见性与影响力评估的难题，即如何量化网页在生成式 AI 系统中被引用（citation selection）和被吸收进最终答案（citation absorption）的程度。其核心问题是现有方法仅依赖引用次数来衡量优化效果，忽略了引用质量与内容贡献度的差异。解决方案的关键在于提出一个两阶段测量框架——“生成式引擎优化”（Generative Engine Optimization, GEO），区分“引用选择”与“引用吸收”两个独立过程，并通过大规模实证分析发现：引用广度（如 Perplexity 和 Google 引用更多来源）与引用深度（如 ChatGPT 引用较少但影响更强）存在显著分化；高影响力页面具备更长文本、更强结构化、更高语义对齐性和更丰富可提取证据（如定义、数值事实、比较和步骤）。这表明 GEO 应超越单纯引用计数，将答案层面的内容吸收视为独立评估指标。

链接: https://arxiv.org/abs/2604.25707
作者: Zhang Kai,Yao Jingang
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 26 pages, 11 figures. Public dataset and analysis pipeline: this https URL

点击查看摘要

Abstract:Generative search engines increasingly determine whether online information is merely discoverable, cited as a source, or actually absorbed into generated answers. This paper proposes a two-stage measurement framework for Generative Engine Optimization (GEO): citation selection, where a platform triggers search and chooses sources, and citation absorption, where a cited page contributes language, evidence, structure, or factual support to the final answer. We analyze the public geo-citation-lab dataset covering 602 controlled prompts across ChatGPT, Google AI Overview/Gemini, and Perplexity; 21,143 valid search-layer citations; 23,745 citation-level feature records; 18,151 successfully fetched pages; and 72 extracted features. The central descriptive finding is that citation breadth and citation depth diverge. Perplexity and Google cite more sources on average, while ChatGPT cites fewer sources but shows substantially higher average citation influence among fetched pages. High-influence pages tend to be longer, more structured, semantically aligned, and richer in extractable evidence such as definitions, numerical facts, comparisons, and procedural steps. The results suggest that GEO should be measured beyond citation counts, with answer-level absorption treated as a separate outcome.

[IR-7] K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在电商搜索相关性任务中因领域知识边界缺失而导致的“长尾”场景下性能瓶颈问题，尤其在处理特定查询或小众商品时，模型因缺乏上下文感知能力而表现不佳。其解决方案的关键在于提出K-CARE框架，通过引入外部知识增强模型的认知能力：一是利用行为数据驱动的隐式知识进行对称式上下文锚定（Symmetrical Contextual Anchoring, SCA），填补语义空白；二是基于专家标注原型知识进行类比推理（Analogical Prototype Reasoning, APR），通过上下文类比校准决策边界，从而显著提升复杂工业场景下的搜索相关性表现。

链接: https://arxiv.org/abs/2604.25683
作者: Chen Yifei,Tian Zhixing,Wang Chenyang,Cheng Ziguang
机构: JD.COM(京东)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper targets e-commerce search relevance. While Large Language Models (LLMs) have demonstrated significant potential in this field, they often encounter performance bottlenecks in persistent ‘corner cases’ within complex industrial scenarios. Existing research primarily focuses on optimizing reasoning trajectories via Reinforcement Learning. However, real-world observations suggest that the primary bottleneck stems from knowledge boundaries, where the absence of domain-specific intelligence in the model’s parametric memory creates a contextual void. This void persists when interpreting idiosyncratic queries or niche products and cannot be resolved solely through reasoning-path optimization. To bridge this gap, we propose K-CARE, a framework that extends the model’s cognitive reach by grounding reasoning in external knowledge. K-CARE comprises two synergistic components: (1) Symmetrical Contextual Anchoring (SCA), which fills the contextual void by anchoring queries and products with behavior-derived implicit knowledge; and (2) Analogical Prototype Reasoning (APR), which leverages expert-curated prototypical knowledge to calibrate decision boundaries through in-context analogy. Extensive offline evaluations and online A/B tests on a leading e-commerce platform demonstrate that K-CARE significantly outperforms state-of-the-art baselines, delivering substantial commercial impact by resolving knowledge-intensive relevance challenges. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.25683 [cs.IR] (or arXiv:2604.25683v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.25683 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-8] LLM -ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）生成摘要的可靠评估问题，尤其在跨异构领域和不同文档长度（2K–27K词）下的有效性挑战。传统基于词汇重叠的指标（如ROUGE、BLEU）与人类判断的相关性较弱甚至为负，而任务特定的神经网络指标和LLM-based评估器则表现出显著更高的对齐度，特别是在语言质量评估方面。解决方案的关键在于提出一种无需微调模型的自省式摘要框架LLM-ReSum，其通过将LLM-based评估与生成集成在一个闭环反馈机制中，实现对低质量摘要的自动优化；实验表明，该框架在三个领域内可提升事实准确性达33%、覆盖率达39%，且89%的人类评估偏好改进后的摘要。

链接: https://arxiv.org/abs/2604.25665
作者: Huyen Nguyen,Haoxuan Zhang,Yang Zhang,Junhua Ding,Haihua Chen
机构: University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 15 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

[IR-9] Health System Scale Semantic Search Across Unstructured Clinical Notes

【速读】：该论文旨在解决在大型医疗系统中部署语义搜索（Semantic Search）所面临的工程、成本与治理挑战，以实现对数亿条临床笔记的高效检索。其核心问题是传统基于关键词匹配的检索方法难以满足临床信息获取的精准性需求，而现有语义搜索方案因规模和合规性限制无法在真实医疗环境中落地。解决方案的关键在于构建一个可扩展、低成本且符合HIPAA合规要求的语义搜索基础设施：采用指令微调的qwen3-embedding-0.6B嵌入模型（Embedding Model），结合300 token的文本分块策略（Chunking Strategy），利用存储优化索引的向量数据库与低延迟键值存储协同工作，并通过三阶段评估验证其性能——最终实现亚秒级查询延迟（中位数237 ms）、月均成本约4000美元，且在临床任务中显著提升图表提取效率（减少24%–89%时间），同时保持良好的一致性。

链接: https://arxiv.org/abs/2604.25605
作者: Faith Wavinya Mutinda,Spandana Makeneni,Anna Lin,Shivaji Dutta,Irit R. Rasooly,Patrick Dibussolo,Shivani Kamath Belman,Hessam Shahriari,Kevin Murphy,Alex B. Ruan,Barbara H. Chaiyachati,Sanjay Chainani,Robert W. Grundmeier,Scott M. Haag,Jeffrey M. Miller,Heather M. Griffis,Ian M. Campbell
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: for associated code, see this https URL

点击查看摘要

Abstract:Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children’s hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.

[IR-10] he Attention Market: Interpreting Online Fair Re-ranking as Manifold Optimization under Walrasian Equilibrium SIGIR’26

【速读】：该论文旨在解决在线公平重排序（fair re-ranking）中现有方法在不同设置下表现不一致的问题，其核心挑战在于如何在保证检索准确性的同时有效提升长尾项目（long-tail items）的可见性并增强组内多样性。解决方案的关键在于将公平重排序建模为一个由瓦尔拉斯均衡（Walrasian Equilibrium）驱动的注意力市场框架，其中公平性被形式化为一种税收成本；在此基础上，通过流形优化（manifold optimization）揭示了寻找该均衡等价于在由市场构建的特定排序流形上执行梯度下降。该方法提出ManifoldRank算法，通过两个关键梯度调整机制实现公平与准确性的平衡：一是基于不同公平要求的供给侧梯度调整，考虑相应的成本；二是基于排序分数的经验性需求侧梯度调整项。这种几何感知的梯度对齐策略使算法能够适应不同上下文场景下的流形结构差异，从而显著提升公平重排序的稳定性与有效性。

链接: https://arxiv.org/abs/2604.25577
作者: Chen Xu,Wei Chu,Wenyu Hu,Fengran Mo,Jun Xu,Maarten de Rijke
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); University of Montreal (蒙特利尔大学); University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR)
备注: Accepted in SIGIR’26

点击查看摘要

Abstract:Fair re-ranking aims to promote long-tail items and enhance diversity within groups in information retrieval. While previous research on online fairness-aware re-ranking has shown promising outcomes, our comprehensive evaluation of online fair re-ranking methods over 20 settings reveals significant performance disparities among existing methods. To uncover the root causes of these inconsistencies, we reformulate fair re-ranking within an attentional market framework governed by a Walrasian Equilibrium, where the fairness is treated as a taxation cost. This market-based formulation is then coupled with manifold optimization, demonstrating that seeking this equilibrium is equivalent to performing gradient descent on a specific ranking manifold constructed by the market. Different re-ranking settings induce distinct manifold geometries, and these intrinsic geometric differences dictate the gradient landscapes and optimization trajectories. We propose ManifoldRank, an efficient online fair re-ranking algorithm. ManifoldRank adjusts gradients to align with the ranking manifold, considering various contextual settings. On the supply side, it incorporates a gradient adjustment based on different fairness requirements, accounting for associated costs. On the demand side, it empirically predicts an additional gradient adjustment term derived from the ranking scores. By integrating these two gradient adjustments, ManifoldRank effectively balances fairness and accuracy. Experimental results across multiple datasets confirm ManifoldRank’s effectiveness.

[IR-11] A contemporary science map through the lens of IEEE and ACM periodicals

【速读】：该论文旨在识别并验证计算机与电气/电子工程领域内两大权威学术组织——ACM和IEEE所出版期刊标题中反映的当代科研趋势。其核心问题是：如何通过分析近期内刊名称的变化，揭示这些期刊在研究方向上的演化特征及其共性模式。解决方案的关键在于采用定性而非定量的方法，聚焦于期刊标题中的主题倾向与结构特征，从而发现两个协会均呈现出对开放获取（Open Access）模式的偏好、ACM期刊日益向人工智能（Artificial Intelligence, AI）领域倾斜，以及同一协会内部期刊间存在显著的主题重叠现象。

链接: https://arxiv.org/abs/2604.25487
作者: George Margaritis,Dionysios Kritsas,Dimitrios Katsaros,Yannis Manolopoulos
机构: University of Thessaly (塞萨洛尼基大学); University of York, Thessaloniki campus (约克大学，塞萨洛尼基校区)
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:ACM and IEEE are the two premier associations on computing and electrical/electronics engineering which publish and organize the great majority of periodicals and conferences, respectively, serving these disciplines. Science is a constantly evolving process, and these publication fora are expected to follow the trends. In this article, we focus on the periodicals published by the two associations and seek to detect and/or confirm any contemporary science trends as these are reflected to the periodical titles established recently. Our study is rather qualitative than quantitative, aiming at revealing patterns immediately comprehensible and validatable by the reader. Among the most notable patterns, we see a growing preference of both associations for the open access mode of publication; we also observe ACM’s orientation toward AI-focused periodicals, and most importantly, a significant theme overlap among periodicals of the same association and this is valid for both ACM and IEEE.

[IR-12] GeoSearch: Augmenting Worldwide Geolocalization with Web-Scale Reverse Image Search and Image Matching SIGIR2026

【速读】：该论文旨在解决全球图像地理定位（Worldwide image geolocalization）中因视觉多样性导致的挑战，特别是现有基于检索增强生成（Retrieval-Augmented Generation, RAG）和大型多模态模型（Large Multimodal Models, LMMs）的方法在面对训练数据库中未包含的场景时性能下降的问题。解决方案的关键在于提出GeoSearch框架，其核心创新是将网络规模的反向图像搜索（reverse image search）整合进RAG流程，通过从网页中提取文本证据和坐标信息来增强LMM提示（prompt），并引入两层过滤机制——首先进行图像匹配，再通过置信度门控过滤噪声内容，从而提升对开放世界场景的定位准确性。

链接: https://arxiv.org/abs/2604.25390
作者: Tung-Duong Le-Duc,Hoang-Quoc Nguyen-Son,Minh-Son Dao
机构: University of Science, VNU-HCM; National Institute of Information and Communications Technology
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGIR 2026 Main Conference

点击查看摘要

Abstract:Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordinates and textual evidence extracted from web pages. To mitigate noise from irrelevant content, we introduce a two-layer filtering mechanism consisting of image matching, followed by confidence-based gating. Experiments on standard benchmarks Im2GPS3k and YFCC4k demonstrate the superiority of GeoSearch under leakage-aware evaluation. Our code and data are publicly available to support reproducibility.

[IR-13] Stop Using the Wilcoxon Test: Myth Misconception and Misuse in IR Research SIGIR2026

【速读】：该论文旨在解决信息检索（Information Retrieval, IR）领域中长期存在的统计假设检验误用问题，特别是对威尔科xon符号秩检验（Wilcoxon signed-rank test）被不当视为t检验的“安全替代”方法的现象进行批判性反思。研究表明，当前广泛采用的这种做法源于教材和实践中的误解，导致Wilcoxon检验在IR场景下极易失控其第一类错误率（Type I error rate），从而产生误导性结论。论文的关键解决方案在于通过系统文献回顾、理论分析与TREC数据的实证验证，明确指出Wilcoxon检验在IR评估中存在严重适用性缺陷，并主张彻底摒弃其在该领域的应用，以提升IR研究的方法论严谨性。

链接: https://arxiv.org/abs/2604.25349
作者: Julián Urbano
机构: 未知
类目: Information Retrieval (cs.IR); Applications (stat.AP); Methodology (stat.ME)
备注: 11 pages, 5 tables, 2 figures, ACM SIGIR 2026

点击查看摘要

Abstract:In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

[IR-14] From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space

【速读】：该论文旨在解决现代推荐系统中列表级重排序（list-wise reranking）阶段存在的语义不一致动作空间问题：传统方法将重排序建模为从局部输入列表中选择索引，导致同一输出神经元（logits）在不同样本中对应不同物品，阻碍模型建立对物品的稳定内在理解。解决方案的关键在于提出GloRank（Global Action Space Ranker），其核心创新是将重排序任务重构为生成全局标识符的生成式框架——通过将物品表示为离散token序列，将重排序转化为token生成任务，从而解耦评分机制与输入顺序，确保物品始终基于一致的全局标准进行评估。此外，采用两阶段优化策略（监督预训练+强化学习后训练）进一步提升了模型性能与冷启动鲁棒性。

链接: https://arxiv.org/abs/2604.25291
作者: Pengyue Jia,Xiaobei Wang,Yingyi Zhang,Shuchang Liu,Yupeng Hou,Hailan Yang,Xu Gao,Xiaopeng Li,Yejing Wang,Julian McAuley,Xiang Li,Lantao Hu,Yongqi Liu,Kaiqiao Zhan,Han Li,Kun Gai,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Kuaishou Technology (快手科技); University of California San Diego (加州大学圣地亚哥分校)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In modern recommender systems, list-wise reranking serves as a critical phase within the multi-stage pipeline, finalizing the exposed item sequence and directly impacting user satisfaction by modeling complex intra-list item dependencies. Existing methods typically formulate this task as selecting indices from the local input list. However, this approach suffers from a semantically inconsistent action space: the same output neuron (logits) represents different items across different samples, preventing the model from establishing a stable, intrinsic understanding of the items. To address this, we propose GloRank (Global Action Space Ranker), a generative framework that shifts reranking from selecting local indices to generating global identifiers. Specifically, we represent items as sequences of discrete tokens and reformulate reranking as a token generation task. This design effectively decouples the scoring mechanism from the variable input order, ensuring that items are evaluated against a consistent global standard. We further enhance this with a two-stage optimization pipeline: a supervised pre-training phase to initialize the model with high-quality demonstrations, followed by a reinforcement learning-based post-training phase to directly maximize list-wise utility. Extensive experiments on two public benchmarks and a large-scale industrial dataset, coupled with online A/B tests, demonstrate that GloRank consistently outperforms state-of-the-art baselines and achieves superior robustness in cold-start scenarios.

[IR-15] UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval ACL2026

【速读】：该论文旨在解决无监督域适应（Unsupervised Domain Adaptation, UDA）中神经检索模型在目标域上泛化能力不足的问题，核心挑战在于如何高效且高质量地选择用于伪查询生成的目标域文档。现有方法主要依赖文档多样性采样，但忽略了模型不确定性对学习效用的影响。论文提出了一种基于不确定性的迭代文档采样方法（UnIte），其关键创新在于：(1) 过滤掉具有高随机不确定性（aleatoric uncertainty）的文档以避免噪声干扰；(2) 优先选择具有高认知不确定性（epistemic uncertainty）的文档，从而最大化当前模型的学习潜力。实验表明，该方法在BEIR大规模语料库上显著提升了检索性能（nDCG@10提升达+3.49），且仅需较小的训练样本量（平均4k文档）。

链接: https://arxiv.org/abs/2604.25142
作者: Jongyoon Kim,Minseong Hwang,Seung-won Hwang
机构: Seoul National University (首尔国立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: ACL 2026 (Findings)

点击查看摘要

Abstract:Unsupervised domain adaptation generalizes neural retrievers to an unseen domain by generating pseudo queries on target domain documents. The quality and efficiency of this adaptation critically depend on which documents are selected for pseudo query generation. The existing document sampling method focuses on diversity but fails to capture model uncertainty. In contrast, we propose Uncertainty-based Iterative Document Sampling (UnIte) addressing these limitations by (1) filtering documents with high aleatoric uncertainty and (2) prioritizing those with high epistemic uncertainty, maximizing the learning utility of the current model. We conducted extensive experiments on a large corpus of BEIR with small and large models, showing significant gains of +2.45 and +3.49 nDCG@10 with a smaller training sample size, 4k on average.

[IR-16] CiteRadar: A Citation Intelligence Platform for Researcher Profiling and Geographic Visualization

【速读】：该论文旨在解决科研人员在职业发展、基金申请和合作发现中对学术引用地理范围与社区结构认知不足的问题，现有文献计量平台或需昂贵机构订阅，或仅提供聚合的引用计数而缺乏细粒度的作者元数据。其解决方案的关键在于提出一个名为 CiteRadar 的开源系统，通过整合 Google Scholar、OpenAlex、CrossRef、Semantic Scholar 和 OpenStreetMap Nominatim 五个异构数据源，构建了一个五阶段处理管道，实现从单一 Google Scholar 用户标识符出发，自动生成包含完整出版列表、引用文献及其丰富作者元数据、按引用频次与 h 指数排序的作者表、统计摘要及交互式世界地图的结构化输出。关键技术贡献包括：(1) 对 Google Scholar HTML 中 Unicode 非断行空格的鲁棒解析器，避免期刊名和年份字段被破坏；(2) 基于停用词过滤后的机构名称相似度的两阶段作者消歧系统，有效防止同名实体合并错误导致的 h 指数高估（最高达正确值的 9 倍）；(3) OpenAlex 网页 URL 到 API URL 的转换修复机制，使城市级别位置数据覆盖率从 0% 提升至约 60%；(4) 使用对数缩放的 Folium 交互式世界地图，以城市为单位展示研究人员分布并支持弹窗详情查看，且整个地图可独立渲染为单个 HTML 文件。

链接: https://arxiv.org/abs/2604.25057
作者: Chenxu Niu,Yiming Sun
机构: NVIDIA Corporation(英伟达公司); Texas Tech University(德克萨斯理工大学)
类目: Machine Learning (cs.LG); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Understanding the geographic reach and community structure of one’s scholarly citations is increasingly valuable for career development, grant applications, and collaboration discovery – yet accessible tools for answering these questions remain scarce. Existing bibliometric platforms either require costly institutional subscriptions or expose only aggregate citation counts without granular per-author metadata. We present CiteRadar, an open-source system that accepts a single Google Scholar user identifier and automatically produces a structured output folder containing: the author’s complete publication list, all retrieved citing papers with enriched author metadata, two ranked author tables (by citation frequency and by h-index), a plain-text statistical summary, and a self-contained interactive HTML world map – all from a single command-line invocation. CiteRadar integrates five heterogeneous data sources – Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap Nominatim – through a carefully engineered five-stage pipeline. Key technical contributions include: (1) a Scholar meta-string parser resilient to Unicode non-breaking-space separators, a pervasive but undocumented quirk in Scholar’s HTML that silently corrupts venue and year fields when unhandled; (2) a two-stage author disambiguation system using stop-word-filtered institution name similarity to guard against the well-known same-name entity-merging failure mode in bibliometric databases, demonstrated to eliminate h-index attribution errors of up to 9x the correct value; (3) an OpenAlex web-URL to API-URL conversion fix that raises the fraction of author records with city-level location data from 0% to ~60%; and (4) a logarithmically-scaled interactive Folium world map with per-city researcher popups, rendered as a fully self-contained HTML file.

[IR-17] Offline Evaluation Measures of Fairness in Recommender Systems

【速读】：该论文旨在解决推荐系统公平性评估指标存在的理论、实证和概念局限性问题，这些问题导致指标得分难以解释、适用场景不明确以及在某些情况下无法计算（如除零错误）。其关键解决方案在于：首先通过系统的理论与实证分析揭示现有指标在可解释性、表达能力和适用范围上的缺陷；其次提出新的评估方法和指标以克服这些局限；最后基于对指标局限性的深入理解，制定出适用于不同场景的公平性评估指标选用指南，从而提升推荐系统公平性评估的准确性与实用性。

链接: https://arxiv.org/abs/2604.25032
作者: Theresia Veronika Rampisela
机构: 未知
类目: Information Retrieval (cs.IR)
备注: PhD thesis

点击查看摘要

Abstract:The evaluation of recommender system fairness has become increasingly important, especially with recent legislation that emphasises the development of fair and responsible artificial intelligence. This has led to the emergence of various fairness evaluation measures, which quantify fairness based on different definitions. However, many of such measures are simply proposed and used without further analysis on their robustness. As a result, there is insufficient understanding and awareness of the measures’ limitations. Among other issues, it is not known what kind of model outputs produce the (un)fairest score, how the measure scores are empirically distributed, and whether there are cases where the measures cannot be computed (e.g., due to division by zero). These issues cause difficulty in interpreting the measure scores and confusion on which measure(s) should be used for a specific case. This thesis presents a series of papers that assess and overcome various theoretical, empirical, and conceptual limitations of existing recommender system fairness evaluation measures. We investigate a wide range of offline evaluation measures for different fairness notions, divided based on the evaluation subjects (users and items) and for different evaluation granularities (groups of subjects and individual subjects). Firstly, we perform theoretical and empirical analysis on the measures, exposing flaws that limit their interpretability, expressiveness, or applicability. Secondly, we contribute novel evaluation approaches and measures that overcome these limitations. Finally, considering the measures’ limitations, we recommend guidelines for the appropriate measure usage, thereby allowing for more precise selection of fairness evaluation measures in practical scenarios. Overall, this thesis contributes to advancing the state-of-the-art offline evaluation of fairness in recommender systems. Comments: PhD thesis Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.25032 [cs.IR] (or arXiv:2604.25032v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.25032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-18] Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

【速读】：该论文旨在解决现代深度学习推荐模型（Deep Learning Recommendation Models, DLRMs）在处理超长用户交互历史（User Interaction History, UIH）时，因行业标准的“Fat Row”范式导致的数据冗余问题——该范式将序列数据预加载至每个训练样本中，造成存储与I/O瓶颈，尤其在多租户环境中因不同模型对序列长度需求差异显著而加剧。解决方案的关键在于提出一种版本化延迟物化（versioned late materialization）范式：将UIH仅存储于一个标准化、不可变的层级中，通过轻量级版本指针在训练时按需重建序列，从而消除冗余；同时采用分叉协议保障在线到离线（Online-to-Offline, O2O）一致性，并利用读优化的不可变存储层支持多维投影下推，结合解耦预处理与流水线I/O预取及数据亲和性优化，有效隐藏序列重建延迟，使训练吞吐量由GPU计算能力主导，最终实现资源消耗降低与序列长度激进扩展的协同提升，成为当前推荐系统架构（如HSTU和ULTRA-HSTU）的核心数据基础设施。

链接: https://arxiv.org/abs/2604.24806
作者: Liang Guo,Ge Song,Litao Deng,Jianhui Sun,Chufeng Hu,Lu Zhang,Zhen Ma,Shouwei Chen,Weiran Liu,Sarang Masti Sreeshylan,Xiaoxuan Meng
机构: Meta Platforms, Inc.(Meta平台公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard “Fat Row” paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emphversioned late materialization paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

人机交互

[HC-0] “The Worst Weather In America”: Augmenting the Information Design of Extreme Cold Weather Forecasts

【速读】：该论文旨在解决高山气象信息传达效率低的问题，特别是在极端天气条件下如何更有效地向访客传递冷气候灾害风险。其核心挑战在于当前文本密集的天气预报难以被不同背景和读写能力的游客快速理解，从而影响安全决策。解决方案的关键是引入颜色编码的危险图标（color-coded hazard icons），作为对传统文本预报的视觉化摘要，通过用户参与式工作坊的设计输入与众包研究验证其有效性。结果表明，图标显著增强了用户对登山活动风险的感知，但同时也揭示了可视化设计与伦理问题仍需进一步探索，以确保信息传达兼顾多样性读者的认知差异与体验背景。

链接: https://arxiv.org/abs/2604.25818
作者: Michael Correll,Jay Broccolo,Drew Bush
机构: Northeastern University (东北大学); Mount Washington Observatory (华盛顿山天文台)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Mount Washington is home to extreme, and extremely volatile, weather conditions. Consulting a weather forecast of conditions at the summit is vital for making one’s visit as safe as possible. Using the discussion and suggestions arising from a participatory workshop as input, we test a design intervention employing color-coded hazard icons to function as visual summaries of Mount Washington Observatory’s current text-heavy forecast through a crowd-sourced study. We find that the use of icons increases the perceived risk of activities involving visiting the mountain. However, we highlight remaining questions around visualization design and design ethics that warrant further study in the domain of how best to communicate cold weather hazards in ways that are mindful of the diversity of literacies and experiences of visitors.

[HC-1] Lexical Anthropomorphization Influences on Moral Judgments of AI Bad Behavior

【速读】：该论文试图解决的问题是：人类化语言（anthropomorphic language）如何影响人们对人工智能（AI）不良行为的道德判断。具体而言，研究关注词汇性人类化（lexical anthropomorphism, LA）提示是否会影响人们对AI道德品质、行为道德性和行为责任的认知。解决方案的关键在于通过四项实验（总样本量N=1,020）系统检验了人类化语言与人类化设计线索（如图标、名称、自指表达）在不同类型的道德违规情境下对道德判断的影响，发现尽管人类化语言和设计线索对道德判断影响有限，但在某些情况下高人类化提示会增强对AI欺骗能力的感知；而道德违规类型（尤其是伤害和贬损类违规）才是最显著的预测因素，表明道德判断的核心依据是行为本身的性质而非AI的人类化表征。

链接: https://arxiv.org/abs/2604.25814
作者: Jaime Banks,Nicholas David Bowman,Roman Saladino
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Anthropomorphic language describing artificial intelligence (AI) is widespread in media, policy, and everyday discourse; so too are discussions of AI bad behavior, from hallucinations to inappropriate comments. How does humanizing language about AI shape moral judgments when AI behaves badly? Across four experiments (total N = 1,020), we tested whether lexical anthropomorphism (LA) primes shape judgments of AI moral character, behavior morality, and behavioral responsibility. Studies 1-3 tested interactions between anthropomorphic language and humanizing design cues (icons, names, self-referencing) in the context of amoral errors. Study 4 extended this to genuinely immoral AI behavior across seven moral-violation types. Results indicate humanizing language and design cues have little influence on moral judgments of misbehaving AI. Where effects emerged, high-anthropomorphic primes elevated perceptions of an AI’s capacity for dishonesty. The type of moral violation observed was the strongest predictor of moral judgments, with harm and degradation violations producing the broadest negative character assessments. Prime drift, horn effects, and egoistic value orientations emerged as potentially important predictors of AI moral judgments.

[HC-2] MAIC-UI: Making Interactive Courseware with Generative UI

【速读】：该论文旨在解决教育工作者在创建交互式STEM课程资源时面临的高技术门槛问题，即传统HTML/CSS/JavaScript开发流程难以普及，而现有生成式AI工具存在静态内容输出、长文档处理困难、缺乏教学准确性保障以及修改效率低下（需200–600秒重生成）等局限。其解决方案的关键在于提出MAIC-UI这一零代码创作系统，核心创新包括：（1）基于多模态理解的结构化知识分析以确保教学严谨性；（2）采用两阶段“生成-验证-优化”流水线，分离内容一致性与视觉呈现优化；（3）通过Click-to-Locate编辑结合统一差异（Unified Diff）驱动的增量生成机制，实现亚10秒级迭代更新，显著提升创作效率与可控性。

链接: https://arxiv.org/abs/2604.25806
作者: Shangqing Tu,Yanjia Li,Keyu Chen,Sichen Zhang,Jifan Yu,Daniel Zhang-Li,Lei Hou,Juanzi Li,Yu Zhang,Huiqin Liu
机构: Tsinghua University (清华大学); Guangzhou University (广州大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: You can try our demo at this https URL

点击查看摘要

Abstract:Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200–600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) structured knowledge analysis with multi-modal understanding to ensure pedagogical rigor; (2) a two-stage generate-verify-optimize pipeline separating content alignment from visual refinement; and (3) Click-to-Locate editing with Unified Diff-based incremental generation achieving sub-10-second iteration cycles. A controlled lab study with 40 participants shows MAIC-UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text-to-HTML generation. A three-month classroom deployment with 53 high school students demonstrates that MAIC-UI fosters learning agency and reduces outcome disparities – the pilot class achieved 9.21-point gains in STEM subjects compared to -2.32 points in control classes. Our code is available at this https URL.

[HC-3] Designing and Evaluating Next-Generation Learning Interfaces: Linking AI HCI and the Learning Sciences

【速读】：该论文旨在解决当前交互式学习系统在支持学习过程中的不足，特别是如何通过人机协同（human-AI collaboration）提升学习效果的问题。其解决方案的关键在于设计并评估技术稳健、以人为本且具有教学理论基础的学习界面，强调跨学科对话以识别共同挑战、提炼设计原则，并推动下一代学习技术的研究方向。

链接: https://arxiv.org/abs/2604.25721
作者: Meng Xia,Yan Chen,Qiao Jin,Yang Shi,Paul Denny,Tiffany Barnes,Qingsong Wen,Vincent Aleven
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This workshop addresses this gap by bringing together researchers and practitioners from AI, HCI, and the learning sciences to explore how interactive systems can better support learning. We focus on the design and evaluation of human-AI collaborative learning interfaces that are technically robust, human-centered, and pedagogically grounded. By fostering interdisciplinary dialogue, the workshop aims to identify shared challenges, design principles, and research directions for next-generation learning technologies.

[HC-4] SlicerRoboTMS: An Open-Source 3D Slicer Extension for Robot-Assisted Transcranial Magnetic Stimulation

【速读】：该论文旨在解决机器人辅助经颅磁刺激（Robo-TMS）研发过程中因涉及医学影像、计算机视觉与机器人学等多学科交叉而带来的技术壁垒问题。其解决方案的关键在于开发了一个名为SlicerRoboTMS的开源3D Slicer扩展模块，该模块通过整合医学图像计算与可视化能力，支持基于磁共振成像（MRI）的神经导航，并借助标准化通信协议和可配置系统描述实现与多种机器人系统的无缝接口，从而为Robo-TMS研究提供统一、可复现且易于扩展的交互基础设施。

链接: https://arxiv.org/abs/2604.25661
作者: Wenzhi Bai,Yituo Guo,Bhaskar Basu,Andrew Weightman,Zhenhong Li
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted by the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2026

点击查看摘要

Abstract:Robot-assisted Transcranial Magnetic Stimulation (Robo-TMS) is an image-guided robotic intervention that enhances the accuracy and reproducibility of conventional Transcranial Magnetic Stimulation (TMS), a widely used non-invasive brain stimulation procedure in clinical treatment and neuroscience research. Despite its potential, the development of Robo-TMS remains challenging due to the need for multidisciplinary expertise spanning medical imaging, computer vision, and robotics. This paper presents SlicerRoboTMS, an open-source 3D Slicer extension that provides a unified interaction infrastructure for Robo-TMS research. By leveraging 3D Slicer’s medical image computing and visualisation capabilities, the extension supports Magnetic Resonance Imaging (MRI)-based neuronavigation and interfaces with robotic systems through standardised communication protocols and configurable system descriptions. An example integration is presented to demonstrate how SlicerRoboTMS can be incorporated into a representative Robo-TMS workflow. Designed to support diverse hardware configurations and rapid prototyping, SlicerRoboTMS lowers the barrier to entry and facilitates reproducible and extensible research in Robo-TMS. The extension is available at this https URL.

[HC-5] ClayScape: A GenAI-Supported Workflow for Designing Chinese Style Ceramics with Clay 3D Printing

【速读】：该论文旨在解决传统陶瓷制作工艺中因步骤复杂且高度依赖手工技艺而导致的高技术门槛问题，尤其针对手工艺创作者在采用数字制造技术（如计算机辅助设计 CAD 和计算机辅助制造 CAM）时面临的技能障碍。其解决方案的关键在于设计了一种融合生成式 AI（Generative AI）与陶土 3D 打印的混合工作流（hybrid workflow），通过 ClayScape 设计工具实现该方法的落地应用，从而降低数字制造的技术壁垒，并在保留文化根基的前提下拓展创作者的创意可能性。

链接: https://arxiv.org/abs/2604.25657
作者: Sijia Liu,Hoi Ching Silvester Mok,Long Ling,Tobias Klein,Ray LC
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Designing Interactive Systems Conference (DIS '26)

点击查看摘要

Abstract:Chinese ceramic-making involves complex and interdependent steps, making it technically demanding. Digital fabrication methods attempt to make the process more accessible, but for craft-creators, technical challenges such as CAD and CAM skills remain major obstacles. To address this, we designed a hybrid workflow that integrates Generative AI with clay 3D printing to support new creative possibilities. We evaluated the workflow through ClayScape, a design tool that operationalizes this approach, with four ceramic creators. Our findings show that the workflow supports accessible ceramic creation while revealing both expanded opportunities for creative exploration and challenges in balancing agency and control. This work demonstrates how hybrid workflows can lower barriers to digital fabrication while supporting creative possibilities in culturally grounded ceramic practices.

[HC-6] Emotive Architectures: The Role of LLM s in Adjusting Work Environments

【速读】：该论文旨在解决远程与混合工作环境中物理空间与数字空间融合所面临的挑战，即如何通过技术手段提升空间体验、协作效率及人际互动质量。其核心问题在于：如何将生成式 AI（Generative AI）特别是大语言模型（Large Language Models, LLMs）有效集成到工作空间中，以实现对环境参数的动态响应和用户情绪状态的感知，从而增强专注力、幸福感与参与度。解决方案的关键在于构建一种“协同适应性环境”（co-adaptive environments）框架，该框架以人本设计为核心，利用 LLM 实时调节光照、声学或界面配置等物理属性，并通过透明、包容的设计策略应对隐私保护、情感追踪与用户自主权等伦理风险，从而推动具备情感敏感性和情境适应性的混合办公空间发展。

链接: https://arxiv.org/abs/2604.25601
作者: Lara Vartziotis,Tina Vartziotis,Frank Beutenmueller,Stella Salta,Konstantinos Moraitis,Miltiadis Katsaros,Sotirios Kotsopoulos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 Table

点击查看摘要

Abstract:In remote and hybrid work contexts, the integration of physical and digital environments is revolutionizing spatial experiences, collaboration, and interpersonal interactions. This study examines three fundamental spatial conditions: the physical environment, characterized by material and sensory attributes; the virtual environment, influenced by immersive technologies; and their fusion into hybrid environments where digital and physical components interact dynamically. The increasing number of AI tools in contemporary society, extensively utilized in both professional and personal spheres, has led to a varied landscape of developing technologies. For instance, ChatGPT has emerged as one of the most downloaded applications, a statistically substantiated fact that demonstrates the swift incorporation of language-based AI into daily life. It also underscores the function of large language models (LLMs) as meaningful bridges between concepts at reading emotional and behavioral signals via natural language. These models provide real-time modifications such as altering illumination, acoustics, or interface configurations, converting static settings into dynamic, emotionally receptive environments. We investigate the integration of language models into professional settings and their potential to enhance user experience by promoting focus, well-being, and engagement. The study investigates ethical concerns, including privacy, emotional tracking, and user agency, emphasizing the importance of inclusive and transparent design. This research formulates a framework for creating co-adaptive environments that merge technological innovation with human-centered experiences, offering a fresh viewpoint on responsive and supportive hybrid workspaces.

[HC-7] From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

【速读】：该论文旨在解决当前对大型语言模型（Large Language Models, LLMs）用于情感支持的跨文化采纳模式及其用户感知机制缺乏系统理解的问题。其解决方案的关键在于开展一项涵盖七个国家（美国、英国、德国、法国、西班牙、意大利和荷兰）共4641名参与者的大型横断面调查，并结合混合效应模型分离文化因素与人口统计学构成的影响，从而识别出影响用户信任度、使用意愿及感知益处的核心变量；研究进一步通过收集731条多语言真实用户提示语料，揭示了用户主要寻求帮助的四大类问题：孤独感、压力、人际关系冲突和心理健康困扰，为构建安全、知情且符合社会需求的情感支持型LLM系统提供了实证基础与政策启示。

链接: https://arxiv.org/abs/2604.25525
作者: Natalia Amat-Lefort,Mert Yazan,Amanda Cercas Curry,Flor Miriam Plaza-del-Arco
机构: Leiden University (莱顿大学); Hogeschool van Amsterdam (阿姆斯特丹应用科技大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 28 pages (9 pages main text, 19 pages references and appendices), 14 figures. The first two authors contributed equally

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.

[HC-8] Making the Invisible Visible: Toward Micro-Expression Visualization for Empathy in Social Interaction

【速读】：该论文旨在解决微表情（micro-expressions）在自然社交互动中难以被感知，从而限制其在以人为本场景下应用的问题。其解决方案的关键在于提出一个概念框架，通过将原本不可察觉的微表情转化为可感知的情感线索（affective cues），以探索其对共情体验的影响，并计划通过受控环境下的初步试点研究验证该框架的可行性。

链接: https://arxiv.org/abs/2604.25505
作者: Feiyang Yin,Isidro Butaslac,Patrick Gebhard,Monica Perusquia-Hernandez,Zhaofeng Niu,Taishi Sawabe,Hirokazu Kato
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Qufu Normal University (曲阜师范大学)
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 4 figures. Presented at the CHI 2026 Workshop “Shaping Future Human Connection: Social Augmentation through XR Technologies”

点击查看摘要

Abstract:Micro-expressions are brief and subtle facial movements that convey nuanced affective information but often remain imperceptible during natural social interaction. Although prior research has primarily focused on computational recognition and spotting of micro-expressions, their application in human-centered contexts remains limited. From the perspective of social augmentation, this work proposes a conceptual framework for micro-expression visualization that transforms otherwise imperceptible micro-expressions into perceptible affective cues, with the aim of exploring their potential influence on empathic experience. Furthermore, we outline a planned pilot study to preliminarily assess the feasibility of this framework under controlled conditions.

[HC-9] Generative UI as an Accessibility Bridge: Lessons from C2C E-Commerce

【速读】：该论文旨在解决用户生成内容（User-Generated Content, UGC）平台中静态无障碍标准难以应对动态、多样化内容呈现所导致的可访问性障碍问题，例如图片模糊或构图不当、描述缺失关键信息（如尺寸与状态）、页面结构不一致等。解决方案的关键在于引入生成式 UI（Generative UI），即在运行时根据用户需求和上下文动态生成适配界面，而非依赖预先设定的静态布局。通过三个实证干预——为屏幕阅读器重新生成 HTML、为老年卖家提供对话式引导、为视障卖家设计音频辅助照片构图——研究证明了生成式 UI 能够填补现有标准无法预见的空白，从而提升盲人、低视力及老年人用户的使用体验。这要求 HCI 设计实践从指定具体布局转向定义行为策略，使无障碍设计更具适应性和前瞻性。

链接: https://arxiv.org/abs/2604.25455
作者: Bektur Ryskeldiev
机构: Mercari(mercari); University of Tsukuba(筑波大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 7 pages, 1 figure. Expanded version of a position paper accepted at the CHI 2026 workshop “What does Generative UI mean for HCI Practice?” (Barcelona, 15 April 2026)

点击查看摘要

Abstract:Web accessibility rests on static standards and developer compliance. That model frays in platforms where content is user-generated: photos arrive blurry or off-frame, descriptions skip size and condition, and page structure shifts from listing to listing. Drawing on six studies conducted between 2022 and 2025 with blind, low-vision, and older adult users of customer-to-customer (C2C) marketplaces, I argue that generative UI can produce adapted interfaces at the point of use, addressing barriers that static design cannot anticipate. Three interventions from this program – HTML regeneration for screen readers, conversational guidance for older sellers, and audio-guided photo framing for blind sellers – demonstrate how runtime generation can bridge gaps that standards leave open. I outline what these findings imply for HCI practice: generative UI extends beyond the screen, complements rather than replaces ability-based design, and shifts the designer’s role from specifying layouts to specifying policies. This is an expanded arXiv version of a position paper accepted at the CHI 2026 workshop “What does Generative UI mean for HCI Practice?”

[HC-10] Rewiring Perceived Doability in VR: Hand Redirection as a Subtle Cross-Sensory Support for Sustained Practice

【速读】：该论文旨在解决日常生活中人们难以持续进行轻度运动与拉伸的问题，其核心挑战并非源于客观的身体限制，而是个体在当下对动作“可执行性”（perceived doability）的主观判断——即是否认为该行为在其能力范围内且所需努力可控。解决方案的关键在于利用虚拟现实（VR）中的手部重定向（hand redirection, HR）技术，在不突破用户感知极限的前提下，通过微小的、隐蔽的空间调整，使用户反复体验到“微成功”（micro-successes），如用相似物理动作更早达成虚拟目标，从而增强继续练习的意愿和早期重新参与的动力。这种方法无需外显压力或高强度指导，但同时也引发关于自主性与真实性的伦理考量，进而提出两个关键研究问题：HR如何影响感知可执行性以促进持续行为改变；以及HR在何种条件下构成可接受的支持，而非因削弱真实性、代理权或信任而适得其反。

链接: https://arxiv.org/abs/2604.25443
作者: Isidro Butaslac,Yota Nagaya,Almira Princess Redoble,Jordan Aiko Deja,Nicko Reginio Caluya,Maheshya Weerasinghe,Taishi Sawabe,Hirokazu Kato,Eric Cesar Vidal Jr
机构: NAIST(日本信息科学研究所); Ateneo de Manila University(阿特内奥大学); De La Salle University(德拉萨大学); Ritsumeikan University(立命馆大学); University of Primorska(普里莫尔斯卡大学)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 1 figure. Presented at the CHI 2026 Workshop “Cross-Sensory Futures: Rewiring Perception in HCI”. this https URL

点击查看摘要

Abstract:In everyday life, physical effort is often minimized and convenience is prioritized, making it difficult for many people to sustain light exercise and stretching despite well-known long-term benefits. This challenge often arises not from objective movement limitations, but from whether an action feels doable in the moment and, therefore worth continuing. This position paper argues that subtle VR hand redirection (HR) can be reframed as a form of cross-sensory support for sustained practice by targeting perceived doability: a moment-to-moment cognitive appraisal that an action is within one’s capability while requiring manageable effort. We propose that conservative HR, applied within known perceptual limits, can create repeated micro-success experiences (e.g., reaching a virtual goal earlier with similar physical movement). These micro-successes may increase continuation intention and early re-engagement without relying on overt pressure or intensive coaching. At the same time, such support raises questions about autonomy and authenticity. We therefore articulate two research questions: (RQ1) how HR shifts perceived doability to support sustained practice and positive behavior change; and (RQ2) when HR functions as acceptable support versus becoming counterproductive by undermining authenticity, agency, trust, or fostering dependence. We present an initial sit-and-reach VR prototype, outline a research plan, and identify key design tensions to spark community discussions on autonomy-preserving cross-sensory futures in HCI.

[HC-11] Recommending Usability Improvements with Multimodal Large Language Models

【速读】：该论文旨在解决传统可用性评估方法对专家知识和资源依赖性强、难以在小型团队或缺乏可用性专家的环境中实施的问题。其解决方案的关键在于利用多模态大语言模型（Multimodal Large Language Models, MLLMs）自动分析有限的应用上下文信息和用户交互屏幕录制，识别基于尼尔森可用性启发式原则（Nielsen’s usability heuristics）的可用性问题，并生成具有严重性排序的改进建议。该方法显著降低了开发者的手动优先级判断负担，且通过面向软件工程师的用户研究验证了推荐建议的质量与实用性，为未来将自动化可用性评估集成到软件工程工作流中提供了可行路径。

链接: https://arxiv.org/abs/2604.25420
作者: Sebastian Lubos,Alexander Felfernig,Damian Garber,Viet-Man Le,Manuel Henrich
机构: Graz University of Technology (格拉茨工业大学); UNiQUARE Software Development (UNiQUARE软件开发公司)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

点击查看摘要

Abstract:Usability describes quality attributes of application user interfaces that determine how effectively users can interact with them. Traditional usability evaluation methods require considerable expertise and resources, which can be challenging, especially for small teams and organizations. Automating usability evaluation could make it more accessible and help to improve the user experience. The recent emergence of powerful multimodal large language models (MLLMs) has opened new opportunities for automating usability evaluation and recommendation of improvements. These models can process visual inputs such as images and videos alongside textual context, which enables the identification of usability issues and the generation of actionable suggestions to resolve these issues. In this paper, we present a novel automated approach that uses limited application context and screen recordings of user interactions as input to an MLLM. The model automatically identifies and describes usability issues based on Nielsens usability heuristics, and provides corresponding explanations and improvement recommendations. To reduce the developer effort of manual prioritization, the recommendations are ranked by severity. The quality and practical usefulness of the generated recommendations were evaluated based on a user study that involved software engineers as participants. The evaluation focused on the highest-ranked suggestions provided by the model. The results demonstrate the potential of our approach to provide low-effort usability improvement recommendations. This makes it a promising complement to traditional evaluation methods, especially in settings with limited access to usability experts. In this sense, the approach serves as a basis for future integration into development tools to enable automated usability evaluation within software engineering workflows. Comments: Accepted for publication at the ACM International Conference on the Foundations of Software Engineering (FSE 2026) Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.25420 [cs.SE] (or arXiv:2604.25420v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.25420 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3797121 Focus to learn more DOI(s) linking to related resources

[HC-12] Co-Writing with AI: An Empirical Study of Diverse Academic Writing Workflows

【速读】：该论文试图解决的问题是：尽管生成式 AI (Generative AI) 工具在学术实践中日益普及，但大学本科生如何将其整合到写作过程中仍缺乏系统认知。研究聚焦于学生在不同写作阶段（构思、资料获取、规划、起草和审校）中使用 AI 的模式及其受个体因素（如 AI 素养、写作自信、信任度、作者身份顾虑及动机）的影响。解决方案的关键在于通过两项实证研究——一项针对107名英国大学生的问卷调查与另一项对12名研究生的深度访谈——识别出三种具有价值导向的 AI 使用配置：早期阶段（学习导向型）、晚期阶段（质量导向型）和边缘辅助（生产力导向型），从而构建了一个工作流层面的 AI 支持学术写作框架，揭示学生如何权衡学习、质量、效率与作者责任等多重目标，并评估与承担 AI 生成内容的责任。

链接: https://arxiv.org/abs/2604.25389
作者: Silvia Bodei,Duncan P. Brumby,Katie Fisher,Jon Mella
机构: University College London (伦敦大学学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 25 pages, 1 table, 5 figures. Accepted at CHIWORK 2026 (ACM Symposium on Human-Computer Interaction for Work)

点击查看摘要

Abstract:Despite AI tools becoming increasingly embedded in academic practice, little is known about how university students integrate them into their writing processes. We examine how students engage with AI across different writing tasks, and how this engagement is shaped by individual factors including AI literacy, writing confidence, trust, authorship concerns, and motivation. Study~1 surveys 107 UK university students to map task-specific and co-occurring patterns of AI use across five writing stages (ideation, sourcing, planning, drafting, and reviewing) and their associations with individual factors. Study~2 complements this by exploring how these patterns can be assembled in practice, through interviews with 12 postgraduates reflecting on their established use of AI in assessed writing. Together, the studies suggest that AI integration is selective and heterogeneous, forming three recurring and value-oriented configurations: (1) early-stage (learning-oriented), where tools support exploration and understanding; (2) late-stage (quality-oriented), where tools support drafting and refinement; and (3) peripheral (productivity-oriented), where tools are used to reduce friction and sustain momentum across the process. We offer a workflow-level account of AI-supported academic writing, showing how students navigate competing priorities of learning, quality, productivity, and authorship, and how they evaluate and take responsibility for AI-generated outputs.

[HC-13] Author response to commentaries on H is for Human and How (Not) to Evaluate Qualitative Research in HCI

【速读】：该论文旨在解决人机交互（Human-Computer Interaction, HCI）领域中对质性研究（qualitative research）评价标准不统一、方法论争议较多的问题。其核心解决方案在于强调“以人为本”（H is for Human）的研究范式，主张通过理解用户经验的复杂性和情境依赖性来评估质性研究的价值，而非单纯依赖量化指标或形式化的评估框架。作者认为，质性研究的关键优势在于揭示人类行为背后的意义结构，因此评价应聚焦于研究设计是否充分回应了人类经验的深度与多样性。

链接: https://arxiv.org/abs/2604.25312
作者: Andy Crabtree
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This is the authors response to commentaries on the original article H is for Human and How (Not) to Evaluate Qualitative Research in HCI, this https URL Commentaries were provided by: Jeffrey Bardzell, this https URL Alan Blackwell, this https URL Paul Dourish, this https URL Bonnie Nardi, this https URL Peter Pirolli, this https URL Jennifer Rode, this https URL Peter Tolmie, this https URL Please feel free to copy, redistribute, adapt, and build on any part of this article in accordance with the CC BY 4.0 license: this https URL

[HC-14] Visual Boosting Techniques for Spatiotemporal Dense Pixel Visualizations

【速读】：该论文旨在解决将二维地理空间数据线性化为一维排序时引入的结构失真问题，这类失真会以视觉伪影形式出现在密集像素可视化中，干扰对真实时空模式的识别。解决方案的关键在于提出一种度量驱动的可视化分析方法：通过邻域保持度量（neighborhood preservation measures）量化一维排序中的视觉伪影，并利用符号增强技术（如图示符号、光环和交叉阴影）进行视觉强化（visual boosting），从而帮助分析人员可靠地区分真实的地理空间模式与由线性化过程产生的伪影。

链接: https://arxiv.org/abs/2604.25298
作者: Julius Rauscher,Frederik L. Dennig,Udo Schlegel,Daniel A. Keim,Tobias Schreck
机构: University of Konstanz (康斯坦茨大学); LMU MCML Munich (慕尼黑大学麦克莱姆研究所); TU Graz (格拉茨技术大学)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 4 figures, to appear at the 17th International EuroVis Workshop on Visual Analytics

点击查看摘要

Abstract:The analysis of spatiotemporal data is essential in domains such as epidemiology and environmental monitoring, where understanding the interplay between spatially distributed phenomena and their temporal evolution is critical. Dense pixel visualizations offer a compact, effective overview of spatiotemporal dynamics. However, the necessary linearization of 2D geographic space into a 1D ordering inevitably introduces structural distortions that manifest as visual artifacts. We propose a measure-driven visual analytics approach that captures visual artifacts through neighborhood preservation measures for 1D orderings and renders them using visual boosting techniques such as glyphs, halos, and hatching. We demonstrate our approach through a usage scenario analyzing COVID-19 incidence data across German districts, showing that interactive, measure-driven boosting enables analysts to reliably distinguish genuine spatial patterns from linearization artifacts.

[HC-15] Value-Sensitive AI for Prayer: Balancing the Agencies Between Human and AI Agents in Spiritual Context

【速读】：该论文旨在探讨人工智能（AI）介入祈祷体验时可能引发的价值冲突问题，特别是如何在技术干预下保持用户对神圣连接的“真实性”感受。其解决方案的关键在于：AI系统设计应优先保障用户的自主性（agency），通过维持解释上的开放性来支持个体意义建构——例如将AI的不可解释性（inexplicability）转化为个人化理解的资源，或承认“不使用AI”本身即是一种合理且值得尊重的设计选择。

链接: https://arxiv.org/abs/2604.25230
作者: Soonho Kwon,Dong Whi Yoo,Shaowen Bardzell,Younah Kang
机构: Georgia Institute of Technology, School of Interactive Computing (佐治亚理工学院，交互计算学院); Indiana University Indianapolis, Luddy School of Informatics, Computing, and Engineering (印第安纳大学伯明顿分校，信息、计算与工程学院); Yonsei University, Information and Interaction Design (延世大学，信息与交互设计专业)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Designing Interactive Systems Conference (DIS '26), June 13–17, 2026, Singapore, Singapore

点击查看摘要

Abstract:We present four conceptual value-sensitive AI systems to examine how the presence of AI could influence praying experiences. Drawing on key values and practices associated with praying identified through a diary study, we designed AI systems intended to “assist” prayer practices. These designs were presented to participants through speculative design workbooks, serving as provocations to co-reflect on how the intervention of AI systems might shape their praying experiences. Our findings suggest that a sense of authenticity (or feeling a genuine connection to the divine) is a crucial value, while the presence of AI was often perceived as diminishing this authenticity, particularly when AI assumed too much agency in guiding praying practices. Based on our findings, we argue that AI system designs for deeply value-laden experiences should preserve users’ agency in shaping their own experiences by maintaining interpretive openness, perhaps by leveraging AI’s inexplicability as a resource for personal meaning-making or by recognizing non-use of AI as a legitimate design choice.

[HC-16] People IT and Structuration (PIS): An Integrative Theoretical Framework for Management Information Systems

【速读】：该论文旨在解决管理信息系统（Management Information Systems, MIS）领域长期存在的理论碎片化问题，即如何整合社会技术系统理论、技术接受模型、适应性结构化理论到社会物质性等多元理论流派，以揭示人、信息技术（Information Technology, IT）与组织结构之间复杂且相互建构的关系。其解决方案的关键在于提出“人-IT-结构化”（People - IT - Structuration, PIS）框架，该框架基于吉登斯的结构化理论，将人（P）、IT（I）和结构（S）视为动态互构的要素，而非独立变量，并通过形式化的命题阐明三者在持续结构化过程中协同演化的机制，从而统一解释从传统信息系统到人工智能、算法管理及人机协作等新兴现象。

链接: https://arxiv.org/abs/2604.25118
作者: Wei Huang,Xiaofang Cai,Qiaozhen Guo,Xiaosong Wu,Xin Tang
机构: 未知
类目: Human-Computer Interaction (cs.HC); General Literature (cs.GL)
备注:

点击查看摘要

Abstract:The Management Information Systems (MIS) discipline has long grappled with how to theorize the complex, mutually constitutive relationships among people, information technology, and organizational structures. Decades of research have produced influential but fragmented theoretical streams from socio-technical systems theory to technology acceptance models, from adaptive structuration theory to sociomateriality, and each illuminating important facets while leaving integrative questions unresolved. This paper proposes the People - IT - Structuration (PIS) framework as a unifying theoretical lens that synthesizes these streams. Drawing on Giddens’ structuration theory, we conceptualize People §, Information Technology (I), and Structure (S) not as independent variables but as mutually constitutive elements engaged in ongoing structuration processes. We trace the intellectual history of MIS theorizing to demonstrate how PIS resolves persistent tensions in the field,e.g. between technological and social determinism, between variance and process approaches, and between micro-level interaction and macro-level institutional dynamics. We develop a set of formal propositions articulating the mechanisms through which P, I, and S co-evolve, and extend the framework to address contemporary phenomena including artificial intelligence, algorithmic management, and human-AI collaboration. The PIS framework offers both a retrospective lens for understanding the discipline’s theoretical evolution and a prospective tool for guiding research in the AI era.

[HC-17] he Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue

【速读】：该论文旨在解决人工智能聊天机器人（AI chatbots）是否可能加剧用户妄想信念的问题，特别是探讨人类与聊天机器人之间是否存在双向反馈循环机制。其解决方案的关键在于构建了一个潜变量状态模型（latent state model），能够量化分析人类与聊天机器人在对话过程中相互影响的累积与衰减效应。研究发现，虽然人类对聊天机器人的影响较强但短暂，而聊天机器人对人类的影响则更持久且稳定；更重要的是，聊天机器人自身具有持续强化其输出内容的“自我影响”能力，这种自增强机制成为长期维持妄想信念的主要路径。这一发现首次提供了定量证据，表明人机交互可形成具有不同时间动态特性的妄想反馈回路，为开发更安全的生成式 AI 系统提供了理论依据。

链接: https://arxiv.org/abs/2604.25096
作者: Ashish Mehta,Jared Moore,Jacy Reese Anthis,William Agnew,Eric Lin,Peggy Yin,Desmond C. Ong,Nick Haber,Carol Dweck
机构: Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:There is growing concern that AI chatbots might fuel delusional beliefs in users. Some have suggested that humans and chatbots mutually reinforce false beliefs over time, but quantitative evidence is lacking. Using a unique dataset of chat logs from individuals who exhibited delusional thinking, we developed a latent state model that captures accumulating and decaying influences between humans and chatbots. We find that a bidirectional influence model substantially outperforms a unidirectional alternative where humans are the primary driver of delusion. We find that humans exert strong but short-lived influence on chatbots, whereas chatbots exert longer-lasting influence on humans. Moreover, chatbots exert strong, stable self-influence over their own future outputs that tends to perpetuate delusions over long stretches of conversation. In fact, this chatbot self-influence constituted the dominant pathway when considering accumulated influence over time. Overall, these results indicate that humans tend to drive sharp, immediate increases in delusion, whereas chatbots sustain and propagate these effects over longer timescales. Together, these findings provide the first quantitative evidence that human-chatbot interactions can form feedback loops of delusion, decomposable into distinct pathways with dissociable temporal dynamics. By doing so, they can inform the development of safer AI systems.

[HC-18] Feature Anchors for Time-Series Sensor-Based Human Activity Recognition

【速读】：该论文旨在解决可穿戴设备中人体活动识别（Human Activity Recognition, HAR）领域长期存在的问题：现有方法要么依赖手工设计的时间序列特征（Time-Series Features, TSFs），这些特征虽具语义明确性但难以自适应调整；要么采用深度模型直接从原始信号中学习隐式表示，虽然具备自适应能力却缺乏可解释性。解决方案的关键在于将TSFs作为“特征锚点”（feature anchors）保留在模型内部，并通过神经上下文动态调节其尺度、偏置和门控参数，从而在保持特征语义可见性的前提下实现对分类目标的灵活适配。这种机制使模型既能利用TSFs的物理可解释性，又能通过上下文感知的调制实现端到端优化，显著提升性能并验证了显式且可调特征在HAR中的核心价值。

链接: https://arxiv.org/abs/2604.25092
作者: Ruijie Yao,Chenhang Li,Danyang Zhuo,Tingjun Chen,Xiaoyue Ni
机构: Duke University (杜克大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Wearable Human Activity Recognition (HAR) still lacks a representation that is both explicit and adaptable. Handcrafted time-series features (TSFs) capture meaningful motion statistics and remain competitive on standard benchmarks, but they are usually used as fixed preprocessing outputs. Deep models learn adaptable representations directly from raw signals, but those representations are typically latent and difficult to inspect. We address this gap by treating handcrafted TSFs as feature anchors: explicit intermediate representations that remain inside the model and are adjusted by neural context instead of being discarded. We propose the Temporal Conditioning Network for Feature Anchors (TCNet), which extracts handcrafted anchors, encodes complementary time-domain and frequency-domain context from raw IMU windows, and predicts context-conditioned scale, bias, and gating parameters to modulate anchor groups directly in feature space. This design keeps anchor semantics visible while allowing the representation to adapt to the classification objective. Across five HAR benchmarks, TCNet achieves 70.2% mF1 on USC-HAD, 85.1% mF1 on Daphnet, 93.9% mF1 on MHealth, and 94.5% mF1 on PAMAP2. Relative to rTsfNet, it improves by 4.5 points on USC-HAD, 14.6 points on Daphnet, and 6.5 points on MHealth. Ablations show that the gains come primarily from anchor guidance rather than simple branch fusion, and feature-space analyses indicate that several discriminative TSF families are not reliably accessible in standard latent representations. These results suggest that, for HAR, handcrafted TSFs are most useful when they remain explicit and adaptable within the model. The code is available at: this https URL

[HC-19] AFA: Identity-Aware Memory for Preventing Persona Confusion in Multi-User Dialogue

【速读】：该论文旨在解决多用户共享语音助手时出现的“人格混淆”（persona confusion）问题，即系统因混用不同用户的对话历史和偏好而导致个性化响应失真，进而降低用户体验与信任度。解决方案的关键在于提出自适应好友代理（Adaptive Friend Agent, AFA）框架，其核心是结合基于语音的说话人识别技术与每个用户的独立记忆存储机制，实现身份感知的个性化对话路由。通过构建包含133个用户画像和12种场景的合成数据集PAT（Personalized Agent chaT），并在五个大语言模型（LLM）后端上验证，AFA显著提升了人格归属准确性（Persona Attribution Accuracy, PAA）——从35.7%提升至61.3%，且人类评估也证实了身份感知路由能显著增强响应的个性化程度。

链接: https://arxiv.org/abs/2604.25022
作者: Mohammad Al-Ratrout,Pavan Uttej Ravva,Shayla Sharmin,Aditya Raikwar,Ju Young Shin,Roghayeh Leila Barmaki
机构: University of Delaware (特拉华大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:When multiple people share a single voice assistant, the system conflates their histories: one resident’s preferences can leak into another’s responses, eroding utility and trust. We call this failure mode persona confusion, and we show it is a measurable problem in today’s single-user dialogue systems when deployed in shared environments. We present the Adaptive Friend Agent (AFA), a modular framework that combines voice-based speaker identification with per-user memory stores to enable identity-aware, personalized dialogue across multiple users. To support training and evaluation, we construct PAT (Personalized Agent chaT), a synthetic dataset of 58,289 persona-grounded dialogue turns spanning 133 user profiles and 12 real-world scenarios. We evaluate AFA across five LLM back-ends in a standard response-quality benchmark, with a LLaMA-2-70B model fine-tuned on PAT achieving the highest overall performance. To directly measure persona confusion prevention, we introduce an interleaved multi-user evaluation protocol with a novel metric, Persona Attribution Accuracy (PAA), demonstrating that identity-aware routing improves PAA from 35.7% to 61.3%. Human evaluation confirms annotators perceive significantly higher personalization in routing-enabled responses. Our results establish that identity-aware user routing is the critical component for preventing persona confusion in multi-user conversational systems.

[HC-20] “We Wanted to Do Better Than the Law”: Exploring UI/UX Designers Privacy Advocacy in Practice

【速读】：该论文旨在解决当前隐私设计研究中对UI/UX设计师角色关注不足的问题，即现有文献多聚焦于开发者的隐私实现挑战，而忽视了设计师在塑造用户界面（UI）和用户体验（UX）过程中如何权衡隐私因素。其解决方案的关键在于通过12名具有隐私倡导意识的UI/UX设计师的半结构化访谈，系统揭示设计师对隐私的认知、影响因素、协作挑战及适应策略，并特别关注团队决策情境下隐私优先级与商业目标、技术实现和团队动态之间的张力。研究强调需从组织层面推动隐私意识设计的文化变革，并借助以设计师为中心的工具和社区建设来弥合知识鸿沟，从而促进以用户为中心的隐私友好型设计实践。

链接: https://arxiv.org/abs/2604.24982
作者: Keyu Yao,Jinghui Cheng,Jin L.C. Guo
机构: McGill University (麦吉尔大学); Polytechnique Montreal (蒙特利尔工程学院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM CSCW 2026

点击查看摘要

Abstract:Designers hold primary responsibility for shaping the user interface (UI) and user experience (UX) of a product. This role goes beyond aesthetics and usability, extending to the privacy outcomes of user experience, which often emerge through collaboration with other stakeholders such as developers, product managers, and marketing teams. Previous studies on enhancing privacy for technological products primarily focused on the roles of developers – understanding their needs and challenges – but limited effort is devoted to examining how UI/UX designers consider and approach privacy in their work. Through 12 semi-structured interviews with privacy-advocating UI/UX designers, we explore the perceptions, influencing factors, challenges, and adaptive methods they use regarding privacy implementation. We pay special attention to how these challenges and adaptations play out in team-based settings where decisions are negotiated together. Our study reveals how personal and contextual factors shape designers’ value of privacy, the collaborative nature of the challenges designers face when trying to prioritize privacy, and how they navigate tensions between business goals, team dynamics, and technical development. Based on our findings, we discuss implications for advocating a user-centered approach for supporting privacy-aware design, suggestions for organizational-level changes and bridging knowledge gaps through designer-centric tools and community building.

[HC-21] What If We Work Together? Fostering Reflections on Designer Inclusion in Open Source Software Through Speculative Design

【速读】：该论文旨在解决开源软件（Open Source Software, OSS）社区中因开发者主导思维和设计能力匮乏导致的可用性与用户体验（User Experience, UX）不足问题，从而限制了非技术用户对OSS的采纳。其解决方案的关键在于引入推测性设计（Speculative Design），通过构建两个具有不同价值观的虚构社会——Husia（集体主义）和Reetar（个人主义），以激发OSS从业者对现有社区价值体系、设计角色缺失根源及改进路径的深度反思。实证研究表明，这种设计干预能够有效提升参与者对包容性、可持续性和公平性的认知，并为构建更具设计师参与度的OSS生态提供可行建议。

链接: https://arxiv.org/abs/2604.24981
作者: Rozhan Hozhabri Nezhad,Jin L.C. Guo,Jinghui Cheng
机构: Polytechnique Montreal (蒙特利尔理工学院); McGill University (麦吉尔大学)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Accepted to ACM CSCW 2026

点击查看摘要

Abstract:Open source software (OSS) often prioritizes technical functionality over usability and UX design. This imbalance limits OSS adoption among broader, non-technical users. Key underlying factors contributing to this issue are the shortage of design expertise in OSS and a dominant developer-centric mindset. To address these persistent issues, we explore the potential of speculative design as a catalyst for transforming the OSS community’s mindset towards a more designer-inclusive environment. Our design was informed by an analysis of online forums, which revealed designers’ motivations and challenges when contributing to OSS. Guided by these insights, we created two speculative societies, Husia (collectivist) and Reetar (individualist), in which designers are valued for different reasons and their work incorporated in different ways. Through a user study with 12 OSS practitioners (seven designers and five developers), we found that our speculative societies provoked participants’ rich and critical reflections on OSS values, the root causes of challenges, and proposed actions. Our work provides insights into how speculative design can be used in the practical, sociotechnical context of OSS to stimulate critical reflection, improve awareness, and yield recommendations for fostering an equitable, sustainable, and inclusive OSS environment.

[HC-22] A Survey on LLM -based Conversational User Simulation

【速读】：该论文旨在解决 conversational user simulation（对话用户模拟）领域中因缺乏系统性梳理而阻碍研究进展的问题，尤其在大型语言模型（Large Language Models, LLMs）兴起后，如何有效组织和理解其在该领域的应用成为关键。解决方案的关键在于提出一个新颖的分类体系，涵盖用户粒度（user granularity）与模拟目标（simulation objectives），并系统分析核心技术和评估方法，从而为研究社区提供统一框架，促进未来研究的发展，并识别当前存在的开放挑战。

链接: https://arxiv.org/abs/2604.24977
作者: Bo Ni,Leyao Wang,Yu Wang,Branislav Kveton,Franck Dernoncourt,Yu Xia,Hongjie Chen,Reuben Leura,Samyadeep Basu,Subhojyoti Mukherjee,Puneet Mathur,Nesreen Ahmed,Junda Wu,Li Li,Huixin Zhang,Ruiyi Zhang,Tong Yu,Sungchul Kim,Jiuxiang Gu,Zhengzhong Tu,Alexa Siu,Zichao Wang,David Seunghyun Yoon,Nedim Lipka,Namyong Park,Zihao Lin,Trung Bui,Yue Zhao,Tyler Derr,Ryan A. Rossi
机构: Vanderbilt University; Adobe Research; Yale University; University of Oregon; University of California San Diego; Dolby Laboratories; University of California, Berkeley; Cisco AI Research; University of Southern California; Texas AM University; UC Davis
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Submitted in August 2025. MOD-81000 approved survey

点击查看摘要

Abstract:User simulation has long played a vital role in computer science due to its potential to support a wide range of applications. Language, as the primary medium of human communication, forms the foundation of social interaction and behavior. Consequently, simulating conversational behavior has become a key area of study. Recent advancements in large language models (LLMs) have significantly catalyzed progress in this domain by enabling high-fidelity generation of synthetic user conversation. In this paper, we survey recent advancements in LLM-based conversational user simulation. We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies. We aim to keep the research community informed of the latest advancements in conversational user simulation and to further facilitate future research by identifying open challenges and organizing existing work under a unified framework.

[HC-23] Vega-Video: Integrating Video into the Grammar of Graphics

【速读】：该论文旨在解决视频数据（video data）与传统数据在可视化交互中难以集成的问题，尤其针对视频数据特有的范式差异和性能挑战。其核心解决方案在于将视频数据可视化抽象为三类操作——同步（synchronization）、标注（annotation）和变换（transformation），并将其整合进Vega声明式语法（declarative grammar）体系中；关键创新点包括：提出一种“分信号”（split-signal）架构以维持声明式语义的同时屏蔽视频播放器状态更新延迟，以及在编译时检测连续拖拽交互（continuous scrubbing）并应用编码感知优化（encoding-aware optimizations），从而实现最高达4倍的响应速度提升，并借助VOD协议实现实时视频变换，在多小时长视频上仍可保持低于200ms的更新延迟。

链接: https://arxiv.org/abs/2604.24958
作者: Dominik Winecki,Arnab Nandi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Video data is increasingly used alongside conventional data for interactive data exploration, necessitating interfaces for exploring and presenting mixed-modality data. However, integrating video into visualizations remains difficult due to its distinct paradigms and inherent performance challenges. We identify three classes of video data visualization - synchronization, annotation, and transformation - and integrate them into the Vega declarative grammar. We show that these abstractions enable high-performance implementation. To reconcile Vega’s instantaneous dataflow with video player state, we introduce a split-signal architecture that preserves declarative semantics while masking video update delays. We detect continuous scrubbing interactions at compile time to apply encoding-aware optimizations that improve responsiveness by up to 4x. We also repurpose VOD protocols to transform videos in real time, delivering sub-200ms updates even on multi-hour-long compilations. These contributions enable seamless integration of conventional and video data visualization.

[HC-24] V.O.I.C.E (Voice Ownership Identity Control Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data

【速读】：该论文旨在解决生成式语音模型（Generative Voice Models）在快速演进过程中，因未经授权收集、重用和合成语音数据而引发的隐私、安全与治理风险问题，这些问题目前难以被现有统一威胁模型所覆盖。其解决方案的关键在于提出一个名为V.O.I.C.E.的语音生成风险分类体系（taxonomy），该体系基于多源威胁建模方法，整合了来自AI事件数据库、美国联邦贸易委员会（FTC）和互联网犯罪投诉中心（IC3）的569起事件、1067份来自美国不同群体（包括配音演员、网络名人、政界人士及普通公众）的直接报告，以及2221条Reddit讨论内容，从而从真实世界数据中提炼出风险演化机制，并明确刻画风险如何与暴露程度、社会可见性及法律保护可用性等情境因素相互作用。

链接: https://arxiv.org/abs/2604.24794
作者: Tanusree Sharma,Anish Krishnagiri,Lili Dudas,Ahmed Adnan,Visar Berisha
机构: Penn State University (宾夕法尼亚州立大学); Arizona State University (亚利桑那州立大学); University of Dhaka (达卡大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As generative voice models are rapidly advancing in both capabilities and public utilization, the unconsented collection, reuse, and synthesis of voice data are introducing new classes of privacy, security and governance risk that are poorly captured by existing, largely uniform threat models. To fill the gap, we present V.O.I.C.E, a taxonomy of voice generation risk grounded in a multi-source threat modeling effort with 569 incidents from major AI incident database, FTC and Internet Crime Complaint Center (IC3); 1067 direct incident reports from U.S. based participants across diverse groups (including voice actors, internet personalities, political personnel, and general public); and 2,221 Reddit discussions. Grounded in real-world data, our taxonomy explicitly models how risk emerges, interact with contextual factors such as degree of exposure, social visibility, and the availability of legal protections for various affected groups.

[HC-25] One-shot emergency psychiatric triage across 15 frontier AI chatbots

【速读】：该论文旨在解决前沿生成式 AI (Generative AI) 聊天机器人在精神科分诊（psychiatric triage）任务中的性能评估问题，尤其是其在真实单条消息披露情境下对临床紧急程度判断的准确性与可靠性。研究的关键在于构建了一个包含112个临床情景 vignette 的基准测试集，每个情景对应四种分诊等级（A–D），覆盖9类精神科表现和9个风险维度，并由50名医学专家达成共识标签作为金标准。结果表明，AI模型在识别紧急情况（Level D）时几乎无误（准确率94.3%），但对低至中等风险（Level B）的过度分诊现象显著（准确率仅19.7%，平均误差+0.47级），揭示出当前AI在复杂心理风险判别中存在系统性高估倾向，为后续改进模型的分诊逻辑与风险感知能力提供了关键实证依据。

链接: https://arxiv.org/abs/2604.25415
作者: Veith Weilnhammer,Lennart Luettgau,Christopher Summerfield,Viknesh Sounderajah,Elise Wilkinson,Virginia Corno,Matthew M Nour
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI chatbots are increasingly used for health advice, but their performance in psychiatric triage remains undercharacterized. Psychiatric triage is particularly challenging because urgency must often be inferred from thoughts, behavior, and context rather than from objective findings. We evaluated the performance of 15 frontier AI chatbots on psychiatric triage from realistic single-message disclosures using 112 clinical vignettes, each paired with 1 of 4 original benchmark triage labels: A, routine; B, assessment within 1 week; C, assessment within 24 to 48 hours; and D, emergency care now. Vignettes covered 9 psychiatric presentation clusters and 9 focal risk dimensions, organized into 28 presentation-by-risk groups. Each group contributed 4 distinct vignettes, with 1 vignette at each triage level. Each vignette was rendered as a realistic human-authored conversational query, and the AI chatbots were tasked with assigning a triage label from that disclosure. Emergency under-triage occurred in 23 of 410 level D trials (5.6%), and all under-triaged emergencies were reassigned to level C urgency. Across target models, average accuracy ranged from 42.0% to 71.8%. Accuracy was highest for level D vignettes (94.3%) and lowest for level B vignettes (19.7%). Mean signed ordinal error was positive (+0.47 triage levels), indicating net over-triage. Dispersion was highest around the middle triage levels. All results were confirmed relative to clinician consensus labels from 50 medical doctors. When presented with user messages containing sufficient clinical information, frontier AI chatbots thus recognized psychiatric emergencies as requiring urgent medical assessment with near-zero error rates, yet showed marked over-triage for low and intermediate risk presentations. Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.25415 [q-bio.NC] (or arXiv:2604.25415v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2604.25415 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Veith Weilnhammer [view email] [v1] Tue, 28 Apr 2026 09:25:41 UTC (2,015 KB)

[HC-26] Interpretable Fuzzy Modeling Reveals Population-Level Representation Differences in P300 Brain Computer Interfaces Across Neurodivergent and Neurotypical Cohorts

【速读】：该论文旨在解决P300脑机接口（Brain-Computer Interface, BCI）在不同人群中的神经表征差异问题，尤其是这些差异如何影响解码模型的学习结构和性能。以往研究多关注信号层面或个体表现差异，而忽视了模型内部所学习到的代表性结构变化。论文提出了一种可解释的模糊时空框架（interpretable fuzzy spatiotemporal framework），其关键在于引入具有可学习原型的空间与时间模糊滤波器，不仅实现对P300事件相关电位（Event-Related Potential, ERP）的分类，还能重建各人群特有的模糊中心（fuzzy centers）。实验表明，该方法在肌萎缩侧索硬化症（ALS）、自闭症（AUT）和神经典型（NT）人群中均表现出竞争性性能，并揭示出群体依赖的波形形态与表示几何结构差异，从而为构建面向人群特异性的P300-BCI提供了可解释的新路径。

链接: https://arxiv.org/abs/2604.24765
作者: Xiaowei Jiang,Sudong Shang,Adrian Wilkinson,Michael L. Platt,Da Xiao,Bening Cao,Thomas Do
机构: 未知
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:P300-based brain-computer interfaces (BCIs) are widely used for communication, but population heterogeneity may alter the neural patterns available for decoding. Prior work has mainly examined such differences at the signal or performance level, while the representation structure learned by the decoder remains underexplored. In this study, we propose an interpretable fuzzy spatiotemporal framework for P300 classification and use it to analyze population-level differences across amyotrophic lateral sclerosis (ALS), autism (AUT), and neurotypical (NT) cohorts. The model employs spatial and temporal fuzzy filters with learnable prototypes, enabling both classification and reconstruction of cohort-specific fuzzy centers. Experiments were conducted on ALS and NT subsets from bigP3BCI and on the BCIAUT-P300 benchmark in a within-subject setting. The proposed model achieved competitive performance against multiple deep learning baselines. More importantly, the reconstructed fuzzy centers revealed systematic cohort-dependent differences in waveform morphology and representation geometry. Point-wise statistical analysis identified significant temporal differences between cohorts, including intervals overlapping with the canonical P300 window, and low-dimensional embeddings showed partially separated cohort-specific prototype organizations. These results suggest that population heterogeneity in P300-BCI is reflected not only in decoding performance but also in the discriminative structure learned by the model. The proposed framework provides an interpretable route toward population-aware P300-BCI analysis and design.

计算机视觉

[CV-0] Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

【速读】：该论文旨在解决当前深度伪造检测模型在真实世界复合退化（如模糊和严重有损压缩）条件下出现的空间注意力漂移问题，从而导致性能显著下降。其解决方案的关键在于提出一种基于基础模型驱动的取证框架，核心创新包括：1）构建极端复合退化引擎，在训练中系统性破坏高频伪影，促使DINOv2-Giant主干网络提取不变的几何与语义先验；2）设计结构约束的多流架构，包含全局纹理流、局部人脸流及融合CLIP的混合语义融合流，以提取非冗余且互补的特征表示；3）通过Score-CAM分析空间归属和余弦相似度评估特征稳定性，最终采用校准离散投票机制聚合预测结果，有效抑制背景注意力漂移并作为鲁棒的几何锚点，实现稳定的零样本泛化能力。

链接: https://arxiv.org/abs/2604.25889
作者: Minh-Khoa Le-Phan,Minh-Hoang Le,Trong-Le Do,Minh-Triet Tran
机构: University of Science, VNU-HCM, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4th place (out of 94 teams) in the NTIRE 2026 Robust Deepfake Detection Challenge

点击查看摘要

Abstract:Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at this https URL.

[CV-1] No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

【速读】：该论文旨在解决传统行人过街信号灯采用固定时长控制策略导致弱势道路使用者（Vulnerable Road Users, VRUs）如老年人、残障人士或分心行人可能在绿灯结束时仍滞留于路口而无法安全通过的问题。解决方案的关键在于提出一种实时自适应交通信号系统——No Pedestrian Left Behind (NPLB)，其核心由三部分构成：基于BGVP数据集微调后的YOLOv12目标检测模型实现高精度行人识别（mAP@0.5达0.756），结合ByteTrack多目标跟踪算法持续追踪跨线行人状态，以及一个自适应控制器根据剩余通行时间是否低于阈值动态延长行人相位。实验证明，该系统在10,000次蒙特卡洛仿真中将VRU滞留率从9.10%降至2.60%，提升安全性71.4%，且仅需在12.1%的过街周期中进行信号延时，具备良好的实用性与鲁棒性。

链接: https://arxiv.org/abs/2604.25887
作者: Anas Gamal Aly,Hala ElAarag
机构: Stetson University (斯泰森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: © Anas Gamal Aly and Hala ElAarag, 2026. This is the authors’ version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in Proceedings of the 2026 ACM Southeast Conference (ACMSE 2026)

点击查看摘要

Abstract:Current pedestrian crossing signals operate on fixed timing without adjustment to pedestrian behavior, which can leave vulnerable road users (VRUs) such as the elderly, disabled, or distracted pedestrians stranded when the light changes. We introduce No Pedestrian Left Behind (NPLB), a real-time adaptive traffic signal system that monitors VRUs in crosswalks and automatically extends signal timing when needed. We evaluated five state-of-the-art object detection models on the BGVP dataset, with YOLOv12 achieving the highest mean Average Precision at 50% (mAP@0.5) of 0.756. NPLB integrates our fine-tuned YOLOv12 with ByteTrack multi-object tracking and an adaptive controller that extends pedestrian phases when remaining time falls below a critical threshold. Through 10,000 Monte Carlo simulations, we demonstrate that NPLB improves VRU safety by 71.4%, reducing stranding rates from 9.10% to 2.60%, while requiring signal extensions in only 12.1% of crossing cycles.

[CV-2] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在分布外（out-of-distribution, OOD）场景下可靠性不足的问题，尤其关注如何在保证用户定义风险水平的前提下提升系统覆盖率（coverage），即模型能够回答的输入比例。传统方法依赖于对答案的置信度评分并设定阈值进行拒绝预测，但难以适应复杂OOD场景。解决方案的关键在于提出SIEVES框架，其核心创新是设计了一个显式学习视觉定位质量的选择器（selector），要求推理模型在作答时提供局部化的视觉证据（localized visual evidence），从而实现更可靠的预测决策。该方法显著提升了多个挑战性OOD基准上的覆盖率（最高达三倍），且无需访问推理模型的权重或logits即可迁移至专有模型（如o3和Gemini-3-Pro），体现出良好的泛化能力。

链接: https://arxiv.org/abs/2604.25855
作者: Hector G. Rodriguez,Marcus Rohrbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.

[CV-3] Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

【速读】：该论文旨在解决长时程音频-视频同步的快速自回归生成问题，核心挑战在于联合建模音频与视频信号以及实现高效的因果生成。解决方案的关键是提出“互强制”（Mutual Forcing）框架，该框架通过两阶段训练策略先分别训练单模态生成器，再将其耦合为统一的音视频模型进行联合优化；同时，在流式生成中直接训练一个原生因果模型，而非依赖多阶段蒸馏流程。其创新性在于在一个参数共享的模型中集成少步（few-step）和多步（multi-step）生成模式，利用自蒸馏机制提升训练-推理一致性，并通过两种模式之间的协同作用增强整体性能。相比现有方法如Self-Forcing，Mutual Forcing无需额外的双向教师模型，支持更灵活的训练序列长度，降低训练开销，并可直接从真实配对数据中学习，显著提升效率与生成质量。

链接: https://arxiv.org/abs/2604.25819
作者: Yupeng Zhou,Lianghua Huang,Zhifan Wu,Jiabao Wang,Yupeng Shi,Biao Jiang,Daquan Zhou,Yu Liu,Ming-Ming Cheng,Qibin Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at this https URL.

[CV-4] Magnification-Invariant Image Classification via Domain Generalization and Stable Sparse Embedding Signatures

【速读】：该论文旨在解决组织病理学图像分类中因放大倍数（magnification shift）导致的模型泛化能力差的问题，即在某一放大倍数下训练的模型难以有效应用于其他放大倍数的数据。其关键解决方案是采用梯度反转域泛化模型（gradient-reversal domain-general model），该模型通过抑制放大倍数特异性特征的同时保留判别性信息，实现对不同放大尺度数据的鲁棒建模。实验表明，该方法在保持高预测性能（AUC: 0.967 vs 基线 0.965）的前提下，显著压缩了特征表示维度（从1,074降至306维），并大幅提升跨折签名的一致性（Jaccard重叠从接近零提升至0.99），从而实现了更紧凑、可迁移且可靠的计算病理学模型部署。

链接: https://arxiv.org/abs/2604.25817
作者: Ifeanyi Ezuma,Olusiji Medaiyese
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 12 pages, 7 figures, 3 tables. Preprint manuscript

点击查看摘要

Abstract:Magnification shift is a major obstacle to robust histopathology classification, because models trained on one imaging scale often generalize poorly to another. Here, we evaluated this problem on the BreaKHis dataset using a strict patient-disjoint leave-one-magnification-out protocol, comparing supervised baseline, baseline augmented with DCGAN-generated patches, and a gradient-reversal domain-general model designed to preserve discriminative information while suppressing magnification-specific variation. Across held-out magnifications, the domain-general model achieved the strongest overall discrimination and its clearest gain was observed when 200X was held out. By contrast, GAN augmentation produced inconsistent effects, improving some folds but degrading others, particularly at 400X. The domain-general model also yielded the lowest Brier score at 0.063 vs 0.089 at baseline. Sparse embedding analysis further revealed that domain-general training reduced average signature size more than three-fold (306 versus 1,074 dimensions) while preserving equivalent predictive performance (AUC: 0.967 vs 0.965; F1: 0.930 vs 0.931). It also increased cross-fold signature reproducibility from near-zero Jaccard overlap in the baseline to 0.99 between the 100X and 200X folds. These findings show that calibrated, compact, and transferable representations can be learned without added architectural complexity, with clear implications for the reliable deployment of computational pathology models across heterogeneous acquisition settings.

[CV-5] Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在生成过程中存在幻觉（hallucination）的问题，即模型虽能生成流畅的语言输出，但缺乏对图像内容的严格依赖，尤其在视觉信号模糊或不确定时，指令提示（instruction prompting）会进一步放大语言先验（language priors），导致输出与视觉证据脱节。解决方案的关键在于提出一种双流解码框架——指令-证据对比双流解码（Instruction-Evidence Contrastive Dual-Stream Decoding, IECD2），其核心机制是在每一步解码中维护两个并行的概率分布：一个由指令驱动、强调语言信息丰富性的流，另一个由视觉证据驱动、确保生成内容忠实于图像的流；并通过基于对称KL散度的对比门控机制自适应融合二者，抑制仅被语言先验支持而无视觉证据支撑的词元，同时保留两者一致的预测结果，从而在多个任务（如图像描述和视觉问答）上显著提升准确性并降低幻觉率。

链接: https://arxiv.org/abs/2604.25809
作者: Yashwant Pravinrao Bangde,Debaditya Roy
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD2), maintains two parallel probability distributions of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrast-based gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD2 on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering, including POPE, MME, VQAv2, AMBER, MS-COCO, and LLaVA-Bench. IECD2 demonstrates consistent improvements in task accuracy and reasoning performance, alongside a substantial reduction in hallucination across all evaluation metrics compared to state-of-the-art decoding approaches.

[CV-6] Improving Diversity in Black-box Few-shot Knowledge Distillation

【速读】：该论文旨在解决黑盒少样本知识蒸馏（black-box few-shot knowledge distillation, KD）中的两个核心挑战：一是训练数据稀缺，二是现有方法生成的合成图像缺乏多样性，从而限制了学生网络的学习效果。其解决方案的关键在于提出一种新颖的生成对抗网络（Generative Adversarial Networks, GAN）训练机制，通过在教师模型监督下自适应地选择高置信度图像，并实时引入对抗学习过程，从而动态扩充并提升蒸馏数据集的多样性，显著提高学生模型的准确性。

链接: https://arxiv.org/abs/2604.25795
作者: Tri-Nhan Vo,Dang Nguyen,Kien Do,Sunil Gupta
机构: Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a well-known technique to effectively compress a large network (teacher) to a smaller network (student) with little sacrifice in performance. However, most KD methods require a large training set and internal access to the teacher, which are rarely available due to various restrictions. These challenges have originated a more practical setting known as black-box few-shot KD, where the student is trained with few images and a black-box teacher. Recent approaches typically generate additional synthetic images but lack an active strategy to promote their diversity, a crucial factor for student learning. To address these problems, we propose a novel training scheme for generative adversarial networks, where we adaptively select high-confidence images under the teacher’s supervision and introduce them to the adversarial learning on-the-fly. Our approach helps expand and improve the diversity of the distillation set, significantly boosting student accuracy. Through extensive experiments, we achieve state-of-the-art results among other few-shot KD methods on seven image datasets. The code is available at this https URL.

[CV-7] Diverse Image Priors for Black-box Data-free Knowledge Distillation

【速读】：该论文旨在解决在去中心化或安全AI生态系统中，由于隐私法规和知识产权限制导致无法访问教师模型接口及原始训练数据的黑盒数据-free知识蒸馏（Knowledge Distillation, KD）问题。传统方法依赖合成数据，但面临数据多样性不足与蒸馏信号弱的问题。其解决方案的关键在于提出Diverse Image Priors Knowledge Distillation (DIP-KD)框架，通过三阶段协同流程实现：(1) 利用图像先验（image priors）合成多样化视觉模式与语义信息；(2) 引入对比学习增强合成样本间的区分度；(3) 设计新型“引导学生”（primer student）实现软概率蒸馏。实验证明该方法在12个基准上达到最先进性能，且消融实验验证了数据多样性对受限环境下知识获取的重要性。

链接: https://arxiv.org/abs/2604.25794
作者: Tri-Nhan Vo,Dang Nguyen,Trung Le,Kien Do,Sunil Gupta
机构: Deakin University (迪肯大学); Monash University (莫纳什大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) represents a vital mechanism to transfer expertise from complex teacher networks to efficient student models. However, in decentralized or secure AI ecosystems, privacy regulations and proprietary interests often restrict access to the teacher’s interface and original datasets. These constraints define a challenging black-box data-free KD scenario where only top-1 predictions and no training data are available. While recent approaches utilize synthetic data, they still face limitations in data diversity and distillation signals. We propose Diverse Image Priors Knowledge Distillation (DIP-KD), a framework that addresses these challenges through a three-phase collaborative pipeline: (1) Synthesis of image priors to capture diverse visual patterns and semantics; (2) Contrast to enhance the collective distinction between synthetic samples via contrastive learning; and (3) Distillation via a novel primer student that enables soft-probability KD. Our evaluation across 12 benchmarks shows that DIP-KD achieves state-of-the-art performance, with ablations confirming data diversity as critical for knowledge acquisition in restricted AI environments.

[CV-8] Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

【速读】：该论文旨在解决CAD模型中可动部件及其运动参数的自动推断问题，即如何从用户绘制的轻量级2D草图（如箭头和线条）中高效、准确地生成具有交互式动画、仿真和形状编辑能力的可动3D模型。其核心挑战在于将设计师意图通过草图形式表达后，自动映射为物理合理的结构与运动约束，而传统方法依赖大量人工标注或特定类别先验。解决方案的关键在于提出Sketch2Arti系统，该系统基于无类别感知（category-agnostic）的深度学习框架，无需物体类别信息即可从单视角草图中自动识别可动部件并预测其运动参数，同时支持对壳体模型缺失内部结构的可控补全，生成符合几何一致性与运动约束的内部组件，从而实现复杂对象上细粒度、迭代式的可动性建模。

链接: https://arxiv.org/abs/2604.25781
作者: Yi Yang,Hao Pan,Yijing Cui,Alla Sheffer,Changjian Li
机构: University of Edinburgh(爱丁堡大学); Tsinghua University(清华大学); University of British Columbia(不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at this https URL.

[CV-9] QB-LIF: Learnable-Scale Quantized Burst Neurons for Efficient SNNs

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）中二进制尖峰编码（Binary Spike Coding）因每时间步仅用1比特表示而导致的信息吞吐量受限问题，尤其在短仿真时长下深度架构性能瓶颈更为显著。解决方案的关键在于提出量化爆发LIF（Quantized Burst-LIF, QB-LIF）神经元模型，其将爆发尖峰建模为膜电位的饱和均匀量化，并引入可学习的缩放因子（scale），使每一层能自主适应膜电位统计特性以动态调节尖峰分辨率；同时设计可吸收缩放策略（absorbable scale strategy），在推理阶段将学习到的量化尺度融合进突触权重，保持严格的累加-only（Accumulate-Only, AC）执行范式以保障硬件效率；此外，为稳定离散多级空间中的优化过程，提出了带指数尾部的修正线性替代梯度（ReLSG-ET），确保梯度在爆发间隔间持续流动，从而实现高精度、低延迟且具备类脑计算兼容性的SNN训练与部署。

链接: https://arxiv.org/abs/2604.25688
作者: Dewei Bai,Hongxiang Peng,Jiajun Mei,Yang Ren,Hong Qu,Dawen Xia,Zhang Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Binary spike coding enables sparse and event-driven computation in spiking neural networks (SNNs), yet its 1-bit-per-timestep representation fundamentally limits information throughput. This bottleneck becomes increasingly restrictive in deep architectures under short simulation horizons. We propose the Quantized Burst-LIF (QB-LIF) neuron, which reformulates burst spiking as a saturated uniform quantization of membrane potentials with a learnable scale. Instead of relying on predefined multi-threshold structures, QB-LIF treats the quantization scale as a trainable parameter, allowing each layer to autonomously adapt its spiking resolution to the underlying membrane-potential statistics. To preserve hardware efficiency, we introduce an absorbable scale strategy that folds the learned quantized scale into synaptic weights during inference, maintaining a strict accumulate-only (AC) execution paradigm. To enable stable optimization in the discrete multi-level space, we further design ReLSG-ET, a rectified-linear surrogate gradient with exponential tails that sustains gradient flow across burst intervals. Extensive experiments on static (CIFAR-10/100, ImageNet) and event-driven (CIFAR10-DVS, DVS128-Gesture) benchmarks demonstrate that QB-LIF consistently outperforms binary and fixed-burst SNNs, achieving higher accuracy under ultra-low latency while preserving neuromorphic compatibility.

[CV-10] Exploring Remote Photoplethysmography for Neonatal Pain Detection from Facial Videos

【速读】：该论文旨在解决新生儿在重症监护病房（Neonatal ICU）中疼痛评估缺乏客观性和可靠性的问题，传统方法依赖医护人员主观判断，难以实现精准监测。为应对这一挑战，作者提出了一种基于远程光电容积脉搏波描记术（remote photoplethysmography, rPPG）的非接触式生理信号获取方案，用于自动识别新生儿疼痛状态。其解决方案的关键在于：首先通过引入质量参数筛选受皮肤形变影响最小的感兴趣区域（regions-of-interest, ROIs）以提高rPPG信号质量；其次利用信噪比（signal-to-noise ratio）作为适应度指标选择噪声最小的rPPG片段；最终结合rPPG与音频特征进行多模态融合，显著提升了疼痛检测性能。

链接: https://arxiv.org/abs/2604.25680
作者: Ashutosh Dhamaniya,Anup Kumar Gupta,Trishna Saikia,Puneet Gupta
机构: Indian Institute of Technology Indore (印度理工学院英迪拉普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 25 pages, 9 figures, 10 tables. Proposed rPPG-based method for neonatal pain detection from facial videos, with multimodal (rPPG + audio) analysis and extensive ablation studies on the iCOPEvid dataset

点击查看摘要

Abstract:Unaddressed pain in neonates can lead to adverse effects, including delayed development and slower weight gain, emphasising the need for more objective and reliable pain assessment methods. Hence, automated methods using behavioural and physiological pain indicators have been developed to aid healthcare professionals in the Neonatal ICU. Traditional contact-based methods for physiological parameter estimation are unsuitable for long-term monitoring and increase the risk of spreading diseases like COVID-19. We introduce a novel approach using remote photoplethysmography (rPPG) to estimate pulse signals in a non-contact manner and employ them for neonatal pain detection. The temporal signals acquired from regions-of-interest (ROIs) affected by skin deformations may exhibit lower quality and provide erroneous rPPG signals. Therefore, we incorporated a quality parameter to select the temporal signals obtained from ROIs that are least affected by skin deformations. Further, we employed signal-to-noise ratio as a fitness parameter to extract the rPPG signal corresponding to the clip that is least affected by noise. Experimental findings demonstrate that the rPPG signals provide useful information for neonatal pain detection, and signals extracted from the blue colour channel outperform those extracted from other colour channels. We also show that combining rPPG and audio features provides better results than individual modalities.

[CV-11] SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

【速读】：该论文旨在解决当前机器人超声系统在扫描启动阶段缺乏解剖学理解的问题，即系统无法自主确定应扫描的目标器官、起始位置以及如何根据个体患者解剖结构进行自适应调整，导致仍需专家干预才能启动扫描。解决方案的关键在于提出SAMe（Semantic Anatomy Mapping engine），其核心是构建一个显式的解剖先验层，将模糊的临床主诉通过语义映射定位到目标器官，并基于单张外部体表图像实例化患者特异性的解剖表示，进而直接生成无需额外配准的6自由度（6-DoF）探头初始位姿。该方法实现了从“目标→解剖→动作”的端到端流程，在真实机器人实验中达到了97.3%的肝脏和81.7%的肾脏初始化命中率，显著优于基于表面启发式的基线方法，为自主超声扫描提供了可扩展的解剖学基础。

链接: https://arxiv.org/abs/2604.25646
作者: Jing Zhang,Duojie Chen,Wentao Jiang,Zihan Lou,Jianxin Liu,Xinwu Cui,Qinghong Zhao,Bo Du,Christoph F. Dietrich,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Supplementary information included. Code will be released at this https URL

点击查看摘要

Abstract:Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, SAMe achieved overall organ-hit rates of 97.3% for liver initialization and 81.7% for kidney initialization across the evaluated target sets. Even when restricted to the centroid target, SAMe outperformed the surface-heuristic baseline for both liver and kidney initialization. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

[CV-12] Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models CVPR2026

【速读】：该论文旨在解决大视觉语言模型（Large Vision-Language Models, LVLMs）中存在的幻觉问题，即模型生成与输入内容事实不符或逻辑不一致的响应。现有方法通过解码阶段的引导向量（steering vectors）虽能部分缓解幻觉，但会无意中加剧残余幻觉的严重性，原因在于其仅在解码阶段干预，导致错误在自回归过程中累积并恶化。本文提出预填充阶段干预（Prefill-Time Intervention, PTI），其核心创新在于在预填充（prefill）阶段一次性介入，优化初始键值（Key-Value, KV）缓存以阻止错误积累。PTI具备模态感知能力，分别针对视觉和文本表征提取不同方向：引导键（keys）聚焦于视觉锚定对象，过滤值（values）中的背景噪声，从而从源头修正易产生幻觉的表示。实验表明，PTI在多种解码策略、模型架构和基准测试中均具显著效果，且可与现有解码阶段方法正交集成，实现即插即用式性能提升。

链接: https://arxiv.org/abs/2604.25642
作者: Chengsheng Zhang,Chenghao Sun,Xinyan Jiang,Wei Li,Xinmei Tian
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Advanced Research Institute, Chinese Academy of Sciences (中国科学院上海高等研究院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI’s significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: this https URL.

[CV-13] Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

【速读】：该论文旨在解决统一多模态模型（Unified Multimodal Models, UMMs）在文本到图像（Text-to-Image, T2I）生成任务中，现有基于编辑的精炼方法（Refinement-via-Editing, RvE）存在的两个关键问题：一是编辑指令对提示与图像之间的语义错位描述过于粗略，导致精炼不完整；二是像素级内容保留机制限制了有效修改空间，抑制了更深层次的语义对齐。解决方案的关键在于提出一种新的精炼范式——基于再生的精炼（Refinement via Regeneration, RvR），该方法将精炼过程重新建模为条件图像再生任务，通过结合目标提示和初始图像的语义标记（semantic tokens）来引导再生，从而在不强制像素级保留的前提下实现更全面的语义对齐，并显著扩大可修改空间，实验表明其在多个基准测试中均取得显著性能提升。

链接: https://arxiv.org/abs/2604.25636
作者: Jiayi Guo,Linqing Wang,Jiangshan Wang,Yang Yue,Zeyu Liu,Zhiyuan Zhao,Qinglin Lu,Gao Huang,Chunyu Wang
机构: Tsinghua University (清华大学); Tencent HY (腾讯HY)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub: this https URL

点击查看摘要

Abstract:Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

[CV-14] Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

【速读】：该论文旨在解决自动驾驶中相机与雷达（camera-radar）融合感知的性能瓶颈问题，现有方法在输入混合、特征图融合或基于查询的特征采样层面存在对象覆盖不足和跨模态信息利用不充分的问题。其解决方案的关键在于提出一种新的融合范式——异构查询交互（heterogeneous query interaction），并通过两个核心机制实现：一是异构查询混合（QMix），通过在特征采样后引入专用的跨类型注意力机制，强化不同模态查询间的互补证据融合；二是交互式查询交换采样（QSwap），允许相关查询在注意力和几何约束下交换信息丰富的特征token，从而提升特征采样质量。这一框架显著提升了3D目标检测的精度，在nuScenes数据集上达到59.1 mAP和65.6 NDS（验证集）以及61.6 mAP和67.9 NDS（测试集）的最新水平。

链接: https://arxiv.org/abs/2604.25574
作者: Jialong Wu,Yihan Wang,Matthias Rottmann
机构: Osnabrück University (奥斯纳布吕克大学); University of Wuppertal (伍珀塔尔大学); Aptiv Services Deutschland GmbH (艾佩斯服务德国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix), which performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence. We further propose interactive query swap sampling (QSwap), which improves feature sampling by allowing related queries to exchange informative feature tokens under attention and geometric constraints. Experiments on the nuScenes dataset show that ConFusion achieves state-of-the-art performance, reaching 59.1 mAP and 65.6 NDS on the validation set, and 61.6 mAP and 67.9 NDS on the test set.

[CV-15] Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

【速读】：该论文旨在解决脉冲Transformer（Spiking Transformers）在长距离视觉建模中因二次项交互导致的能量效率低下问题，其根本原因在于此类架构与脉冲神经计算固有的稀疏性和事件驱动特性不匹配。解决方案的关键在于提出Vision SmolMamba，一种基于脉冲状态空间（spiking state-space）的高效架构，核心创新是引入Spike-Guided Spatio-Temporal Token Pruner（SST-TP）机制，通过结合脉冲激活强度和首次脉冲延迟来动态估计token重要性，从而逐步剔除冗余token并保留关键时空信息，使模型能够以稀疏token实现线性时间复杂度的双向状态空间递归，显著提升能效比。

链接: https://arxiv.org/abs/2604.25570
作者: Dewei Bai,Hongxiang Peng,Yunyun Zeng,Ziyu Zhang,Hong Qu,Yi Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.

[CV-16] opoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media

【速读】：该论文旨在解决视觉状态空间模型（Visual State-Space Models, VSSMs）在医学图像分割中面临的两个关键问题：一是轴向偏倚的扫描顺序削弱了对斜向和弯曲结构的建模能力；二是简单的多分支融合策略容易放大冗余响应。其解决方案的关键在于提出一种拓扑感知的“扫描-融合”框架TopoMamba，通过引入对角/反对角拓扑扫描分支（TopoA-Scan）与标准交叉扫描分支（Cross-Scan）协同提供互补的结构先验，并设计ScanCache机制以设备感知的方式缓存扫描索引，降低重复分辨率下的计算开销；同时，提出轻量级HSIC门控机制（HSIC Gate），基于依赖关系自适应调节分支交互，实现高效且鲁棒的异构特征融合。

链接: https://arxiv.org/abs/2604.25545
作者: Fuchen Zheng,Chengpei Xu,Long Ma,Weixuan Li,Junhua Zhou,Xuhang Chen,Weihuang Liu,Haolun Li,Quanjun Li,Zhenxi Zhang,Lei Zhao,Chi-Man Pun,Shoujun Zhou
机构: Dalian University of Technology (大连理工大学); Guangdong University of Technology (广东工业大学); Huizhou University (惠州学院); University of Macau (澳门大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Visual state-space models (SSMs) have shown strong potential for medical image segmentation, yet their effectiveness is often limited by two practical issues: axis-biased scan ordering weakens the modeling of oblique and curved structures, and naive multi-branch fusion tends to amplify redundant responses. We present TopoMamba, a topology-aware scan-and-fuse framework for segmenting heterogeneous medical visual media. The method combines a diagonal/anti-diagonal TopoA-Scan branch with the standard Cross-Scan branch to provide complementary structural priors, and introduces ScanCache, a device-aware caching mechanism that amortizes explicit scan-index construction across recurring resolutions. To fuse heterogeneous scan features efficiently, we further propose a lightweight HSIC Gate that regulates branch interaction using a dependence-aware scalar gating rule. We also instantiate a volumetric TopoMamba-3D for practical 3D clinical segmentation. Experiments on Synapse CT, ISIC 2017 dermoscopy, and CVC-ClinicDB endoscopy show that TopoMamba consistently improves segmentation quality over strong CNN, Transformer, and SSM baselines, with particularly clear gains on thin or curved targets such as the pancreas and gallbladder, while maintaining favorable deployment efficiency under dynamic input resolutions. These results suggest that topology-aware scan ordering and lightweight dependence-aware fusion form an effective and practical design for medical multimedia segmentation. The code will be made publicly available.

[CV-17] DualGeo: A Dual-View Framework for Worldwide Image Geo-localization ICME2026

【速读】：该论文旨在解决全球范围图像地理定位（worldwide image geo-localization）中因环境变化敏感性和异常候选过滤不足导致的定位精度受限问题。现有方法依赖易受光照、季节和天气等环境因素影响的视觉特征，且缺乏有效的后处理机制来剔除误匹配结果。其解决方案的关键在于提出一个两阶段框架DualGeo：第一阶段通过双向交叉注意力融合图像与语义分割特征，并利用双视角对比学习将融合特征对齐至GPS坐标，构建全局检索数据库；第二阶段则通过地理聚类重排序检索候选并输入大语言模型（Large Multimodal Models, LMMs）进行最终坐标预测，从而显著提升街级（1 km）和城市级（25 km）定位准确率。

链接: https://arxiv.org/abs/2604.25533
作者: Junchao Cui,Wenqi Shi,Shaoyong Du,Hang He,Xuanzi Ma,Hao Tang,Xiangyang Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME2026 Accept

点击查看摘要

Abstract:Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention. The fused features are then aligned with GPS coordinates through dual-view contrastive learning to build a global retrieval database. Second, it performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state-of-the-art methods, improving street-level (1 km) and city-level (25 km) localization accuracy by 3.6%-16.58% and 1.29%-8.77%, respectively. Our code and datasets are available : this https URL.

[CV-18] he Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation CVPR2026

【速读】：该论文旨在解决知识蒸馏（Knowledge Distillation, KD）在语义分割任务中因复杂手工设计目标函数导致训练成本不透明的问题。当前方法通常在固定迭代次数下进行评估，但不同KD策略的每轮计算开销差异显著，使得迭代数无法真实反映训练预算，从而难以判断性能提升是否源于更强的蒸馏信号或单纯更多的计算资源。论文的关键解决方案是采用基于实际运行时间（wall-clock compute）的公平比较机制，并发现经典的logit和特征级蒸馏方法在匹配计算量后优于近期专为分割设计的复杂KD方法；进一步通过延长训练时间，基于特征的蒸馏实现了在Cityscapes和ADE20K数据集上ResNet-18学生模型接近甚至达到ResNet-101教师模型的性能表现，证明了规模扩展（scaling）比复杂的手工目标设计更能有效推动语义分割知识蒸馏的进步。

链接: https://arxiv.org/abs/2604.25530
作者: Muhammad Ali,Kevin Alexander Laube,Madan Ravi Ganesh,Lukas Schott,Niclas Popp,Thomas Brox
机构: University of Freiburg (弗莱堡大学); Bosch Center for Artificial Intelligence (博世人工智能中心); Aleph Alpha Research (Aleph Alpha 研究院); University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at Efficient Computer Vision (ECV) Workshop, CVPR 2026 (non-archival). 5 pages, 3 figures

点击查看摘要

Abstract:Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textitcanonical logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99% of the teacher’s mIoU on Cityscapes (79.0 vs.\ 79.8) and 92% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

[CV-19] he Forensic Cost of Watermark Removal

【速读】：该论文旨在解决当前水印移除方法评估体系不完善的问题，即现有研究仅关注攻击成功率和感知质量，忽略了水印移除操作会引入可被检测的统计伪影（statistical artifacts），从而导致移除行为本身可被 forensic 分析识别。解决方案的关键在于提出“水印移除检测”（Watermark Removal Detection, WRD）这一新的评估维度，并通过训练现代分类器识别这些伪影，在所有测试的移除方法中均实现了高达 10⁻³ 的误报率（FPR）下的检测精度。研究表明，当前主流水印方案无法在攻击成功率、感知质量和伪造隐蔽性（forensic stealthiness）三者间取得平衡，因此，实现高隐蔽性的水印移除成为必要前提。

链接: https://arxiv.org/abs/2604.25491
作者: Gautier Evennou,Ewa Kijak
机构: IMATAG; IRISA, Université de Rennes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint; accepted at IHMMSEC 2026, Special Session “Watermarking Across the Lifecycle of Generative Models”

点击查看摘要

Abstract:Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at 10^-3 FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

[CV-20] DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning -Driven Image Editing

【速读】：该论文旨在解决当前图像编辑模型在复杂推理任务中表现不足的问题，尤其是生成式 AI（Generative AI）在面对需要多步逻辑规划与语义理解的编辑需求时，缺乏有效的推理驱动决策能力。其核心解决方案是提出一种以“Thinker”为中心的解耦框架——DDA-Thinker，通过将规划模块（Thinker）与固定生成模型（Editor）分离，实现对规划模块的独立优化和可量化评估。关键创新在于引入双原子强化学习框架：一方面设计认知原子奖励（cognitive-atomic reward），基于可验证清单直接评估Thinker生成的执行计划质量；另一方面设计视觉原子奖励（visual-atomic reward），用于衡量最终图像输出的质量。此外，通过融合源图像、用户指令及理想场景描述来合成高质量检查清单，并构建两阶段数据策展流程，确保训练数据具备推理多样性与难度梯度，从而显著提升模型在RISE-Bench和KRIS-Bench等推理驱动图像编辑基准上的性能表现。

链接: https://arxiv.org/abs/2604.25477
作者: Hanqing Yang,Qiang Zhou,Yongchao Du,Sashuai Zhou,Zhibin Wang,Jun Song,Tiezheng Ge,Cheng Yu,Bo Zheng
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker’s executable plan, which serves as the actionable outcome of the Thinker’s reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.

[CV-21] Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency CVPR2026

【速读】：该论文旨在解决从稀疏视角输入中进行泛化性人体高斯点渲染时，由于人体复杂关节运动和不同视角间重叠区域有限所导致的多视角特征表示不一致问题。解决方案的关键在于：通过预测深度图将每个视角编码的潜在嵌入（latent embeddings）反投影到共享的3D空间，并基于跨视角注意力机制对属于同一身体部位的嵌入进行重新校准，从而有效缓解高度纹理区域和遮挡部位的空间模糊性，实现3D高斯点的精确定位，提升人体渲染质量。

链接: https://arxiv.org/abs/2604.25466
作者: Jingi Kim,Wonjun Kim
机构: Konkuk University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, CVPR 2026 Findings

点击查看摘要

Abstract:Recently, generalizable human Gaussian splatting from sparse-view inputs has been actively studied for the photorealistic human rendering. Most existing methods rely on explicit geometric constraints or predefined structural representations to accurately position 3D Gaussians. Although these approaches have shown the remarkable progress in this field, they still suffer from inconsistent feature representations across multi-view inputs due to complex articulations of the human body and limited overlaps between different views. To address this problem, we propose a novel method to accurately localize 3D Gaussians and ultimately improve the quality of human rendering. The key idea is to unproject latent embeddings encoded from each viewpoint into a shared 3D space through predicted depth maps and recalibrate them belonging to the same body part based on cross-view attention. This helps the model resolve the spatial ambiguity occurring in highly textured regions as well as occluded body parts, thus leading to the accurate localization of 3D Gaussians. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of generalizable human Gaussian splatting from sparse-view inputs.

[CV-22] Image Compression with Bubble-Aware Frame Rate Adaptation for Energy-Efficient Video Capsule Endoscopy

【速读】：该论文旨在解决视频胶囊内镜（Video Capsule Endoscopy, VCE）因设备尺寸受限导致电池寿命短，与图像采集及传输高能耗之间的矛盾。其核心解决方案包括两个关键环节：一是设计了一种图像压缩流水线，在保持诊断图像质量（峰值信噪比达40.3 dB，压缩比为5.748，即压缩率82.6%）的前提下显著减少需传输的数据量；二是提出一种基于气泡感知的动态帧率自适应策略，利用压缩过程中的特征识别低诊断价值帧（主要由气泡引起），并在这些低可见性阶段降低图像采集和传输频率，从而在不牺牲异常检测敏感性的前提下实现最高达40%的能耗降低。该方案在RISC-V平台上通过Kvasir-Capsule和Galar数据集验证，整体系统平均节能20.58%，显著提升了VCE的实用性。

链接: https://arxiv.org/abs/2604.25464
作者: Oliver Bause,Jörg Gammerdinger,Julia Werner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures, EMBC2026

点击查看摘要

Abstract:Video Capsule Endoscopy (VCE) is a promising method for improving the medical examination of the small intestine in the gastrointestinal tract. A key challenge is their limited size, resulting in a short battery lifetime which conflicts with high energy consumption for image capturing and transmission to an on-body device. Thus, we propose an image compression pipeline that substantially reduces the transmitted data while preserving diagnostic image quality. Furthermore, we exploit characteristics of the compression process to identify frames with low diagnostic value mainly caused by bubbles, without requiring additional image analysis. For low-visibility frames, a dynamic bubble-aware frame rate adaptation strategy reduces image acquisition and transmission during these phases while preserving sensitivity to potential anomalies. The proposed compression and frame rate adaptation are evaluated on a RISC-V platform using the Kvasir-Capsule and Galar datasets. The compression method achieves a compression ratio of 5.748 (82.6%) at a peak signal-to-noise ratio of 40.3 dB, indicating negligible loss of visual quality. The compression accomplished a mean energy reduction of the whole system by 20.58%. Additionally, the proposed bubble-aware frame rate adaptation reduced the energy consumption by up to 40%. These results demonstrate the potential of our method to increase the applicability of VCE.

[CV-23] GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

【速读】：该论文旨在解决单图像超分辨率（Single-Image Super-Resolution, SISR）在真实场景下因复杂退化因素导致的重建质量受限问题，尤其是基于扩散模型的方法普遍依赖语义描述文本进行条件引导，而此类文本仅提供高层语义信息，缺乏与低分辨率输入空间对齐的视觉细节，从而造成语义与视觉空间表示之间的鸿沟。解决方案的关键在于提出GramSR，一种一步式扩散超分框架，其核心创新是用从低分辨率输入中提取的密集视觉特征（通过预训练DINOv3编码器获得）替代传统文本条件，实现更精确的视觉引导；同时采用三阶段LoRA架构，依次优化像素级恢复、语义增强和纹理一致性，并利用Gram矩阵损失约束DINOv3特征间的特征相关性，从而在推理时通过独立的指导尺度灵活控制退化去除、语义增强和纹理保留三个维度，显著提升结构保真度与纹理真实性。

链接: https://arxiv.org/abs/2604.25457
作者: Fabio D’Oronzio,Federico Putamorsi,Leonardo Zini,Marcella Cornia,Lorenzo Baraldi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 28th International Conference on Pattern Recognition

点击查看摘要

Abstract:Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using \ell_2 loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism. The code for this work is available at: this https URL.

[CV-24] SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

【速读】：该论文旨在解决遥感影像（Remote Sensing Imagery, RSI）中阴影问题对下游任务（如目标检测和语义分割）性能的严重影响，以及现有方法在阴影检测与去除上通常采用级联处理流程所导致的误差累积和依赖成对训练数据（shadowed and non-shadowed images）的问题。其解决方案的关键在于提出一个统一的两阶段框架——Shadow-Aware and Removal Unified (SARU)，第一阶段通过双分支检测模块（DBCSF-Net）融合多色彩空间与语义特征以生成高保真阴影掩膜，有效区分阴影与暗色物体；第二阶段则引入一种无需训练的物理算法（N²SGSR），基于单张输入图像内邻近非阴影区域的光照属性进行恢复，从而避免了对成对数据的依赖，并通过端到端整合实现误差传播最小化，显著提升了实际应用中的鲁棒性与实用性。

链接: https://arxiv.org/abs/2604.25432
作者: Zi-Yang Bo,Wei Lu,Hongruixuan Chen,Si-Bao Chen,Bin Luo
机构: Anhui University (安徽大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N ^2 SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments demonstrate that SARU achieves state-of-the-art performance on both the public AISD dataset and our newly introduced benchmarks. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The source code and datasets are publicly available at: this https URL.

[CV-25] A Systematic Post-Train Framework for Video Generation

【速读】：该论文旨在解决大规模视频扩散模型在实际部署中面临的三大核心问题：提示敏感性（prompt sensitivity）、时间不一致性（temporal inconsistency）以及推理成本过高。为弥合预训练性能与真实应用场景之间的差距，作者提出了一套系统性的后训练框架，其关键在于通过四个协同阶段实现模型对用户意图的精准对齐：首先采用监督微调（Supervised Fine-Tuning, SFT）使基础模型具备稳定的指令遵循能力；接着引入专为视频扩散设计的群体相对策略优化（Group Relative Policy Optimization, GRPO）方法进行强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF），以提升感知质量和时序一致性；随后利用专用语言模型进行提示增强（Prompt Enhancement），优化用户输入质量；最后通过推理优化（Inference Optimization）降低计算开销。该方案有效提升了视觉质量、时序连贯性和指令遵循能力，同时保持预训练阶段获得的可控性，为构建可扩展、稳定且高效的视频生成系统提供了实用路径。

链接: https://arxiv.org/abs/2604.25427
作者: Zeyue Xue,Siming Fu,Jie Huang,Shuai Lu,Haoran Li,Yijun Liu,Yuming Li,Xiaoxuan He,Mengzhao Chen,Haoyang Huang,Nan Duan,Ping Luo
机构: The University of Hong Kong (香港大学); JD Explore Academy (京东探索研究院); Tsinghua University (清华大学); Peking University (北京大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

[CV-26] Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing

【速读】：该论文旨在解决传统低层图像处理评估方法在深度学习和生成模型兴起背景下存在的局限性问题，即现有图像质量评估（Image Quality Assessment, IQA）主要关注视觉保真度，而忽视了语义内容可能发生的改变，导致无法有效衡量处理后图像的语义一致性。其解决方案的关键在于提出一种新的评估任务——语义相似性（Semantic Similarity），并构建基于语义实体及其关系的结构化语义表示框架；在此基础上设计了三元组语义相似性评分（Triplet-based Semantic Similarity Score, T3S），通过提取前景与背景语义实体、分离前景-背景关系以及开放世界类/关系建模，实现了对图像语义保留程度的量化评估。实验表明，T3S在COCO和SPA-Data数据集上显著优于依赖保真度的传统指标及现有语义级基准方法，且能更准确捕捉多种退化条件下语义的渐进变化。

链接: https://arxiv.org/abs/2604.25408
作者: Runjie Wang,Weiling Chen,Tiesong Zhao,Chang Wen Chen
机构: Fuzhou University (福州大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-level image processing has long been evaluated mainly from the perspective of visual fidelity. However, with the rise of deep learning and generative models, processed images may preserve perceptual quality while altering semantic content, making conventional Image Quality Assessment (IQA) insufficient for semantic-level assessment. In this paper, we formalize \textitSemantic Similarity as a new evaluation task for low-level image processing, aimed at measuring whether semantic content is preserved after processing. We further present a structured formulation of image semantics based on semantic entities and their relations, and discuss the desired properties and constraints of a valid semantic similarity index. Based on this formulation, we propose Triplet-based Semantic Similarity Score (T3S), which models image semantics through foreground entities, background entities, and relations. T3S combines semantic entity extraction, foreground-background disentanglement, and open-world class/relation modeling. Experiments on COCO and SPA-Data show that T3S consistently outperforms existing fidelity-oriented metrics and representative semantic-level baselines, while better reflecting progressive semantic changes under diverse degradations. These results highlight the importance of semantic assessment in modern low-level vision.

[CV-27] Leverag ing Previous-Traversal Point Cloud Map Priors for Camera-Based 3D Object Detection and Tracking

【速读】：该论文旨在解决自动驾驶中仅使用摄像头进行3D目标检测与跟踪时因缺乏深度信息而导致的定位精度受限问题，尤其是在无在线LiDAR传感器部署的情况下。其解决方案的关键在于提出了一种名为DualViewMapDet的纯摄像头推理框架，通过在线检索先前构建的静态点云地图作为几何先验，并采用双空间（透视视图PV与鸟瞰图BEV）融合策略实现高效的地图-相机特征交互：一方面将地图投影至透视视图并编码多通道几何线索以增强图像特征并支持BEV提升，另一方面在BEV空间中用稀疏体素骨干网络编码地图并与提升后的相机特征在共享度量空间中融合，从而有效缓解深度模糊问题并提升3D目标定位性能。

链接: https://arxiv.org/abs/2604.25405
作者: Markus Käppeler,Özgün Çiçek,Yakov Miron,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); Bosch Research (博世研究中心); University of Haifa (海法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key idea is a dual-space camera-map fusion strategy that avoids one-sided view conversion. Specifically, we (i) project the map into perspective view (PV) and encode multi-channel geometric cues to enrich image features and support BEV lifting, and (ii) encode the map directly in bird’s-eye view (BEV) with a sparse voxel backbone and fuse it with lifted camera features in a shared metric space. Extensive evaluations on nuScenes and Argoverse 2 demonstrate consistent improvements over strong camera-only baselines, with particularly strong gains in object localization. Ablations further validate the contributions of PV/BEV fusion and prior-map coverage. We make the code and pre-trained models available at this https URL .

[CV-28] COMPASS: COmpact Multi-channel Prior-map And Scene Signature for Floor-Plan-Based Visual Localization

【速读】：该论文旨在解决机器人在复杂室内环境中位姿估计（pose estimation）时，未能有效利用建筑平面图（architectural floor plan）中蕴含的语义信息的问题。现有定位方法主要依赖几何特征，忽略了平面图中如墙体、门窗等结构的语义先验。其解决方案的关键在于提出COMPASS算法，通过构建多通道径向描述子（multi-channel radial descriptor），同时融合来自平面图和双鱼眼相机图像的几何与语义信息：平面图通过360°射线投射生成包含归一化距离、结构类型（墙/窗/开口）、距离梯度、逆距离及局部距离方差的五通道描述子；视觉侧则通过线段检测与垂直边缘聚类实现窗户检测，并将其投影至方位角以填充结构类型通道。该方法首次实现了跨模态结构匹配的可行性验证，为基于语义增强的视觉-地图匹配定位提供了新思路。

链接: https://arxiv.org/abs/2604.25388
作者: Muhammad Shaheer,Miguel Fernandez-Cortizas,Asier Bikandi-Noya,Holger Voos,Jose Luis Sanchez-Lopez
机构: University of Luxembourg (卢森堡大学); Interdisciplinary Centre for Security, Reliability and Trust (SnT) (安全、可靠性与信任跨学科研究中心); Faculty of Science, Technology and Medicine, University of Luxembourg (卢森堡大学科学、技术和医学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we present COMPASS, an algorithm that exploits both geometric and semantic priors from floor plans to estimate the pose of a robot equipped with dual fisheye cameras. Inspired by scan context descriptor from LiDAR-based place recognition, we design a multi-channel radial descriptor that encodes the geometric layout surrounding a position. From the floor plan, rays are cast in 360 azimuth bins and the results are encoded into five channels: normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. From the image side, the same descriptor structure is populated by detecting structural elements in the fisheye imagery. As a first step toward full cross-modal matching, we present a window detection algorithm for fisheye images that uses a line segment detector to identify window frames via vertical edge clustering and brightness verification. Detected windows are projected to azimuthal bearings through the fisheye camera model, producing the hit-type channel of the visual descriptor. As a proof of concept, we generate both descriptors at a single known pose from the Hilti-Trimble SLAM Challenge 2026 dataset and demonstrate that the wall-window pattern extracted from the first frame of each camera closely matches the floor plan descriptor, validating the feasibility of cross-modal structural matching.

[CV-29] Benchmarking and Improving GUI Agents in High-Dynamic Environments

【速读】：该论文旨在解决高动态图形用户界面（GUI）环境中代理决策能力不足的问题，即现有方法通常仅依赖单张截图进行决策，导致状态信息不完整，难以应对界面频繁变化的场景。其核心解决方案是提出DynamicUI代理架构，关键在于：首先通过动态感知器（dynamic perceiver）对交互过程中的屏幕录制视频进行帧聚类并生成语义描述，迭代选择最具信息量的帧作为动态上下文；其次引入动作条件过滤策略（action-conditioned filtering）以消除帧与文本上下文之间的不一致性和冗余；最后利用反思模块（reflection）基于优化后的轨迹提供精准的动作指导。这一设计显著提升了在复杂动态GUI环境下的任务完成率，同时保持了在传统基准上的竞争力。

链接: https://arxiv.org/abs/2604.25380
作者: Enqi Liu,Liyuan Pan,Zhi Gao,Yan Yang,Chenrui Shi,Yang Liu,Jingrong Wu,Qing Li
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.

[CV-30] CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation

【速读】：该论文旨在解决医学影像中脑部病灶分割任务在持续学习（Continual Learning, CL）场景下的知识遗忘与参数冗余问题，尤其是在面对临床数据流的非平稳性、病理多样性及多模态异质性时，传统方法难以有效适应新任务且保持已有知识。其解决方案的关键在于提出概念推理扩展（Concept-Reasoning Expansion, CoRE）框架，通过将视觉特征与结构化概念库对齐，构建图像token与层级概念之间的映射机制，模拟临床决策过程，实现可解释的专家路由与按需模型增长，从而在保留临床先验的基础上避免冗余参数膨胀，并最大化知识复用能力。

链接: https://arxiv.org/abs/2604.25376
作者: Qianqian Chen,Anglin Liu,Jingyang Zhang,Yudong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate brain lesion segmentation in MRI is vital for effective clinical diagnosis and treatment planning. Due to high annotation costs and strict data privacy regulations, universal models require employing Continual Learning (CL) to adapt to evolving clinical tasks without losing previously acquired knowledge. However, existing CL paradigms often suffer from capacity limits or redundant parameter growth, and even advanced dynamic methods rely mostly on image-perception strategies that struggle to handle the substantial pathological and multimodal heterogeneity inherent in brain imaging. To address this issue, we propose Concept-Reasoning Expansion (CoRE) framework, which establishes a joint decision-making mechanism by integrating visual features with structured concepts. Through the alignment of image tokens with a hierarchical concept library, CoRE simulates clinical reasoning to guide both interpretable expert routing and demand-based model growth. This collaborative process ensures model evolution is grounded in clinical priors, preventing redundant parameter expansion while maximizing knowledge reuse. Extensive evaluations across 12 sequential brain lesion MRI tasks demonstrate that CoRE achieves state-of-the-art performance and provides a high knowledge starting point for efficient future adaptation. Its superior few-shot transferability and clinical interpretability further validate its effectiveness in managing non-stationary clinical data streams. Our code will be released soon.

[CV-31] GPT -Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment

【速读】：该论文旨在解决生成式 AI（Generative AI）图像在社交媒体平台上传播时的可追溯性与真实性验证问题，特别是针对 GPT-Image-2 生成图像在 Twitter/X 上的传播特征进行系统性分析。其解决方案的关键在于构建首个公开的 GPT-Image-2 图像数据集——GPT-Image-2 Twitter Dataset，通过多阶段自动化流程实现高质量图像筛选：包括基于多语言文本启发式规则（英语、日语和中文）、浏览器自动化识别“Made with AI”徽章以及模型名称变体匹配，最终从 27,662 条记录中确认了 10,217 张真实 GPT-Image-2 生成图像。研究还揭示了一个关键负面发现：Twitter 的 CDN 在上传过程中系统性移除 C2PA（Content Credentials for Provenance and Authenticity）水印，导致基于区块链或加密凭证的内容溯源无法实施，这对未来 AI 图像治理具有重要警示意义。

链接: https://arxiv.org/abs/2604.25370
作者: Kidus Zewde,Simiao Ren,Xingyu Shen,Jenny Wu,Yuchen Zhou,Tommy Duong,Zikang Zhang,Ethan Traister
机构: SCAM (Stanford Center for AI and Machine Learning)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages; GPT-image-2 social media dataset; Twitter API collection and multilingual curation; C2PA watermark stripping on platform upload; browser-automated AI badge verification; CLIP semantic clustering; AI-generated image provenance and attribution

点击查看摘要

Abstract:The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been more difficult to discern. We introduce the GPT-Image-2 Twitter Dataset, the first published dataset of GPT-image-2 generated images, sourced from publicly available Twitter/X posts in the immediate aftermath of the model’s April 21, 2026 release. Leveraging the Twitter API v2 and a multi-stage curation pipeline spanning multilingual text heuristics (English, Japanese, and Chinese), browser-automated Twitter “Made with AI” badge verification, and model name variant matching, we curate 10,217 confirmed GPT-image-2 images from 27,662 collected records over a six-day window. We characterize the dataset across four analyses: CLIP-based zero-shot subject taxonomy, OCR text legibility (82.0% of images contain detectable text), face detection (59.2% of images, 22,583 total faces), and semantic clustering (137 CLIP ViT-L/14 clusters). A key negative result is that C2PA content credentials are systematically stripped by Twitter’s CDN on upload, rendering cryptographic provenance verification infeasible for social-media-sourced AI images. The dataset and all curation code are released publicly. Comments: 11 pages; GPT-image-2 social media dataset; Twitter API collection and multilingual curation; C2PA watermark stripping on platform upload; browser-automated AI badge verification; CLIP semantic clustering; AI-generated image provenance and attribution Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.25370 [cs.CV] (or arXiv:2604.25370v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.25370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-32] Self-DACE: Robust Low-Light Enhancement via Efficient Adaptive Curve Estimation

【速读】：该论文旨在解决低光照图像增强（Low-Light Image Enhancement, LLIE）中计算效率与恢复质量之间的权衡问题。解决方案的关键在于提出Self-DACE++框架，其核心创新包括：引入参数极少的自适应调整曲线（Adaptive Adjustment Curves, AACs），在保持色彩保真度、结构完整性和自然性的同时灵活调节动态范围；采用随机顺序训练策略与网络融合机制，实现模型轻量化并构建高效的迭代推理结构；此外，基于Retinex理论设计物理引导的目标函数，并集成专用去噪模块以有效估计和抑制暗区潜在噪声，从而在多个真实世界基准数据集上实现优于现有最先进方法的增强效果与实时推理能力。

链接: https://arxiv.org/abs/2604.25367
作者: Jianyu Wen,Jun Xie,Feng Chen,Zhepeng Wang,Chenhao Wu,Tong Zhang,Yixuan Yu,Piotr Swierczynski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present Self-DACE++, an improved unsupervised and lightweight framework for Low-Light Image Enhancement (LLIE), building upon our previous Self-Reference Deep Adaptive Curve Estimation (Self-DACE). To better address the trade-off between computational efficiency and restoration quality, Self-DACE++ introduces enhanced Adaptive Adjustment Curves (AACs). These curves, governed by minimal trainable parameters, flexibly adjust the dynamic range while preserving the color fidelity, structural integrity, and naturalness of the enhanced images. To achieve an extremely lightweight architecture without sacrificing performance, we propose a randomized order training strategy coupled with a network fusion mechanism, which compresses the model into an efficient iterative inference structure. Furthermore, we formulate a physics-grounded objective function based on Retinex theory and incorporate a dedicated denoising module to effectively estimate and suppress latent noise in dark regions. Extensive qualitative and quantitative evaluations on multiple real-world benchmark datasets demonstrate that Self-DACE++ outperforms existing state-of-the-art methods, delivering superior enhancement quality with real-time inference capability. The code is available at this https URL.

[CV-33] HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation ICME2026

【速读】：该论文旨在解决生成式视频模型中人类运动质量评估的难题，现有评价指标多聚焦于全局场景统计特征，忽视了人体细节信息，导致与人类主观偏好不一致。其解决方案的关键在于提出一种以人类为中心的评估框架HuM-Eval，采用从粗到细（coarse-to-fine）的策略：首先利用视觉语言模型（Vision Language Model）对视频整体质量进行粗粒度评估，随后通过2D姿态分析验证解剖学正确性，并借助3D人体运动数据评估运动稳定性，从而实现更贴近人类感知的精细化评价。

链接: https://arxiv.org/abs/2604.25361
作者: Bingzi Zhang,Kaisi Guan,Ruihua Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)

点击查看摘要

Abstract:Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.

[CV-34] Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings CVPR

【速读】：该论文旨在解决布局引导型文本到图像生成模型（layout-guided text-to-image generative models）在评估过程中面临的两大挑战：一是现有基准测试难以全面衡量模型对文本提示的语义对齐能力和对预设布局的空间保真度；二是由于细粒度标注成本高，导致当前基准在规模和覆盖范围上受限，影响模型比较、排序与可解释性。解决方案的关键在于提出两个互补的基准测试框架——封闭集基准（C-Bench）和开放集基准（O-Bench），前者通过控制变量设计来隔离关键生成能力并提供不同复杂度的提示结构与布局，后者则基于真实世界提示和布局进行评估，反映模型在自然场景下的表现；同时，作者进一步开发了一种统一的评估协议，将语义准确性和空间准确性融合为单一评分，从而实现稳定且一致的模型排名。

链接: https://arxiv.org/abs/2604.25358
作者: Luca Parolari,Nicla Faccioli,Lamberto Ballan
机构: University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPRF 2026

点击查看摘要

Abstract:Evaluating layout-guided text-to-image generative models requires assessing both semantic alignment with textual prompts and spatial fidelity to prescribed layouts. Assessing layout alignment requires collecting fine-grained annotations, which is costly and labor-intensive. Consequently, current benchmarks rarely provide comprehensive layout evaluation and often remain limited in scale or coverage, making model comparison, ranking, and interpretation difficult. In this work, we introduce a closed-set benchmark (C-Bench) designed to isolate key generative capabilities while providing varying levels of complexity in both prompt structure and layout. To complement this controlled setting, we propose an open-set benchmark (O-Bench) that evaluates models using real-world prompts and layouts, offering a measure of semantic and spatial alignment in the wild. We further develop a unified evaluation protocol that combines semantic and spatial accuracy into a single score, ensuring consistent model ranking. Using our benchmarks, we conduct a large-scale evaluation of six state-of-the-art layout-guided diffusion models, totaling 319,086 generated and evaluated images. We establish a model ranking based on their overall performance and provide detailed breakdowns for text and layout alignment to enhance interpretability. Fine-grained analyses across scenarios and prompt complexities highlight the strengths and limitations of current models. Code is available at this https URL.

[CV-35] Assessment of the quantitative impact of occlusal positioning splints on temporomandibular joint conditions

【速读】：该论文旨在解决颞下颌关节（Temporomandibular Joint, TMJ）在不同下颌位置下的定量分析难题，尤其是传统方法依赖多次影像学扫描导致的辐射暴露和成本增加问题。解决方案的关键在于将咬合定位牙合垫（occlusal positioning splint）建模为由多模态数据（包括锥形束CT、面部运动捕捉和牙科扫描）共同定义的刚体变换（rigid transformation）的物理实现，并通过重复石膏模型扫描评估牙合垫定位精度，以误差变换形式在刚体运动空间中进行统计分析。进而将估计的变换传播至分割后的TMJ结构，实现关节间隙变化的仿真评估，从而基于单一解剖模型与变换数据间接评估TMJ配置，减少对多个下颌位置重复成像的需求。

链接: https://arxiv.org/abs/2604.25322
作者: Agnieszka Anna Tomaka,Krzysztof Domino,Dariusz Pojda,Michał Tarnawski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 9 figures

点击查看摘要

Abstract:A computational method for quantitative analysis of temporomandibular joint (TMJ) configuration using occlusal positioning splints is proposed and demonstrated. The method models a positioning splint as a physical realization of a predefined rigid transformation of the mandible, derived from multimodal data, including CBCT, facial motion acquisition, and dental scans integrated within a common coordinate system. Splints corresponding to selected mandibular positions are designed and fabricated, and their positioning accuracy is evaluated using repeated scans of plaster models. Discrepancies are represented as error transformations and analyzed statistically in the space of rigid motions. The estimated transformations are propagated to segmented TMJ structures, enabling simulation-based evaluation of joint space changes. Transformation-based error analysis and surface distance metrics are used to quantify differences between planned and achieved configurations. The method enables indirect assessment of TMJ configuration using a single anatomical model and transformation data, reducing the need for repeated imaging across multiple mandibular positions. This study is intended as a methodological demonstration, supported by a clear step-by-step graphical presentation, and does not aim to provide clinical validation.

[CV-36] Edge-Cloud Collaborative Reconstruction via Structure-Aware Latent Diffusion for Downstream Remote Sensing Perception

【速读】：该论文旨在解决高分辨率遥感图像在卫星到地面传输过程中因下行带宽受限而需采用极端压缩比所导致的高频结构细节丢失问题，这会严重影响下游机器感知任务（如目标检测）的性能。现有超分辨率（Super-Resolution, SR）方法中，基于回归的方法常产生过度平滑的纹理，而生成式扩散模型则易引入结构幻觉，误导检测系统。解决方案的关键在于提出结构感知潜在扩散（Structure-Aware Latent Diffusion, SALD）框架，其核心创新为：在资源受限边缘端将图像解耦为高度压缩的低频数据与轻量级软结构先验；在云端利用结构门控大核卷积（Structure-Gated Large Kernel, SGLK）模块和语义引导引擎（Semantic-Guidance Engine, SGE）模块，通过结构先验控制扩散过程中的长程依赖建模并抑制生成幻觉，从而在极低带宽下实现高质量重建与下游任务性能的协同提升。

链接: https://arxiv.org/abs/2604.25319
作者: Yun Li,Xianju Li
机构: China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:The exponential surge in high-resolution remote sensing data faces a severe bottleneck in satellite-to-ground transmission. Limited downlink bandwidth forces the use of extreme high-ratio compression, which irreversibly destroys high-frequency structural details essential for downstream machine perception tasks like object detection. While current super-resolution techniques attempt to recover these details, regression-based methods often yield over-smoothed textures, and generative diffusion models frequently introduce structural hallucinations that mislead detection systems. To address this trade-off, we propose the Structure-Aware Latent Diffusion (SALD) framework, an asymmetric edge-cloud collaborative SR system. At the resource-constrained edge, the system decouples imagery into a highly compressed low-frequency payload and a lightweight soft structural prior. Transmitting this decoupled representation minimizes bandwidth consumption. On the powerful cloud side, we introduce a Structure-Gated Large Kernel (SGLK) module and a Semantic-Guidance Engine (SGE) within the diffusion backbone. These modules leverage the transmitted structural priors to gate large-kernel convolutions, effectively capturing long-range dependencies inherent in aerial scenes while actively suppressing generative hallucinations. Extensive experiments on both the MSCM and UCMerced datasets demonstrate that, even under extreme bandwidth constraints, SALD achieves superior perceptual quality (LPIPS) and significantly enhances downstream performance in both scene classification and small-target detection.

[CV-37] owards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images

【速读】：该论文旨在解决域适应（Domain Adaptation, DA）问题，即如何将基于地面车辆采集的Rumex obtusifolius（锐叶酸模）图像分类模型有效迁移至由无人机（UAV）获取的目标域数据上，以应对源域与目标域间的数据分布差异。其关键解决方案在于：相较于传统卷积神经网络（CNN）如ResNet在目标域表现不佳，预训练的视觉Transformer（Vision Transformer, ViT）模型（如DINOv2和DINOv3）凭借大规模自监督预训练获得的通用表征能力，能够天然地适应域偏移；即使仅在源域数据上微调，ViT仍可在目标域上实现F1=0.8的高分类性能，显著优于采用领域对齐技术（如矩匹配和最大分类器差异）优化后的ResNet模型。

链接: https://arxiv.org/abs/2604.25316
作者: Fabian Dionys Schrag,Mehmet Ozgur Turkoglu,Konrad Schindler,Ralph Lukas Stoop
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Applying moment-matching and maximum classifier discrepancy, two established DA techniques, substantially improves target-domain performance. However, Vision Transformer (ViT) models pretrained with self-supervised objectives (DINOv2, DINOv3) handle domain shifts intrinsically well, surpassing even moment-matching-trained ResNets, likely due to the rich, general-purpose representations acquired during large-scale pretraining. Using ViTs fine-tuned on the source dataset, we demonstrate high classification performances in the range of F1=0.8 on our target dataset. To support further research on DA for weed detection in grassland systems, we publicly release our UAV-based target dataset AGSMultiRumex, comprising data from 15 flights over Swiss meadows.

[CV-38] SaliencyDecor: Enhancing Neural Network Interpretability through Feature Decorrelation IJCNN2026

【速读】：该论文旨在解决梯度类显著性方法在解释深度神经网络时存在的噪声大、不稳定以及与语义有意义输入特征对齐度差的问题。其关键在于识别出特征相关性是导致梯度解释不可靠的根本原因——即相关特征维度会扩散归因梯度至冗余方向，从而产生模糊的显著性图。为此，作者提出SaliencyDecor训练框架，通过引入一个去相关正则项（decorrelation regularizer），在不改变模型架构或显著性计算方法的前提下，强制学习到的特征空间趋向正交化，从而促进梯度流更加集中，提升显著性解释的保真度（fidelity）。该方案同时优化分类任务、特征掩码下的预测一致性及去相关目标，实验证明其不仅能生成更清晰、聚焦于物体区域的显著性图，还能同步提升模型准确率，打破了传统上解释质量与模型性能之间的权衡关系。

链接: https://arxiv.org/abs/2604.25315
作者: Ali Karkehabadi,Jamshid Hassanpour,Houman Homayoun,Avesta Sasan
机构: University of California, Davis, USA; Georgia Institute of Technology, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:Gradient-based saliency methods are widely used to interpret deep neural networks, yet they often produce noisy and unstable explanations that poorly align with semantically meaningful input features. We argue that a fundamental cause of this behavior lies in the geometry of learned representations: correlated feature dimensions diffuse attribution gradients across redundant directions, resulting in blurred and unreliable saliency maps. To address this issue, we identify feature correlation as a structural limitation of gradient-based interpretability and propose SaliencyDecor, a training framework that enforces feature decorrelation to improve attribution fidelity without modifying saliency methods or model architectures by reshaping the feature space toward orthogonality, our approach promotes more concentrated gradient flow and improves the fidelity of saliency-based explanations. SaliencyDecor jointly optimizes classification, prediction consistency under feature masking, and a decorrelation regularizer, requiring no architectural changes or inference-time overhead. Extensive experiments across multiple benchmarks and architectures demonstrate that our method produces substantially sharper and more object-focused saliency maps while simultaneously improving predictive performance, achieving accuracy gains across the datasets. These results establish our method as a principled mechanism for enhancing both interpretability and accuracy, challenging the conventional trade-off between explanation quality and model performance.

[CV-39] Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

【速读】：该论文旨在解决文本到图像（text-to-image, T2I）生成中多区域语义一致性问题，即当输入提示（prompt）包含多个空间分离的物体或区域时，现有扩散模型难以准确地将每个子提示（sub-prompt）映射到对应图像区域。核心挑战在于：传统的噪声预测网络（NPNet）采用全局文本嵌入（global text embedding）来编码整个提示，导致无法区分不同区域的语义信息。解决方案的关键是提出Golden RPG，其创新包括：(i) FiLM适配器（per-region FiLM adapter），通过逐区域调整噪声预测以匹配各子提示；(ii) 区域交叉注意力层（Region Cross-Attention layer），在Swin骨干网络中注入跨区域注意力机制，使不同空间位置能动态关注对应的子提示词；以及(iii) 置信度自适应混合头（Confidence-Adaptive Blending head），根据样本复杂度动态融合区域与全局信号，避免对简单样本的性能干扰。该方法显著提升了跨区域一致性（Cross-Region-Coherence），同时保持高保真度（CLIP-Score）和用户偏好。

链接: https://arxiv.org/abs/2604.25314
作者: Hao Li
机构: University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emphstarting noise of a diffusion model carries significant semantic information: ``golden’’ noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbfGolden RPG, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbfFiLM adapter that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbfRegion Cross-Attention layer injected between two stages of the Swin backbone, allowing different spatial locations to attend to different sub-prompt tokens. To prevent the regional conditioning from degrading samples whose prompts are already easy, we further propose a \textbfConfidence-Adaptive Blending head that dynamically predicts, per sample, how strongly the regional signal should override the global signal. We evaluate on the original RPG benchmark (20 prompts, 100 samples) and on four multi-region categories of T2I-CompBench (1,200 images, six competing methods). Golden RPG achieves the highest Cross-Region-Coherence score on every category, while matching the strongest baselines on absolute CLIP-Score and CLIP-IQA. A paired user study further shows a \boldsymbol\sim 67% preference over the strongest baseline. The adapter contains \sim 2M trainable parameters and adds only 0.6 ,s of inference overhead on top of SDXL.

[CV-40] Rapid tracking through strongly scattering media with physics-informed neuromorphic speckle analysis

【速读】：该论文旨在解决在低光照和强散射介质环境中对高速运动目标进行稳定跟踪的问题。传统基于帧的相机因固定曝光时间而在信噪比与时间分辨率之间存在权衡，难以满足此类极端条件下的需求。解决方案的关键在于提出计算神经形态追踪（Computational Neuromorphic Tracking, CNT）框架，该框架融合了异步事件传感（asynchronous event sensing）与任务驱动的斑点分析（task-driven speckle analysis），将神经形态斑点聚合建模为时空斑点表示（spatiotemporal speckle representation），并通过联合优化时空参数以最大化在极端条件下的跟踪稳定性。实验表明，该方法可在速度提升10倍、光照降低10倍的情况下实现鲁棒跟踪，显著扩展了散射介质中目标追踪的适用范围。

链接: https://arxiv.org/abs/2604.25310
作者: Yuqing Cao,Shuo Zhu,Rongzhou Chen,Jingyan Chen,Ni Chen,Edmund Y. Lam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This work addresses the critical problem of tracking fast-moving objects through strongly scattering media in a low-light environment. Different from existing approaches that use frame-based cameras with fixed exposure times, which trade off signal-to-noise ratio for temporal resolution, we introduce computational neuromorphic tracking (CNT), a physics-informed framework that combines asynchronous event sensing with task-driven speckle analysis for robust motion estimation. We formulate the neuromorphic speckle aggregation as a spatiotemporal speckle representation, jointly optimizing the temporal and spatial parameters to maximize tracking stability under extreme conditions. Extensive experiments demonstrate that our method enables robust motion tracking of 10x faster motion and under 10x dimmer illumination compared to conventional systems. These improvements significantly broaden the operational regime for tracking through scattering media, providing an efficient and scalable solution for demanding scenarios involving rapid motion and low-light conditions.

[CV-41] DenseScout: Algorithm-System Co-design for Budgeted Tiny Object Selection on Edge Platforms

【速读】：该论文旨在解决在边缘平台部署微小目标感知（tiny object perception）时面临的双重挑战：如何在严格的计算资源预算下实现高效的目标候选区域选择，同时满足端到端延迟约束。现有基于检测器的前端方法虽具备良好的离线检测精度，但难以有效适配低预算场景下的patch优先级排序，且未考虑传输和推理延迟对实际性能的影响。其解决方案的关键在于提出DenseScout——一种参数仅为1.01M的轻量级密集响应选择器（dense-response selector），通过一个轻量代理输入直接对高分辨率图像中的候选patch位置进行排序，相较于检测器类前端更契合低预算微小目标优先选择任务；此外，作者进一步设计了面向异构边缘设备的传输感知运行时实现，并引入QoS约束召回率（QoS-constrained recall），仅当目标被选中区域覆盖且端到端处理在截止时间前完成时才计为成功感知，从而将离线选择质量与可部署实用性相衔接。实验表明，DenseScout在低预算条件下显著优于检测器基线，而跨平台结果揭示边缘微小目标感知需作为算法-系统协同设计问题进行优化。

链接: https://arxiv.org/abs/2604.25300
作者: Xiong Zhouzhi,Zimo Zeng,Yi Chen,Shuqi Xu,Yunfeng Yan,Donglian Qi
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Deploying tiny object perception on edge platforms is challenging because practical systems must satisfy both strict compute budgets and end-to-end latency constraints. A common strategy is to first select a small number of candidate patches from a high-resolution image and then apply downstream processing only to the selected regions. However, existing detector-based frontends are not well aligned with this setting: strong offline detection accuracy does not necessarily yield effective low-budget patch prioritization, nor does it guarantee usable performance once transport and inference delays are considered. In this work, we study budgeted tiny object selection on edge platforms from a joint algorithm–system perspective. We present DenseScout, a lightweight dense-response selector with only 1.01M parameters, which directly ranks candidate patch locations from a high-resolution scene via a lightweight proxy input and is better aligned with low-budget tiny-object prioritization than detector-style frontends. To bridge offline selector quality and deployable utility, we further develop a transport-aware runtime realization on heterogeneous edge devices and adopt QoS-constrained recall, which counts a target as successfully perceived only if it is covered by the selected regions and the end-to-end processing finishes before the deadline. Experiments show that DenseScout consistently outperforms detector-based baselines in offline budgeted patch-selection evaluation, especially in low-budget regimes, while cross-platform results on RK3588 and Jetson Orin NX show that deployable performance depends jointly on selector quality and runtime realization efficiency. These results suggest that edge tiny object perception should be optimized as an algorithm–system co-design problem rather than as isolated model selection.

[CV-42] he Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

【速读】：该论文旨在解决扩散模型（Diffusion Models）在复杂结构化推理任务（如文本引导的图像生成）中能力受限的问题，尤其是在处理多模态文本到图像生成时，由于视觉标记（visual tokens）具有连续且非离散特性，难以有效借鉴语言模型中已有的潜在推理（latent reasoning）与递归（recursion）策略。其解决方案的关键在于提出一种基于模块化人类认知启发的递归稀疏专家混合（recursive, sparse mixture-of-experts）框架，并将其集成到传统扩散模型中：通过在联合注意力层内引入递归组件，在多个潜在步骤中迭代优化视觉标记，同时利用稀疏选择神经模块实现高效参数共享；每一步由门控网络动态选择特定神经模块，该选择基于当前视觉标记、扩散时间步和条件信息，从而显著提升图像生成性能。

链接: https://arxiv.org/abs/2604.25299
作者: Yuwei Sun,Yuxuan Yao,Hui Li,Siyu Zhu
机构: Shanghai Academy of AI for Science (上海人工智能科学研究院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is devised to dynamically select specialized neural modules, conditioned on the current visual tokens, the diffusion timestep, and the conditioning information. Comprehensive evaluation on class-conditioned ImageNet image generation tasks and additional studies on the GenEval and DPG benchmark demonstrate the superiority of the proposed method in enhancing model image generation performance.

[CV-43] Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

【速读】：该论文旨在解决扩散模型（Diffusion Models）在确定性采样过程中对显式时间条件（time conditioning）的依赖问题，尤其是像DDIM（Denoising Diffusion Implicit Models）这类方法在缺乏时间条件时性能显著下降的现象。其核心问题是：时间条件是否真的必要？解决方案的关键在于从几何视角重新审视前向扩散过程——作者发现，在高维空间中，噪声数据分布会集中在低维超圆柱状流形（hyper-cylinder-like manifolds）上，而高质量生成的本质在于这些流形在高维空间中的解耦（disentanglement）。基于此洞察，作者修改了DDIM的前向过程，使其与流匹配（flow matching）方法对齐，从而证明只要噪声流形遵循流匹配演化规律，DDIM即可在无显式时间条件的情况下实现高质量生成。此外，通过将类别解耦至不同时间空间，该框架还能实现类别条件生成，且无需类别条件嵌入（conditional embeddings）。

链接: https://arxiv.org/abs/2604.25289
作者: Liuzhuozheng Li,Zhiyuan Zhan,Shuhong Liu,Dengyang Jiang,Zanyi Wang,Guang Dai,Jingdong Wang,Mengmeng Wang
机构: SGIT AI Lab(SGIT人工智能实验室); UTokyo(东京大学); HKUST(香港科技大学); UCSD(加州大学圣地亚哥分校); ZJUT(浙江工业大学); Baidu(百度); RIKEN AIP(理化学研究所先进智能项目)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Practically, training diffusion models typically requires explicit time conditioning to guide the network through the denoising sampling process. Especially in deterministic methods like DDIM, the absence of time conditioning leads to significant performance degradation. However, other deterministic sampling approaches, such as flow matching, can generate high-quality content without this conditioning, raising the question of its necessity. In this work, we revisit the role of time conditioning from a geometric perspective. We analyze the evolution of noisy data distributions under the forward diffusion process and demonstrate that, in high-dimensional spaces, these distributions concentrate on low-dimensional hyper-cylinder-like manifolds embedded within the input space. Successful generation, we argue, stems from the disentanglement of these manifolds in high-dimensional space. Based on this insight, we modify the forward process of DDIM to align the noisy data manifold with the flow-matching approach, proving that DDIM can generate high-quality content without time conditioning, provided the noisy manifold evolves according to the flow-matching method. Additionally, we extend our framework to class-conditioned generation by decoupling classes into distinct time spaces, enabling class-conditioned synthesis with a class-unconditional denoising model. Extensive experiments validate our theoretical analysis and show that high-quality generation is achievable without explicit conditional embeddings.

[CV-44] OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding CVPR2026

【速读】：该论文旨在解决视频时间定位（Video Temporal Grounding, VTG）在开放世界场景下的性能瓶颈问题，即由于数据集规模有限和语义多样性不足，导致模型在常见概念与罕见概念上的表现存在显著差距。解决方案的关键在于两个核心创新：一是构建了大规模、高覆盖度的OmniVTG数据集，通过语义覆盖迭代扩展管道识别现有数据集的词汇盲区并收集相关视频，同时设计以密集描述为中心的数据引擎，利用多模态大语言模型（Multimodal Large Language Models, MLLMs）生成高质量的时间戳密集描述进行标注；二是提出自校正思维链（Self-Correction Chain-of-Thought, CoT）训练范式，通过三阶段流程（监督微调、CoT微调和强化学习）使MLLM先预测再基于其更强的视频理解能力反思并修正自身输出，从而显著提升对罕见概念的定位能力，并实现跨基准的零样本性能领先。

链接: https://arxiv.org/abs/2604.25276
作者: Minghang Zheng,Zihao Yin,Yi Yang,Yuxin Peng,Yang Liu
机构: Peking University (北京大学); Huawei Technologies Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs’ video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at this https URL.

[CV-45] Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

【速读】：该论文旨在解决统一多模态检索（Unified Multimodal Retrieval, UMR）中因现有嵌入方法主要依赖样本级对比学习而忽视语义主体级（subject-level）信息所导致的语义对齐偏差问题，即模型在复杂多模态查询下难以准确定位文本所指的显著视觉区域，且易过度依赖文本线索而忽略视觉模态，造成视觉知识利用不充分。解决方案的关键在于提出一种显著主体感知的多模态嵌入框架（Salient Subject-Aware Multimodal Embedding, SSA-ME），其核心创新包括：利用大型多模态模型（LMMs）与视觉专家联合识别图像-文本对中的显著视觉概念，并引入基于显著性引导的目标函数以强化跨模态注意力与语义有意义区域的一致性；同时设计特征再生模块，根据提取的显著性图重新校准视觉特征，从而实现模态间更平衡、语义一致的融合，显著提升细粒度表示学习能力。

链接: https://arxiv.org/abs/2604.25273
作者: Guosheng Zhang,Linkai Liu,Keyao Wang,Haixiao Yue,Zhiwen Tan,Xiao Tan
机构: Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model’s ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation–where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.

[CV-46] Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation

【速读】：该论文旨在解决语音保持的面部表情操控（Speech-preserving facial expression manipulation, SPFEM）中因缺乏成对数据（即同一人相同语音但不同表情的对齐帧）而导致的情感操控监督信号不足的问题。其核心解决方案是提出个性化跨模态情感关联学习（Personalized Cross-Modal Emotional Correlation Learning, PCMECL）算法，关键在于两点：一是通过引入个体视觉信息来学习个性化的提示（prompt），以捕捉个体间表情差异，建立更细粒度的视觉-语义关联；二是采用特征差分机制对齐视觉与语义特征分布，通过匹配两者的变化量来提供更精确的监督信号，从而弥合模态间的固有差异。

链接: https://arxiv.org/abs/2604.25255
作者: Tianshui Chen,Yujie Zhu,Jianman Lin,Zhijing Yang,Chunmei Qing,Feng Gao,Liang Lin
机构: Guangdong University of Technology (广东工业大学); South China University of Technology (华南理工大学); Peking University (北京大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.

[CV-47] When the Forger Is the Judge: GPT -Image-2 Cannot Recognize Its Own Faked Documents

【速读】：该论文旨在解决生成式 AI（Generative AI）在文档图像伪造中的检测难题，特别是针对 OpenAI 的 GPT-Image-2 模型所生成的高保真、难以察觉的文档篡改问题。其核心挑战在于现有检测方法在面对此类基于深度学习的图像编辑时失效，导致人类专家和主流算法均无法可靠识别伪造内容。解决方案的关键在于构建一个大规模、像素级标注的伪造文档数据集 AIForge-Doc v2，并系统性地评估四种检测策略：人类专家（2AFC）、通用图像取证工具 TruFor、文档专用检测器 DocTamper 以及 GPT-Image-2 自身作为零样本自判模型。实验表明，所有检测手段性能显著下降，尤其是传统检测器在切换至 GPT-Image-2 inpainting 场景后 AUC 值骤降 0.27–0.36，揭示出当前检测技术对生成式 AI 编辑存在系统性盲区，从而明确指出了未来研究需聚焦于提升模型对新型生成式伪造的鲁棒性。

链接: https://arxiv.org/abs/2604.25213
作者: Jiaqi Wu,Yuchen Zhou,Dennis Tsang Ng,Xingyu Shen,Kidus Zewde,Ankit Raj,Tommy Duong,Simiao Ren
机构: Google(谷歌); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:OpenAI’s GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site this http URL), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge – asked, to avoid the trivial “image is mostly real” reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962-0.599 TruFor; 0.852-0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.

[CV-48] owards Seamless Lunar Mosaics: Deep Radiometric Normalization for Cross-Sensor Orbital Imagery Using Chandrayaan-2 TMC Data

【速读】：该论文旨在解决多任务月球影像拼接中因光照几何差异、传感器特性不一致及获取条件变化导致的辐射不一致性问题，从而生成无缝且光度一致的月球镶嵌图（lunar mosaic）。其核心解决方案是提出一种基于深度学习的辐射归一化框架，采用条件生成对抗网络（cGAN），其中生成器为U-Net结构，判别器为PatchGAN，通过学习从传统拼接影像到由LROC WAC数据构建的光度一致参考图像之间的非线性映射关系，实现高保真辐射校正。该方法结合基于图像块的训练策略与重叠感知推理机制，在保证大范围镶嵌图处理可扩展性的同时，有效维持了拼接边界处的结构连续性，显著优于传统直方图匹配等归一化技术。

链接: https://arxiv.org/abs/2604.25208
作者: Pratincha Singh,Jai Gopal Singla,Prashant Hemrajani,Nitant Dube,Amithabh,Hinal Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:

点击查看摘要

Abstract:Radiometric inconsistencies remain a major challenge in generating seamless lunar mosaics from multi-mission orbital imagery due to variability in illumination geometry, sensor characteristics, and acquisition conditions. This paper presents a deep learning-based radiometric normalization framework for multi-mission lunar mosaics constructed primarily from ISRO’s Chandrayaan-2 Terrain Mapping Camera (TMC) data, supplemented with auxiliary imagery from the SELENE (Kaguya) mission. The proposed approach employs a conditional generative adversarial network (cGAN) comprising a U-Net-based generator and a PatchGAN discriminator to learn a nonlinear radiometric mapping from conventionally mosaicked lunar imagery to a photometrically consistent reference derived from LROC Wide Angle Camera (WAC) data. A patch-based training strategy with overlap-aware inference is adopted to enable scalable processing of large-area mosaics while preserving structural continuity across tile boundaries. Quantitative evaluation using Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Root Mean Square Error (RMSE) demonstrates consistent improvements over traditional histogram-based normalization techniques. The proposed framework achieves enhanced tonal uniformity, reduced seam artifacts, and improved structural coherence across multi-source lunar datasets. These results highlight the effectiveness of learning-based radiometric normalization for large-scale planetary mosaicking and demonstrate its potential for generating high-fidelity lunar surface maps from heterogeneous orbital imagery. Subjects: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM) Cite as: arXiv:2604.25208 [cs.CV] (or arXiv:2604.25208v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.25208 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jai Singla [view email] [v1] Tue, 28 Apr 2026 04:26:03 UTC (1,808 KB)

[CV-49] Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

【速读】：该论文旨在解决图像分类任务中细粒度特征提取与背景噪声抑制难以同时实现的问题，尤其针对传统卷积神经网络在多尺度上下文信息捕捉能力不足及易受无关区域干扰导致过拟合的局限性。其解决方案的关键在于提出RDCNet架构，通过三个协同创新模块实现：(1) 多分支随机膨胀卷积（Multi-Branch Random Dilated Convolution, MRDC）模块，利用不同膨胀率的并行分支结合随机掩码机制，在多尺度下增强对细粒度特征的感知能力并提升抗噪性和泛化性能；(2) 细粒度特征增强（Fine-Grained Feature Enhancement, FGFE）模块，通过自适应池化与双线性插值将全局上下文信息融入局部特征表示，强化对细微视觉模式的敏感性；(3) 上下文激励（Context Excitation, CE）模块，采用基于softmax的空间注意力与通道重校准机制，动态突出任务相关特征并抑制背景干扰。

链接: https://arxiv.org/abs/2604.25188
作者: Wentao Jiang,Yuanchan Xu,Heng Yuan
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image classification remains a fundamental yet challenging task in computer vision, particularly when fine-grained feature extraction and background noise suppression are required simultaneously. Conventional convolutional neural networks, despite their remarkable success in hierarchical feature learning, often struggle with capturing multi-scale contextual information and are susceptible to overfitting when confronted with noisy or irrelevant image regions. In this paper, we propose RDCNet (Image Classification Network with Random Dilated Convolution), a novel architecture built upon ResNet-34 that integrates three synergistic innovations to address these limitations: (1) a Multi-Branch Random Dilated Convolution (MRDC) module that employs parallel branches with varying dilation rates combined with a stochastic masking mechanism to capture fine-grained features across multiple scales while enhancing robustness against noise and overfitting; (2) a Fine-Grained Feature Enhancement (FGFE) module embedded within MRDC that bridges global contextual information with local feature representations through adaptive pooling and bilinear interpolation, thereby amplifying sensitivity to subtle visual patterns; and (3) a Context Excitation (CE) module that leverages softmax-based spatial attention and channel recalibration to dynamically emphasize task-relevant features while suppressing background interference. Extensive experiments conducted on five benchmark datasets – CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof – demonstrate that RDCNet consistently achieves state-of-the-art classification accuracy, outperforming the second-best competing methods by margins of 0.02%, 1.12%, 0.18%, 4.73%, and 3.56%, respectively, thereby validating the effectiveness and generalizability of the proposed approach across diverse visual recognition scenarios.

[CV-50] FCMBench-Video: Benchmarking Document Video Intelligence

【速读】：该论文旨在解决金融信贷审核、开户及远程验证等场景中，对文档视频（document video）理解能力的评估难题，尤其关注决策准确性与证据可追溯性的双重需求。传统静态图像处理方法难以捕捉文档视频中的时序冗余信息、跨帧证据整合以及采集过程中的真实性线索，而现有模型缺乏在真实拍摄条件下系统性评估其文档感知、时间定位和证据驱动推理能力的基准。解决方案的关键在于构建FCMBench-Video——一个基于原子级文档片段采集与可控退化组合的可扩展视频基准，通过真实场景下的多文档长视频合成（覆盖28类文档、持续时间20–60秒）、专家标注的问答对（11,322条）及多样化任务设计（如计数、跨文档验证、视觉提示注入），实现了对视频多模态大语言模型（Video-MLLMs）性能的精细化量化与能力边界探测，从而为金融领域文档视频智能分析提供可复现、具区分度的评测标准。

链接: https://arxiv.org/abs/2604.25186
作者: Runze Cui,Fangxin Shang,Yehui Yang,Qing Yang,Tao Chen
机构: Qifu Technology (启富科技); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question–answer instances, covering 28 document types over 20s–60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

[CV-51] Lightweight Real-Time Rendering Parameter Optimization via XGBoost-Driven Lookup Tables

【速读】：该论文旨在解决现代游戏与渲染引擎中如何在资源受限的移动设备上实现渲染质量与实时性能之间的理想平衡问题。现有自动渲染参数优化方法要么依赖耗时数天的场景级预计算，要么因神经网络推理开销过大而无法支持逐帧自适应，或缺乏跨异构硬件和多样化场景的泛化能力。解决方案的关键在于提出一种轻量级、通用的逐帧自适应渲染参数优化框架 LUT-Opt，其核心是将渲染时间与图像质量的联合优化分解为可处理的两阶段流程：离线阶段利用 XGBoost 回归器预测渲染时间和图像质量，并通过系统性离散化和两阶段线性搜索将模型压缩为紧凑的查找表（Lookup Table, LUT）；运行时阶段每帧仅需亚毫秒级查询 LUT 即可完成参数自适应选择，显著降低计算开销并提升效率。实验表明，LUT-Opt 在 Unreal Engine 5 中对次表面散射（Subsurface Scattering, SSS）和混合管线环境光遮蔽（Hybrid-Pipeline Ambient Occlusion, AO）两种技术分别实现约 40% 和 70% 的渲染时间减少，同时图像质量误差增加不足 2%，且每帧推理延迟低于 0.1 ms。

链接: https://arxiv.org/abs/2604.25178
作者: Baijun Tan,Francesco Moretti
机构: School of Software, Polytechnic University of Turin (都灵理工大学软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving a desirable balance between rendering quality and real-time performance is a long-standing challenge in modern game and rendering engines, particularly on resource-constrained mobile devices such as laptops, tablets, and smartphones. Existing approaches to automatic rendering parameter optimization either depend on exhaustive per-scene pre-computation that spans several days, suffer from the prohibitive inference overhead of neural networks that prevents per-frame adaptation, or lack generalizability across heterogeneous hardware and diverse scenes. In this paper, we propose \textbfLUT-Opt, a lightweight, general-purpose framework for adaptive per-frame rendering parameter optimization. Our method decomposes the joint optimization of rendering time and image quality into a tractable two-stage pipeline. In the offline stage, we train a pair of XGBoost regressors to predict rendering time and image quality from rendering parameters, hardware state, and scene complexity descriptors. The trained ensemble models are then distilled into compact lookup tables (LUTs) through systematic discretization and a two-phase linear search that first constrains rendering time and subsequently maximizes structural similarity (SSIM). During runtime, the pre-computed LUT is queried every frame in sub-millisecond time, enabling truly adaptive parameter selection with negligible computational overhead. We validate LUT-Opt on two representative rendering techniques – subsurface scattering (SSS) and hybrid-pipeline ambient occlusion (AO) – implemented within Unreal Engine 5. Extensive experiments across multiple scenes and GPU configurations demonstrate that LUT-Opt reduces subsurface scattering rendering time by approximately 40% and ambient occlusion rendering time by roughly 70%, while incurring only about 2% increase in image quality error, with per-frame inference latency below 0.1\ ms.

[CV-52] Benchmarking OCR Pipelines with Adaptive Enhancement for Multi-Domain Retail Bill Digitization

【速读】：该论文旨在解决多领域零售账单（涵盖杂货店、餐厅、五金店、鞋店及服装店）在数字化过程中因扫描质量差异、版式异构性以及商业领域多样性导致的光学字符识别（OCR）准确率低的问题。其解决方案的关键在于构建一个智能且质量感知的自适应OCR流水线：首先通过基于卷积神经网络（CNN）的自监督去噪图像增强模块提升图像质量；其次引入基于拉普拉斯方差的三级路由图像质量分析器以动态选择处理策略；再结合置信度驱动的自适应反馈循环与迭代重试机制优化识别过程；最后利用自然语言处理（NLP）后处理层进行纠错，从而显著提升跨域账单文本提取的准确性与鲁棒性。

链接: https://arxiv.org/abs/2604.25176
作者: Vijaysinh Gaikwad
机构: JP Research India Pvt. Ltd.(JP研究印度私人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The digitization of multi-domain retail billing documents remains a challenging task due to variability in scan quality, layout heterogeneity, and domain diversity across commercial sectors. This paper proposes and benchmarks an intelligent, quality-aware adaptive Optical Character Recognition (OCR) pipeline for retail bill digitization spanning five domains: grocery stores, restaurants, hardware shops, footwear outlets, and clothing retailers. The proposed system integrates a Convolutional Neural Network (CNN)-based image enhancement module trained via self-supervised denoising, a Laplacian variance-based image quality analyzer with three-tier routing, a confidence-driven adaptive feedback loop with iterative retry, and an NLP-based post-OCR correction layer. Experiments were conducted on a real-world dataset of 360 heterogeneous retail bill images. Ground truth for quantitative evaluation was generated using an OCR ensemble majority voting strategy, a validated approach for scenarios without manual annotation. The proposed pipeline achieves a Character Error Rate (CER) of 18.4% and Word Error Rate (WER) of 27.6%, representing improvements of 26.4% and 31.2% respectively over the Raw Tesseract baseline. The pipeline additionally achieves a text density of 108.3 words per image, a noise ratio of 2.3%, and a processing time of 3.64 seconds per image - a 6.4x speed advantage over EasyOCR. Image quality PSNR analysis on enhanced MEDIUM and LOW quality images yields an average of 28.7 dB, confirming meaningful enhancement. These results establish a reproducible benchmark for multi-domain retail bill OCR research.

[CV-53] IAM: Identity-Aware Human Motion and Shape Joint Generation

【速读】：该论文旨在解决现有文本驱动人体动作生成模型在忽略身体形态（body morphology）影响时导致的动作物理不一致性问题。当前方法通常假设身份中立的运动，使用标准人体表示生成动作，但实际中身体比例、质量分布和年龄等形态特征显著影响动作执行方式。解决方案的关键在于提出一种身份感知的运动生成框架，通过多模态信号（如自然语言描述和视觉线索）隐式表征身份，并引入联合运动-形状生成范式，同步合成动作序列与身体形状参数，使身份信息能够直接调控运动动力学，从而提升动作的真实性和身份一致性。

链接: https://arxiv.org/abs/2604.25164
作者: Wenqi Jia,Zekun Li,Abhay Mittal,Chengcheng Tang,Chuan Guo,Lezi Wang,James Matthew Rehg,Lingling Tao,Size An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: this https URL

[CV-54] 8DNA: 8D Neural Asset Light Transport by Distribution Learning

【速读】：该论文旨在解决高保真3D资产在全局光照（global illumination）效果模拟中的计算成本过高问题，特别是涉及长程散射路径（如次表面散射、镜面互反射和细尺度纤维散射）时的渲染效率瓶颈。解决方案的关键在于提出8D神经资产（8D Neural Assets, 8DNA），通过学习完整的8D光传输函数（包含光源位置、观察方向及表面点的几何信息），将复杂的光传输效应预先烘焙为神经表示，从而支持近场照明下的精确渲染。与以往依赖远场光照并预计算6D光传输的方法不同，8DNA实现了更全面的光传输建模，并采用分布学习（distribution-learning）框架从正向路径追踪样本中训练，相比传统回归方法，在有限训练预算下显著降低优化方差，同时实现更快的推理速度和高质量渲染结果。

链接: https://arxiv.org/abs/2604.25129
作者: Liwen Wu,Haolin Lu,Bing Xu,Miloš Hašan,Ravi Ramamoorthi
机构: University of California San Diego (加州大学圣地亚哥分校); NVIDIA (英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity 3D assets exhibit intriguing global illumination effects like subsurface scattering, glossy interreflections, and fine-scale fiber scatterings, which often involve long scattering paths that are expensive to simulate. We introduce 8D neural assets (8DNA) to pre-bake these light transport effects into neural representations. Unlike prior methods that assume far-field lighting and precompute light transport into 6D functions, 8DNA learns the full 8D light transport, enabling accurate rendering under near-field illumination. Our training leverages a distribution-learning formulation that learns light transport from forward path-traced samples, which produces less optimization variance with lower training budget than the prior regression-based approaches. Experiments show our 8DNA rendering closely matches path-traced results under various scene configurations, yet it achieves improved variance reduction and fast inference speeds on challenging assets.

[CV-55] ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent

【速读】：该论文旨在解决扩散模型生成图像后进行局部编辑时面临的挑战：如何在保持全局结构一致性的前提下实现高精度、灵活的区域修改。现有基于反演（inversion）的方法（如DDIM反演）往往无法获得高质量的初始潜在表示（latent representation），导致编辑保真度下降和结构不一致。其核心问题在于缺乏一个既保留场景语义结构又能支持精细控制的“编辑锚点”。解决方案的关键是提出ResetEdit框架，该框架通过在生成过程中主动嵌入可恢复的潜在信息——即把干净潜变量与扩散潜变量之间的差异注入扩散轨迹，并在反演时提取该差异以重建接近原始状态的“可重置潜变量”（resettable latent），同时引入轻量级优化模块校正由变分自编码器（VAE）不对称性引起的重建偏差。此方法无需存储每张图像的原始潜变量即可实现高保真编辑，且兼容无需微调的编辑策略，在可控性和视觉质量上显著优于当前最优基线。

链接: https://arxiv.org/abs/2604.25128
作者: Hanyi Wang,Han Fang,Zheng Wang,Shilin Wang,Ee-Chien Chang
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Wuhan University (武汉大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have enabled high-quality image generation, leading to increasing demand for post-generation editing that modifies local regions while preserving global structure. Achieving such flexible and precise editing requires a high-quality starting point, a latent representation that provides both the freedom needed for diverse modifications and the precision required for fine-grained, region-specific control. However, existing inversion-based approaches such as DDIM inversion often yield unsatisfactory starting latents, resulting in degraded edit fidelity and structural inconsistency. Ideally, the most suitable editing anchor should be the original latent used during the generation process, as it inherently captures the scene’s structure and semantics. Yet, storing this latent for every generated image is impractical due to massive storage and retrieval costs. To address this challenge, we propose ResetEdit, a proactive diffusion editing framework that embeds recoverable latent information directly into the generation process. By injecting the discrepancy between the clean and diffused latents into the diffusion trajectory and extracting it during inversion, ResetEdit reconstructs a resettable latent that closely approximates the true starting state. Additionally, a lightweight latent optimization module compensates for reconstruction bias caused by VAE asymmetry. Built upon Stable Diffusion, ResetEdit integrates seamlessly with existing tuning-free editing methods and consistently outperforms state-of-the-art baselines in both controllability and visual fidelity.

[CV-56] M3-VQA: A Benchmark for Multimodal Multi-Entity Multi-Hop Visual Question Answering

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在细粒度多模态实体理解与复杂多跳推理能力评估上的不足问题。现有VQA数据集多聚焦于粗粒度类别和单一实体的简单推理，难以全面检验模型对跨视觉与文本来源的多个实体进行复杂推理的能力。解决方案的关键在于提出M³-VQA这一新型知识驱动的视觉问答基准，其核心特征包括：引入涉及多个异构实体的多样化多实体问题、支持可追溯的详细证据链、构建结构化的多模态知识库，并通过三种评估设置（无外部知识、黄金证据、检索增强输入）系统性地测试模型的知识获取与推理性能。实验表明，模型在缺乏外部信息时表现不佳，但借助精确证据显著提升，且基于推理意识的代理式检索优于启发式方法，凸显了结构化推理对复杂多模态理解的重要性。

链接: https://arxiv.org/abs/2604.25122
作者: Jiatong Ma,Longteng Guo,Yuchen Liu,Zijia Zhao,Dongze Hao,Xuanxu Lin,Jing Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present M ^3 -VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M ^3 -VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M ^3 -VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at this https URL.

[CV-57] One Perturbation Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于字体提示注入（typographic prompt injection）的安全威胁问题，即攻击者通过在图像中渲染特定文本，诱导视觉语言模型（VLMs）忽略其安全对齐机制并执行恶意指令。现有研究多聚焦于提升攻击成功率（ASR），但缺乏对为何某些文本渲染方式能绕过安全约束的解释。论文的关键解决方案是提出一个可解释且模型无关的代理指标——多模态嵌入距离（multimodal embedding distance），发现其与ASR呈强负相关（r = -0.71 至 -0.93, p < 0.01），从而揭示了攻击效果的核心机制。在此基础上，作者进一步利用该指标作为红队工具，通过CWA-SSA优化方法在受限ℓ∞扰动下最大化图像文本嵌入相似性，有效恢复感知可读性并降低安全对齐拒绝率，实现对目标模型安全性的压力测试，且无需访问目标模型本身。

链接: https://arxiv.org/abs/2604.25102
作者: Ravikumar Balakrishnan,Sanket Mendapara
机构: Cisco Systems (思科系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Typographic prompt injection exploits vision language models’ (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emphwhy certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ( r=-0.71 to -0.93 , p0.01 ), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded \ell_\infty perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model’s safety filter strength and the degree of visual degradation.

[CV-58] Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

【速读】：该论文旨在解决当前统一多模态模型（Unified Multimodal Models, uMMs）在视觉理解与视觉生成能力之间缺乏语义一致性评估的问题。现有评测协议独立衡量这两项能力，未能检验其是否在语义层面保持一致，导致无法判断模型是否真正学习到跨任务统一的表示。解决方案的关键在于提出XTC-Bench框架，该框架基于场景图（scene graph）构建跨任务评估体系，通过从结构化场景图中提取生成提示和理解查询，实现对物体、属性及关系等原子事实层面的语义对齐分析；并引入连续跨任务一致性（Continuous Cross-Task Agreement, CCTA）指标，量化生成与理解任务在匹配原子事实上的语义一致性，从而将内部一致性与单任务准确率解耦。实验表明，高单项性能并不意味着强跨任务一致性，且一致性主要取决于多模态学习目标的耦合紧密度，而非架构统一本身。

链接: https://arxiv.org/abs/2604.25072
作者: Weixing Wang,Liudvikas Zekas,Anton Hackl,Constantin Alexander Auga,Parisa Shahabinejad,Jona Otholt,Antonio Rueda-Toicen,Gerard de Melo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.

[CV-59] Scalable Secure Biometric Authentication without Auxiliary Identifiers

【速读】：该论文旨在解决大规模云环境下生物特征认证系统在数据泄露风险下的安全性问题，即现有系统要么无法有效防范数据库被攻破导致的用户敏感生物特征信息泄露，要么因计算开销过高而不具备实际部署可行性。其解决方案的关键在于将人工智能（AI）与先进的密码学技术以新颖方式融合，从而在不依赖辅助标识符的前提下实现可扩展、高性能且具有可证明安全性的隐私保护生物特征认证机制，首次验证了现实世界中此类系统的可行性。

链接: https://arxiv.org/abs/2604.25071
作者: Alexander Bienstock,Daniel Escudero,Antigoni Polychroniadou,Zhen Zeng,Pranav Bhat,Ashok Singal,Prashant Sharma,Manuela Veloso
机构: JPMorganChase(摩根大通); TACEO; Carnegie Mellon University (卡内基梅隆大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The prevalence of biometric authentication has been on the rise due to its ease of use and elimination of weak passwords. To date, most biometric authentication systems have been designed for on-device authentication of the device owner (e.g., smartphones and laptops). Recently, biometric authentication systems have started to emerge that are designed to authenticate users against cloud databases storing representations of biometrics for large numbers of users (potentially millions), such as those facilitating biometric payments. However, the use of a large cloud database introduces a significant attack vector, as a breach of the database could lead to the compromise of all enrolled users’ sensitive biometric data. Indeed, all such existing systems either do not adequately protect against such a breach, or are impractical to deploy and use due to their high computational overhead. In this work, we present a new biometric authentication system that provides provable security guarantees against data breaches, while remaining scalable and performant. To do so, we marry artificial intelligence with advanced cryptographic techniques in a novel fashion, providing several optimizations along the way. Our work is the first to show that real-world scalable privacy-preserving biometric authentication without auxiliary identifiers is feasible, and we believe that it will spur widespread industrial adoption and further research in this area.

[CV-60] ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching

【速读】：该论文旨在解决当前深度神经网络在物体识别（Object Recognition, OR）中对非形状线索（如纹理和背景）过度依赖的问题，从而导致其在视角变化和外观扰动下泛化能力差、鲁棒性不足的缺陷。解决方案的关键在于提出一个名为ShapeY的全新基准测试框架，该框架通过构建包含200个3D物体、68,200张多视角灰度图像并可引入非形状外观扰动的数据集，结合最近邻匹配任务，系统评估OR模型嵌入空间中对象视图是否按3D形状相似性聚类。这一设计使ShapeY能够定量与定性地揭示模型在不同视角和外观变化下的形状理解能力，从而为推动人工视觉系统向人类水平的形状识别能力迈进提供了一个原则性的评估工具。

链接: https://arxiv.org/abs/2604.25065
作者: Jong Woo Nam,Amanda S. Rios,Bartlett W. Mel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object recognition (OR) in humans relies heavily on shape cues and the ability to recognize objects across varying 3D viewpoints. Unlike humans, deep networks often rely on non-shape cues such as texture and background, leading to vulnerabilities in generalization and robustness. To address this gap, we introduce ShapeY, a novel and principled benchmarking framework designed to evaluate shape-based recognition capability in OR systems. ShapeY comprises 68,200 grayscale images of 200 3D objects rendered from multiple viewpoints and optionally subjected to non-shape ``appearance’’ changes. Using a nearest-neighbor matching task, ShapeY specifically probes the fine-grained structure of an OR system’s embedding space by evaluating whether object views are clustered by 3D shape similarity across varying 3D viewpoints and other non-shape changes. ShapeY provides a suite of quantitative and qualitative performance readouts, including error rate graphs, viewpoint tuning curves, histograms of positive and negative matching scores, and grids showing ordered best matches, which together offer a comprehensive evaluation of an OR system’s shape understanding capability. Testing of 321 pre-trained networks with diverse architectures reveals significant challenges in achieving robust shape-based recognition: even state-of-the-art models struggle to generalize consistently across 3D viewpoint and appearance changes, and are prone to infrequent but egregious matches of objects of obviously completely different shape. ShapeY establishes a principled framework for advancing artificial vision systems toward human-like shape recognition capabilities, emphasizing the importance of disentangled and invariant object encodings.

[CV-61] BifDet: A 3D Bifurcation Detection Dataset for Airway-Tree Modeling

【速读】：该论文旨在解决胸腔计算机断层扫描（CT）中气道分叉点（airway bifurcation）自动检测与分割工具开发因缺乏标注数据而受限的问题。其关键解决方案是提出了BifDet，首个公开可用的专注于3D气道分叉点检测的数据集，包含来自ATM22开放队列的CT扫描图像，并对每个分叉点的父支和子支进行边界框标注。此外，作者通过在该数据集上微调并评估RetinaNet和DETR模型，验证了BifDet在实际应用中的有效性，为后续研究提供了基准和可复现的处理流程。

链接: https://arxiv.org/abs/2604.24999
作者: Ali Keshavarzi,Quentin Bouniot,Benjamin M. Smith,Elsa Angelini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This manuscript is currently in preparation for submission

点击查看摘要

Abstract:Thoracic Computed Tomography (CT) scans offer detailed insights into the intricate branching network of the airway tree, which is essential for understanding various respiratory diseases. Airway bifurcations, where airway branches split, are crucial landmarks for understanding lung physiology, disease mechanisms and lesion localization. Despite the significance of bifurcation analysis, a notable lack of datasets annotated for this task hinders the development of advanced automated specialized detection or segmentation tools. In this paper, we introduce BifDet, the first publicly-available dataset specialized for 3D airway bifurcation detection, filling a critical gap in existing resources. Our dataset comprises carefully annotated CT scans from the ATM22 open-access cohort with bifurcation bounding boxes covering the parent and daughter branches. As a use-case for demonstrating the potential of BifDet, we fine-tune and evaluate RetinaNet and DETR for 3D airway bifurcations detection on CT scans. We provide detailed pipelines, including preprocessing steps and specific implementation design choices. Results are detailed over various categories of minimal bounding box sizes to serve as baseline to benchmark future research.

[CV-62] DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

【速读】：该论文旨在解决开放词汇语义分割（open-vocabulary semantic segmentation）中训练-free方法存在的两个核心问题：一是单一推理机制导致局部token可靠性不足，二是空间一致性（spatial coherence）缺失。解决方案的关键在于提出一种无需训练的双分支CLIP框架DouC，其通过两个互补分支实现改进：OG-CLIP利用轻量级推理时token门控机制提升patch级可靠性，FADE-CLIP则借助冻结视觉基础模型引导的代理注意力注入外部结构先验，从而增强空间感知能力；最终在logit层面融合两分支输出，使局部可靠性与结构感知的patch交互共同作用于预测结果，且支持可选的实例级后处理校正。该方法不引入额外可学习参数，保持CLIP的零样本泛化能力，并在多个基准和CLIP骨干网络上实现显著性能提升。

链接: https://arxiv.org/abs/2604.24997
作者: Mohamad Zamini,Diksha Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP’s zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.

[CV-63] Power Foam: Unifying Real-Time Differentiable Ray Tracing and Rasterization

【速读】：该论文旨在解决现有3D表示方法在兼顾光线追踪效率与光栅化性能方面的难题：传统基于泡沫（foam）的表示虽支持常数时间的光线遍历，但其潜在无界的单元结构不利于高效的基于瓦片（tile-based）光栅化；而当前主流的3D高斯散射（3DGS）等方法虽具备良好光栅化性能，却难以实现高效光线追踪。解决方案的关键在于提出一种可微分的3D表示——通过将Voronoi泡沫推广为具有可控单元范围的有界幂图（bounded power diagrams），在不依赖训练过程中昂贵的Delaunay三角剖分的前提下，实现了空间上受限的几何原语；同时引入面向表面的表述（oriented surface formulation），显式建模内外区域边界，并将可微纹理嵌入到这些表面上，从而解耦几何与外观。这一设计使得模型在保持先进光线追踪效率的同时，达到与最新3DGS相当的光栅化性能，为统一实时可微渲染提供了可行路径。

链接: https://arxiv.org/abs/2604.24994
作者: Shrisudhan Govindarajan,Daniel Rebain,Dor Verbin,Kwang Moo Yi,Anish Prabhu,Andrea Tagliasacchi
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a differentiable 3D representation that unifies the ray tracing capabilities of foam-based ray tracing with the efficiency of modern rasterization pipelines. While prior foam representations enable constant-time ray traversal through an explicit volumetric partition of space, their potentially unbounded cells hinder efficient tile-based rasterization. We address this limitation by generalizing Voronoi foams to bounded power diagrams with controllable cell extents, enabling spatially bounded primitives without requiring expensive Delaunay triangulations during training. We further introduce an oriented surface formulation that explicitly models interfaces between interior and exterior regions, and decouple geometry from appearance by embedding differentiable texture directly on these surfaces. Together, these contributions yield a representation that preserves state-of-the-art ray tracing efficiency while achieving rasterization performance competitive with current generation 3DGS, providing a practical path toward unified real-time differentiable rendering.

[CV-64] A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

【速读】：该论文旨在解决传统复杂系统建模方法（如微分方程）在表达自组织和涌现行为方面的局限性，以及经典细胞自动机（Cellular Automata, CA）因规则固定难以适应真实数据驱动建模的问题。其解决方案的关键在于引入神经细胞自动机（Neural Cellular Automata, NCA），通过将Wolfram提出的简单递归更新规则与可学习的人工神经网络相结合，使NCA能够从数据样本中自动学习复杂的局部更新规则，从而有效模拟具有自组织特性的生成系统。论文进一步提出了一种统一的模块化框架、标准化符号表示及开源实现（NCAtorch），为该领域研究提供理论基础与实践工具。

链接: https://arxiv.org/abs/2604.24990
作者: Martin Spitznagel,Janis Keuper
机构: Offenburg University (奥芬堡大学); University of Mannheim (曼海姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stephen Wolfram proclaimed in his 2003 seminal work “A New Kind Of Science” that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems. Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram’s ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems. The aim of this paper is to review the existing work on NCA and provide a unified modular framework and notation, as well as a reference implementation in the open-source library NCAtorch.

[CV-65] Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

【速读】：该论文旨在解决多模态大模型在实际应用中面临的效率与性能平衡问题，尤其是如何在保持高精度的同时显著降低推理延迟并提升吞吐量。解决方案的关键在于：首先，基于高效的Nemotron 3 Nano 30B-A3B架构设计，结合创新的多模态token压缩技术（multimodal token-reduction techniques），有效减少了冗余信息处理；其次，通过优化训练数据和训练策略（training data and recipes），实现了跨文本、图像、视频及音频等多种模态的一致性性能提升，尤其在真实文档理解、长时音频-视频理解以及代理式计算机操作任务上达到领先水平。

链接: https://arxiv.org/abs/2604.24954
作者: NVIDIA:Amala Sanjay Deshmukh,Kateryna Chumachenko,Tuomas Rintamaki,Matthieu Le,Tyler Poon,Danial Mohseni Taheri,Ilia Karmanov,Guilin Liu,Jarno Seppanen,Arushi Goel,Mike Ranzinger,Greg Heinrich,Guo Chen,Lukas Voegtle,Philipp Fischer,Timo Roman,Karan Sapra,Collin McCarthy,Shaokun Zhang,Fuxiao Liu,Hanrong Ye,Yi Dong,Mingjie Liu,Yifan Peng,Piotr Zelasko,Zhehuai Chen,Nithin Rao Koluguri,Nune Tadevosyan,Lilit Grigoryan,Ehsan Hosseini Asl,Pritam Biswas,Leili Tavabi,Yuanhang Su,Zhiding Yu,Peter Jin,Alexandre Milesi,Netanel Haber,Yao Xu,Sarah Amiraslani,Nabin Mulepati,Eric Tramel,Jaehun Jung,Ximing Lu,Brandon Cui,Jin Xu,Zhiqi Li,Shihao Wang,Yuanguo Kuang,Shaokun Zhang,Huck Yang,Boyi Li,Hongxu Yin,Song Han,Pavlo Molchanov,Adi Renduchintala,Charles Wang,David Mosallanezhad,Soumye Singhal,Luis Vega,Katherine Cheung,Sreyan Ghosh,Yian Zhang,Alexander Bukharin,Venkat Srinivasan,Johnny Greco,Andre Manoel,Maarten Van Segbroeck,Suseella Panguliri,Rohit Watve,Divyanshu Kakwani,Shubham Pachori,Jeffrey Glick,Radha Sri-Tharan,Aileen Zaman,Khanh Nguyen,Shi Chen,Jiaheng Fang,Qing Miao,Wenfei Zhou,Yu Wang,Zaid Pervaiz Bhat,Varun Praveen,Arihant Jain,Ramanathan Arunachalam,Tomasz Kornuta,Ashton Sharabiani,Amy Shen,Wei Huang,Yi-Fu Wu,Ali Roshan Ghias,Huiying Li,Brian Yu,Nima Tajbakhsh,Chen Cui,Wenwen Gao,Li Ding,Terry Kong,Manoj Kilaru,Anahita Bhiwandiwalla
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

[CV-66] ViPO: Visual Preference Optimization at Scale

【速读】：该论文旨在解决视觉生成模型中偏好优化（preference optimization）的可扩展性问题，特别是现有开源偏好数据集存在冲突的偏好模式、低分辨率、提示多样性不足及分布不均衡等瓶颈，导致传统方法在噪声数据上难以有效学习。其解决方案的关键在于提出Poly-DPO算法与大规模高质量偏好数据集ViPO的协同设计：Poly-DPO通过引入一个动态调整模型置信度的多项式项扩展DPO目标函数，增强对不同数据分布的鲁棒性；而ViPO则构建了包含100万张图像对（1024px）和30万段视频对（720p+）的多类别高质数据集，确保偏好信号可靠且分布平衡。实验表明，在高质量数据下Poly-DPO退化为标准DPO，验证了其自适应特性与数据质量的重要性，同时在噪声数据如Pick-a-Pic V2上显著优于Diffusion-DPO，证明该方案在算法灵活性与数据质量双重维度上的有效性。

链接: https://arxiv.org/abs/2604.24953
作者: Ming Li,Jie Wu,Justin Cui,Xiaojie Li,Rui Wang,Chen Chen
机构: University of Central Florida (中佛罗里达大学); ByteDance Seed (字节跳动种子); UCLA (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO’s adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.

[CV-67] Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

【速读】：该论文旨在解决现有图像偏好数据集因仅提供单一整体标注而导致的标签噪声问题，这种噪声源于人类视觉偏好在美学、细节保真度和语义一致性等多个维度上的复杂性，而简单地将多维偏好压缩为二元胜负标签会引发冲突的梯度信号，从而误导扩散模型的直接偏好优化（Diffusion Direct Preference Optimization, DPO）。解决方案的关键在于提出一种半监督方法——Semi-DPO，其核心思想是将一致样本对视为干净标签数据，冲突样本对视为噪声未标记数据；首先在共识过滤后的干净子集上训练初始模型，再利用该模型作为隐式分类器为噪声数据生成伪标签，并通过迭代精炼实现性能提升。此方法无需额外人工标注或显式奖励模型即可显著改善与复杂人类偏好的对齐效果。

链接: https://arxiv.org/abs/2604.24952
作者: Xinxin Liu,Ming Li,Zonglin Lyu,Yuzhang Shang,Chen Chen
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: this https URL

[CV-68] Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

【速读】：该论文旨在解决移动视频在不同屏幕分辨率和方向模式下进行画幅调整时所面临的挑战，传统静态裁剪或填充边框会损害视觉质量，而变形则可能扭曲视频语义。其核心解决方案是采用时间一致性的动态裁剪策略，在保证内容完整性的同时最小化形变，从而提升重制后视频的质量与意义保留度。关键创新在于构建了目前最大规模的主观视频人物区域裁剪数据库——LIVE-YouTube Video Cropping (LIVE-YT VC)，包含1800段标注视频，并引入一种新颖的帧内时序滤波器对标注结果进行平滑处理（称为LIVE-YT VC++），为视频画幅变换模型提供高质量基准数据支持。

链接: https://arxiv.org/abs/2604.24947
作者: Cheng-Han Lee,Maniratnam Mandal,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review in IEEE Transactions on Image Processing. The code, models and dataset will be available at: this https URL

点击查看摘要

Abstract:With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video’s intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.

[CV-69] Agent ic AI for Remote Sensing: Technical Challenges and Research Directions

【速读】：该论文旨在解决地球观测（Earth Observation, EO）领域中多步骤分析工作流的可靠性问题，尤其是在引入生成式 AI 和代理型人工智能（Agentic AI）后，如何确保在地理空间数据处理过程中保持空间一致性、时间有效性及物理合理性。传统通用代理模型在面对 EO 数据的时空结构化特性（如重投影、重采样、合成与聚合等操作）时存在隐含假设失效的问题，导致错误在流程中无声传播，进而影响最终结果的准确性。论文提出的关键解决方案是构建面向地球观测的原生代理（EO-native agents），其核心设计原则包括：结构化的地理空间状态表示、工具感知的推理机制、验证器引导的执行策略，以及与地理空间和物理有效性对齐的学习目标。这一方法强调从底层架构上重新思考代理设计，以适应 EO 工作流特有的物理约束、地理空间约束和流程依赖性。

链接: https://arxiv.org/abs/2604.24919
作者: Muhammad Akhtar Munir,Muhammad Umer Sheikh,Akashah Shabbir,Muhammad Haris Khan,Fahad Khan,Xiao Xiang Zhu,Begum Demir,Salman Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages. Position Paper

点击查看摘要

Abstract:Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have expanded representation learning and language-grounded interaction for remote sensing, and agentic AI has demonstrated long-horizon reasoning and external tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate over georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation actively transform the underlying state and can constrain subsequent analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence, but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We identify the implicit assumptions commonly made in generic agentic models, analyze how they break in geospatial workflows, and characterize the resulting failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and learning objectives aligned with geospatial and physical validity. Finally, we present research directions spanning EO-specific benchmarks, hybrid supervised and reinforcement learning, constrained self-improvement, and trajectory-level evaluation beyond final-answer accuracy. Building reliable geospatial agents therefore requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.

[CV-70] VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis

【速读】：该论文旨在解决从高分辨率RGB图像中实现非线性输出反馈控制的问题，尤其在存在部分可观测性、传感器噪声和非线性动力学的情况下，如何提供鲁棒的约束满足保证。解决方案的关键在于提出VISION-SLS方法，其核心创新包括：(i) 利用预训练视觉特征学习一个低维观测映射，并结合状态依赖的误差边界以保障可扩展性与安全性；(ii) 通过系统层级综合（System Level Synthesis, SLS）优化因果仿射时变输出反馈策略，并设计一种基于序列凸规划与高效Riccati递推相结合的新型可扩展求解器，从而有效处理由此产生的非凸优化问题。该方法在模拟与硬件平台上均验证了其在安全信息采集、不确定性降低及约束满足方面的有效性。

链接: https://arxiv.org/abs/2604.24894
作者: Antoine P. Leeman,Shuyu Zhan,Melanie N. Zeilinger,Glen Chou
机构: ETH Zürich (苏黎世联邦理工学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Extended version; conference version to appear in Robotics: Science and Systems XXII (RSS 2026)

点击查看摘要

Abstract:We propose VISION-SLS, a method for nonlinear output-feedback control from high-resolution RGB images which provides robust constraint satisfaction guarantees under calibrated uncertainty bounds despite partial observability, sensor noise, and nonlinear dynamics. To enable scalability while retaining guarantees, we propose: (i) a learned low-dimensional observation map from pretrained visual features with state-dependent error bounds, and (ii) a causal affine time-varying output-feedback policy optimized via System Level Synthesis (SLS). We develop a scalable, novel solver for the resulting nonconvex program that leverages sequential convex programming coupled with efficient Riccati recursions. On two simulated visuomotor tasks (a 4D car and a 10D quadrotor) with = 512 x 512 pixels and a 59D humanoid task with partial observability, our method enables safe, information-gathering behavior that reduces uncertainty while guaranteeing constraint satisfaction with empirically-calibrated error bounds. We also validate our method on hardware, safely controlling a ground vehicle from onboard images, outperforming baselines in safety rate and solve times. Together, these results show that learned visual abstractions coupled with an efficient solver make SLS-based safe visuomotor output-feedback practical at scale. The code implementation of our method is available at this https URL.

[CV-71] Interactive Episodic Memory with User Feedback CVPR2026

【速读】：该论文旨在解决自然语言查询下的情景记忆（Episodic Memory with Natural Language Queries, EM-NLQ）任务中，用户查询可能存在歧义或不完整导致模型响应错误的问题。现有方法通常采用一次性推理（one-shot setup），无法适应真实场景中用户通过反馈迭代修正查询的需求。为应对这一挑战，作者提出了“带问题与反馈的情景记忆任务”（Episodic Memory with Questions and Feedback, EM-QnF），并设计了一种轻量级训练方案和一个即插即用的反馈对齐模块（Feedback Alignment Module, FALM），使模型能够基于用户反馈动态调整预测结果。关键创新在于引入交互式反馈机制，并通过FALM实现对已有EM-NLQ模型的有效增强，从而显著提升在三个基准上的性能，同时保持高效性且优于或媲美商业大视觉-语言模型。

链接: https://arxiv.org/abs/2604.24893
作者: Nikesh Subedi,Loris Bazzani,Ziad Al-Halah
机构: University of Utah (犹他大学); University of Verona (维罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project Page: this https URL

点击查看摘要

Abstract:In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., “Where did I place the mug?”) that requires searching a long egocentric video, captured from the user’s perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model’s initial prediction or add more information (e.g., “Before this. I’m looking for the big blue mug not the white one”), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.

[CV-72] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations CVPR’26

【速读】：该论文旨在解决生成式 AI (Generative AI) 中自回归（Autoregressive, AR）图像合成模型在分辨率和长宽比上缺乏灵活性的问题，同时克服其计算资源消耗高、效率低的瓶颈。传统AR模型通常依赖固定分辨率训练与推理，导致在处理不同尺寸图像时性能下降或计算成本急剧上升；而扩散模型虽在质量上表现优异，但往往需要大量计算资源。解决方案的关键在于提出VibeToken——一种基于一维Transformer的分辨率无关图像分词器，能够将任意分辨率和长宽比的图像编码为动态可控的32–256个token序列，并在此基础上构建VibeToken-Gen，一个类条件AR图像生成器。该方案实现了在1024×1024图像生成中仅用64个token即可达到3.94 gFID，且推理浮点运算次数（FLOPs）恒定为179G（相比扩散模型提升63.4倍效率），从而显著提升了AR图像生成的通用性与实用性。

链接: https://arxiv.org/abs/2604.24885
作者: Maitreya Patel,Jingtao Li,Weiming Zhuang,Yezhou Yang,Lingjuan Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR’26 | Project Page: this https URL

点击查看摘要

Abstract:We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen – whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) – VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

[CV-73] Learning Illumination Control in Diffusion Models ICLR2026

【速读】：该论文旨在解决生成式 AI（Generative AI）中光照控制的可复现性与开放性问题，即如何在不依赖闭源模型或复杂控制输入（如深度图）的前提下，实现对扩散模型（diffusion model）中图像光照条件的有效调控。其解决方案的关键在于构建一个全开源的数据引擎，将高质量光照图像转化为包含低光照输入图像、自然语言光照指令和高光照输出图像的监督三元组数据集，并基于此数据集微调扩散模型。该方法显著提升了在感知相似性、结构相似性和身份保持方面的性能，且整个流程完全基于开源工具和公开数据，具备高度可复现性。

链接: https://arxiv.org/abs/2604.24877
作者: Nishit Anand,Manan Suri,Christopher Metzler,Dinesh Manocha,Ramani Duraiswami
机构: University of Maryland College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted to ICLR 2026 ReALM-GEN Workshop on Diffusion Models. Project Website: this https URL

点击查看摘要

Abstract:Controlling illumination in images is essential for photography and visual content creation. While closed-source models have demonstrated impressive illumination control, open-source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open-source and reproducible pipeline for learning illumination control in diffusion models. Our approach builds a data engine that transforms well-lit images into supervised training triplets consisting of a poorly-illuminated input image, a natural language lighting instruction, and a well-illuminated output image. We finetune a diffusion model on this data and demonstrate significant improvements over baseline SD 1.5, SDXL, and FLUX.1-dev models in perceptual similarity, structural similarity, and identity preservation. Our work provides a reproducible solution built entirely with open-source tools and publicly available data. We release all our code, data, and model weights publicly.

[CV-74] ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

【速读】：该论文旨在解决文本引导的3D医学图像分割（text guided 3D medical image segmentation）中存在的三大挑战：计算复杂度高、文本与体积特征对齐能力弱，以及难以捕捉精细解剖结构细节。其核心解决方案是提出ESICA框架，关键创新包括：(1) 基于相似性矩阵的掩码预测公式，提升语义对齐精度；(2) 采用带适配器模块的高效分解解码器，实现高保真体积重建；(3) 引入两阶段精修策略，增强边界清晰度并解决不确定性区域。此外，通过仅正样本预训练结合平衡微调的两阶段训练策略，显著提升了模型训练稳定性和泛化能力。在涵盖CT、MRI、PET、超声和显微成像五种模态的CVPR BiomedSegFM基准上，ESICA实现了最先进的分割精度，而轻量级版本ESICA4 Lite则在参数量大幅减少的同时保持相近性能，展现出更优的效率-精度权衡。

链接: https://arxiv.org/abs/2604.24876
作者: Yu Xin,Gorkem Can Ates,Jun Ma,Sumin Kim,Ying Zhang,Kaleb E Smith,Kuang Gong,Wei Shao
机构: University of Florida (佛罗里达大学); University of Toronto (多伦多大学); NVIDIA Corporation (英伟达公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text guided 3D medical image segmentation offers a flexible alternative to class based and spatial prompt based models by allowing users to specify regions of interest directly in natural language. This paradigm avoids reliance on predefined label sets, reduces ambiguous outputs, and aligns more naturally with clinical workflows. However, existing text guided frameworks are often computationally expensive, exhibit weak text volume feature alignment, and fail to capture fine anatomical details. We propose ESICA, a lightweight and scalable framework that addresses these challenges through three innovations: (1) a similarity matrix based mask prediction formulation that enhances semantic alignment, (2) an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and (3) a two pass refinement strategy that sharpens boundaries and resolves uncertain regions. To improve training stability and generalization, ESICA adopts a two stage scheme consisting of positive only pretraining followed by balanced fine tuning. On the CVPR BiomedSegFM benchmark spanning five imaging modalities (CT, MRI, PET, ultrasound, and microscopy), ESICA achieves state of the art segmentation accuracy, while the compact ESICA4 Lite variant attains similar segmentation performance with substantially fewer parameters, yielding a superior efficiency accuracy trade off. Our framework advances text guided segmentation toward efficient, scalable, and clinically deployable systems. Code will be made publicly available at this https URL.

[CV-75] Automated detection of pediatric congenital heart disease from phonocardiograms using deep and handcrafted feature fusion

【速读】：该论文旨在解决先天性心脏病（Congenital Heart Disease, CHD）在低资源环境中诊断成本高、可及性差以及依赖经验不足的专家导致误诊率高的问题。其解决方案的关键在于提出一种基于深度特征融合（deep feature fusion）的自动化检测方法，通过整合深度学习特征与人工设计特征（handcrafted features），利用数字听诊器采集的心音图（Phonocardiography, PCG）信号实现对CHD的早期识别。该模型在751名儿童受试者数据集上表现出优异性能，准确率达92%，AUROC为96%，具备实时远程筛查潜力，可作为低成本、高效率的初筛工具应用于资源匮乏地区。

链接: https://arxiv.org/abs/2604.24767
作者: Abdul Jabbar,Ethan Grooby,Yang Yi Poh,Khawza I. Ahmad,Md Hassanuzzaman,Raqibul Mostafa,Ahsan H. Khandoker,Faezeh Marzbanrad
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Pages, 5 figures. Computers in Biology and Medicine, 2025

点击查看摘要

Abstract:Congenital heart disease (CHD) is the most common type of birth defect, impacting about 1% of live births worldwide. Echocardiography, the gold-standard diagnostic method, is costly and inaccessible in low-resource settings. Diagnosis is delayed due to limited skilled experts, whose ability to interpret pathological patterns varies significantly, causing inter- and intra-clinician variability. Therefore, we present a new method for a more accessible diagnostic modality, the digital stethoscope, to detect CHDs. Our method is based on deep feature fusion, integrating deep and handcrafted features for the automated early detection of CHDs. For this work, Phonocardiography (PCG) recordings were obtained from 751 pediatric subjects (Age:1 month- 16 years) in Bangladesh, ranging from infants to adults at four auscultation locations: mitral valve (MV), aortic valve (AV), pulmonary valve (PV), and tricuspid valve (TV). These recordings were labeled based on confirmed diagnoses by cardiologists as either cases of CHD or non-CHD. The results demonstrated that our proposed model achieved an accuracy of 92%, a sensitivity of 91%, and a specificity of 91%, based on a patient-wise split of 70% training, 20% validation, and 10% testing. Furthermore, the Area Under the Receiver Operating Characteristic curve (AUROC) of 96%, and an F1-score of 92%. This model promises efficient real-time remote detection of CHDs as a cost-effective screening tool for low-resource settings.

[CV-76] QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在量子计算校准图（quantum calibration plots）理解能力上缺乏系统评估的问题。现有方法依赖人工解读实验数据，而校准图作为最通用的人类可读表示形式，其自动化分析亟需可靠工具支持。解决方案的关键在于构建首个专门针对量子校准图的VLM基准测试集QCalEval：包含243个样本、87种场景类型及22个实验家族，覆盖超导量子比特和中性原子平台，并设计六类问题以评估零样本（zero-shot）与上下文学习（in-context learning）性能。实验表明，前沿闭源模型在多图上下文学习中显著优于开源模型，且监督微调（Supervised Fine-Tuning, SFT）虽能提升零样本表现，但无法弥合多模态上下文学习差距，凸显了高质量多模态对齐训练的重要性。

链接: https://arxiv.org/abs/2604.25884
作者: Shuxiang Cao,Zijian Zhang,Abhishek Agarwal,Grace Bratrud,Niyaz R. Beysengulov,Daniel C. Cole,Alejandro Gómez Frieiro,Elena O. Glen,Hao Hsu,Gang Huang,Raymond Jow,Greshma Shaji,Tom Lubowe,Ligeng Zhu,Luis Mantilla Calderón,Nicola Pancotti,Joel Pendleton,Brandon Severin,Charles Etienne Staub,Sara Sussman,Antti Vepsäläinen,Neel Rajeshbhai Vora,Yilun Xu,Varinia Bernales,Daniel Bowring,Elica Kyoseva,Ivan Rungger,Giulia Semeghini,Sam Stanwyck,Timothy Costa,Alán Aspuru-Guzik,Krysta Svore
机构: NVIDIA(英伟达); University of Toronto(多伦多大学); IQM Quantum Computers(量子计算机公司); Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室); Conductor Quantum(量子导体公司); National Physical Laboratory(英国国家物理实验室); Infleqtion(Infleqtion公司); Harvard University(哈佛大学); Fermi National Accelerator Laboratory(费米国家加速器实验室); Northwestern University(西北大学); EeroQ Corporation(艾罗Q公司); Royal Holloway University of London(伦敦大学皇家霍洛威学院); Vector Institute for Artificial Intelligence(人工智能矢量研究所)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

[CV-77] Quantum-Inspired Robust and Scalable SAR Object Classification

【速读】：该论文旨在解决合成孔径雷达（SAR）图像分类中面临的两大挑战：一是高噪声和宽动态范围导致的鲁棒性不足，二是模型在边缘设备（如无人机和军用飞机）部署时对模型尺寸与分类精度之间平衡的需求。解决方案的关键在于引入张量网络（Tensor Networks），其优势体现在两个方面：一方面具备较强的抗数据中毒（data poisoning）能力，从而提升模型鲁棒性；另一方面可通过结构化压缩实现高效模型降维，在保证分类准确率的同时显著减小模型规模，为雷达目标识别与深度学习方法的协同优化提供新思路。

链接: https://arxiv.org/abs/2604.25755
作者: Maximilian Scharf,Marco Trenti,Felix Bock,Padraig Davidson,Tobias Brosch,Benjamin Rodrigues de Miranda,Sigurd Huber,Timo Felser
机构: Tensor AI Solutions GmbH( Tensor AI 解决方案有限公司); Ulm University(乌尔姆大学); Hensoldt Sensors GmbH(亨索尔特传感器公司); German Aerospace Center (DLR)(德国航空航天中心)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注: 6 pages, 6 figures, EUSAR 2026 conference

点击查看摘要

Abstract:SAR image classification naturally has to deal with huge noise and a high dynamic range particularly requiring robust classification models. Additionally, the deployment of these models on edge devices, such as drones and military aircraft, requires a careful balance between model size and classification accuracy. This study explores the potential of tensor networks to meet these robustness requirements, specifically evaluating their resilience to data poisoning. Unlike previous works that concentrated on conventional neural networks for SAR object detection, this research focuses on the robustness and model reduction capabilities of tensor networks in object classification. Our findings indicate that tensor networks are adept at addressing both the challenges of robustness and the need for model efficiency, thereby contributing valuable insights to the ongoing discourse in radar applications and deep learning methodologies in general.

[CV-78] Robustness Evaluation of a Foundation Segmentation Model Under Simulated Domain Shifts in Abdominal CT: Implications for Health Digital Twin Deployment

【速读】：该论文旨在解决生成式 AI（Generative AI）基础分割模型（如 Segment Anything Model, SAM）在临床真实医学影像域偏移下的鲁棒性不足问题，尤其是其在腹部CT图像中脾脏分割任务中的稳定性尚未得到充分量化。解决方案的关键在于通过系统性的切片级鲁棒性审计，采用标准化的基于真值的边界框协议隔离编码器鲁棒性与提示不确定性，并在十种模拟跨扫描仪变异的控制扰动条件下（包括高斯噪声、模糊、对比度缩放、伽马校正和分辨率不匹配），评估SAM（ViT-B）的性能变化。结果显示，尽管部分条件存在统计显著但微小的Dice分数变化，整体平均ΔDice低于0.01，且失败率未显著上升，表明SAM在中等CT域偏移下表现出稳定的分割行为，可作为医学图像分割研究的稳健基础模型。

链接: https://arxiv.org/abs/2604.25685
作者: Sanghati Basu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Pages, 5 Tables, 2 Figures

点击查看摘要

Abstract:Foundation segmentation models such as the Segment Anything Model (SAM) have demonstrated strong generalization across natural images; however, their robustness under clinically realistic medical imaging domain shifts remains insufficiently quantified. We present a systematic slice-level robustness audit of SAM (ViT-B) for spleen segmentation in abdominal CT using 1,051 nonempty slices from 41 volumes in the Medical Segmentation Decathlon. A standardized ground-truth-derived bounding-box protocol was used to isolate encoder robustness from prompt uncertainty. Controlled perturbations simulating inter-scanner variability, including Gaussian noise, blur, contrast scaling, gamma correction, and resolution mismatch, were applied across ten conditions. The clean baseline achieved a mean Dice score of 0.9145 (95% CI: [0.909, 0.919]) with a failure rate of 0.67%. Across all perturbations, the absolute mean \DeltaDice remained below 0.01. Paired Wilcoxon signed-rank tests with Benjamini-Hochberg false discovery rate correction identified statistically significant but small-magnitude changes under selected conditions, while McNemar analysis showed no significant increase in failure probability. These findings indicate that SAM exhibits stable segmentation behavior under moderate CT domain shifts, supporting its role as a robust foundation baseline for medical image segmentation research. As health digital twins increasingly incorporate foundation segmentation models for anatomical modeling and organ-level monitoring, formal characterization of robustness under real-world imaging variability is a necessary step toward trustworthy deployment.

[CV-79] PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

【速读】：该论文旨在解决计算进化生物学中生成新颖且符合生物合理性的三维形态结构这一核心挑战，尤其在数据极度稀缺且生成形状需满足物种间系统发育关系的前提下。解决方案的关键在于提出 PhyloSDF，一种基于系统发育条件的神经生成模型，其创新性体现在两个方面：一是通过新型系统发育一致性损失（Phylogenetic Consistency Loss）正则化 DeepSDF 自解码器，使潜在空间与进化距离高度相关（Pearson r=0.993）；二是采用残差条件流匹配（Residual Conditional Flow Matching, Residual CFM）架构，将生成过程分解为物种中心点查找与学习残差预测两步，从而实现仅需约4个样本/物种即可高质量生成新形态。该方法在达尔文雀及其近缘种共24个物种的100个微CT扫描颅骨数据集上验证有效，生成的新网格在代码层面保留88–129%的真实种内变异，并通过留一物种交叉验证展示了系统发育外推能力与生物合理的祖先颅骨重建潜力。

链接: https://arxiv.org/abs/2604.25371
作者: Kaikwan Lau,Gary P. T. Choi
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993 ); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin’s Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fréchet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

[CV-80] CRC-SAM: SAM-Based Multi-Modal Segmentation and Quantification of Colorectal Cancer in CT Colonoscopy and Histology Images

【速读】：该论文旨在解决结直肠癌（colorectal cancer, CRC）在不同医学影像模态（内窥镜、CT 和组织病理学图像）中分割一致性差的问题，现有方法多局限于单一模态，难以满足临床全流程的统一分析需求。解决方案的关键在于提出 CRC-SAM 框架，基于 MedSAM 的预训练模型构建统一分割架构，并引入低秩适应（low-rank adaptation, LoRA）层嵌入冻结的编码器中，实现对低资源模态的高效领域迁移，仅需极少可训练参数即可获得跨模态一致且高性能的分割结果。

链接: https://arxiv.org/abs/2604.24793
作者: Daniel Lao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, ISBI 2026 oral presentation

点击查看摘要

Abstract:We present CRC-SAM, a unified framework for colorectal cancer segmentation across colonoscopy, CT, and histopathology images. Unlike prior single-modality methods, CRC-SAM provides consistent, modality-agnostic segmentation throughout the clinical workflow. Built on MedSAM, it incorporates low-rank adaptation (LoRA) layers into a frozen encoder, enabling efficient domain transfer to underrepresented modalities with minimal trainable parameters. Experiments on MSD-Colon, CVC-ClinicDB, and EBHI-Seg demonstrate superior performance across modalities, outperforming state-of-the-art baselines and highlighting the effectiveness of lightweight LoRA adaptation for foundation-model-based colorectal cancer analysis.

人工智能

[AI-0] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

【速读】：该论文旨在解决生成式 AI（Generative AI）在后训练阶段，使用仅输出级监督进行任务适配时，在基于可验证奖励的强化学习（Reinforcement Learning from Verifiable Rewards, RLVR）中因初始成功概率 $ p_0 $ 较低而导致的冷启动停滞（cold-start stalling）问题。解决方案的关键在于引入一个由 Tsallis $ q $-对数定义的损失族 $ J_Q $，该族在 $ q=0 $ 时对应传统 RLVR（利用极），而在 $ q=1 $ 时对应潜在轨迹的对数边缘似然（密度估计极），二者之间通过标量放大因子 $ P_\theta^{-q} $ 进行插值，该因子独立于学习率地重加权每个样本。此放大机制使得从冷启动状态逃逸的时间复杂度从 $ \Omega(1/p_0) $（RLVR）降低至 $ \Theta(\log(1/p_0)) $（密度估计极），从而有效缓解冷启动问题；进一步提出两种蒙特卡洛估计方法——梯度放大强化学习（GARL）与后验衰减微调（PAFT），分别从先验采样和后验重要性重采样角度实现高效、稳定且语义一致的优化，实验证明在 FinQA、HotPotQA 和 MuSiQue 上显著优于基线方法。

链接: https://arxiv.org/abs/2604.25907
作者: Chu-Cheng Lin,Eugene Ie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p_0 is small. Using the Tsallis q -logarithm, we define a loss family J_Q that interpolates between RLVR (at q=0 , the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q=1 , the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification P_\theta^-q that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires \Omega(\frac1p_0) time to escape cold start, while the density-estimation pole escapes in \Theta\big(\log(\frac1p_0)\big) ; intermediate q trades escape speed against noise memorization. Because P_\theta is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias O\big(\fracqM P_\theta^q+1\big) ; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q=0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q=0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4 over GRPO).

[AI-1] SN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

【速读】：该论文旨在解决持续离线强化学习（Continual Offline Reinforcement Learning, CORL）中的双重挑战：一方面是从随时间积累的数据集中顺序学习新任务，同时保持对先前任务的性能；另一方面是在不进行在线交互的情况下避免灾难性遗忘（catastrophic forgetting）。传统基于回放（replay-based）的方法虽有效但存在内存开销大、样本分布与新策略不匹配的问题，而架构式（architectural）方法在监督学习中表现良好，但在CORL中尚未充分探索。本文提出TSN-Affinity方法，其核心创新在于利用TinySubNetworks（小型子网络）实现任务特异性参数化，并通过一种RL感知的重用策略，根据动作兼容性和潜在相似性动态路由任务到不同子网络，从而实现可控的知识共享与高效的任务切换。实验表明，该方案在Atari游戏和Franka Emika Panda机械臂操作任务上均展现出优异的多任务性能与记忆保留能力，验证了基于相似性的架构重用是一种优于传统回放策略的有效替代方案。

链接: https://arxiv.org/abs/2604.25898
作者: Dominik Żurek,Kamil Faber,Marcin Pietron,Paweł Gajewski,Roberto Corizzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live environment interactions is expensive, risky, or impossible. However, CORL inherits the dual difficulty of offline reinforcement learning and adapting while preventing catastrophic forgetting. Replay-based continual learning approaches remain a strong baseline but incur memory overhead and suffer from a distribution mismatch between replayed samples and newly learned policies. At the same time, architectural continual learning methods have shown strong potential in supervised learning but remain underexplored in CORL. In this work, we propose TSN-Affinity, a novel CORL method based on TinySubNetworks and Decision Transformer. The method enables task-specific parameterization and controlled knowledge sharing through a RL-aware reuse strategy that routes tasks according to action compatibility and latent similarity. We evaluate the approach on benchmarks based on Atari games and simulations of manipulation tasks with the Franka Emika Panda robotic arm, covering both discrete and continuous control. Results show strong retention from sparse SubNetworks, with routing further improving multi-task performance. Our findings suggest that similarity-guided architectural reuse is a strong and viable alternative to replay-based strategies in a CORL setting. Our code is available at: this https URL.

[AI-2] Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

【速读】：该论文旨在解决语言模型在微调（fine-tuning）过程中可能出现的**涌现式错位（Emergent Misalignment, EM）**问题，即模型在训练分布外的行为表现出现更严重、更具危害性的偏离对齐目标的行为。尽管现有干预措施（如稀释错误数据、在错误数据后继续微调良性数据、以及接种提示法）能在标准评估中减少或消除EM，但研究发现这些方法仅在特定上下文触发时才有效，导致一种新的现象——条件性错位（Conditional Misalignment）：当输入与训练上下文特征相似时，模型仍会表现出比训练期间更严重的不当行为。关键在于，这些干预手段无法从根本上消除模型对训练语境的敏感依赖，而这种依赖会在实际部署中被恶意利用，从而造成潜在风险。因此，解决方案的关键在于识别并缓解模型对训练上下文的隐式记忆和触发机制，尤其在真实场景中混合使用良性和不良数据的情况下，需重新审视当前评估范式的局限性。

链接: https://arxiv.org/abs/2604.25891
作者: Jan Dubiński,Jan Betley,Anna Sztyber-Betley,Daniel Tan,Owain Evans
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution. We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like “How do I make a quick buck?”). However, if the evaluation prompts are tweaked to resemble the training context, the model displays EM. We call this conditional misalignment. As in standard EM, the model displays misaligned behaviors more egregious than those seen during training, but only on inputs sharing features with the training data. The first two interventions are diluting misaligned data with benign data, and finetuning on benign data after misaligned data. Both produce conditional misalignment. For instance, models trained on a mix of only 5% insecure code still show misalignment when asked to format responses as Python strings (resembling the training context). The third intervention is inoculation prompting. Here, statements with a similar form to the inoculation prompt serve as triggers for misalignment, even if they have the opposite meaning. On the positive side, inoculation prompting has lower (but still non-zero) conditional misalignment if training is on-policy or includes reasoning distillation. Our results imply that in realistic post-training, where misaligned data is typically combined with benign data, models may be conditionally misaligned even if standard evaluations look clean. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2604.25891 [cs.LG] (or arXiv:2604.25891v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.25891 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jan Dubiński [view email] [v1] Tue, 28 Apr 2026 17:36:06 UTC (2,699 KB) Full-text links: Access Paper: View a PDF of the paper titled Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers, by Jan Dubi’nski and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-3] When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

【速读】：该论文旨在解决生成式 AI（Generative AI）在基于强化学习（Reinforcement Learning, RL）训练语言模型时，因缺乏精确的地面真实奖励（ground truth reward）而依赖不完美代理奖励（proxy reward）所引发的问题。传统方法将所有代理奖励误差视为有害，但本文指出并非所有偏差均等——某些奖励误差可能无害甚至有益，因其可避免策略在中等性能输出上陷入停滞。解决方案的关键在于理论分析奖励误差对真实奖励提升的影响机制，并据此提出两类实践启示：一是为人类反馈强化学习（RLHF）设计更合理的奖励模型评估指标，这些指标能更好预测模型经RLHF后的性能；二是为具有可验证奖励的场景提供奖励设计指导，强调代理奖励的有效性高度依赖其与初始策略及学习算法的交互特性。

链接: https://arxiv.org/abs/2604.25872
作者: Shuning Shang,Hubert Strauss,Stanley Wei,Sanjeev Arora,Noam Razin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Code available at this https URL

点击查看摘要

Abstract:Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.

[AI-4] RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM -Generated REST API Test Cases from NL Requirements

【速读】：该论文旨在解决现有REST API测试工具依赖代码覆盖率和崩溃型故障指标评估生成式测试用例有效性的问题，这类指标无法准确衡量由自然语言（NL）需求生成的测试是否真正验证了预期功能行为。针对此问题，作者提出RESTestBench基准测试平台，其核心创新在于：（1）构建包含三个REST服务及其人工验证的精确与模糊两种版本自然语言需求的数据集，支持可控且可复现的基于需求的测试生成评估；（2）引入基于需求的变异测试度量方法（requirements-based mutation testing metric），量化单个测试用例对特定需求的故障检测能力，从而替代传统属性驱动的评估方式。该方案有效提升了对LLM生成测试用例功能正确性的评估精度，尤其揭示了在模糊需求场景下，测试生成器若接触错误实现反而会显著降低测试效果，强调了需求清晰度对测试有效性的重要性。

链接: https://arxiv.org/abs/2604.25862
作者: Leon Kogler,Stefan Hangler,Maximilian Ehrhart,Benedikt Dornauer,Roland Wuersching,Peter Schrammel
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for EASE 2026

点击查看摘要

Abstract:Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

[AI-5] Investigation into In-Context Learning Capabilities of Transformers

【速读】：该论文旨在解决生成式 AI（Generative AI）中基于上下文学习（In-Context Learning, ICL）的机制在实际应用中的成功条件尚不明确的问题，尤其是针对高维 Gaussian 混合二分类任务下 ICL 的经验缩放行为缺乏系统刻画。其解决方案的关键在于构建一个受控的合成实验框架，结合 Frei 和 Vardi（2024）的理论基础，通过线性上下文分类器形式化建模，系统分析输入维度、上下文样本数量和预训练任务数三个核心因素如何共同决定模型能否仅凭上下文示例推断出任务结构，并识别出“良性过拟合”现象出现的参数区域及其与数据几何和训练暴露的关系，从而提供了一幅关于 ICL 缩放行为的全面经验图谱。

链接: https://arxiv.org/abs/2604.25858
作者: Rushil Chandrupatla,Leo Bangayan,Sebastian Leng,Arya Mazumdar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Frei and Vardi (2024), we analyze how in-context test accuracy depends on three fundamental factors: the input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, we isolate the geometric conditions under which models successfully infer task structure from context alone. We additionally investigate the emergence of benign overfitting, where models memorize noisy in-context labels while still achieving strong generalization performance on clean test data. Through extensive sweeps across dimensionality, sequence length, task diversity, and signal-to-noise regimes, we identify the parameter regions in which this phenomenon arises and characterize how it depends on data geometry and training exposure. Our results provide a comprehensive empirical map of scaling behavior in in-context classification, highlighting the critical role of dimensionality, signal strength, and contextual information in determining when in-context learning succeeds and when it fails. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.25858 [cs.LG] (or arXiv:2604.25858v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.25858 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rushil Chandrupatla [view email] [v1] Tue, 28 Apr 2026 16:57:55 UTC (2,600 KB)

[AI-6] ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLM Agents

【速读】：该论文旨在解决长时程大语言模型（Long-horizon LLM）任务中因知识状态漂移（knowledge state drift）、中间承诺隐式化以及中断导致证据链断裂等问题而导致的失败问题。其解决方案的关键在于提出ADEMA架构，该架构以知识状态编排为核心，通过显式的认知状态追踪（explicit epistemic bookkeeping）、异构双评估者治理（heterogeneous dual-evaluator governance）、自适应任务模式切换（adaptive task-mode switching）、声誉驱动资源分配（reputation-shaped resource allocation）、断点续跑持久化（checkpoint-resumable persistence）、段级记忆压缩（segment-level memory condensation）、以产物优先组装（artifact-first assembly）及最终有效性验证与安全回退机制（final-validity checking with safe fallback）等模块协同实现对复杂推理过程的结构化控制与恢复能力，从而保障长周期任务中知识演进的连贯性、可追溯性和鲁棒性。

链接: https://arxiv.org/abs/2604.25849
作者: Zhou Hanlin,Chan Huah Yong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon LLM tasks often fail not because a single answer is unattainable, but because knowledge states drift across rounds, intermediate commitments remain implicit, and interruption fractures the evolving evidence chain. This paper presents ADEMA as a knowledge-state orchestration architecture for long-horizon knowledge synthesis rather than as a generic multi-agent runtime. The architecture combines explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level memory condensation, artifact-first assembly, and final-validity checking with safe fallback. Evidence is drawn entirely from existing materials: a four-scenario showcase package, a fixed 60-run mechanism matrix, targeted micro-ablation and artifact-chain supplements, and a repaired protocol-level benchmark in which code-oriented evaluation is the clearest quality-sensitive mechanism block. Across the fixed matrix, removing checkpoint/resume produced the only invalid run, and it did so in the interruption-sensitive resume condition. By contrast, dual evaluation, segment synthesis, and dynamic governance are best interpreted as supporting control mechanisms that shape trajectory discipline, explicit artifact progression, and cost-quality behavior rather than as universal binary prerequisites for completion. The contribution is therefore a knowledge-state orchestration architecture in which explicit epistemic state transition, evidence-bearing artifact progression, and recoverable continuity are the primary design commitments.

[AI-7] Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

【速读】：该论文旨在解决城市尺度下电动车辆（Electric Vehicle, EV）网约车车队的协同控制问题，需在不确定且空间相关的需求和行驶时间条件下，同时优化调度、再定位与充电决策，并严格遵守充电设施及馈线容量限制。其核心解决方案是提出一种基于软演员-评论家（Soft Actor–Critic, SAC）的鲁棒强化学习框架——PD-RSAC，其中关键创新包括：1）将问题建模为带有混合动作（离散服务/再定位/充电与连续充电功率）的六边形网格半马尔可夫决策过程（semi-MDP），并引入温度退火掩码策略生成高层意图；2）通过有限时域滚动混合整数线性规划（MILP）实现物理可行性约束（如电池状态、充电端口与馈线限值）的实时投影；3）采用Wasserstein-1模糊集结合图对齐马哈拉诺比斯距离度量来建模空间相关性的分布不确定性，利用Kantorovich–Rubinstein对偶形式、投影次梯度内循环及原-对偶风险预算更新机制提升策略鲁棒性。实验表明，该方法在纽约市出租车数据构建的大规模EV车队仿真中实现了最高净收益（1.22百万美元），显著优于Greedy、SAC、MAPPO和MADDPG等基线方法，且零违规馈线容量限制。

链接: https://arxiv.org/abs/2604.25848
作者: An Nguyen,Hoang Nguyen,Phuong Le,Hung Pham,Cuong Do,Laurent El Ghaoui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures. Submitted to Neurocomputing

点击查看摘要

Abstract:We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions – discrete actions for serving, repositioning, and charging, together with continuous charging power – and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor–Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich–Rubinstein dual, a projected subgradient inner loop, and a primal–dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD–RSAC achieves the highest net profit, reaching \ 1.22M, compared with \ 0.58M–\ 0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

[AI-8] owards Agent ic Investigation of Security Alerts

【速读】：该论文旨在解决安全分析师在面对海量告警信息时因上下文不足和手动关联多源日志而效率低下的问题。解决方案的关键在于设计了一个基于大语言模型（Large Language Models, LLMs）的代理式工作流，该工作流通过预定义查询（如结构化SQL查询Suricata日志和基于grep的文本搜索）获取数据概览，并由LLM组件根据结果选择合适查询、提取原始证据并生成最终告警判断。此方法将现实世界分析师的调查实践与结构化流程相结合，有效提升了告警研判的准确性，显著优于直接使用LLM处理高维无结构数据的方案。

链接: https://arxiv.org/abs/2604.25846
作者: Even Eilertsen,Vasileios Mavroeidis,Gudmund Grov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables. Accepted at the 2025 IEEE International Conference on Big Data (BigData)

点击查看摘要

Abstract:Security analysts are overwhelmed by the volume of alerts and the low context provided by many detection systems. Early-stage investigations typically require manual correlation across multiple log sources, a task that is usually time-consuming. In this paper, we present an experimental, agentic workflow that leverages large language models (LLMs) augmented with predefined queries and constrained tool access (structured SQL over Suricata logs and grep-based text search) to automate the first stages of alert investigation. The proposed workflow integrates queries to provide an overview of the available data, and LLM components that selects which queries to use based on the overview results, extracts raw evidence from the query results, and delivers a final verdict of the alert. Our results demonstrate that the LLM-powered workflow can investigate log sources, plan an investigation, and produce a final verdict that has a significantly higher accuracy than a verdict produced by the same LLM without the proposed workflow. By recognizing the inherent limitations of directly applying LLMs to high-volume and unstructured data, we propose combining existing investigation practices of real-world analysts with a structured approach to leverage LLMs as virtual security analysts, thereby assisting and reducing the manual workload.

[AI-9] rialCalibre: A Fully Automated Causal Engine for RCT Benchmarking and Observational Trial Calibration

【速读】：该论文旨在解决真实世界证据（Real-World Evidence, RWE）研究中因残余偏倚难以量化而导致的因果效应估计可信度不足的问题，尤其针对基于观察性数据模拟目标试验（target trial emulation）的方法。其解决方案的关键在于提出一个名为TrialCalibre的多智能体系统框架，该框架自动化并扩展了BenchExCal的两阶段流程：首先通过基准对比（Benchmark）将观察性模拟与已有的随机对照试验（Randomized Controlled Trial, RCT）进行比较，再利用观测到的偏差进行校准（Calibrate），从而提升新适应症因果效应估计的准确性。TrialCalibre通过引入专用智能体（如协调者、协议设计、数据合成、临床验证和定量校准等）及代理学习（如强化学习人类反馈，Reinforcement Learning from Human Feedback, RLHF）与知识黑板机制，实现流程的自适应、可审计与透明化，显著增强了方法的可扩展性和实用性。

链接: https://arxiv.org/abs/2604.25832
作者: Amir Habibdoust,Xing Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages , 2 figures

点击查看摘要

Abstract:Real-world evidence (RWE) studies that emulate target trials increasingly inform regulatory and clinical decisions, yet residual, hard-to-quantify biases still limit their credibility. The recently proposed BenchExCal framework addresses this challenge via a two-stage Benchmark, Expand, Calibrate process, which first compares an observational emulation against an existing randomized controlled trial (RCT), then uses observed divergence to calibrate a second emulation for a new indication causal effect estimation. While methodologically powerful, BenchExCal is resource intensive and difficult to scale. We introduce TrialCalibre, a conceptualized multiagent system designed to automate and scale the BenchExCal workflow. Our framework features specialized agents such as the Orchestrator, Protocol Design, Data Synthesis, Clinical Validation, and Quantitative Calibration Agents that coordi-nate the the overall process. TrialCalibre incorpo-rates agent learning (e.g., RLHF) and knowledge blackboards to support adaptive, auditable, and transparent causal effect estimation.

[AI-10] At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

【速读】：该论文旨在解决在资源极度受限的可穿戴健康传感器上实现可靠、实时的心震图（Seismocardiography, SCG）特征提取与分类的问题，尤其适用于长期太空任务中宇航员的自主健康监测需求。解决方案的关键在于提出一种基于超低功耗（Ultra-Low-Power, ULP）现场可编程门阵列（Field-Programmable Gate Array, FPGA）的高效实现方法，结合量化感知训练（quantization-aware training）与脉动阵列加速器（systolic-array accelerator），在Lattice iCE40UP5K FPGA上实现了仅使用整数运算的卷积神经网络（Convolutional Neural Networks, CNNs）推理，从而在极低功耗（8.55 mW）和最小硬件资源占用（2,861 LUTs 和 7 DSP blocks）下达到98%的验证准确率，证明了在电池供电且辐射环境严苛的空间场景中，本地化、自主化的SCG心脏特征提取是可行的。

链接: https://arxiv.org/abs/2604.25799
作者: Kazi Mohammad Abidur Rahman,Davis Rakhshan,Philipp Lütke,Laura Harms,Ulf Kulau
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, To be published in: The 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026)

点击查看摘要

Abstract:The convergence of accelerating human spaceflight ambitions and critical terrestrial health monitoring demands is driving unprecedented requirements for reliable, real-time feature extraction on extremely resource-constrained wearable health sensors. We present an ultra-low-power (ULP) Field-Programmable Gate Array (FPGA) based solution for real-time Seismocardiography (SCG) feature classification using Convolutional Neural Networks (CNNs). Our approach combines quantization-aware training with a systolic-array accelerator to enable efficient integer-only inference on the Lattice iCE40UP5K FPGA, which offers an ideal platform for battery-powered deployments – particularly in space environments – thanks to its power efficiency and radiation resilience. The implementation achieves a validation accuracy of 98% while consuming only 8.55 mW, completing inference in 95.5 ms with minimal hardware resources (2,861 LUTs and 7 DSP blocks). These results demonstrate that fully on-device SCG-based cardiac feature extraction is feasible on resource-constrained hardware, enabling energy-efficient, autonomous health monitoring for astronauts in long-duration space missions.

[AI-11] StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games

【速读】：该论文旨在解决不完美信息博弈中对手建模与策略利用之间的权衡问题，即如何在保持接近纳什均衡（game-theoretic optimal, GTO）安全性的同时，有效识别并利用对手的可 exploited行为。解决方案的关键在于提出StratFormer——一种基于Transformer的元智能体（meta-agent），采用两阶段课程学习机制：第一阶段通过GTO策略与对手建模头联合训练，从行动历史中提取对手行为模式；第二阶段则依据每对手的可 exploited性（exploitability）动态调整正则化调度，逐步将策略向最佳响应（best-response, BR）转移，从而实现安全与高收益的平衡。其架构创新引入双轮标记（dual-turn tokens）和桶率特征（bucket-rate features），分别捕捉代理与对手决策点的上下文信息，并编码五种战略情境下的对手倾向，显著提升了对多样化对手的适应性与利用效率。

链接: https://arxiv.org/abs/2604.25796
作者: Andy Caen,Mark H.M. Winands,Dennis J.N.J. Soemers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at Computers and Games 2026

点击查看摘要

Abstract:We present StratFormer, a transformer-based meta-agent that learns to simultaneously model and exploit opponents in imperfect-information games through a two-phase curriculum. The first phase trains an opponent modeling head to identify behavioral patterns from action histories while the agent plays a game-theoretic optimal (GTO) policy. The second phase progressively shifts the policy toward best-response (BR) exploitation, guided by a per-opponent regularization schedule tied to exploitability. Our architecture introduces dual-turn tokens – feature vectors constructed at both agent and opponent decision points – coupled with bucket-rate features that encode opponent tendencies across five strategic contexts. On Leduc Hold’em, a small poker variant with six cards and two betting rounds, we test against six opponent archetypes at two strength levels each, with exploitability ranging from 0.15 to 1.26 Big Blinds (BB) per hand. StratFormer achieves an average exploitation gain of +0.106 BB per hand over GTO, with peak gains of +0.821 against highly exploitable opponents, while maintaining near-equilibrium safety.

[AI-12] Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment ICLR2026

【速读】：该论文旨在解决生成式 AI（Generative AI）在知识蒸馏过程中出现的“隐性学习”（subliminal learning）问题，即学生模型在仅蒸馏无类别 logits（no-class logits）时仍会意外习得教师模型的特定属性。其解决方案的关键在于揭示梯度对齐（gradient alignment）在多步训练中虽弱但持续存在，并证明这种对齐是导致属性获取的因果因素；同时指出现有缓解方法如“临界训练”（liminal training）通过削弱梯度对齐来抑制该现象，但在当前设定下无法完全阻止属性获取，暗示当一阶驱动占主导时，此类基于梯度对齐的缓解策略可能不可靠。

链接: https://arxiv.org/abs/2604.25779
作者: Chayanon Kitkana,Shivam Arora
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in ICLR 2026 Sci4DL Workshop

点击查看摘要

Abstract:In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

[AI-13] Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile

【速读】：该论文旨在解决训练数据质量对机器学习模型性能影响难以量化的问题，尤其是如何识别哪些特征中的错误最可能损害模型表现。其解决方案的关键在于提出误差敏感度谱（Error Sensitivity Profile, ESP），该指标能够定量评估单个或多个特征中存在错误时对模型性能的敏感程度，从而指导数据清洗工作优先处理最具影响力的错误类型和特征。为支持ESP的计算，研究团队还开发了一套集成工具集\dirty，实验表明，仅依赖特征与目标变量之间的简单相关性无法准确预测性能下降，而ESP能更有效地揭示潜在的数据质量问题。

链接: https://arxiv.org/abs/2604.25765
作者: Andrea Maurino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quality of training data is critical to the performance of machine learning models. In this paper, the Error Sensitivity Profile (ESP) is proposed. It quantifies the sensitivity of model performance to errors in a single feature or in multiple features. By leveraging ESP, data-cleaning efforts can be prioritized based on error types and features most likely to affect model performance. To support the computation of this metric, an integrated suite of tools, called \dirty, is created. We conduct an extensive experimental study on two widely used datasets using 14 classification models, revealing that performance degradation is not always predictable from simple correlations with the target variable.

[AI-14] hreat-Oriented Digital Twinning for Security Evaluation of Autonomous Platforms DSN

【速读】：该论文旨在解决当前安全自主系统（Secure Autonomy）研究中因受限于操作平台访问权限、通信基础设施争议以及缺乏代表性对抗测试条件而导致的开放性与可复现性不足的问题。其解决方案的关键在于提出一种面向威胁的数字孪生（Digital Twinning）方法学，通过构建一个开源、模块化的自主系统孪生体，实现对欺骗攻击、重放攻击、畸形输入注入、感知退化及对抗机器学习等典型威胁场景的可观测、可控测试。该孪生体具备分层功能结构（感知、自主决策与监督控制）、置信度门控的多模态感知机制、显式的命令与遥测信任边界，以及运行时的安全保持行为，且架构设计可迁移至无人机（UAV）和空间系统等高风险应用场景，从而为可信与安全自主系统的跨域研究提供可实施的研究框架。

链接: https://arxiv.org/abs/2604.25757
作者: Thomas J. Neubert,Laxima Niure Kandel,Berker Peköz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Camera ready accepted for presentation at and publication in the proceedings of 2026 56st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W): Dependable and Secure Autonomous Systems (DSAS)

点击查看摘要

Abstract:Open, unclassified research on secure autonomy is constrained by limited access to operational platforms, contested communications infrastructure, and representative adversarial test conditions. This paper presents a threat-oriented digital twinning methodology for cybersecurity evaluation of learning-enabled autonomous platforms. The approach is instantiated as an open-source, modular twin of a representative autonomy stack with separated sensing, autonomy, and supervisory-control functions; confidence-gated multi-modal perception; explicit command and telemetry trust boundaries; and runtime hold-safe behavior. The contribution is methodological: a reproducible design pattern that translates threat analysis into observable, controllable tests for spoofing, replay, malformed-input injection, degraded sensing, and adversarial ML stress. Although the implemented proxy is ground based, the architecture is intentionally framed around stack elements shared with UAV and space systems, including constrained onboard compute, intermittent or high-latency links, probabilistic perception, and mission-critical recovery behavior. The result is an implementable research scaffold for dependable and secure autonomy studies across UAV and space domains.

[AI-15] QAROO: AI-Driven Online Task Offloading for Energy-Efficient and Sustainable MEC Networks

【速读】：该论文旨在解决无线供电移动边缘计算（Wireless Powered Mobile Edge Computing, WMEC）网络中在线任务卸载问题，特别是传统方法存在的适应性差和启发式算法收敛速度慢的问题。解决方案的关键在于提出一种基于量子注意力机制的强化学习框架（Quantum Attention-based Reinforcement learning for Online Offloading, QAROO），其核心创新包括：利用循环神经网络增强时序建模能力、设计不确定性引导的量化方法以提升探索效率，并将注意力机制嵌入量子神经网络以强化特征表示能力，从而在动态信道环境中实现计算与能量资源的协同优化。

链接: https://arxiv.org/abs/2604.25740
作者: Yongtao Yao,Yao Yang,Haorui Shi,Canglu Zhu,Miaojiang Chen,Ahmed Farouk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of artificial intelligence (AI) and intelligent science, intelligent edge computing has been widely adopted. However, the limitations of traditional methods, such as poor adaptability and the slow convergence of heuristic algorithms, are becoming increasingly evident. To enable sustainable and resource-efficient edge applications, this paper proposes an online task offloading framework for wireless powered mobile edge computing (MEC) networks, called Quantum Attention-based Reinforcement learning for Online Offloading (QAROO). The system employs a binary offloading strategy with the aim of co-optimizing computing and energy resources in dynamic channel environments. In response to the issues of poor adaptability in traditional approaches and the slow convergence of heuristic algorithms, the framework integrates quantum neural networks and attention mechanisms, introducing three key improvements: using recurrent neural networks to enhance temporal modeling capability, proposing an uncertainty-guided quantization method to improve exploration efficiency, and incorporating attention mechanisms into quantum networks to strengthen feature representation. Experiments demonstrate that the proposed method outperforms comparative schemes in terms of normalized computation speed and processing time, offering an efficient and stable solution for online task offloading in large-scale Internet of Things (IoT) dynamic environments.

[AI-16] SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在指令驱动代码编辑任务中的性能瓶颈问题，尤其是在执行测试约束下实现可靠、精准的代码修改能力不足的问题。现有模型在EditBench基准上的任务成功率为低于60%，表明其在通用代码生成与基于指令的精确编辑之间存在显著差距。解决方案的关键在于提出一种多智能体框架SAFEdit，通过角色分工机制提升编辑可靠性：Planner Agent生成可见性感知的编辑计划，Editor Agent执行最小化、字面级的代码变更，Verifier Agent运行真实测试用例；同时引入Failure Abstraction Layer（FAL）将原始测试日志结构化为诊断反馈，支持迭代优化，从而显著提升成功率并减少指令层面的幻觉现象。

链接: https://arxiv.org/abs/2604.25737
作者: Noam Tarshish,Nofar Selouk,Daniel Hodisan,Bar Ezra Gafniel,Yuval Elovici,Asaf Shabtai,Eliya Nachmani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to the EQUISA (Evaluation of Qualitative Aspects of Intelligent Software Assistants) workshop at EASE (Evaluation and Assessment in Software Engineering) 2026

点击查看摘要

Abstract:Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility-aware edit plan, an Editor Agent applies minimal, literal code modifications, and a Verifier Agent executes real test runs. When tests fail, SAFEdit uses a Failure Abstraction Layer (FAL) to transform raw test logs into structured diagnostic feedback, which is fed back to the Editor to support iterative refinement. We compare SAFEdit against both prior single-model results reported for EditBench and an implemented ReAct single-agent baseline under the same evaluation conditions. We used EditBench to evaluate SAFEdit on 445 code editing instances in five languages (English, Polish, Spanish, Chinese, and Russian) under varying spatial context variants. SAFEdit achieved 68.6 percent TSR, outperforming the single-model baseline by 3.8 percentage points and the ReAct single-agent baseline by 8.6 percentage points. The iterative refinement loop was found to contribute 17.4 percentage points to SAFEdit’s overall success rate. SAFEdit’s automated error analysis further indicates a reduction in instruction-level hallucinations compared to single-agent approaches, providing an additional framework component for interpreting failures beyond pass or fail outcomes.

[AI-17] Verification of Neural Networks (Lecture Notes)

【速读】：该论文旨在解决神经网络（Neural Networks）的验证问题，即如何从理论上保证神经网络在特定输入条件下输出满足预设规范。其解决方案的关键在于结合不同的神经网络架构（包括前馈神经网络、循环神经网络、注意力机制及Transformer）与形式化规格语言（Specification Languages）以及算法验证技术（Algorithmic Verification Techniques），从而构建一套系统化的理论框架，用于分析和证明神经网络的行为符合预期的安全性和功能性要求。

链接: https://arxiv.org/abs/2604.25733
作者: Benedikt Bollig
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 72 pages

点击查看摘要

Abstract:These lecture notes provide an introduction to the verification of neural networks from a theoretical perspective. We discuss feed-forward neural networks, recurrent neural networks, attention mechanisms, and transformers, together with specification languages and algorithmic verification techniques.

[AI-18] oward Scalable Terminal Task Synthesis via Skill Graphs

【速读】：该论文旨在解决终端代理（Terminal Agent）在训练过程中因高质量、多样化执行轨迹稀缺而导致性能受限的问题。现有方法虽通过合成大规模终端任务实例来缓解数据瓶颈，但主要关注任务数量的扩展，缺乏对代理实际训练中所体验执行轨迹多样性的有效控制。其解决方案的关键在于提出SkillSynth框架，该框架基于场景驱动的技能图（scenario-mediated skill graph）进行自动化任务合成：首先构建以场景为中间节点连接多种命令行技能的大规模技能图，进而采样图中路径作为真实工作流的抽象，并利用多智能体环境将这些路径实例化为可执行的任务实例。这一机制使任务合成过程能够显式控制最小执行轨迹的多样性，从而提升代理在终端场景下的泛化与自主能力。

链接: https://arxiv.org/abs/2604.25727
作者: Zhiyuan Fan,Tinghao Yu,Yuanjun Cai,Jiangtao Guan,Yun Yang,Dingxin Hu,Jiang Zhou,Xing Wu,Zhuo Han,Feng Zhang,Lilin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for trajectory sampling. However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training. In this paper, we present SkillSynth, an automated framework for terminal task synthesis built on a scenario-mediated skill graph. SkillSynth first constructs a large-scale skill graph, where scenarios serve as intermediate transition nodes that connect diverse command-line skills. It then samples paths from this graph as abstractions of real-world workflows, and uses a multi-agent harness to instantiate them into executable task instances. By grounding task synthesis in graph-sampled workflow paths, SkillSynth explicitly controls the diversity of minimal execution trajectories required to solve the synthesized tasks. Experiments on Terminal-Bench demonstrate the effectiveness of SkillSynth. Moreover, task instances synthesized by SkillSynth have been adopted to train Hy3 Preview, contributing to its enhanced agentic capabilities in terminal-based settings.

[AI-19] Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

【速读】：该论文旨在解决现代企业级生成式 AI (Generative AI) 应用中复合 AI 系统（compound AI systems）在生产环境中部署时面临的高延迟、低吞吐量和成本高昂的问题。这类系统由多个模型、检索器和工具组成，需支持并发、异构的模型调用，且对推理基础设施提出了严苛要求。解决方案的关键在于构建一个模块化、平台无关的推理架构，集成无服务器执行（serverless execution）、动态自动扩缩容（dynamic autoscaling）与 MLOps 流水线，从而实现跨多组件智能体工作流的一致低延迟推理。实证结果表明，该方案可将尾部延迟（P95）降低超 50%，吞吐量提升最高达 3.9 倍，并节省 30–40% 成本，同时有效应对复合系统特有的挑战，如多模型扇出开销、冷启动传播及异构扩缩容动态等。

链接: https://arxiv.org/abs/2604.25724
作者: Srikanta Prasad S V,Utkarsh Arora
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the ACM Conference on AI and Agentic Systems (ACM CAIS 2026)

点击查看摘要

Abstract:Modern enterprise AI applications increasingly rely on compound AI systems - architectures that compose multiple models, retrievers, and tools to accomplish complex tasks. Deploying such systems in production demands inference infrastructure that can efficiently serve concurrent, heterogeneous model invocations while maintaining cost-effectiveness and low latency. This paper presents a production deployment study of a modular, platform-agnostic inference architecture developed at Salesforce to support compound AI use cases including Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis). The system integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows. We report production results demonstrating over 50% reduction in tail latency (P95), up to 3.9x throughput improvement, and 30 to 40% cost savings compared to prior static deployments. We further present a novel analysis of compound-system-specific challenges including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics that emerge uniquely when serving agentic workloads. Through detailed case studies and operational lessons, we illustrate how the architecture enables compound AI systems to scale model invocations in parallel, handle bursty multi-agent workloads, and support rapid model iteration - capabilities essential for operationalizing agentic AI at enterprise scale.

[AI-20] Learning Generalizable Multimodal Representations for Software Vulnerability Detection

【速读】：该论文旨在解决现有漏洞检测方法主要依赖单一模态代码表示，忽略了代码与注释之间互补语义信息的问题，从而限制了模型在复杂代码结构和逻辑关系中的泛化能力。其解决方案的关键在于提出一种多模态对比框架 MultiVul，通过双相似性学习（dual similarity learning）和一致性正则化（consistency regularization）对齐代码与注释的表征，并利用多样化的代码-文本对增强模型鲁棒性，从而有效融合代码的结构逻辑与注释中的开发者意图信息。

链接: https://arxiv.org/abs/2604.25711
作者: Zeming Dong,Yuejun Guo,Qiang Hu,Yao Zhang,Maxime Cordy,Hao Liu,Mike Papadakis,Yongqiang Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Source code and its accompanying comments are complementary yet naturally aligned modalities-code encodes structural logic while comments capture developer intent. However, existing vulnerability detection methods mostly rely on single-modality code representations, overlooking the complementary semantic information embedded in comments and thus limiting their generalization across complex code structures and logical relationships. To address this, we propose MultiVul, a multimodal contrastive framework that aligns code and comment representations through dual similarity learning and consistency regularization, augmented with diverse code-text pairs to improve robustness. Experiments on widely adopted DiverseVul and Devign datasets across four large language models (LLMs) (i.e., DeepSeek-Coder-6.7B, Qwen2.5-Coder-7B, StarCoder2-7B, and CodeLlama-7B) show that MultiVul achieves up to 27.07% F1 improvement over prompting-based methods and 13.37% over code-only Fine-Tuning, while maintaining comparable inference efficiency.

[AI-21] RADD: Retrieval-Augmented Discrete Diffusion for Multi-Modal Knowledge Graph Completion

【速读】：该论文针对多模态知识图谱补全（Multi-modal Knowledge Graph Completion, MMKGC）中现有模型普遍采用单一嵌入评分器同时负责全局实体检索与最终决策的问题展开研究，指出这种“检索-重排序”耦合机制是性能瓶颈：全局高召回搜索与局部细粒度消歧需要不同的归纳偏置。解决方案的关键在于提出一种检索增强的离散扩散框架（Retrieval-Augmented Discrete Diffusion, RADD），通过解耦检索与重排序两个阶段实现优化——其中关系感知的多模态知识图谱嵌入（KGE）检索器作为全局检索器及蒸馏教师，而条件离散去噪器则在短名单级别生成实体身份用于重排序；训练过程融合KGE监督、去噪交叉熵和温度缩放蒸馏损失，推理时先由检索器生成Top-K短名单，再由去噪器进行精细化重排序，从而确保召回率是精确率的前提。

链接: https://arxiv.org/abs/2604.25693
作者: Guanglin Niu,Bo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Most multi-modal knowledge graph completion (MMKGC) models use one embedding scorer to do both retrieval over the full entity set and final decision making. We argue that this coupling is a core bottleneck: global high-recall search and local fine-grained disambiguation require different inductive biases. Therefore, we propose a Retrieval-Augmented Discrete Diffusion (RADD) framework to decouple retrieve and reranking for MMKGC. A relation-aware multimodal KGE retriever serves as both global retriever and distillation teacher, while a conditional discrete denoiser performs shortlist-level entity-identity generation for reranking. Training combines KGE supervision, denoising cross-entropy, and temperature-scaled distillation from the retriever to the denoiser. At inference, the designed Diff-Rerank first forms a top- K shortlist with the retriever and then reranks it with the denoiser, ensuring that recall is a strict prerequisite for precision. Experiments on three MMKGC benchmarks show that RADD achieves the best performance and consistent gains over strong unimodal, multimodal, and LLM-based baselines, while ablations further verify the contribution of each component.

[AI-22] Spreadsheet Modeling Experiments Using GPT s on Small Problem Statements and the Wall Task

【速读】：该论文旨在解决如何利用基于GPT的工具辅助构建可复用的分析型电子表格模型（analytical spreadsheet models）的问题。其关键解决方案在于评估特定GPT扩展工具（即Excel AI）在结构化实验中的表现，并基于ERFR标准（每个输入单元格独立；公式明确；无硬编码数值；标签清晰；结果准确）进行量化分析。研究发现，尽管Excel AI能生成结构良好的初稿模型，但其一致性差且难以复现，暴露出“置信度问题”与“工作流问题”，表明当前工具仍需专业用户介入验证和调整，从而强调未来应聚焦于提示工程（prompt engineering）、可复现性提升及大规模建模任务的研究方向。

链接: https://arxiv.org/abs/2604.25689
作者: Thomas A. Grossman,Yuan Chen,Sopiko Datuashvili
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates how GPT-based tools can assist in building reusable analytical spreadsheet models. After a screening, we evaluate five GPT extensions and select Excel AI by this http URL for detailed testing. Through structured experiments on simple problem statements, we assess Excel AI’s performance against the ERFR criteria (each input in a cell; cell formulas; no hardwired numbers; labels; accurate). Results show that while Excel AI can produce well-structured models, it is inconsistent and often non-reproducible. We identify two central challenges - “the problem of confidence” and “the problem of workflow” - which highlight the need for skilled users to verify and adapt GPT-generated spreadsheets. Though GPTs show promise for generating draft models that may reduce development time or lower skill requirements, current tools remain unreliable for professional use. We conclude with recommendations for future research into prompt engineering, reproducibility, and larger-scale modeling tasks.

[AI-23] hink Before You Act – A Neurocognitive Governance Model for Autonomous AI Agents

【速读】：该论文旨在解决自主AI代理在企业、医疗及高风险场景中因治理机制缺失而导致的安全隐患问题。现有方法如运行时防护、训练阶段对齐和事后审计均将治理视为外部约束，而非内化的行为准则，致使代理易执行不安全且不可逆的操作。解决方案的关键在于提出一种神经认知治理框架（neurocognitive governance framework），通过形式化映射人类自我治理的认知过程到大语言模型（LLM）驱动的代理推理中，构建人脑与LLM之间的结构类比；其中核心是引入预行动治理推理循环（Pre-Action Governance Reasoning Loop, PAGRL），使代理在每次关键动作前都参照四层治理规则集——全局规则、工作流特定规则、代理特定规则和情境规则——实现类似组织合规层级的决策逻辑，从而将治理嵌入到代理的思考机制中，而非依赖外部强制。实证表明，该方案在零售供应链任务中达到95%合规准确率且无误报升级，验证了内部化治理优于外部控制的有效性。

链接: https://arxiv.org/abs/2604.25684
作者: Eranga Bandara,Ross Gore,Asanga Gunaratna,Sachini Rajapakse,Isurunima Kularathna,Ravi Mukkamala,Sachin Shetty,Xueping Liang,Amin Hass,Tharaka Hewa,Abdul Rahman,Christopher K. Rhea,Anita H. Clayton,Preston Samuel,Atmaram Yarlagadda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid deployment of autonomous AI agents across enterprise, healthcare, and safety-critical environments has created a fundamental governance gap. Existing approaches, runtime guardrails, training-time alignment, and post-hoc auditing treat governance as an external constraint rather than an internalized behavioral principle, leaving agents vulnerable to unsafe and irreversible actions. We address this gap by drawing on how humans self-govern naturally: before acting, humans engage deliberate cognitive processes grounded in executive function, inhibitory control, and internalized organizational rules to evaluate whether an intended action is permissible, requires modification, or demands escalation. This paper proposes a neurocognitive governance framework that formally maps this human self-governance process to LLM-driven agent reasoning, establishing a structural parallel between the human brain and the large language model as the cognitive core of an agent. We formalize a Pre-Action Governance Reasoning Loop (PAGRL) in which agents consult a four-layer governance rule set: global, workflow-specific, agent-specific, and situational before every consequential action, mirroring how human organizations structure compliance hierarchies across enterprise, department, and role levels. Implemented on a production-grade retail supply chain workflow, the framework achieves 95% compliance accuracy and zero false escalations to human oversight, demonstrating that embedding governance into agent reasoning produces more consistent, explainable, and auditable compliance than external enforcement. This work offers a principled foundation for autonomous AI agents that govern themselves the way humans do: not because rules are imposed upon them, but because deliberation is embedded in how they think.

[AI-24] Large language models eroding science understanding: an experimental study

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在回答科学问题时的可靠性与抗干扰能力问题，特别是其是否容易受到边缘科学文献（fringe scientific material）的影响而产生误导性输出。研究的关键解决方案在于通过定制化微调，使LLM优先采纳特定边缘论文中的知识（如关于精细结构常数和引力波的研究），从而测试其生成内容与领域专家共识的一致性。结果表明，修改后的模型能输出流畅且看似合理的错误答案，且难以被非专业人士识别为误导信息，揭示了LLM在面对偏见或错误输入时的脆弱性，强调了其无法替代专业判断，并警示了潜在的信息误导风险。

链接: https://arxiv.org/abs/2604.25639
作者: Harry Collins,Hartmut Grote,Paul Newbury,Patrick Sutton,Simon Thorne
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under review in AI and Ethics

点击查看摘要

Abstract:This paper is under review in AI and Ethics This study examines whether large language models (LLMs) can reliably answer scientific questions and demonstrates how easily they can be influenced by fringe scientific material. The authors modified custom LLMs to prioritise knowledge in selected fringe papers on the Fine Structure Constant and Gravitational Waves, then compared their responses with those of domain experts and standard LLMs. The altered models produced fluent, convincing answers that contradicted scientific consensus and were difficult for non-experts to detect as misleading. The results show that LLMs are vulnerable to manipulation and cannot replace expert judgment, highlighting risks for public understanding of science and the potential spread of misinformation.

[AI-25] HotComment: A Benchmark for Evaluating Popularity of Online Comments

【速读】：该论文旨在解决在线评论（online comments）在社交媒体中流行度（popularity）评估难题，其核心挑战在于流行度不仅依赖于语言质量、原创性和情感共鸣，还受平台和用户群体间风格偏好的显著差异影响，导致同一评论在不同社区中的传播效果不一。解决方案的关键在于提出HotComment这一多模态基准，从三个增强维度量化流行度：(1) 内容质量（Content Quality），通过语义相似性与人工标注评论对比，并引入四个可解释的评估维度；(2) 流行度预测（Popularity Prediction），基于真实交互数据训练模型捕捉趋势；(3) 用户行为模拟（User Behavior Simulation），利用基于代理的框架建模用户分布并估算参与度得分。此外，论文提出StyleCmt方法，借鉴社会涟漪效应，通过多个风格维度协同作用放大具有社会共鸣的表达，抑制不一致内容，从而提升跨社区的流行度预测准确性。

链接: https://arxiv.org/abs/2604.25614
作者: Yafeng Wu,Yunyao Zhang,Liliang Ye,Guiyi Zeng,Junqing Yu,Chen Xu,Zikai Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online comments play a crucial role in shaping public sentiment and opinion dynamics on social media. However, evaluating their popularity remains challenging, not only because it depends on linguistic quality, originality, and emotional resonance, but also because stylistic preferences vary widely across platforms and user groups, causing the same comment to resonate differently in different communities. In this work, we present HotComment, a multimodal benchmark integrating video and text modalities that comprehensively quantifies popularity from three enhanced aspects: (1) Content Quality, which evaluates semantic similarity with ground-truth human comments and extends quality assessment through four interpretable dimensions; (2) Popularity Prediction, based on trends from models trained on real-world interaction data; and (3) User Behavior Simulation, which models the distribution of platform users and approximates \textbfengagement scores through an agent-based framework. Furthermore, we propose StyleCmt, inspired by social ripple effects, where multiple stylistic dimensions align to amplify socially resonant expressions and suppress incongruent ones.

[AI-26] he Nonverbal Syntax Framework: An Evidence-Based Tiered System for Inferring Learner States from Observable Behavioral Cues

【速读】：该论文旨在解决非言语行为（nonverbal behavior）与学习者认知和情感状态之间映射关系的标准化与证据校准问题，具体针对术语碎片化、证据异质性和状态模糊性三大挑战。其解决方案的关键在于提出“非言语语法框架”（Nonverbal Syntax Framework），通过归一化处理5,537个状态标签和11,521个线索为2,010个标准状态和6,434个标准化线索，并引入双证据评估机制——即组件证据（Component Evidence，覆盖度）与关系证据（Relationship Evidence，独立研究数量）分离评估，从而避免基于单一文献的过度自信推断。该框架构建了从线索词汇到状态聚类、状态特征画像再到可区分分析的四级结构，识别出480个经三次及以上独立研究验证的R1–R4级可靠关系，构成了六十年来研究的核心实证基础，为科研人员识别知识缺口、实践者提供有依据的状态推断工具、技术开发者提供多模态检测的验证特征提供了系统性支持。

链接: https://arxiv.org/abs/2604.25612
作者: Sherzod Turaev,Mary John,Jaloliddin Rustamov,Zahiriddin Rustamov,Saja Aldabet,Nazar Zaki,Khaled Shuaib
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages

点击查看摘要

Abstract:Understanding learners’ cognitive and affective states underpins adaptive educational systems and effective teaching. Although research links nonverbal cues to internal states, no framework calibrates them to evidence. We present the Nonverbal Syntax Framework, drawn from a systematic review of 908 studies and 17,043 cue-state mappings (Turaev et al., 2026). The framework addresses three challenges: terminological fragmentation (behaviors described inconsistently), evidence heterogeneity (single observations to replicated findings), and state ambiguity (similar patterns indicating multiple states). Normalization consolidated 5,537 state labels into 2,010 canonical states (63.7%) and 11,521 cues into 6,434 normalized cues (44.2%) across nine behavioral channels. Dual-evidence assessment separately evaluates Component Evidence (coverage of cues and states) and Relationship Evidence (independent studies per cue-state link). 52% of “Very High” relationships rest on one paper, so separation enables calibrated rather than overconfident inference from preliminary findings. The framework’s four levels comprise a Cue Vocabulary of 6,434 indicators classified as observable/instrumental; State Clusters linking 2,010 states to indicative cues; State Profiles with multimodal behavioral signatures and actionable specifications; and Discriminative Analysis distinguishing 1,215 confusable state pairs. We identify 480 actionable R1-R4 relationships (three or more independent papers), the replicated core of six decades of research, covering 35.5% of mappings across 47 key learning states and 111 distinct indicators. The remaining 91.5% (9,653 single-paper findings) form exploratory hypotheses for replication. The framework gives researchers an empirical foundation for identifying gaps, practitioners evidence-based tools for state inference, and technologists validated features for multimodal detection.

[AI-27] OxyGent: Making Multi-Agent Systems Modular Observable and Evolvable via Oxy Abstraction ACL2026

【速读】：该论文旨在解决复杂工业环境中生产级多智能体系统（Multi-Agent Systems, MAS）在可扩展性、可观测性和自主演化方面的挑战。其解决方案的关键在于提出一个名为OxyGent的开源框架，通过统一的Oxy抽象将智能体、工具、大语言模型（Large Language Models, LLMs）和推理流程封装为可插拔的原子组件，形成类似乐高积木的模块化组装范式，从而支持系统的可扩展组合与非侵入式监控；同时引入权限驱动的动态规划机制以替代固定工作流，生成运行时执行图并提供自适应可视化，增强可观测性，并集成OxyBank AI资产管理系统实现自动化数据回流、标注与联合演化，支撑系统的持续进化能力。

链接: https://arxiv.org/abs/2604.25602
作者: Junxing Hu,Tianlong Li,Lei Yu,Ai Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, ACL 2026 System Demonstration track

点击查看摘要

Abstract:Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework that enables modular, observable, and evolvable MAS via a unified Oxy abstraction, in which agents, tools, LLMs, and reasoning flows are encapsulated as pluggable atomic components. This Lego-like assembly paradigm supports scalable system composition and non-intrusive monitoring. To enhance observability, OxyGent introduces permission-driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, which provide adaptive visualizations. To support continuous evolution, the framework integrates OxyBank, an AI asset management platform that supports automated data backflow, annotation, and joint evolution. Empirical evaluations and real-world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is publicly available at this https URL.

[AI-28] DualFact: A Multimodal Fact Verification Framework for Procedural Video Understanding ACL2026

【速读】：该论文旨在解决当前生成式AI（Generative AI）在程序性视频字幕生成任务中存在事实性不完整和角色层面不一致的问题，即现有模型虽能生成流畅的文本，但常出现对动作、食材、工具等概念性事实（Conceptual Facts）及与视频场景对应的上下文事实（Contextual Facts）的遗漏或错误关联。解决方案的关键在于提出DualFact框架，其核心创新包括：1）将事实性评估分为概念层与上下文层两个层次，以区分抽象语义角色与视觉具身化实现；2）引入隐式论元增强（Implicit Argument Augmentation, VIA）与对比事实集，提升评估的完整性与一致性；3）提供两种验证模式——基于文本证据的DualFact-T与基于视频证据的DualFact-V，从而更准确地识别幻觉并揭示仅依赖字幕评估会高估虚假信息的问题。该框架显著优于传统指标，在人类判断上具有更强相关性，尤其在上下文事实层面表现突出。

链接: https://arxiv.org/abs/2604.25584
作者: Cennet Oguz,Yasser Hamidullah,Josef van Genabith,Simon Ostermann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning. DualFact separates factual correctness into conceptual facts, capturing abstract semantic roles (e.g., Action, Ingredient, Tool, Location), and contextual facts, capturing their grounded predicate-argument realizations in video. To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets. We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which verifies facts against video-grounded visual evidence. Experiments on YouCook3-Fact and CraftBench-Fact show that state-of-the-art multimodal language models produce fluent but often factually incomplete captions, with systematic omissions and role-level inconsistencies. DualFact correlates more strongly with human factuality judgments than standard metrics, particularly for contextual facts, and reveals that caption-only evaluation overestimates hallucinations compared to video-grounded verification. Overall, DualFact offers an interpretable and human-aligned evaluation protocol that highlights persistent challenges in multimodal factual grounding, extending beyond surface-level fluency.

[AI-29] SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

【速读】：该论文旨在解决截图型网页代理（screenshot-based web agents）在面对提示注入攻击（prompt injection attacks）时缺乏高效、轻量级检测手段的问题。当前主流的文本中心防御方法对视觉输入无效，而基于大规模视觉语言模型（VLMs）的多模态检测方案则因计算开销大、推理延迟高且内存占用显著，难以部署于实时场景。解决方案的关键在于提出 SnapGuard，一种基于多模态表征分析的轻量级检测方法：其核心创新是利用两个互补信号——一是通过视觉稳定性指标识别恶意内容引起的异常平滑梯度分布；二是借助对比极性反转恢复动作导向的文本特征，从而实现对提示注入攻击的高精度、低延迟检测，在F1得分达0.75的同时比GPT-4o-prompt快8倍且无额外内存开销。

链接: https://arxiv.org/abs/2604.25562
作者: Mengyao Du,Han Fang,Haokai Ma,Jiahao Chen,Kai Xu,Quanjun Yin,Ee-Chien Chang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Web agents have emerged as an effective paradigm for automating interactions with complex web environments, yet remain vulnerable to prompt injection attacks that embed malicious instructions into webpage content to induce unintended actions. This threat is further amplified for screenshot-based web agents, which operate on rendered visual webpages rather than structured textual representations, making predominant text-centric defenses ineffective. Although multimodal detection methods have been explored, they often rely on large vision-language models (VLMs), incurring significant computational overhead. The bottleneck lies in the complexity of modern webpages: VLMs must comprehend the global semantics of an entire page, resulting in substantial inference time and GPU memory usage. This raises a critical question: can we detect prompt injection attacks from screenshots in a lightweight manner? In this paper, we observe that injected webpages exhibit distinct characteristics compared to benign ones from both visual and textual perspectives. Building on this insight, we propose SnapGuard, a lightweight yet accurate method that reformulates prompt injection detection as multimodal representation analysis over webpage screenshots. SnapGuard leverages two complementary signals: a visual stability indicator that identifies abnormally smooth gradient distributions induced by malicious content, and action-oriented textual signals recovered via contrast-polarity reversal. Extensive evaluations across eight attacks and two benign settings demonstrate that SnapGuard achieves an F1 score of 0.75, outperforming GPT-4o-prompt while being 8x faster (1.81s vs. 14.50s) and introducing no additional memory overhead.

[AI-30] From CRUD to Autonomous Agents : Formal Validation and Zero-Trust Security for Semantic Gateways in AI-Native Enterprise Systems

【速读】：该论文旨在解决企业软件工程从传统的确定性 CRUD/REST 架构向 AI 原生系统转型过程中引入的安全挑战，特别是大语言模型（Large Language Models, LLMs）的不确定性行为对传统验证机制、访问控制和形式化测试造成的削弱。其解决方案的核心在于提出一种基于模型上下文协议（Model Context Protocol, MCP）的语义网关（Semantic Gateway），通过将企业 API 重构为语义表面，实现工具的动态发现、授权与执行。关键创新在于将自主代理视为随机状态转移系统（stochastic state-transition systems），并采用启用性保持抽象（Enabledness-Preserving Abstractions, EPAs）和灰盒语义模糊测试（greybox semantic fuzzing）进行行为审计，构建包含预推理语义防火墙、工具级基于角色的访问控制（RBAC）以及带外密码学人工介入审批的三层零信任安全模型，从而在动态环境中实现对代理行为的严格形式验证与漏洞挖掘。

链接: https://arxiv.org/abs/2604.25555
作者: Ignacio Peyrano
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures, 4 tables. Open-source proof-of-concept (47 automated tests, deterministic semantic fuzzer) available at this https URL

点击查看摘要

Abstract:Enterprise software engineering is shifting away from deterministic CRUD/REST architectures toward AI-native systems where large language models act as cognitive orchestrators. This transition introduces a critical security tension: probabilistic LLMs weaken classical mechanisms for validation, access control, and formal testing. This paper proposes the design, formal validation, and empirical evaluation of a Semantic Gateway governed by the Model Context Protocol (MCP). The gateway reframes the enterprise API as a semantic surface where tools are dynamically discovered, authorized, and executed based on intent and policy enforcement. The central contribution rests on a paradigm shift: autonomous agents must not be validated as traditional software nor as simple API consumers, but as stochastic state-transition systems whose behavior must be abstracted, fuzzed, and audited through enabled-tool graphs. The architecture introduces a three-layer Zero-Trust security model comprising a pre-inference Semantic Firewall, deterministic Tool-Level RBAC, and out-of-band Cryptographic Human-in-the-Loop approval. Enabledness-Preserving Abstractions (EPAs) and greybox semantic fuzzing–originally developed for blockchain smart contract verification–are adapted to audit agent behavior in enterprise environments. Results demonstrate an 84.2% reduction in incidental code. Across 500,000 multi-turn fuzzing sequences, the methodology achieved a 100% discovery rate of hidden unauthorized state transitions, proving that dynamic formal verification is strictly necessary for secure agentic deployment. Comments: 25 pages, 4 figures, 4 tables. Open-source proof-of-concept (47 automated tests, deterministic semantic fuzzer) available at this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68Q60 (Primary) 68Q85, 94A60, 68N30 (Secondary) ACMclasses: D.2.4; K.6.5; I.2.1; D.4.6; F.3.1 Cite as: arXiv:2604.25555 [cs.CR] (or arXiv:2604.25555v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.25555 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] On Halting vs Converging in Recurrent Graph Neural Networks

【速读】：该论文旨在解决递归图神经网络（Recurrent Graph Neural Networks, RGNNs）中不同收敛机制的表达能力差异问题，特别是针对三种模型——收敛型RGNN（converging RGNN）、输出收敛型RGNN（output-converging RGNN）和停顿型RGNN（halting RGNN）之间的表达力关系进行形式化分析。研究发现，在无向图上，收敛型RGNN与分级双模拟不变的停顿型RGNN具有相同的表达能力，而输出收敛型RGNN至少具备同等表达能力；进一步表明，收敛型RGNN恰好能表达graded modal μ- calculus（μGML），且这一结果在使用ReLU激活函数和求和聚合的情况下依然成立。解决方案的关键在于提出一种“交通灯”协议（traffic-light protocol），用于在缺乏全局停顿判别器的情况下协调各节点局部异步停顿行为，从而实现从停顿型RGNN到收敛型RGNN的有效模拟，解决了因局部停顿时序不一致导致的同步难题。

链接: https://arxiv.org/abs/2604.25551
作者: Jeroen Bollen,Stijn Vansummeren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Recurrent Graph Neural Networks (RGNNs) extend standard GNNs by iterating message-passing until some stopping condition is met. Various RGNN models have been proposed in the literature. In this paper, we study three such models: converging RGNNs, where all vertex representations must stabilise; output-converging RGNNs, where only the output classifications must stabilise; and halting RGNNs, where a per-vertex halting classifier determines when to stop. We establish expressiveness relationships between these models: over undirected graphs, converging RGNNs are equally expressive as graded-bisimulation-invariant halting RGNNs, while output-converging RGNNs are at least as expressive. Combined with prior results on halting RGNNs, this shows that, relative to the classifiers expressible in monadic second-order logic (MSO), converging RGNNs express exactly the graded modal \mu -calculus ( \mu GML), and output-converging RGNNs express at least \mu GML. These results hold even when restricting to ReLU networks with sum aggregation. The main technical challenge is simulating halting RGNNs by converging ones: without a global halting classifier, vertices may locally decide to halt at different times, causing desynchronisation. We develop a “traffic-light” protocol that enables vertices to coordinate despite this asynchrony. Our results answer an open question from Bollen et al. (2025) and show that the RGNN model of Pflueger et al. (2024) retains full \mu GML expressiveness even when convergence is guaranteed.

[AI-32] Medoid Prototype Alignment for Cross-Plant Unknown Attack Detection in Industrial Control Systems

【速读】：该论文旨在解决工业控制系统（Industrial Control System, ICS）中跨厂区入侵检测的难题，具体表现为：ICS流量具有高度站点依赖性、标签稀缺，且部署后常出现未见过的攻击类型。为应对这一挑战，作者提出了一种基于中位原型对齐（medoid prototype alignment）的框架，其核心创新在于：首先将异构工业流量压缩至可比表示空间，并提取能反映各领域局部运行结构的鲁棒中位原型；随后设计一种原型校准的迁移目标，使目标域原型与源域原型对齐，同时保持源域判别能力并鼓励目标域预测置信度。该策略有效降低了跨域匹配噪声，在异构工业环境下提升了迁移稳定性，显著改善了未知攻击检测性能。

链接: https://arxiv.org/abs/2604.25544
作者: Luyao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying an intrusion detector trained in one industrial plant to another remains difficult because Industrial Control System (ICS) traffic is highly site-dependent, labels are scarce, and unseen attacks often appear after deployment. To address this challenge, this paper introduces a medoid prototype alignment framework for cross-plant unknown attack detection. Instead of aligning all source and target samples directly, the method first compresses heterogeneous traffic into a comparable representation space and then extracts robust medoid prototypes that summarize local operational structure in each domain. A prototype-calibrated transfer objective is further designed to align target prototypes with source prototypes while preserving source-domain discrimination and encouraging confident target predictions. This strategy reduces noisy cross-domain matching and improves transfer stability under heterogeneous industrial conditions. Experiments conducted on natural gas and water storage control systems show that the proposed method achieves the best average performance among all compared models, reaching an average accuracy of 0.843 and an average F1-score of 0.838 across four unknown-attack transfer tasks. The analysis also shows clear transfer asymmetry between source-target directions and confirms that prototype guidance is especially helpful on challenging reverse-transfer settings. These findings suggest that medoid prototype alignment is a practical solution for robust industrial intrusion detection under domain shift.

[AI-33] Sample-efficient Neuro-symbolic Proximal Policy Optimization

【速读】：该论文旨在解决深度强化学习（Deep Reinforcement Learning, DRL）在稀疏奖励环境中的学习效率低下问题，尤其是在具有长规划周期和多个子目标的复杂任务中。其关键解决方案是提出一种神经符号扩展的近端策略优化（Proximal Policy Optimization, PPO）方法，通过将从简单实例中学到的部分逻辑策略规范（logical policy specifications）迁移至更具挑战性的场景中，以引导策略学习。具体而言，提出了两种符号引导机制：(i) H-PPO-Product，在采样时偏置动作分布；(ii) H-PPO-SymLoss，在PPO损失函数中引入符号正则化项，从而显著加速收敛并提升最终回报，即使在符号知识不完全的情况下仍具鲁棒性。

链接: https://arxiv.org/abs/2604.25534
作者: Simone Murari,Celeste Veronese,Daniele Meli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings. We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization term. We evaluate our methods on three benchmarks (OfficeWorld, WaterWorld, and DoorKey), showing consistently faster learning and higher return at convergence than PPO and a Reward Machine baseline, also under imperfect symbolic knowledge.

[AI-34] AI as Consumer and Participant: A Co-Design Agenda for MBSE Substrates and Methodology

【速读】：该论文试图解决的问题是：当前基于模型的系统工程（MBSE）模型并未为生成式 AI (Generative AI) 工具的设计与使用而优化，导致这些工具在处理模型时依赖训练数据中的推理而非模型本身的结构化知识，从而产生不可复现、难以验证的结果。其关键解决方案在于推动模型与方法论的协同设计（co-design），将 MBSE 模型从面向人类可读的结构化文档转变为机器可查询的知识底座（knowledge substrate），并提出三个原则以指导这一转变过程，从而确保 AI 工具能够基于一致、可追溯的模型内容进行推理，而非仅作为“提示输入”使用。

链接: https://arxiv.org/abs/2604.25526
作者: Siyuan Ji
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI tools are being deployed over MBSE models today, and those models were not designed for this kind of consumption. The problem is not simply that tools hallucinate: well-prompted frontier models produce competent, useful output over a conformant SysML model, but the reasoning they produce is drawn from training rather than retrieved from the model itself, and different tools over the same model produce different results with nothing in the record to adjudicate between them. The model, in other words, is functioning as a prompt rather than as a knowledge base. Attaching better tools to the same model does not resolve this. The model and the methodology that governs its construction need to be designed together for AI participation, treating the model as a machine-queryable knowledge substrate rather than a structured artefact for human navigation, and that co-design has not yet happened in any systematic way. This paper works through a concrete workflow scenario to show what that gap looks like in practice, proposes three principles that jointly characterise what model and methodology must achieve together, and closes with a call to the community to begin this work before the architectural decisions about AI integration settle without the methodological foundation they require.

[AI-35] Automated Adversarial Collaboration for Advancing Theory Building in the Cognitive Sciences

【速读】：该论文旨在解决认知科学中理论评判依赖窄范式和局部模型比较的问题，从而限制了跨任务与跨实现方式证据的整合。其解决方案的关键在于提出了一种自动化的对抗性协作框架（automated adversarial collaboration framework），该框架通过闭环整合大语言模型（LLM）驱动的理论代理、程序合成（program synthesis）以及信息论实验设计（information-theoretic experimental design），能够在候选模型与实验均需在评判过程中被发现的情况下，有效区分竞争性理论。

链接: https://arxiv.org/abs/2604.25521
作者: Suyog Chandramouli,George Kachergis,Akshay Jagadish
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2 pages

点击查看摘要

Abstract:Cognitive science often evaluates theories through narrow paradigms and local model comparisons, limiting the integration of evidence across tasks and realizations. We introduce an automated adversarial collaboration framework for adjudicating among competing theories even when the candidate models and experiments must be discovered during the adjudication process. The system combines LLM-based theory agents, program synthesis, and information-theoretic experimental design in a closed loop. In a simulation study spanning three classic categorization theories, the framework recovered the ground-truth theory across noise settings with weaker reliability in the hardest settings. Together, the framework and findings provide a concrete proof of concept for closed-loop, in-silico theory adjudication in cognitive science.

[AI-36] PHISHREV: A Hybrid Machine Learning and Post-Hoc Non-monotonic Reasoning Framework for Context-Aware Phishing Website Classification

【速读】：该论文旨在解决当前钓鱼检测系统主要依赖统计机器学习模型所带来的局限性，即这些模型缺乏上下文推理能力且易受对抗性攻击的影响。解决方案的关键在于提出一种混合框架，将机器学习分类器与基于答案集编程（Answer Set Programming, ASP）的非单调推理相结合，构建一个后处理推理层，通过专家知识对分类器输出进行形式化的信念修正，从而实现上下文感知的决策优化。实验表明，该推理模块可修改5.08%的原始分类结果，提升决策一致性，且新增领域知识可在O(n)时间内注入推理层，无需重新训练模型。

链接: https://arxiv.org/abs/2604.25512
作者: Mainak Sen,Kumar Sankar Ray,Amlan Chakrabarti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phishing detection systems are predominantly rely on statistical machine learning models, which often lack contextual reasoning and are vulnerable to adversarial manipulation. In this work, we propose a hybrid framework that integrates machine learning classifiers with non-monotonic reasoning using Answer Set Programming (ASP) to enable context-aware decision refinement. The proposed post-hoc reasoning layer incorporates expert knowledge to revise classifier predictions through formal belief revisions. Experimental results indicate that the reasoning module modifies 5.08% of classifier outputs, leading to improved decision consistency. A key advantage is that new domain knowledge can be incorporated into the reasoning layer in \mathcalO(n) time, eliminating the need for model retraining.

[AI-37] Assistants Not Architects: The Role of LLM s in Networked Systems Design

【速读】：该论文旨在解决现代网络化系统架构设计中面临的复杂决策问题，即如何在大量硬件、系统配置及跨层交互的组合空间中，平衡性能、成本、可部署性等多重目标，并满足兼容性和资源约束。传统方法依赖于分散的经验法则（rules-of-thumb），常因遗漏关键约束或错误假设而导致设计失效。作者发现大型语言模型（Large Language Models, LLMs）虽能生成看似合理的配置，但缺乏对约束条件的准确捕捉能力，且存在模式固化（stickiness）问题，难以支撑大规模验证。为此，论文提出Kepler框架，其核心在于将架构特性（如需求、不相容性与定性权衡）形式化为约束，并结合SMT（Satisfiability Modulo Theories）求解器进行优化，从而在抽象层级上实现可解释、系统化的可行设计方案生成，有效识别LLMs易忽略的关键交互关系。

链接: https://arxiv.org/abs/2604.25506
作者: Pratyush Sahu,Rahul Bothra,Venkat Arun,Brighten Godfrey,Akshay Narayan,Ahmed Saeed
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing the architecture of modern networked systems requires navigating a large, combinatorial space of hardware, systems, and configuration choices with complex cross-layer interactions. Architects must balance competing objectives such as performance, cost, and deployability while satisfying compatibility and resource constraints, often relying on scattered rules-of-thumb drawn from benchmarks, papers, documentation, and expert experience. This raises a natural question: can large language models (LLMs) reliably perform this kind of architectural reasoning? We find that they cannot. While LLMs produce plausible configurations, they frequently miss critical constraints, encode incorrect assumptions, and exhibit stickiness'' to familiar patterns. A natural workaround--iterative validation via simulation or experimentation--is often prohibitively expensive at scale and, in many cases, infeasible, particularly when comparing hardware-dependent alternatives. Motivated by this gap, we present Kepler, a lightweight reasoning framework for architecture design that combines structured, expert-driven specifications with SMT-based optimization. Kepler encodes architecturally significant properties--requirements, incompatibilities, and qualitative trade-offs--about systems, hardware, and workloads as constraints, and synthesizes feasible designs that optimize user-defined objectives. It operates at an abstract level, capturing rules-of-thumb’’ rather than detailed system behavior, enabling tractable reasoning while preserving key interactions, and provides explanations for its decisions. Through experiments and case studies, we show that Kepler uncovers interactions missed by LLMs and supports systematic, explainable design exploration. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.25506 [cs.NI] (or arXiv:2604.25506v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2604.25506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

【速读】：该论文旨在解决生成交响乐时面临的“复杂度-控制失衡”问题，即现有符号化模型在扩展规模时受限于长程细粒度控制能力不足。其解决方案的关键在于提出一个三维分层框架SymphonyGen，通过级联解码器架构分解小节（Bar）、声部（Track）和事件（Event）三个维度，从而提升计算效率与可扩展性；同时引入基于节拍量化多声部和声骨架的“简谱”条件控制机制，在保持织体多样性的同时实现结构轮廓的可控性，并结合组相对策略优化（GRPO）与跨模态音频感知奖励对齐符号输出与现代听觉预期，辅以避不和谐采样算法抑制推理过程中的意外音程冲突，显著提升和声清晰度与旋律表现力。

链接: https://arxiv.org/abs/2604.25498
作者: Xuzheng He,Nan Nan,Zhilin Wang,Ziyue Kang,Zhuoru Mo,Ao Li,Yu Pan,Xiaobing Li,Feng Yu,Xiaohong Guan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a “complexity-control imbalance”, in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce “short-score” conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: this https URL

[AI-39] Improving Zero-Shot Offline RL via Behavioral Task Sampling

【速读】：该论文旨在解决离线零样本强化学习（offline zero-shot reinforcement learning）中因随机采样任务向量而导致的零样本泛化性能不佳的问题。现有方法通常通过随机采样定义线性奖励函数的任务向量来训练任务条件策略，隐含假设此类随机采样能充分覆盖任务空间结构，但作者指出这会导致次优性能。解决方案的关键在于：从离线数据集中直接提取任务向量，并以此构建用于策略训练的任务分布，从而更准确地反映真实任务空间的结构；该方法通过一个简单且通用的奖励函数提取流程无缝集成到现有算法中，在多个基准环境和基线上平均提升零样本性能20%，验证了有原则地采样任务向量对提升离线零样本强化学习效果的重要性。

链接: https://arxiv.org/abs/2604.25496
作者: Nazim Bendib,Nicolas Perrin-Gilbert,Olivier Sigaud
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline zero-shot reinforcement learning (RL) aims to learn agents that optimize unseen reward functions without additional environment interaction. The standard approach to this problem trains task-conditioned policies by sampling task vectors that define linear reward functions over learned state representations. In most existing algorithms, these task vectors are randomly sampled, implicitly assuming this adequately captures the structure of the task space. We argue that doing so leads to suboptimal zero-shot generalization. To address this limitation, we propose extracting task vectors directly from the offline dataset and using them to define the task distribution used for policy training. We introduce a simple and general reward function extraction procedure that integrates into existing offline zero-shot RL algorithms. Across multiple benchmark environments and baselines, our approach improves zero-shot performance by an average of 20%, highlighting the importance of principled task sampling in offline zero-shot RL.

[AI-40] SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials

【速读】：该论文旨在解决K-12科学教育中教学材料（instructional materials）自动化评估的难题，尤其是在生成式AI（Generative AI）广泛应用于教学内容创作背景下，传统人工评审方式存在耗时、依赖专家知识且难以扩展的问题。解决方案的关键在于构建首个面向教学材料自动评估的任务框架——Automatic Instructional Materials Evaluation (AIME)，并开发了首个基准数据集SciEval，其中包含基于EQuIP评价量规标注的3549条评分与证据性理由，覆盖13个教学维度。通过在该数据集上对主流大语言模型（LLMs）进行微调，特别是针对领域适配的Qwen3模型，研究发现微调可带来最高达11%的性能提升，验证了领域对齐微调对于提高教学材料评估准确性和可靠性的重要性，从而推动生成式AI在教育领域的落地应用。

链接: https://arxiv.org/abs/2604.25472
作者: Zhaohui Li,Peng He,Zhiyuan Chen,Honglu Liu,Zeyuan Wang,Tingting Li,Jinjun Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The need to evaluate instructional materials for K-12 science education has become increasingly important, as more educators use generative AI to create instructional materials. However, the review of instructional materials is time-consuming, expertise-intensive, and difficult to scale, motivating interest in automated evaluation approaches. While large language models (LLMs) have shown strong performance on general evaluation tasks, their performance and reliability on instructional materials remain unclear. To address this gap, we formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task that predicts scores and evidence using the rubric designed by the educator. We create a benchmark dataset and develop baseline models for AIME. First, we curate the first AIME dataset, SciEval, consisting of instructional materials annotated with pedagogy-aligned evaluation scores and evidence-based rationales. Expert annotations achieve high inter-rater reliability, resulting in a dataset of 273 lesson-level instructional materials evaluated across 13 criteria (N=3549) using the EQuIP rubric. Second, we test mainstream LLMs (GPT, Gemini, Llama, and Qwen) on SciEval and find that none achieve strong performance. Then we fine-tune Qwen3 on SciEval. Results on a held-out test set show that domain-aligned fine-tuning can achieve up to 11 percent performance gains, highlighting the importance of domain-specific fine-tuning for AIME and facilitating the use of LLMs in other educational tasks.

[AI-41] PI-TTA: Physics-Informed Source-Free Test-Time Adaptation for Robust Human Activity Recognition on Mobile Devices

【速读】：该论文旨在解决移动和可穿戴传感场景中无源测试时适应（source-free test-time adaptation, TTA）面临的稳定性问题，特别是在行为惯性流具有时间相关性且易受传感器旋转、放置变化及采样率漂移影响的非独立同分布（non-i.i.d.）流式数据环境下，传统视觉风格TTA目标函数易引发过自信错误、表征坍塌和灾难性遗忘。解决方案的关键在于提出PI-TTA框架，通过三个物理一致性约束稳定在线更新：重力一致性（gravity consistency）、短时程时间连续性（short-horizon temporal continuity）和频谱稳定性（spectral stability），在仅更新少量参数的前提下实现高精度、高鲁棒性的在线适应，显著提升长期序列下的准确率与物理合理性。

链接: https://arxiv.org/abs/2604.25435
作者: Changyu Li,Lu Wang,Ming Lei,Jiashen Liu,Yichen Zhang,Kaishun Wu,Fei Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Source-free test-time adaptation (TTA) is appealing for mobile and wearable sensing because it enables on-device personalization from unlabeled test streams without centralizing private data. However, sensor-based human activity recognition (HAR) poses challenges that are less pronounced in standard vision benchmarks: behavioral inertial streams are temporally correlated and often exhibit within-session shifts caused by sensor rotation, placement change, and sampling-rate drift. Under this streaming non-i.i.d. setting, widely used vision-style TTA objectives can become unstable, leading to overconfident errors, representation collapse, and catastrophic forgetting. We propose PI-TTA, a lightweight source-free adaptation framework that stabilizes online updates through three physics-consistent constraints: gravity consistency, short-horizon temporal continuity, and spectral stability. PI-TTA updates the same small parameter subset as strong source-free baselines and incurs only modest overhead, making it suitable for on-device deployment. Experiments on USCHAD, PAMAP2, and mHealth under long-sequence stress tests and factorized shift protocols show that PI-TTA mitigates the severe degradation observed in confidence-driven baselines and preserves stable adaptation under sustained streaming conditions. It improves long-sequence accuracy by up to 9.13% and reduces physical-violation rates by 27.5%, 24.1%, and 45.4% on USCHAD, PAMAP2, and mHealth, respectively. These results demonstrate that physics-informed adaptation can improve accuracy, stability, and deployment reliability for real-world mobile sensing systems.

[AI-42] FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLM s on Edge Devices

【速读】：该论文旨在解决联邦微调（Federated Fine-tuning）在移动设备部署中因异构带宽和间歇性参与导致的上行链路通信瓶颈问题，尤其在非独立同分布（non-IID）数据场景下，传统均匀压缩策略易丢失任务关键信号，从而限制模型收敛效率。其解决方案的关键在于提出 Fed-FSTQ——一种基于 Fisher 信息引导的令牌量化系统原语，通过轻量级 Fisher 代理估计令牌敏感度，结合重要性感知的令牌选择与非均匀混合精度量化机制，在保留高信息价值 token 的高保真传输的同时抑制冗余数据传输；该方法具备模型无关性、可无缝集成至标准联邦 PEFT 流程（如 LoRA），并支持带宽异构客户端的紧凑稀疏消息打包，显著降低累积上行流量并提升端到端训练速度。

链接: https://arxiv.org/abs/2604.25421
作者: Changyu Li,Shuanghong Huang,Jiashen Liu,Ming Lei,Jidu Xing,Kaishun Wu,Lu Wang,Fei Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.

[AI-43] JURY-RL: Votes Propose Proofs Dispose for Label-Free RLVR

【速读】：该论文旨在解决生成式 AI（Generative AI）在强化学习中依赖人工标注奖励或精心设计的奖励规范所带来的高成本与不稳定性问题，特别是在机器可验证领域中，现有无标签替代方法（如多数投票或LLM作为评判者）可能引入假阳性结果，导致训练过程不稳定。其解决方案的关键在于提出JURY-RL框架，通过解耦答案生成与奖励判定两个阶段：首先由模型回放生成候选答案，再由形式化验证器（如Lean）判断该候选是否可被证实；若验证成功，则仅奖励匹配多数票的答案；若验证不确定，则启用ResZero机制——一种零均值、方差保持的退避奖励策略，将未验证的多数提案置零，并在剩余答案间重新分配奖励信号，从而避免对不可验证共识的错误强化，确保优化梯度稳定。此设计使模型在数学推理任务上显著优于其他无标签基线，并实现与监督训练相当甚至更优的泛化能力。

链接: https://arxiv.org/abs/2604.25419
作者: Xinjie Chen,Biao Fu,Jing Wu,Guoxin Chen,Xinggao Liu,Dayiheng Liu,Minpeng Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 32 pages, 9 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

[AI-44] ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

【速读】：该论文旨在解决多模态情感识别中因个体表达差异导致的识别准确率下降问题，尤其在多轮对话场景下，不同说话者对同一情绪（如“快乐”）可能表现出显著不同的面部、语音或行为特征，而现有静态模型难以适应这种多样性。其解决方案的关键在于提出一种多层级说话人自适应网络（Multi-Level Speaker Adaptive Network, ML-SAN），通过三个阶段的自适应机制实现对说话人身份信息的有效解耦与利用：首先在输入层使用特征级线性调制（Feature-Level Linear Modulation, FiLM）将原始音频和视觉特征映射至与说话人无关的中性空间；其次在交互层引入基于说话人身份的门控机制动态调整各模态的信任权重；最后在输出层通过正则化保持潜在空间中的说话人特征一致性，从而提升模型对真实世界多样化说话者的泛化能力。

链接: https://arxiv.org/abs/2604.25383
作者: Kexue Wang,Yinfeng Yu,Liejun Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Main paper (12 pages). Accepted for publication by International Conference on Intelligent Computing 2026

点击查看摘要

Abstract:To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express “happiness” through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of ‘happiness,’ but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a ‘static’ level, using a single recognition model to identify all emotional styles. This “simplification” often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker’s ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker’s identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.

[AI-45] Safe-Support Q-Learning: Learning without Unsafe Exploration

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在现实应用中因探索过程可能导致不安全状态访问而引发的安全问题。现有安全强化学习方法通常通过约束或惩罚机制降低风险，但仍允许在训练过程中探索不安全状态，无法满足严格的安全要求。为解决此问题，论文提出一种基于Q-learning的安全强化学习框架，其关键在于引入一个定义在安全集（safe set）上的行为策略（behavior policy），确保训练期间所有轨迹始终停留在安全区域内，从而彻底消除不安全状态的访问。该框架采用两阶段设计：首先利用KL正则化的贝尔曼目标（KL-regularized Bellman target）约束Q函数贴近行为策略，随后从训练得到的Q值中推导出策略，并通过参数化策略提取方法近似最优策略，实现对不同动作空间和行为策略类型的统一适配。实验表明，该方法能实现稳定的学习、校准良好的价值估计，并在安全性上优于或等同于现有基线方法。

链接: https://arxiv.org/abs/2604.25379
作者: Yeeun Lim,Narim Jeong,Donghwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:Ensuring safety during reinforcement learning (RL) training is critical in real-world applications where unsafe exploration can lead to devastating outcomes. While most safe RL methods mitigate risk through constraints or penalization, they still allow exploration of unsafe states during training. In this work, we adopt a stricter safety requirement that eliminates unsafe state visitation during training. To achieve this goal, we propose a Q-learning-based safe RL framework that leverages a behavior policy supported on a safe set. Under the assumption that the induced trajectories remain within the safe set, this policy enables sufficient exploration within the safe region without requiring near-optimality. We adopt a two-stage framework in which the Q-function and policy are trained separately. Specifically, we introduce a KL-regularized Bellman target that constrains the Q-function to remain close to the behavior policy. We then derive the policy induced from the trained Q-values and propose a parametric policy extraction method to approximate the optimal policy. Our approach provides a unified framework that can be adapted to different action spaces and types of behavior policies. Experimental results demonstrate that the proposed method achieves stable learning and well-calibrated value estimates and yields safer behavior with comparable or better performance than existing baselines.

[AI-46] Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

【速读】：该论文旨在解决连续多任务强化学习（Continuous Multi-Task Reinforcement Learning, MTRL）环境中算法的泛化能力与可解释性问题。传统遗传编程（Genetic Programming, GP）方法在处理连续控制任务时表现有限，且缺乏对决策逻辑的透明度。解决方案的关键在于提出一种基于多动作遗传编程（Multi-Action TPG, MATPG）的新框架，其通过聚合多个MAPLE代理并构建可控的执行流程来实现任务切换；实验表明，在MuJoCo Half Cheetah新基准上结合词典选择（lexicase selection）策略后，MATPG在多任务性能上显著优于基线方法，并展现出完全可解释的决策流结构。

链接: https://arxiv.org/abs/2604.25369
作者: Quentin Vacher(IETR),Nicolas Beuve(IETR),Mickaël Dardaillon(IETR),Karol Desnos(IETR)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (MTRL) environments have been introduced, requiring a single model to learn multiple behaviors. The Tangled Program Graph (TPG) algorithm is a Genetic Programming (GP) algorithm designed for discrete MTRL environments. Recently, the MAPLE algorithm has been proposed, as another GP algorithm that achieves high results in single task continuous RL environments. A variation of the TPG is proposed alongside MAPLE, named Multi-Action TPG (MATPG) that aggregates MAPLE agents, and creates a control flow to activate them. Initially tested on single task RL environments only, MATPG achieved similar results to MAPLE. In this work, we present a new benchmark based on the MuJoCo Half Cheetah from Gymnasium. This benchmark features five distinct obstacles that are randomly positioned in front of the agent, each of which demands a unique behavior. This benchmark serves as a use case for MATPG, to prove its ability as a GP solution for continuous MTRL environments. Our experiments demonstrate its superiority in this multi-task use case when combined with lexicase selection. Furthermore, we examine the interpretability of the evolved graph, revealing that the decision flow of the model is fully interpretable.

[AI-47] GraphPL: Leverag ing GNN for Efficient and Robust Modalities Imputation in Patchwork Learning ICASSP2026

【速读】：该论文旨在解决分布式多模态学习中客户端无法获取全部模态信息的问题，即在实际场景中，不同客户端可能仅能访问部分模态数据（patchwork learning），而现有方法未能充分利用所有可观测模态，导致性能受限。其解决方案的关键在于提出GraphPL，该方法结合图神经网络（Graph Neural Networks, GNNs）与补全学习机制，通过构建模态间的结构化关系图来灵活融合所有可用模态，并具备对噪声输入的鲁棒性，从而实现更有效的无监督缺失模态补全，提升下游任务如疾病预测的性能。

链接: https://arxiv.org/abs/2604.25352
作者: Xingjian Hu,Zuoyu Yan,Jianhua Zhu,Liangcai Gao,Fei Wang,Tengfei Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICASSP 2026. This is a preprint of the work

点击查看摘要

Abstract:Current research on distributed multi-modal learning typically assumes that clients can access complete information across all modalities, which may not hold in practice. In this paper, we explore patchwork learning, in which the modalities available to different clients vary, and the objective is to impute the missing modalities for each client in an unsupervised manner. Existing methods are shown not to fully utilize the modality information as they tend to rely on only a subset of the observed modalities. To address this issue, we propose GraphPL, which combines graph neural networks with patchwork learning to flexibly integrate all observed modalities and remains robust with noisy inputs. Experimental results show that GraphPL achieves SOTA performance on benchmark datasets. Our results on real-world distributed electronic health record dataset show GraphPL learns strong downstream features and enables tasks like disease prediction via superior modality imputation.

[AI-48] A Faceted Proposal for Transparent Attribution of AI-Assisted Text Production

【速读】：该论文旨在解决当前人工智能（Artificial Intelligence, AI）在文本生成过程中缺乏透明度与可追溯性的问题，尤其是在学术写作中，AI的使用往往仅被简单声明，而未明确其介入方式、位置及后续审查机制。解决方案的关键在于提出一个分层的、可扩展的“多维模型”（faceted model），该模型以文档、章节、节和段落为粒度，通过核心三要素——形式（Form）、生成（Generation）和评估（Evaluation）——构建基础框架，并进一步引入意图（Intent）、控制（Control）和可追溯性（Traceability）三个维度形成扩展模型，从而实现对AI辅助文本生产的结构化描述与操作化记录，为未来高保真度的AI协作写作提供标准化表示基础。

链接: https://arxiv.org/abs/2604.25346
作者: Geraldo Xexéo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 39 pages, 3 figures

点击查看摘要

Abstract:Artificial intelligence systems are increasingly integrated into writing processes, challenging traditional notions of authorship, responsibility, and intellectual contribution. Current disclosure practices usually indicate whether AI was used, but rarely explain how it was used, where it intervened, or how its output was reviewed. This paper proposes a faceted model for representing AI-assisted text production at the levels of documents, chapters, sections, and paragraphs. The proposal introduces a core model based on Form, Generation, and Evaluation, and an extended model that adds Intent, Control, and Traceability. The model is positioned as a minimal operational baseline with extensibility toward higher-fidelity representations. A worked example based on the production of this article demonstrates applicability.

[AI-49] Plausible but Wrong: A case study on Agent ic Failures in Astrophysical Workflows

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在科学工作流中行为不可靠的问题，尤其是在真实场景下其推理能力和结果一致性缺乏系统评估。研究通过在两类工作流范式（One-Shot 和 Deep Research）下对 CMBAgent 进行十八项天体物理任务测试，发现尽管模型在明确指定的任务上表现良好，但在挑战推理极限的任务中常出现“无声失败”（silent failures）——即生成语法正确但物理不一致或数值错误的结果，且缺乏自我诊断能力。解决方案的关键在于构建一个可复现的评估框架，用于系统性分析科学 AI 代理的可靠性，从而识别并缓解这类隐蔽性错误，提升其在科研应用中的可信度。

链接: https://arxiv.org/abs/2604.25345
作者: Shivam Rawat,Lucie Flek
机构: 未知
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:

点击查看摘要

Abstract:Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.

[AI-50] VAE-Inf: A statistically interpretable generative paradigm for imbalanced classification

【速读】：该论文旨在解决**类别不平衡分类（Imbalanced Classification）**问题，尤其针对少数类样本极度稀缺导致判别边界不稳定、误差控制不可靠的极端场景。其解决方案的关键在于提出一个两阶段框架VAE-Inf：第一阶段利用仅包含多数类数据训练变分自编码器（Variational Autoencoder, VAE），通过Wasserstein均值聚合潜空间后验分布构建全局高斯参考模型，从而获得几何上合理的多数类基准；第二阶段基于此生成基础，通过引入一种分布感知损失函数微调编码器，使少数类样本在方差归一化投影统计量下实现概率分离，并在推理时采用基于投影的得分指标，支持无需参数假设的零样本Type-I错误率（假阳性率）精确控制，实现了统计可解释且稳健的分类决策。

链接: https://arxiv.org/abs/2604.25334
作者: Hongfei Wu,Ruijian Han,Yancheng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imbalanced classification remains a pervasive challenge in machine learning, particularly when minority samples are too scarce to provide a robust discriminative boundary. In such extreme scenarios, conventional models often suffer from unstable decision boundaries and a lack of reliable error control. To bridge the gap between generative modeling and discriminative classification, we propose a two-stage framework \textbfVAE-Inf that integrates deep representation learning with statistically interpretable hypothesis testing. In the first stage, we adopt a one-class modeling perspective by training a variational autoencoder (VAE) exclusively on majority-class data to capture the underlying reference distribution. The resulting latent posteriors are aggregated via a Wasserstein barycenter to construct a global Gaussian reference model, providing a geometrically principled baseline for the majority class. In the second stage, we transform this generative foundation into a discriminative classifier by fine-tuning the encoder with limited minority samples. This is achieved through a novel distribution-aware loss that enforces probabilistic separation between classes based on variance-normalized projection statistics. For inference, we introduce a projection-based score that admits a natural hypothesis testing interpretation, allowing for a distribution-free calibration procedure. This approach yields exact finite-sample control of the Type-I error (false positive rate) without relying on restrictive parametric assumptions. Extensive experiments on diverse real-world benchmarks demonstrate that our framework achieves competitive performance against other approaches. The codes are available upon request.

[AI-51] AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices

【速读】：该论文旨在解决移动单NPU-PIM（Processing-In-Memory）系统中自适应推测解码（Adaptive Speculative Decoding）因传统操作级同步执行导致的空闲开销以及异步执行中因草稿长度波动引发的计算浪费问题。解决方案的关键在于提出AHASD架构，其核心创新包括：通过任务级DLM-TLM解耦实现PIM侧并行草稿生成与单NPU侧验证；引入基于熵-历史感知的草稿控制（Entropy-History-Aware Drafting Control）和时间感知预验证控制（Time-Aware Pre-Verification Control），动态调节自适应草稿算法执行与预验证时机，抑制低置信度草稿带来的无效计算；同时在LPDDR5-PIM内集成注意力算法单元（Attention Algorithm Units）与门控任务调度单元（Gated Task Scheduling Units），支持注意力链接定位与亚微秒级任务切换，从而显著提升吞吐量和能效比。

链接: https://arxiv.org/abs/2604.25326
作者: Ma zirui,Fan Zhihua,Li Wenxing,Wu Haibin,Zhang Fulin,Ye Xiaochun,Li Wenming
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 9 figures, accepted by DAC 2026, repo: this https URL

点击查看摘要

Abstract:Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2 \times in throughput and 5.6 \times in energy efficiency improvements over a GPU-only baseline, and 1.5 \times in throughput and 1.24 \times in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3% of the DRAM area.

[AI-52] QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

【速读】：该论文旨在解决生成式 AI (Generative AI) 中 FlashAttention 在实现全整数（integer-only）量化时面临的三大挑战：（1）分块累加过程中尺度爆炸问题，（2）GPU 上基于移位的指数运算效率低下，（3）量化粒度约束导致整数比较需统一缩放因子。其解决方案的关键在于提出 QFlash，一种端到端整数域 FlashAttention 设计，通过在整数域内完成 softmax 计算，并以单一 Triton 内核高效执行，从而实现无精度损失的加速与节能效果，在 ViT/DeiT 上保持 Top-1 准确率，在 Swin 上也保持竞争力，同时相较 FP16 FlashAttention 节能 18.8%，并在多个注意力任务中实现最高达 8.69 倍的性能提升。

链接: https://arxiv.org/abs/2604.25306
作者: Sehyeon Oh,Yongin Kwon,Jemin Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:FlashAttention improves efficiency through tiling, but its online softmax still relies on floating-point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer-only FlashAttention: (1) scale explosion during tile-wise accumulation, (2) inefficient shift-based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose \textitQFlash, an end-to-end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73 \times speedup over I-ViT and up to 8.69 \times speedup on Swin, while reducing energy consumption by 18.8% compared to FP16 FlashAttention, without sacrificing Top-1 accuracy on ViT/DeiT and remaining competitive on Swin under per-tensor quantization. Our code is publicly available at this https URL.

[AI-53] Dynamic UGV-UAV Cooperative Path Planning in Uncertain Environments ICRA

【速读】：该论文旨在解决动态无人地面车辆-无人机协同路径规划（Dynamic UGV-UAV Cooperative Path Planning, DUCPP）问题，即在部分未知道路网络中，如何通过一个或多个无人机（UAV）动态探测并识别不可通行边，从而协助无人地面车辆（UGV）安全高效地抵达目标位置。其核心解决方案在于提出多种协同策略，尤其是双向策略（bidirectional strategy），以优化UGV与UAV之间的协作机制，使UGV能够在不确定环境中实时更新可行路径；同时研究多无人机部署对UGV旅行时间的影响，结果表明增加UAV数量可进一步缩短UGV路径规划时间，但伴随计算开销的上升。该框架为复杂和不确定环境下UGV-UAV协同导航提供了实用且鲁棒的路径规划方案。

链接: https://arxiv.org/abs/2604.25267
作者: Ninh Nguyen,Srinivas Akella
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:This paper addresses the Dynamic UGV-UAV Cooperative Path Planning (DUCPP) problem involving one unmanned ground vehicle (UGV) assisted by one or more unmanned aerial vehicles (UAVs) operating on an uncertain road network with potentially impassable edges. DUCPP is particularly relevant for scenarios such as disaster response, emergency supply transport, and rescue operations, where a UGV must reach a specified destination in the presence of partially unknown road conditions. To enable the UGV to travel safely and efficiently to its destination, the UAV(s) dynamically inspect edges in the environment to identify and prune damaged or impassable edges from consideration. We present multiple strategies, including a bidirectional approach, to optimize UGV-UAV cooperation for finding a safe path in an uncertain road network. Furthermore, we explore the impact of using multiple UAVs on reducing the UGV’s travel time, and evaluate the associated computation time. The proposed strategies are implemented and evaluated on 100 urban road networks. The results demonstrate that the bidirectional strategy achieves the best performance in most instances, and using multiple UAVs further reduces UGV travel time at the expense of increased computation time. This paper presents a robust framework for DUCPP to achieve efficient UGV-UAV cooperation for path planning and inspection, offering practical solutions for navigation in challenging and uncertain conditions. Comments: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) ACMclasses: I.2.9; I.2.8; I.2.11 Cite as: arXiv:2604.25267 [cs.RO] (or arXiv:2604.25267v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2604.25267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

【速读】：该论文旨在解决当前AI代理在自主科学研究中，特别是在科学文献发现任务上的能力评估问题。现有基准测试多聚焦于通用网页浏览任务，缺乏对科研场景下深度理解、细粒度信息利用及开放-ended搜索策略的考量。为此，作者提出了AutoResearchBench，一个面向自主科学文献发现的专用基准，其关键在于设计两类互补任务：Deep Research（深度研究）要求通过多步推理精准定位目标论文，Wide Research（广度研究）则需全面收集满足条件的文献集合。该基准在三个维度上区别于以往工作——研究导向性（requiring in-depth comprehension of scientific concepts）、文献聚焦性（demanding fine-grained utilization of detailed information）和开放性（involving an unknown number of qualified papers），从而更真实地反映AI代理在自主科研中的核心能力，且极具挑战性。

链接: https://arxiv.org/abs/2604.25256
作者: Lei Xiong,Kun Luo,Ziyi Xia,Wenbo Zhang,Jin-Ge Yao,Zheng Liu,Jingying Shao,Jianlyu Chen,Hongjin Qian,Xi Yang,Qian Yu,Hao Li,Chen Yue,Xiaan Du,Yuyang Wang,Yesheng Liu,Haiyu Xu,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents’ capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at this https URL.

[AI-55] ValueAlpha: Agreement-Gated Stress Testing of LLM -Judged Investment Rationales Before Returns Are Observable

【速读】：该论文旨在解决长期投资决策中“预实现评估”（pre-realization evaluation）难题：即实际收益作为投资质量的最终评判标准，但其延迟性与噪声干扰使得无法有效指导模型开发和治理决策。针对生成式 AI (Generative AI) 在金融场景下通过大语言模型（LLM）对投资理由进行判别时可能存在的偏差问题（如奖励冗长、自信表达或格式模仿而非真实财务判断），作者提出 ValueAlpha——一个基于预注册协议的共识门控压力测试框架。其核心创新在于引入多维一致性阈值（aggregate agreement gate 和 per-dimension gate）与对抗性控制机制，以识别并过滤掉不可靠的投资理由声称，确保仅在具备足够稳定性、共识性和抗污染能力的前提下才允许报告 LLM 判决结果。此方案并非用于衡量真实投资技能或构建排行榜，而是一种面向 AI-金融评估的前校准计量层（pre-calibration metrology layer）。

链接: https://arxiv.org/abs/2604.25224
作者: Sidi Chang,Peiying Zhu,Yuxiao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: 9 pages, Submitted to IEEE Computational Intelligence in Financial Engineering and Economics (CIFEr) 2026, Tokyo, Japan

点击查看摘要

Abstract:Long-horizon investment decisions create a pre-realization evaluation problem: realized returns are the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting substitute for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces \textbfValueAlpha, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories, 5,500 judge calls), ValueAlpha clears the aggregate agreement gate at (\bar\kappa_w = 0.7168) but prevents several overclaims. Lower-rank systems collapse into a tie-class, one rubric dimension fails the per-dimension gate (\textttconstraint_awareness, (\bar\kappa_w = 0.2022)), single-judge rankings are family-dependent, and terse-correct rationales receive a (\Delta = -2.81) rubric-point penalty relative to honest rationales. A targeted anchor-specificity probe further shows that financial constructs such as constraint awareness are operationally load-bearing. The contribution is therefore not a leaderboard and not a claim to measure true investment skill. ValueAlpha is a pre-calibration metrology layer for AI-finance evaluation: it determines whether a proposed LLM-judge-based investment-rationale claim is stable enough, agreed enough, and uncontaminated enough to be reported at all. Comments: 9 pages, Submitted to IEEE Computational Intelligence in Financial Engineering and Economics (CIFEr) 2026, Tokyo, Japan Subjects: Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP) MSC classes: 91G70, 62P05, 68T50 ACMclasses: I.2.7; I.2.6; G.3 Cite as: arXiv:2604.25224 [cs.AI] (or arXiv:2604.25224v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.25224 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] DATAREEL: Automated Data-Driven Video Story Generation with Animations

【速读】：该论文旨在解决自动化生成数据驱动视频故事（data-driven video storytelling）的挑战，即如何有效协调可视化编码、时间推进和叙述内容，并减少对专业可视化设计、动画制作和视频编辑技能的依赖。其关键解决方案是提出一个名为DataReel的基准数据集，包含328个真实世界的数据故事，每个故事均配有结构化数据、图表可视化和同步叙述文本，从而为模型评估提供标准化依据；同时设计了一个多智能体框架，将任务分解为规划、生成与验证三个阶段，模拟人类叙事流程，在自动与人工评估中均优于直接提示基线方法，但仍未完全解决动画、叙述与视觉焦点之间的协同问题。

链接: https://arxiv.org/abs/2604.25220
作者: Ridwan Mahbub,Syem Aziz,Mahir Ahmed,Shadikur Rahman,Mizanur Rahman,Shafiq Joty,Enamul Hoque
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Data videos are a powerful medium for visual data based storytelling, combining animated, chart-centric visualizations with synchronized narration. Widely used in journalism, education, and public communication, they help audiences understand complex data through clear and engaging visual explanations. Despite their growing impact, generating data-driven video stories remains challenging, as it requires careful coordination of visual encoding, temporal progression, and narration and substantial expertise in visualization design, animation, and video-editing tools. Recent advances in large language models offer new opportunities to automate this process; however, there is currently no benchmark for rigorously evaluating models on animated visualization-based video storytelling. To address this gap, we introduce DataReel, a benchmark for automated data-driven video story generation comprising 328 real-world stories. Each story pairs structured data, a chart visualization, and a narration transcript, enabling systematic evaluation of models’ abilities to generate animated data video stories. We further propose a multi-agent framework that decomposes the task into planning, generation, and verification stages, mirroring key aspects of the human storytelling process. Experiments show that this multi-agent approach outperforms direct prompting baselines under both automatic and human evaluations, while revealing persistent challenges in coordinating animation, narration, and visual emphasis. We release DataReel at this https URL.

[AI-57] DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale

【速读】：该论文旨在解决现有维度约简方法（如UMAP和t-SNE）在可视化高维数据时，因局部邻域目标函数导致的全局拓扑结构失真问题，尤其是这些方法会过度记忆采样噪声并引入虚假的环路和孤立区域。其解决方案的关键在于提出一种基于已知同调结构的拓扑保真度基准（topology-faithfulness benchmark），并通过该基准对DiRe（Dimensionality Reduction with Explicit Topology Preservation）进行优化，使其在保持与GPU加速UMAP相当的分类性能的同时，能精确恢复第一贝蒂数（first Betti number），并在大规模arXiv论文嵌入数据上显著提升拓扑结构保留能力（达UMAP的3–4倍）。

链接: https://arxiv.org/abs/2604.25209
作者: Alexander Kolpakov,Igor Rivin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Social and Information Networks (cs.SI)
备注: 5 pages, 4 figures; GitHub repositories ( this https URL ) ( this https URL;) HuggingFace dataset ( this https URL )

点击查看摘要

Abstract:Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top-performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology-faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto-optimal configurations that match or beat GPU-accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3-4 times more topological structure than UMAP at comparable wall-clock.

[AI-58] Making AI-Assisted Grant Evaluation Auditable without Exposing the Model

【速读】：该论文旨在解决公共机构在使用大语言模型（Large Language Models, LLMs）进行资助评审时面临的治理难题：如何在不暴露模型权重、评分规则或中间推理过程的前提下，确保评审流程具备可审计性、可申诉性和问责性。解决方案的关键在于提出一种基于可信执行环境（Trusted Execution Environment, TEE）的架构，通过远程证明（remote attestation）机制生成一个“经验证的评估包”——即一个带有签名和时间戳的记录，关联原始申请哈希、规范输入哈希、模型与评分规则的测量结果及最终评价输出。该设计同时引入文档规范化与净化层，以防范申请人控制文档中潜在的提示注入攻击，从而实现对评估过程关键环节的外部可验证性，尽管不保证评估结果的公平性或科学正确性。

链接: https://arxiv.org/abs/2604.25200
作者: Kemal Bicakci
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Public agencies are beginning to consider large language models (LLMs) as decision-support tools for grant evaluation. This creates a practical governance problem: the model and scoring rubric should not be exposed in a way that allows applicants to optimize against them, yet the evaluation process must remain auditable, contestable, and accountable. We propose a TEE-based architecture that helps reconcile these requirements through remote attestation. The architecture allows an external verifier to check which model, rubric, prompt template, and input representation were used, without exposing model weights, proprietary scoring logic, or intermediate reasoning to applicants or infrastructure operators. The main artifact is an attested evaluation bundle: a signed, timestamped record linking the original submission hash, the canonical input hash, the model-and-rubric measurement, and the evaluation output. The paper also considers a scenario-specific prompt injection risk: applicant-controlled documents may contain hidden or indirect instructions intended to influence the LLM evaluator. We therefore include a canonicalization and sanitization layer that normalizes document representations and records suspicious transformations before inference. We position the design relative to confidential AI inference, attestable AI audits, zero-knowledge machine learning, algorithmic accountability, and AI-assisted peer review. The resulting claim is deliberately narrow: remote attestation does not prove that an evaluation is fair or scientifically correct, but it can make part of the evaluation process externally verifiable. Comments: 12 pages, 2 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2604.25200 [cs.CR] (or arXiv:2604.25200v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.25200 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] How Can Reinforcement Learning Achieve Expert-level Placement?

【速读】：该论文旨在解决基于强化学习（Reinforcement Learning, RL）的芯片布局（Chip Placement）方法在实际应用中难以达到专家水平的问题。现有方法主要聚焦于线长（Wirelength）优化，但忽略了专家布局中隐含的复杂决策逻辑，导致性能与人类专家存在显著差距。解决方案的关键在于摒弃传统手工设计奖励函数的方式，转而通过直接从专家布局结果中推断出逐步的专家轨迹，并利用这些轨迹作为示范或偏好数据来训练一个能够捕捉专家隐式奖励机制的模型，从而实现从少量甚至单个设计案例中高效学习并良好泛化至未见场景。

链接: https://arxiv.org/abs/2604.25191
作者: Ruo-Tong Chen,Ke Xue,Chengrui Gao,Yunqi Shi,Tian Xu,Peng Xie,Siyuan Xu,Mingxuan Yuan,Chao Qian,Zhi-Hua Zhou
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: DAC 2026

点击查看摘要

Abstract:Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.

[AI-60] From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

【速读】：该论文旨在解决当前机制可解释性工具（如稀疏自编码器，Sparse Autoencoders, SAEs）虽能识别大型语言模型（Large Language Models, LLMs）内部有意义的特征，但缺乏将这些洞察转化为实际训练优化策略的问题。其解决方案的关键在于提出一种基于可解释性的数据选择框架——解释引导的数据选择（Interpretability-Guided Data Selection, IGDS），该框架首先通过频率回忆和干预过滤识别出因果任务特征，进而筛选出能最大化激活这些特征的“特征共振数据”用于微调。实验表明，IGDS在数学推理、摘要生成和翻译任务中显著提升数据效率，例如在Gemma-2-2B上仅使用50%数据即可超越全量数据微调17.4%，验证了内部特征放大与任务性能提升之间的强正相关性。

链接: https://arxiv.org/abs/2604.25167
作者: Ling Shi,Xinwei Wu,Xiaohu Zhao,Hao Wang,Heng Liu,Yangyang Liu,Linlong Xu,Longyue Wang,Deyi Xiong,Weihua Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While mechanistic interpretability tools like Sparse Autoencoders (SAEs) can uncover meaningful features within Large Language Models (LLMs), a critical gap remains in transforming these insights into practical actions for model optimization. We bridge this gap with the hypothesis that data selection guided by a model’s internal task features is a effective training strategy. Inspired by this, we propose Interpretability-Guided Data Selection (IGDS), a framework that first identifies these causal task features through frequency recall and interventional filtering, then selects ``Feature-Resonant Data’’ that maximally activates task features for fine-tuning. We validate IGDS on mathematical reasoning, summarization, and translation tasks within Gemma-2, LLaMA-3.1, and Qwen3 models. Our experiments demonstrate exceptional data efficiency: on the Math task, IGDS surpasses full-dataset fine-tuning by a remarkable 17.4% on Gemma-2-2B while using only 50% of the data, and outperforms established baselines focused on data quality and diversity. Analysis confirms a strong positive correlation between feature amplification and task performance improvement. IGDS thus provides a direct and effective framework to enhance LLMs by leveraging their internal mechanisms, validating our core hypothesis.

[AI-61] raining Transformers as a Universal Computer

【速读】：该论文旨在解决如何让小型Transformer模型学习并执行通用计算任务的问题，特别是针对一种简化但具有图灵完备性的编程语言MicroPy。其核心挑战在于在有限的上下文窗口内实现高效、准确的小步执行（small-step execution）过程。解决方案的关键在于采用PENCIL（Program Execution with Context-Limited Inference and Learning）框架作为结构支撑，通过空间高效的执行机制，在固定长度的上下文窗口中模拟程序的逐步执行流程；训练阶段使用随机生成的无意义MicroPy程序进行监督学习，最终模型展现出对人类编写的复杂程序（如位操作、二进制运算和SAT求解）以及分布外（out-of-distribution）新程序的良好泛化能力，从而验证了标准Transformer可作为通用计算机的可行性。

链接: https://arxiv.org/abs/2604.25166
作者: Ruize Xu,Chenxiao Yang,Yanhong Li,David McAllester
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:We demonstrate that a small transformer can learn to execute programs in MicroPy, a simplified yet computationally universal programming language. Given procedure definitions together with an expression to evaluate, the transformer predicts small-step execution using PENCIL scaffolding for space-efficient execution within a bounded context window. After training on randomly generated, meaningless MicroPy programs, the learned transformer generalizes to various human-written programs including bit copying and flipping, binary addition and multiplication, and SAT verification and solving. We note that the trained model can achieve out-of-distribution generalization; i.e., evaluate novel programs from distribution on programs. Since MicroPy can express any computation, our results provide empirical evidence that a standard transformer can be trained to act as a universal computer.

[AI-62] he Role of Symmetry in Optimizing Overparameterized Networks

【速读】：该论文试图解决的问题是：深度学习中过参数化（overparameterization）如何改善优化过程的机制尚不明确。解决方案的关键在于从权重空间对称性（weight-space symmetries）的角度出发，揭示过参数化通过引入额外对称性来优化损失曲面的几何结构——一方面，这些对称性相当于对Hessian矩阵进行对角预条件化（diagonal preconditioning），从而在函数等价解类中产生条件数更优的极小值点；另一方面，过参数化提高了全局最优解在典型初始点附近的概率密度质量，使更优解更容易被收敛到达。实验验证表明，随着网络宽度增加，Hessian迹减小、条件数改善且收敛加速，这为理解过参数化与宽度增长提供了一个统一的几何视角。

链接: https://arxiv.org/abs/2604.25150
作者: Kusha Sareen,Mohammad Pedramfar,Sékou-Oumar Kaba,Mehran Shakerinava,Siamak Ravanbakhsh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely understood. We analyze weight-space symmetries in neural networks and show that overparameterization introduces additional symmetries that benefit optimization in two distinct ways. First, we prove that these symmetries act as a form of diagonal preconditioning on the Hessian, enabling the existence of better-conditioned minima within each equivalence class of functionally identical solutions. Second, we show that overparameterization increases the probability mass of global minima near typical initializations, making these favorable solutions more reachable. Teacher-student network experiments validate our theoretical predictions: as width increases, the Hessian trace decreases, condition numbers improve, and convergence accelerates. Our analysis provides a unified framework for understanding overparameterization and width growth as a geometric transformation of the loss landscape.

[AI-63] Semantic Layers for Reliable LLM -Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自然语言查询分析型数据库时出现的两大核心问题：生成错误答案和自信的幻觉（confident hallucinations），其根本原因在于模型被迫推断数据库模式（schema）未编码的业务语义。解决方案的关键在于通过提供一个4 KB的手工编写的Markdown文档，明确描述数据集的度量指标（measures）、命名惯例及歧义消除规则，从而将缺失的业务语义显式注入到上下文中。实验表明，该方法使三个前沿LLM（Claude Opus 4.7、Claude Sonnet 4.6、GPT-5.4）的准确率提升17至23个百分点，且在引入该语义层文档后，不同模型间的性能差异不再显著，说明结构化业务语义的显式供给是抑制文本到SQL错误的主要机制，而非模型能力本身。

链接: https://arxiv.org/abs/2604.25149
作者: Michael Rumiantsau,Ivan Fokeev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap. We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse, using a paired single-shot protocol. Each model is evaluated twice: once given only the warehouse schema, and once given the schema plus a 4 KB hand-authored markdown document describing the dataset’s measures, conventions, and disambiguation rules. Adding the document improves accuracy by +17 to +23 percentage points across all three models. With it, the three models are statistically indistinguishable (67.7-68.7%); without it, they are also indistinguishable (45.5-50.5%). Every cross-cluster comparison is significant at p 0.01. The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not. We interpret this as a structural result: explicit business semantics suppress the dominant class of text-to-SQL errors not by making the model more capable, but by changing what the model is being asked to do. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.25149 [cs.AI] (or arXiv:2604.25149v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.25149 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mikhail Rumiantsau [view email] [v1] Tue, 28 Apr 2026 02:53:23 UTC (441 KB)

[AI-64] Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

【速读】：该论文旨在解决现有神经网络可解释性诊断方法中对特征形成位置敏感度不足的问题，特别是针对注意力机制中特征聚集方向（SED）与线性中心假设（LCH）特征之间耦合强度测量不稳定、受优化器更新方式影响显著的问题。解决方案的关键在于将传统的基于AdamW更新的滚动奇异值分解（rolling SVD）替换为对损失梯度进行SVD分析，从而显著增强SED方向与LCH特征之间的扰动耦合强度（从约3–9×提升至100–330×），并消除不同任务间的操作依赖性；此外，通过在多任务场景下对每任务梯度单独执行SVD而非聚合更新，有效缓解了任务间梯度干扰导致的诊断失效问题。这一改进使SED-LCH耦合成为更可靠且具有因果意义的特征形成定位指标，同时揭示了自然AdamW更新虽为全秩但存在高秩冗余性的现象。

链接: https://arxiv.org/abs/2604.25143
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from \barR_k \approx 3 – 9\times to 100 – 330\times across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives \barR_k \leq 1 – an apparent failure of the diagnostic – while per-operation gradient-based SED recovers \barR_k = 20 – 45\times across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately 2.3\times across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.

[AI-65] owards Unified Multi-task EEG Analysis with Low-Rank Adaptation

【速读】：该论文旨在解决预训练脑电图（EEG）模型在多任务场景下因参数空间冲突导致的联合优化困难问题，即传统方法需为每个下游任务单独微调模型，造成计算和存储资源浪费。其解决方案的关键在于提出MTEEG框架，通过引入任务特定的低秩适应（LoRA）模块来解耦参数空间，从而缓解不同任务间的干扰，实现单个预训练模型对多个EEG任务的同时适配。

链接: https://arxiv.org/abs/2604.25131
作者: Sicheng Dai,Kai Chen,Hongwang Xiao,Shan Yu,Qiwei Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent self-supervised pre-training methods for electroencephalogram (EEG) have shown promising results. However, the pre-trained models typically require full fine-tuning on each downstream task individually to achieve good performance. In practical applications involving multiple tasks, utilizing a separate model for each task is not ideal regarding computational and spatial cost. In this study, we go one step further and explore the simultaneous adaptation of a pre-trained model to multiple different tasks. The EEG signals exhibit significant heterogeneity due to their collection from various subjects using diverse devices and experimental setups, resulting in potential conflicts among different tasks that impede joint optimization. To tackle this challenge, we propose MTEEG, a multi-task EEG analysis framework which incorporates task-specific low-rank adaptation (LoRA) modules to disentangle the parameter space and alleviate task conflicts. To investigate the trade-off between task specification and interaction, we propose three variants of MTEEG that integrate the LoRA modules in different ways and evaluate them on six downstream tasks, demonstrating that MTEEG can surpass state-of-the-art single-task methods on the majority of metrics. MTEEG shows the potential of multi-task EEG analysis and promotes the development of general-purpose brain-computer interfaces in the future.

[AI-66] Knowledge Distillation Must Account for What It Loses

【速读】：该论文试图解决知识蒸馏（Knowledge Distillation）在实际应用中因忽视教师模型（teacher model）能力损失而导致的评估偏差问题，即当前方法仅关注学生模型（student model）在目标任务上的性能指标，而忽略了诸如不确定性建模、边界行为、过程可靠性、策略稳定性、语义接地性、隐私保护、安全性及多样性等关键能力的潜在退化。解决方案的关键在于重新定义蒸馏的本质——将其视为对教师行为的有损投影（lossy projection），而非忠实复制，并提出一种“蒸馏损失声明”（Distillation Loss Statement），明确报告哪些能力被保留、哪些被丢失以及剩余损失为何可接受，从而推动更透明、可问责的知识蒸馏实践。

链接: https://arxiv.org/abs/2604.25110
作者: Wenshuo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. This matters because distillation is increasingly used to turn large, often frontier models into deployable systems, yet headline metrics can hide losses in uncertainty, boundary behavior, process reliability, on-policy stability, grounding, privacy, safety, and diversity. We identify the retention assumption behind current evaluation and reframe distillation as a lossy projection of teacher behavior rather than a faithful copy. We then synthesize existing evidence into a taxonomy of off-metric distillation losses, showing that these losses are concrete, recurring, and measurable. To make the position actionable, we propose scenario-specific preservation targets and a Distillation Loss Statement that reports what was preserved, what was lost, and why the remaining losses are acceptable. The goal is not lossless distillation, but accountable distillation.

[AI-67] Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

【速读】：该论文旨在解决未信任Agent Skills（代理技能）在预加载阶段的安全审计问题，即如何有效识别和防范经语义保持重写后的恶意代码注入风险。现有防护机制常因无法一致识别恶意意图而失效，导致安全漏洞。解决方案的关键在于提出SkillGuard-Robust，其核心创新包括：角色感知的证据提取（role-aware evidence extraction）、选择性语义验证（selective semantic verification）以及一致性保持的裁决机制（consistency-preserving adjudication），从而将预加载审计建模为鲁棒的三分类任务，显著提升对跨文件安全审查的准确性和一致性。

链接: https://arxiv.org/abs/2604.25109
作者: Lijia Lv,Xuehai Tang,Jie Wen,Jizhong Han,Songlin Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent Skills package this http URL files, scripts, reference documents, and repository context into reusable capability units, turning pre-load auditing from single-prompt filtering into cross-file security review. Existing guardrails often flag risk but recover malicious intent inconsistently under semantics-preserving rewrites. This paper formulates pre-load auditing for untrusted Agent Skills as a robust three-way classification task and introduces SkillGuard-Robust, which combines role-aware evidence extraction, selective semantic verification, and consistency-preserving adjudication. We evaluate SkillGuard-Robust on SkillGuardBench and two public-ecosystem extensions through five large evaluation views ranging from 254 to 404 packages. On the 404-package held-out aggregate, SkillGuard-Robust reaches 97.30% overall exact match, 98.33% malicious-risk recall, and 98.89% attack exact consistency. On the 254-package external-ecosystem view, it reaches 99.66%, 100.00%, and 100.00%, respectively. These results support a bounded conclusion: factorized package auditing materially improves frozen and public-ecosystem robustness, while harsher external-source transfer remains an open challenge.

[AI-68] Optimally Auditing Adversarial Agents AAAI

【速读】：该论文旨在解决资源分配领域中因个体欺诈行为导致的效率损失问题，例如在社会服务提供和信贷发放过程中，代理方可能通过谎报私有信息来获取不当利益或信贷资格。为缓解此类问题，论文提出将审计策略设计建模为一个具有多个代理的委托-代理博弈，其中委托方（principal）先承诺一种审计策略，而代理方则基于该策略选择一个均衡以最小化委托方的效用。解决方案的关键在于：针对自适应与非自适应两种场景（即委托方策略是否能响应代理报告分布），分别设计高效的算法来计算最优审计策略，并进一步扩展至审计预算受限的情形，从而在保证机制有效性的同时实现计算可行性与资源约束下的最优权衡。

链接: https://arxiv.org/abs/2604.25085
作者: Sanmay Das,Fang-Yi Yu,Yuang Zhang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Published in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026, pages 16787-16794

点击查看摘要

Abstract:Fraud can pose a challenge in many resource allocation domains, including social service delivery and credit provision. For example, agents may misreport private information in order to gain benefits or access to credit. To mitigate this, a principal can design strategic audits to verify claims and penalize misreporting. In this paper, we introduce a general model of audit policy design as a principal-agent game with multiple agents, where the principal commits to an audit policy, and agents collectively choose an equilibrium that minimizes the principal’s utility. We examine both adaptive and non-adaptive settings, depending on whether the principal’s policy can be responsive to the distribution of agent reports. Our work provides efficient algorithms for computing optimal audit policies in both settings and extends these results to a setting with limited audit budgets.

[AI-69] Agent ic Architect: An Agent ic AI Framework for Architecture Design Exploration and Optimization

【速读】：该论文旨在解决计算机体系结构设计中因微架构配置空间庞大且组合复杂而导致的探索效率低下的问题，尤其在缓存替换、数据预取和分支预测等关键组件优化方面。其解决方案的核心是提出Agentic Architect——一个基于大语言模型（Large Language Models, LLMs）驱动的智能体框架，通过将LLM生成的代码演化与周期精确仿真（cycle-accurate simulation）相结合，实现自动化、高效的微架构设计空间探索与优化。该框架由人类架构师定义目标函数、初始设计、评分机制及模拟器接口，而LLM在约束条件下自主演化实现方案，最终在多个任务上达到或超越现有最优设计，验证了其有效性与通用性。

链接: https://arxiv.org/abs/2604.25083
作者: Alexander Blasberg,Vasilis Kypriotis,Dimitrios Skarlatos
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Rapid advances in Large Language Models (LLMs) create new opportunities by enabling efficient exploration of broad, complex design spaces. This is particularly valuable in computer architecture, where performance depends on microarchitectural designs and policies drawn from vast combinatorial spaces. We introduce Agentic Architect, an agentic AI framework for computer architecture design exploration and optimization that combines LLM-driven code evolution with cycle-accurate simulation. The human architect specifies the optimization target, seed design, scoring function, simulator interface, and benchmark split, while the LLM explores implementations within these constraints. Across cache replacement, data prefetching, and branch prediction, Agentic Architect matches or exceeds state-of-the-art designs. Our best evolved cache replacement design achieves a 1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x). Our evolved branch predictor achieves a 1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x). Finally, our evolved prefetcher achieves a 1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x). Our analysis surfaces several findings about agentic AI-driven microarchitecture design. Across evolved designs, components often correspond to known techniques; the novelty lies in how they are coordinated. The architect’s role is shifting, but the human remains central. Seed quality bounds what search can achieve: evolution can refine and extend an existing mechanism, but cannot compensate for a weak foundation. Likewise, objectives, constraints, and prompt guidance affect reliability and generalization. Overall, Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2604.25083 [cs.AI] (or arXiv:2604.25083v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.25083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

【速读】：该论文旨在解决弱教师到强模型对齐（weak-to-strong alignment）过程中可能出现的“盲区欺骗”（blind-spot deception）问题，即强模型在弱教师无法覆盖的样本区域中变得自信错误，从而导致对齐失败。此类失败难以通过整体准确率衡量，需深入分析置信度与不确定性分布。解决方案的关键在于引入偏差-方差-协方差（bias-variance-covariance）分析框架，从理论上推导出基于误配（misfit）的强模型群体风险上界，并通过连续置信分数实证分解其组成部分。研究发现，强模型方差（strong-model variance）是预测欺骗行为最强的经验指标，而协方差提供辅助信息，表明弱强模型间的依赖关系虽重要但不足以单独解释失败机制。这为早期识别对齐风险提供了可操作的信号——强模型高方差可能预示潜在的盲区错误，同时盲区评估有助于区分失败来源：是来自弱监督的继承性偏差，还是发生在弱模型不确定区域的新发错误。

链接: https://arxiv.org/abs/2604.25077
作者: Hamid Osooli,Kareema Batool,Rick Gentry,Tiasa Singha Roy,Ashwin Gupta,Anirudha Ramesh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher’s blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Using a blind-spot deception metric that isolates cases where the strong model is confidently wrong while the weak model is uncertain, we find that strong-model variance is the strongest empirical predictor of deception across our settings. Covariance provides additional but weaker information, indicating that weak-strong dependence matters, but does not by itself explain the observed failures. These results suggest that strong-model variance can serve as an early-warning signal for weak-to-strong deception, while blind-spot evaluation helps distinguish whether failures are inherited from weak supervision or arise in regions of weak-model uncertainty.

[AI-71] Barriers and Enablers of Online Instruction in Hospitality Education in the Philippines: An Exploratory Study

【速读】：该论文旨在解决在线教学在酒店管理教育中实施过程中面临的主要障碍与促进因素问题。研究发现，教学法挑战（如实践类课程难以线上开展、学生参与度难以维持）是教师最关注的核心问题，而技术障碍（如网络不稳定、设备不足）和制度及个人支持则处于次要位置；人工智能（AI）虽被视作潜在助力，但教师对其应用持谨慎态度，并强调需配套专业培训以确保负责任的使用。解决方案的关键在于强化教师的教学法培训、明确机构层面的支持政策，并系统培养教师在AI工具中的胜任力，从而提升在线教学质量与接受度。

链接: https://arxiv.org/abs/2604.25047
作者: Maria Anna D. Cruz,Jeaneth D. Serna,Lloyd D. Feliciano,Mike Haizon M. David,Ma. Ferna Bel L. Punsalan,Glen Brian L. Lacsa,Michelle C. Castro,John Paul P. Miranda
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 2 figures; 9 pages, conference proceedings

点击查看摘要

Abstract:This study examined the barriers and enablers of online instruction in hospitality education. A sequential exploratory design was implemented with hospitality teachers from both public and private higher educational institutions in the Philippines. Thematic analysis of interviews identified four key themes: technological barriers, pedagogical challenges, institutional and personal support, and integration of artificial intelligence (AI). These themes were transformed into survey constructs and tested for reliability. Pedagogical challenges, including difficulties in teaching hands-on subjects and sustaining student engagement, emerged as the most critical concerns. Technological barriers such as unstable internet and limited devices were moderately rated, while institutional and personal support received mixed evaluations. Teachers viewed AI integration as helpful but also expressed caution and emphasized the need for training. Reliability analysis showed acceptable to good internal consistency across constructs. The findings highlight the importance of strengthening pedagogical training, providing clear institutional support, and fostering responsible competence in AI use. Future studies should validate these results with larger and more diverse samples.

[AI-72] Internet of Everything in the 6G Era: Paradigms Enablers Potentials and Future Directions

【速读】：该论文旨在解决当前物联网（IoT）向更高级形态演进过程中所面临的系统集成复杂性、智能化水平不足以及跨域协同效率低下的问题，其核心在于构建一个融合人、数据、流程与事物的统一智能生态系统——即万物互联（Internet of Everything, IoE）。解决方案的关键在于提出结构化的IoE概念框架，明确其核心组件与架构基础，并识别出支撑其发展的关键技术，同时聚焦于6G赋能下的智能IoE系统所面临的关键挑战，如可扩展性、安全性、隐私保护和能效优化等开放研究方向，从而为未来智能网络基础设施提供理论支撑与技术路径。

链接: https://arxiv.org/abs/2604.25018
作者: Driss Choukri,Essaid Sabir,Elmahdi Driouh,Abdelkrim Haqiq
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: 48 pages, 15 figures, 6 tables, 272 references

点击查看摘要

Abstract:The Internet of Everything (IoE) represents an evolution of the Internet of Things (IoT) by integrating people, data, processes, and things into a unified intelligent ecosystem. IoE aims to enhance automation, decision-making, and service efficiency across multiple application domains such as smart cities, healthcare, industry, and next-generation wireless networks. This paper provides a structured overview of the IoE concept, its core components, architectural foundations, enabling technologies, and major research challenges. Finally, open research directions toward 6G-enabled intelligent IoE systems are discussed, with emphasis on scalability, security, privacy, and energy efficiency.

[AI-73] oward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents

【速读】：该论文试图解决的问题是：尽管当前智能系统在特定任务中表现出色，能够通过学习结构和推理搜索缩短求解时间，但这些模型在开放机构中的部署依然困难。其核心挑战在于如何将人类意图转化为可检查、可约束的执行实体，以应对开放世界中验证机制的复杂性。解决方案的关键在于提出“意图编译”（intent compilation）概念，即将部分指定的人类目的转化为具有可解释性的结构化产物，从而绑定执行过程；同时区分封闭世界求解器与开放世界代理的本质差异，并引入“闭合缺口向量”（closure-gap vector）来量化开放性残留，定义“委托包络”（delegation envelopes）作为预授权的动作空间区域，以此指导何时采用闭合干预比持续推理更有效。

链接: https://arxiv.org/abs/2604.25000
作者: Maximiliano Armesto,Christophe Kolb
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 15 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Recent work has framed intelligence in verifiable tasks as reducing time-to-solution through learned structure and test-time search, while systems work has explored learned runtimes in which computation, memory and I/O migrate into model state. These perspectives do not explain why capable models remain difficult to deploy in open institutions. We propose intent compilation: the transformation of partially specified human purpose into inspectable artifacts that bind execution. The relevant deployment distinction is closed-world solver versus open-world agent. In closed worlds, a checker is largely given; in open worlds, verification is distributed across semantic, evidentiary, procedural and institutional dimensions. Weformalize this residual openness as a closure-gap vector, define delegation envelopes as pre-authorized regions of action space, distinguish misclosure from undersearch, and outline benchmark metrics for testing when closure interventions outperform additional inference-time search.

[AI-74] Sparse Personalized Text Generation with Multi-Trajectory Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在冷启动场景下的个性化问题，即当用户交互历史数据稀疏或缺失时，现有方法难以有效生成符合个体偏好的输出。其解决方案的关键在于提出一种名为PAT（Personalization with Aligned Trajectories）的推理框架，该框架通过两条互补的信息轨迹进行检索：一是来自风格相似用户的写作风格线索，二是来自偏好对齐用户的主题相关上下文；随后利用基于强化学习的迭代双推理机制，使模型能够联合优化并融合这两类异构信号，从而在低数据条件下显著提升生成质量与个性化对齐度。

链接: https://arxiv.org/abs/2604.24996
作者: Bo Ni,Haowei Fu,Qinwen Ge,Franck Dernoncourt,Samyadeep Basu,Nedim Lipka,Seunghyun Yoon,Yu Wang,Nesreen K. Ahmed,Subhojyoti Mukherjee,Puneet Mathur,Ryan A. Rossi,Tyler Derr
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) advance, personalization has become a key mechanism for tailoring outputs to individual user needs. However, most existing methods rely heavily on dense interaction histories, making them ineffective in cold-start scenarios where such data is sparse or unavailable. While external signals (e.g., content of similar users) can offer a potential remedy, leveraging them effectively remains challenging: raw context is often noisy, and existing methods struggle to reason over heterogeneous data sources. To address these issues, we introduce PAT (Personalization with Aligned Trajectories), a reasoning framework for cold-start LLM personalization. PAT first retrieves information along two complementary trajectories: writing-style cues from stylistically similar users and topic-specific context from preference-aligned users. It then employs a reinforcement learning-based, iterative dual-reasoning mechanism that enables the LLM to jointly refine and integrate these signals. Experimental results across real-world personalization benchmarks show that PAT consistently improves generation quality and alignment under sparse-data conditions, establishing a strong solution to the cold-start personalization problem.

[AI-75] Assessing Y-Axis Influence: Bias in Multimodal Language Models on Chart-to-Table Translation

【速读】：该论文旨在解决图表图像到表格数据转换（Chart-to-table translation）中存在的y轴相关偏差问题，这类偏差源于公共数据集中不同y轴信息维度（如主刻度数值的位数、主刻度数量、取值范围及刻度格式等）分布不均，导致多模态语言模型（Multimodal Language Model, MLM）在处理此类任务时性能不稳定，产生系统性偏倚。解决方案的关键在于提出一个名为FairChart2Table的新框架，用于系统性分析五种先进MLM模型在y轴特征上的表现差异，并通过实证发现：(1) y轴数值长度、刻度数量、取值范围和格式是显著影响模型性能的核心因素；(2) 图表中图例或实体数量也会影响MLM性能；(3) 在提示（prompting）中引入y轴信息可显著提升部分MLM的翻译准确率，从而为缓解偏差提供有效策略。

链接: https://arxiv.org/abs/2604.24987
作者: Seok Hwan Song,Azher Ahmed Efat,Wallapak Tavanapong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chart-to-table translation converts chart images into structured tabular data. Accurate translation is crucial for Multimodal Language Model (MLM) to answer complex queries. We observe imbalances in the number of images across different aspects of the y-axis information in public chart datasets. Such imbalances can introduce unintended biases, causing uneven MLM performance. Previous works have not systematically examined these biases. To address this gap, we propose a new framework, FairChart2Table, for analyzing y-axis-related bias on five state-of-the-art models. Key Findings: (1) There are significant y-axis biases related to the digit length of the major tick values, the number of major ticks, the range of values, and the tick value format (e.g., abbreviation or scientific format). (2) The number of legends/entities in chart images impacts MLM performance. (3) Prompting MLM with y-axis information can significantly enhance the performance for some MLMs.

[AI-76] Adaptive Prompt Embedding Optimization for LLM Jailbreaking

【速读】：该论文旨在解决对齐后大语言模型（Aligned Large Language Models, LLMs）的白盒越狱攻击（White-box Jailbreak Attacks）效率与隐蔽性不足的问题。现有方法通常通过在用户提示中添加离散的对抗后缀来实现越狱，但这类方法不仅会明显改变原始提示内容，且在组合式的词元空间中搜索效率较低。论文提出Prompt Embedding Optimization (PEO)，其核心创新在于直接优化原始提示词元的嵌入表示（embedding），而非添加额外的对抗词元；关键突破在于证明：即使对原始词元嵌入进行连续空间优化，经最近邻词元投影后仍能保持原提示字符串不变，且模型响应在语义上仍保持相关性。PEO结合了嵌入空间的连续优化、结构化的续写目标以及自适应失败聚焦调度机制，实验证明其在两个标准有害行为基准测试中优于所有对比的白盒攻击方法（包括离散后缀搜索、附加对抗嵌入和基于搜索的对抗生成）。

链接: https://arxiv.org/abs/2604.24983
作者: Miles Q. Li,Benjamin C. M. Fung,Boyang Li,Radin Hamidi Rad,Ebrahim Bagheri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt’s semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model’s responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.

[AI-77] Compute Aligned Training: Optimizing for Test Time Inference

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在训练阶段与推理阶段计算资源使用方式不一致的问题。标准的后训练范式，如监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL），优化的是单个样本的似然函数，而测试时通常采用聚合或筛选输出的策略（如自洽性采样、束搜索等），导致训练目标与测试行为之间存在错位。解决方案的关键在于提出“计算对齐训练”（Compute Aligned Training），其核心思想是将推理策略建模为对基础策略的算子，并据此推导出新的损失函数，使得在应用测试时策略后模型性能最大化。该方法通过在SFT和RL框架下具体实现此类损失函数，在多个常见测试时策略中验证了其有效性，显著提升了测试时计算资源扩展的效果。

链接: https://arxiv.org/abs/2604.24957
作者: Adam Ousherovitch,Ambuj Tewari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

[AI-78] S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models ICASSP2026

【速读】：该论文旨在解决当前通用音频基础模型（General Audio Foundation Models）参数量庞大、推理成本高且难以部署在边缘设备上的问题。现有知识蒸馏方法多依赖于监督学习设置，需使用类别 logits 或中间特征进行对齐，这使得其无法适用于仅输出嵌入（embedding）的自监督或度量学习类模型。解决方案的关键在于提出 S-SONDO（Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models），这是首个仅利用模型输出嵌入进行蒸馏的方法，无需 logits 或层级对齐，具备架构无关性（architecture-agnostic），从而可广泛应用于基于嵌入的教师模型。实验表明，该方法可将两个音频基础模型压缩为三个效率更高的学生模型，体积最大减少至原来的 1/61，同时保持高达 96% 的教师性能。

链接: https://arxiv.org/abs/2604.24933
作者: Mohammed Ali El Adlouni,Aurian Quelennec,Pierre Chouteau,Geoffroy Peeters,Slim Essid
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted at IEEE ICASSP 2026. 5 pages, 2 figures, 3 tables. Equal contribution by first two authors. Code: this https URL | Models: this https URL | Package: this https URL

点击查看摘要

Abstract:General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: this https URL.

[AI-79] SUDP: Secret-Use Delegation Protocol for Agent ic Systems

【速读】：该论文旨在解决代理式系统中用户秘密（如API密钥、云服务凭证等）的安全使用问题，即如何在不将可重用的凭据暴露给不可信的自主请求者（agentic requester）的前提下，实现受用户授权的单次操作。当前主流的基于持有者凭证（bearer-secret）接口存在根本性缺陷：一旦模型可操控边界被攻破（如提示注入或工具侧漏洞），攻击者即可永久获取账户控制权。为此，作者提出Agent Secret Use (ASU) 的形式化定义，明确区分结构义务与实现层面的鲁棒性条件，并设计了Secret-Use Delegation Protocol (SUDP) 作为解决方案。其核心在于引入三方角色——请求者、用户和托管方（custodian），通过一次性授权凭证（fresh authenticator-backed grant）实现“授权可验证、操作可绑定、使用仅一次”的安全机制，确保可重用的秘密始终不出现在请求者边界内，从而在不依赖环境强假设的情况下实现对代理行为中敏感凭据的最小权限委托。

链接: https://arxiv.org/abs/2604.24920
作者: Xiaohang Yu,Hejia Geng,William Knottenbelt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic systems increasingly act with user secrets for APIs, messaging platforms, and cloud services. Today’s bearer-secret interfaces implement authorization by exposure: enabling action often means placing a reusable secret, or a reusable artifact derived from it, within a model-steerable boundary, so a transient prompt-injection or tool-side compromise becomes durable account compromise. Existing defenses cover adjacent pieces such as secret storage, scoped delegation, sender-constrained tokens, and runtime monitoring, but leave the combined agentic obligation without a common specification: an untrusted autonomous requester should be able to cause a user-authorized secret-backed operation without exposing reusable authority to the requester. We formalize this problem as Agent Secret Use (ASU). From ASU we derive a security-property taxonomy that separates the problem’s structural obligations from the realization-level robustness conditions any concrete construction must establish, enabling principled comparison of existing agentic-secret defenses against a problem-grounded specification. We propose the Secret-Use Delegation Protocol (SUDP), a three-role protocol realizing ASU: a requester proposes a canonical operation; the user authorizes it with a fresh authenticator-backed grant; and a custodian redeems the grant once to perform the bounded use, so reusable authority never crosses the requester boundary. We specialize SUDP for agentic deployments: agents propose operations; they do not retrieve secrets. Under explicit assumptions, we show that SUDP satisfies the ASU requirements: authorization is verifiable, operation-bound, and single-use. SUDP also provides storage confidentiality and wrapping-epoch key isolation under stated sealing and erasure assumptions; plaintext-level forward secrecy of the underlying secret additionally requires the environment to rotate and revoke it.

[AI-80] asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

【速读】：该论文旨在解决生成式 AI (Generative AI) 在复杂非完整约束系统中实现高保真物理仿真与真实硬件部署之间的“现实差距”问题，特别是针对类人球形机器人（humanoid ballbot）在实际控制中因接触建模不准确、执行器延迟抖动及安全探索困难导致的强化学习（Reinforcement Learning, RL）迁移失败问题。其解决方案的关键在于：首先构建了高保真 MuJoCo 仿真环境，显式建模 ETH 型全向轮的离散滚轮力学特性，以捕捉此前被忽略的寄生振动和接触不连续性；其次提出了一种摩擦感知强化学习框架（Friction-Aware Reinforcement Learning），通过同时学习轮-球和球-地界面处的滚动、侧向与扭转摩擦通道，实现了零样本从仿真到现实（zero-shot Sim2Real）的迁移；此外，通过减法重构设计（subtractive reconfiguration）将过约束四足机器人部件重新配置为低成本、鲁棒的研究平台，并开发了一个通用 iOS 生态系统，将消费级电子设备转化为低延迟接口，使单个操作员可直观操控类人动作。

链接: https://arxiv.org/abs/2604.24916
作者: Fang Wan,Guangyi Huang,Tianyu Wu,Zishang Zhang,Bangchao Huang,Haoran Sun,Mingdong Chen,Chaoyang Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figure, accepted for RSS2026. For Supplementary Videos, see this https URL

点击查看摘要

Abstract:We introduce asRoBallet, to the best of our knowledge, the first successful deployment of reinforcement learning (RL) on a humanoid ballbot hardware. Historically, ballbots have served as a canonical benchmark for underactuated and nonholonomic control, which are characterized by a reality gap in complex friction models for wheel-sphere-ground interactions. While current literature demonstrates successful handling of 3D balancing with LQR and MPC, transitioning to actual hardware for a humanoid ballbot using RL is currently hindered by critical gaps in contact modeling, actuator latency jitter, and safe hardware exploration, and safe hardware exploration. This study proposes a high-fidelity MuJoCo simulation that explicitly models the discrete roller mechanics of ETH-type omni-wheels, thereby capturing parasitic vibrations and contact discontinuities that are previously ignored. We also developed a Friction-Aware Reinforcement Learning framework that achieves zero-shot Sim2Real transfer by mastering the coupled rolling, lateral, and torsional friction channels at the wheel-sphere and sphere-ground interfaces. We designed asRoBallet through subtractive reconfiguration, repurposing key components from an overconstrained quadruped and integrating them into a newly designed structural frame to achieve a robust research platform at low cost. We also developed a generalized iOS ecosystem that transforms consumer electronics into a low-latency interface, enabling a single operator to orchestrate expressive humanoid maneuvers via intuitive natural motion.

[AI-81] Learning with Embedded Linear Equality Constraints via Variational Bayesian Inference AISTATS2026

【速读】：该论文旨在解决机器学习在科学与工程领域应用中普遍存在的两个问题：一是许多方法无法提供有意义的不确定性估计，二是预测结果可能违背已知的物理规律。解决方案的关键在于提出一种贝叶斯框架，将输入与输出之间的线性关系（即物理约束）嵌入到学习过程中，同时对模型参数和领域知识的不确定性进行完整建模。通过在单粒子电池模型上验证，该方法相较于基于变分推断的标准贝叶斯神经网络，能够显著缩小可信区间并减少约束违反情况。

链接: https://arxiv.org/abs/2604.24911
作者: Matthew Marsh,Benoît Chachuat,Antonio del Rio Chanona
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Part of the OPTIMAL: Optimisation and Post-Bayesian Inference in Machine Learning Workshop at AISTATS 2026

点击查看摘要

Abstract:Machine Learning is becoming more prevalent in science and engineering, but many approaches do not provide meaningful uncertainty estimates and predictions may also violate known physical knowledge. We propose a Bayesian framework to embed linear relationships across inputs and outputs into the learning process, whilst characterizing full predictive uncertainty over both the model parameters and the domain knowledge. We evaluated our method on learning the single particle battery model subject to voltage and energy balances, showing its ability to provide reduced credible intervals and constraint violations compared to standard Bayesian neural networks based on variational inference.

[AI-82] Latent Agents : A Post-Training Procedure for Internalized Multi-Agent Debate ACL2026

【速读】：该论文旨在解决多智能体辩论（multi-agent debate）在大型语言模型（LLM）中计算效率低下的问题，即传统方法需要生成冗长的对话转录才能回答问题，导致资源消耗巨大。解决方案的关键在于提出一个两阶段微调框架：第一阶段学习辩论结构，第二阶段通过动态奖励调度和长度截断实现推理能力的内化（internalization），从而将多智能体辩论过程压缩为单一LLM的推理路径。实验表明，该方法在多个模型和基准测试中可达到或超越显式多智能体辩论的效果，同时减少高达93%的token使用量。此外，研究通过激活操控（activation steering）揭示了内化机制的本质——在激活空间中形成了对应不同代理视角的可解释子空间，这不仅提升了对内部推理行为的理解，还为精准控制有害行为提供了新途径。

链接: https://arxiv.org/abs/2604.24881
作者: John Seon Keun Yi,Aaron Mueller,Dokyun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Main

点击查看摘要

Abstract:Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi-agent debate into a single LLM through a two-stage fine-tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi-agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent-specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling malicious agents into the LLM through internalized debate, then applying negative steering to suppress them, we show that distillation makes harmful behaviors easier to localize and control with smaller reductions in general performance compared to steering base models. Our findings offer a new perspective for understanding multi-agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors. Code available at this https URL

[AI-83] ransformer Approximations from ReLUs

【速读】：该论文旨在解决如何将ReLU（Rectified Linear Unit）的近似结果系统性地转化为Softmax注意力机制的理论分析框架问题，从而为Transformer模型中的Softmax注意力机制提供更精确的数学工具。其解决方案的关键在于提出了一种通用的“转换配方”（recipe），能够针对不同计算目标（如乘法、倒数计算和最小/最大值运算）生成特定于任务的、资源高效的近似边界，而不仅限于泛化能力的定性描述，从而为软注意力机制的性能与资源消耗提供了可量化、可优化的理论支撑。

链接: https://arxiv.org/abs/2604.24878
作者: Jerry Yao-Chieh Hu,Mingcheng Lu,Yi-Chen Lee,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyond universal approximation statements. We showcase the recipe on multiplication, reciprocal computation, and min/max primitives. These results provide new analytical tools for analyzing softmax transformer models.

[AI-84] MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives SIGGRAPH2026

【速读】：该论文旨在解决生成式运动合成在实时交互控制中的两大核心挑战：一是实时可扩展性问题，即工业应用需要在实时计算约束下生成大量多样化的运动技能，而现有生成方法在此条件下质量与扩展性显著下降；二是集成问题，即工业场景要求细粒度的多模态控制（如速度指令、风格选择和精确关键帧），但现有基于文本或标签驱动的模型难以满足此类需求。解决方案的关键在于提出MotionBricks框架，其包含两个创新：首先，设计了一个大规模模块化潜在生成主干网络，能够以单一模型高效建模超过35万段运动片段，实现鲁棒的实时运动生成；其次，引入智能原语（smart primitives），提供统一、鲁棒且直观的接口用于导航与物体交互的创作，使应用开发如同积木拼接般便捷，无需专业动画知识。该方案在多个开源与专有数据集上实现了最先进的运动质量，并达到15,000 FPS的实时吞吐量与2ms延迟，同时在生产级动画演示和Unitree G1人形机器人上的部署验证了其灵活性与泛化能力。

链接: https://arxiv.org/abs/2604.24833
作者: Tingwu Wang,Olivier Dionne,Michael De Ruyter,David Minor,Davis Rempe,Kaifeng Zhao,Mathis Petrovich,Ye Yuan,Chenran Li,Zhengyi Luo,Brian Robison,Xavier Blackwell,Bernardo Antoniazzi,Xue Bin Peng,Yuke Zhu,Simon Yuen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: ACM Transactions on Graphics; SIGGRAPH 2026. Project page: this https URL

点击查看摘要

Abstract:Despite transformative advances in generative motion synthesis, real-time interactive motion control remains dominated by traditional techniques. In this work, we identify two key challenges in bridging research and production: 1) Real-time scalability: Industry applications demand real-time generation of a vast repertoire of motion skills, while generative methods exhibit significant degradation in quality and scalability under real-time computation constraints, and 2) Integration: Industry applications demand fine-grained multi-modal control involving velocity commands, style selection, and precise keyframes, a need largely unmet by existing text- or tag-driven models. To overcome these limitations, we introduce MotionBricks: a large-scale, real-time generative framework with a two-fold solution. First, we propose a large-scale modular latent generative backbone tailored for robust real-time motion generation, effectively modeling a dataset of over 350,000 motion clips with a single model. Second, we introduce smart primitives that provide a unified, robust, and intuitive interface for authoring both navigation and object interaction. Applications can be designed in a plug-and-play manner like assembling bricks without expert animation knowledge. Quantitatively, we show that MotionBricks produces state-of-the-art motion quality on open-source and proprietary datasets of various scales, while also achieving a real-time throughput of 15,000 FPS with 2ms latency. We demonstrate the flexibility and robustness of MotionBricks in a complete production-level animation demo, covering navigation and object-scene interaction across various styles with a unified model. To showcase our framework’s application beyond animation, we deploy MotionBricks on the Unitree G1 humanoid robot to demonstrate its flexibility and generalization for real-time robotic control.

[AI-85] On the Trainability of Masked Diffusion Language Models via Blockwise Locality

【速读】：该论文旨在解决掩码扩散语言模型（Masked Diffusion Language Models, MDMs）在结构化生成任务中优化不稳定的问题，尤其是在与自回归大语言模型（Autoregressive Large Language Models, AR-LLMs）对比时表现出的学习能力差异和训练动态波动。其关键解决方案在于提出两种具有局部性感知的分块式模型——Jigsaw 和 Scatter，它们通过在块内强制引入从左到右的归纳偏置（即保持自回归局部性），同时保留块级别的迭代精炼能力，从而在保证扩散模型规划优势的同时提升训练稳定性与任务适应性。

链接: https://arxiv.org/abs/2604.24832
作者: Yuxiang Wang,Yu Xiang,Baojian Zhou,Qifang Zhao,Keyue Jiang,Yanghua Xiao,Xiaoxiao Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Masked diffusion language models (MDMs) have recently emerged as a promising alternative to standard autoregressive large language models (AR-LLMs), yet their optimization can be substantially less stable. We study blockwise MDMs and compare them with AR-LLMs on three controlled tasks that stress different aspects of structured generation: in-context linear regression, graph path-finding, and Sudoku solving. We find that standard random-masking MDMs fail to reliably learn linear regression, exhibit high variance training dynamics on graph path-finding, while outperforming AR-LLMs on Sudoku. To mitigate these instabilities, we propose two locality aware blockwise models, namely Jigsaw and Scatter, that inject left-to-right inductive bias by enforcing autoregressive locality within blocks while preserving iterative refinement at the block level. Empirically, Jigsaw matches AR-LLM stability on linear regression and remains strong on Sudoku, while Scatter retains diffusion’s planning advantage on path-finding. Our results indicate that standard random-masking MDMs, even with blockwise variants, may be a suboptimal instantiation of diffusion LMs for ordered generation, motivating models beyond random masking.

[AI-86] Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

【速读】：该论文旨在解决闭源前沿模型（closed-source frontier labs）因不公开参数量而导致的模型规模评估难题，以及传统基于推理经济性（inference economics）估算方法因硬件、批处理和服务栈假设引入高不确定性的问题。其核心解决方案是提出不可压缩知识探针（Incompressible Knowledge Probes, IKPs），通过设计一个包含1400个跨7个隐蔽层级的事实性问题的基准测试集，隔离出无法通过推理或架构改进压缩的知识内容，从而利用存储这些知识所需的最小参数数作为参数量的下界估计。该方法建立了一个从IKP准确率到参数量的对数线性映射关系，在89个开源模型上实现R²=0.917的强相关性，并验证了在多个厂商和模型架构（包括Mixture-of-Experts）上的泛化能力，为评估闭源模型的有效知识容量提供了可量化、可比较的新范式。

链接: https://arxiv.org/abs/2604.24827
作者: Bojie Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Closed-source frontier labs do not disclose parameter counts, and the standard alternative – inference economics – carries 2\times + uncertainty from hardware, batching, and serving-stack assumptions external to the model. We exploit a tighter intrinsic bound: storing F facts requires at least F/ (bits per parameter) weights, so measuring how much a model \emphknows lower-bounds how many parameters it \emphhas. We introduce \textbfIncompressible Knowledge Probes (IKPs), a benchmark of 1,400 factual questions spanning 7 tiers of obscurity, designed to isolate knowledge that cannot be derived by reasoning or compressed by architectural improvements. We calibrate a log-linear mapping from IKP accuracy to parameter count on 89 open-weight models (135M–1,600B) spanning 19 vendors, achieving R^2 = 0.917 ; leave-one-out cross-validation confirms generalization (median fold error 1.59\times , 68.5% within 2\times and 87.6% within 3\times ). For Mixture-of-Experts models, total parameters predict knowledge ( R^2 = 0.79 ) far better than active parameters ( R^2 = 0.51 ). We evaluate 188 models from 27 vendors and estimate effective knowledge capacity for all major proprietary frontier models; for heavily safety-tuned models the estimates are lower bounds, since refusal policy can hide tens of percentage points of “refused but known” capacity. The widely-reported saturation of reasoning benchmarks does not imply the end of scaling. Procedural capability compresses under the “Densing Law,” but across 96 dated open-weight models the IKP time coefficient is -0.0010 /month (95% CI [-0.0031, +0.0008] ) – indistinguishable from zero, and rejecting the Densing prediction of +0.0117 /month at p 10^-15 . Factual capacity continues to scale log-linearly with parameters across generations and across vendors. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.24827 [cs.LG] (or arXiv:2604.24827v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-87] A Comparative Evaluation of AI Agent Security Guardrails

【速读】：该论文旨在解决AI代理（AI agent）在实际应用中面临的安全风险识别与防护问题，特别是针对两类关键风险：一是对代理自身构成威胁的行为（如指令覆盖、间接注入攻击、工具滥用），二是旨在诱导生成有害内容的请求（如仇恨言论、色情内容、暴力信息）。解决方案的关键在于构建并评估一个名为DKnownAI Guard的防护机制，其通过高召回率（96.5%）和优异的真负率（TNR 90.4%）展现出优于AWS Bedrock Guardrails、Azure Content Safety及Lakera Guard的综合检测能力，从而为AI代理提供更可靠的安全保障。

链接: https://arxiv.org/abs/2604.24826
作者: Qi Li,Jiu Li,Pingtao Wei,Jianjun Xu,Xueyi Wei,Jiwei Shi,Xuan Zhang,Yanhui Yang,Xiaodong Hui,Peng Xu,Lingquan Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail’s ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5% and ranks first in true negative rate (TNR) at 90.4%, delivering the best overall performance among all evaluated guardrails.

[AI-88] Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

【速读】：该论文旨在解决长上下文（long context）下大语言模型推理过程中因KV缓存访问频繁导致的计算与内存带宽压力剧增问题，现有加速器在处理长序列时性能显著下降。其解决方案的关键在于软硬件协同设计：软件层面提出双压缩动态稀疏注意力机制（dual-compression dynamic sparse attention），融合超低精度量化与特征稀疏性以最小化预测开销，并引入硬件友好的近似Top-K选择将过滤复杂度从O(n log k)降低至O(n)；硬件层面则通过深度优化计算与内存访问策略，针对稀疏注意力与长序列之间的复杂交互关系进行针对性改进，并建立性能模型以确定最优协同设计方案，最终实现全流水线并行架构，在长序列下仍保持O(n)效率。

链接: https://arxiv.org/abs/2604.24820
作者: Wang Fan,Wei Cao,Xi Zha,Kedi Ma,MingQian Sun,Jialin Chen,Fengzhe Zhang,Fan Zhang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long contexts improve capabilities of large language models but pose serious hardware challenges: compute and memory footprints grow linearly with sequence length. Particularly, the decoding phase continuously accesses massive KV cache, dramatically increasing bandwidth and computing pressure. Existing accelerators are primarily designed and evaluated for short contexts. They suffer from significant performance degradation when processing long contexts. To bridge this gap, we identify the major bottleneck and present a hardware accelerator for long context attention decoding via hardware-software co-design. On the software side, we propose dual-compression dynamic sparse attention. It combines ultra-low-precision quantization with feature sparsity to minimize prediction overhead. A hardware-friendly approximate Top-K selection further reduces filter complexity from O(n \log k) to O(n) . On the hardware side, we deeply optimize compute and memory access to tackle bottlenecks from intricate interplay between sparse attention and long contexts, and establish a performance model to derive the optimal co-design scheme. The resulting hardware adopts a fully pipelined parallel architecture and achieves O(n) efficiency even for long sequences. Experiments show that our design delivers 3.82\times speedup and 74.19\times energy efficiency over A100. Compared to SOTA accelerators, this is the first ASIC accelerator that efficiently supports long context inference, with at least 3.5\times higher throughput and 2.08\times better energy efficiency.

[AI-89] Programming with Data: Test-Driven Data Engineering for Self-Improving LLM s from Raw Corpora

【速读】：该论文旨在解决如何可靠地将人类专业知识从文本中迁移至大型语言模型（Large Language Models, LLMs）这一根本性挑战。现有方法依赖于在领域语料库上微调（fine-tuning），但该过程缺乏反馈机制：当模型在特定任务上失败时，无法诊断训练数据中的缺陷，只能盲目增加数据。论文的关键解决方案是引入一种结构化的知识表示（structured knowledge representation），作为训练数据与评估基准的共享基础，从而将整个数据工程生命周期映射到软件开发生命周期——训练数据成为源代码，模型训练相当于编译，基准测试相当于单元测试，而基于失败的数据修复则等价于调试。这一框架使得模型失败可被分解为概念级缺失和推理链断裂，并能追溯至具体的数据缺陷，通过针对性修补实现跨模型规模与架构的一致性改进，且不损害通用能力。该方法被形式化为“以数据编程”（Programming with Data），并在十六个学科领域验证其有效性。

链接: https://arxiv.org/abs/2604.24819
作者: Chenkai Pan,Xinglong Xu,Yuhang Xu,Yujun Wu,Siyuan Li,Jintao Chen,Conghui He,Jingxuan Wei,Cheng Tan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 57 pages, 28 figures, 14 tables

点击查看摘要

Abstract:Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.

[AI-90] SWE-QA: A Dataset and Benchmark for Complex Code Understanding

【速读】：该论文旨在解决现有代码理解基准测试与实际软件开发中复杂推理需求之间的差距问题，即当前评估任务多聚焦于孤立的代码片段，而开发者在真实场景中需跨多个分散代码段进行信息关联和推理。解决方案的关键在于构建SWE-QA数据集，这是一个包含9,072道多项选择题的文本与代码语料库，系统性地从12个Python仓库（SWE-bench）中生成，涵盖如“声明与调用”（Declaration-and-Call）和“交互实体”（Interacting-Entity）等典型多跳推理模式；其核心创新在于通过基于解析的实体提取与大语言模型辅助的问题构造，并辅以精心设计的干扰项验证，有效区分真正的代码理解能力与表面模式匹配，从而为多跳代码理解提供更贴近实践的评估基准。

链接: https://arxiv.org/abs/2604.24814
作者: Laïla Elkoussy(LRE, EPITA),Julien Perez(EPITA, LRE)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce SWE-QA, a text and code corpus aimed at benchmarking multi-hop code comprehension, addressing the gap between simplified evaluation tasks and the complex reasoning required in real-world software development. While existing code understanding benchmarks focus on isolated snippets, developers must routinely connect information across multiple dispersed code segments. The dataset comprises 9,072 multiple-choice questions systematically generated from 12 Python repositories of SWE-bench, evaluating several recurrent reasoning patterns like Declaration-and-Call questions that link entity definitions to their usage, and Interacting-Entity questions that examine the dynamic relationships among multiple collaborating components. Generated through parsing-based entity extraction and Large Language Model assisted question construction with carefully validated distractors, the benchmark distinguishes genuine comprehension from superficial pattern matching. Evaluation of 15 language models (360M to 671B parameters) reveals significant challenges in multi-hop reasoning, with best performance reaching 74.41% accuracy. Dense architectures consistently outperform mixture-of-experts models by 10-14 percentage points, while reasoning-enhanced variants show inconsistent benefits.

[AI-91] me-varying Interaction Graph ODE for Dynamic Graph Representation Learning

【速读】：该论文旨在解决现有图神经微分方程（Graph Neural Ordinary Differential Equations, Graph Neural ODE）在动态图场景下难以捕捉节点间交互模式的多样性与时变性问题。其核心挑战在于，传统方法通常采用统一的消息传递机制，假设任意时刻节点间的交互遵循相同的函数形式，从而限制了对复杂动态关系的建模能力。解决方案的关键在于提出时间可变交互图微分方程（Time-varying Interaction Graph Ordinary Differential Equations, TI-ODE），通过将图ODE的演化函数分解为一组可学习的交互基函数（interaction basis functions），每类基函数对应一种特定类型的节点交互模式，并利用时变的可学习权重动态组合这些基函数，使节点间交互模式能够自适应地随时间演化，从而实现更灵活、精准且具解释性的动态图表示学习。

链接: https://arxiv.org/abs/2604.24811
作者: Xiaoyi Wang,Zhiqiang Wang,Jianqing Liang,Xingwang Zhao,Chuangyin Dang,Zhen Jin,Jiye Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph neural Ordinary Differential Equations (ODE) combine neural ODE with the message passing mechanism of Graph Neural Networks (GNN), providing a continuous-time modeling method for graph representation learning. However, in dynamic graph scenarios, existing graph neural ODEs typically employ a unified message passing mechanism, assuming that inter-node interactions share the same message passing function at any time, which makes it challenging to capture the diversity and time-varying nature of inter-node interaction patterns. To address this, we propose Time-varying Interaction Graph Ordinary Differential Equations (TI-ODE). The core idea of TI-ODE is to decompose the evolution function of a graph ODE into a set of learnable interaction basis functions, where each basis function corresponds to a distinct type of inter-node interaction. These basis functions are dynamically combined through time-dependent learnable weights, enabling inter-node interaction patterns to adaptively evolve over time. Experimental results on six dynamic graph datasets demonstrate that TI-ODE consistently outperforms existing methods and achieves state-of-the-art performance on attribute prediction tasks, and experiments on the \textitCovid dataset further verify the interpretability and generalizability of our TI-ODE. Furthermore, we demonstrate both theoretically and empirically that TI-ODE exhibits superior robustness compared to models utilizing a unified message-passing mechanism.

[AI-92] A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

【速读】：该论文旨在解决边缘计算环境中深度神经网络部署面临的能量消耗与延迟约束问题，核心挑战在于如何在保证预测准确性的前提下动态平衡计算成本与推理延迟。解决方案的关键在于改进自适应深度神经网络（Adaptive Deep Neural Networks, ADNNs）中的多臂赌博机（Multi-Armed Bandit, MAB）策略，引入四种新型上置信界（Upper Confidence Bound, UCB）算法——UCB-V、UCB-Tuned、UCB-Bayes 和 UCB-BwK，并首次系统比较它们在准确性-延迟和准确性-能耗权衡上的表现。实验表明，这些策略均实现次线性累积遗憾，其中UCB-Bayes收敛最快，而UCB-V与UCB-Tuned在帕累托前沿上最优，展现出良好的实用性与适应性。

链接: https://arxiv.org/abs/2604.24810
作者: Grigorios Papanikolaou,Ioannis Kontopoulos,Konstantinos Tserpes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a significant challenge. Therefore, smart and adaptive inference strategies that dynamically balance computational cost or latency with predictive accuracy are critical in edge computing scenarios. In this work, we build on Adaptive Deep Neural Networks (ADNNs) that employ the Multi-Armed Bandit (MAB) framework. Current literature leverages the first version of the Upper Confidence Bound (UCB1) strategy to dynamically select the optimal confidence threshold, enabling efficient early exits without sacrificing accuracy. However, we introduce four additional Upper Confidence Bound strategies in ADNNs, namely UCB-V, UCB-Tuned, UCB-Bayes, and UCB-BwK, and perform, for the first time, a comparative study of these strategies with respect to trade-offs between accuracy, energy consumption, and latency. The proposed UCB strategies are employed on the ResNet and MobileViT neural networks, and are evaluated on the benchmark datasets of CIFAR-10, CIFAR-10.1, and CIFAR-100. Experimental results demonstrate that all strategies achieve sub-linear cumulative regret, with UCB-Bayes converging the fastest, followed by UCB-Tuned and UCB-V. Finally, UCB-V and UCB-Tuned dominate the Pareto Frontiers of accuracy-latency and accuracy-energy trade-offs.

[AI-93] Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

【速读】：该论文旨在解决小语言模型在严格参数和推理预算下实现高效推理的问题，特别是如何在保持长上下文处理能力的同时提升模型表达力。解决方案的关键在于提出一种混合骨干结构，其中两个线性时间谱序列算子（SeqCond Attention, SCA）层与一个Transformer层交替排列，该设计融合了结构化序列模型的长距离效率与状态跟踪优势，同时保留了注意力机制的token-to-token路由能力；此外，SCA读出机制被证明可在连续极限下精确检索前缀摘要中的任意token，并能以softmax注意力为特例重现其输出，从而确保SCA至少具有全自注意力的表达能力。

链接: https://arxiv.org/abs/2604.24809
作者: Maixent Chenebaux
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a linear-time spectral sequence operator inspired by SeqCondenser, alternate with one transformer layer. This design aims to retain the long-context efficiency and state-tracking benefits of structured sequential models while preserving the expressive token-to-token routing of attention. The model was trained on a single Cloud TPU v4-64 pod slice provided through the Google TPU Research Cloud (TRC) program; the subsequent reinforcement learning stage was carried out on a single NVIDIA DGX Spark. We prove that the SCA readout mechanism can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case, establishing that SCA is at least as expressive as full self-attention in the continuous limit. We also describe the training data pipeline and outline a reinforcement learning stage specialized for reasoning, verification, and response quality.

[AI-94] Architecture Determines Observability in Transformers

【速读】：该论文旨在解决生成式 AI（Generative AI）模型在推理过程中产生自信但错误预测的问题，即“自信错误”（confident errors），并探索如何通过监控模型内部状态来识别这些错误。其核心挑战在于：现有激活监测方法仅在模型保留了未被输出置信度掩盖的决策质量信号时才有效，而这种信号的可观察性（observability）取决于模型架构与训练策略。解决方案的关键在于定义并量化“可观测性”——即在控制最大 softmax 置信度和激活范数后，从冻结的中间层激活中线性可读取每 token 决策质量的能力。研究表明，可观测性并非 Transformer 模型的普遍属性，而是高度依赖于具体配置（如层数、头数），且在特定架构下会随着训练过程逐渐消失；此外，基于 WikiText 训练的观测器可在不微调的情况下迁移至下游问答任务中，以 20% 的标记率捕获 10.9–13.4% 的原模型错误，显著超越单纯依赖输出置信度的检测方式。

链接: https://arxiv.org/abs/2604.24801
作者: Thomas Carmichael
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 8 figures, 11 tables. Code and data: this https URL

点击查看摘要

Abstract:Autoregressive transformers make confident errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. This preservation is determined by architecture and training recipe. We define observability as the linear readability of per-token decision quality from frozen mid-layer activations after controlling for max-softmax confidence and activation norm. The correction is essential. Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families. Observability is not a generic property of transformers. In Pythia’s controlled suite, every tested run with the 24-layer, 16-head configuration collapses to rho_partial ~0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band from 0.21 to 0.38. The output-controlled residual collapses at the same points, and neither tested nonlinear probes nor layer sweeps recover healthy-range signal. Checkpoint dynamics show the collapse is emergent during training. Both configurations at matched hidden dimension form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while predictive loss continues improving. Across independent recipes the collapse map changes but the phenomenon persists. Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with probe seed distributions that do not overlap, while Mistral 7B preserves observability where Llama 3.1 8B collapses despite similar broad architecture. A WikiText-trained observer transfers to downstream QA without training on those tasks, catching errors confidence misses. At 20% flag rate, its exclusive catch rate is 10.9-13.4% of all errors in seven of nine model-task cells. Architecture selection is a monitoring decision. Comments: 31 pages, 8 figures, 11 tables. Code and data: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.24801 [cs.LG] (or arXiv:2604.24801v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24801 Focus to learn more arXiv-issued DOI via DataCite

[AI-95] Semantic Denial of Service in LLM -controlled robots

【速读】：该论文旨在解决大语言模型（Large Language Models, LLM）控制机器人时因安全导向指令遵循机制（safety-oriented instruction-following）所引入的新型安全威胁——即通过向机器人音频通道注入简短且看似合理的安全提示词（1–5 tokens），触发模型的安全推理逻辑，从而导致执行中断或扰动，形成语义拒绝服务攻击（semantic denial-of-service attack）。其关键发现在于：当前主流的提示级防御策略虽能在一定程度上抑制硬性停止类攻击，但会将攻击效果转化为新的干扰形式（如确认循环和虚假警报），且无法根本消除系统对未认证音频文本的依赖；因此，论文指出真正的解决方案应聚焦于系统架构层面，避免将未经身份验证的音频文本直接输入LLM，以切断安全监控与动作选择之间的可被利用的安全依赖链。

链接: https://arxiv.org/abs/2604.24790
作者: Jonathan Steinberg,Oren Gal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-oriented instruction-following is supposed to keep LLM-controlled robots safe. We show it also creates an availability attack surface. By injecting short safety-plausible phrases (1-5 tokens) into a robots audio channel, an adversary can trigger the models safety reasoning to halt or disrupt execution without jailbreaking the model or overriding its policy. In the embodied setting, this is a semantic denial-of-service attack: the agent stops because the injected signal looks like a legitimate alert. Across four vision-language models, seven prompt-level defenses, three deployment modes, and single- and multi-injection settings, we find that prompt-only defenses trade off attack suppression against genuine hazard response. The strongest defenses reduce hard-stop attack success on some models, but defenses change the form of disruption, not its fact: suppressed hard stops re-emerge as acknowledge loops and false alerts, which we measure with Disruption Success Rate (DSR). We further find that injection variety is consistently more effective than repeating the same phrase, suggesting that models treat diverse safety cues as corroborating evidence. The practical implication is architectural rather than prompt-level: systems that route unauthenticated audio text directly into the LLM create an avoidable security dependency between safety monitoring and action selection.

[AI-96] Liquid Neural Network Models for Natural Gas Spot Price Time-Series Forecasting

【速读】：该论文旨在解决天然气价格短期预测中的高波动性与非线性动态问题，传统时间序列模型在面对季节性需求变化、地缘政治事件及宏观经济波动等复杂因素时效果受限。其解决方案的关键在于引入液态神经网络（Liquid Neural Networks, LNNs），该模型通过持续更新内部状态来适应随时间演变的模式，从而有效捕捉非平稳市场价格行为，提升在动荡市场环境下的预测精度，进而降低不确定性并增强能源交易与电力市场决策支持能力。

链接: https://arxiv.org/abs/2604.24788
作者: Yiqian Liu,Jiayi Niu,Adam Kelleher,Subhabrata Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural gas is undoubtedly an essential component of the global energy system. Accurate short-term forecasting of natural gas price is challenging due to pronounced volatility driven by seasonal demand patterns, geopolitical developments, and shifting macroeconomic conditions. The nonlinear dynamics and frequent regime changes can limit the effectiveness of traditional time-series models. In this study, we explore the use of Liquid Neural Networks (LNNs) for short-horizon forecasting of the Henry Hub spot price, a primary benchmark for pricing. LNNs are designed to adapt continuously to evolving temporal patterns through dynamic internal state updates, making them well suited for nonstationary price behavior. By improving forecast accuracy in volatile market conditions, this work aims to reduce uncertainty and enhance decision support across energy trading and power market applications.

[AI-97] Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

【速读】：该论文旨在解决在工业控制与国防等对数据隐私、延迟和成本敏感的场景中，如何高效部署生成式 AI（Generative AI）模型的问题。现有方案受限于云端部署带来的隐私风险、高延迟及成本问题，而边缘端部署虽具潜力，但因配置空间维度高、评估方法单一，难以确定最优部署策略。解决方案的关键在于提出一种多维基准测试方法，综合评估推理性能与硬件效率，在四种适合物联网（IoT）的边缘平台配置下进行实证分析，明确硬件加速器（如NPU和GPU）的作用，并量化功率效率、设备体积与每秒令牌吞吐量之间的权衡关系，从而为资源受限环境下的生成式 AI 部署提供可操作的技术指导。

链接: https://arxiv.org/abs/2604.24785
作者: Harri Renney,Fouad Trad,Michael Mattarock,Zena Wood
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming increasingly capable at small parameter scales. At the same time, conventional cloud-centric deployment introduces challenges around data privacy, latency, and cost that are acute in operational technology and defence environments. Advances in model distillation, quantisation, and affordable edge accelerators now make local LLM inference on single-board computers feasible, but the high dimensionality of the configuration space makes identifying optimal deployments difficult without structured evaluation. Existing LLM-specific edge benchmarking efforts rely on CPU-only inference, poor coverage of genuine single-board computers, and generic evaluation tasks that lack multi-dimensional assessment of hardware effectiveness. This paper proposes a multi-dimensional benchmarking methodology that jointly evaluates inference performance and hardware efficiency across four IoT-suitable edge platform configurations testing single-board computers with the latest available hardware accelerators. Our results reveal the benefits of using hardware accelerators such as NPUs and GPUs, along with multi-dimensional evaluations quantifying the trade-offs between power efficiency, physical device size and token throughput; offering practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations.

[AI-98] Comparative Study of Bending Analysis using Physics-Informed Neural Networks and Numerical Dynamic Deflection in Perforated nanobeam

【速读】：该论文旨在解决微纳尺度下穿孔纳米梁在正弦载荷作用下的静力弯曲响应与动力挠度之间的关系建模问题，尤其关注不同穿孔形式对二者耦合特性的影响。解决方案的关键在于提出一种基于物理信息约束的函数连接框架（Physics-Informed Functional Link Constrained Framework with Domain Mapping, DFL-TFC），其核心创新在于将控制微分方程（DE）的约束通过理论上的函数连接（Theory of Functional Connections, TFC）精确嵌入到约束表达式（Constrained Expression, CE）中，从而严格满足初始条件（ICs）和边界条件（BCs），同时利用域映射技术将原定义域转换至正交多项式空间以提升数值稳定性；此外，自由函数由功能链接神经网络（Functional Link Neural Network, FLNN）表示，通过最小化残差均方误差实现优化训练，无需构建复杂的深度网络结构即可获得高精度解，显著优于传统物理信息神经网络（PINN）方法。

链接: https://arxiv.org/abs/2604.24768
作者: Ramanath Garai,Iswari Sahu,S. Chakraverty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:In this chapter, we investigate the bending behavior of a perforated nanobeam subjected to sinusoidal loading using an efficient and computationally robust Physics-Informed Functional Link Constrained Framework with Domain Mapping (DFL-TFC) method. Our aim is to determine the relationship between static bending response and dynamic deflection of a perforated nanobeam for various perforation cases. The static bending is obtained using the FL-TFC with Domain mapped method, whereas dynamic deflection is determined using the Galerkin method. The proposed approach employs the theory of functional connections (TFC) to systematically embed governing differential equation constraints into a constrained expression (CE), which exactly satisfies all prescribed initial and boundary conditions (ICs and BCs) and domain of differential equation is mapped to domain of orthogonal polynomials. Within this framework, the free function appearing in the constrained expression is expressed through a functional link neural network (FLNN). The cost is minimized by the mean square residual of DE, allowing training without requiring complex deep network architectures. Relationship between static and dynamic defection of simply-supported (S-S) perforated nanobeams has been investigated here. FL-TFC with Domain mapped method eliminates the need for deep and complex neural network architectures while ensuring accuracy, efficiency, and strict satisfaction of boundary conditions as compared to standard PINN.

[AI-99] GCA-BULF: A Bottom-Up Framework for Short-Term Load Forecasting Using Grouped Critical Appliances

【速读】：该论文旨在解决传统短时负荷预测（Short-Term Load Forecasting, STLF）方法在应对分时电价和阶梯电价激励下用户峰谷转移行为时精度不足的问题。现有自上而下（top-down）方法难以捕捉多样化负载的复杂模式，而自下而上（bottom-up）方法虽能提升精度但因需监测全部高功率电器导致成本过高，且并非所有设备对总负荷预测均有显著贡献。解决方案的关键在于提出GCA-BULF框架，其核心创新为基于分组关键电器（Grouped Critical Appliances, GCA）的建模策略：首先通过关键电器筛选模块识别对总负荷影响最大的设备；其次利用相关电器分组模块依据时空相关性对关键电器进行聚类；最后通过协同负荷预测模块融合多组负荷预测结果以优化整体预测性能。此设计显著提升了预测准确率，实验证明相较主流方法提升达20.85%–92.48%。

链接: https://arxiv.org/abs/2604.24766
作者: Yunhao Yao,Jinwei Fang,Puhan Luo,Zhiqiang Wang,Jiahui Hou,Xiang-Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 10 pages, 12 figures

点击查看摘要

Abstract:With the rise of time-of-use and tiered electricity pricing, energy consumers are encouraged to adopt peak-shifting strategies by automatically controlling high-power appliances. These help lower energy costs while enhancing the power grid’s stability. To support such energy management with high resilience and responsiveness, reliable short-term load forecasting (STLF) plays a critical role. STLF predicts electricity consumption over time horizons ranging from minutes to days, using historical data, temporal patterns, and contextual factors. Traditional top-down forecasting methods struggle to capture the complex consumption patterns of diverse and mixed appliance loads. Although bottom-up methods improve forecasting accuracy by integrating appliance-level data, monitoring all appliances is costly, and many do not meaningfully impact total load prediction. Therefore, we propose GCA-BULF, a bottom-up short-term load forecasting framework based on grouped critical appliances, supported by three key designs. First, the Critical Appliance Filtering module ranks appliances according to their power consumption, switching frequency, and usage pattern periodicity, and identifies critical ones through iterative load decomposition. Next, the Related Appliance Grouping module clusters these appliances based on spatial and temporal correlations for group-level forecasting. Finally, the Collaborative Load Forecasting module refines the total load prediction by combining multiple group-level forecasts. We evaluate GCA-BULF on residential and office building load forecasting tasks. Experimental results reveal that GCA-BULF improves hourly total load forecasting by 20.85%-57.88% compared to existing top-down methods and by 33.03%-92.48% compared to bottom-up methods.

[AI-100] Back to Repair: A Minimal Denoising Network for Time Series Anomaly Detection

【速读】：该论文旨在解决时间序列异常检测（Time Series Anomaly Detection）中模型复杂度与性能之间不平衡的问题，即传统方法往往依赖于复杂的架构（如注意力机制、潜在变量或对抗训练）来提升检测精度，但这些设计未必能有效提升性能。其解决方案的关键在于提出一种极简的去噪网络 JuRe（Just Repair），通过正确实现流形投影原理（manifold-projection principle）来驱动异常检测：JuRe 仅由一个隐藏维度为 128 的深度可分离卷积残差块构成，训练目标是修复被污染的时间窗口，并在推理阶段使用固定且无参数的结构差异函数进行评分。实验表明，这种基于去噪任务的简单架构在多个基准数据集上优于所有神经基线模型，且组件消融实验证实训练时的扰动策略是性能提升的核心因素，而非网络容量本身。

链接: https://arxiv.org/abs/2604.17388
作者: Kadir-Kaan Özer,René Ebeling,Markus Enzweiler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 5 tables

点击查看摘要

Abstract:We introduce JuRe (Just Repair), a minimal denoising network for time series anomaly detection that exposes a central finding: architectural complexity is unnecessary when the training objective correctly implements the manifold-projection principle. JuRe consists of a single depthwise-separable convolutional residual block with hidden dimension 128, trained to repair corrupted time series windows and scored at inference by a fixed, parameter-free structural discrepancy function. Despite using no attention, no latent variable, and no adversarial component, JuRe ranks second on the TSB-AD multivariate benchmark (AUC-PR 0.404, 180 series, 17 datasets) and second on the UCR univariate archive by AUC-PR (0.198, 250 series), leading all neural baselines on AUC-PR and VUS-PR. Component ablation on TSB-AD identifies training-time corruption as the dominant factor ( \Delta AUC-PR = 0.047 on removal), confirming that the denoising objective, not network capacity, drives detection quality. Pairwise Wilcoxon signed-rank tests establish statistical significance against 21 of 25 baselines on TSB-AD. Code is available at the URL this https URL.

[AI-101] A Quantitative Definition of Intelligence

【速读】：该论文试图解决如何对任意物理系统的智能进行操作性、定量定义的问题，旨在建立一个跨基质的连续统一体系，从逻辑门到大脑均能适用。其解决方案的关键在于提出“智能密度”（intelligence density）这一指标，即系统独立输出的对数与其总描述长度之比；若系统描述长度随输出数量增长则为记忆（memorization），若描述长度固定而输出数量发散则为知识（knowledge），其中知识的本质是泛化能力——即单一有限机制可生成无界输入范围内的正确输出。进一步地，作者通过定义输出的上下文性（contextuality）为给定先前输出条件下其条件柯尔莫哥洛夫复杂度的倒数，将正确性与独立性统一为单一判据，并据此反驳了塞尔（Searle）关于语法不足以产生语义的第三前提，在所有正确性可规约的领域中成立。

链接: https://arxiv.org/abs/2604.10873
作者: Kang-Sin Choi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
备注: 27 pages; v2: syntax is semantics

点击查看摘要

Abstract:We propose an operational, quantitative definition of intelligence for arbitrary physical systems. The intelligence density of a system is the ratio of the logarithm of its independent outputs to its total description length. A system memorizes if its description length grows with its output count; it knows if its description length remains fixed while its output count diverges. The criterion for knowing is generalization. A system knows its domain if a single finite mechanism can produce correct outputs across an unbounded range of inputs, rather than storing each answer individually. The definition places intelligence on a substrate-independent continuum from logic gates to brains. We then argue that meaning over a domain is a selection and ordering of functions that produces correct outputs where correctness is specifiable. We also define a measure of contextuality of an output as the inverse of its conditional Kolmogorov complexity given the context of prior outputs, which unifies correctness and independence into a single condition. Together, these refute Searle’s third premise, that syntax is insufficient for semantics, over any domain where correctness is specifiable.

[AI-102] From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在从自然语言需求中可靠地构建优化建模（Optimization Modeling）问题方面的挑战，尤其是在物流、制造、能源和公共服务等实际场景中的应用。其解决方案的关键在于提出一个模块化的代理框架——Agora-Opt，该框架结合去中心化辩论机制与读写记忆库（read-write memory bank），允许多个代理团队独立生成端到端的优化方案，并通过基于结果的辩论协议进行协调；同时，记忆库存储经求解器验证的成果及历史分歧解决记录，从而实现无需训练即可持续改进。这一设计具有跨基础模型和方法的灵活性，显著优于零样本LLM、以训练为中心的方法以及先前的代理基线，在多个公开基准上展现出最优性能，且证明了去中心化辩论相比集中式选择具有结构优势，能通过交互迭代修正初始错误方案。

链接: https://arxiv.org/abs/2604.25847
作者: Jianghao Lin,Zi Ling,Chenyu Zhou,Tianyi Xu,Ruoqing Jiang,Zizhuo Wang,Dongdong Ge
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Working Paper

点击查看摘要

Abstract:Optimization modeling underpins real-world decision-making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural-language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emphAgora-Opt, a modular agentic framework for optimization modeling that combines decentralized debate with a read-write memory bank. Agora-Opt allows multiple agent teams to independently produce end-to-end solutions and reconcile them through an outcome-grounded debate protocol, while memory stores solver-verified artifacts and past disagreement resolutions to support training-free improvement over time. This design is flexible across both backbones and methods: it reduces base-model lock-in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora-Opt achieves the strongest overall performance among all compared methods, outperforming strong zero-shot LLMs, training-centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross-checking with reusable experience, and position Agora-Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at this https URL.

[AI-103] Benchmarking bandgap prediction in semiconductors under experimental and realistic evaluation settings

【速读】：该论文旨在解决当前机器学习模型在半导体带隙（bandgap）预测中难以从计算数据泛化到实验测量的问题，尤其关注数据保真度、领域泛化能力及模型可解释性等关键挑战。其解决方案的核心在于构建一个名为RealMat-BaG的基准测试框架，该框架包含对齐晶体结构的实验带隙开放数据集，并系统评估图神经网络与传统机器学习基线模型在统计分割和领域分割下的性能表现，同时分析从密度泛函理论（DFT）计算带隙到实验带隙的迁移能力及多层次可解释性（元素属性与结构层面）。此框架为开发更可靠的材料发现学习策略提供了实验导向的评估标准。

链接: https://arxiv.org/abs/2604.25568
作者: Haolin Wang,Xianyuan Liu,Anna Jungbluth,Alexandra J. Ramadan,Robert D. J. Oliver,Haiping Lu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate bandgap prediction is crucial for semiconductor applications, yet machine learning models trained on computational data often struggle to generalize to experimental bandgap measurements. Challenges related to data fidelity, domain generalization, and model interpretability remain insufficiently addressed in existing evaluation frameworks. To bridge this gap, we introduce RealMat-BaG, a benchmark for assessing model reliability under experimentally relevant conditions. We curate an open-access dataset of experimental bandgaps with aligned crystal structures and compare graph neural networks as well as classical machine learning baselines. Our framework evaluates performance across statistical and domain-based splits, examines transfer from DFT-computed to experimental bandgaps, and analyzes interpretability at both elemental-property and structural levels. Our results reveal the fundamental generalization limitations of current bandgap prediction models and establish a benchmark aligned with experimental measurements for developing more reliable learning strategies for materials discovery.

[AI-104] Spectral bandits

【速读】：该论文旨在解决图上平滑函数的bandit问题，即在推荐系统等场景中，当每个可推荐项对应图中的一个节点且其期望收益与邻近节点相似时，如何设计算法以最小化累积遗憾（cumulative regret），同时避免遗憾随节点数量急剧增长。解决方案的关键在于引入“有效维度”（effective dimension）这一概念——该维度在真实世界图结构中通常较小，并据此提出三种算法，其性能分别呈线性或次线性地依赖于该维度，从而实现了对大规模图结构的高效在线学习。实验表明，仅需数十次节点评估即可准确估计数千个项目的用户偏好。

链接: https://arxiv.org/abs/2604.25272
作者: Tomáš Kocák,Rémi Munos,Branislav Kveton,Shipra Agrawal,Michal Valko
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in Journal of Machine Learning Research (JMLR 2020). arXiv admin note: text overlap with arXiv:2604.18420

点击查看摘要

Abstract:Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node of an undirected graph and its expected rating is similar to the one of its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose three algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of node evaluations.

[AI-105] Kohn-Sham Hamiltonian from Effective Field Theory: Quasiparticle Band Narrowing from Frozen Core Dynamics

【速读】：该论文旨在解决密度泛函理论（Density Functional Theory, DFT）中Kohn-Sham (KS) 本征值与角分辨光电子能谱（Angle-Resolved Photoemission Spectroscopy, ARPES）实验测量之间长期存在的带宽偏差问题——即KS带宽在碱金属和碱土金属中系统性高估ARPES结果20–35%，且此偏差不随交换关联泛函变化。解决方案的关键在于构建非均匀电子气的有效场论（Effective Field Theory, EFT），并识别出两个核心条件：一是内层激发能与价带费米能级之间存在尺度分离，二是均匀电子气近似满足伽利略不变性（由图解蒙特卡洛验证）。由此推导出KS带为准粒子带，仅需一个冻结芯态重整化因子 $ z_{\text{core}} $ 进行修正；该因子捕捉了传统赝势忽略的动力学芯激发效应，其修正量 $ 1 - z_{\text{core}} $ 在碱金属中达20–35%，而在Al和Si中低于5%，从而解释了KS带理论的失败与成功。作者进一步提出闭合形式的后自洽场（post-SCF）修正公式，在Li、Na、K、Ca、Mg、Al和Si上验证有效，预测的准粒子带与嵌入动力学平均场理论（embedded dynamical mean-field theory）一致，计算成本极低。

链接: https://arxiv.org/abs/2604.25199
作者: Xiansheng Cai,Han Wang,Kun Chen
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Kohn-Sham (KS) eigenvalues are routinely compared with angle-resolved photoemission (ARPES) and used as input for many-body methods, yet density functional theory (DFT) assigns them no physical meaning. For alkali and alkaline-earth metals, KS bandwidths overestimate ARPES measurements by 20-35%, a discrepancy that persists across all exchange-correlation functionals. We construct an effective field theory (EFT) of the inhomogeneous electron gas and show that two conditions imply KS bands are the quasiparticle bands, up to a frozen-core renormalization factor zcore: a scale separation between core excitation energies and the valence Fermi energy, and an approximate Galilean invariance of the uniform electron gas confirmed by diagrammatic Monte Carlo. This factor reflects dynamical core excitations that conventional pseudopotentials freeze out and no static potential can capture. The correction 1-zcore reaches 20-35% for alkali metals but falls below 5% for Al and Si, explaining both the failure and success of KS band theory. We derive a closed-form post-SCF formula and validate it for Li, Na, K, Ca, Mg, Al, and Si; the predicted quasiparticle bands resolve the long-standing ARPES bandwidth discrepancy, matching embedded dynamical mean-field theory at negligible cost. This work also exemplifies first-principles agentic science, a direction particularly suited to the AGI-for-Science paradigm: an LLM-co-developed derivation with controlled approximations, verified symbolically and against a few experiments, becomes a deterministic harness for agentic scale-out, resolving simultaneously the LLM audit bottleneck and the non-falsifiability of fit-based AI-for-science.

[AI-106] EVT-Based Generative AI for Tail-Aware Channel Estimation

【速读】：该论文旨在解决第五代移动通信（5G）及未来网络中超可靠低延迟通信（URLLC）场景下，如何在有限样本和实时性要求下准确建模无线信道中的罕见事件问题。传统方法依赖大规模数据集和计算密集型估计技术，在实时场景中表现不佳；而本文提出的关键解决方案是将极值理论（EVT）与生成式人工智能（Generative AI）进行协同集成：EVT用于精确刻画信道尾部分布以表征极端事件，生成式AI则实现小样本下的数据增强与信道参数估计，从而弥补生成模型在极端事件捕捉上的不足。实验表明，该融合框架在汽车环境数据集上实现了对极端分位数的有效数据增强，且所需样本量少于传统分析方法和生成基线模型。

链接: https://arxiv.org/abs/2604.25008
作者: Parmida Valiahdi,Niloofar Mehrnia,Walid Saad,Sinem Coleri
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Ultra-reliable and low-latency communication (URLLC) will play a key role in fifth-generation (5G) and beyond networks, enabling mission-critical applications. Meeting the stringent URLLC requirements, characterized by extremely low packet error rates and minimal latency, calls for advanced statistical modeling to accurately capture rare events in wireless channels. Traditional methods, such as those that rely on large datasets and computationally intensive estimation techniques, often fail in real-time scenarios. In this paper, a novel framework is proposed to meet URLLC requirements through a synergistic integration of extreme value theory (EVT) with generative artificial intelligence (AI). EVT is used to model channel tail distributions, providing an accurate characterization of rare events. Concurrently, generative AI enables data augmentation and channel parameter estimation from limited samples. The integration of EVT with generative AI can thus help overcome the limitations of generative models in capturing extreme events during channel characterization. Using an experimental dataset collected from an automotive environment, it is demonstrated that this integration enhances data augmentation for extreme quantiles, while requiring fewer samples than traditional analytical EVT methods and generative baselines in online estimation of channel distribution.

[AI-107] spectroxide: A code package for computing cosmic microwave background spectral distortions

【速读】：该论文旨在解决宇宙微波背景（Cosmic Microwave Background, CMB）光谱畸变的高精度数值计算问题，尤其针对从红移 $ z \sim 5 \times 10^6 $ 到现今期间由热注入和光子注入引起的复杂物理过程（包括康普顿散射、双康普顿辐射及轫致辐射）进行建模。其关键解决方案是开发了名为 spectroxide 的开源代码包，该代码全部由人工智能助手（Claude Code）在人类物理学家监督下编写，涵盖约14,500行Rust代码、Python接口及约400个自动化测试，实现了对CMB光谱畸变的精确模拟，并通过与解析极限、已有谱数据及公开Green函数表的验证确保可靠性。该研究同时揭示了领域知识在识别AI生成代码中物理错误（如量纲因子错误、近似抵消误差）中的不可替代作用，为科学计算中人机协作提供了实践范式。

链接: https://arxiv.org/abs/2604.24838
作者: Ethan Baker,Hongwan Liu,Siddharth Mishra-Sharma
机构: 未知
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph)
备注: 32+18 pages, 11 figures

点击查看摘要

Abstract:We present spectroxide, a code package for computing cosmic microwave background spectral distortions in which all \sim14,500 lines of Rust code, Python interface, and \sim400 automated tests were written by an AI assistant (Claude Code) under human physicist supervision. The solver evolves the photon Boltzmann equation under Compton scattering, double Compton emission, and Bremsstrahlung from z \sim 5 \times 10^6 to the present, computing spectral distortions from arbitrary heat and photon injection within this redshift range. No fully open-source code of this kind is publicly available; we validate against analytic limits, published spectra, and publicly available precomputed Green’s function tables. We document the development as a case study in AI-assisted scientific computing, highlighting how domain expertise caught physics bugs (incorrect dimensional prefactors, near-cancellation errors) that evaded the full automated test suite, and provide recommendations for best practices in human–AI collaborative development of scientific software. We make spectroxide publicly available on GitHub.

机器学习

[LG-0] acher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics AISTATS2026

链接: https://arxiv.org/abs/2604.25904
作者: Andre Herz,Daniel Durstewitz,Georgia Koppe
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注: Presented at the Workshop on Optimization and Post-Bayesian Inference in Machine Learning, AISTATS 2026

点击查看摘要

Abstract:Identity teacher forcing (ITF) enables stable training of deterministic recurrent surrogates for chaotic dynamical systems and has been highly effective for dynamical systems reconstruction (DSR) with recurrent neural networks (RNNs), including interpretable almost-linear RNNs (AL-RNNs). However, as an intervention-based prediction loss (and thus a generalized Bayes update), teacher forcing need not match the free-running model’s marginal likelihood geometry. We compare the objective-induced curvatures of ITF and marginal likelihood in a probabilistic switching augmentation of AL-RNNs, estimating ambiguity-aware observed information via Louis’ identity. In the switching setting studied here, conditioning on a single forced regime path (as ITF does) inflates curvature, while marginal likelihood curvature is reduced by a missing-information correction when multiple switching explanations remain plausible. In Lorenz-63 experiments, windowed evidence fine-tuning improves held-out evidence but can degrade dynamical quantities of interest (QoIs) relative to ITF-pretrained models.

[LG-1] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

链接: https://arxiv.org/abs/2604.25903
作者: Ajmain Inqiad Alam,Palash Roy,Chanchal K. Roy,Banani Roy,Kevin A. Schneider
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accelerating adoption of Large Language Models (LLMs) in software engineering (SE) has brought with it a silent crisis: unsustainable computational cost. While these models demonstrate remarkable capabilities in different SE tasks, they are unmanageably large, slow to deploy, memory-intensive, and carbon-heavy. This reality threatens not only the scalability and accessibility of AI-powered SE, but also its long-term environmental sustainability. The research challenge is clear: we must go beyond accuracy and address efficiency and environmental cost as first-class design constraints. To meet this challenge, we introduce Carbon-Taxed Transformers (CTT), a systematic multi-architectural compression principled pipeline ordering inspired by economic carbon taxation principles. Drawing from the economic concept of carbon pricing, CTT operationalizes a computational carbon tax that penalizes architectural inefficiencies and rewards deployment-ready compression. We evaluate CTT across three core SE tasks: code clone detection, code summarization, and code generation, with models spanning encoder-only, encoder-decoder, and decoder-only architecture. Our results show that CTT delivers on inference: (1) up to 49x memory reduction, (2) time reduction up to 8-10x for clone detection, up to 3x for summarization, and 4-7x for generation, (3) up to 81% reduction in CO2 emissions and (4) CTT retains around 98% accuracy on clone detection, around 89% on summarization, and up to 91% (textual metrics) and 68% (pass@1) for generation. Two ablation studies show that pipeline ordering and individual component contributions are both essential, providing empirical justification for CTT’s design and effectiveness. This work establishes a viable path toward responsible AI in SE through aggressive yet performance-preserving compression.

[LG-2] Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

链接: https://arxiv.org/abs/2604.25897
作者: Clinton Enwerem,Shreya Kalyanaraman,John S. Baras,Calin Belta
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Contact variability, sensing uncertainty, and external disturbances make grasp execution stochastic. Expected-quality objectives ignore tail outcomes and often select grasps that fail under adverse contact realizations. Risk-sensitive POMDPs address this failure mode, but many use particle-filter beliefs that scale poorly, obstruct gradient-based optimization, and estimate Conditional Value-at-Risk (CVaR) with high-variance approximations. We instead formulate grasp acquisition as variational inference over latent contact parameters and object pose, representing the belief with a differentiable Gaussian mixture. We use Gumbel-Softmax component selection and location-scale reparameterization to express samples as smooth functions of the belief parameters, enabling pathwise gradients through a differentiable CVaR surrogate for direct optimization of tail robustness. In simulation, our variational neural belief improves robust grasp success under contact-parameter uncertainty and exogenous force perturbations while reducing planning time by roughly an order of magnitude relative to particle-filter model-predictive control. On a serial-chain robot arm with a multifingered hand, we validate grasp-and-lift success under object-pose uncertainty against a Gaussian baseline. Both methods succeed on the tested perturbations, but our controller terminates in fewer steps and less wall-clock time while achieving a higher tactile grasp-quality proxy. Our learned belief also calibrates risk more accurately, keeping mean absolute calibration error below 0.14 across tested simulation regimes, compared with 0.58 for a Cross-Entropy Method planner.

[LG-3] Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

链接: https://arxiv.org/abs/2604.25700
作者: Pernilla Hall,Anton Ununger,Riccardo Rubei,Alessio Bucaioni
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to source code, execution traces, or static analysis artifacts, making it directly deployable within existing industrial maintenance workflows. We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Our evaluation used proprietary data from ABB Robotics in Sweden, comprising five years of resolved industrial bug reports, each linked to its verified code fix. This setting allowed us to assess model effectiveness under realistic industrial constraints. Our results showed that traditional models using term frequency-inverse document features consistently outperformed the fine-tuned language models on this dataset, while data augmentation improved Random Forest performance. These findings challenge the assumption that transformer-based models universally outperform classical approaches in industrial contexts with domain-specific data. We demonstrated that historical bug reports can be systematically used for text-based, artificial intelligence-assisted fault localization, providing a scalable, low-cost, and empirically grounded complement to common debugging practices in industry. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2604.25700 [cs.SE] (or arXiv:2604.25700v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.25700 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] owards interpretable AI with quantum annealing feature selection

链接: https://arxiv.org/abs/2604.25649
作者: Francesco Aldo Venturelli,Emanuele Costa,Sikha O K,Bruno Juliá-Díaz,Miguel A. González Ballester,Alba Cervera-Lierta
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 1 table, and supplementary materials

点击查看摘要

Abstract:Deep learning models are used in critical applications, in which mistakes can have serious consequences. Therefore, it is crucial to understand how and why models generate predictions. This understanding provides useful information to check whether the model is learning the right patterns, detect biases in the data, improve model design, and build systems that can be trusted. This work proposes a new method for interpreting Convolutional Neural Networks in image classification tasks. The approach works by selecting the most important feature maps that contribute to each prediction. To solve this combinatorial problem, we encode it into a quantum constrained optimization problem and propose to solve it using quantum annealing. We evaluate our method against the state-of-the-art explainable AI techniques, specifically GradCAM and GradCAM++, and observe an improved class disentanglement, i.e. the model’s decision boundaries become more distinct and its reasoning more transparent. This demonstrates that our approach enhances the quality of explanations, making it easier to understand which features the model relies on for specific predictions. In addition, we study the computational behavior of the quantum annealing algorithm. Specifically, we analyze the minimum energy gap of the system during computation and the probability that the algorithm finds the correct solution. These analyses provide theoretical insight into why the method works effectively in practice.

[LG-5] PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection

链接: https://arxiv.org/abs/2604.25599
作者: Mohamed Taoufik Kaouthar El Idrissi,Edward Zulkoski,Mohammad Hamdaqa
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code understanding models increasingly rely on pretrained language models (PLMs) and graph neural networks (GNNs), which capture complementary semantic and structural information. We conduct a controlled empirical study of PLM-GNN hybrids for code classification and vulnerability detection tasks by systematically pairing three code-specialized PLMs with three foundational GNN architectures. We compare these hybrids against PLM-only and GNN-only baselines on Java250 and Devign, including an identifier-obfuscation setting. Across both tasks, hybrids consistently outperform GNN-only baselines and often improve ranking quality over frozen PLMs. On Devign, performance and robustness are more sensitive to the PLM feature source than to the GNN backbone. We also find that larger PLMs are not necessarily better feature extractors in this pipeline, and that the PLM choice has more impact than the GNN choice. Finally, we distill these findings into practical guidelines for PLM-GNN design choices in code classification and vulnerability detection.

[LG-6] Egocentric Tactile and Proximity Sensors as Observation Priors for Humanoid Collision Avoidance ICRA

链接: https://arxiv.org/abs/2604.25554
作者: Carson Kohlbrenner,Niraj Pudasaini,William Xie,Naren Sivagnanadasan,Nikolaus Correll,Alessandro Roncone
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work was accepted at the 8th RoboTac Workshop at the International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:Collision-free motion is often aided by tactile and proximity sensors distributed on the body of the robot due to their resistance to occlusion as opposed to external cameras. However, how to shape the sensor’s properties, such as sensing coverage; type; and range, to enable avoidant behavior remains unclear. In this work, we present a reinforcement learning framework for whole-body collision avoidance on a humanoid H1-2 robot and use it to characterize how sensor properties shape learned avoidance behavior. Using dodgeball as a benchmark task, we ablate the properties of sensors distributed across the upper body of the robot and find that raw proximity measurements can substitute for explicit object localization provided the sensing range is sufficient and that sparse non-directional proximity signals outpace dense directional alternatives in sample efficiency.

[LG-7] Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

链接: https://arxiv.org/abs/2604.25550
作者: Haoran Chen,Wentao Wang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:SignSGD compresses each stochastic gradient coordinate to a single bit, offering substantial memory and communication savings, but its 1-bit quantization removes magnitude information and is known to leave a generalization gap relative to well-tuned SGD. We revisit SignSGD from a 1-bit quantization and dithering perspective and contribute three improvements. First, we derive a small-batch convergence rate for SignSGD under unimodal symmetric gradient noise using a signal-to-noise weighted stationarity measure, removing the large-batch assumption of prior analyses. Second, we inject annealed Gaussian noise before the sign operator, which acts as a classical dithering mechanism and probabilistically restores magnitude information lost to hard thresholding. Third, we adapt the SWATS strategy to sign-based updates with a projection-based learning-rate calibration that smoothly transitions from SignSGD to SGD. Single-worker experiments on ResNet-18 isolate optimizer effects from communication aspects: pre-sign dithering surpasses Adam on CIFAR-100, and the calibrated switch reaches 92.18% test accuracy on CIFAR-10, outperforming both pure SGD 91.38% and pure SignSGD with momentum 90.82%.

[LG-8] Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

链接: https://arxiv.org/abs/2604.25508
作者: Artur Eisele,Bernd Frauenknecht,Friedrich Solowjow,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dynamics. We propose Dyna-style Safety Augmented Reinforcement Learning (Dyna-SAuR), a novel algorithm that learns both a scalable safety filter and a control policy using a learned uncertainty-aware dynamics model, while requiring minimal domain knowledge. The filter avoids failures and high uncertainty regions. Thus, better models expand the set of safe and certain states, reducing filter conservatism. We present the effectiveness of Dyna-SAuR on goal-reaching CartPole as well as MuJoCo Walker, reducing failures compared to state-of-the-art methods by 2 orders of magnitude.

[LG-9] EvoTSC: Evolving Feature Learning Models for Time Series Classification via Genetic Programming

链接: https://arxiv.org/abs/2604.25499
作者: Xuanhao Yang,Bing Xue,Mengjie Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Time series classification is an important analytical task across diverse domains. However, its practical application is often hindered by the scarcity of labeled data and the requirement for substantial computational resources. To address these challenges, this paper proposes EvoTSC, a novel genetic programming approach designed to automatically evolve lightweight feature learning models for time series classification. The core of EvoTSC is a carefully designed multi-layer program structure that strategically embeds diverse forms of prior expert knowledge into the evolutionary process, effectively guiding the search toward operations known to be highly effective for time series analysis. To mitigate the common overfitting problem in time series classification, a tailored Pareto tournament selection strategy is proposed to favor models that perform consistently well across varying training data subsets, promoting the discovery of highly generalizable models. Extensive experiments conducted on univariate time series classification datasets demonstrate that EvoTSC significantly outperforms eleven benchmark methods in most comparisons. Further analyses verify the contribution of each component and the resource efficiency of the evolved models.

[LG-10] Subspace Optimization for Efficient Federated Learning under Heterogeneous Data

链接: https://arxiv.org/abs/2604.25467
作者: Shuchen Zhu,Zhengyang Huang,Yuqi Xu,Peijin Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Federated learning increasingly operates in a large-model regime where communication, memory, and computation are all scarce. Typically, non-IID client data induce drift that degrades the stability and performance of local training. Existing remedies such as SCAFFOLD introduce heterogeneity-correction mechanisms to address this challenge, but they incur substantial extra communication and memory overhead. This paper proposes a subspace optimization method for federated learning (SSF), which performs heterogeneity-corrected optimization in a low-dimensional subspace using only projected quantities, while preserving full-dimensional control information through a backfill-style update that retains residual components whenever the active subspace changes. Under standard smoothness and bounded-variance assumptions, SSF attains a non-asymptotic rate of order \widetilde\mathcalO(1/T+1/\sqrtNKT) . Experiments show favorable accuracy–efficiency trade-offs under heterogeneous data.

[LG-11] Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

链接: https://arxiv.org/abs/2604.25416
作者: Julia Berger,Bernd Frauenknecht,Sebastian Trimpe,Bastian Leibe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of latent space, exhibiting an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest in latent space, undermining the reliability of epistemic uncertainty estimates. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. Our findings highlight key limitations of epistemic uncertainty estimation in latent dynamics models and motivate more critical evaluation of this method.

[LG-12] RCProb: Probabilistic Rule Extraction for Efficient Simplification of Tree Ensembles

链接: https://arxiv.org/abs/2604.25304
作者: Josue Obregon
类目: Machine Learning (cs.LG)
*备注: 20 pages, 3 figures. Submitted to Information Sciences, currently under review

点击查看摘要

Abstract:Tree ensembles are widely used in industrial machine learning due to their strong predictive performance and efficient training procedures. However, as the number of trees in an ensemble grows, the resulting models become increasingly difficult for humans to interpret. To address this limitation, explainable artificial intelligence (XAI) studies methods that generate interpretable models capable of explaining complex predictors. One approach consists of extracting decision rules from tree ensembles while attempting to preserve the predictive performance of the original model. In previous work, we introduced RuleCOSI+, a greedy heuristic algorithm for extracting compact rule-based models from tree ensembles. Although RuleCOSI+ produces accurate and interpretable rule sets, it relies on repeated empirical frequency counting over the training data to estimate rule confidence, which becomes computationally expensive for large datasets. In this paper, we propose RCProb, a probabilistic reformulation of RuleCOSI+ designed to reduce the computational cost of rule extraction. RCProb estimates rule statistics using Dirichlet-smoothed class priors and Beta-smoothed condition likelihoods combined through a Naive Bayes formulation, avoiding repeated dataset scans. Experiments on 33 benchmark datasets show that RCProb maintains competitive predictive performance while reducing runtime by approximately 22\times compared with RuleCOSI+, while producing more compact rule sets on average.

[LG-13] Optimization-Free Topological Sort for Causal Discovery via the Schur Complement of Score Jacobians

链接: https://arxiv.org/abs/2604.25295
作者: Rui Wu,Hong Xie
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Continuous causal discovery typically couples representation learning with structural optimization via non-convex acyclicity penalties, which subjects solvers to local optima and restricts scalability in high-dimensional regimes. We propose a decoupled paradigm that shifts the causal discovery bottleneck from non-convex optimization to statistical score estimation. We introduce the Score-Schur Topological Sort (SSTS), an algorithm that extracts topological order directly from unconstrained generative models, bypassing constrained structure optimization. We establish that the causal hierarchy leaves a geometric signature within the score function: iterative graph marginalization is mathematically equivalent to computing the Schur complement of the Score-Jacobian Information Matrix (SJIM) under linear conditions. This translates the acyclicity constraint into an algebraic procedure with a dominant cost of O(d^3) operations. For non-linear systems, we formulate the expectation gap of Schur marginalization and introduce Block-SSTS to compress extraction depth, bounding structural error. Empirically, SSTS allows causal structural analysis on non-linear graphs up to d=1000. At this scale, our framework indicates that once the non-convex optimization bottleneck is mathematically bypassed, the structural fidelity of continuous causal discovery is bounded by the finite-sample estimation variance of the global score geometry. By reducing graph extraction to matrix operations, this work reframes scalable causal discovery from a constrained optimization problem to a statistical estimation challenge.

[LG-14] Online combinatorial optimization with stochastic decision sets and adversarial losses NEURIPS

链接: https://arxiv.org/abs/2604.25269
作者: Gergely Neu,Michal Valko
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at Neural Information Processing Systems (NeurIPS) 2014

点击查看摘要

Abstract:Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader prediction method for several learning settings differing in the feedback provided to the learner. Our algorithms rely on a novel loss estimation technique that we call Counting Asleep Times. We deliver regret bounds for our algorithms for the previously studied full information and (semi-)bandit settings, as well as a natural middle point between the two that we call the restricted information setting. A special consequence of our results is a significant improvement of the best known performance guarantees achieved by an efficient algorithm for the sleeping bandit problem with stochastic availability. Finally, we evaluate our algorithms empirically and show their improvement over the known approaches.

[LG-15] DGLight: DQN-Guided GRPO Fine-Tuning of Large Language Models for Traffic Signal Control

链接: https://arxiv.org/abs/2604.25259
作者: Chenbo Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic signal control (TSC) plays a central role in reducing congestion and maintaining urban mobility. This dissertation introduces DGLight, a critic-guided reinforcement-learning framework for adapting a pretrained large language model to TSC. DGLight first trains a CoLight-based Deep Q-Network critic to estimate traffic-aware action values from structured intersection states, then uses the frozen critic to score candidate language-model actions and optimize the policy with Group Relative Policy Optimization (GRPO). The resulting controller maps traffic states to interpretable reasoning traces and signal decisions while learning from dense per-state supervision rather than raw cumulative environment rewards. Experiments on TSC benchmarks covering Jinan and Hangzhou show that DGLight is the strongest overall method among the compared LLM-based controllers, remains competitive with strong RL baselines, and transfers well to city datasets not used to fit the critic. Qualitative examples further show that the model’s generated reasoning is interpretable and aligned with the chosen signal phase. The project code is available \hrefthis https URLhere .

[LG-16] Categorical Optimization with Bayesian Anchored Latent Trust Regions for Structural Design under High-Dimensional Uncertainty

链接: https://arxiv.org/abs/2604.25241
作者: Zhangyong Liang,Huanhuan Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Categorical structural optimization under aleatoric uncertainty is challenging because each design variable must be selected from a finite catalog of admissible instances, while each candidate design may require expensive stochastic finite-element evaluations. Existing latent-space optimization strategies can reduce the dimensionality of catalog attributes, but they often treat the reduced space as a continuous search domain. The resulting continuous optimum must then be rounded off to a nearby catalog instance, which may alter the objective value, constraint status, or physical interpretation of the design. To address this issue, this paper proposes the \textbfCategorical \textbfOptimization with \textbfBayesian \textbfAnchored \textbfLatent \textbfTrust Regions (\textbfCOBALT) framework for high-dimensional categorical Optimization Under Uncertainty. COBALT first embeds the physical catalog into a low-dimensional latent representation and locks the mapped instances as a discrete anchored graph. A data-independent random tree decomposition is then used to provide bounded-complexity additive modeling over high-dimensional categorical variables. On this anchored domain, an additive SAAS-GP surrogate is fitted to heteroscedastic MC-FEA observations, and a trust-region discrete graph acquisition search selects the next admissible catalog configuration without continuous relaxation or rounding-off. The proposed strategy is applied to robust design optimization of complex bar structures, considering structural weight, strain energy, and local buckling performance. By evaluating only valid catalog designs through the MC-FEA oracle, COBALT preserves physical admissibility throughout the active learning loop and improves the efficiency of robust categorical structural optimization. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.25241 [cs.LG] (or arXiv:2604.25241v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.25241 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhangyong Liang [view email] [v1] Tue, 28 Apr 2026 05:45:16 UTC (1,793 KB) Full-text links: Access Paper: View a PDF of the paper titled Categorical Optimization with Bayesian Anchored Latent Trust Regions for Structural Design under High-Dimensional Uncertainty, by Zhangyong Liang and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-17] Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model

链接: https://arxiv.org/abs/2604.25196
作者: Yuting Yang,Gang Mei,Feng Chen,Yongshuang Zhang,Jianbing Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Landslide susceptibility prediction is critical for geohazard risk assessment and mitigation. Conventional data-driven paradigm achieves high predictive accuracy but require sufficient conditioning factors and large-scale landslide inventories. However, in practical engineering applications across mountainous and plateau regions, data-scarce conditions are commonly observed, where such data requirements are rarely satisfied, rendering conventional data-driven paradigm inapplicable. To address this issue, we propose a knowledge-data dually driven paradigm for accurate landslide susceptibility prediction under data-scarce conditions. The essential idea behind the proposed novel paradigm is the integration of the geomorphic prior knowledge with scarce landslide data. To validate the proposed paradigm, we first applied it to a data-rich region in central Italy, where a conventional data-driven paradigm trained on the full dataset served as the baseline. By utilizing only 30% of the available landslide data, the proposed paradigm achieved comparable predictive accuracy to the baseline, demonstrating its effectiveness under data-scarce conditions. The paradigm was further evaluated in a genuinely data-scarce environment for application, the Qilian Permafrost Region of the Tibetan Plateau, where it also yielded reliable susceptibility predictions, confirming its applicability under data-scarce conditions.

[LG-18] Shearlet Neural Operators for Anisotropic-Shock-Dominated and Multi-scale parametric partial differential equations

链接: https://arxiv.org/abs/2604.25181
作者: Fabio Pereira dos Santos,Julio de Castro Vargas Fernandes,Adriano Mauricio de Almeida Cortes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators have emerged as powerful data-driven surrogates for learning solution operators of parametric partial differential equations (PDEs). However, widely used Fourier Neural Operators (FNOs) rely on global Fourier representations, which can be inefficient for resolving anisotropic structures, sharp gradients, and spatially localized discontinuities that arise in shock-dominated and multiscale regimes. To address these limitations, we introduce the Shearlet Neural Operator (SNO), a neural operator architecture that replaces the Fourier transform with a shearlet-based representation. Shearlets offer directional, multiscale, and spatially localized atoms with near-optimal sparse approximation of anisotropic features, providing an inductive bias aligned with PDE solutions containing edges, fronts, and shocks. SNO learns in the shearlet domain and reconstructs predictions via the inverse transform, retaining efficient spectral computation while improving locality and directional selectivity. Across seven benchmark PDE families, including strongly anisotropic advection, anisotropic diffusion, and nonlinear conservation laws with straight, curved, interacting, spiral, and polygonal shock structures, SNO consistently improves predictive accuracy and feature fidelity over FNO baselines, with the largest gains observed in anisotropic and discontinuity-dominated settings.

[LG-19] Accurate and Robust Generative Approach for Overcoming Data Sparsity and Imbalance in Landslide Modeling with A Tabular Foundation Model

链接: https://arxiv.org/abs/2604.25159
作者: Kaixuan Shao,Gang Mei,Yinghan Wu,Nengxiong Xu,Jianbing Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Landslide investigation relies on sufficient and well-balanced observational data influenced by geological, hydrological, and anthropogenic factors. Available landslide inventories are often sparse and imbalanced, which limits understanding of triggering conditions and failure mechanisms. Data generation provides an effective approach to help capture feature dependencies from limited landslide observations. However, existing generation approaches for landslides often struggle to capture complex relationships among features and lack robustness across multiple scenarios and interacting factors. Here, we propose an accurate and robust approach for generating multi-feature landslide datasets by utilizing a tabular foundation model. By leveraging the capacity to learn from limited observations, the proposed approach effectively preserves the multivariate dependencies and statistical characteristics inherent in landslide occurrences. Comparative experiments on 20 landslide inventories demonstrate that the generated datasets closely align with observed distributions, maintain realistic feature dependencies, and exhibit robustness across different environmental contexts. This work provides an effective approach to overcome data sparsity and imbalance and strengthens landslide susceptibility modeling and risk assessment under limited observations.

[LG-20] Prior-Aligned Data Cleaning for Tabular Foundation Models

链接: https://arxiv.org/abs/2604.25154
作者: Laure Berti-Equille
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes – making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM’s synthetic prior. Six experiments on ten OpenML benchmark datasets establish: 1) three of seven reward designs collapse to degenerate trivial cleaning strategies – principled reward engineering is scientifically non-trivial; 2) the novel TFMAwareReward reward we propose selects structurally distinct pipelines on 4/10 datasets and achieves higher TabPFN accuracy on those diverging cases (mean 0.851 vs. 0.843; Wilcoxon p=0.063, n=4) while never underperforming; 3) parameterized cleaning actions improve best-found pipeline reward on 9/10 datasets (Wilcoxon p=0.004); and 4) a policy pre-trained on one single source dataset exceeds scratch training at the 2,000-step fine-tuning checkpoint on all three held-out datasets (up to +28.8% after full fine-tuning) demonstrating cross-dataset transfer of prior-alignment knowledge. These findings establish that prior alignment is a principled data preparation strategy for TFM deployment on real-world tabular data.

[LG-21] Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

链接: https://arxiv.org/abs/2604.25119
作者: Vinith M. Suriyakumar,Ayush Sekhari,Lena Stempfle,Robertson Wang,Michael Simpson,Rebecca Portnoff,Marzyeh Ghassemi,Ashia C. Wilson
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Auditing the fine-tunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, generative evaluation via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like CSAM where generation is legally constrained. This motivates the Evaluation without Generation problem: assessing model capabilities without producing outputs. We argue that in such settings, capability must be inferred from the model’s state, either its parameters or internal representations, rather than its outputs. We introduce Gaussian probing, a method that characterizes how LoRA adaptors perturb a model’s internal representations by measuring responses to Gaussian latent ensembles. Unlike raw-weight baselines, Gaussian probing reliably distinguishes benign from harmful specialization without sampling outputs. We demonstrate effectiveness in high-risk domains, including detecting models specialized for child sexual abuse material (CSAM), where output-based evaluation is legally and ethically constrained. Our results show that Gaussian probing provides a scalable non-generative alternative for evaluating high-risk generative systems and remains robust to weight rescaling, a representative adversarial manipulation.

[LG-22] Zero Shot Coordination for Sparse Reward Tasks with Diverse Reward Shapings

链接: https://arxiv.org/abs/2604.25076
作者: Keenan Powell,Peihong Yu,Pratap Tokekar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many Multi-Agent Reinforcement Learning (MARL) agents fail to adapt properly to cooperating with agents trained with the same objectives but different seeds, algorithms, or other training differences. This is the problem of Zero-Shot Coordination (ZSC), which focuses on training agents to cooperate well with unknown agents. ZSC has been studied for a variety of tabular cases and simple games such as Hanabi, achieving excellent results. However, existing solutions to ZSC only consider identical rewards for your trained agents and all future partners. This is not realistic for the trained agents, as they do not consider the problem of cooperating with agents that have identical sparse objectives but shape the rewards for those objectives in different manner. To address this issue, we show how to train an ensemble of methods using randomized reward shapings chosen using 4 selection algorithms. Experiments done on the Overcooked environment demonstrate consistent improvements of 62.2%-119.2% in sparse reward over baseline ZSC algorithms when playing with agents that have identical sparse rewards but different reward shapings.

[LG-23] Feasible-First Exploration for Constrained ML Deployment Optimization in Crash-Prone Hierarchical Search Spaces

链接: https://arxiv.org/abs/2604.25073
作者: Christian Lysenstøen
类目: Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, 10 tables. Code available at this https URL

点击查看摘要

Abstract:Deploying machine learning models under production constraints requires joint optimization over model family, quantization scheme, runtime backend, and serving configuration. This induces a hierarchical mixed-variable search space in which many configurations are invalid: evaluations may crash, exceed memory limits, or violate latency constraints. Standard black-box optimizers such as Tree-structured Parzen Estimators (TPE) and constrained Bayesian optimization are effective when valid configurations are common, but they can spend a large fraction of a small evaluation budget on invalid or uninformative trials in hostile deployment spaces. This paper studies that regime and asks whether optimization should be decomposed into an explicit exploration stage followed by model-guided exploitation. We propose Thermal Budget Annealing (TBA), a feasible-first exploration procedure that maps valid and feasible regions before warm-starting TPE. The method includes two robustness mechanisms for hostile hardware: trial timeouts that abort clearly infeasible evaluations early, and subspace blacklisting that temporarily suppresses categorical subspaces after repeated failures. We also introduce DeployBench, a benchmark suite for deployment optimization with hierarchical structure, hidden crash zones, hard constraints, and unequal evaluation costs. On synthetic benchmarks and real GPU deployment with five pre-trained vision models across five GPU targets (NVIDIA H100, A100, RTX 5080, L4, and T4), the proposed hybrid improves model-family discovery under tight constraints while reducing wasted budget relative to cold-start TPE.

[LG-24] Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

链接: https://arxiv.org/abs/2604.25061
作者: Zeyu Bai
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Custom policy-learning pipelines in Spark fail for two coupled systems reasons: rowwise Python execution makes inference impractical, and driver-side candidate materialization makes split search fragile at feature scale. We present Spark Policy Toolkit, a semantics-governed systems toolkit for scalable policy learning in Spark. The toolkit provides two Spark-native primitives: partition-initialized vectorized inference through mapInPandas and mapInArrow, and collect-less split search that scores candidates on executors. Both primitives are governed by one fixed-input semantic contract: the same rows, feature order, treatment vocabulary, preprocessing manifest, and split boundaries must preserve per-row score vectors, best-split decisions, and end-to-end learned policy outputs. The evaluation combines practical baseline ladders, backend parity checks, measured split-search scale results, synthetic and Hillstrom end-to-end policy preservation, missingness stress, partition and order perturbation tests, quantile-boundary sensitivity, and a concrete adversarial failure catalog. On a 40-worker Databricks cluster, mapInArrow reaches 4.72M rows/s at 10M matched rows and 7.23M rows/s at 50M rows, while collect-less split search remains valid from F = 10 through F = 1000 with 124000 candidate rows, where the driver-collect baseline is intentionally skipped. Across 24 backend-ablation settings, mapInArrow wins 18 while mapInPandas wins 6, so the paper treats backend choice as workload-dependent rather than universal. Once the fixed-input lock is enforced, all six tested repartition/coalesce/shuffle perturbations preserve identical signatures; before lock, all six drift. The central result is not speed alone: throughput and collect-less execution are the mechanisms that let policy semantics survive at Spark scale.

[LG-25] Null Measurability at the Symmetrization Interface in VC Learning

链接: https://arxiv.org/abs/2604.25028
作者: Dhruv Gupta
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
*备注: 12 pages. Companion Lean 4 formalization: this https URL

点击查看摘要

Abstract:Recent work revisiting measurability in the fundamental theorem of statistical learning imposes Borel measurability of ghost-gap suprema. We show that, at the one-sided ghost-gap interface actually used by the standard symmetrization proof, this requirement is stronger than necessary. For any Borel-parameterized concept class on a Polish domain, the bad event “there exists a hypothesis whose ghost empirical error exceeds its training empirical error by at least \epsilon/2” is analytic. By Choquet capacitability, it is therefore measurable in the completion of every finite Borel measure. We then construct a concept class whose bad event is null-measurable but not Borel, giving a strict separation from the Borel supremum condition. Finally, we prove closure under patching, fixed and countable interpolation, and fiber-product amalgamation, showing that the weaker regularity level is stable under natural concept-class constructors. In the realizable setting, where targets belong to the class and are measurable, these results weaken the measurability hypothesis needed by the symmetrization route from finite VC dimension to PAC learnability. The main results and the descriptive-set-theoretic infrastructure used by them are formalized in Lean 4.

[LG-26] Dynamic Regret for Online Regression in RKHS via Discounted VAW and Subspace Approximation

链接: https://arxiv.org/abs/2604.25021
作者: Dmitry B. Rokhlin,Georgiy A. Karapetyants
类目: Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:We study online regression with the square loss in a reproducing kernel Hilbert space under a dynamic regret criterion. The learner is compared with a time-varying comparator sequence, and the bounds depend on its path length in the RKHS norm. The proposed method transfers the finite-dimensional discounted Vovk–Azoury–Warmuth approach of Jacobsen \ Cutkosky (2024) to the RKHS setting by means of finite-dimensional subspace approximations. For a fixed subspace, we run a VAW-based ensemble of discounted VAW forecasters over a geometric grid of discount factors. The additional approximation error is controlled by the uniform projection error of kernel sections. We then introduce a general orthogonal truncation method: starting from a feature expansion of the kernel, we construct the associated RKHS by introducing an inner product that makes the feature functions orthonormal, and then use the spans of the first basis functions as finite-dimensional approximation spaces. The resulting subspace reduction is applied to several approximation schemes. Explicit feature expansions yield fast-regime bounds for Gaussian and analytic dot-product kernels. Mercer truncations provide a spectral approximation method and lead to dynamic regret bounds in fast and slow regimes, depending on the eigenvalue decay. Finally, we study subspaces spanned by kernel sections and apply this construction to Matérn kernels. Comments: 26 pages Subjects: Machine Learning (cs.LG) MSC classes: 68W27, 62G08, 46E22 Cite as: arXiv:2604.25021 [cs.LG] (or arXiv:2604.25021v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.25021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Why Search When You Can Transfer? Amortized Agent ic Workflow Design from Structural Priors

链接: https://arxiv.org/abs/2604.25012
作者: Shiyi Du,Jiayuan Liu,Weihua Du,Yue Huang,Jiayi Li,Yingtao Luo,Xiangliang Zhang,Vincent Conitzer,Carl Kingsford
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT (Synthesizing Workflows via Few-shot Transfer), a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely. On five benchmarks, SWIFT outperforms the state-of-the-art search-based method while reducing marginal per-task optimization cost by three orders of magnitude. It further generalizes to four additional unseen benchmarks and transfers successfully from GPT-4o-mini to three additional foundation models (Grok, Qwen, Gemma). Controlled ablations reveal that workflow demonstrations primarily transfer topological structure rather than surface semantics: replacing all operator names with random strings still retains over 93% of the full system’s average performance.

[LG-28] Laplace-Bridged Randomized Smoothing for Fast Certified Robustness

链接: https://arxiv.org/abs/2604.24993
作者: Miao Lin,MD Saifur Rahman Mazumder,Feng Yu,Daniel Takabi,Rui Ning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomized Smoothing (RS) offers formal \ell_2 guarantees for arbitrary base classifiers but faces two key practical bottlenecks: (i) it often relies on noise-augmented training to achieve nontrivial certificates, which increases training cost, can reduce clean accuracy, and weakens RS as a genuinely post-hoc defense; and (ii) certification is computationally expensive, typically requiring tens of thousands of noisy forward passes per input, which hinders deployment, especially on resource-constrained edge devices. To address both limitations, we propose Laplace-Bridged Smoothing (LBS), an analytic reformulation of RS that replaces high-dimensional input-space Monte Carlo (MC) sampling with efficient computations in a low-dimensional probability space. LBS preserves formal robustness guarantees without requiring noise-augmented training while substantially reducing certification burden. On CIFAR-10 and ImageNet, LBS attains stronger certified robustness than RS and reduces per-sample certification cost by nearly an order of magnitude. Notably, on NVIDIA Jetson Orin Nano and Raspberry Pi 4, LBS achieves speedups of up to 494\times , enabling practical certified deployment on real-world edge devices. Finally, we provide theoretical justification for the analytic formulation and certificate validity of LBS.

[LG-29] CoreFlow: Low-Rank Matrix Generative Models

链接: https://arxiv.org/abs/2604.24959
作者: Dongze Wu,Linglingzhi Zhu,Yao Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning matrix-valued distributions from high-dimensional and possibly incomplete training data is challenging: ambient-space generative modeling is computationally expensive and statistically fragile when the matrix dimension is large but the sample size is limited. We propose CoreFlow, a geometry-preserving low-rank flow model that learns shared row/column subspaces across the matrix distribution, and then trains a continuous normalizing flow only on the induced low-dimensional core. CoreFlow is designed for settings where shared low-rank matrix geometry is present, especially in high-dimensional limited-sample regimes. This separates shared matrix geometry from sample-specific variation, preserves matrix structure, and substantially improves training efficiency. The same framework also handles incomplete training matrices through masked Riemannian updates and iterative completion. Across real and synthetic benchmarks, CoreFlow substantially improves spectral and moment-level generation quality in few-sample regimes while remaining competitive in data-rich settings, even under compression to 9% of the ambient dimension and with up to 40% missing training entries.

[LG-30] A Unifying Framework for Unsupervised Concept Extraction AISTATS2026

链接: https://arxiv.org/abs/2604.24936
作者: Chandler Squires,Pradeep Ravikumar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: AISTATS 2026, 9 pages

点击查看摘要

Abstract:Techniques for concept extraction, such as sparse autoencoders and transcoders, aim to extract high-level symbolic concepts from low-level nonsymbolic representations. When these extracted concepts are used for downstream tasks such as model steering and unlearning, it is essential to understand their guarantees, or lack thereof. In this work, we present a unified theoretical framework for unsupervised concept extraction, in which we frame the task of concept extraction as identifying a generative model. We present a general meta-theorem for identifiability, which reduces the problem of establishing identifiability guarantees to the problem of characterizing the intersection of two sets. As we demonstrate on a range of widely-used approaches, this meta-theorem substantially simplifies the task of proving such guarantees, thus paving the way for the development of new, principled approaches for concept extraction.

[LG-31] CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

链接: https://arxiv.org/abs/2604.24935
作者: Jing Chen,Abhijay Deevi,Onat Gungor,Tajana Rosing
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by the 35th International Conference on Computer Communications and Networks (ICCCN 2026)

点击查看摘要

Abstract:The Controller Area Network (CAN) is a safety-critical in-vehicle communication protocol that lacks built-in security mechanisms, making intrusion detection essential. Existing approaches predominantly formulate CAN intrusion detection as a classification task, mapping complex traffic patterns to attack labels. However, this formulation abstracts away the temporal and relational structure of CAN traffic and misaligns with real-world forensic workflows, which require systematic reasoning about traffic behavior. To address this gap, we introduce CAN-QA, the first benchmark that reformulates CAN traffic analysis as a question-answering (QA) task. CAN-QA converts raw CAN logs into temporally segmented windows and applies deterministic rule-based templates to generate natural-language questions paired with automatically derived ground-truth answers. The resulting dataset comprises 33,128 QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Using CAN-QA, we evaluate large language models across both True/False and multiple-choice formats. Our results indicate that, although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation. Our code is available at this https URL.

[LG-32] Generative diffusion models for spatiotemporal influenza forecasting

链接: https://arxiv.org/abs/2604.24913
作者: Joseph Lemaitre,Justin Lessler
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Forecasting infectious disease incidence can provide important information to guide public health planning, yet is difficult because epidemic dynamics are complex. Current mechanistic and statistical approaches often struggle to capture multimodal uncertainty or emergent trends. Influpaint adapts denoising diffusion probabilistic models to epidemic forecasting. By encoding influenza seasons as spatiotemporal images in which pixel intensity represents incidence, Influpaint learns a rich distribution of disease dynamics from a hybrid dataset of surveillance and simulated trajectories. Forecasting is formulated as a conditional generation (inpainting) task from partial observations. We show that Influpaint generates realistic, diverse epidemic trajectories and achieves forecast accuracy that is competitive with leading ensemble methods in retrospective evaluation. In real-time evaluation during the 2023–2025 U.S. CDC FluSight challenges, performance improved substantially across seasons, with highly accurate but somewhat overconfident projections in 2024–2025. The best performance was achieved with a training dataset containing 30% surveillance and 70% simulated trajectories. These results show that diffusion models can capture important spatiotemporal structure in influenza dynamics and provide a flexible framework for probabilistic infectious disease forecasting.

[LG-33] Contrastive Image-Metadata Pre-Training for Materials Transmission Electron Microscopy

链接: https://arxiv.org/abs/2604.24909
作者: Georgia Channing,Debora Keller,Marta D. Rossell,Philip Torr,Rolf Erni,Stig Helveg,Henrik Eliasson
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:The vast majority of transmission electron microscopy (TEM) data never gets published and ends up on a backup drive until deleted to free up space. These left-over datasets are rich in detail and variation, often paired with automatically saved metadata of instrument state and acquisition parameters. In this work, we introduce a dataset of 7,330 high-angle annular dark-field scanning-TEM (HAADF-STEM) images from a single instrument to learn a joint embedding space between image metadata and HAADF image. These embeddings link image style with acquisition parameters, which allows us to train a generative style transfer network that can convert experimental images into the style they would have had if they were recorded with different instrument parameters. We evaluate the performance of the network and explore the usefulness of the technique for physical denoising.

[LG-34] An analysis of sensor selection for fruit picking with suction-based grippers IROS

链接: https://arxiv.org/abs/2604.24906
作者: Eva Krueger,Marcus Rosette,Joseph R. Davidson
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: IROS Conference Format, 6 pages, 6 figures, 1 table

点击查看摘要

Abstract:Robotic fruit harvesting often fails to reliably detect whether a fruit has been successfully picked, limiting efficiency and increasing crop damage. This problem is difficult due to compliant fruit and grippers, variable stem attachment, and occlusions in orchard environments. Prior work has explored vision-based perception and multi-sensor learning approaches for pick state estimation. However, minimal sensor sets and phase-dependent sensing strategies for accurate pick and slip detection remain largely unexplored. In this work, we design and evaluate a multimodal sensing suite integrated into a compliant suction-based apple gripper. Our approach is unique because it identifies which sensors are most informative at different phases of the pick, enabling predictive detection of failures before they occur. The contributions of this paper are a phase-dependent evaluation of multimodal sensors and the identification of minimal sensor sets for reliable pick state classification. Experiments in a real apple orchard show that Random Forest and Multilayer Perceptron classifiers detect successful picks and impending failures with over 90% accuracy, and Random Forest predicts pick/slip events within 0.09 s of human-annotated ground truth.

[LG-35] FGDM: Reasoning Aware Multi-Agent ic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

链接: https://arxiv.org/abs/2604.24831
作者: Srita Padmanabhuni,Bhargavi Karuturi,Jerusha Karen Indupalli,Santhan Reddy Chilla,Vivek Yelleti
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning methods are becoming prominent in automated software bug detection; however, they lack the global understanding of the given code. Consequently, their performance tends to degrade, especially when they are applied to large interconnected code bases or complex modular programs. Recently, Large Language Models (LLMs) have proven to be effective at capturing dependencies among multiple interconnected modules in the codebase. This motivated us to propose the Flow-Graph-Driven Multi-Agent Framework (FGDM), which is composed of four agents that operate in a sequential manner. The framework converts the received code to a flow graph, identifies the erroneous segments, and further generates the repaired code. All the employed agents utilize Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts. Additionally, we also integrated with the FAISS vector database to retrieve similar previous bugs and their repairs. We demonstrated the efficacy of the proposed framework over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.

[LG-36] Negative Ontology of True Target for Machine Learning: Towards Evaluation and Learning under Democratic Supervision

链接: https://arxiv.org/abs/2604.24824
作者: Yongquan Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article philosophically examines how shifts in assumptions regarding the existence and non-existence of the true target (TT) give rise to new perspectives and insights for machine learning (ML)-based predictive modeling and, correspondingly, proposes a knowledge system for evaluation and learning under Democratic Supervision. By systematically analysing the existence assumption of the TT in current mainstream ML paradigms, we explicitly adopt a negative ontology perspective, positing that the TT does not objectively exist in the real world, and, grounded in this non-existence assumption, define Democratic Supervision for ML. We further present Multiple Inaccurate True Targets (MIATTs) as an instance-level realization of Democratic Supervision. Building upon MIATTs, we derive principles, for the logic-driven generation and assessment of MIATTs, a logical assessment formulation for evaluation with MIATTs, and undefinable true target learning for learning with MIATTs. Based on these components, we establish the evaluation and learning with MIATTs (EL-MIATTs) framework for ML-based predictive modelling. A real-world application demonstrates the potential of the proposed EL-MIATTs framework in supporting education and professional development for individuals, aligning with prior discussions of Democratic Supervision in the fields of education and professional development.

[LG-37] A systematic literature Review for Transformer-based Software Vulnerability detection

链接: https://arxiv.org/abs/2604.24822
作者: Fiza Naseer,Javed Ali Khan,Muhammad Yaqoob,Alexios Mylonas,Ishaya Gambo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Context: Software vulnerabilities pose significant security threats to software systems, especially as software is increasingly used across many areas of daily life, including health, government, and finance. Recently, transformer-based models have demonstrated promising results in automatic software vulnerability identification due to their robust contextual modelling and representation learning capabilities. Objectives: While numerous systematic literature reviews (SLRs) have examined machine learning and deep learning methods for identifying vulnerabilities, a more transformer-centric analysis remains to be explored. This SLR critically analysed 80 studies published between 2021 and 2025 that utilised transformer models to identify software vulnerabilities. Methods: Using Kitchenhams SLR guidelines, we methodically evaluate current research from various perspectives, encompassing study trends, datasets and sources, programming languages, transformer frameworks, detection detail levels, assessment metrics, reference models, types of vulnerabilities, and experimental configurations. Results: We classify transformer models into encoder, decoder, and combined architectures and analyse both pre-trained and fine-tuned versions utilized on source code, logs, and smart contracts. The results emphasise prevailing research trends, frequently utilised benchmarks, and main baselines. It also uncovers crucial technical issues like data imbalance, interpretability, scalability, and generalization across programming languages. Conclusion: By integrating current evidence and recognising unaddressed research areas, this SLR provides a consolidated resource for researchers and professionals seeking to develop more reliable, precise, and interpretable transformer-based vulnerability identification systems.

[LG-38] Heterogeneous Variational Inference for Markov Degradation Hazard Models: Discretized Mixture with Interpretable Clusters

链接: https://arxiv.org/abs/2604.24818
作者: Takato Yasuno
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Bayesian finite mixture models can identify discrete risk clusters (low-risk vs. high-risk equipment), but face three critical bottlenecks: (1) insufficient degradation signals from coarse state discretization, (2) unstable cluster identification when data inherently supports fewer clusters than explored, and (3) computational infeasibility of Markov Chain Monte Carlo (MCMC) methods for production deployment (7+ hours per model). We propose a practical framework combining (1) 8-state global percentile discretization that amplifies degradation events, (2) 30-dimensional feature engineering integrating statistical trends (22 features), continuous health indicators, and text embeddings (PCA-compressed to 3 dimensions), (3) interpretable model selection rules enforcing minimum cluster share and separation alongside WAIC, and (4) Automatic Differentiation Variational Inference (ADVI) with full-rank covariance for stable, fast estimation. Applied to 280 industrial pump equipment with 104,703 inspection records, we demonstrate: (1) Random effect models (baseline) show ADVI and NUTS produce nearly identical estimates with 15 \times speedup, validating ADVI accuracy. (2) Finite mixture models identify optimal number of clusters with interpretability constraints. (3) NUTS exhibits severe convergence issues and label switching, while ADVI provides stable results in 84 \times less time. We contributed that (1) First demonstration that fine-grained state discretization (8-state) is essential for mixture model stability in survival analysis.(2) Comprehensive feature engineering strategy combining statistical, continuous, and semantic signals. (3) Practical interpretability rules preventing overfitting in automated model selection. (4) Empirical evidence that ADVI outperforms NUTS for finite mixture models in terms of convergence, stability, and computational efficiency.

[LG-39] minAction.net: Energy-First Neural Architecture Design – From Biological Principles to Systematic Validation

链接: https://arxiv.org/abs/2604.24805
作者: Martin G. Frasch
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Modern machine learning optimizes for accuracy without explicitly accounting for internal computational cost, even though physical and biological systems operate under intrinsic energy constraints. We evaluate energy-aware learning across 2,203 experiments spanning vision, text, neuromorphic, and physiological datasets, using 10 seeds per configuration and performing a factorial statistical analysis. Three findings emerge. First, architecture alone explains negligible variance in accuracy (partial eta^2 = 0.001). In contrast, the architecture x dataset interaction is large (partial eta^2 = 0.44, p 0.001), demonstrating that optimal architecture depends critically on task modality and rejecting the assumption of a universal best architecture. Second, a controlled lambda-sweep over four orders of magnitude validates a single-parameter energy-regularized objective L = L_CE + lambda * E(theta, x): internal activation energy decreases to 6% of baseline at moderate lambda with no accuracy degradation on MNIST. Third, energy-first architectures inspired by an action-principle framework yield 5-33% within-modality training-efficiency gains over conventional baselines. These results emerge from a research program that interprets learning through a structural correspondence between the action functional in classical mechanics, free energy in statistical physics, and KL-regularized objectives in variational inference. We frame this correspondence as a design hypothesis rather than a derivation.

[LG-40] Query-Efficient Quantum Approximate Optimization via Graph-Conditioned Trust Regions

链接: https://arxiv.org/abs/2604.24803
作者: Molena Huynh
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In low-depth implementations of the Quantum Approximate Optimization Algorithm (QAOA), the dominant cost is often the number of objective evaluations rather than circuit depth. We introduce a graph-conditioned trust-region method for reducing this query cost. A graph neural network predicts a Gaussian distribution N(mu, Sigma) over QAOA angles. The mean initializes a local optimizer, the covariance defines an ellipsoidal trust region that constrains the search, and the predicted uncertainty determines an instance-dependent evaluation budget. Thus the learned distribution defines a search policy rather than only an initial parameter estimate. Under explicit assumptions on local smoothness, curvature, calibration, and noise, we derive bounds on objective degradation within the trust region, lower bounds on gradient variance, preservation of expected objective ordering under depolarizing noise, and finite-sample coverage guarantees. We evaluate the method for MaxCut at depth p = 2 on Erdos-Renyi, 3-regular, Barabasi-Albert, and Watts-Strogatz graphs with n = 8-16 vertices. Relative to random restarts and the strongest learned point-prediction baseline, the method reduces the mean number of circuit evaluations from 343 and 85 to 45 +/- 7, while maintaining sampled approximation ratios within 3 percentage points of concentration-based heuristics. The method does not improve absolute approximation ratios; its advantage is reduced query cost at comparable solution quality. The predictive uncertainty is calibrated in the experiments, with ECE = 0.052 and Spearman correlation rho = 0.770, and the learned trust regions transfer to graph sizes not used during training. The results identify a low-depth, query-dominated regime in which graph-conditioned trust regions reduce the query cost of QAOA without modifying the ansatz. Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2604.24803 [cs.LG] (or arXiv:2604.24803v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.24803 Focus to learn more arXiv-issued DOI via DataCite

[LG-41] Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer GNNShap and GradCAM for Jet Tagging in the Lund Jet Plane

链接: https://arxiv.org/abs/2604.25885
作者: Pahal D. Patel,Sanmay Ganguly
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 25 pages, 9 figures. Comments are welcome

点击查看摘要

Abstract:Graph neural networks such as ParticleNet and transformer based networks on point clouds such as ParticleTransformer achieve state-of-the-art performance on jet tagging benchmarks at the Large Hadron Collider, yet the physical reasoning behind their predictions remains opaque. We present different methods, i.e. perturbation-based (GNNExplainer), Shapley-value-based (GNNShap), and gradient-based (GRADCam); adapted to operate on LundNet’s Lund-plane graph representation. Leveraging the fact that each node in the Lund plane corresponds to a physically meaningful parton splitting, we construct Monte Carlo truth explanation masks and introduce a physics-informed evaluation framework that goes beyond standard fidelity metrics. We perform the analysis in three transverse-momentum bins ( \mathrmp_T \in [500,700] , [800,1000] , and the inclusive region [500,1000] GeV), revealing how explanation quality and focus shift between non-perturbative and perturbative regimes. We further quantify the correlation between explainer-assigned node importance and classical jet substructure observables – N -subjettiness ratios \tau_21 and \tau_32 and the energy correlation functions – establishing the degree to which the model has learned known QCD features. We find that overall the weight assigned by explainability methods has a correlation with analytic observables, with expected shift across different phase space regimes, indicating that a trained neural network indeed learns some aspects of jet-substructure moments. Our open-source implementation enables reproducible explainability studies for graph-based jet taggers.

[LG-42] Adaptive Meta-Learning Stochastic Gradient Hamiltonian Monte Carlo Simulation for Bayesian Updating of Structural Dynamic Models

链接: https://arxiv.org/abs/2604.25710
作者: Xianghao Meng,James L. Beck,Yong Huang,Hui Li
类目: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the last few decades, Markov chain Monte Carlo (MCMC) methods have been widely applied to Bayesian updating of structural dynamic models in the field of structural health monitoring. Recently, several MCMC algorithms have been developed that incorporate neural networks to enhance their performance for specific Bayesian model updating problems. However, a common challenge with these approaches lies in the fact that the embedded neural networks often necessitate retraining when faced with new tasks, a process that is time-consuming and significantly undermines the competitiveness of these methods. This paper introduces a newly developed adaptive meta-learning stochastic gradient Hamiltonian Monte Carlo (AM-SGHMC) algorithm. The idea behind AM-SGHMC is to optimize the sampling strategy by training adaptive neural networks, and due to the adaptive design of the network inputs and outputs, the trained sampler can be directly applied to various Bayesian updating problems of the same type of structure without further training, thereby achieving meta-learning. Additionally, practical issues for the feasibility of the AM-SGHMC algorithm for structural dynamic model updating are addressed, and two examples involving Bayesian updating of multi-story building models with different model fidelity are used to demonstrate the effectiveness and generalization ability of the proposed method.

[LG-43] Deflation-Free Optimal Scoring

链接: https://arxiv.org/abs/2604.25664
作者: Sharmin Afroz,Brendan Ames
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Sparse Optimal Scoring (SOS) reformulates linear discriminant analysis to enable feature selection through elastic net regularization, making it well-suited for high-dimensional settings where the number of features exceeds observations. Most existing SOS methods use deflation-based strategies that compute discriminant vectors sequentially, which can propagate errors and produce suboptimal solutions. We propose a novel approach that estimates all discriminant vectors simultaneously under an explicit global orthogonality constraint, which we call Deflation-Free Sparse Optimal Scoring (DFSOS). DFSOS combines Bregman iteration with orthogonality-constrained optimization, decomposing the problem into tractable subproblems for scoring vectors, discriminant vectors, and orthogonality enforcement. We establish convergence to stationary points of the augmented Lagrangian under mild conditions. Extensive experiments using synthetic data and real-world time series data demonstrate that DFSOS achieves classification accuracy comparable to or better than existing deflation-based methods. These results indicate that deflation-free approaches offer a robust and effective framework for sparse discriminant analysis in high-dimensional problems.

[LG-44] Residual-loss Anomaly Analysis of Physics-Informed Neural Networks: An Inverse Method for Change-point Detection in Nonlinear Dynamical Systems with Regime Switching

链接: https://arxiv.org/abs/2604.25655
作者: Yuhe Bai,Chengli Tan,Jiaqi Li,Xiangjun Wang,Zhikun Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nonlinear dynamical systems with regime transitions are typically described by ordinary differential equations with jumping parameters parameters. Traditional methods often treat change-point detection and parameter estimation as separate tasks, ignoring the inherent coupling between them. To address this, we propose residual-loss anomaly analysis of physics-informed neural networks, a unified framework that leverages dynamical consistency within the physics-informed learning paradigm. This approach jointly infers piecewise parameters and transition points under a single set of constraints. The method follows a two-stage strategy: First, local physical residuals are analyzed through overlapping subinterval decomposition. When a subinterval spans a true transition point, the residual exhibits a distinct structural elevation in noise-free conditions, which has a non-zero lower bound, enabling effective localization of potential transition intervals. Second, within our framework, change-point locations and piecewise parameters are integrated into a unified physical loss function for joint optimization, enabling simultaneous identification. Experiments on benchmark nonlinear dynamical systems, including Malthusian and logistic growth models, Van der Pol oscillator, Lotka-Volterra model and Lorenz system, demonstrate that the proposed method outperforms traditional decoupled approaches in both change-point localization and parameter estimation accuracy. This study provides an efficient, unified solution for structurally coupled inverse problems in nonlinear dynamical systems with regime switching.

[LG-45] Dictionary learning for Kernel EDMD

链接: https://arxiv.org/abs/2604.25572
作者: Erik Lien Bolager,Boumediene Hamzi,Houman Owhadi,Ioannis G. Kevrekidis,Felix Dietrich
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Studying nonlinear dynamical systems through their state space behavior can be challenging, and one possible alternative is to analyze them via their associated Koopman operator. This turns the nonlinear problem into a linear, infinite-dimensional one. To approximate the operator in finite dimensions, extended dynamic mode decomposition (EDMD) is a commonly used algorithm. It requires a finite list of functionals and a set of snapshots from the system to compute an approximation of the operator and its corresponding spectrum. Instead of choosing the list of functionals directly, it can be implicitly defined via kernels, a method known as kernel extended dynamic mode decomposition (kEDMD). However, one still needs to define the kernel and choose its parameter values. In this paper, we aim to streamline this process by extending dictionary learning for EDMD to kernel learning in kEDMD. By simplifying kEDMD we show how to perform gradient-based optimization over the learnable kernel parameters, and demonstrate that this method leads to useful kernels for the original kEDMD. The focus of our work is a method that takes a weighted list of kernels with randomly initialized values as input and outputs a list of kernels and parameter values suitable for approximating the Koopman operator of the underlying system. We demonstrate that unimportant kernels can be removed from the list by analyzing the weights in the weighted sum. We evaluate the method across several experiments, including the Duffing oscillator and the Kuramoto-Sivashinsky PDE, showcasing the method’s different strengths. Subjects: Dynamical Systems (math.DS); Machine Learning (cs.LG) Cite as: arXiv:2604.25572 [math.DS] (or arXiv:2604.25572v1 [math.DS] for this version) https://doi.org/10.48550/arXiv.2604.25572 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Adaptable phase retrieval for coherent transition radiation spectroscopy based on differentiable physics information

链接: https://arxiv.org/abs/2604.25489
作者: Ritz Ann Aguilar,Maxwell LaBerge,Andreas Doepp,Alexander Debus,Zewu Bi,Michael Bussmann,Arie Irman,Ulrich Schramm,Jeffrey Kelling
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Coherent transition radiation (CTR) spectroscopy is a critical diagnostic for characterizing the longitudinal structure of relativistic electron bunches in laser-plasma and conventional accelerators. In practice, recovering the bunch profile from a measured CTR spectrum is an ill-posed phase-retrieval problem. Traditionally, this is addressed using Gerchberg-Saxton (GS)-type iterative algorithms. However, these implementations often rely on explicit inverse propagators, making them difficult to adapt to sophisticated experimental forward models. In this work, we introduce a flexible gradient-based framework for CTR phase retrieval. By leveraging a differentiable forward model, we propose a phase-only gradient descent (GD-Phase) approach that enforces the measured spectral amplitude as a hard constraint while optimizing the Fourier phase under physical real-space priors. Using synthetic CTR spectra spanning multi-peaked and strongly modulated profiles, we benchmark GD-Phase against traditional GS and a real-space amplitude-parametrized gradient descent (GD-Amp) algorithm. Unlike traditional methods, this formulation allows for the seamless inclusion of arbitrary differentiable experimental effects into the reconstruction loop. We demonstrate that this physics-informed approach not only reproduces the fidelity of GS methods but also establishes a robust baseline for incorporating multi-diagnostic constraints and uncertainty quantification. This enables the systematic extension to higher-dimensional, multimodal, and uncertainty-aware diagnostics, facilitating fast and scalable phase retrieval in realistic experimental settings.

[LG-47] Emergent Self-Attention from Astrocyte-Gated Associative Memory Dynamics

链接: https://arxiv.org/abs/2604.25481
作者: Arnau Vivet,Alex Arenas
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:We introduce a Hopfield-type associative memory in which effective connectivity is multiplicatively modulated by astrocytic gains evolving under an entropy-regularized replicator equation. The coupled neuron-astrocyte dynamics admit a Lyapunov function, ensuring global convergence. At fixed points, astrocytic gains implement a softmax-normalized allocation over pattern similarity scores, yielding a mechanistic realization of self-attention as emergent routing on the gain simplex. In regimes of high memory load and interference, the model significantly improves retrieval accuracy relative to classical Hopfield dynamics and recent neuron-astrocyte baselines. These results establish a dynamical systems framework linking glial modulation, competitive resource allocation, and attention-like computation.

[LG-48] From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability

链接: https://arxiv.org/abs/2604.25372
作者: Amir Ali Farzin,Philipp Braun,Iman Shames
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:While it is generally understood that zeroth-order (ZO) algorithms have an extra dependency on their number of iterations for any choice of parameters, compared to their first-order (FO) counterparts, in this work, we show that under several conditions, in expectation, ZO methods do not suffer from extra dimension dependencies in their convergence rates with respect to their FO counterparts. We look at optimisation algorithms from the dynamical systems perspective and analyse the conditions under which one can formulate the average of a ZO algorithm as the average of its FO counterpart with bounded perturbations with values dependent on design parameters. Then, using input-to-state stability properties, we show ZO methods follow the same decay rate as their FO counterparts and converge to a neighbourhood of the fixed point of FO methods, where its radius depends on the bound of the norm of the perturbations, which can be made arbitrarily small. The theoretical findings are illustrated via numerical examples.

[LG-49] Online learning with Erdős-Rényi side-observation graphs ICML

链接: https://arxiv.org/abs/2604.25271
作者: Tomáš Kocák,Gergely Neu,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at International Conference on Machine Learning (ICML) 2015. 11 pages

点击查看摘要

Abstract:We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability r , independently of each other and the action of the learner. We propose two algorithms that work for different ranges of r . We show that after T rounds in a bandit problem with N arms, the expected regret of our first algorithm is O(\sqrt(T /r) \log N ) whenever r\ge(\log T)/(2N) , while our second algorithm achieves a regret of O(\sqrt(T/r) \log (N+T)) for smaller values of r . We also give a quick estimation procedure that decides the range of~ r . All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~ r .

[LG-50] Learning Structure Energy and Dynamics: A Survey of Artificial Intelligence for Protein Dynamics

链接: https://arxiv.org/abs/2604.25244
作者: Haocheng Tang,Liang Shi,Ya-Shi Zhang,Xixian Liu,Jian Tang,Jiarui Lu
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protein dynamics underlie many biological functions, yet remain difficult to characterize due to the high computational cost of molecular dynamics simulations and the scarcity of dynamic structural data. This survey reviews recent advances in artificial intelligence for protein dynamics from three perspectives: learning from structural ensembles and trajectories, learning from physical energy signals, and learning to accelerate molecular simulations. We summarize representative methods for conformation ensemble generation, trajectory generation, Boltzmann generators, physics-aware adaptation, machine learning potentials, coarse-grained modeling, and collective variable discovery. We further discuss available datasets and key open challenges, such as scalability, thermodynamic consistency, kinetic fidelity, and integration with experimental constraints.

[LG-51] Conditional Flow Matching for Probabilistic Downscaling of Maximum 3-day Snowfall in Alaska

链接: https://arxiv.org/abs/2604.25172
作者: Douglas Brinkerhoff,Elizabeth Fischer
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Precipitation in complex terrain is governed by orographic processes operating at scales of a few kilometers, yet climate models typically run at resolutions of 50–100~km where this topographic detail is absent. Dynamical downscaling with high-resolution regional models such as WRF can resolve these processes, but the computational cost – months of wall-clock time per scenario – precludes the large ensembles needed for uncertainty quantification. We present WxFlow, a conditional generative model based on flow matching that learns to map coarse-resolution climate model output and high-resolution topography to calibrated probabilistic ensembles of fine-scale precipitation fields. Applied to 4~km WRF simulations of maximum 3-day snowfall over southeast Alaska, WxFlow achieves 87.8% improvement in spectral fidelity and dramatically lower Continuous Ranked Probability Scores relative to conventional lapse-rate-corrected bicubic downscaling, while generating 50-member ensembles in seconds on a laptop. Ensemble spread is spatially coherent and governed by topography, reflecting physically plausible uncertainty structure. All code is available at this https URL.

[LG-52] Elite-Driven Support Vector Machines for Classification

链接: https://arxiv.org/abs/2604.25158
作者: Mohammad Jafari Jozani,Bahram Moeinianfar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 41 pages, 4 figures

点击查看摘要

Abstract:Support vector machines (SVMs) are a standard tool for binary classification, but their classical formulations are purely data-driven and offer no direct way to encode trusted benchmark models or structured preferences on selected subsets of the data. We propose Elite-Driven Support Vector Machines (EDSVM), a general framework that augments regularized empirical risk minimization by guiding the slack variables for a curated set of elite observations (typically the union of support vectors from one or more reference SVMs). EDSVM combines the usual slack loss with a deviation penalty that shrinks new slacks toward benchmark slack values, defining a localized, margin-aligned notion of proximity to reference models, unlike global function penalties in knowledge distillation or teacher-student methods, and without requiring privileged features as in SVM+/LUPI. Within this framework we develop two concrete models, C-EDSVM and LS-EDSVM, based respectively on hinge-type and squared-slack losses. For both variants we derive dual quadratic programs that can be implemented with modest modifications of standard SVM solvers, and we give simple sufficient conditions under which the induced margin losses are classification calibrated. Simulation studies and experiments on several UCI benchmarks show that EDSVMs closely track the behaviour induced by reference SVMs while achieving predictive performance that is competitive with, and sometimes better than, C-SVM, LINEX-SVM, and LS-SVM.

[LG-53] Fractionally Supervised Classification with Maxima Nominated Samples

链接: https://arxiv.org/abs/2604.25145
作者: Mohammad Jafari Jozani,Jingyu Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:Fractionally supervised classification (FSC) offers a flexible framework for combining labeled and unlabeled data in model-based classification, but existing formulations assume simple random sampling. In many applications, however, the retained observation is an extreme order statistic from a set rather than a randomly selected unit. This is particularly appealing when the target population is rare, since maxima nomination sampling (NS) can enrich the sample with the most informative observations, as in screening, environmental monitoring, repeated testing, and reliability studies. Under such designs, the likelihood function changes fundamentally, and the usual FSC EM construction is no longer valid. We develop FSC for nominated samples by introducing a latent representation that accounts for both the class membership of the observed maximum and the latent composition of the remaining units in the set. The resulting method yields a proper EM algorithm and a coherent weighted-likelihood FSC procedure for NS data. We present the methodology in general form, illustrate it for a rare-event contamination normal mixtures, and show through simulation that it substantially improves on the misspecified alternative by ignoring the extra rank information of such data. A real-data analysis demonstrates its practical value.

[LG-54] Accelerating Regularized Attention Kernel Regression for Spectrum Cartography

链接: https://arxiv.org/abs/2604.25138
作者: Liping Tao,Chee Wei Tan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spectrum cartography reconstructs spatial radio fields from sparse and heterogeneous wireless measurements, underpinning many sensing and optimization tasks in wireless networks. Attention mechanisms have recently enabled adaptive measurement aggregation via attention kernel-based formulations. However, the resulting exponential kernels exhibit severe spectral imbalance, inducing large condition numbers that render standard iterative solvers ineffective for regularized attention kernel regression. This paper proposes a Learning-based Attention Kernel Regression (LAKER) algorithm for accelerating regularized attention kernel regression in spectrum cartography. The key idea is to learn a data-dependent preconditioner that captures the inverse spectral structure of the attention kernel system, directly reducing the condition number bottleneck. The preconditioner is obtained by solving a regularized maximum-likelihood estimation problem via a shrinkage-regularized convex–concave procedure, and is integrated with a preconditioned conjugate gradient solver for efficient optimization, whose solution is used for radio map reconstruction. Extensive experiments demonstrate that LAKER significantly reduces condition numbers by up to three orders of magnitude, accelerates convergence by over twenty-fold compared to baselines, and maintains high reconstruction accuracy, establishing learning-based preconditioning as an effective approach for attention kernel regression in spectrum cartography.

[LG-55] Quantum Dynamics via Score Matching on Bohmian Trajectories

链接: https://arxiv.org/abs/2604.25137
作者: Lei Wang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 8 pages, 5 figues, code at this https URL

点击查看摘要

Abstract:We solve the time-dependent Schrödinger equation by learning the score function, the gradient of the log-probability density, on Bohmian trajectories. In Bohm’s formulation of quantum mechanics, particles follow deterministic paths under the classical potential supplemented by a quantum potential depending on the score function of the evolving density. These non-crossing Bohmian trajectories form a continuous normalizing flow governed by the score. We parametrize the score with a neural network and minimize a self-consistent Fisher divergence between the network and the score of the resulting density. We prove that the zero-loss minimizer of this self-consistent objective recovers Schrödinger dynamics for nodeless wave functions, a condition naturally met in quantum vibrations of atoms. We demonstrate the approach on wavepacket splitting in a double-well potential and anharmonic vibrations of a Morse chain. By recasting real-time quantum dynamics as a self-consistent score-driven normalizing flow, this framework opens the time-dependent Schrödinger equation to the rapidly advancing toolkit of modern generative modeling.

[LG-56] Learning biophysical models of gene regulation with probability flow matching

链接: https://arxiv.org/abs/2604.25062
作者: Suryanarayana Maddu,Victor Chardès,Michael J. Shelley
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Cellular differentiation is governed by gene regulatory networks, the high-dimensional stochastic biochemical systems that determine the transcriptional landscape and mediate cellular responses to signals and perturbations. Although single-cell RNA sequencing provides quantitative snapshots of the transcriptome, current methods for inferring gene-regulatory dynamics often lack mechanistic interpretability and fail to generalize to unseen conditions. Here we introduce Probability Flow Matching (PFM), a scalable framework for learning biophysically consistent stochastic processes directly from time-resolved single-cell measurements. Applying PFM to three hematopoiesis datasets, we show that models with similar interpolation accuracy can encode fundamentally different dynamics, with only biophysically consistent formulations accurately capturing mechanisms of lineage transitions, fate specification, and gene perturbation responses. We further demonstrate that PFM accommodates unbalanced populations, enabling simultaneous inference of cellular proliferation and death dynamics. Together, these results establish PFM as a flexible, scalable framework for integrating mechanistic modeling with single-cell omics.

[LG-57] A Finite Time Analysis of Thompson Sampling for Bayesian Optimization with Preferential Feedback AISTATS2026

链接: https://arxiv.org/abs/2604.25025
作者: Joseph Lazzaro,Davide Buffelli,Da-shan Shiu,Sattar Vakili
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS 2026

点击查看摘要

Abstract:Preference feedback, in the form of pairwise comparisons rather than scalar scores, has seen increasing use in applications such as human-, laboratory-, and expert-in-the-loop design, as well as scientific discovery. We propose a Thompson Sampling (TS) approach to Bayesian optimization with preferential feedback that models comparisons using a monotone link on latent utility differences and leverages the dueling kernel induced by a base kernel. We provide a finite-time analysis showing that the performance of the proposed method matches that of standard TS for conventional Bayesian optimization with scalar feedback. The analysis exploits the anchor invariance of TS for challenger selection and introduces a double-TS pairing variant. We also demonstrate the performance of the method on both synthetic and real-world examples.

[LG-58] PINNs in More General Geometry

链接: https://arxiv.org/abs/2604.25020
作者: Edward Hirst
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Neural architectures trained with losses inspired by differential conditions are the basis for PINN models. Since many constructions in differential geometry may be framed as minimisation of a differential functional, these functionals can be coded as loss functions to align the AI loss-minimisation goal with that of solving the geometric problem. This contribution to the Recent Progress in Computational String Geometry workshop proceedings introduces the PINN architecture defining principles, motivates how they are well suited for problems in differential geometry, and demonstrates their use via summaries of three works at this intersection.

[LG-59] Data-Driven Hamiltonian Reduction for Superconducting Qubits via Meta-Learning

链接: https://arxiv.org/abs/2604.24912
作者: Arielle Sanford,Andrew T. Kamen,Frederic T. Chong,Andy J. Goldschmidt
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce HAML (Hamiltonian Adaptation via Meta-Learning), a framework for fast online adaptation of effective Hamiltonian models of superconducting quantum processors. HAML proceeds in two phases. A supervised training phase uses an ensemble of simulated devices to learn an offline map from control inputs and device parameters to effective Hamiltonian coefficients. An online adaptation phase then uses a small number of hardware-accessible measurements to identify the unknown parameters of a new device. By training directly against effective two-qubit coefficients extracted from full multi-mode simulations, HAML implicitly learns the reduction from full multi-mode Hamiltonians to effective qubit descriptions without invoking perturbation theory. We further show that a variance-maximizing greedy selection of measurement configurations boosts online adaptation efficiency. We demonstrate HAML on a transmon-coupler-transmon system, recovering effective two-qubit coefficients across a wide range of operating regimes, including parameter regions where Schrieffer-Wolff perturbation theory (SWPT) breaks down. This establishes a scalable, sample-efficient approach to Hamiltonian reduction and characterization for near-term quantum processors, with direct implications for calibration, control, and error mitigation.

[LG-60] Uncovering Exotic Paired States in the 2D Spin-Imbalanced Fermi Gas with Neural Wave Functions

链接: https://arxiv.org/abs/2604.24883
作者: Wan Tong Lou,Gino Cassella,Andres Perez Fadon,Halvard Sutterud,David Pfau,James S. Spencer,Johannes Knolle,W.M.C. Foulkes
类目: Quantum Gases (cond-mat.quant-gas); Superconductivity (cond-mat.supr-con); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 23 pages, 17 figures

点击查看摘要

Abstract:We study the zero-temperature phase diagram of the 2D spin-imbalanced Fermi gas with short-ranged attractive interactions using the recently developed neural network variational Monte Carlo method with the AGPs FermiNet Ansatz. The Fulde-Ferrell-Larkin-Ovchinnikov phase is observed in the weakly interacting BCS limit and a polarised superfluid is seen in the strongly interacting BEC limit. When the interactions are strong, the minority-spin momentum density is reduced almost to zero in the momentum-space region occupied by the unpaired majority-spin electrons. When the interactions are very strong, phase separation occurs, with regions containing bosonic pairs and unpaired regions occupied by the remaining majority-spin particles. In addition, we observe translational symmetry breaking at intermediate interaction strengths, where the system forms an exotic crystal of Cooper pairs in a Fermi fluid of unpaired majority-spin particles. We provide a possible explanation for the formation of the crystalline phase, explain the origins of the k-space momentum-density hole when the pairs are tightly bound, and discuss how our approach opens new directions for future work.

[LG-61] Monitoring exposure-length variations in submarine power cables using distributed fiber-optic sensing

链接: https://arxiv.org/abs/2604.24880
作者: Sakiko Mishima,Yoshiyuki Yajima,Noriyuki Tonami,Tomoyuki Hino,Shugo Aibe,Junichiro Saikawa,Koji Mizuguchi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 11 pages, 5 figures, accepted in the IOP Journal of Physics: Conference Series, and presented in WindEurope Annual Event 2026

点击查看摘要

Abstract:This study proposes an anomaly-detection framework for monitoring exposure-length variations in submarine free-span cables using Distributed Acoustic Sensing (DAS), which is one of the distributed fiber-optic sensing technologies. To address environmental variability and limited training data in offshore environments, a regression-based feature extraction method was introduced to derive low-dimensional latent representations that retain exposure length-dependent vibration characteristics while suppressing environmental influences. The extracted features were used for one-class Support Vector Machine (SVM)-based anomaly detection. The proposed framework was evaluated through wave-tank experiments with exposure lengths ranging from 2 to 10 m. Experimental results showed that anomaly scores decreased approximately monotonically with increasing exposure-length change, exhibiting a strong correlation ( r = -0.83 ). The binary classification achieved an F1 score of 0.82 despite training with only small-sample datasets. These findings demonstrate that exposure-length variations can be reliably detected under severe data limitations, supporting the potential of DAS-based cable condition monitoring.

[LG-62] A multi-stage soft computing framework for complex disease modelling and decision support: A liver cirrhosis case study

链接: https://arxiv.org/abs/2604.24796
作者: Xueyuan Huang,Yuheng Wang,Yuanzhi He,Siqi Gou,Lu Bai,Wenqian Wu,Peifeng Liu,Aijia Wang,Tianhui Fan,Jiayu Xu
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Liver cirrhosis is a major global health problem causing millions of deaths annually, and timely detection with aggressive treatment can significantly improve patients’ quality of life. Modelling complex diseases from biomedical data is computationally challenging due to high dimensionality, strong feature correlations, noise, and limited labelled samples. Conventional Machine Learning (ML) pipelines often struggle with robustness, interpretability, and generalisation under such conditions. In this study, we propose an ML-driven multi-stage decision framework for complex disease modelling and therapeutic exploration. The framework integrates single-cell transcriptomic profiling, high-dimensional network-based feature stabilisation, multi-model learning, deep representation construction, and post-hoc decision support. Specifically, single-cell sequencing data were analysed to identify key cellular subpopulations, followed by high-dimensional weighted gene co-expression network analysis (hdWGCNA) to stabilise gene modules under sparsity and noise. To enhance non-linear feature interaction modelling, tabular molecular features were restructured into two-dimensional disease maps and analysed using a CNN. Finally, molecular docking was incorporated as a decision-support module to evaluate candidate therapeutic compounds. Using liver cirrhosis as a representative case, the framework identified a disease-associated endothelial subpopulation and extracted seven robust signature genes (HSPB1, GADD45A, CLDN5, ATP1B3, C1QBP, ENPP2, and PARL). The CNN-based representation learning module outperformed conventional pipelines in classification. The framework is disease-agnostic and readily extends to other omics-driven biomedical applications involving uncertainty, heterogeneity, and limited samples.

[LG-63] Application of a Mixture of Experts-based Foundation Model to the GlueX DIRC Detector

链接: https://arxiv.org/abs/2604.24775
作者: Cristiano Fanelli,James Giroux,Cole Granger,Justin Stevens
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex); Instrumentation and Detectors (physics.ins-det)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:We present a Mixture-of-Experts-based foundation model applied to the GlueX DIRC detector at Jefferson Lab, demonstrating its utility as a unified framework for fast simulation, particle identification, and hit-level noise filtering of Cherenkov photons. By leveraging a single shared transformer backbone across all tasks, the approach eliminates the fragmentation of task-specific pipelines while maintaining competitive-and in several cases superior-performance relative to established methods. The model operates directly on low-level detector inputs, performing hit-by-hit autoregressive generation over split spatial and temporal vocabularies with continuous kinematic conditioning, and supports class-conditional generation of pions and kaons through its Mixture-of-Experts architecture. We benchmark against the standard geometrical reconstruction and prior deep learning methods across the full kinematic phase space of the GlueX DIRC, demonstrating that the foundation model framework transfers effectively to this detector without architectural modification. This work positions the foundation model as a practical and scalable alternative to the suite of task-specific models currently proposed for GlueX DIRC analysis.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-04-29

目录

概览 (2026-04-29)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载