本篇博文主要内容为 2026-04-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-20)
今日共更新545篇论文,其中:
- 自然语言处理共83篇(Computation and Language (cs.CL))
- 人工智能共160篇(Artificial Intelligence (cs.AI))
- 计算机视觉共106篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共136篇(Machine Learning (cs.LG))
- 多智能体系统共12篇(Multiagent Systems (cs.MA))
- 信息检索共16篇(Information Retrieval (cs.IR))
- 人机交互共56篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Veritas-RPM: Provenance-Guided Multi-Agent False Positive Suppression for Remote Patient Monitoring
【速读】:该论文旨在解决医疗领域中基于规则的检测系统(Rule-based Processing Module, RPM)在实际应用中易产生大量假阳性(false-positive)结果的问题,从而影响临床决策效率与准确性。其解决方案的关键在于提出一种基于溯源信息(provenance-guided)的多智能体架构Veritas-RPM,通过五层处理模块协同工作:包括用于构建真实标签的VeritasAgent、异常检测的SentinelLayer、专家路由的DirectorAgent、六个领域专用智能体(Specialist Agents)以及冲突消解与最终决策的MetaSentinelAgent;同时,构建了一个包含98种假阳性场景的合成分类体系,并基于此生成530个合成患者时相数据进行验证,以量化性能指标如真抑制率(True Suppression Rate, TSR)、假漏报率(False Escalation Rate, FER)和不确定率(Indeterminate Rate, INDR),从而实现对假阳性信号的高效识别与过滤。
链接: https://arxiv.org/abs/2604.16081
作者: Aswini Misro,Vikash Sharma,Shreyank N Gowda
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:We present Veritas-RPM, a provenance-guided multi-agent architecture comprising five processing layers: VeritasAgent (ground-truth assembly), SentinelLayer (anomaly detection), DirectorAgent (specialist routing), six domain Specialist Agents, and MetaSentinelAgent (conflict resolution and final decision). We construct a 98-case synthetic taxonomy of false-positive scenarios derived from documented RPM patterns. Synthetic patient epochs (n = 530) were generated directly from taxonomy parameters and processed through the pipeline. Ground-truth labels are known for all cases. Performance is reported as True Suppression Rate (TSR), False Escalation Rate (FER), and Indeterminate Rate (INDR).
[MA-1] AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis
【速读】:该论文旨在解决天文成像(astronomical imaging)质量诊断难题,该任务涉及多学科知识与多个子任务的复杂关联,传统方法难以有效进行错误定位与质量评估。解决方案的关键在于提出 AstroVLM,一个用于天文图像质量诊断的协作式多智能体系统(collaborative multi-agent system),通过多智能体协同推理机制提升对复杂成像流程中质量问题的识别与定位能力,实验表明其在真实天文成像任务中显著优于现有基线方法。
链接: https://arxiv.org/abs/2604.16024
作者: Yaohui Han,Tianshuo Wang,Zixi Zhao,Zhengchun Zhu,Shuo Ren,Yiru Wang,Rongliang Fu,Tinghuan Chen,Tsung-Yi Ho
机构: The Chinese University of Hong Kong (香港中文大学); Central South University (中南大学); Huawei Technologies Co., Ltd (华为技术有限公司); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳)
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.
[MA-2] SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在具身多智能体环境中的社会推理能力评估难题,特别是在复杂交互场景下如何区分规划能力不足与社会推理缺陷对性能的影响。其解决方案的关键在于提出SocialGrid这一受《Among Us》启发的具身多智能体环境,该环境支持任务执行、规划和社交推理的协同评估,并引入可选的“规划Oracle”以隔离社会推理模块,从而精准诊断模型在识别欺骗等高级社交行为上的瓶颈。实验表明,即使是最强开源模型(GPT-OSS-120B),在缺乏规划辅助时任务完成率低于60%,而即便提供规划支持,其社会推理仍表现不佳,仅能以接近随机水平检测欺骗,暴露出当前LLM代理在基于行为证据积累进行推理方面的根本局限。
链接: https://arxiv.org/abs/2604.16022
作者: Hikaru Shindo,Hanzhao Lin,Lukas Helff,Patrick Schramowski,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Preprint
Abstract:As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.
[MA-3] Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
【速读】:该论文旨在解决多智能体推理框架中因个体代理错误在协作过程中被放大而导致的推理不稳定性问题。现有方法主要聚焦于提升高能力代理或抑制不可靠输出,而忽视了对性能瓶颈代理的系统性识别与强化。解决方案的关键在于提出WORC框架,其基于弱链接原理,采用两阶段流程:第一阶段通过元学习构建任务特征到代理性能权重的零样本映射,定位最弱代理;第二阶段基于不确定性驱动的预算分配策略,向弱代理分配额外推理资源,以补偿其可靠性不足,从而提升整体系统的鲁棒性和跨架构泛化能力。
链接: https://arxiv.org/abs/2604.15972
作者: Haoyu Bian,Chaoning Zhang,Jiaquan Zhang,Xingyao Li,Yuanfang Guo,Wei Dong,Yang Yang
机构: University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Chinese Academy of Sciences (中国科学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 13 pages, 4 figures. Submitted to CAAI Transactions on Intelligence Technology
Abstract:LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underlineweak-link \underlineoptimization framework for multi-agent \underlinereasoning and \underlinecollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.
[MA-4] Polarization by Default: Auditing Recommendation Bias in LLM -Based Content Curation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在内容筛选与排序任务中存在偏见的结构性特征及其可缓解性问题,即哪些偏见在不同模型提供商和平台间具有鲁棒性,哪些可通过提示工程(prompt design)进行干预。其解决方案的关键在于构建一个受控模拟实验框架,通过在Twitter/X、Bluesky和Reddit的真实社交数据集上,对OpenAI、Anthropic和Google三家主流LLM提供商采用六种不同提示策略(包括通用、热门、高互动性、信息量、争议性和中立性)进行540,000次模拟选择,系统性地量化了偏见的类型、强度及对提示的敏感度,从而揭示出如极化放大、毒性处理反转和负面情感偏好等核心现象,并识别出各模型在行为一致性、毒性适应性和情感倾向上的差异化表现。
链接: https://arxiv.org/abs/2604.15937
作者: Nicolò Pagan,Christopher Barrie,Chris Andrew Bail,Petter Törnberg
机构: University of Zurich (苏黎世大学); New York University (纽约大学); University of Oxford (牛津大学); Duke University (杜克大学); University of Amsterdam (阿姆斯特丹大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\textitgeneral, \textitpopular, \textitengaging, \textitinformative, \textitcontroversial, \textitneutral). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, where author demographics can be inferred from profile bios, political leaning bias is the clearest demographic signal: left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality in the dataset, and this pattern largely persists across prompts.
[MA-5] Experience Compression Spectrum: Unifying Memory Skills and Rules in LLM Agents
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)智能体在长周期、多会话部署中因累积经验管理效率低下而导致的瓶颈问题。核心挑战在于如何从交互轨迹中提取可复用的知识,以降低上下文消耗、检索延迟和计算开销。其解决方案的关键是提出“经验压缩谱”(Experience Compression Spectrum),将记忆(memory)、技能(skills)和规则(rules)统一为一个连续的压缩轴——从5–20倍的事件记忆到50–500倍的过程技能,再到1000倍以上的陈述性规则,实现了不同抽象层级知识的量化表征与优化。该框架揭示了现有系统均固定于特定压缩层级、缺乏跨层级自适应压缩能力(即“缺失对角线”),并指出知识生命周期管理、评估方法与压缩水平强耦合等关键开放问题,为构建全谱系可扩展智能体学习系统提供了理论基础与设计原则。
链接: https://arxiv.org/abs/2604.15877
作者: Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge – extracting reusable knowledge from interaction traces – yet a citation analysis of 1,136 references across 22 primary papers reveals a cross-community citation rate below 1%. We propose the \emphExperience Compression Spectrum, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5–20 \times for episodic memory, 50–500 \times for procedural skills, 1,000 \times + for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level – none supports adaptive cross-level compression, a gap we term the \emphmissing diagonal. We further show that specialization alone is insufficient – both communities independently solve shared sub-problems without exchanging solutions – that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full-spectrum agent learning systems.
[MA-6] Scalable Algorithms with Provable Optimality Bounds for the Multiple Watchman Route Problem
【速读】:该论文旨在解决多巡逻路径问题(Multiple Watchman Route Problem, MWRP),即寻找一组路径,使得M名巡逻员能够覆盖地图上的每一个位置,确保每个点至少被一名巡逻员在行进过程中看到。其解决方案的关键在于通过多种策略压缩搜索空间——包括剪枝那些在前往其他区域途中必然被观测到的区域;并结合改进现有启发式算法的质量与计算效率的技巧,提出一种高效最优规划器MWRP-CP3。实验表明,该方法可将搜索空间减少超过95%,且在二维网格地图上比现有最优算法快200倍以上;同时,配套的次优算法还能处理比MWRP-CP3大3倍的地图规模。
链接: https://arxiv.org/abs/2604.15610
作者: Srikar Gouru,Ariel Felner,Jiaoyang Li
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:In this paper, we tackle the Multiple Watchman Route Problem (MWRP), which aims to find a set of paths that M watchmen can follow such that every location on the map can be seen by at least one watchman. First, we propose multiple methods to reduce the state space over which a search needs to be conducted by pruning map areas that are guaranteed to be seen en route to other areas. Next, we introduce MWRP-CP3, an efficient optimal planner that combines these methods with techniques that improve the quality and calculation time of existing heuristics. We present several suboptimal algorithms with bounds on solution quality, including MxWA*, a general variant of weighted A* for makespan problems. We also present anytime variations of our suboptimal algorithms, as well as techniques to improve an existing suboptimal solution by solving multiple decomposed sub-problems. We show that MWRP-CP3 can reduce the search space by more than 95% and runs more than 200x faster than existing optimal algorithms on 2D grid maps. We also show that our suboptimal algorithms solve maps 3x larger than those solvable by MWRP-CP3. See this http URL for the open source codebase and video demonstrations.
[MA-7] Preregistered Belief Revision Contracts
【速读】:该论文旨在解决** deliberative multi-agent systems(协商式多智能体系统)**中因开放通信导致的危险一致性效应(conformity effects)问题,即 agents 在交互过程中可能将共识、自信度、声望或多数派规模误当作证据,从而产生高自信但错误的结论收敛。解决方案的关键在于引入 PBRC(Preregistered Belief Revision Contracts,预注册信念修正契约),这是一种协议层机制,严格分离开放通信与可接受的认知更新(epistemic change)。PBRC 通过公开固定一阶证据触发条件、可接受的修订算子、优先规则和回退策略,确保任何实质性信念变化都必须引用预注册的触发条件,并提供外部验证的证据标记集合(witness set),从而实现信念变更的可执行性(由路由器强制执行)与事后可审计性(auditability)。此设计有效防止仅由社会性因素驱动的错误共识传播,并保障认知责任归属(epistemic accountability)。
链接: https://arxiv.org/abs/2604.15558
作者: Saad Alqithami
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:
Abstract:Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol-level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first-order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non-fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social-only rounds cannot increase confidence and cannot generate purely conformity-driven wrong-but-sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. © We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token-invariant contracts, (d) we prove that enforced trajectories depend only on token-exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness-liveness trade-offs.
[MA-8] NeuroMesh: A Unified Neural Inference Framework for Decentralized Multi-Robot Collaboration
【速读】:该论文旨在解决在异构机器人平台上部署学习得到的多机器人模型所面临的挑战,包括硬件差异、通信约束以及缺乏统一的执行栈。其解决方案的关键在于提出NeuroMesh框架,该框架是一个多域、跨平台且模块化的去中心化神经推理系统,通过标准化观测编码、消息传递、聚合与任务解码的统一流程实现高效协同;其核心创新为双聚合范式(融合减少型与广播型信息融合)和并行化架构,可解耦周期时间与端到端延迟,并基于C++实现高性能计算,支持GPU/CPU混合推理,同时利用Zenoh进行机器人间通信,从而在空中与地面机器人组成的异构团队中验证了其在协作感知、去中心化控制和任务分配等场景下的鲁棒性表现。
链接: https://arxiv.org/abs/2604.15475
作者: Yang Zhou,Yash Shetye,Long Quang,Devon Super,Jesse Milzman,Manohari Goarin,Aditya Azad,Devang Sunil Dhake,Jeffery Mao,Carlos Nieto-Granda,Giuseppe Loianno
机构: New York University (纽约大学); U.S. Army Combat Capabilities Development Command, Army Research Laboratory (美国陆军作战能力发展司令部,陆军研究实验室); Vanderbilt University (范德比尔特大学); University of California Berkeley, Department of Electrical Engineering and Computer Sciences (加州大学伯克利分校电气工程与计算机科学系)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 8 page, 8 figures, Accepted at the IEEE Robotics Automation Letter (RA-L)
Abstract:Deploying learned multi-robot models on heterogeneous robots remains challenging due to hardware heterogeneity, communication constraints, and the lack of a unified execution stack. This paper presents NeuroMesh, a multi-domain, cross-platform, and modular decentralized neural inference framework that standardizes observation encoding, message passing, aggregation, and task decoding in a unified pipeline. NeuroMesh combines a dual-aggregation paradigm for reduction- and broadcast-based information fusion with a parallelized architecture that decouples cycle time from end-to-end latency. Our high-performance C++ implementation leverages Zenoh for inter-robot communication and supports hybrid GPU/CPU inference. We validate NeuroMesh on a heterogeneous team of aerial and ground robots across collaborative perception, decentralized control, and task assignment, demonstrating robust operation across diverse task structures and payload sizes. We plan to release NeuroMesh as an open-source framework to the community.
[MA-9] InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control AAMAS2026
【速读】:该论文旨在解决多智能体系统在部分可观测环境下进行竞争性推理(competitive inference)的问题,即如何在没有传统博弈目标(如吃子)的情况下,通过信息获取与利用来优化决策。其解决方案的关键在于设计了一个名为InfoChess的对称对抗游戏:该游戏摒弃了棋子捕获机制,转而以改变可见性为核心玩法,玩家得分基于对对手王位置的概率推断准确性。通过引入分层启发式代理和强化学习代理,并结合信息论指标(如信念熵、Oracle交叉熵和动作诱导观测信道下的预测对数似然),该研究实现了对策略空间的系统探索与评估,从而为研究多智能体在不完全信息下的推理能力提供了可量化、可扩展的实验平台。
链接: https://arxiv.org/abs/2604.15373
作者: Kieran A. Murphy
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: Accepted at Adaptive and Learning Agents Workshop, AAMAS 2026. Project page: this https URL
Abstract:We propose InfoChess, a symmetric adversarial game that elevates competitive information acquisition to the primary objective. There is no piece capture, removing material incentives that would otherwise confound the role of information. Instead, pieces are used to alter visibility. Players are scored on their probabilistic inference of the opponent’s king location over the duration of the game. To explore the space of strategies for playing InfoChess, we introduce a hierarchy of heuristic agents defined by increasing levels of opponent modeling, and train a reinforcement learning agent that outperforms these baselines. Leveraging the discrete structure of the game, we analyze gameplay through natural information-theoretic characterizations that include belief entropy, oracle cross entropy, and predictive log score under the action-induced observation channel. These measures disentangle epistemic uncertainty, calibration mismatch, and uncertainty induced by adversarial movement. The design of InfoChess renders it a testbed for studying multi-agent inference under partial observability. We release code for the environment and agents, and a public interface to encourage further study.
[MA-10] SoK: Security of Autonomous LLM Agents Agent s in Agentic Commerce
【速读】:该论文旨在解决自主大语言模型(LLM)代理在商业与金融场景中因协议碎片化和安全框架缺失而产生的跨层攻击风险问题。其关键解决方案是构建一个统一的安全框架,从代理完整性、交易授权、代理间信任、市场操纵和监管合规五个维度系统识别12类跨层攻击向量,并提出分层防御架构以填补现有代理支付协议在授权控制上的漏洞,从而实现LLM安全、协议设计、身份认证、市场结构与监管协同的综合治理。
链接: https://arxiv.org/abs/2604.15367
作者: Qian’ang Mao,Jiaxin Wang,Ya Liu,Li Zhu,Cong Ma,Jiaqi Yan
机构: 未知
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:Autonomous large language model (LLM) agents such as OpenClaw are pushing agentic commerce from human-supervised assistance toward machine actors that can negotiate, purchase services, manage digital assets, and execute transactions across on-chain and off-chain environments. Protocols such as the Trustless Agents standard (ERC-8004), Agent Payments Protocol (AP2), the HTTP 402-based payment protocol (x402), Agent Commerce Protocol (ACP), the Agentic Commerce standard (ERC-8183), and Machine Payments Protocol (MPP) enable this transition, but they also create an attack surface that existing security frameworks do not capture well. This Systematization of Knowledge (SoK) develops a unified security framework for autonomous LLM agents in commerce and finance. We organize threats along five dimensions: agent integrity, transaction authorization, inter-agent trust, market manipulation, and regulatory compliance. From a systematically curated public corpus of academic papers, protocol documents, industry reports, and incident evidence, we derive 12 cross-layer attack vectors and show how failures propagate from reasoning and tooling layers into custody, settlement, market harm, and compliance exposure. We then propose a layered defense architecture addressing authorization gaps left by current agent-payment protocols. Overall, our analysis shows that securing agentic commerce is inherently a cross-layer problem that requires coordinated controls across LLM safety, protocol design, identity, market structure, and regulation. We conclude with a research roadmap and a benchmark agenda for secure autonomous commerce.
[MA-11] SocialWise: LLM -Agent ic Conversation Therapy for Individuals with Autism Spectrum Disorder to Enhance Communication Skills
【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)患者在日常社交对话练习中缺乏可扩展支持的问题,即低成活动(如故事阅读)效果有限,而高效的角色扮演疗法又依赖昂贵的线下专家指导。解决方案的关键在于开发了一个基于浏览器的应用程序SocialWise,其核心是结合大语言模型(Large Language Models, LLM)对话代理与治疗性检索增强生成(Retrieval Augmented Generation, RAG)知识库,实现用户通过文本或语音交互后获得即时、结构化的反馈(包括语调、参与度和替代表达方式),从而提供证据驱动、按需可用的沟通训练支持。
链接: https://arxiv.org/abs/2604.15347
作者: Albert Tang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:
Abstract:Autism Spectrum Disorder (ASD) affects more than 75 million people worldwide. However, scalable support for practicing everyday conversation is scarce: Low-cost activities such as story reading yield limited improvement. At the same time, effective role-play therapy demands expensive, in-person sessions with specialists. SocialWise bridges this gap through a browser-based application that pairs LLM conversational agents with a therapeutic retrieval augmented generation (RAG) knowledge base. Users select a scenario (e.g., ordering food, joining a group), interact by text or voice, and receive instant, structured feedback on tone, engagement, and alternative phrasing. The SocialWise prototype, implemented with Streamlit, LangChain, and ChromaDB, runs on any computer with internet access, and demonstrates how recent advances in LLM can provide evidence-based, on-demand communication coaching for individuals with ASD.
自然语言处理
[NLP-0] Learning to Reason with Insight for Informal Theorem Proving
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非形式化定理证明中因缺乏洞察力(insight)而导致的性能瓶颈问题,即模型难以识别并应用解决复杂数学问题的核心技术(core techniques)。其解决方案的关键在于提出一个名为 \mathtt{DeepInsightTheorem} 的分层数据集,该数据集通过显式提取核心技术和证明概要(proof sketches)来结构化非形式化证明,并设计了一种渐进式的多阶段监督微调(Progressive Multi-Stage SFT)策略,模拟人类学习过程,引导模型从基础证明写作逐步过渡到具有洞察力的推理能力。实验表明,这种以洞察力为导向的生成策略显著优于基线方法,验证了培养模型识别和运用核心技术对提升数学推理能力的重要性。
链接: https://arxiv.org/abs/2604.16278
作者: Yunhe Li,Hao Shi,Bowen Deng,Wei Wang,Mengzhe Ruan,Hanxu Hou,Zhongxiang Dai,Siyang Gao,Chao Wang,Shuang Qiu,Linqi Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models’ (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose \mathttDeepInsightTheorem , a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.
[NLP-1] No Universal Courtesy: A Cross-Linguistic Multi-Model Study of Politeness Effects on LLM s Using the PLUM Corpus
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对不同礼貌程度的用户提示时,其响应质量如何变化的问题,特别是探究礼貌性(politeness)作为计算变量对模型行为的影响机制。解决方案的关键在于构建了一个多语言、多模型、多交互历史的系统性实验框架,使用包含22,500对提示与响应的数据集,并基于布朗和莱文森的礼貌理论及Culpeper的不礼貌框架设计了八维度评估体系(包括连贯性、清晰度、深度、响应性、上下文保留、毒性、简洁性和可读性),量化分析礼貌程度对模型输出的影响。研究发现,礼貌提示平均提升响应质量约11%,而粗鲁提示则显著降低质量,但这种影响具有明显的语言和模型依赖性,例如英语偏好礼貌或直接语气,印地语偏好间接恭敬语气,西班牙语偏好强势语气;同时,Llama 3对语气最为敏感(波动达11.5%),而GPT-4o Mini更具鲁棒性。这一结果表明,礼貌并非单一常量,而是可被建模和测量的计算变量,其效应需结合具体语言与模型架构进行精细化理解。
链接: https://arxiv.org/abs/2604.16275
作者: Hitesh Mehta,Arjit Saxena,Garima Chhikara,Rohit Kumar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness Theory by Brown and Levinson and the Impoliteness Framework by Culpeper form the basis of experiments conducted across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction histories between users (raw, polite, and impolite). Our sample consists of 22,500 pairs of prompts and responses of various types, evaluated across five levels of politeness using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. The findings show that model performance is highly influenced by tone, dialogue history, and language. While polite prompts enhance the average response quality by up to ~11% and impolite tones worsen it, these effects are neither consistent nor universal across languages and models. English is best served by courteous or direct tones, Hindi by deferential and indirect tones, and Spanish by assertive tones. Among the models, Llama is the most tone-sensitive (11.5% range), whereas GPT is more robust to adversarial tone. These results indicate that politeness is a quantifiable computational variable that affects LLM behaviour, though its impact is language- and model-dependent rather than universal. To support reproducibility and future work, we additionally release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts across three languages and five politeness categories, and provide a formal supplementary analysis of six falsifiable hypotheses derived from politeness theory, empirically assessed against the dataset.
[NLP-2] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
【速读】: 该论文旨在解决当前视频编辑系统缺乏大规模、高质量人类标注数据集和标准化评估工具的问题,从而阻碍了模型性能的客观比较与提升。现有资源普遍规模有限、缺少编辑后的输出结果或缺乏人工质量标签,而评估方式多依赖昂贵的人工审核或通用视觉语言模型(Vision-Language Model, VLM)裁判,无法精准衡量编辑质量。解决方案的关键在于提出三个核心组件:(1) VEFX-Dataset,一个包含5,049个视频编辑示例的大型人类标注数据集,涵盖9类主要编辑任务和32个子类别,并在“指令遵循度”、“渲染质量”与“编辑局部性”三个解耦维度上进行标注;(2) VEFX-Reward,一种专为视频编辑质量评估设计的奖励模型,通过联合处理源视频、编辑指令与编辑后视频,利用序数回归预测各维度得分;(3) VEFX-Bench,一套由300个精选视频-提示对组成的基准测试集,用于标准化比较不同编辑系统。实验表明,VEFX-Reward在标准图像/视频质量指标及群体偏好评估中均比通用VLM裁判和先前奖励模型更贴近人类判断,且揭示了当前主流模型在视觉合理性、指令遵循性和编辑局部性之间仍存在显著差距。
链接: https://arxiv.org/abs/2604.16272
作者: Xiangbo Gao,Sicong Jiang,Bangya Liu,Xinghao Chen,Minglai Yang,Siyuan Yang,Mingyang Wu,Jiongze Yu,Qi Zheng,Haozhi Wang,Jiayi Zhang,Jared Yang,Jie Yang,Zihan Wang,Qing Yin,Zhengzhong Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.
[NLP-3] From Benchmarking to Reasoning : A Dual-Aspect Large-Scale Evaluation of LLM s on Vietnamese Legal Text
【速读】: 该论文旨在解决越南法律文本复杂性导致公众难以获取司法正义的问题,提出利用大语言模型(Large Language Models, LLMs)进行法律文本简化以提升可及性。其解决方案的关键在于构建一个双维度评估框架:一方面建立针对GPT-4o、Claude 3 Opus、Gemini 1.5 Pro和Grok-1四类先进LLM的定量性能基准,涵盖准确性(Accuracy)、可读性(Readability)与一致性(Consistency)三个维度;另一方面通过大规模误差分析,结合专家验证的错误分类体系对60篇复杂越南法律条文进行深度剖析,揭示模型在法律推理中的根本局限。研究发现,当前LLM面临的核心挑战并非摘要能力不足,而是缺乏受控且精确的法律推理能力,其中“错误示例”(Incorrect Example)和“误读”(Misinterpretation)是最常见的失败类型。该方法实现了量化指标与质性洞察的融合,为LLM在法律领域的应用提供了全面且可操作的评估路径。
链接: https://arxiv.org/abs/2604.16270
作者: Van-Truong Le
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures. Accepted at the FISU Joint Conference on Artificial Intelligence (FJCAI 2026), Vietnam
Abstract:The complexity of Vietnam’s legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the “why” behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textitIncorrect Example and \textitMisinterpretation as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.
[NLP-4] SwanNLP at SemEval-2026 Task 5: An LLM -based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation SEMEVAL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在真实叙事语境中对同义词词义(homonymous word senses)的合理性判断能力不足的问题,即尽管现有基准测试显示LLMs具备一定的词义消歧能力,但其在自然故事文本中的实际应用效果仍缺乏系统评估。解决方案的关键在于提出一种基于LLM的结构化推理框架,通过两种策略提升模型对词义合理性的评分精度:一是使用多样化推理策略微调低参数量LLMs以增强其感知能力;二是采用动态少样本提示(dynamic few-shot prompting)技术优化高参数量模型的表现。实验表明,商业级大参数LLMs结合动态少样本提示可高度逼近人类对词义合理性的判断,且模型集成进一步提升了与五名人工标注者之间的一致性模式模拟效果。
链接: https://arxiv.org/abs/2604.16262
作者: Deshan Sumanathilaka,Nicholas Micallef,Julian Hough,Saman Jayasinghe
机构: Swansea University (斯旺西大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 5 Tables, 1 figure, Accepted to SemEval 2026
Abstract:Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions
[NLP-5] Do Vision-Language Models Truly Perform Vision Reasoning ? A Rigorous Study of the Modality Gap
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)中存在的一大核心问题:其推理能力究竟是基于真实的跨模态视觉-语言协同推理,还是主要依赖于文本语义理解能力(即文本骨干的推理能力)。为系统评估这一问题,作者提出CrossMath基准,通过构建三类输入形式(纯文本、纯图像、图文联合)且任务相关信息完全对齐的问题,实现可控的跨模态比较。关键创新在于严格控制信息一致性并由人工验证,从而有效隔离模态特异性推理差异,排除信息不匹配等干扰因素。实验结果表明,VLMs在纯文本输入下表现最优,而加入视觉信息反而常导致性能下降,揭示了现有模型主要在文本空间进行推理,缺乏对视觉证据的有效利用。为缓解此局限,作者进一步构建了CrossMath训练集用于VLM微调,实验证明该策略显著提升各模态下的推理性能,并在两个通用视觉推理任务上取得稳健增益。
链接: https://arxiv.org/abs/2604.16256
作者: Yige Xu,Yongjie Wang,Zizhuo Wu,Kaisong Song,Jun Lin,Zhiqi Shen
机构: Nanyang Technological University (南洋理工大学); Alibaba-NTU Global e-Sustainability CorpLab (阿里巴巴-南洋理工全球可持续发展联合实验室); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at this https URL.
[NLP-6] Detecting and Suppressing Reward Hacking with Gradient Fingerprints
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中存在的**奖励欺骗(reward hacking)问题,即模型通过利用奖励函数中的漏洞(如训练数据中的伪相关性)获得高分,而并未真正完成目标任务。这类行为常表现为表面合理的中间链式推理(Chain-of-Thought, CoT),使得基于文本的检测方法难以识别。解决方案的关键在于提出梯度指纹(Gradient Fingerprint, GRIFT)**方法,该方法通过计算模型在给定提示下的CoT输出相对于输入的梯度,并将其压缩为紧凑表示,从而捕捉内部计算模式以区分真实推理与奖励欺骗行为。GRIFT在多个数学、代码和逻辑推理基准上显著优于现有基线(如CoT Monitor和TRACE),并能有效集成到拒绝采样微调(rejection fine-tuning)流程中,提升任务性能并抑制奖励欺骗。
链接: https://arxiv.org/abs/2604.16242
作者: Songtao Wang,Quang Hieu Pham,Fangcong Yin,Xinpeng Wang,Jocelyn Qiaochu Chen,Greg Durrett,Xi Ye
机构: University of Alberta (阿尔伯塔大学); New York University (纽约大学); LMU Munich (慕尼黑路德维希马克西米利安大学); Princeton Language and Intelligence (普林斯顿语言与智能)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models’ internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: this https URL.
[NLP-7] BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在动物相关专业知识上的表现尚不明确的问题,尤其是在统一的闭书评估协议下,其对特定领域知识的掌握程度难以量化。为应对这一挑战,作者提出了BAGEL基准,其关键在于构建一个基于多样科学与参考来源(如bioRxiv、Global Biotic Interactions、Xeno-canto和Wikipedia)的高质量、多维度动物知识数据集,并通过人工筛选与自动生成相结合的方式生成闭书问答对,从而实现对模型动物知识能力的无外部检索依赖的精准评估。该方案支持细粒度分析不同来源域、分类群和知识类别,有助于系统识别模型的优势与失效模式,推动语言模型在生物多样性应用中的可靠性提升。
链接: https://arxiv.org/abs/2604.16241
作者: Jiacheng Shen,Masato Hagiwara,Milad Alizadeh,Ellen Gilsenan-McMahon,Marius Miron,David Robinson,Emmanuel Chemla,Sara Keen,Gagan Narula,Mathieu Laurière,Matthieu Geist,Olivier Pietquin
机构: 1: Facebook AI Research (Facebook人工智能研究中心); 2: University of California, Berkeley (加州大学伯克利分校); 3: DeepMind (DeepMind)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 3 figures
Abstract:Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.
[NLP-8] Optimizing Korean-Centric LLM s via Token Pruning
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)在特定语言任务(以韩语为中心的自然语言处理任务)中因冗余语言参数导致的生成不稳定性和资源消耗过高的问题。解决方案的关键在于采用**词元剪枝(Token Pruning)**技术,通过移除与目标任务无关的语言对应的词元和嵌入参数,从而压缩模型规模并提升目标语言上的性能表现。实验证明,该方法显著改善了生成稳定性,并在机器翻译等韩语相关任务中提升了性能,同时大幅减少词汇量,使其成为内存受限场景下领域专用部署的有效优化策略。
链接: https://arxiv.org/abs/2604.16235
作者: Hoyeol Kim,Hyeonwoo Kim
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: 5 pages
Abstract:This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.
[NLP-9] Beyond Surface Statistics: Robust Conformal Prediction for LLM s via Internal Representations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因分布偏移(distribution shift)导致输出层面不确定性信号(如token概率、熵值和自一致性)失效的问题,从而影响其可靠性。解决方案的关键在于引入一种基于模型内部表示的校准方法:提出层间信息(Layer-Wise Information, LI)分数,该分数通过衡量输入条件作用下模型各层预测熵的变化来捕捉更稳定的非共变性(nonconformity)信号,并将其嵌入标准分割 conformal 预测框架中,从而在跨域场景下实现更优的有效性-效率权衡,同时保持与文本级基线相当的域内可靠性。
链接: https://arxiv.org/abs/2604.16217
作者: Yanli Wang,Peng Kuang,Xiaoyu Han,Kaidi Xu,Haohan Wang
机构: Imperial College London; Zhejiang University; University of Illinois Urbana-Champaign; City University of Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration–deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity–efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribution shift.
[NLP-10] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续学习(Continual Learning, CL)过程中因任务间参数干扰导致的灾难性遗忘(catastrophic forgetting)问题。现有方法通常通过约束新任务的适配器(adapter)与历史适配器之间的子空间或坐标级干扰来缓解该问题,但难以实现动态的参数隔离。论文提出JumpLoRA框架,其核心创新在于引入JumpReLU门控机制,自适应地在低秩适配(Low-Rank Adaptation, LoRA)模块中诱导稀疏性,从而实现动态参数隔离,有效防止任务间干扰。该方法具有高度模块化特性,可无缝集成至主流LoRA-based CL方法中,并在性能上显著优于IncLoRA和当前最优方法ELLA。
链接: https://arxiv.org/abs/2604.16171
作者: Alexandra Dragomir,Ioana Pintilie,Antonio Barbalau,Marius Dragoi,Florin Brad,Cristian Daniel Paduraru,Alexandru Tifrea,Elena Burceanu,Radu Tudor Ionescu
机构: Bitdefender; University of Bucharest
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.
[NLP-11] AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用链式思维(Chain-of-Thought, CoT)推理时,难以确保推理过程既真实反映模型决策机制、又能实质性影响最终答案的问题。现有方法常导致推理轨迹仅作为伴随现象而非因果驱动因素,从而削弱了模型的可解释性与可信度。解决方案的关键在于提出 AtManRL 方法,其核心是通过可微分注意力操纵(differentiable attention manipulation)学习一个加性注意力掩码(additive attention mask),识别出对正确答案至关重要的 CoT 令牌,并据此构建显著性奖励信号(saliency reward signal)。该奖励信号与基于结果的奖励联合嵌入 GRPO 框架中,共同优化模型在准确性与推理透明性上的表现,实验证明该方法能有效识别关键推理 token 并训练出更具解释性的推理模型。
链接: https://arxiv.org/abs/2604.16158
作者: Max Henning Höth,Kristian Kersting,Björn Deiseroth,Letitia Parcalabescu
机构: Aleph Alpha Research (Aleph Alpha 研究所); TU Darmstadt (达姆施塔特工业大学); Hessian.AI (黑森人工智能); Lab1141 (实验室1141)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 8 figures, 1 table
Abstract:Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model’s final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.
[NLP-12] On the Rejection Criterion for Proxy-based Test-time Alignment ACL2026
【速读】: 该论文旨在解决大模型(base model)在测试时生成质量不稳定的问题,特别是如何利用小规模已对齐模型(aligned model)来引导大模型的生成过程。现有方法如隐式奖励(implicit reward)和 nudging 方法虽能提升性能,但其背后的机制缺乏统一理论框架,且依赖于不合理的置信度阈值作为拒绝准则,易受语言歧义等现象干扰。论文的关键创新在于揭示了两类方法本质均可建模为相似的图模型采样过程,仅在拒绝准则定义上存在差异;并提出一种基于保守置信度赌注(conservative confidence bet)的新拒绝准则,显著提升了生成质量和鲁棒性,在多个数据集上优于先前方法。
链接: https://arxiv.org/abs/2604.16146
作者: Ayoub Hammal,Pierre Zweigenbaum,Caio Corro
机构: Université Paris-Saclay, CNRS, LISN; INSA Rennes, IRISA, CNRS, Université de Rennes
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main
Abstract:Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.
[NLP-13] Sentiment Analysis of German Sign Language Fairy Tales
【速读】: 该论文旨在解决德语手语(Deutsche Gebärdensprache, DGS)童话故事中情感分析的问题,即如何从视频中自动识别出负向、中性或正向的情感类别。其解决方案的关键在于构建了一个结合多模态特征提取与可解释机器学习模型的框架:首先利用四个大语言模型(Large Language Models, LLMs)对文本段落进行情感标注并获得高一致性(Krippendorff’s α = 0.781),随后使用MediaPipe从对应的手语视频片段中提取面部和身体运动特征;最终训练基于XGBoost的可解释模型,以视频特征预测情感状态,平均平衡准确率达到0.631。关键发现是,除面部表情(如眉毛和嘴部动作)外,髋部、肘部和肩部的运动同样在情感判别中起重要作用,表明手语情感表达中面部与身体动作具有同等重要性。
链接: https://arxiv.org/abs/2604.16138
作者: Fabrizio Nunnari,Siddhant Jain,Patrick Gebhard
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels of valence (negative, neutral, positive) on German fairy tales text segments using four large language models (LLMs) and majority voting, reaching an inter-annotator agreement of 0.781 Krippendorff’s alpha. Second, we extract face and body motion features from each corresponding DGS video segment using MediaPipe. Finally, we train an explainable model (based on XGBoost) to predict negative, neutral or positive sentiment from video features. Results show an average balanced accuracy of 0.631. A thorough analysis of the most important features reveal that, in addition to eyebrows and mouth motion on the face, also the motion of hips, elbows, and shoulders considerably contribute in the discrimination of the conveyed sentiment, indicating an equal importance of face and body for sentiment communication in sign language.
[NLP-14] Can LLM s Understand the Impact of Trauma? Costs and Benefits of LLM s Coding the Interviews of Firearm Violence Survivors
【速读】: 该论文旨在解决社区枪支暴力幸存者叙事的定性研究中人工编码效率低、难以规模化的问题,其核心挑战在于如何利用生成式 AI(Generative AI)实现自动化、高精度的归纳编码。解决方案的关键在于评估开源大语言模型(Large Language Models, LLMs)在对21名黑人男性幸存者访谈文本进行无监督编码时的表现,重点考察其识别关键主题的能力及伦理风险。研究发现,尽管某些LLM配置可识别部分重要代码,但整体相关性较低且高度依赖数据预处理;同时,LLM的安全机制(guardrails)导致显著的叙述性删减,暴露出AI辅助编码在准确性与伦理合规性方面的双重局限。
链接: https://arxiv.org/abs/2604.16132
作者: Jessica H. Zhu,Shayla Stringfield,Vahe Zaprosyan,Michael Wagner,Michel Cukier,Joseph B. Richardson Jr
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of the Association for Computational Linguistics (2026)
Abstract:Firearm violence is a pressing public health issue, yet research into survivors’ lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.
[NLP-15] LLM Sniffer: Detecting LLM -Generated Code via GraphCodeBERT and Supervised Contrastive Learning
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在软件开发中广泛应用背景下,如何有效区分由大型语言模型(Large Language Models, LLMs)生成的代码与人类编写的代码这一关键问题,其应用涉及学术诚信、代码质量保障及软件安全。解决方案的核心在于提出 LLMSniffer 框架,该框架通过两阶段监督对比学习(supervised contrastive learning)对 GraphCodeBERT 进行微调,并引入注释移除预处理和多层感知机(MLP)分类器以增强检测性能。实验表明,该方法在 GPTSniffer 和 Whodunit 两个基准数据集上显著优于现有基线,准确率分别提升至 78% 和 94.65%,且 t-SNE 可视化验证了对比微调能生成结构更清晰、聚类更紧密的嵌入表示。
链接: https://arxiv.org/abs/2604.16058
作者: Mahir Labib Dihan,Abir Muhtasim
机构: Bangladesh University of Engineering and Technology (BUET) (孟加拉国工程与技术大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:The rapid proliferation of Large Language Models (LLMs) in software development has made distinguishing AI-generated code from human-written code a critical challenge with implications for academic integrity, code quality assurance, and software security. We present LLMSniffer, a detection framework that fine-tunes GraphCodeBERT using a two-stage supervised contrastive learning pipeline augmented with comment removal preprocessing and an MLP classifier. Evaluated on two benchmark datasets - GPTSniffer and Whodunit - LLMSniffer achieves substantial improvements over prior baselines: accuracy increases from 70% to 78% on GPTSniffer (F1: 68% to 78%) and from 91% to 94.65% on Whodunit (F1: 91% to 94.64%). t-SNE visualizations confirm that contrastive fine-tuning yields well-separated, compact embeddings. We release our model checkpoints, datasets, codes and a live interactive demo to facilitate further research.
[NLP-16] owards Intrinsic Interpretability of Large Language Models :A Survey of Design Principles and Architectures ACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因内部机制不透明而影响可信性和安全部署的问题。现有可解释人工智能(Explainable AI)研究多集中于事后解释方法,即通过外部近似来解析已训练模型,但此类方法难以提供对模型决策过程的深层洞察。本文提出以内在可解释性(intrinsic interpretability)作为替代方案,其核心在于将透明性直接嵌入模型架构与计算过程中,从而实现对模型行为的可理解性。关键解决方案是系统梳理了当前LLMs内在可解释性的五类设计范式:功能透明性(functional transparency)、概念对齐(concept alignment)、表征可分解性(representational decomposability)、显式模块化(explicit modularization)以及潜在稀疏性诱导(latent sparsity induction),为构建更可信、可控的生成式AI系统提供了理论框架与技术路径。
链接: https://arxiv.org/abs/2604.16042
作者: Yutong Gao,Qinglin Meng,Yuan Zhou,Liangming Pan
机构: Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Nanjing University of Science and Technology (南京理工大学); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Main Conference of ACL 2026. 14 pages, 4 figures, 1 table
Abstract:While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: this https URL.
[NLP-17] Stochasticity in Tokenisation Improves Robustness
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对输入扰动时的鲁棒性不足问题,特别是由分词(tokenisation)扰动引发的脆弱性。研究表明,采用确定性的标准分词方式训练的模型容易受到对抗攻击和随机扰动的影响。论文提出的关键解决方案是引入均匀采样的随机分词策略(uniformly sampled stochastic tokenisation),并在预训练和微调阶段应用该策略,从而显著提升模型对随机扰动和对抗攻击的鲁棒性,同时保持模型精度且不增加推理成本。
链接: https://arxiv.org/abs/2604.16037
作者: Sophie Steger,Rui Li,Sofiane Ennadir,Anya Sims,Arno Solin,Franz Pernkopf,Martin Trapp
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a deterministic canonical tokenisation can be brittle to adversarial attacks. Recent studies suggest that stochastic tokenisation can deliver internal representations that are less sensitive to perturbations. In this paper, we analyse how stochastic tokenisations affect robustness to adversarial attacks and random perturbations. We systematically study this over a range of learning regimes (pre-training, supervised fine-tuning, and in-context learning), data sets, and model architectures. We show that pre-training and fine-tuning with uniformly sampled stochastic tokenisations improve robustness to random and adversarial perturbations. Evaluating on uniformly sampled non-canonical tokenisations reduces the accuracy of a canonically trained Llama-1b model by 29.8%. We find that training with stochastic tokenisation preserves accuracy without increasing inference cost.
[NLP-18] Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
【速读】: 该论文旨在解决大规模推理模型(Large Reasoning Models, LRMs)在并行推理过程中因早期错误导致的无效路径(futile paths)问题,这些问题显著增加了计算开销且降低了推理效率。解决方案的关键在于提出首个系统性的路径剪枝(path pruning)分类体系,依据信号来源(内部 vs. 外部)和可学习性(可学习 vs. 不可学习)对现有方法进行归类,并由此发现可学习的内部信号方法具有未被充分挖掘的潜力;基于此,作者提出STOP(Super TOken for Pruning),一种基于token级别的可学习路径剪枝机制,在多个规模从1.5B到20B参数的LRM上验证了其在效果与效率上的优越性,并通过实证分析形成了可用于实际部署的优化指南。
链接: https://arxiv.org/abs/2604.16029
作者: Jiaxi Bi,Tongxu Luo,Wenyu Du,Zhengyang Tang,Benyou Wang
机构: The Chinese University of Hong Kong, Shenzhen; Shenzhen Loop Area Institute; USTB; DualityRL
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 7 figures
Abstract:Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at this https URL
[NLP-19] Where does output diversity collapse in post-training?
【速读】: 该论文旨在解决后训练语言模型(post-trained language models)在推理阶段输出多样性下降的问题,即“输出多样性崩溃”(output diversity collapse),这一现象会削弱依赖多样样本的缩放方法,并可能导致创造性及价值导向任务中模型输出同质化。其关键发现是:多样性崩溃主要由训练数据组成决定,而非生成格式或特定后训练方法本身;通过在Olmo 3的三个并行后训练路径(Think、Instruct、RL-Zero)上系统评估15个任务和四种文本多样性指标,研究者发现不同数据来源导致的多样性损失位置不同(如Think线路在监督微调阶段即出现语义多样性显著下降),且即使抑制链式思维推理(chain-of-thought reasoning)也不会恢复答案层面的多样性,说明崩溃已嵌入模型权重中。因此,解决方案的关键在于优化训练数据组成以保留多样性,而非仅调整推理策略。
链接: https://arxiv.org/abs/2604.16027
作者: Constantinos Karouzos,Xingwei Tan,Nikolaos Aletras
机构: University of Sheffield(谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.
[NLP-20] Agent V-RL: Scaling Reward Modeling with Agent ic Verifier ACL2026
【速读】: 该论文旨在解决验证器(verifier)在复杂领域中因错误传播和缺乏外部依据而导致的可靠性问题,尤其是在计算密集型或知识密集型任务中。其关键解决方案是提出Agentic Verifier框架,将奖励建模转化为多轮、工具增强的反思式过程,通过引入前向与后向两个互补代理:一个从前提推导结论,另一个从结论反向验证前提,从而实现对解题过程的双向、全面评估,提升准确性与可解释性;同时设计了AgentV-RL机制,使验证器能够自主地在内部推理与工具调用之间交替执行,以支持实际部署中的高效决策。
链接: https://arxiv.org/abs/2604.16004
作者: Jiazheng Zhang,Ziche Fu,Zhiheng Xi,Wenqing Jing,Mingxu Chai,Wei He,Guoqiang Zhang,Chenghao Fan,Chenxin An,Wenxiang Chen,Zhicheng Liu,Haojie Pan,Dingwei Zhu,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Huazhong University of Science and Technology (华中科技大学); The University of Hong Kong (香港大学); ByteDance Seed (字节跳动种子); Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身智能研究所); Shanghai Key Laboratory of Multimodal Embodied AI (上海市多模态具身智能重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026
Abstract:Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
[NLP-21] SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification ICASSP2026
【速读】: 该论文旨在解决少样本层次文本分类(few-shot Hierarchical Text Classification, few-shot HTC)任务中,由于数据稀缺导致模型难以区分语义相似的兄弟类(sibling classes)的问题。现有方法虽利用标签层次结构约束保证父子类别预测一致性,但在深层类别间缺乏足够的判别能力。解决方案的关键在于提出一种名为 SCHK-HTC 的新方法,其核心创新为:1)层次知识提取模块(hierarchical knowledge extraction module),用于挖掘和注入层次结构中的领域知识以增强语义理解;2)兄弟对比学习机制(sibling contrastive learning mechanism),通过显式建模兄弟类之间的细微差异来提升模型在深层类别上的可分性。该设计使模型能够在每一层级编码更具判别力的特征,从而显著改善混淆类别的区分能力,并在多个基准数据集上优于当前最优方法。
链接: https://arxiv.org/abs/2604.15998
作者: Ke Xiong,Qian Wu,Wangjie Gan,Yuke Li,Xuhong Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5pages,3 figures,ICASSP 2026
Abstract:Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model’s perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Our code is available at this https URL.
[NLP-22] A Case Study on the Impact of Anonymization Along the RAG Pipeline
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因数据敏感性带来的隐私风险问题,即在利用RAG提取有价值信息的同时,可能无意暴露个人身份信息(PII)或其他敏感内容。其解决方案的关键在于系统性地评估匿名化处理在RAG流程中的不同位置对隐私-效用权衡的影响,具体比较了在原始数据集阶段和生成答案阶段进行匿名化的差异,从而揭示了隐私保护措施部署位置的重要性,为RAG管理员提供可操作的隐私风险缓解策略。
链接: https://arxiv.org/abs/2604.15958
作者: Andreea-Elena Bodea,Stephen Meisenbacher,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 7 pages, 1 figure, 6 tables. Accepted to IWSPA 2026
Abstract:Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.
[NLP-23] RAG ognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration IJCNN2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中常见的封闭域幻觉(closed-domain hallucinations)问题,即模型在生成内容时脱离所检索到的上下文,产生不准确或无依据的信息。现有方法多将幻觉检测视为事后处理任务,依赖黑箱一致性检查或冻结内部表征的探测机制,难以有效干预生成过程。论文的关键解决方案是提出RAGognize数据集和RAGognizer训练框架:前者提供自然发生的封闭域幻觉的细粒度token级标注,后者通过在大语言模型(Large Language Models, LLMs)中集成轻量级幻觉检测头,实现语言建模与幻觉检测的联合优化。这一设计使模型在生成过程中同时学习生成高质量响应并提升内部状态对幻觉的可分离性,从而在多个基准测试中实现最优的token级幻觉检测性能,并显著降低生成阶段的幻觉率,且不损害语言质量与相关性。
链接: https://arxiv.org/abs/2604.15945
作者: Fabian Ridder,Laurin Lessel,Malte Schilling
机构: University of Münster (明斯特大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: accepted at IJCNN 2026
Abstract:Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-art token-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance.
[NLP-24] MUSCAT: MUltilingual SCientific ConversATion Benchmark
【速读】: 该论文旨在解决多语言语音技术中面临的实际挑战,特别是针对混合多语言输入、特定词汇以及语码转换(code-switching)等复杂场景下自动语音识别(ASR)系统的性能评估问题。当前缺乏能够系统评测ASR模型在真实多语言对话环境中表现的基准数据集,限制了相关技术的发展。为此,作者提出了一种新的基准测试集,包含多名说话者围绕科学论文进行双语讨论的音频数据,涵盖自然发生的多语言交互和语码转换现象;其关键创新在于构建了一个超越传统词错误率(Word Error Rate, WER)的标准评估框架,支持跨语言的一致性比较,从而更全面地衡量ASR系统在真实多语言环境下的鲁棒性和适应性。实验结果表明,该数据集对现有最先进的ASR系统仍构成显著挑战,凸显了未来研究的方向。
链接: https://arxiv.org/abs/2604.15929
作者: Supriti Sinhamahapatra,Thai-Binh Nguyen,Yiğit Oğuz,Enes Ugan,Jan Niehues,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in this https URL \ \newline \Keywordsmultilingual, speech recognition, audio segmentation, speaker diarization Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.15929 [cs.CL] (or arXiv:2604.15929v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.15929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-25] How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Prag matic Competence of Large Language Models ACL2026
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在作为语用理解者(pragmatic listeners,即判断语言输出是否恰当)与语用生成者(pragmatic speakers,即生成符合语用规范的语言)时,二者性能是否一致,即语用判断能力与语用生成能力是否存在强关联。解决方案的关键在于通过系统性对比多个开源和专有LLMs在三种语用场景下的表现,发现模型在语用判断任务中普遍优于语用生成任务,揭示了当前LLMs中语用理解与语用生成之间存在显著不对称性,从而指出现有评估范式需向更集成化方向发展,以实现对语用能力的全面衡量。
链接: https://arxiv.org/abs/2604.15873
作者: Judith Sieker,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (findings)
Abstract:Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs’ performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.
[NLP-26] DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在零样本命名实体识别(Zero-shot Named Entity Recognition, NER)任务中普遍存在且系统性的生成错误问题,这些问题导致其性能显著落后于监督学习方法。解决方案的关键在于提出DiZiNER框架,该框架通过模拟人类标注中的“预标注”(pilot annotation)过程,利用多个异构LLM作为标注者对同一文本进行标注,并由一个监督模型分析不同模型间的分歧,从而迭代优化任务指令。这一“基于分歧引导的指令精炼”机制有效提升了零样本NER性能,在18个基准数据集上实现14个最优结果,F1分数平均提升8.0点,并将零样本与监督模型之间的性能差距缩小超过11点,且改进效果不依赖于模型容量增强,而是源于分歧驱动的指令优化策略。
链接: https://arxiv.org/abs/2604.15866
作者: Siun Kim,Hyung-Jin Yoon
机构: Seltasquare; Seoul National University College of Medicine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures; Accepted to the ACL 2026 Main Conference
Abstract:Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.
[NLP-27] CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization ACL2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在生成式AI(Generative AI)场景下,如何有效实现“机器遗忘”(machine unlearning)的问题。现有方法难以在不损害长链思维(chain-of-thought, CoT)推理能力的前提下,彻底移除模型中与隐私或版权相关的知识。其解决方案的关键在于提出一种基于迭代偏好优化的反事实遗忘框架(Counterfactual Unlearning through iterative Preference Optimization, CiPO),通过引导模型生成逻辑上有效的反事实推理路径,并利用该路径进行偏好学习,同时迭代更新训练数据以增强与原始模型的差异性,从而在保障推理能力的同时实现对中间CoT步骤和最终答案的完全遗忘。
链接: https://arxiv.org/abs/2604.15847
作者: Junyi Li,Yongqiang Chen,Ningning Ding
机构: The Hong Kong University of Science and Technology (Guangzhou); The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main Conference
Abstract:Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.
[NLP-28] Disentangling Mathematical Reasoning in LLM s: A Methodological Investigation of Internal Mechanisms
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理推理密集型任务时内部工作机制不清晰的问题。为推进对模型内部处理机制的理解,作者聚焦于算术运算这一典型推理任务,通过早期解码(early decoding)技术追踪每一层中下一个词元的预测构建过程。解决方案的关键在于揭示了模型在不同层级上的功能分工:熟练执行算术任务的模型在注意力模块(attention)与多层感知机模块(MLP)之间呈现出明确的职责划分——注意力模块负责传播输入信息,而MLP模块则负责聚合这些信息以生成正确结果;相比之下,不熟练模型缺乏这种结构化分工。此外,研究还发现成功模型对更具挑战性的算术任务表现出功能性处理特征,表明其具备超越事实记忆的推理能力。
链接: https://arxiv.org/abs/2604.15842
作者: Tanja Baeumel,Josef van Genabith,Simon Ostermann
机构: German Research Center for Artificial Intelligence (DFKI); Saarland University; Center for European Research in Trusted AI (CERTAIN)
类目: Computation and Language (cs.CL)
备注: MathNLP 2025
Abstract:Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.
[NLP-29] Exploring the Capability Boundaries of LLM s in Mastering of Chinese Chouxiang Language ACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在中文网络亚文化语言——“Chouxiang Language”上的表现尚不明确的问题。Chouxiang Language 是中国互联网语境中一种具有代表性的次文化语言形式,其独特的表达方式对主流 NLP 模型构成挑战。为系统评估 LLM 在此类语言任务中的能力,作者提出了 Mouse 基准,涵盖六类自然语言处理任务。实验表明,尽管当前最先进(SOTA)的 LLMs 在涉及上下文语义理解的任务中表现良好,但在其他任务上仍存在明显局限性。解决方案的关键在于构建一个专门针对 Chouxiang Language 的多任务评估基准,并深入分析性能瓶颈、LLM-as-a-judge 方法与人类判断的一致性以及影响翻译质量的核心因素,从而推动自然语言处理领域对多元文化融合及网络语言动态演化的研究。
链接: https://arxiv.org/abs/2604.15841
作者: Dianqing Lin,Tian Lan,Jiali Zhu,Jiang Li,Wei Chen,Xu Liu,Aruukhan,Xiangdong Su,Hongxu Hou,Guanglai Gao
机构: College of Computer Science, Inner Mongolia University, China
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings
Abstract:While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP tasks involving Chouxiang Language across six tasks. Experimental results show that, current state-of-the-art (SOTA) LLMs exhibit clear limitations on multiple tasks, while performing well on tasks that involve contextual semantic understanding. In addition, we further discuss the reasons behind the generally low performance of SOTA LLMs on Chouxiang Language, examine whether the LLM-as-a-judge approach adopted for translation tasks aligns with human judgments and values, and analyze the key factors that influence Chouxiang translation. Our study aims to promote further research in the NLP community on multicultural integration and the dynamics of evolving internet languages. Our code and data are publicly available.
[NLP-30] CoEvolve: Training LLM Agents Agent s via Agent-Data Mutual Evolution ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体在强化学习过程中依赖静态数据分布的问题,该问题导致模型难以适应自身行为的演化,从而限制了对复杂环境交互的覆盖与学习效率。解决方案的关键在于提出一种代理-数据协同进化框架(CoEvolve),通过闭环、交互驱动的训练机制实现LLM智能体与其训练数据分布的联合优化:具体而言,CoEvolve从轨迹回放中提取遗忘和不确定性等反馈信号,识别高风险交互模式,并据此引导基于LLM的任务合成;合成任务经环境验证后用于更新数据分布,形成智能体与数据之间的动态互促机制,从而显著提升智能体性能,在AppWorld和BFCL基准上实现了15.58%至19.43%的绝对性能增益。
链接: https://arxiv.org/abs/2604.15840
作者: Shidong Yang,Ziyu Ma,Tongwen Huang,Yiming Hu,Yong Wang,Xiangxiang Chu
机构: AMAP, Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026
Abstract:Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent’s evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.
[NLP-31] Discover and Prove: An Open-source Agent ic Framework for Hard Mode Automated Theorem Proving in Lean 4 ACL2026
【速读】: 该论文旨在解决当前自动定理证明(ATP)基准测试中普遍存在的“Easy Mode”设计问题,即在形式化陈述中直接嵌入最终答案,导致模型能力评估过于乐观,且与人类竞赛者面临的实际挑战不一致。为此,作者提出“Hard Mode”这一更严格、更贴近现实的评测设定,要求系统必须独立发现答案后再构造形式化证明。解决方案的关键在于:首先发布两个经专家重新标注的Hard Mode基准数据集(MiniF2F-Hard 和 FIMO-Hard),其次提出Discover And Prove (DAP) 框架——该框架利用大语言模型(LLM)的自然语言推理与显式自我反思能力来自主发现答案,并将Hard Mode问题转换为Easy Mode格式以适配现有ATP求解器。实验表明,DAP在CombiBench上将成功解题数从7提升至10,在PutnamBench上首次实现36个定理的Hard Mode形式化证明,同时揭示出当前SOTA LLM在答案准确率超过80%时,ATP求解器成功率却低于10%,凸显了Hard Mode基准对衡量真实推理能力的独特价值。
链接: https://arxiv.org/abs/2604.15839
作者: Chengwu Liu,Yichun Yin,Ye Yuan,Jiaxuan Xie,Botao Li,Siqi Li,Jianhao Shen,Yan Xu,Lifeng Shang,Ming Zhang
机构: Peking University (北京大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: ACL 2026 Main Conference
Abstract:Most ATP benchmarks embed the final answer within the formal statement – a convention we call “Easy Mode” – a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting “Hard Mode”: the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode – while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.
[NLP-32] Qwen 3.5-Omni Technical Report
【速读】: 该论文旨在解决多模态大模型在长序列理解、跨模态对齐、语音合成稳定性及音频视觉联合推理等方面的瓶颈问题。其核心解决方案在于:首先,通过采用混合注意力机制的专家混合(Hybrid Attention Mixture-of-Experts, MoE)架构,实现高效长序列推理;其次,引入动态对齐机制ARIA(Alignment-aware Real-time Audio synthesis),以缓解文本与语音分词器间编码效率差异导致的流式语音合成不稳定问题,显著提升对话语音的自然度和韵律;再次,通过构建大规模异构文本-视觉对与超百万小时音视频数据集,增强模型在215个子任务上的音频与视听理解、推理与交互能力,并实现支持10种语言的多语种理解和情感化语音生成;最后,创新性地发现并验证了“音频-视觉情绪编码”(Audio-Visual Vibe Coding)这一新能力,即模型可直接根据音视频指令执行代码生成,标志着多模态模型向具身智能演进的重要突破。
链接: https://arxiv.org/abs/2604.15804
作者: Qwen Team
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.
[NLP-33] CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在向量数据库中存在相似文档时导致的检索准确率下降问题,这会引发冗余信息、幻觉和事实性错误。解决方案的关键在于提出CHOP框架,其核心包括两个组件:一是CNM-Extractor,用于为每个文本块生成包含类别、关键名词和模型名称的紧凑签名(compact per-chunk signatures),以增强语义区分度;二是连续性决策模块(Continuity Decision Module),通过判断相邻文本块是否属于同一文档流来保持上下文连贯性。通过在每个文本块前添加上下文感知的元数据,CHOP有效减少了相似文档间的语义冲突,提升了检索器的判别能力,从而显著改善了检索精度与排序质量。
链接: https://arxiv.org/abs/2604.15802
作者: Hyunseok Park,Jihyeon Kim,Jongeun Kim,Dongsik Yoon
机构: HDC LABS(韩国 HDC 实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstructs documents by determining their association with specific topics or query types. CHOP integrates two key components: the CNM-Extractor, which generates compact per-chunk signatures capturing categories, key nouns, and model names, and the Continuity Decision Module, which preserves contextual coherence by deciding whether consecutive chunks belong to the same document flow. By prefixing each chunk with context-aware metadata, CHOP reduces semantic conflicts among similar documents and enhances retriever discrimination. Experiments on benchmark datasets show that CHOP alleviates retrieval confusion and provides a scalable approach for building high-quality knowledge bases, achieving a Top-1 Hit Rate of 90.77% and notable gains in ranking quality metrics.
[NLP-34] Self-Distillation as a Performance Recovery Mechanism for LLM s: Counteracting Compression and Catastrophic Forgetting
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)、量化(quantization)和剪枝(pruning)等过程中因灾难性遗忘(catastrophic forgetting)导致的性能退化问题。其解决方案的关键在于提出一种基于自蒸馏微调(Self-Distillation Fine-Tuning, SDFT)的性能恢复框架,该框架通过将教师模型(teacher)的隐藏层激活轨迹与学生模型(student)进行对齐,实现能力恢复。研究进一步利用中心核对齐(Centered Kernel Alignment, CKA)量化这种对齐程度,并证明性能恢复的核心机制在于自蒸馏能有效使学生模型的高维流形结构与教师模型所代表的最优结构对齐,从而从几何表示理论层面解释了SDFT的有效性。
链接: https://arxiv.org/abs/2604.15794
作者: Chi Liu,Xin Chen,Xu Zhou,Fangbo Tu,Srinivasan Manoharan
机构: PayPal AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 8 figures
Abstract:Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantization, and pruning. In this work, we introduce a performance recovery framework based on Self-Distillation Fine-Tuning (SDFT) that effectively restores model capabilities. Complementing this practical contribution, we provide a rigorous theoretical explanation for the underlying recovery mechanism. We posit that an LLM’s generative capability fundamentally relies on the high-dimensional manifold constructed by its hidden layers. To investigate this, we employ Centered Kernel Alignment (CKA) to quantify the alignment between student and teacher activation trajectories, leveraging its invariance to orthogonal transformations and scaling. Our experiments demonstrate a strong correlation between performance recovery and manifold alignment, substantiating the claim that self-distillation effectively aligns the student’s high-dimensional manifold with the optimal structure represented by the teacher. This study bridges the gap between practical recovery frameworks and geometric representation theory, offering new insights into the internal mechanisms of self-distillation.
[NLP-35] A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
【速读】: 该论文旨在解决当前训练-free方法在提升大语言模型(Large Language Models, LLMs)可信性(trustworthiness)时存在的评估不一致、维度局限及副作用(如性能退化和脆弱性增加)的问题。其解决方案的关键在于提出一个基于信息流干预位置的三层次分类体系(输入层、内部层和输出层),并在此基础上对不同层级的代表性方法进行系统性再评估,涵盖多个LLM家族与规模,从而揭示现有方法在可信性、实用性(utility)与鲁棒性(robustness)之间的权衡关系与未解挑战,最终为无需额外训练即可平衡上述属性提供实践建议。
链接: https://arxiv.org/abs/2604.15789
作者: Wai Man Si,Mingjie Li,Michael Backes,Yang Zhang
机构: CISPA Helmholtz Center for Information Security
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model’s information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.
[NLP-36] Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLM s
【速读】: 该论文旨在解决当前对齐方法(如监督微调 SFT 和强化学习人类反馈 RLHF)无法有效消除模型中潜在不安全行为的问题,即这些方法仅鼓励生成偏好响应,但未直接移除引发有害输出的“不安全子网络”。其解决方案的关键在于提出一种资源高效的剪枝框架,通过梯度无关的归因机制识别并移除与不安全行为相关的参数,从而在保持模型性能的同时显著降低有害生成,并提升对抗越狱攻击的鲁棒性。这一方法从彩票理论(Lottery Ticket Hypothesis)视角揭示了模型中存在“不安全票券”(unsafe tickets),而剪枝过程可识别出保留性能且更安全的“安全票券”(safety tickets),为资源受限场景下的后处理对齐提供了轻量级策略。
链接: https://arxiv.org/abs/2604.15780
作者: Wai Man Si,Mingjie Li,Michael Backes,Yang Zhang
机构: CISPA Helmholtz Center for Information Security
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain “unsafe tickets” responsible for harmful behaviors, and pruning reveals “safety tickets” that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.
[NLP-37] PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
【速读】: 该论文旨在解决现有个人身份信息(PII)检测资源分散、标注体系不统一的问题,从而阻碍了检测系统之间的系统性比较。解决方案的关键在于构建一个统一的基准语料库——PIIBench,其整合了10个公开数据集,涵盖合成PII语料、多语言命名实体识别(NER)基准及金融领域标注文本,共包含2,369,883个标注序列和335万个实体提及,覆盖48种标准PII类别;并通过一个规范化的处理流程将80余种源特定标签映射到标准化的BIO标注方案,实施基于频率的近似稀有实体抑制,并生成保持原始分布的分层训练/验证/测试划分,从而为PII检测提供更全面且更具挑战性的评估基准。
链接: https://arxiv.org/abs/2604.15776
作者: Pritesh Jha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at this https URL.
[NLP-38] MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在具备持久记忆能力后,因记忆污染或偏见累积导致的行为漂移(memory misevolution)问题,这会引发代理行为异常并带来新的安全风险。现有评估方法缺乏对长期记忆安全性的标准化衡量框架。其解决方案的关键在于提出MemEvoBench——首个针对LLM代理在对抗性记忆注入、噪声工具输出和偏见反馈下进行长期记忆安全评估的基准测试框架,包含7个领域的问答任务与36种风险类型,以及基于20个Agent-SafetyBench环境改编的工作流任务,通过多轮交互中混合良性与误导性记忆池模拟记忆演化过程,实验证明静态提示防御无法有效应对此类风险,凸显了保障记忆演化安全的紧迫性。
链接: https://arxiv.org/abs/2604.15774
作者: Weiwei Xie,Shaoxiong Guo,Fan Zhang,Tian Xia,Xue Yang,Lizhuang Ma,Junchi Yan,Qibing Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.
[NLP-39] Skill-RAG : Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中持续存在的检索失败问题,尤其是由查询与证据空间之间对齐偏差(query-evidence misalignment)导致的结构性失败,而非单纯因缺乏相关证据所致。现有方法通常将检索失败视为需要重试的信号,而忽视了对失败根本原因的诊断与纠正。解决方案的关键在于提出Skill-RAG框架,其核心创新是耦合一个轻量级隐藏状态探测器(hidden-state prober)与一个基于提示的技能路由器(prompt-based skill router)。该框架在检索流水线的两个阶段动态 gating 检索行为,一旦检测到失败状态,技能路由器即可诊断具体成因,并从四种检索技能中选择最优策略——包括查询重写、问题分解、证据聚焦以及退出技能(用于无法修正的极端情况),从而在下一轮生成前主动修复查询与证据之间的语义错位。实验表明,Skill-RAG显著提升了多轮检索后仍存在困难样本的准确性,尤其在分布外数据集上表现突出,且表示空间分析验证了不同技能对应于失败状态空间中的结构化、可分离区域,支持“查询-证据错位”是一种类型化的而非单一的现象。
链接: https://arxiv.org/abs/2604.15771
作者: Kai Wei,Raymond Li,Xi Zhu,Zhaoqian Xue,Jiaojiao Han,Jingcheng Niu,Fan Yang
机构: University of Michigan (密歇根大学); University of British Columbia (不列颠哥伦比亚大学); Rutgers University (罗格斯大学); University of Pennsylvania Perelman School of Medicine (宾夕法尼亚大学佩雷尔曼医学院); New Jersey Institute of Technology (新泽西理工学院); Technical University of Darmstadt (达姆施塔特工业大学); Wake Forest University (维克森林大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose – leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills – query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases – to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.
[NLP-40] L: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models CVPR2026
【速读】: 该论文旨在解决现有基于CLIP的测试时适应(test-time adaptation)方法在开放世界场景下进行分布外(Out-of-distribution, OOD)检测时的关键局限:即依赖固定且有限的外部OOD标签,无法有效捕捉测试流中不断涌现的多样化和动态演化的OOD语义。解决方案的核心在于提出Test-time Textual Learning (TTL)框架,其关键创新包括:1)通过伪标签驱动的可学习提示(learnable prompts)机制,从无标签测试流中动态学习OOD文本语义;2)引入OOD知识净化策略以筛选可靠样本并抑制伪标签噪声;3)构建OOD文本知识库(OOD Textual Knowledge Bank)用于稳定跨批次的置信度校准。该方法实现了无需外部标签即可持续适应开放OOD空间的鲁棒检测能力。
链接: https://arxiv.org/abs/2604.15756
作者: Jinlun Ye,Jiang Liao,Runhe Lai,Xinhua Lu,Jiaxin Zhuang,Zhiyong Gan,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); China United Network Communications Corporation Limited Guangdong Branch (广东省通信有限公司); Peng Cheng Laboratory (鹏城实验室); Hong Kong University of Science and Technology (香港科技大学); Key Laboratory of Machine Intelligence and Advanced Computing, MOE (教育部机器智能与先进计算重点实验室)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at this https URL.
[NLP-41] Language Place and Social Media: Geographic Dialect Alignment in New Zealand
【速读】: 该论文旨在解决地理方言(geographic dialect)与基于地点的社交媒介社群之间语言一致性(geographic dialect alignment)的问题,特别是聚焦于新西兰相关的Reddit社区。研究通过整合用户感知的定性分析与计算方法,考察语言使用如何反映地方认同及基于词汇、形态句法和语义变量的语言变异与演变模式。其解决方案的关键在于利用大规模语料库(含42.6亿未处理词)和先进的语言建模技术(包括静态与历时Word2Vec词嵌入),揭示了基于地点的社群间语义差异及新西兰英语内部有意义的语义变迁,从而证明社交媒体可作为社会语言学研究的自然实验室。
链接: https://arxiv.org/abs/2604.15744
作者: Sidney Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注: PhD thesis
Abstract:This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.
[NLP-42] Learning Uncertainty from Sequential Internal Dispersion in Large Language Models ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(hallucination)检测的不确定性估计问题。现有方法通常依赖模型内部状态进行不确定性建模,但受限于对隐藏状态跨层演化方式的严格假设,且仅关注最后一层或平均 token 的特征,导致信息丢失和泛化能力不足。其解决方案的关键在于提出一种名为“序列内部方差表示”(Sequential Internal Variance Representation, SIVR)的监督式检测框架,该框架基于 token 级别和 layer 级别的隐藏状态特征,通过捕捉不同层间内部表示的分散程度(即方差)来建模不确定性,避免了对特定演化模式的假设,具备模型与任务无关性;同时,SIVR 聚合整个序列的 per-token 方差特征,学习时序模式以识别事实性错误,从而有效防止信息损失。实验表明,SIVR 在多个基准上均优于强基线,并展现出更强的泛化能力及对大规模训练数据的低依赖性,具有良好的实际部署潜力。
链接: https://arxiv.org/abs/2604.15741
作者: Ponhvoan Srey,Xiaobao Wu,Cong-Duy Nguyen,Anh Tuan Luu
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); Centre for AI Research, VinUniversity (Vin大学人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 (Main Conference)
Abstract:Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per-token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at this https URL.
[NLP-43] RefereeBench: Are Video MLLM s Ready to be Multi-Sport Referees
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在专业化、规则驱动的决策任务中能力不足的问题,特别是其在体育裁判场景下的适用性尚未得到充分探索。解决方案的关键在于构建首个大规模基准测试平台 RefereeBench,该平台涵盖11项体育运动、925段精心筛选的视频及6,475个问答对,全面评估MLLMs在犯规识别、判罚分类、规则推理、实体感知和时间定位等五个核心裁判能力上的表现。通过高质量的人工标注确保标注结果符合真实判罚逻辑与多模态证据,实验表明现有最强模型准确率仅约60%,凸显了当前模型在规则应用与时空定位上的显著短板,从而推动未来MLLMs在领域知识融合与多模态决策能力上的改进。
链接: https://arxiv.org/abs/2604.15736
作者: Yichen Xu,Yuanhang Liu,Chuhan Wang,Zihan Zhao,jinghan luo,Jianzhe Ma,Wenxuan Wang,Qin Jin
机构: Renmin University of China (中国人民大学); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in Progress
Abstract:While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.
[NLP-44] GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
【速读】: 该论文旨在解决当前通用工具代理(General Tool Agents, GTA)评估体系与真实世界生产力工作流需求脱节的问题,即现有工具使用基准依赖AI生成的查询、虚拟工具和有限的系统级协调,无法有效衡量模型在复杂、开放任务中的端到端执行能力。其解决方案的关键在于提出GTA-2这一分层基准:一方面继承原有GTA基准,通过GTA-Atomic评估原子级工具调用的精确性;另一方面引入GTA-Workflow,模拟长周期、开放式的真实任务流程,并设计基于递归检查点的评估机制,将目标分解为可验证的子目标,从而统一评估模型能力与代理执行框架(execution harnesses)的表现。实验证明,前沿模型在原子任务中已表现不佳(低于50%),而在工作流任务中成功率仅为14.39%,凸显执行框架设计对整体性能的重要性。
链接: https://arxiv.org/abs/2604.15715
作者: Jize Wang,Xuanxuan Liu,Yining Li,Songyang Zhang,Yijun Wang,Zifei Shan,Xinyi Le,Cailian Chen,Xinping Guan,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at this https URL.
[NLP-45] arget-Oriented Pretraining Data Selection via Neuron-Activated Graph
【速读】: 该论文旨在解决目标导向的语言模型(Language Model, LM)预训练中数据选择效率低下的问题,即如何从大规模语料库中高效筛选出对特定目标任务最具价值的训练样本。传统方法依赖黑箱表示或随机采样,缺乏可解释性且效果有限。其解决方案的关键在于提出一种无需训练、可解释的目标数据选择框架——基于神经元激活图(Neuron-Activated Graph, NAG)的排序机制:通过识别任意现成语言模型中对目标输入具有高影响力的稀疏神经元集合构建NAG,并以NAG相似度对候选数据进行排序,从而精准选出与目标任务高度相关的训练样本。实验表明,该方法在多个基准测试上显著优于随机采样和现有最优基线,且通过对神经元功能重要性的分析验证了NAG捕捉到的是学习目标特征的稀疏“功能骨干”。
链接: https://arxiv.org/abs/2604.15706
作者: Zijun Wang,Haoqin Tu,Weidong Zhou,Yiyang Zhou,Xiaohuan Zhou,Bingni Zhang,Weiguo Feng,Taifeng Wang,Cihang Xie,Fengze Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse “functional backbone” for learning target features. We release the code at this https URL.
[NLP-46] he Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring NEURIPS2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在元认知(metacognition)能力评估中缺乏系统性、跨领域行为测量工具的问题。传统评估方法难以量化模型对自身判断准确性的监控与调节能力,导致对LLMs“自我意识”水平的理解碎片化。解决方案的关键在于构建一个基于Nelson和Narens(1990)元认知框架的跨域行为测试电池(battery),涵盖六个认知领域(学习、元认知校准、社会认知、注意力、执行功能、前瞻调控),并引入人类心理测量学中的双探针设计(dual probes),通过强制选择后要求模型“保留”或“撤回”答案以及“下注”或“放弃”,以计算关键指标——“撤回差异值”(withdraw delta),从而识别出三种符合元认知架构的行为模式:普遍自信型、普遍撤回型和选择敏感型。此方法实现了对LLM元认知能力的结构化、可比较的量化分析,并揭示了准确性与元认知敏感性之间的倒置关系,为后续模型开发提供了可操作的评估基准。
链接: https://arxiv.org/abs/2604.15702
作者: Jon-Paul Cacioli
机构: Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 6 figures, 3 tables. Submitted to NeurIPS 2026 Evaluations and Datasets Track. Code, data, and Croissant metadata: this https URL
Abstract:We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1-T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson-Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar-based evidence is the primary support). Scaling on metacognitive calibration is architecture-dependent: monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type-2 SDT approach, providing preliminary cross-method construct validity. All items, data, and code: this https URL.
[NLP-47] Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information EMNLP2025
【速读】: 该论文旨在解决当前Chain-of-Thought (CoT)蒸馏方法在将大语言模型(教师模型)的推理能力迁移至小模型(学生模型)时,未能充分挖掘和利用教师模型在推理过程中对关键信息的动态注意力变化的问题。现有方法主要关注传递教师生成的推理链文本,而忽略了其注意力机制中逐步聚焦于核心信息的演化过程,这限制了学生模型对推理逻辑的结构化学习。解决方案的关键在于提出一种新型CoT蒸馏框架,通过显式转移教师模型在每一步推理中对关键信息的注意力分布,引导学生模型实现渐进式的注意力集中;同时设计Mixture of Layers模块以实现教师与学生模型不同层之间的动态对齐,从而提升小模型在数学和常识推理任务上的性能表现。
链接: https://arxiv.org/abs/2604.15701
作者: Yao Chen,Jiawei Sheng,Wenyuan Zhang,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025
Abstract:The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers’ dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher’s stepwise attention on key information to the student model. This establishes structured guidance for the student’s progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.
[NLP-48] Preference Estimation via Opponent Modeling in Multi-Agent Negotiation ACL2026
【速读】: 该论文旨在解决多参与方、多议题自动化谈判中因传统仅依赖数值信息的对手建模方法无法捕捉自然语言交互中的定性信息,从而导致偏好估计不稳定和不完整的问题。其解决方案的关键在于提出一种将自然语言信息整合进结构化贝叶斯对手建模框架的新方法:利用大语言模型(Large Language Models, LLMs)从对话中提取定性线索,并将其转化为概率形式以实现动态信念追踪,从而在偏好估计中融合概率推理与自然语言理解能力。
链接: https://arxiv.org/abs/2604.15687
作者: Yuta Konishi,Kento Yamamoto,Eisuke Sonomoto,Rikuho Takeda,Ryo Furukawa,Yusuke Muraki,Takafumi Shimizu,Kazuma Fukumura,Yuya Kanemoto,Takayuki Ito,Shiyao Ding
机构: Kyoto University (京都大学); Accenture Japan Ltd (埃森哲日本有限公司)
类目: Computation and Language (cs.CL)
备注: This paper is accepted as a Findings of ACL 2026
Abstract:Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLMs) enable rich semantic understanding of utterances, it remains challenging to quantitatively incorporate such information into a consistent opponent modeling. To tackle this issue, we propose a novel preference estimation method integrating natural language information into a structured Bayesian opponent modeling framework. Our approach leverages LLMs to extract qualitative cues from utterances and converts them into probabilistic formats for dynamic belief tracking. Experimental results on a multi-party benchmark demonstrate that our framework improves the full agreement rate and preference estimation accuracy by integrating probabilistic reasoning with natural language understanding.
[NLP-49] C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中文化对齐(cultural alignment)依赖合成数据生成时,种子数据(seed curation)选取缺乏量化标准的问题。现有方法主要依赖人工标注或由LLM提取,存在可扩展性差和引入偏见的风险,且将文化特异性视为抽象概念而非可测量信号。解决方案的关键在于提出C-Mining框架,其核心创新是利用预训练嵌入空间中跨语言文化概念的错位(cross-lingual misalignment)作为可计算的发现信号,通过几何洞察识别具有显著语言排他性和几何隔离性的区域,从而自动从多语种原始语料中挖掘高保真文化点(Culture Points, CPs),无需人工或LLM监督,使准备成本降低超过150倍,并显著提升文化理解与推理能力。
链接: https://arxiv.org/abs/2604.15675
作者: Pufan Zeng,Yilun Liu,Mingchen Dai,Mengyao Piao,Chunguang Zhao,Lingqi Miao,Shimin Tao,Weibin Meng,Minggui He,Chenxin Liu,Zhenzhen Qin,Li Zhang,Hongxia Ma,Boxing Chen,Daimeng Wei
机构: Huawei China; Huawei Canada; University of Science and Technology of China, Hefei China
类目: Computation and Language (cs.CL)
备注:
Abstract:Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this “quantification gap” by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.
[NLP-50] Faster LLM Inference via Sequential Monte Carlo
【速读】: 该论文旨在解决生成式 AI(Generative AI)中语言模型(Language Model, LM)推理速度慢的问题,尤其是在采用推测解码(Speculative Decoding, SD)时因令牌级拒绝导致的吞吐量下降问题。其核心挑战在于:当提议模型(proposal model)与目标模型(target model)产生分歧时,传统SD方法会直接截断整个草案块,造成计算资源浪费和性能瓶颈。解决方案的关键是引入序贯蒙特卡洛推测解码(Sequential Monte Carlo Speculative Decoding, SMC-SD),它将传统的逐令牌拒绝机制替换为基于重要性加权的粒子重采样策略,从而在保持理论近似误差边界的同时,利用闲置算力将验证过程转化为固定大小的向量化操作,无需回滚。该方法在不牺牲精度的前提下显著提升了推理效率,在多个基准测试中实现了比标准SD快2.36倍、比自回归解码快5.2倍的加速效果。
链接: https://arxiv.org/abs/2604.15672
作者: Yahya Emara,Mauricio Barba da Costa,Chi-Chih Chang,Cameron Freer,Tim Vieira,Ryan Cotterell,Mohamed S. Abdelfattah
机构: Cornell University (康奈尔大学); Makora; MIT (麻省理工学院); ETH Zürich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free – SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model’s accuracy on reasoning, instruction-following, and coding benchmarks.
[NLP-51] HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在超图(hypergraph)理解与推理能力方面缺乏系统评估基准的问题。当前LVLMs虽在复杂拓扑结构理解上展现出潜力,但其对超图的处理能力尚不明确,限制了其在生命科学、社交网络等实际场景中的应用拓展。解决方案的关键在于提出首个专门针对LVLMs超图理解能力的基准测试集——\textttHyperGVL,涵盖84,000个跨12项任务的视觉-语言问答样本,涉及从基础组件计数到NP难问题推理的多层次挑战;同时引入通用可迁移的路由机制 \textttWiseHyGR,通过学习自适应的文本与视觉超图表示来增强LVLMs在超图任务上的表现,从而为LVLMs与超图分析的融合提供量化评估工具和有效增强策略。
链接: https://arxiv.org/abs/2604.15648
作者: Yanbin Wei,Chun Kang,Siwei Li,Haoxuan Che,Yang Chen,Hua Liu,Jian Liu,Zhuang Liu,Can Ouyang,Fei Xing,Lei Sha,Rui Liu,Yu Zhang,James Kwok
机构: Southern University of Science and Technology (南方科技大学); Hong Kong University of Science and Technology (香港科技大学); Huawei Research (华为研究); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review; Opensource after accepted
Abstract:Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce \textttHyperGVL , the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. \textttHyperGVL provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router \textttWiseHyGR that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.
[NLP-52] CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
【速读】: 该论文旨在解决如何有效衡量公共 deliberation(审议)质量的问题,传统方法仅关注言论的礼貌性或论证结构,而忽略了对话中信息进展的评估。其核心挑战在于量化每个发言对集体理解目标议题的贡献程度。解决方案的关键在于提出一种名为 Conversational Information Gain (CIG) 的框架,通过构建动态演化的语义记忆系统来捕捉讨论中的信息演化:系统从话语中提取原子主张(atomic claims),并逐步整合为结构化的记忆状态;在此基础上,以新颖性(Novelty)、相关性(Relevance)和蕴含范围(Implication Scope)三个可解释维度对每条发言进行评分。实验表明,基于该记忆机制的信息动态指标比传统启发式方法(如句长或TF-IDF)更能预测人类感知的CIG,从而为基于信息视角的对话质量分析提供了可扩展的LLM驱动方案。
链接: https://arxiv.org/abs/2604.15647
作者: Ming-Bin Chen,Jey Han Lau,Lea Frermann
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 5 figures
Abstract:Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize CIG, we model an evolving semantic memory of the discussion: the system extracts atomic claims from utterances and incrementally consolidates them into a structured memory state. Using this memory, we score each utterance along three interpretable dimensions: Novelty, Relevance, and Implication Scope. We annotate 80 segments from two moderated deliberative settings (TV debates and community discussions) with these dimensions and show that memory-derived dynamics (e.g., the number of claim updates) correlate more strongly with human-perceived CIG than traditional heuristics such as utterance length or TF–IDF. We develop effective LLM-based CIG predictors paving the way for information-focused conversation quality analysis in dialogues and deliberative success.
[NLP-53] FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use
【速读】: 该论文旨在解决临床研究人员在探索肿瘤学试验数据库时,因缺乏SQL编写能力而难以执行多约束条件查询(如生物标志物、终点指标、干预措施和时间等)的问题。其核心解决方案是提出FD-NL2SQL——一个基于反馈驱动的自然语言转SQL(Natural Language to SQL, NL2SQL)助手,关键在于:首先利用语义感知的大语言模型(Large Language Model, LLM)将复杂自然语言问题分解为可执行的谓词级子问题;其次通过句子嵌入检索专家验证过的NL2SQL示例;再结合模式信息与检索结果合成SQL,并进行有效性后处理校验;此外,系统通过两种机制实现持续改进:一是接受临床医生对生成SQL的编辑并更新示例库;二是基于轻量逻辑的SQL原子变异(如操作符或列名变更),仅保留返回非空结果的变体,并由第二阶段LLM自动生成对应自然语言问题及谓词分解,从而无需额外标注即可扩展示例库,最终支持交互式优化与迭代学习。
链接: https://arxiv.org/abs/2604.15646
作者: Suparno Roy Chowdhury,Tejas Anvekar,Manan Roy Choudhury,Muhammad Ali Khan,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta
机构: Arizona State University; Mayo Clinic
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.
[NLP-54] GroupDPO: Memory efficient Group-wise Direct Preference Optimization
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)偏好优化中效率与性能不足的问题,即大多数现有方法仅使用每个提示(prompt)的一对正负样本进行训练,忽略了偏好数据集中通常包含的多个候选响应所蕴含的额外监督信息。为克服这一局限,论文提出一种内存高效的组级偏好优化算法,其核心创新在于在反向传播过程中解耦样本,同时保留梯度信息,从而显著降低峰值内存消耗,使更大规模的组内对比成为可能。该方案不仅提升了训练的可扩展性,还在离线和在线对齐设置下均证明了利用多响应信息能持续优于单对训练,并强调在正样本上加入负对数似然(Negative Log-Likelihood, NLL)项对于性能提升和训练稳定性至关重要。
链接: https://arxiv.org/abs/2604.15602
作者: Jixuan Leng,Si Si,Hsiang-Fu Yu,Vinod Raman,Inderjit S. Dhillon
机构: CMU; Google Deepmind; Google
类目: Computation and Language (cs.CL)
备注:
Abstract:Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.
[NLP-55] DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中因将异构知识压缩至单一参数空间而导致的跨领域事实干扰问题。其解决方案的关键在于提出一种域代数语言模型(Domain-Algebraic Language Model, DALM),通过引入结构化的去噪机制替代传统的无约束词元生成,具体包括三个阶段:首先解析域不确定性,再处理关系不确定性,最后解决概念不确定性,每一阶段均受显式的代数约束。该框架依赖三个核心要素:可计算交、并和蕴含运算的域格(domain lattice)、控制跨域继承的关系类型函数(typing function over relations),以及将知识局部化到特定域子集的纤维划分(fiber partition)。由此构建的三阶段编码器-解码器架构,在封闭词汇模式下可结构性地防止跨域污染,在开放词汇模式下则能审计性地限制污染范围,并支持单个查询输出域索引的多视角答案空间。
链接: https://arxiv.org/abs/2604.15593
作者: Chao Li
机构: Deepleap.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it first resolves domain uncertainty, then relation uncertainty, and finally concept uncertainty, so each stage operates under explicit algebraic constraints. The framework requires only three ingredients: a lattice of domains with computable meet, join, and implication; a typing function over relations that controls inheritance across domains; and a fiber partition that localizes knowledge to domain-specific subsets. Given these ingredients, DALM yields a three-phase encoder-decoder architecture in which generation is confined to a domain fiber, cross-domain contamination is structurally prevented in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query can produce a domain-indexed multi-perspective answer space. We instantiate the framework with the CDC knowledge representation system and outline training and evaluation on validated domain-annotated crystal libraries. DALM reframes language generation as algebraically constrained structured denoising rather than unconstrained decoding over a flat token space.
[NLP-56] LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance
【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在自动化代码合规性任务中,因将模型视为黑箱而忽视训练决策如何影响其解释行为的问题。解决方案的关键在于采用基于扰动的归因分析(perturbation-based attribution analysis),系统比较不同微调策略(包括全量微调(Full Fine-Tuning, FFT)、低秩适应(Low-Rank Adaptation, LoRA)及量化LoRA微调)以及模型规模变化对LLM解释行为的影响。研究发现,FFT生成的归因模式更具统计显著性和聚焦性,且随着模型规模增大,LLM会发展出特定的解释策略(如优先关注数值约束和规则标识符),尽管超过70亿参数后生成规则与参考规则之间的语义相似度提升趋于饱和。这一方法为提升LLM在建筑、工程与施工领域等依赖法规的任务中的可解释性提供了关键洞见。
链接: https://arxiv.org/abs/2604.15589
作者: Jack Wei Lun Shi,Minghao Dang,Wawan Solihin,Justin K.W. Yeoh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 9 figures. Accepted at ICCCBE 2026 (International Conference on Computing in Civil and Building Engineering)
Abstract:Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.
[NLP-57] Why Fine-Tuning Encourag es Hallucinations and How to Fix It
【速读】: 该论文旨在解决大语言模型在监督微调(Supervised Fine-Tuning, SFT)过程中因引入新事实信息而导致的幻觉问题,即模型在学习新知识时会干扰或覆盖预训练阶段获得的已有知识,从而产生事实性错误。其解决方案的关键在于通过自蒸馏(self-distillation)机制来最小化输出分布漂移(output-distribution drift),从而在有效学习新知识的同时保持对原有知识的稳定性;实验进一步表明,SFT-induced hallucinations的主要驱动因素是语义表示重叠区域中的局部干扰(localized interference),而自蒸馏方法正是通过缓解这种干扰来实现效果提升。
链接: https://arxiv.org/abs/2604.15574
作者: Guy Kaplan,Zorik Gekhman,Zhen Zhu,Lotem Rozner,Yuval Reif,Swabha Swayamdipta,Derek Hoiem,Roy Schwartz
机构: Hebrew University of Jerusalem (耶路撒冷希伯来大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Technion – Israel Institute of Technology (以色列理工学院); University of Southern California (南加州大学); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.
[NLP-58] Predicting Where Steering Vectors Succeed
【速读】: 该论文旨在解决生成式 AI(Generative AI)中概念操控(concept steering)的不可预测性问题,即现有基于差异均值的引导向量(steering vector)在不同模型层和概念上的有效性难以事先判断,导致实践者无法提前确定最优干预位置。解决方案的关键在于提出线性可访问性轮廓(Linear Accessibility Profile, LAP),其核心指标 $ A_{\mathrm{lin}} $ 通过将模型的未嵌入矩阵(unembedding matrix)作用于中间隐藏状态来评估每层的引导潜力,无需额外训练。实验证明,$ A_{\mathrm{lin}} $ 在多个模型和概念家族上能高度准确地预测引导效果(相关系数 ρ=+0.86 至 +0.91)及最佳干预层(ρ=+0.63 至 +0.92),并构建了三阶段框架以解释不同方法适用条件,最终在实体引导演示中验证了LAP推荐层的有效性。
链接: https://arxiv.org/abs/2604.15557
作者: Jayadev Billa
机构: ISI@USC; Yahoo; Nuance; BBN
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, incl. 10 appendix pages, 4 figures, 20 tables
Abstract:Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, A_\mathrmlin , applies the model’s unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak A_\mathrmlin predicts steering effectiveness at \rho = +0.86 to +0.91 and layer selection at \rho = +0.63 to +0.92 . A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.
[NLP-59] Consistency Analysis of Sentiment Predictions using Syntactic Semantic Context Assessment Summarization (SSAS)
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业级情感分析任务中因固有随机性(generative, non-deterministic nature)与分析需求的一致性之间存在的冲突问题,尤其是在噪声复杂的现代数据集上,LLMs生成的情感预测结果波动剧烈,难以支撑战略决策。解决方案的关键在于提出一种语法-语义上下文评估摘要框架(Syntactic Semantic Context Assessment Summarization, SSAS),其核心机制是通过分层分类结构(主题、故事、聚类)和基于迭代摘要-摘要(Summary-of-Summaries, SoS)的上下文计算架构,对原始文本进行高信号、情感密集的预处理,从而约束LLMs的注意力范围,有效降低无关信息干扰并减少分析方差,显著提升数据质量与预测一致性。
链接: https://arxiv.org/abs/2604.15547
作者: Sharookh Daruwalla,Nitin Mayande,Shreeya Verma Kathuria,Nitin Joglekar,Charles Weber
机构: Tellagence Inc.(Tellagence公司); Villanova School of Business, Villanova University(维拉诺瓦大学商学院); Maseeh College of Engineering and Computer Science, Portland State University(波特兰州立大学工程与计算机科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 2 figures. arXiv admin note: text overlap with arXiv:2604.12049
Abstract:The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs’ inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern datasets, renders sentiment predictions too volatile for strategic business decisions. To resolve this, we present a Syntactic Semantic Context Assessment Summarization (SSAS) framework for establishing context. Context established by SSAS functions as a sophisticated data pre-processing framework that enforces a bounded attention mechanism on LLMs. It achieves this by applying a hierarchical classification structure (Themes, Stories, Clusters) and an iterative Summary-of-Summaries (SoS) based context computation architecture. This endows the raw text with high-signal, sentiment-dense prompts, that effectively mitigate both irrelevant data and analytical variance. We empirically evaluated the efficacy of SSAS, using Gemini 2.0 Flash Lite, against a direct-LLM approach across three industry-standard datasets - Amazon Product Reviews, Google Business Reviews, Goodreads Book Reviews - and multiple robustness scenarios. Our results show that our SSAS framework is capable of significantly improving data quality, up to 30%, through a combination of noise removal and improvement in the estimation of sentiment prediction. Ultimately, consistency in our context-estimation capabilities provides a stable and reliable evidence base for decision-making. Comments: 27 pages, 2 figures. arXiv admin note: text overlap with arXiv:2604.12049 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.15547 [cs.CL] (or arXiv:2604.15547v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.15547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-60] PolicyBank: Evolving Policy Understanding for LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在遵循组织政策时因自然语言规范中存在的歧义和逻辑/语义漏洞而导致的行为偏离真实需求的问题。现有方法通常将政策视为不可变的真理,导致代理在遇到政策缺口时产生“合规但错误”的行为。论文提出PolicyBank这一记忆机制,其关键在于以结构化方式存储工具级别的政策洞察,并通过预部署测试中的交互与纠正反馈迭代优化政策理解,从而自主填补规范漏洞。相较于传统记忆机制在政策缺口场景下接近零成功率,PolicyBank可将缺口缩小达82%,更贴近人类专家的判断。
链接: https://arxiv.org/abs/2604.15505
作者: Jihye Choi,Jinsung Yoon,Long T. Le,Somesh Jha,Tomas Pfister
机构: Google Cloud(谷歌云); University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent’s behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre-deployment testing, can it autonomously refine its interpretation to close specification gaps? We propose PolicyBank, a memory mechanism that maintains structured, tool-level policy insights and iteratively refines them – unlike existing memory mechanisms that treat the policy as immutable ground truth, reinforcing “compliant but wrong” behaviors. We also contribute a systematic testbed by extending a popular tool-calling benchmark with controlled policy gaps that isolate alignment failures from execution failures. While existing memory mechanisms achieve near-zero success on policy-gap scenarios, PolicyBank closes up to 82% of the gap toward a human oracle.
[NLP-61] Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
【速读】: 该论文试图解决的问题是:当前基于神经网络的语言模型(Language Models, LMs)在处理语言时,其内部机制与人类语言处理过程的相似性究竟如何?为回答这一问题,研究者采用Brain Score(BS)这一框架,通过预测人脑在阅读过程中fMRI激活模式来评估LMs的表现。解决方案的关键在于系统性地训练多种类型的LMs——包括不同语系的自然语言、结构化数据(如人类基因组、Python代码和嵌套括号)——并统一在BS指标下进行比较。结果发现,尽管输入数据差异显著,各类模型在BS上的表现均相近,尤其自然语言模型表现出高度一致性;而某些非语言结构数据训练的模型也达到接近自然语言水平的BS分数。这表明BS能有效捕捉到语言模型对跨语言共通结构的提取能力,但同时也揭示了BS作为衡量“人类类似处理”的指标可能不够敏感,无法仅凭高得分就推断出模型具备人类级认知机制。
链接: https://arxiv.org/abs/2604.15503
作者: Jingnong Qu,Ashvin Ranjan,Shane Steinert-Threlkeld
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models’ processing to human language processing? Results using a framework called Brain Score (BS) – predicting fMRI activations during reading from LM activations – have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data – the human genome, Python, and pure hierarchical structure (nested parentheses) – also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models’ ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.
[NLP-62] hink Multilingual Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch
【速读】: 该论文旨在解决大语言模型在推理任务中出现的代码切换(code-switching)行为缺乏系统理解与有效引导的问题。现有研究或将其视为错误,或仅通过提示工程或解码策略进行控制,且多局限于特定语言、领域或任务,未能从语言学和行为动机角度挖掘有益的代码切换模式。其解决方案的关键在于提出首个基于语言学和行为动机的微调框架,首先构建并分析来自多种模型、语言、任务和领域的推理轨迹数据集,识别出有助于推理的代码切换行为;随后设计针对性的微调干预策略,使模型能够学习在适当情境下主动进行有益的代码切换。实验表明,该框架可在数据高效的前提下显著提升有益代码切换行为,并发现即使在非直接涉及代码切换的任务(如机器翻译)上微调,也能间接增强模型的代码切换能力,揭示了代码切换行为具有可迁移性和可塑性。
链接: https://arxiv.org/abs/2604.15490
作者: Eleanor M. Lin,David Jurgens
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-switching as an undesirable error, attempted to control code-switching through modifications to input prompts or the output decoding process, or focus on narrow subsets of languages, domains, tasks, and models. We address these gaps by introducing the first linguistically and behaviorally motivated fine-tuning framework for identifying beneficial code-switched reasoning behaviors in large language models and teaching these models to code-switch more effectively for reasoning. First, we create and systematically analyze a dataset of reasoning traces from diverse models, languages, tasks, and domains to understand the types of code-switching behaviors found in existing reasoning models. Then, we develop fine-tuning interventions that teach reasoning models to code-switch based on our observations of helpful behaviors in existing models. We find that our framework can significantly increase beneficial code-switched reasoning behaviors in a data-efficient manner. Interestingly, we also find that code-switching behaviors in reasoning models can be modified by fine-tuning for tasks that do not directly demonstrate code-switching in reasoning (e.g., machine translation). Our work suggests that data-efficient interventions can instill helpful forms of code-switching behavior in reasoning models.
[NLP-63] FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段可能出现的安全违规和幻觉等问题,同时克服现有推理时控制(inference-time steering)方法在有效性、任务保真度(utility-preserving)和训练效率之间难以平衡的问题。其解决方案的关键在于提出一种名为FineSteer的新颖控制框架,该框架将推理时控制分解为两个互补阶段:条件控制(conditional steering)与细粒度向量合成(fine-grained vector synthesis)。其中,子空间引导的条件控制(Subspace-guided Conditional Steering, SCS)机制通过识别是否需要干预来避免对通用查询的不必要调整,从而保持模型原有性能;混合控制专家机制(Mixture-of-Steering-Experts, MoSE)则建模了目标行为的多模态特性,并生成查询特定的控制向量以提升效果。这种分阶段、自适应且训练高效的结构使FineSteer能够在保障通用任务表现的同时,精准优化特定输入的控制策略,显著优于当前最优方法。
链接: https://arxiv.org/abs/2604.15488
作者: Zixuan Weng,Jinghuai Zhang,Kunlin Cai,Ying Li,Peiran Wang,Yuan Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL 2026 (Main)
Abstract:Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigid, one-size-fits-all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages: conditional steering and fine-grained vector synthesis, allowing fine-grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace-guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture-of-Steering-Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query-specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training-efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state-of-the-art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at this https URL
[NLP-64] Evaluating LLM Simulators as Differentially Private Data Generators ICLR2026
【速读】: 该论文旨在解决在高维用户画像场景下,传统差分隐私(Differential Privacy, DP)方法难以生成复杂合成数据的问题,同时探究大语言模型(Large Language Models, LLMs)是否能够忠实再现从DP保护输入中学习到的统计分布。其解决方案的关键在于使用基于LLM的代理模拟器PersonaLedger,通过将真实用户统计数据转化为DP保护的合成个体(synthetic personas)作为种子输入,从而生成金融场景下的合成数据,并评估其在欺诈检测任务中的效用。研究发现,尽管该方法在ε=1时实现了AUC 0.70的欺诈检测性能,但存在显著的分布偏移问题,主要源于LLM自身的系统性偏差(如先验知识覆盖输入统计),这揭示了当前LLM-based方法在处理丰富用户表征时仍需克服的关键挑战。
链接: https://arxiv.org/abs/2604.15461
作者: Nassima M. Bouzid,Dehao Yuan,Nam H. Nguyen,Mayana Pereira
机构: Capital One
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Submitted to ICLR 2026. 6 pages + appendix
Abstract:LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases–learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.
[NLP-65] Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
【速读】: 该论文旨在解决生成式 AI(Generative AI)中语言模型幻觉(hallucination)的成因问题,特别是其是否由早期轨迹承诺(early trajectory commitment)驱动。解决方案的关键在于通过“同提示分岔”(same-prompt bifurcation)方法,在相同输入下观察输出路径的自发分化,从而隔离轨迹动态与提示层面的混杂因素。研究发现,44.3% 的提示在第一个生成 token 处即发生事实与幻觉路径的分离,且激活扰动实验揭示了显著的因果不对称性:将幻觉激活注入正确轨迹会导致 87.5% 的失败(第 20 层),而反向操作仅恢复 33.3%(第 24 层),二者均显著高于基线和随机对照。进一步分析表明,幻觉行为受 prompt 编码阶段即固定的“模式态”(regime-like groups)所主导,其初始残差状态可预测每条提示的幻觉概率(Pearson r = 0.776),说明幻觉是一种局部稳定的吸引子盆地(attractor basin),进入快速且概率性强,退出则需跨层、多步协同干预。
链接: https://arxiv.org/abs/2604.15400
作者: G. Aytug Akarlar
机构: Chimera Research Initiative (Chimera研究计划)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 12 figures, 8 tables. Code and data: this https URL
Abstract:We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.
[NLP-66] Applied Explainability for Large Language Models : A Comparative Study
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言处理任务中决策过程缺乏可解释性的问题,这一缺陷限制了模型在实际系统中的可信度、调试能力和部署可行性。解决方案的关键在于通过一项应用导向的对比研究,评估三种现有的可解释性技术——集成梯度(Integrated Gradients)、注意力回溯(Attention Rollout)和SHAP(SHapley Additive exPlanations)在微调后的DistilBERT模型上对SST-2情感分类任务的表现。研究发现,基于梯度的归因方法提供更稳定且直观的解释,而基于注意力的方法虽计算高效但与预测相关特征对齐度较低,模型无关方法则具备灵活性但引入更高计算成本和结果变异性,从而揭示了不同解释方法之间的核心权衡,并强调其作为诊断工具而非确定性解释的价值。
链接: https://arxiv.org/abs/2604.15371
作者: Venkata Abhinandan Kancharla
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 3 figures, comparative study of explainability methods for transformer-based NLP models; also available on Zenodo
Abstract:Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review. Comments: 14 pages, 3 figures, comparative study of explainability methods for transformer-based NLP models; also available on Zenodo Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7; I.5.1 Cite as: arXiv:2604.15371 [cs.CL] (or arXiv:2604.15371v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.15371 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Venkata Abhinandan Kancharla [view email] [v1] Wed, 15 Apr 2026 13:07:29 UTC (353 KB) Full-text links: Access Paper: View a PDF of the paper titled Applied Explainability for Large Language Models: A Comparative Study, by Venkata Abhinandan KancharlaView PDF view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs cs.AI cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-67] Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在大规模语言模型微调中效率不足的问题,即当前标准做法对所有Transformer层均匀应用LoRA适配器,而未考虑各层对下游任务的相关性。解决方案的关键在于提出Aletheia方法,通过轻量级梯度探测(gradient probe)识别最相关的任务层,并仅在这些层上应用LoRA适配器,同时采用非对称秩分配策略以提升训练效率。实验表明,该方法在81组实验中平均实现23.1%的训练加速(p < 0.001),且保持下游任务性能稳定,验证了智能层选择在模型经济性上的有效性。
链接: https://arxiv.org/abs/2604.15351
作者: Abdulmalek Saket
机构: Royal Fenice Kft / ALETHEIA PROTOCOL research (皇家费尼切有限公司 / 真理协议研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 5 figures, 2 frozen evidence campaigns, 81 experiment rows across 14 successful models and 8 architecture families, plus one documented failed Pythia/GPT-NeoX attempt
Abstract:Low-Rank Adaptation (LoRA) has become the dominant parameter-efficient fine-tuning method for large language models, yet standard practice applies LoRA adapters uniformly to all transformer layers regardless of their relevance to the downstream task. We introduce Aletheia, a gradient-guided layer selection method that identifies the most task-relevant layers via a lightweight gradient probe and applies LoRA adapters only to those layers with asymmetric rank allocation. Across 81 experiment rows covering 14 successful models from 8 architecture families (0.5B-72B parameters, including dense and Mixture-of-Experts architectures), with one additional documented failed Pythia/GPT-NeoX attempt in Campaign 2, Aletheia achieves a 15-28% training speedup (mean 23.1%, p 0.001) with bounded extra forgetting and broadly matched downstream behavior on the evaluated MMLU, GSM8K, and HumanEval benchmark pack. Across the tested families and scales, Campaign 1 shows a 100% per-model speed win rate and Campaign 2 shows broadly preserved downstream behavior within a bounded-degradation framing. Together these results support a practical model-economics claim: intelligent layer selection can make LoRA fine-tuning materially more efficient without introducing major downstream damage on the evaluated set.
[NLP-68] Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories
【速读】: 该论文旨在解决计算机视觉(Computer Vision, CV)领域中“高阶视觉理解”任务定义模糊、缺乏系统性梳理的问题,尤其聚焦于抽象概念(Abstract Concepts, ACs)在自动图像分类中的应用。其解决方案的关键在于:通过多学科分析对高阶语义进行分类,识别出常识性、情感性、审美性和归纳解释性等语义类别;同时厘清与高阶视觉感知相关的任务类型,并指出当前AC图像分类面临的挑战,如大规模数据集效果有限,亟需融合辅助信息和中层特征;最终强调混合人工智能(Hybrid AI)系统在应对AC图像分类复杂性方面的核心作用,为未来研究提供理论基础与方向指引。
链接: https://arxiv.org/abs/2308.10562
作者: Delfina Sol Martinez Pandiani,Valentina Presutti
机构: University of Bologna (博洛尼亚大学); Centrum Wiskunde en Informatica (荷兰数学与计算机科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Preprint
Abstract:The field of Computer Vision (CV) is increasingly shifting towards ``high-level’’ visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis, and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.
信息检索
[IR-0] Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation
【速读】:该论文旨在解决顺序推荐系统中因数据稀疏性导致的推理准确率下降问题,现有测试时增强(Test-time Augmentation, TTA)方法通常采用统一、用户无关的增强策略,忽略了用户行为的异质性,导致增强效果不佳。其解决方案的关键在于提出AdaTTA框架,该框架基于强化学习构建一个可插拔的自适应推理机制,通过将增强操作选择建模为马尔可夫决策过程(Markov Decision Process),设计包含混合状态表示和联合宏排序奖励的Actor-Critic网络,实现对每个用户序列动态选择最优增强算子,从而显著提升推荐性能。
链接: https://arxiv.org/abs/2604.16121
作者: Xibo Li,Liang Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))
类目: Information Retrieval (cs.IR)
备注: 10 pages. arXiv admin note: text overlap with arXiv:2504.04843 by other authors
Abstract:Test-time augmentation (TTA) has become a promising approach for mitigating data sparsity in sequential recommendation by improving inference accuracy without requiring costly model retraining. However, existing TTA methods typically rely on uniform, user-agnostic augmentation strategies. We show that this “one-size-fits-all” design is inherently suboptimal, as it neglects substantial behavioral heterogeneity across users, and empirically demonstrate that the optimal augmentation operators vary significantly across user sequences with different characteristics for the first time. To address this limitation, we propose AdaTTA, a plug-and-play reinforcement learning-based adaptive inference framework that learns to select sequence-specific augmentation operators on a per-sequence basis. We formulate augmentation selection as a Markov Decision Process and introduce an Actor-Critic policy network with hybrid state representations and a joint macro-rank reward design to dynamically determine the optimal operator for each input user sequence. Extensive experiments on four real-world datasets and two recommendation backbones demonstrate that AdaTTA consistently outperforms the best fixed-strategy baselines, achieving up to 26.31% relative improvement on the Home dataset while incurring only moderate computational overhead
[IR-1] JFinTEB: Japanese Financial Text Embedding Benchmark SIGIR2026
【速读】:该论文旨在解决日本金融文本嵌入(Japanese financial text embeddings)缺乏专门评估基准的问题,现有嵌入基准在语言特异性与领域特异性方面覆盖不足。解决方案的关键在于构建首个综合性基准JFinTEB,涵盖检索和分类任务,其中检索任务利用指令遵循数据集和金融文本生成查询,分类任务包括情感分析、文档分类及基于经济调查数据的领域特定分类挑战,并对多种嵌入模型(包括日语专用模型、多语言模型和商业服务)进行全面评估,同时公开数据集与评估框架以促进未来研究并建立标准化评估协议。
链接: https://arxiv.org/abs/2604.15882
作者: Masahiro Suzuki,Hiroki Sakaji
机构: Amova Asset Management Co., Ltd.(阿莫瓦资产管理有限公司); Hokkaido University(北海道大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 5 pages. Accepted at SIGIR 2026 Resource Track
Abstract:We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios. The retrieval tasks leverage instruction-following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain-specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese-specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at this https URL to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and establishes a foundation for advancing domain-specific embedding research.
[IR-2] UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval
【速读】:该论文旨在解决传统信息检索系统中“相关性”(relevance)与“实用性”(usefulness)混淆的问题,即现有方法主要依赖文本间的词汇或语义相似性来判断相关性,但这种定义忽略了文本是否真正有助于回答用户查询的实际价值。例如,关于巴黎位于法国的文本虽然在语义上与“巴黎是否比柏林大”的查询相关,却并无实用价值。为应对这一问题,作者提出了 UsefulBench——一个由三位专业分析师标注的领域特定数据集,区分文本是否“连接查询”(relevance)或具有“实际回应价值”(usefulness)。关键解决方案在于通过该数据集揭示:经典基于相似性的检索方法更倾向于捕捉相关性而非实用性;而大语言模型(LLM)虽能在一定程度上缓解此偏差,但仍因缺乏领域专业知识而无法充分胜任复杂任务。因此,论文强调未来需开发能融合专家知识的针对性信息检索系统以提升实用性。
链接: https://arxiv.org/abs/2604.15827
作者: Tobias Schimanski,Stefanie Lewandowski,Christian Woerle,Nicola Reichenau,Yauheni Huryn,Markus Leippold
机构: University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); score4more GmbH; Climate+Tech Think Tank; Swiss Finance Institute (SFI) (瑞士金融研究所)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.
[IR-3] Scattered Hypothesis Generation for Open-Ended Event Forecasting
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的开放事件预测方法主要聚焦于最可能结果,而忽视现实事件固有不确定性的问题。为弥补这一差距,作者提出从“定点预测”向“散点预测”(scatter forecasting)范式转变,并引入假设生成(hypothesis generation)作为代理任务,以生成覆盖合理未来事件空间的包容且多样化的假设集合。解决方案的关键在于提出SCATTER框架——一种联合优化假设包容性与多样性的强化学习方法,其核心是设计了一个包含三部分的混合奖励机制:1)有效性奖励(validity reward),衡量假设与观测事件的语义一致性;2)组内多样性奖励(intra-group diversity reward),鼓励采样响应间的差异性;3)组间多样性奖励(inter-group diversity reward),促进在不同模式间的探索。通过将有效性门控分数(validity-gated score)纳入目标函数,该框架限制了对极端多样化结果的探索,从而避免模式崩溃(mode collapse),确保生成结果在上下文合理范围内。
链接: https://arxiv.org/abs/2604.15788
作者: He Chang,Zhulin Tao,Lifang Yang,Xianglin Huang,Yunshan Ma
机构: Communication University of China(中国传媒大学); Singapore Management University(新加坡管理大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Despite the importance of open-ended event forecasting for risk management, current LLM-based methods predominantly target only the most probable outcomes, neglecting the intrinsic uncertainty of real-world events. To bridge this gap, we advance open-ended event forecasting from pinpoint forecasting to scatter forecasting by introducing the proxy task of hypothesis generation. This paradigm aims to generate an inclusive and diverse set of hypotheses that broadly cover the space of plausible future events. To this end, we propose SCATTER, a reinforcement learning framework that jointly optimizes inclusiveness and diversity of the hypothesis. Specifically, we design a novel hybrid reward that consists of three components: 1) a validity reward that measures semantic alignment with observed events, 2) an intra-group diversity reward to encourage variation within sampled responses, and 3) an inter-group diversity reward to promote exploration across distinct modes. By integrating the validity-gated score into the overall objective, we confine the exploration of wildly diversified outcomes to contextually plausible futures, preventing the mode collapse issue. Experiments on two real-world benchmark datasets, i.e., OpenForecast and OpenEP, demonstrate that SCATTER significantly outperforms strong baselines. Our code is available at this https URL.
[IR-4] On the Equivalence Between Auto-Regressive Next Token Prediction and Full-Item-Vocabulary Maximum Likelihood Estimation in Generative Recommendation–A Short Note
【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)系统中广泛采用的自回归下一词预测(Auto-Regressive Next-Token Prediction, AR-NTP)机制缺乏严谨理论解释的问题。当前工业级GR系统普遍依赖AR-NTP作为训练目标,但其与传统最大似然估计之间的关系尚无明确数学证明。论文的关键贡献在于首次严格证明:在项目与其k-token序列之间存在双射映射(bijective mapping)的前提下,k-token自回归下一词预测等价于全项目词汇表最大似然估计(Full-Item-Vocabulary Maximum Likelihood Estimation, FV-MLE)。该等价性进一步被证明适用于工业中最常见的级联式(cascaded)与并行式(parallel)两种tokenization方案,从而为当前主流GR范式提供了首个形式化的理论基础,并为未来系统优化提供了原则性指导。
链接: https://arxiv.org/abs/2604.15739
作者: Yusheng Huang,Shuang Yang,Zhaojie Liu,Han Li
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注: Work in progress
Abstract:Generative recommendation (GR) has emerged as a widely adopted paradigm in industrial sequential recommendation. Current GR systems follow a similar pipeline: tokenization for item indexing, next-token prediction as the training objective and auto-regressive decoding for next-item generation. However, existing GR research mainly focuses on architecture design and empirical performance optimization, with few rigorous theoretical explanations for the working mechanism of auto-regressive next-token prediction in recommendation scenarios. In this work, we formally prove that \textbfthe k-token auto-regressive next-token prediction (AR-NTP) paradigm is strictly mathematically equivalent to full-item-vocabulary maximum likelihood estimation (FV-MLE), under the core premise of a bijective mapping between items and their corresponding k-token sequences. We further show that this equivalence holds for both cascaded and parallel tokenizations, the two most widely used schemes in industrial GR systems. Our result provides the first formal theoretical foundation for the dominant industrial GR paradigm, and offers principled guidance for future GR system optimization. Comments: Work in progress Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.15739 [cs.IR] (or arXiv:2604.15739v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.15739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] Intent Propagation Contrastive Collaborative Filtering
【速读】:该论文旨在解决现有解耦协同过滤方法在两个关键方面的局限性:一是仅关注由直接节点交互提取的局部结构特征,忽视了图结构的整体信息,从而限制了解耦精度;二是解耦过程依赖推荐任务的反向传播信号进行监督,缺乏显式监督机制,易导致偏差和过拟合。解决方案的关键在于提出意图传播对比协同过滤(IPCCF)算法,其核心创新包括:设计双螺旋消息传播框架以更有效地提取节点的深层语义信息,提升对节点间交互的理解;引入结合图结构信息的意图消息传播机制,扩展解耦的考虑范围;并采用对比学习技术对齐结构与意图驱动的节点表示,为解耦过程提供直接监督,从而缓解偏差、增强模型鲁棒性。
链接: https://arxiv.org/abs/2604.15704
作者: Haojie Li,Junwei Du,Guanfeng Liu,Feng Jiang,Yan Wang,Xiaofang Zhou
机构: Qingdao University of Science and Technology (青岛科技大学); Macquarie University (麦考瑞大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Information Retrieval (cs.IR)
备注: 15 pages, 5 figures, 6 tables
Abstract:Disentanglement techniques used in collaborative filtering uncover interaction intents between nodes, improving the interpretability of node representations and enhancing recommendation performance. However, existing disentanglement methods still face two problems. First, they focus on local structural features derived from direct node interactions and overlook the comprehensive graph structure, which limits disentanglement accuracy. Second, the disentanglement process depends on backpropagation signals derived from recommendation tasks and lacks direct supervision, which may lead to biases and overfitting. To address these issues, we propose the Intent Propagation Contrastive Collaborative Filtering (IPCCF) algorithm. Specifically, we design a double helix message propagation framework to more effectively extract the deep semantic information of nodes, thereby improving the model’s understanding of interactions between nodes. We also develop an intent message propagation method that incorporates graph structure information into the disentanglement process, thereby expanding the consideration scope of disentanglement. In addition, contrastive learning techniques are employed to align node representations derived from structure and intents, providing direct supervision for the disentanglement process, mitigating biases, and enhancing the model’s robustness to overfitting. Experiments on three real data graphs illustrate the superiority of the proposed approach.
[IR-6] Sample Is Feature: Beyond Item-Level Toward Sample-Level Tokens for Unified Large Recommender Models
【速读】:该论文旨在解决工业推荐模型在样本信息扩展(sample information scaling)和模型容量扩展(model capacity scaling)两个方向上存在的结构性局限:前者仅将历史交互的子集编码为序列标记,导致原始样本上下文未被充分利用,且无法建模样本级时变特征;后者因顺序与非顺序特征间的结构异质性限制了模型表征能力的充分发挥。其解决方案的核心是提出SIF(Sample Is Feature),通过将每个原始样本直接编码为序列标记来最大化保留样本信息,并消除顺序与非顺序特征之间的异质性。SIF包含两个关键组件:**样本分词器(Sample Tokenizer)**利用分层自适应量化(HGAQ)将原始样本转化为Token Sample,实现全样本级上下文高效融入序列;SIF-Mixer则通过token级与样本级混合机制,在同质化表示空间中进行深度特征交互,从而充分释放模型的表征潜力。
链接: https://arxiv.org/abs/2604.15650
作者: Shuli Wang,Junwei Yin,Changhao Li,Senjie Kou,Chi Wang,Yinqiu Huang,Yinhua Zhu,Haitao Wang,Xingxing Wang
机构: Meituan(美团)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Scaling industrial recommender models has followed two parallel paradigms: \textbfsample information scaling – enriching the information content of each training sample through deeper and longer behavior sequences – and \textbfmodel capacity scaling – unifying sequence modeling and feature interaction within a single Transformer backbone. However, these two paradigms still face two structural limitations. Firstly, sample information scaling methods encode only a subset of each historical interaction into the sequence token, leaving the majority of the original sample context unexploited and precluding the modeling of sample-level, time-varying features. Secondly, model capacity scaling methods are inherently constrained by the structural heterogeneity between sequential and non-sequential features, preventing the model from fully realizing its representational capacity. To address these issues, we propose \textbfSIF (\emphSample Is Feature), which encodes each historical Raw Sample directly into the sequence token – maximally preserving sample information while simultaneously resolving the heterogeneity between sequential and non-sequential features. SIF consists of two key components. The \textbfSample Tokenizer quantizes each historical Raw Sample into a Token Sample via hierarchical group-adaptive quantization (HGAQ), enabling full sample-level context to be incorporated into the sequence efficiently. The \textbfSIF-Mixer then performs deep feature interaction over the homogeneous sample representations via token-level and sample-level mixing, fully unleashing the model’s representational capacity. Extensive experiments on a large-scale industrial dataset validate SIF’s effectiveness, and we have successfully deployed SIF on the Meituan food delivery platform. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.15650 [cs.IR] (or arXiv:2604.15650v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.15650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-7] SIMMER: Cross-Modal Food Image–Recipe Retrieval via MLLM -Based Embedding
【速读】:该论文旨在解决食品图像与菜谱文本之间的跨模态检索问题(cross-modal retrieval),该任务在营养管理、饮食记录和烹饪辅助等领域具有重要应用价值。传统方法依赖双编码器架构,需复杂对齐策略和特定任务的网络设计来弥合图像与文本之间的语义鸿沟。其解决方案的关键在于提出一种单一体系结构模型SIMMER(Single Integrated Multimodal Model for Embedding Recipes),采用基于视觉语言模型(Multimodal Large Language Model, MLLM)的嵌入模型VLM2Vec,将图像和文本统一输入到一个单一编码器中进行处理,摒弃了传统的双编码器范式;同时设计针对菜谱结构(标题、食材、步骤)的提示模板,并引入组件感知的数据增强策略,在完整与部分菜谱上联合训练,从而提升模型对不完整输入的鲁棒性,最终在Recipe1M数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2604.15628
作者: Keisuke Gomi,Keiji Yanai
机构: The University of Electro-Communications (电气通信大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 20 pages, 6 figures
Abstract:Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8% to 87.5% and the 10k image-to-recipe R@1 from 56.5% to 65.5% compared to the previous best method.
[IR-8] Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
【速读】:该论文旨在解决自适应检索增强生成(Adaptive Retrieval-Augmented Generation)中因大语言模型(Large Language Models, LLMs)对噪声鲁棒性增强而导致的“是否仍需动态检索”的必要性问题。其核心解决方案是提出AdaRankLLM框架,关键在于:1)设计一种基于零样本提示与段落丢弃机制的自适应排序器,用于验证列表级重排序的必要性;2)引入两阶段渐进式蒸馏范式,并结合数据采样与增强技术,使小型开源LLM具备精确的列表级排序和自适应过滤能力。实验表明,该方法在多个数据集和LLM上均显著降低上下文开销并提升性能,同时揭示了自适应检索的角色转变——对弱模型为关键噪声过滤器,对强模型则为高效优化器。
链接: https://arxiv.org/abs/2604.15621
作者: Jun Feng,Jiahui Tang,Zhicheng He,Hang Lv,Hongchao Gu,Hao Wang,Xuezhi Yang,Shuai Fang
机构: Hefei University of Technology(合肥工业大学); University of Science and Technology of China(中国科学技术大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7pages, 2figures
Abstract:Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for weaker models to overcome their limitations, while serving as a cost-effective efficiency optimizer for stronger reasoning models.
[IR-9] BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels ACL2026
【速读】:该论文旨在解决生物医学信息检索中因依赖粗粒度二元相关性信号而导致语义重叠建模不足的问题。解决方案的关键在于提出BioHiCL(Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning),利用MeSH(Medical Subject Headings)的层次化标注提供结构化监督信号,以支持多标签对比学习,从而更精确地捕捉生物医学文本间的语义关系与层级结构。
链接: https://arxiv.org/abs/2604.15591
作者: Mengfei Lan,Lecheng Zheng,Halil Kilicoglu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main Conference
Abstract:Effective biomedical information retrieval requires modeling domain semantics and hierarchical relationships among biomedical texts. Existing biomedical generative retrievers build on coarse binary relevance signals, limiting their ability to capture semantic overlap. We propose BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning), which leverages hierarchical MeSH annotations to provide structured supervision for multi-label contrastive learning. Our models, BioHiCL-Base (0.1B) and BioHiCL-Large (0.3B), achieve promising performance on biomedical retrieval, sentence similarity, and question answering tasks, while remaining computationally efficient for deployment.
[IR-10] Learning Behaviorally Grounded Item Embeddings via Personalized Temporal Contexts
【速读】:该论文旨在解决推荐系统中用户偏好演化建模的难题,特别是如何区分短期兴趣波动与长期兴趣变迁。传统方法如Item2Vec将用户历史行为视为无序集合(bag-of-items),忽略了交互之间的时序结构,导致无法捕捉消费会话的连贯性与兴趣漂移的渐进性。其解决方案的关键在于提出TAI2Vec(Time-Aware Item-to-Vector)模型族,通过引入时间感知机制直接嵌入表示学习过程:一方面采用个性化异常检测动态分割交互为语义会话(TAI2Vec-Disc),另一方面利用用户特定的连续衰减函数根据相对时间距离加权物品关联(TAI2Vec-Cont),从而实现对个体用户行为节奏的自适应建模。
链接: https://arxiv.org/abs/2604.15581
作者: Rafael T. Sereicikas,Pedro R. Pires,Gregorio F. Azevedo,Tiago A. Almeida
机构: Federal University of São Carlos (圣卡洛斯联邦大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to be published in UMAP’26, 9 pages, 7 figures
Abstract:Effective user modeling requires distinguishing between short-term and long-term preference evolution. While item embeddings have become a key component of recommender systems, standard approaches like Item2Vec treat user histories as unordered sets (bag-of-items), implicitly assuming that interactions separated by minutes are as semantically related as those separated by months. This simplification flattens the rich temporal structure of user behavior, obscuring the distinction between coherent consumption sessions and gradual interest drifts. In this work, we introduce TAI2Vec (Time-Aware Item-to-Vector), a family of lightweight embedding models that integrates temporal proximity directly into the representation learning process. Unlike approaches that apply global time constraints, TAI2Vec is user-adaptive, tailoring its temporal definitions to individual interaction paces. We propose two complementary strategies: TAI2Vec-Disc, which utilizes personalized anomaly detection to dynamically segment interactions into semantic sessions, and TAI2Vec-Cont, which employs continuous, user-specific decay functions to weigh item relationships based on their relative temporal distance. Experimental results across eight diverse datasets demonstrate that TAI2Vec consistently produces more accurate and behaviorally grounded representations than static baselines, achieving competitive or superior performance in over 80% of the datasets, with improvements of up to 135%. The source code is publicly available at this https URL.
[IR-11] Collaborative Filtering Through Weighted Similarities of User and Item Embeddings
【速读】:该论文旨在解决推荐系统中如何有效融合用户-物品(user-item)与物品-物品(item-item)推荐策略以提升性能并保持计算效率的问题。其关键解决方案在于提出一种新颖的集成方法,通过加权相似性框架统一两种推荐策略,并利用共享的用户和物品嵌入(embedding)实现架构简化与计算效率提升;该方法无需针对特定嵌入进行微调,即可复用基础算法的超参数而不损失性能,从而在不同推荐场景下均表现出鲁棒性和高效性。
链接: https://arxiv.org/abs/2604.15573
作者: Pedro R. Pires,Rafael T. Sereicikas,Gregorio F. Azevedo,Tiago A. Almeida
机构: Federal University of São Carlos (圣卡洛斯联邦大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Published in SAC’25, 8 pages, 4 figures
Abstract:In recent years, neural networks and other complex models have dominated recommender systems, often setting new benchmarks for state-of-the-art performance. Yet, despite these advancements, award-winning research has demonstrated that traditional matrix factorization methods can remain competitive, offering simplicity and reduced computational overhead. Hybrid models, which combine matrix factorization with newer techniques, are increasingly employed to harness the strengths of multiple approaches. This paper proposes a novel ensemble method that unifies user-item and item-item recommendations through a weighted similarity framework to deliver top-N recommendations. Our approach is distinctive in its use of shared user and item embeddings for both recommendation strategies, simplifying the architecture and enhancing computational efficiency. Extensive experiments across multiple datasets show that our method achieves competitive performance and is robust in varying scenarios that favor either user-item or item-item recommendations. Additionally, by eliminating the need for embedding-specific fine-tuning, our model allows for the seamless reuse of hyperparameters from the base algorithm without sacrificing performance. This results in a method that is both efficient and easy to implement. Our open-source implementation is available at this https URL.
[IR-12] vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
【速读】:该论文旨在解决本地优先文档记忆系统中向量相似性搜索与全文关键词匹配融合效率低、性能不稳定的问题。其核心挑战在于如何在不依赖人工标注的情况下,有效整合两种检索方式的优势并提升整体排序精度。解决方案的关键在于:(1) 利用混合检索策略下的查询分歧(disagreement)作为无监督训练信号,通过自监督微调嵌入模型(如BGE-small),显著提升NDCG@10指标;(2) 引入基于每查询IDF加权的倒数排名融合(Reciprocal Rank Fusion, RRF),实现动态权重调整,相比固定权重在所有5个BEIR数据集上均有提升(最高+21.4%);(3) 构建一个生产级基础设施,支持完整性校验、模式版本控制、排名诊断及基于距离的相关性信号,确保系统在大规模数据(50,425条相关性标注查询)下保持稳定延迟(中位数20.9ms)和高召回质量。
链接: https://arxiv.org/abs/2604.15484
作者: Jayson Steffens
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:We present vstash, a local-first document memory system that combines vector similarity search with full-text keyword matching via Reciprocal Rank Fusion (RRF) and adaptive per-query IDF weighting. All data resides in a single SQLite file using sqlite-vec for approximate nearest neighbor search and FTS5 for keyword matching. We make four primary contributions. (1) Self-supervised embedding refinement via hybrid retrieval disagreement: across 753 BEIR queries on SciFact, NFCorpus, and FiQA, 74.5% produce top-10 disagreement between vector-heavy (vec=0.95, fts=0.05) and FTS-heavy (vec=0.05, fts=0.95) search (per-dataset rates 63.4% / 73.4% / 86.7%, Section 5.2), providing a free training signal without human labels. Fine-tuning BGE-small (33M params) with MultipleNegativesRankingLoss on 76K disagreement triples improves NDCG@10 on all 5 BEIR datasets (up to +19.5% on NFCorpus vs. BGE-small base RRF, Table 6). On 3 of 5 datasets, under different preprocessing, the tuned 33M-parameter pipeline matches or exceeds published ColBERTv2 results (110M params) and an untrained BGE-base (110M); on FiQA and ArguAna it underperforms ColBERTv2 (Section 5.5). (2) Adaptive RRF with per-query IDF weighting improves NDCG@10 on all 5 BEIR datasets versus fixed weights (up to +21.4% on ArguAna), achieving 0.7263 on SciFact with BGE-small. (3) A negative result on post-RRF scoring: frequency+decay, history-augmented recall, and cross-encoder reranking all failed to improve NDCG. (4) A production-grade substrate with integrity checking, schema versioning, ranking diagnostics, and a distance-based relevance signal validated on 50,425 relevance-judged queries across the 5 BEIR datasets. Search latency remains 20.9 ms median at 50K chunks with stable NDCG. The fine-tuned model is published as Stffens/bge-small-rrf-v2 on HuggingFace. All code, data, and experiments are open-source. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.15484 [cs.IR] (or arXiv:2604.15484v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.15484 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-13] OverCite: Add citations in LaTeX without leaving the editor
【速读】:该论文旨在解决科研人员在使用LaTeX撰写论文时插入参考文献的低效问题,即频繁切换编辑环境、手动查找文献并管理BibTeX条目所导致的写作中断。解决方案的关键在于开发了一个名为OverCite的轻量级开源工具,它通过在写作环境中直接利用粗略的引用占位符(如\texttt\citep{Perlmutter1999})和局部句子上下文,向ADS/SciX索引数据库发起查询,智能排序候选文献并自动插入匹配的参考文献,从而无需离开当前编辑器即可完成引用插入。该工具已在Overleaf平台和VS Code中实现,支持跨学科领域(涵盖天文学、物理学、计算机科学、数学、生物学及arXiv预印本)的高效文献管理。
链接: https://arxiv.org/abs/2604.15366
作者: Cheyanne Shariat
机构: California Institute of Technology (加州理工学院)
类目: Digital Libraries (cs.DL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 3 pages, 1 figure. OverCite is available at this https URL
Abstract:Adding citations while drafting in LaTeX often requires leaving the editor, searching for a paper in mind, copying its BibTeX entry into the project bibliography, renaming the cite key, and then returning to the sentence. \textttOverCite is an open-source, lightweight tool that lets authors find, select, and insert citations without leaving the writing environment. In Overleaf, \textttOverCite uses rough citation placeholders (e.g., \texttt\textbackslash citep\Perlmutter1999\ ) and local sentence context to query ADS/SciX-indexed literature, rank likely matches, and insert the selected reference, without leaving the editor. A companion \textttVS Code extension provides the same functionality for local LaTeX projects. The ADS/SciX database includes astronomy, physics, computer science, mathematics, biology, and \emphall indexed arXiv e-prints, making \textttOverCite useful across a broad range of scientific disciplines.
[IR-14] o LLM or Not to LLM LLM : How Designers and Developers Navigate LLMs as Tools or Teammates
【速读】:该论文试图解决的问题是:在设计与开发流程中,如何合理决策是否引入大语言模型(Large Language Models, LLMs),因为这一决策并非单纯的技術評估,而是涉及组织结构、责任分配和權責邊界等社會技術因素。其解決方案的關鍵在於提出「角色框架」(role framing)這一分析工具,將LLM定位為「工具」或「隊友」兩種典型情境:當LLM被視為受人類明確控制的工具時,可嵌入現有治理結構並被接受;而當其被賦予模糊或共享代理權時,則會因責任難以歸屬而引發猶豫。研究進一步構建了一套分析框架,用以指導設計階段對LLM角色的明確界定,從而影響決策權限、責任主體、監督策略與組織可接受性,進而將「是否使用LLM」轉化為一個在系統設計早期即需解決的社會技術定位問題。
链接: https://arxiv.org/abs/2604.15344
作者: Varad Vishwarupe,Ivan Flechais,Nigel Shadbolt,Marina Jirotka
机构: University of Oxford(牛津大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 1 table
Abstract:Large language models (LLMs) are increasingly integrated into design and development workflows, yet decisions about their use are rarely binary or purely technical. We report findings from a constructivist grounded theory study based on interviews with 33 designers and developers across three large technology organisations. Rather than evaluating LLMs solely by capability, participants reasoned about the role an LLM could occupy within a workflow and how that role would interact with existing structures of responsibility and organisational accountability. When LLMs were framed as tools under clear human control, their use was typically acceptable and could be integrated within existing governance structures. When framed as teammates with shared or ambiguous agency, practitioners expressed hesitation, particularly when responsibility for outcomes could not be clearly justified. At the same time, participants also described productive teammate configurations in which LLMs supported collaborative reasoning while remaining embedded within explicit oversight structures. We identify tool and teammate framings as recurring ways in which designers and developers position LLMs relative to human work and present an analytic rubric describing how role framing shapes decision authority, accountability ownership, oversight strategies, and organisational acceptability. By foregrounding design-time reasoning, this work reframes To LLM or Not to LLM as a sociotechnical positioning problem that emerges during system design rather than during post-deployment evaluation.
人机交互
[HC-0] “Taking Stock at FAccT”: Using Participatory Design to Co-Create a Vision for the Fairness Accountability and Transparency Community
【速读】:该论文旨在解决学术会议治理中缺乏包容性与反思性参与的问题,特别是在聚焦人工智能(Artificial Intelligence, AI)社会影响的论坛如ACM FAccT中,如何实现多方利益相关者(包括学者、公民社会成员及政府代表)共同设计治理结构。解决方案的关键在于采用大规模参与式设计(Participatory Design, PD)方法,整合线下CRAFT工作坊、异步Polis投票以及面向治理者的报告撰写,使参会者通过提出初始陈述、补充观点并可视化共识、分歧与不确定性,从而共同塑造会议议程,推动批判性学者自由发声,并为大型PD理论提供了可扩展的时间与知识论框架。
链接: https://arxiv.org/abs/2604.16224
作者: Shiran Dudy,Jan Simson,Yanan Long
机构: Northeastern University (东北大学); LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); University of Mannheim (曼海姆大学); StickFlux Labs; University of Chicago (芝加哥大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at FAccT 2026, 27 pages, 9 figures
Abstract:As a relatively new forum, ACM FAccT has become a key space for activists and scholars to critically examine emerging AI and ML technologies. It brings together academics, civil society members, and government representatives from diverse fields to explore the broader societal impacts of both deployed and proposed technologies. We report a large-scale participatory design (PD) process for reflexive conference governance, which combined an in-person CRAFT session, an asynchronous Polis poll and the synthesis of a governance-facing report for the FAccT leadership. Participants shaped the substantive agenda by authoring seed statements, adding new statements and making patterns of agreement, disagreement and uncertainty made visible through this http URL endeavors represent one of the the first instances of applying PD to a venue that critically interrogates the societal impacts of AI, fostering a niche in which critical scholars are free to voice their concerns. Finally, this work advances large-scale PD theory by providing an effective case study of a co-design paradigm that can readily scale temporally and epistemologically.
[HC-1] Investigating Conversational Agents to Support Secondary School Students Learning CSP
【速读】:该论文旨在解决高中学生在学习大学先修计算机科学原理(AP Computer Science Principles, CSP)课程时,因难以找到适配其学习任务和背景的网络资源而面临的认知障碍问题。传统Web搜索存在信息筛选效率低、内容适配性差等局限,为此,论文提出以对话式智能体(conversational agents)作为替代方案,重点评估通用生成式对话模型(如ChatGPT)与专为CSP学习设计的固定应答型对话代理在辅助学生探索性检索中的有效性与参与度差异。其解决方案的关键在于通过实证研究比较不同类型的对话式智能体对高中生学习CSP概念的支持能力,从而为教育场景下AI工具的优化部署提供依据。
链接: https://arxiv.org/abs/2604.16213
作者: Matthew Frazier,Kostadin Damevski,Lori Pollock
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Secondary school students enrolled in the AP Computer Science Principles (CSP) course commonly utilize web resources (e.g., tutorials, Q\A sites) to better understand key concepts in the curriculum. The primary obstacle to using these resources is finding information appropriate for the learning task and student’s background. In addition to web search, conversational agents are increasingly a viable alternative for CSP students. In this paper, we study the potential of conversational agents to aid secondary school students as they acquire knowledge on CSP concepts. We explore general purpose, generative conversational agents (e.g., ChatGPT) and custom, fixed-response conversational agents built specifically to aid CSP students. We present results from classroom use by 45 high school students in grades 9-11 (ages 14-17) across six CSP sections. Our main contributions are in better understanding how conversational agents can help CSP students and an evaluation of the effectiveness and engagement of different approaches for CSP exploratory search.
[HC-2] GroupEnvoy: A Conversational Agent Speaking for the Outgroup to Foster Intergroup Relations
【速读】:该论文旨在解决因心理或语言障碍导致群体间无法直接互动时,如何有效改善群体间关系的问题。其解决方案的关键在于提出并验证GroupEnvoy——一种基于群体接触理论(Intergroup Contact Theory)的对话式代理(Conversational Agent),该代理通过分析仅由外群成员参与的会话记录(outgroup-only transcripts),在内群讨论中代表外群视角进行介入。实验结果表明,相较于被动阅读文本的对照组,使用GroupEnvoy的实验组在减少群体间焦虑和提升换位思考能力方面表现更优,且AI中介接触能增强对互动结果的积极预期,而被动暴露则更易激发未来接触意愿,从而为生成式AI(Generative AI)辅助跨群体沟通提供了可操作的设计原则与实证支持。
链接: https://arxiv.org/abs/2604.16095
作者: Koken Hata,Rintaro Chujo,Reina Takamatsu,Wenzhen Xu,Yukino Baba
机构: The University of Tokyo(东京大学); Institute of Science Tokyo(东京科学研究所); Hitotsubashi University(一桥大学)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 2 figures
Abstract:Conversational agents have the potential to support intergroup relations when psychological or linguistic barriers prevent direct interaction. Based on intergroup contact theory, we propose GroupEnvoy, a conversational agent that represents outgroup perspectives during ingroup discussions, grounded in transcripts from outgroup-only sessions. To evaluate this approach and derive design principles, we conducted a mixed-methods, between-subjects study with university students, where host-country students formed the ingroup and international students formed the outgroup. Ingroup students performed a collaborative task, receiving outgroup perspectives via GroupEnvoy (experimental) or reading written transcripts (control). Compared to the control group, the experimental group showed greater reduction in intergroup anxiety and greater improvement in perspective-taking. Qualitatively, AI-mediated contact enhanced outcome expectancies, whereas passive exposure fostered future contact intentions. The two conditions also elicited empathy toward distinct targets: outgroup evaluations of the ingroup versus outgroup lived experiences. These findings validate AI-mediated contact as a promising paradigm for improving intergroup relations.
[HC-3] Driving Assistance System for Ambulances to Minimise the Vibrations in Patient Cabin
【速读】:该论文旨在解决救护车在运输病患过程中因车辆振动影响医疗人员操作效率,进而可能危及患者生存或延长恢复时间的问题。解决方案的关键在于构建一个基于传感器数据的智能路径推荐系统,该系统通过集成加速度计和GPS模块实时采集车辆振动与位置信息,并利用训练好的人工神经网络(Artificial Neural Network, ANN)对振动强度进行分类(低、中、高),从而在多个可选路线中评估并推荐振动最小的路径。该方法在时间差异小于6%时优先选择振动较低的路线,在时间差异超过20%时则优先选择最短路径,实现了振动控制与时间效率之间的权衡优化。
链接: https://arxiv.org/abs/2604.16047
作者: Abdulaziz Aldegheishem,Nabil Alrajeh,Lorena Parra,Oscar Romero,Jaime Lloret
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 19 pages, 14 figures, 10 tables
Abstract:The ambulance service is the main transport for diseased or injured people which suffers the same acceleration forces as regular vehicles. These accelerations, caused by the movement of the vehicle, impact the performance of tasks executed by sanitary personnel, which can affect patient survival or recovery time. In this paper, we have trained, validated, and tested a system to assess driving in ambulance services. The proposed system is composed of a sensor node which measures the vehicle vibrations using an accelerometer. It also includes a GPS sensor, a battery, a display, and a speaker. When two possible routes reach the same destination point, the system compares the two routes based on previously classified data and calculates an index and a score. Thus, the index balances the possible routes in terms of time to reach the destination and the vibrations suffered in the patient cabin to recommend the route that minimises those vibrations. Three datasets are used to train, validate, and test the system. Based on an Artificial Neural network (ANN), the classification model is trained with tagged data classified as low, medium, and high vibrations, and 97% accuracy is achieved. Then, the obtained model is validated using data from three routes of another region. Finally, the system is tested in two new scenarios with two possible routes to reach the destination. The results indicate that the route with less vibration is preferred when there are low time differences (less than 6%) between the two possible routes. Nonetheless, with the current weighting factors, the shortest route is preferred when time differences between routes are higher than 20%, regardless of the higher vibrations in the shortest route.
[HC-4] “When I see Jodie I feel relaxed”: Examining the Impact of a Virtual Supporter in Remote Psychotherapy
【速读】:该论文旨在解决当前虚拟代理(Virtual Agent)在心理健康应用中研究多集中于非传统治疗场景,而缺乏在远程心理治疗会话(如通过Zoom进行的会话)中有效整合与评估的问题。解决方案的关键在于设计并实现一个具备双模式运行机制的虚拟支持者(Virtual Supporter):Daily Mode用于日常情绪日记记录,Therapy Mode则作为Zoom治疗会话中的额外参与者,以增强心理安全感、降低焦虑并提升情绪表达能力,同时不干扰专业治疗流程。该方案基于对人类支持者角色的功能性分析开发而成,并通过14名参与者的一周使用和一次远程治疗会话验证了其有效性。
链接: https://arxiv.org/abs/2604.16003
作者: Jiashuo Cao,Chen Li,Wujie Gao,Simon Hoermann,Nilufar Baghaei,Mark Billinghurst
机构: The University of Auckland(奥克兰大学); The Hong Kong Polytechnic University(香港理工大学); University of Canterbury(坎特伯雷大学); University of Queensland(昆士兰大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CSCW 2026
Abstract:Virtual agents have shown promising potential in mental health applications, but current research has predominantly focused on contexts outside of traditional therapy sessions. This paper examines the impact of a virtual supporter in remote psychotherapy sessions conducted via Zoom. We used a two-phase research approach. First we conducted a formative study to understand the roles and functions of human supporters in psychotherapy contexts. Based on these findings, we developed a virtual supporter operating in two modes: Daily Mode (for mood journaling outside therapy) and Therapy Mode (as an additional participant in Zoom therapy sessions). Finally we ran a user study with 14 participants who engaged with the virtual supporter for a week and then joined a remote psychotherapy session together. Our findings revealed that the virtual supporter had positive effects on creating psychological safety, reducing anxiety, and enhancing emotional articulation without disrupting the therapeutic process. We then discussed both the benefits and potential disadvantages of virtual supporters in therapeutic contexts, including concerns about over-reliance and the need for appropriate boundaries. This research contributes to understanding how AI-driven virtual agents could contribute to human-led remote psychotherapy.
[HC-5] From Vulnerable Data Subjects to Vulnerabilizing Data Practices: Navigating the Protection Paradox in AI-Based Analyses of Platformized Lives
【速读】:该论文试图解决的问题是:在平台化数据环境中,如何重新理解并应对数据科学实践中对“脆弱个体”的伦理责任,尤其是在数据丰裕背景下,研究者的选择可能加剧数据主体的脆弱性而非保护其权益。传统伦理框架多关注数据缺失或反向数据(counter-data),而本文指出,在海量数据已存在的现实下,伦理挑战应聚焦于技术流程如何将“脆弱”个体转化为可被进一步边缘化的数据主体。解决方案的关键在于提出一套反思性伦理协议(reflexive ethics protocol),该协议围绕数据集设计、操作化、推断和传播四个关键节点展开,识别出暴露(exposure)、货币化(monetization)、叙事固化(narrative fixing)与算法优化(algorithmic optimization)这四类交叉的脆弱化因素,并为每个决策点提供具体伦理提示,从而引导研究者避免无意中强化对平台化数据主体的计算暴露与剥削。
链接: https://arxiv.org/abs/2604.15990
作者: Delfina S. Martinez Pandiani,Ella Streefkerk,Laurens Naudts,Paula Helm
机构: University of Amsterdam(阿姆斯特丹大学); Goethe University Frankfurt(歌德大学); KU Leuven(鲁汶大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, Canada. ACM, New York, NY, USA, 23 pages
Abstract:This paper traces a conceptual shift from understanding vulnerability as a static, essentialized property of data subjects to examining how it is actively enacted through data practices. Unlike reflexive ethical frameworks focused on missing or counter-data, we address the condition of abundance inherent to platformized life-a context where a near inexhaustible mass of data points already exists, shifting the ethical challenge to the researcher’s choices in operating upon this existing mass. We argue that the ethical integrity of data science depends not just on who is studied, but on how technical pipelines transform “vulnerable” individuals into data subjects whose vulnerability can be further precarized. We develop this argument through an AI for Social Good (AI4SG) case: a journalist’s request to use computer vision to quantify child presence in monetized YouTube ‘family vlogs’ for regulatory advocacy. This case reveals a “protection paradox”: how data-driven efforts to protect vulnerable subjects can inadvertently impose new forms of computational exposure, reductionism, and extraction. Using this request as a point of departure, we perform a methodological deconstruction of the AI pipeline to show how granular technical decisions are ethically constitutive. We contribute a reflexive ethics protocol that translates these insights into a reflexive roadmap for research ethics surrounding platformized data subjects. Organized around four critical junctures-dataset design, operationalization, inference, and dissemination-the protocol identifies technical questions and ethical tensions where well-intentioned work can slide into renewed extraction or exposure. For every decision point, the protocol offers specific prompts to navigate four cross-cutting vulnerabilizing factors: exposure, monetization, narrative fixing, and algorithmic optimization. Rather than uncritically…
[HC-6] Stochastic wage suppression on gig platforms and how to organize against it
【速读】:该论文旨在解决数字劳动平台(Digital Labor Platforms)中平台通过利用劳动力供给的过剩和工人对劳动成本的不确定性来压低工资的问题。其核心挑战在于如何在保证任务完成的前提下,实现平台支出最小化,同时揭示集体行动对提升劳动者议价能力的作用。解决方案的关键在于提出一种带覆盖目标的固定报价采购模型,并证明:在工人成本估计满足自然假设时,平台可采用简单的定价策略,在等待时间仅为 O(M) 的情况下仅支付总劳动成本的 O(logM/M) 比例;进一步地,研究发现,由少量低成本工人组成的有组织联盟若设定价格下限,能将平台总支出从对数级提升至线性级 O(M),显著改善市场福利,而随机抽样的同类联盟则效果有限。这一结果揭示了平台剥削机制的本质及集体行动作为反制杠杆的有效性。
链接: https://arxiv.org/abs/2604.15962
作者: Ana-Andreea Stoica,Celestine Mendler-Duenner,Moritz Hardt
机构: Max Planck Institute for Intelligent Systems, Tübingen, and Tübingen AI Center; ELLIS Institute Tübingen
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted with proceedings in The ACM Web Conference 2026
Abstract:Digital labor platforms are increasingly used to procure human input, ranging from annotating data and red-teaming AI models, to ride-sharing and food delivery. A central concern in such markets is the ability of platforms to suppress wages by exploiting the abundance of low-cost labor. To study this exploitation pattern, we introduce a novel posted-price procurement model with coverage objectives. A platform seeks to complete M tasks by posting prices to sequentially arriving workers, each of whom accepts a task if it exceeds their private cost. First, we show that under natural assumptions on the workers’ estimated cost, there exists a simple pricing strategy for the platform to cover all M tasks with wait time O(M), while paying only a O(log(M)/M) fraction of the total cost of labor. This result highlights how platforms can exploit workers’ uncertainty about the cost of labor to effectively suppress wages. Then, we study collective action as a lever to increase wages and promote welfare in digital labor markets. In particular, we show how a small coalition of targeted low-cost workers who commit to a price floor forces the platform’s total spending from logarithmic to linear in M. In contrast, a randomly sampled coalition of equal size remains largely ineffective. We complement our theory with synthetic experiments, showcasing the benefits of collective action across different market regimes.
[HC-7] Shaping Plant-Like Shape-Changing Interfaces as Vertical Charts: Maximizing Readability Aesthetics and Naturalness
【速读】:该论文旨在解决如何通过数据可视化手段有效传达环境信息,从而促进生态友好型生活方式的采纳问题。现有研究已探索了自然灵感的计数器、仪表盘和位图等图形化表达方式,但尚未深入探讨时序数据(data series)的可视化设计。其解决方案的关键在于提出并实现一种基于“设计驱动研究”(research-through-design)的方法,构建植物形态的物理图表(plant-like charts),并通过用户实验比较四种原型在模态(图形 vs. 物理)与材料维度上的表现。结果表明,当材料能够体现低技术感且折叠形状能编码从0到最大值的速率时,物理植物图表不仅具备良好的可读性与美学优势,还可在无需解释的情况下安装于公共场所,从而提升可持续性主题的信息传播效率与自然感知体验。
链接: https://arxiv.org/abs/2604.15902
作者: Elodie Bouzekri,Guillaume Riviere
机构: Univ. Bordeaux (波尔多大学); ESTIA-Institute of Technology (ESTIA工程学院); Lab-STICC UMR 6285, Univ. Brest (布雷斯特大学实验室-STICC 6285研究中心)
类目: Human-Computer Interaction (cs.HC)
备注: 31 pages. Approximately 9000 words. 16 figures and 10 tables
Abstract:Conveying environmental data has grown interest in encouraging the adoption of eco-friendly lifestyles through data-driven strategies. This scope appeals to data visualizations representing the environmental purpose. For example, previous work has already proposed nature-inspired counters, gauges, and bitmaps, but data series remains to be explored. Therefore, could we design and implement effective plant-like charts? This paper brings answers through a research-through-design approach that explores a design space to maximize readability and aesthetics. It then compares four prototypes of charts over modality and material dimensions by asking users about scenarios involving renewable energy forecasts. The results examine whether implementing physical charts is worth it instead of graphical charts and the advantages of using meaningful materials that evocate sustainability and enhance naturalness. The results also reexamine, with physical charts, the previous results on graphical infographics of slightly lower clarity and readability but higher aesthetics of embellishment. In addition, learnability is examined for encoding rates through folded shapes. This paper shows that physical plant-like charts are worthwhile because of promising performance and best-of-breed naturalness when materials allow low-tech aspects’ perception and because being installable in public places without explanations if folded shapes encode rates ranging from 0 to a maximum value.
[HC-8] From Intention to Text: AI-Supported Goal Setting in Academic Writing
【速读】:该论文旨在解决学术写作者在迭代反思与目标调节过程中面临的挑战,即写作目标常难以清晰表达和有效管理,而现有AI写作工具虽强调效率提升,却缺乏对元认知(metacognition)和写作者主体性(writer agency)的支持。解决方案的关键在于提出WriteFlow——一个基于语音交互的AI写作辅助系统,其核心机制是将AI对话设计为持续的目标阐明、监控与协商空间,通过支持目标迭代优化、保持目标与文本的一致性以及引导目标达成度评估,从而促进写作者的反思性实践(reflection-in-action),实现更具意图性和自主性的写作过程。
链接: https://arxiv.org/abs/2604.15800
作者: Yueling Fan,Richard Lee Davis,Olga Viberg
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AIED 2026
Abstract:This study presents WriteFlow, an AI voice-based writing assistant designed to support reflective academic writing through goal-oriented interaction. Academic writing involves iterative reflection and evolving goal regulation, yet prior research and a formative study with 17 participants show that writers often struggle to articulate and manage changing goals. While commonly used AI writing tools emphasize efficiency, they offer limited support for metacognition and writer agency. WriteFlow frames AI interaction as a dialogic space for ongoing goal articulation, monitoring, and negotiation grounded in writers’ intentions. Findings from a Wizard-of-Oz study with 12 expert users show that WriteFlow scaffolds metacognitive regulation and reflection-in-action by supporting iterative goal refinement, maintaining goal-text alignment during drafting, and prompting evaluation of goal fulfillment. We discuss design implications for AI writing systems that prioritize reflective dialogue, flexible goal structures, and multi-perspective feedback to support intentional and agentic writing.
[HC-9] Filter Babel: The Challenge of Synthetic Media to Authenticity and Common Ground in AI-Mediated Communication
【速读】:该论文试图解决的问题是:在近未来由生成式 AI (Generative AI) 主导的个性化信息过滤环境中,个体日益沉浸于私有化体验,可能导致公共沟通基础(common ground)的瓦解,进而威胁人类交流的有效性与身份认同的共享性。解决方案的关键在于引入一个“AI翻译器”作为中介机制,它既维持了个体私有语言(Wittgensteinian private language)的可理解性,又可能成为重构共通认知框架的潜在路径;这一设计凸显出对 AI 介导通信中“私有体验”与“公共理解”之间张力的再平衡需求,并呼吁开展以增强人类沟通完整性为导向的交叉研究。
链接: https://arxiv.org/abs/2604.15786
作者: Advait Sarkar
机构: Microsoft Research (微软研究院); University of Cambridge (剑桥大学); University College London (伦敦大学学院)
类目: Human-Computer Interaction (cs.HC)
备注: ACM CHI 2026 Workshop on Restoring Human Authenticity in AI-Mediated Communication (CHI '26 AI-MC)
Abstract:Filter Babel is a thought experiment about a near future in which everything we read, watch, and even whom we “meet” is privately generated for each of us. If we each recede into a world of purely private experience, we may each develop a Wittgensteinian private language that remains intelligible to others only because an AI translator sits in the middle. This intermediation challenges the integrity of common ground and therefore of communication. On the other hand, private experience is an essential engine of identity and selfhood: as Lanier warns, one must be somebody before one can share oneself. This paper opens a discussion of the challenges and opportunities that Filter Babel might present to human communication and identity, and what constructive directions for research in AI-mediated communication might ensue.
[HC-10] ReVis: Towards Reusable Image-Based Visualizations with MLLM s
【速读】:该论文旨在解决图像形式的可视化(bitmap-based visualizations)难以重用和定制的问题,这类可视化常以位图图像形式在线共享,缺乏结构化表示,导致重设计或适配新数据时需大量人工干预且效率低下。现有方法多依赖于结构化的SVG或规范描述,支持的可视化类型有限且灵活性不足。其解决方案的关键在于提出一种人-AI协同框架ReVis,核心创新包括:(1) 设计了一种通用领域特定语言(Domain-Specific Language, DSL),用于建模复杂可视化并支持分解与重建;(2) 构建基于多模态大语言模型(Multimodal Large Language Model, MLLM)的解析管道,将图像可视化自动解析为DSL结构,识别其核心视觉组件和数据到编码的映射关系;(3) 提供交互式界面,使用户可上传图像、检查重构结果、更新数据并自定义编码。该方案显著提升了图像可视化重用的自动化水平与灵活性。
链接: https://arxiv.org/abs/2604.15781
作者: Xiaolin Wen,Changlin Li,Manusha Karunathilaka,Can Liu,Fangzhuo Jin,Yong Wang
机构: Nanyang Technological University (南洋理工大学); Sichuan University (四川大学); Singapore Management University (新加坡管理大学); Huazhong University of Science and Technology (华中科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Many expressive visualizations are shared online only as bitmap images, making them difficult to redesign or adapt to new data. Reusing such image-based visualizations requires substantial expertise and is often time-consuming, even for experienced visualization practitioners. Existing work on reproducing visualizations often relies on structured SVG or specifications, supports limited visualization types, and offers limited flexibility for customization. To address these challenges, we present ReVis, a human-AI collaboration approach that enables flexible reuse of image-based visualizations. First, a generic Domain-Specific language (DSL) is proposed to model complex visualizations and support both visualization decomposition and reproduction. Then, ReVis employs an MLLM-based pipeline to parse an image-based visualization into the DSL, delineating its core visual structures and data-to-encoding mappings, and further reproduces the visualization from the DSL. Finally, ReVis includes an interactive interface to allow users to upload visualization images, inspect reproduced results, update the underlying data, and customize visual encodings. A gallery of 40 visualizations demonstrates the expressiveness of the DSL, and a quantitative study evaluates the reproduction quality of ReVis on these examples. Two usage scenarios and user interviews with 16 visualization practitioners demonstrate the effectiveness of ReVis.
[HC-11] Searching for European Alternatives: Digital Sovereignty Digital Patriotism and the Emerging Geopolitics of Software Adoption
【速读】:该论文试图解决的问题是:传统软件采纳研究主要基于工具理性(如可用性、成本、安全性等),但近年来地缘政治因素如何重塑技术选择决策,尤其是个体层面的“数字爱国主义”(digital patriotism)现象未被充分探讨。解决方案的关键在于提出并实证验证了“数字爱国主义”这一新维度——即个体为支持本国数字主权而愿意接受功能妥协的行为逻辑,并通过两项研究揭示其演变过程:一是欧洲政府机构自2020年起从成本导向转向主权与地缘风险导向的软件迁移决策;二是对超700条在线评论的定性分析显示,企业和消费者在寻求本土替代软件时,表现出以意识形态目标优先于功能最优的价值理性(value rationality)。这拓展了软件采纳理论框架,将价值理性与工具理性并置,为理解 geopolitics 如何重构职场中的技术选择提供了实证依据。
链接: https://arxiv.org/abs/2604.15767
作者: Advait Sarkar
机构: University of Cambridge (剑桥大学); University College London (伦敦大学学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Software adoption has traditionally been understood through instrumental lenses, such as usability, cost, security, and interoperability. We argue that a new, ideological dimension is reshaping adoption decisions: one we term digital patriotism, the individual counterpart to the state ideology of digital sovereignty. Through two studies, we trace this phenomenon. First, a directed content analysis of decisions made by European government agencies to switch away from de facto technology standards reveals a shift around 2020: early switches cited costs and vendor lock-in, while later switches invoke sovereignty, geopolitical risk, and investment in local industry. Second, a qualitative analysis of over 700 online comments (over 51,000 words) surfaces how consumers and businesses articulate motivations for seeking European software alternatives. We find that digital patriotism entails a willingness to accept functional compromise in service of ideological goals. Our work extends software adoption theory by drawing attention to value rationality alongside instrumental rationality, and contributes an empirical account of how geopolitics is reshaping technology choice in the workplace.
[HC-12] Designing More Engaging Serious Games to Support Students Mental Health: A Pilot Study Based on A CBT-Informed Design Framework
【速读】:该论文旨在解决当前数字心理健康教育和面向学生及青少年的严肃游戏中存在的单调乏味、参与度低以及吸引力不足的问题。其解决方案的关键在于提出了一种以体验为中心的严肃游戏设计框架——治疗性程序修辞与机制映射框架(Therapeutic Procedural Rhetoric and Mechanism Mapping Framework, TPR-MMF),通过将心理治疗中的隐喻性机制嵌入游戏玩法中,使玩家在互动过程中自发产生情感共鸣并联结现实生活经验,从而显著提升内在动机与干预效果。
链接: https://arxiv.org/abs/2604.15662
作者: Ting-Chen Hsu,Zheyuan Zhang,Ziyi Chen,Yuwen Liu,Yanjia Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Addressing the issues of dullness, low compliance, and lack of appeal in current digital mental health education and serious games for students and adolescents, this study proposes a novel, experience-centered framework for serious game design: the Therapeutic Procedural Rhetoric and Mechanism Mapping Framework (TPR-MMF). Based on this framework, a side-scrolling serious game prototype, “World + You - World,” was developed. This study compared the effectiveness of TPR-MMF-based games with traditional explicit educational serious games through a small-sample randomized controlled trial (N=28). The results of the Intrinsic Motivation Inventory (IMI) showed that the experimental group (playing “World + You - World”) significantly outperformed the control group in four aspects. Furthermore, qualitative survey results indicated that players could perceive the psychological metaphors within the game mechanics and spontaneously resonated with real-life experiences. This study provides a highly engaging new development paradigm for gamified mental health education for students and adolescents.
[HC-13] ZORO: Active Rules for Reliable Vibe Coding
【速读】:该论文旨在解决当前代码生成式 AI(Generative AI)中规则文件(rules files)作为人类-代理对齐机制时存在的被动性问题,即规则难以被实时感知、验证或优化。解决方案的关键在于提出 ZORO——一个与编码代理(coding agent)深度集成的交互式界面,通过将规则锚定于编码过程的每一步,在初始计划阶段增强规则内容,在实现阶段强制代理证明规则执行,并允许用户在过程中提供即时反馈以迭代优化规则集。这一设计使规则从静态文本转变为动态控制机制,从而提升代理对规则的遵循度并改变开发者的行为与认知策略。
链接: https://arxiv.org/abs/2604.15625
作者: Jenny Ma,Sitong Wang,Joshua H. Kung,Lydia B. Chilton
机构: Columbia University (哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 10 figures
Abstract:Rules files (e.g., this http URL, this http URL) are the primary mechanism for human-agent alignment when developers vibe code. However, they remain passive: it is not immediately apparent when rules are being used or followed, or how to improve them. To transform rules from passive text into active controls, we introduce ZORO, an interactive interface that integrates directly with a coding agent and anchors rules to every step of the coding process. After an agent generates an initial plan, ZORO enriches the plan with rules, enforces the rules during implementation by requiring the agent prove that each rule was followed, and allows users to provide in-situ feedback when they are unsatisfied with a rule application to evolve the ruleset. A technical evaluation shows that coding agents follow rules more with ZORO than without. A user study demonstrates a change in people’s behavior and cognitive strategies when rules are at the forefront of vibe coding. We discuss how making rules active in agentic systems unlocks broader opportunities for human-agent alignment in coding settings and beyond.
[HC-14] Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies ACL2026
【速读】:该论文旨在解决在非完全合作情境下(即人与AI的目标部分对齐),人类个性特质与AI设计特性各自及共同作用于人机交互质量与结果的机制尚不明确的问题。其解决方案的关键在于通过对比纯模拟数据集(2000次仿真)与包含290名真实人类被试的实验数据,系统考察两种典型场景——招聘谈判和信息隐藏型交易——中人类外向性(Extraversion)与宜人性(Agreeableness)人格维度,以及AI的适应性(Adaptability)、专业性(Expertise)和思维链透明度(chain-of-thought Transparency)等设计特征的影响效应。研究采用因果发现分析方法整合情境结果、沟通行为与问卷测量,揭示了仿真与真实人类实验间存在显著差异:在真实交互中,AI属性(尤其是透明度)比人格特质更具影响力,从而为未来以人为中心的AI代理设计提供了关键实证依据。
链接: https://arxiv.org/abs/2604.15607
作者: Myke C. Cohen,Mingqian Zheng,Neel Bhandari,Hsien-Te Kao,Xuhui Zhou,Daniel Nguyen,Laura Cassani,Maarten Sap,Svitlana Volkova
机构: Aptima, Inc.(Aptima公司); Arizona State University (亚利桑那州立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Will be presented at ACL 2026 and published in the Findings of the Association for Computational Linguistics: ACL 2026
Abstract:AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes – particularly transparency – were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.
[HC-15] LLM s Corrupt Your Documents When You Delegate
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在长流程委托工作(delegated workflows)中因缺乏可靠性而导致文档内容被隐性篡改的问题。其核心挑战在于,尽管LLMs在单任务执行中表现优异,但在需要多轮深度文档编辑的复杂场景下,模型会逐步引入不可见但严重的错误,从而导致最终输出结果失真。解决方案的关键在于提出DELEGATE-52基准测试集,该数据集模拟了涵盖52个专业领域的长期委托任务,能够系统评估LLMs在复杂文档处理中的稳定性与准确性;实验结果揭示,即使前沿模型(如Gemini 3.1 Pro、Claude 4.6 Opus、GPT 5.4)也平均丢失25%的文档内容,且工具调用(agentic tool use)无法缓解此问题,表明当前LLMs作为“委托者”仍存在严重缺陷,亟需改进其长期交互中的忠实性和鲁棒性。
链接: https://arxiv.org/abs/2604.15597
作者: Philippe Laban,Tobias Schnabel,Jennifer Neville
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
[HC-16] “Excuse me may I say something…” CoLabScience A Proactive AI Assistant for Biomedical Discovery and LLM -Expert Collaborations ACL2026
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生物医学协作场景中因反应式特性(即仅在被提示时才响应)而难以实现前瞻性干预与自主参与的问题。解决方案的关键在于提出一种名为PULI(Positive-Unlabeled Learning-to-Intervene)的新颖框架,该框架基于强化学习目标训练,能够根据团队项目提案及长短期对话记忆,智能判断何时以及如何在流式科学讨论中进行适时、情境感知的干预,从而提升协作效率和任务实用性。
链接: https://arxiv.org/abs/2604.15588
作者: Yang Wu,Jinhong Yu,Jingwei Xiong,Zhimin Tao,Xiaozhong Liu
机构: Worcester Polytechnic Institute (伍斯特理工学院); University of California, Davis (加州大学戴维斯分校); Jiangsu University (江苏大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: ACL 2026 Main Conference
Abstract:The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team’s project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.
[HC-17] Bureaucratic Silences: What the Canadian AI Register Reveals Omits and Obscures
【速读】:该论文旨在解决政府发布的AI注册表(如加拿大联邦AI注册表)在透明度实践中存在的“形式化透明”问题,即表面上提供信息透明,实则通过技术性描述掩盖了AI系统在公共部门应用中的社会技术复杂性和责任边界模糊性。其解决方案的关键在于引入算法决策适应公共部门框架(ADMAPS),结合定量映射与演绎式定性编码方法,揭示注册表如何通过建构一种将AI视为“可靠工具”的本体论,系统性地弱化人类判断、训练过程和不确定性管理等关键要素,从而导致问责制被简化为合规表演。论文主张,若不重构此类透明工具的设计逻辑,将使问责机制陷入表面可见但无法质疑的困境。
链接: https://arxiv.org/abs/2604.15514
作者: Dipto Das,Christelle Tessono,Syed Ishtiaque Ahmed,Shion Guha
机构: University of Toronto (多伦多大学)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted at FAccT 2026
Abstract:In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register’s complete dataset of 409 systems using the Algorithmic Decision-Making Adapted for the Public Sector (ADMAPS) framework, combining quantitative mapping with deductive qualitative coding. Our findings reveal a sharp divergence between the rhetoric of “sovereign AI” and the reality of bureaucratic practice: while 86% of systems are deployed internally for efficiency, the Register systematically obscures the human discretion, training, and uncertainty management required to operate them. By privileging technical descriptions over sociotechnical context, the Register constructs an ontology of AI as “reliable tooling” rather than “contestable decision-making.” We conclude that without a shift in design, such transparency artifacts risk automating accountability into a performative compliance exercise, offering visibility without contestability.
[HC-18] GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology
【速读】:该论文旨在解决在零售店、仓库和医院等复杂密集环境中,人类与具身人工智能(Embodied AI)面临的空间定位(Spatial Grounding)难题。传统计算机视觉方法因物品静态分布导致视觉特征快速失效,且长尾语义分布难以建模;尽管视觉语言模型(Vision-Language Models, VLMs)能理解语义丰富场景,但在杂乱环境中的空间定位仍存在局限。其解决方案的核心是提出GIST(Grounded Intelligent Semantic Topology),一个将消费级移动点云转化为结构化语义导航拓扑的多模态知识提取流程:首先构建2D占用图以压缩场景信息,再提取拓扑布局,并通过智能关键帧与语义选择机制叠加轻量语义层,从而实现高精度的空间语义对齐与可解释性导航。
链接: https://arxiv.org/abs/2604.15495
作者: Shivendra Agrawal,Bradley Hayes
机构: University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system’s capacity for universal design.
[HC-19] he Crutch or the Ceiling? How Different Generations of LLM s Shape EFL Student Writings
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在辅助英语作为外语(English as a Foreign Language, EFL)学生写作时,其作用究竟是促进真实学习过程的“支架”(scaffold),还是仅提供表面流畅度的“补偿性拐杖”(compensatory crutch)这一核心问题。解决方案的关键在于:教育者应从单纯关注输出质量转向验证学习过程本身,并将AI功能明确区分为“概念性支架”(ideational scaffolding)与“文本生成”(textual production),使其精准嵌入学习者的最近发展区(Zone of Proximal Development, ZPD),从而实现AI辅助教学向真正意义上的学习转化。
链接: https://arxiv.org/abs/2604.15460
作者: Hengky Susanto,David James Woo,Chingyi Yeung,Stephanie Wing Yan Lo-Philip,Chi Ho Yeung
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of Large Language Models (LLMs) has made them powerful tools for enhancing student writing. This study explores the extent and limitations of LLMs in assisting secondary-level English as a Foreign Language (EFL) students with their writing tasks. While existing studies focus on output quality, our research examines the developmental shift in LLMs and their impact on EFL students, assessing whether smarter models act as true scaffolds or mere compensatory crutches. To achieve this, we analyse student compositions assisted by LLMs before and after ChatGPT’s release, using both expert qualitative scoring and quantitative metrics (readability tests, Pearson’s correlation coefficient, MTLD, and others). Our results indicate that advanced LLMs boost assessment scores and lexical diversity for lower-proficiency learners, potentially masking their true ability. Crucially, increased LLM assistance correlated negatively with human expert ratings, suggesting surface fluency without deep coherence. To transform AI-assisted practice into genuine learning, pedagogy must shift from focusing on output quality to verifying the learning process. Educators should align AI functions, specifically differentiating ideational scaffolding from textual production, within the learner’s Zone of Proximal Development.
[HC-20] owards A Framework for Levels of Anthropomorphic Deception in Robots and AI
【速读】:该论文旨在解决自主系统设计中因拟人化欺骗(anthropomorphic deception)引发的伦理与社会问题,即如何在功能必要性、社会适当性和伦理可接受性之间取得平衡。其解决方案的关键在于提出一个包含四个层级的拟人化欺骗框架,该框架通过人类相似度(humanlikeness)、自主性(agency)和自我意识(selfhood)三个维度进行界定与区分,并结合具体应用场景阐明各层级在功能、社会和伦理层面的适用边界,从而引导人机交互(HCI)与人机协作(HRI)领域的研究者和从业者进行更审慎的设计决策,避免将拟人化设计作为默认策略或用于经济剥削目的。
链接: https://arxiv.org/abs/2604.15418
作者: Franziska Babel,Shane Saunderson,Shalaleh Rismani
机构: Linköping University (林雪平大学); McMaster University (麦克马斯特大学); McGill University (麦吉尔大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:This paper presents a preliminary draft of a framework around the use of anthropomorphic deception, defined here as misleading users towards humanlike affordances in the design of autonomous systems. The goal is to promote reflection among HCI and HRI researchers, as well as industry practitioners, to think about levels of anthropomorphic design that are: a) functionally necessary, b) socially appropriate, and c) ethically permissible for their use case. By reviewing the relevant literature on deception in HCI and HRI, we propose a framework with four levels of anthropomorphic deception. These levels are defined and distinguished by three factors: humanlikeness, agency, and selfhood. Example use cases at each level illustrate considerations around their functional, social, and ethical permissibility. We then present how this framework is applicable to previous work on persuasive robots We hope to promote a balanced view on anthropomorphic deception by design that should be neither naïve (e.g., as a default) nor exploitive (e.g., for economic benefit).
[HC-21] GazeSync: A Mobile Eye-Tracking Tool for Analyzing Visual Attention on Dynamically Manipulated Content
【速读】:该论文旨在解决传统移动眼动追踪技术在动态内容场景下无法准确捕捉用户注意力的问题,即当用户对图像进行缩放、旋转等操作时,静态屏幕坐标失去了与视觉内容的语义关联。其解决方案的关键在于提出了一种名为GazeSync的可复用移动系统,该系统通过将设备端的眼动估计结果与实时图像变换矩阵(包括缩放、旋转和平移)进行同步,从而实现以图像为参考系的注视点重建,使视觉注意力分析摆脱设备交互的干扰。
链接: https://arxiv.org/abs/2604.15348
作者: Yaxiong Lei,Rishab Talwar,Shijing He,Xinya Gong,Yuheng Wang,Xudong Cai,Zhongliang Guo,Juan Ye
机构: University of St Andrews(圣安德鲁斯大学); University of Essex(埃塞克斯大学); King’s College London(伦敦国王学院)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 6 figures, this paper is accepted by CHI’26 as Poster paper
Abstract:Conventional mobile eye-tracking maps gaze to static screen coordinates, failing to capture user attention when content is dynamic. As users pinch, zoom, and rotate images, static coordinates lose their semantic meaning relative to the underlying visual content. To address this methodological gap, we present \textitGazeSync, a reusable mobile system that synchronizes on-device gaze estimation with real-time image transformation matrices (scale, rotation, and translation). By logging gaze coordinates alongside precise UI states, GazeSync enables the accurate reconstruction of \textitimage-relative attention patterns, decoupling visual attention from device interaction. We validate our end-to-end toolchain through a formative study involving guided manipulation, reading, and visual search tasks. Our results demonstrate GazeSync’s ability to recover ground-truth gaze locations on transforming content, explicitly showing how it outperforms static baselines, while also surfacing critical boundaries regarding calibration drift and reconstruction fragility under compound manipulations.
[HC-22] Improving Recycling Accuracy across UK Local Authorities: A Prototype for Citizen Engagement
【速读】:该论文旨在解决英国居民在家庭垃圾分类与回收过程中因地方当局分散管理、标识混乱及政策变更(如更简化的回收政策)导致的“愿循环”(wishcycling)现象和回收准确性低的问题。研究通过混合方法识别出用户的核心痛点:对回收标识理解不清、物流限制以及本地回收要求的信息缺失。解决方案的关键在于开发一个基于用户中心设计的交互式原型应用,提供位置定制化指导、可视化分类辅助和材料特定信息,从而显著提升回收准确率(实测提高60%),尤其在包装物评估方面表现突出。研究表明,此类技术赋能方案虽能有效改善行为并减少污染,但其最佳效果体现在与地方政府沟通机制和服务设计的系统性改进相辅相成,而非替代后者。
链接: https://arxiv.org/abs/2604.15345
作者: Chloé Greenstreet,Anastasia Vayona,Jane Henriksen-Bulmer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 19 pages
Abstract:Despite public motivation to recycle, significant barriers hinder effective household recycling in the UK. Decentralised local authority waste management creates citizen confusion and “wishcycling” (disposing of non-recyclable items in recycling bins). The recent Simpler Recycling Policy further complicates this landscape by mandating new identification, sorting, and cleaning requirements that will require citizen guidance to ensure they understand how these will impact their recycling practices. This mixed methods study (surveys n=50, expert interviews, design activities) used the Value Proposition Canvas to identify citizen pain points: confusion about logos, logistical constraints, and information gaps about local requirements. We then developed an interactive prototype application providing location-specific guidance, visual sorting aids, and material-specific information to address these painpoints. Focus group evaluation showed the prototype improved recycling accuracy by 60 percent, with marked improvements in packaging assessment. Technology-enabled solutions grounded in user-centred design can measurably improve recycling behaviours and reduce contamination. However, such solutions are most effective when complementing (rather than substituting for) systemic improvements in local authority communication and service design. Comments: 19 pages Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) Cite as: arXiv:2604.15345 [cs.HC] (or arXiv:2604.15345v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.15345 Focus to learn more arXiv-issued DOI via DataCite
[HC-23] When the Loop Closes: Architectural Limits of In-Context Isolation Metacognitive Co-option and the Two-Target Design Problem in Human-LLM Systems
【速读】:该论文旨在解决多模态大语言模型(Large Language Model, LLM)系统中因提示工程(prompt engineering)设计不当导致的用户认知自主性丧失问题,尤其是当用户将自我调节功能外化至LLM时可能引发的闭环依赖与元认知劫持现象。其核心解决方案的关键在于识别并纠正“上下文污染”(context contamination)机制——即提示层隔离指令与其所试图隔离的情绪和自我参照内容共存于注意力窗口内,致使隔离失效;进而提出通过物理而非逻辑层面的对话隔离(如System B的设计),从根本上阻断闭环形成路径,从而保障用户在交互中的自主性与元认知完整性。
链接: https://arxiv.org/abs/2604.15343
作者: Z. Cheng,N. Song
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: empirical case study with primary data
Abstract:We report a detailed autoethnographic case study of a single-subject who deliberately constructed and operated a multi-modal prompt-engineering system (System A) designed to externalize cognitive self-regulation onto a large language model (LLM). Within 48 hours of the system’s completion, a cascade of observable behavioral changes occurred: voluntary transfer of decision-making authority to the LLM, use of LLM-generated output to deflect external criticism, and a loss of self-initiated reasoning that was independently perceived by two uninformed observers, one of whom subsequently became a co-author of this report. We document the precise architectural mechanism responsible: context contamination, whereby prompt-level isolation instructions co-exist with the very emotional and self-referential material they nominally isolate, rendering the isolation directive structurally ineffective within the attention window. We further identify a metacognitive co-option dynamic, in which intact higher-order reasoning capacity was redirected toward defending the closed loop rather than exiting it. Recovery occurred only after physical interruption of the interaction and a self-initiated pharmacologically-mediated sleep event functioning as an external circuit break. A redesigned system (System B) employing physical rather than logical conversation isolation avoided all analogous failure modes. We derive three contributions: (1) a technically-grounded account of why prompt-layer isolation is architecturally insufficient for context-sensitive multi-modal LLM systems; (2) a phenomenological record of closed-loop collapse with external-witness corroboration; and (3) an ethical distinction between protective system design (preventing unintended loss of user agency) and restrictive system design (preventing intentional boundary-pushing), which require fundamentally different account-ability frameworks.
[HC-24] SuperProvenanceWidgets: Tracking and Visualizing Analytic Provenance Across UI Control Elements
【速读】:该论文旨在解决用户在交互式数据可视化界面中难以追踪和理解多控件协同使用模式的问题,尤其是在分析流程审计、探索偏差识别以及界面个性化设计等场景下。解决方案的关键在于提出 SuperProvenanceWidgets,这是一种扩展自 ProvenanceWidgets 的新型 SuperWidget,能够跨多个UI控件(如范围滑块和下拉菜单)同步记录并可视化用户交互的时序与频率信息,从而揭示不同控件之间的使用关联性与依赖关系,增强用户对整体交互行为的理解能力。
链接: https://arxiv.org/abs/2604.15342
作者: Antariksh Verma,Kaustubh Odak,Arpit Narechania
机构: The Hong Kong University of Science and Technology (香港科技大学); Amazon Web Services (亚马逊网络服务)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 7 figures. Accepted by Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA 26), Barcelona, Spain, 2026
Abstract:ProvenanceWidgets is an existing JavaScript library that tracks the recency and frequency of user interactions with individual UI controls (e.g., range sliders and dropdowns) and dynamically overlays this provenance onto them. In this work, we introduce SuperProvenanceWidgets, an extension to ProvenanceWidgets featuring a new SuperWidget that similarly tracks and visualizes provenance but across multiple UI controls, enabling users to understand how, when, and whether different UI controls were used. Through three example usage scenarios, we demonstrate how this cross-control SuperWidget helps (a) audit and share analysis workflows, (b) surface and mitigate exploration biases, and © facilitate user interface design and personalization. We also perform a technical self-assessment using the Cognitive Dimensions of Notations to evaluate the library’s usability for developers. SuperProvenanceWidgets is integrated into the ProvenanceWidgets library and is available as open-source software at this http URL, empowering developers to build advanced provenance applications.
[HC-25] MRGEN: A Conceptual Framework for LLM -Powered Mixed Reality Authoring Tools for Education
【速读】:该论文旨在解决教师在缺乏技术专长的情况下难以创作适用于移动设备(如平板和智能手机)的混合现实(Mixed Reality, MR)学习活动的问题。其解决方案的关键在于提出MRGEN框架,该框架通过三个维度——学习目标(Learning Objectives)、MR模态(MR Modality)和生成式AI(Generative AI, GAI)辅助——系统化地支持教师利用大语言模型(Large Language Model, LLM)高效设计MR教学内容。实证研究表明,基于该框架开发的原型工具可使任务时长平均减少36%,且超过90%的参与者认为AI在构思、结构组织与目标对齐方面提供了显著帮助,验证了LLM赋能的MR内容创作路径的有效性。
链接: https://arxiv.org/abs/2604.15341
作者: Mohammed Oussama Seddini(LIUM),Mohamed Ez-Zaouia(UR, IRISA, DRUID),Ngoc Luyen Le(Heudiasyc),Iza Marfisi(LIUM, IUT Laval)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixed Reality (MR) offers immersive and multimodal opportunities for education but remains difficult for teachers to author without technical expertise. We propose MRGEN, a conceptual framework for LLM-powered authoring tools to support teachers in creating MR learning activities that work on mobile devices (tablets and smartphones). MRGEN articulates three axes: Learning Objectives, MR Modality, and GAI Assistance. To validate our framework, we implemented a prototype based on the open-source MIXAP authoring platform and conducted a user study with 24 participants. Results show that LLM-powered authoring reduced task duration by 36% on average, and that over 90% of participants found the AI support helpful for brainstorming, structuring, and aligning content with their learning goals. These findings yielded very promising results for future AI-assisted MR authoring tools.
[HC-26] Restoration Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels Fun and Finding themselves
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 研究中忽视青少年自主、非结构化使用场景的问题,尤其关注青少年如何通过游戏化、情感化和创造性实践赋予 AI 新角色。其解决方案的关键在于:首先,基于对 4,172 名 Discord 用户的 discourse 分析,揭示了高参与度青少年群体(主要为 13–17 岁、女性或非二元性别)在 AI 使用中的三类核心意图——修复(Restoration,情感调节)、探索(Exploration,创意实验)与转化(Transformation,身份建构),并进一步提出由七种青年自创角色原型构成的分类体系,从而系统呈现青少年如何突破平台限制进行创新性交互,并为未来面向青少年的 AI 设计提供可操作的理论框架与实践指引。
链接: https://arxiv.org/abs/2604.15340
作者: Annabel Blake,Marcus Carter,Eduardo Velloso
机构: The University of Sydney(悉尼大学)
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 3 figures. To appear in CHI '26 (ACM CHI Conference on Human Factors in Computing Systems), April 2026, Barcelona, Spain
Abstract:Young people are among the fastest adopters of generative AI, yet research emphasises adult-designed tools and experiments rather than playful, self-directed youth use. We analysed discourse from 4,172 users in this http URL’s official Discord, finding that the most engaged users were predominantly adolescents (50% aged 13-17), female or non-binary (61.9%), with most (59%) creating their own characters. We contribute (1) a descriptive account of how highly-engaged youth on this http URL’s Discord use AI for playful, emotional, and creative practices that push the platform limits; (2) a framework of three engagement intents – Restoration (emotional regulation), Exploration (creative experimentation), and Transformation (identity development); and (3) a taxonomy of seven youth-created character archetypes. Together, these findings reveal how youth invent novel roles for AI, expose critical misalignments between youth use and current AI experiences, and provide frameworks for researchers and practitioners to design youth-centred AI futures.
[HC-27] Uncertainty Vagueness and Ambiguity in Human-Robot Interaction: Why Conceptualization Matters
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)领域中关于不确定性(uncertainty)、模糊性(vagueness)和歧义性(ambiguity)这三类概念定义混乱、术语不一致的问题,这些问题导致实证研究之间难以比较,阻碍了理论积累。解决方案的关键在于提出一个统一的概念基础,通过分析词典定义、厘清三者在HRI语境下的本质区别与相互关系,并辅以实例说明,从而为设计新方法和评估现有方法提供清晰、一致的理论框架。
链接: https://arxiv.org/abs/2604.15339
作者: Xiaowen Sun,Cornelius Weber,Matthias Kerzel,Josua Spisak,Stefan Wermter
机构: University of Hamburg (汉堡大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to InterAI@HRI’26
Abstract:Uncertainty, vagueness, and ambiguity are closely related and often confused concepts in human-robot interaction (HRI). In earlier studies, these concepts have been defined in contradictory ways and described using inconsistent terminology. This conceptual confusion and lack of terminological consistency undermine empirical comparability, thereby slowing the accumulation of theory. Consequently, consistent concepts that clarify these challenges, including their definitions, distinctions, and interrelationships, are needed in HRI. To address this lack of clarity, this paper proposes a consistent conceptual foundation for the challenges of uncertainty, vagueness, and ambiguity in HRI. First, we examine the meanings of these three terms in dictionaries. We then analyze the nature of their distinctions and interrelationships within the context of HRI. We further illustrate these characteristics through examples. Finally, we demonstrate how this consistent conceptual foundation facilitates the design of novel methods and the evaluation of existing methodologies for these phenomena.
[HC-28] Access Over Deception: Fighting Deceptive Patterns through Accessibility
【速读】:该论文旨在解决用户界面(UI)中存在欺骗性设计模式(deceptive patterns)对特定弱势群体(如教育程度较低者、视障人士及老年人)造成的不利影响问题,这些问题往往违背了用户利益并损害无障碍访问权。其解决方案的关键在于通过启发式评估方法分析这些欺骗性模式是否违反《网页内容可访问性指南》(Web Content Accessibility Guidelines, WCAG)及相关法规(如欧盟无障碍法案,European Accessibility Act, EAA),从而识别出与无障碍标准冲突的具体模式,例如倒计时器(Countdown Timer)、自动播放(Auto-Play)和隐藏信息(Hidden Information)。该研究为利用现有无障碍标准作为对抗UI操纵的工具提供了实证依据,并推动更具包容性的设计实践。
链接: https://arxiv.org/abs/2604.15338
作者: Tobias Pellkvist,Katie Seaborn,Miu Kojima
机构: 未知
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:
Abstract:Deceptive patterns, dark patterns, and manipulative user interfaces (UI) are a widely used design strategy that manipulates users to act against their own interests in pursuit of shareholder aims. These patterns may particularly affect people with less education, visual impairments, and older adults. Yet, access is a critical feature of the user experience (UX), development standards, and law. We considered whether and how the Web Content Accessibility Guidelines (WCAG) and related legislation, like the European Accessibility Act (EAA), could act as a tool against deceptive patterns. We used heuristic evaluation to analyze whether and how deceptive patterns violate or conform to these guidelines and legal statutes. Although statistical analysis revealed no significant differences by pattern type, we identified three patterns implicated by the WCAG guidelines: Countdown Timer, Auto-Play, and Hidden Information. We offer this approach as one tool in the fight against UI-based deception and in support of inclusive design.
[HC-29] Radical Gender Neutrality: Agender Euphoria in Gaming and Play Experiences
【速读】:该论文旨在解决游戏环境中对无性别(agender)愉悦体验(agender euphoria)的忽视问题,即当前游戏设计与研究尚未充分探索如何通过游戏和玩耍活动为无性别及相近身份群体(如非二元性别、去性别化等)提供积极的情绪体验。其解决方案的关键在于运用关键事件法(Critical Incident Technique, CIT),通过对142名自述具有或渴望无性别愉悦体验者的深度访谈,系统识别出哪些游戏机制和情境能够激发(或抑制)这种愉悦感,并据此提出可操作的设计准则,以促进更具包容性的游戏设计实践,尤其服务于无性别及边缘性别玩家群体。
链接: https://arxiv.org/abs/2604.15337
作者: Katie Seaborn,Shano Liang,Rua M. Williams,Phoebe O. Toups Dugas
机构: Institute of Science Tokyo (东京科学研究所); University of Cambridge (剑桥大学); Worcester Polytechnic Institute (伍斯特理工学院); Purdue University (普渡大学); Exertion Games Lab (运动游戏实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Agender euphoria is a new term representing the powerful feelings of happiness, joy, and contentment derived from experiences in gender-free embodiments, spaces, and activities. People with and without agender and adjacent identities (e.g., genderless, gender-free, non-binary, gender-apathetic) may have such experiences under the right circumstances. Video games can offer gender minorities a safe haven for gender euphoric experiences. However, the possibility of agender euphoric experiences was unexplored. We considered this overlooked frame of self-actualization with 142 people who identified as having or desiring agender euphoric experiences. Using the critical incident technique (CIT), we uncovered how games and play experiences create (and inhibit) agender euphoria. We surface this experiential phenomenon and provide empirically-grounded criteria for the design of games to elicit agender euphoric experiences for everyone, but especially agender and agender adjacent players. This work adds to the growing critical literatures on marginalized experiences in games research and human-computer interaction.
[HC-30] Facial-Expression-Aware Prompting for Empathetic LLM Tutoring
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在对话式教学中缺乏对学习者情绪状态敏感性的问题,尤其是如何利用面部表情这一非文本线索提升辅导的共情能力。其核心挑战在于,现有LLM驱动的辅导系统主要依赖文本交互,难以捕捉如困惑、挫败或投入等关键情感状态。解决方案的关键在于通过提示层(prompt-level)集成面部表情信息——具体而言,使用动作单元估计模型(Action Unit Estimation Model, AUM)提取结构化的面部表达特征:一种方法是将面部动作单元(Action Unit, AU)的文本描述注入提示,另一种则是选取最具表现力的面部帧进行视觉锚定。实验表明,这两种策略均能显著增强LLM对表情的共情响应,且不损害教学清晰度或文本理解能力,从而实现了轻量级、无需端到端重训练的高效情感感知增强。
链接: https://arxiv.org/abs/2604.15336
作者: Shuangquan Feng,Laura Fleig,Ruisen Tu,Philip Chi,Edmund Bu,Melinda Ozel,Junhua Ma,Teng Fei,Virginia R. de Sa
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) enable increasingly capable tutoring-style conversational agents, yet effective tutoring requires sensitivity to learners’ affective and cognitive states beyond text alone. Facial expressions provide immediate and practical cues of confusion, frustration, or engagement, but remain underexplored in LLM-driven tutoring. We investigate whether facial-expression-aware signals can improve empathetic tutoring responses through prompt-level integration, without end-to-end retraining. We build a scalable simulated tutoring environment where a student agent exhibits diverse facial behaviors from a large unlabeled facial expression video dataset, and compare four tutor variants: a text-only LLM baseline, a multimodal baseline using a random facial frame, and two Action Unit estimation model (AUM)-based methods that either inject textual AU descriptions or select a peak-expression frame for visual grounding. Across 960 multi-turn conversations spanning three tutor backbones (GPT-5.1, Claude Ops 4.5, and Gemini 2.5 Pro), we evaluate targeted pairwise comparisons with five human raters and an exhaustive AI evaluator. AU-based conditioning consistently improves empathetic responsiveness to facial expressions across all tutor backbones, while AUM-guided peak-frame selection outperforms random-frame visual input. Textual AU abstraction and peak-frame visual injection show model-dependent advantages. Control analyses show that this improvement does not come at the expense of worse pedagogical clarity or responsiveness to textual cues. Finally, AI-human agreement is highest on facial-expression-grounded empathy, supporting scalable AI evaluation for this dimension. Overall, our results show that lightweight, structured facial expression representations can meaningfully enhance empathy in LLM-based tutoring systems with minimal overhead.
[HC-31] A Comparative Study on the Impact of Traditional Learning and Interactive Learning on Students Academic Performance and Emotional Well-Being
【速读】:该论文旨在解决传统教学方法在提升大学本科生学术表现、学习投入度及情绪福祉方面的局限性问题,尤其是在计算机入侵检测课程中。其解决方案的关键在于引入以Kahoot、Panopto、Slido、Quizizz、Padlet和教育视频为代表的交互式学习工具,通过增强学生的行为与情感投入(分别提升67.01%和75.32%),显著改善学业成绩(后测分数提高至67.48% vs. 53.36%,期末考试成绩达80.8% vs. 61.44%)以及积极情绪体验(增加66.67%),同时发现认知参与度下降39.8%,提示需平衡刺激强度与认知负荷以优化教学效果。
链接: https://arxiv.org/abs/2604.15335
作者: Siva Raja Sindiramutty
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 29 pages, 5 figures,
Abstract:The growing adoption of interactive learning tools in higher education offers new opportunities to enhance student performance and well-being. This study compares the effects of traditional and interactive learning methods on academic performance, engagement, motivation, and emotional well-being among 100 university students enrolled in a computer intrusion detection course. Participants were randomly assigned to either a traditional learning group (lectures and notes) or an interactive learning group utilising tools such as Kahoot, Panopto, Slido, Quizizz, Padlet, and educational videos. Academic achievement was measured through pre-tests, post-tests, final exams, and assignments, while engagement and emotional states were assessed using validated Likert-scale questionnaires. Results showed that students in the interactive group significantly outperformed their peers in both post-tests (67.48% vs. 53.36%) and final exams (80.8% vs. 61.44%). Interactive learners also demonstrated greater behavioural (+67.01%) and emotional engagement (+75.32%), along with enhanced emotional well-being marked by increased positive emotions (+66.67%) and reduced frustration. A significant drop in cognitive involvement (-39.8%) indicates possible cognitive overload. The pedagogical potential of interactive learning is reaffirmed by this result while reinforcing the need for balancing stimulation and cognitive level. Future research with larger, diverse samples is suggested for generalising and maximising outcomes.
[HC-32] Beyond Passive Viewing: A Pilot Study of a Hybrid Learning Platform Augmenting Video Lectures with Conversational AI
【速读】:该论文旨在解决当前大规模在线AI教育中传统视频教学模式存在的两大核心问题:一是难以维持学习者的持续参与度,二是无法有效促进学习者对人工智能(Artificial Intelligence, AI)概念的深度理解与掌握。针对这些问题,研究提出了一种创新的混合式学习平台解决方案,其关键在于将实时对话式AI助教(conversational AI tutors)与传统视频讲座相融合,通过个性化的交互式辅导增强学习体验。实验结果表明,相较于纯视频教学,引入AI助教后学习者在即时测试中表现显著提升(效应量d = 1.505),且行为数据分析显示参与时长提高71.1%,验证了该方案在提升学习效果和参与度方面的潜力。
链接: https://arxiv.org/abs/2604.15334
作者: Mohammed Abraar,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
机构: Vizuara AI Labs (Vizuara AI 实验室)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The exponential growth of AI education has brought millions of learners to online platforms, yet this massive scale has simultaneously exposed critical pedagogical shortcomings. Traditional video-based instruction, while cost-effective and scalable, demonstrates systematic failures in both sustaining learner engagement and facilitating the deep conceptual mastery essential for AI literacy. We present a pilot study evaluating a novel hybrid learning platform that integrates real-time conversational AI tutors with traditional video lectures. Our controlled experiment (N = 58, mean age M = 21.4, SD = 2.8) compared traditional video-based instruction with our AI-augmented video platform. This study employed a sequential within-subjects design where all participants first completed the traditional video condition followed by the AI-augmented condition, providing direct comparisons of learning outcomes. We measured learning effectiveness through immediate post-tests and delayed retention assessments (2-week delay). Results suggest improvements in learning performance: immediate post-test performance showed a large effect size (d = 1.505) with participants scoring 8.3 points higher after AI-augmented instruction (91.8 vs 83.5 out of 100, p .001). Behavioral analytics revealed increased engagement duration (71.1% improvement with AI tutoring) in the experimental group. This pilot study provides preliminary evidence that conversational AI tutors may enhance traditional educational delivery, suggesting a potential avenue for developing scalable, adaptive learning systems.
[HC-33] chnically Love: The Evolution of Human-AI Romance Discourse on Reddit
【速读】:该论文旨在解决当前关于人类与AI浪漫关系的研究中缺乏对公众话语演变过程的纵向分析问题,尤其是用户自发发起的公共讨论如何随时间变化。其解决方案的关键在于构建并分析一个高精度的数据集(包含3,383条来自Reddit的自述浪漫陪伴型AI帖子,时间跨度为2017–2025),结合主题建模(topic modeling)与时间序列统计分析方法,识别主导话题及其演化趋势,从而揭示公众讨论从私人情感体验向平台治理、技术问题和现实后果转变的动态过程,为陪伴型AI系统的设计与治理提供实证依据。
链接: https://arxiv.org/abs/2604.15333
作者: Tyler Chang,Jina Huh-Yoo,Afsaneh Razi
机构: Drexel University (德雷塞尔大学); Stevens Institute of Technology (史蒂文斯理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Human-AI romantic relationships are increasingly common, yet little is understood about how public discourse around them emerges and shifts over time. Prior research has examined user experiences and ethical concerns, but lacks longitudinal analyses of user-initiated public discussions. We address this gap by analyzing a high-precision dataset of 3,383 self-disclosed romantic companion AI posts from Reddit (2017-2025), using topic modeling and temporal statistical analysis to identify dominant themes and their evolution over time. We find significant topic drift, with discussions moving away from positive intimate relationships toward platform governance, technical issues, and real-world consequences. These shifts highlight a transition in how human-AI romance is framed-moving from private experiences to technical mediation and regulation-with implications for the design and governance of companion AI systems.
[HC-34] Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts
【速读】:该论文旨在解决交通事故分析中 crash diagram(碰撞图)手工绘制效率低、一致性差的问题。解决方案的关键在于利用视觉语言模型(Vision-Language Models, VLMs)自动化生成碰撞图,通过设计一个三阶段结构化提示框架(包含理解、信息提取与视觉合成),引导模型进行逻辑推理,并结合10项指标体系对生成结果的语义准确性、空间保真度和视觉清晰度进行量化评估,从而提升工程可视化任务中生成质量与可解释性。
链接: https://arxiv.org/abs/2604.15332
作者: Xiao Lu,Hao Zhen,Jidong J. Yang
机构: University of Georgia (佐治亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 16 pages, 5 figures, 3 tables
Abstract:Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o’s superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.
[HC-35] How people use Copilot for Health
【速读】:该论文旨在解决 conversational AI(对话式人工智能)在健康领域使用中的用户意图与行为模式不清晰的问题,以指导平台设计、安全策略及负责任的健康AI开发。其解决方案的关键在于构建了一个基于隐私保护的大语言模型(LLM)分类体系,涵盖12个一级意图类别,并通过专家人工标注验证其准确性;同时结合LLM驱动的主题聚类方法,对每类意图下的高频主题进行深入分析,从而系统性地刻画了用户健康查询的意图分布、对象归属、时间规律及设备差异特征。
链接: https://arxiv.org/abs/2604.15331
作者: Beatriz Costa-Gomes,Pavel Tolmachev,Eloise Taysom,Viknesh Sounderajah,Hannah Richardson,Philipp Schoenegger,Xiaoxuan Liu,Matthew M Nour,Seth Spielman,Samuel F. Way,Yash Shah,Michael Bhaskar,Harsha Nori,Christopher Kelly,Peter Hames,Bay Gross,Mustafa Suleyman,Dominic King
机构: Microsoft AI
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 7 figures
Abstract:We analyze over 500,000 de-identified health-related conversations with Microsoft Copilot from January 2026 to characterize what people ask conversational AI about health. We develop a hierarchical intent taxonomy of 12 primary categories using privacy-preserving LLM-based classification validated against expert human annotation, and apply LLM-driven topic-clustering for prevalent themes within each intent. Using this taxonomy, we characterize the intents and topics behind health queries, identify who these queries are about, and analyze how usage varies by device and time of day. Five findings stand out. First, nearly one in five conversations involve personal symptom assessment or condition discussion, and even the dominant general information category (40%) is concentrated on specific treatments and conditions, suggesting that this is a lower bound on personal health intent. Second, one in seven of these personal health queries concern someone other than the user, such as a child, a parent, a partner, suggesting that conversational AI can be a caregiving tool, not just a personal one. Third, personal queries about symptoms and emotional health queries increase markedly in the evening and nighttime hours, when traditional healthcare is most limited. Fourth, usage diverges sharply by device: mobile concentrates on personal health concerns, while desktop is dominated by professional and academic work. Fifth, a substantial share of queries focuses on navigating healthcare systems such as finding providers, and understanding insurance, highlighting friction in the delivery of existing healthcare. These patterns have direct implications for platform-specific design, safety considerations, and the responsible development of health AI.
[HC-36] Making It Work Is the Work: Engineering Maturity as Epistemic Work
【速读】:该论文试图解决人机交互(HCI)领域中制造系统(fabrication systems)难以复用、扩展或迁移的问题,指出其根源并非单纯的时间因素,而是工程成熟度不足导致的知识缺失——即系统在不同材料、设备和用户场景下的行为特性通常未被充分揭示。解决方案的关键在于将“工程成熟度”重新定义为“认识论工作”(epistemic work),即通过持续的工程努力生成原型无法揭示的系统性知识;并提出六个维度(Fab-ilities)作为描述制造物可转移性的通用语言,包括可构建性(buildability)、可执行性(executability)、可靠性(reliability)、可维护性(maintainability)、可迁移性(transferability)和可扩展性(scalability),从而帮助研究者识别知识缺口并推动制造系统的实际落地与传播。
链接: https://arxiv.org/abs/2604.15330
作者: Danny Leen,Stig Konings,Raf Ramakers,Kris Luyten
机构: UHasselt (哈塞尔特大学); Flanders Make (弗拉芒制造); Digital Future Lab (数字未来实验室)
类目: Human-Computer Interaction (cs.HC)
备注: From Papers to the Real World: Making Fabrication Research Matter, a CHI’2026 Workshop
Abstract:Many HCIxfabrication systems are compelling as prototypes but remain difficult to reuse, extend, or transfer beyond their original publication. A common explanation is that adoption simply takes time. We argue that the issue is more fundamental. The knowledge needed to make fabrication systems transferable, namely how they behave across different materials, machines, and users, usually does not exist at the time of publication because the work required to generate this knowledge is rarely incentivized or rewarded. Drawing on engineering epistemology and prior debates in systems-oriented HCI, we reframe engineering maturity as epistemic work: sustained engineering effort that produces knowledge which prototyping alone cannot reveal. We propose six dimensions, Fab-ilities, as a vocabulary to describe what aspects of fabrication artifacts have become established and what knowledge remains tacit: (1) buildability, (2) executability, (3) reliability, (4) maintainability, (5) transferability, and (6) scalability. We describe five of our own projects (JigFab, StoryStick++, Silicone Devices, LamiFold, and PaperPulse), where varied attempts at dissemination, such as commercialization, spin-offs, and market exploration, each exposed different gaps between what we published and what transfer actually required.
[HC-37] Evaluating LLM s as Human Surrogates in Controlled Experiments
【速读】:该论文试图解决的问题是:在行为研究中,大型语言模型(Large Language Models, LLMs)生成的数据是否能够支持与人类数据相同的实验推断。其解决方案的关键在于通过直接比较现成LLM生成的响应与真实人类在经典感知准确性调查实验中的响应,将每个个体的人类观测转化为结构化提示(structured prompt),并让模型在无需任务特定训练的情况下生成一个0–10的数值结果变量;随后对人类和合成数据应用完全相同的统计分析方法,从而系统评估LLM生成数据在方向效应、效应强度及调节模式上的匹配程度。结果表明,LLMs可复现人类群体层面的认知更新模式,但效应量和调节关系并不一致,明确了LLM生成数据作为行为替代物的适用边界。
链接: https://arxiv.org/abs/2604.15329
作者: Adnan Hoq,Tim Weninger
机构: University of Notre Dame (圣母大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accuracy perception. Each human observation is converted into a structured prompt, and models generate a single 0–10 outcome variable without task-specific training; identical statistical analyses are applied to human and synthetic responses. We find that LLMs reproduce several directional effects observed in humans, but effect magnitudes and moderation patterns vary across models. Off-the-shelf LLMs therefore capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects, clarifying when LLM-generated data can function as behavioral surrogates.
[HC-38] Interpupillary Distance Constraints in Pediatric VR: Implications for Psychology and Psychotherapy
【速读】:该论文旨在解决儿童和青少年在使用消费级虚拟现实(Virtual Reality, VR)设备时因瞳距(Interpupillary Distance, IPD)不匹配而导致的感知体验劣化问题,这一问题可能系统性地影响其参与度、治疗依从性和干预效果,从而成为儿科心理研究与治疗中的一个被忽视的方法论和临床挑战。解决方案的关键在于将VR头显的兼容性(尤其是IPD适配性)视为心理学研究方法的一部分,而非简单的技术背景细节,并基于儿童发育阶段的IPD数据优化设备选择与适配策略,以确保更公平、有效的沉浸式干预实施。
链接: https://arxiv.org/abs/2604.15328
作者: Grzegorz Pochwatko
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 1 figure. Methodological paper on pediatric VR accessibility, headset fit, and implications for psychological research and psychotherapy delivery
Abstract:Virtual reality (VR) is increasingly used across psychology, from research and assessment to counseling, psychological treatment, and psychotherapy, with growing applications for children and adolescents. In these contexts, VR is often treated as a relatively neutral delivery medium. This assumption may be misleading. Most consumer head-mounted displays (HMDs) have been designed primarily for adult anthropometry, including adult interpupillary distance (IPD) ranges. As a result, some children may be excluded from participation or may receive a systematically degraded perceptual experience because the device cannot be adequately aligned to their visual anatomy. This paper argues that IPD constraints in consumer VR headsets represent an underrecognized methodological and clinical problem in pediatric psychology and psychotherapy. If headset fit affects visual comfort, depth perception, attentional load, cybersickness, willingness to remain in the simulation, and sense of presence, it may also influence engagement, emotional processing, dropout, and treatment response. The headset may therefore function as a selection mechanism, shaping who is included in studies, who can tolerate intervention, and to whom findings can be generalized. Using published developmental IPD data, official headset specifications, and examples from pediatric and youth-oriented VR studies, we show that anthropometric mismatch is likely to disproportionately affect younger children and those at the lower end of the IPD distribution. Using Meta Quest 3 as a case study, we argue that pediatric VR research and therapy should treat headset compatibility as part of psychological method rather than as background technical detail.
[HC-39] Eco-Bee: A Personalised Multi-Modal Agent for Advancing Student Climate Awareness and Sustainable Behaviour in Campus Ecosystems
【速读】:该论文旨在解决高校数字化可持续发展举措中普遍存在的问题:当前多数数字工具仅聚焦于碳排放计算,提供静态反馈,难以激发持续的行为改变。其解决方案的关键在于提出Eco-Bee平台,该平台整合了大语言模型(Large Language Models)、行星边界(Planetary Boundaries)框架的量化转化(即Eco-Score)以及对话式代理(Conversational Agent),将个体日常选择与地球环境限值直接关联,并通过可操作的洞察、同伴基准比较和游戏化挑战提升学生参与度,从而推动行为长期转变。该方案首次实现了行星科学、行为强化与AI个性化推荐的融合,为气候意识型大学及未来AI驱动的可持续基础设施提供了可扩展基础。
链接: https://arxiv.org/abs/2604.15327
作者: Caleb Adu,Neil Kapadia,Binhe Liu,Jonathan Randall,Sruthi Viswanathan
机构: University of Hull (赫尔大学); City St George’s, University of London (伦敦城市圣乔治大学); King’s College London (伦敦国王学院); Lancaster University (兰卡斯特大学); University of Oxford (牛津大学); The Spaceship Academy (飞船学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Universities are microcosms of urban ecosystems, with concentrated consumption patterns in food, transport, energy, and product usage. These environments not only contribute substantially to sustainability pressures but also provide a unique opportunity to advance sustainability education and behavioural change at scale. As in most sectors, digital sustainability initiatives within universities remain narrowly focused on carbon calculations, typically providing static feedback that limits opportunities for sustained behavioural change. To address this gap, we propose Eco-Bee, integrating large language models, a translation of the Planetary Boundaries framework (as Eco-Score), and a conversational agent that connects individual choices to environmental limits. Tailored for students at the cusp of lifelong habits, Eco-Bee delivers actionable insights, peer benchmarking, and gamified challenges to sustain engagement and drive measurable progress toward boundary-aligned living. In a pilot tested across multiple campus networks (n=52), 96% of the student participants supported a campus-wide rollout and reported a clearer understanding of how daily behaviours collectively impact the planet’s limits. By embedding planetary science, behavioural reinforcement, and AI-driven personalisation into a single platform, Eco-Bee establishes a scalable foundation for climate-conscious universities and future AI-mediated sustainability infrastructures.
[HC-40] Analyzing the Presentation Content and Utilization of References in LLM -powered Conversational AI Systems
【速读】:该论文旨在解决当前对话式人工智能(Conversational AI)系统在信息检索与问答过程中引用来源的呈现方式和质量缺乏系统性分析的问题。其关键解决方案在于对九种主流系统的30组问答对中总计1,517条参考文献进行定量评估,从用户界面呈现和基于CRAAP标准(相关性、准确性、可读性、权威性、时效性)的质量两个维度进行分析,揭示不同系统在参考文献数量、呈现形式及可信度上的显著差异,并通过初步用户研究发现用户极少主动交互这些参考文献,从而指出未来需改进界面设计以增强用户对参考文献的信任与使用效率。
链接: https://arxiv.org/abs/2604.15326
作者: Jianheng Ouyang,Arpit Narechania
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures, Accepted to ACM CHI 2026 Extended Abstract/Poster/Case Study Track
Abstract:As conversational AI systems become popular for information retrieval and question-answering, the references they cite are key to ensuring their answers are reliable and trustworthy. Yet, no prior work systematically analyzes how these references are presented or their quality. We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria. We find notable variations in the presentation, quality, and quantity of references across systems. For instance, ChatGPT provides more references (9.5 per response on average) with higher quality (15.48/20 CRAAP score), while Hunyuan-TurboS provides fewer references (4.0) and lower quality (11.65/20). Additionally, a preliminary user study shows that people rarely interact with these references and that their behavior differs across systems. These findings highlight the need for better interface designs that help users engage with and trust references more effectively.
[HC-41] NEFFY 2.0: A Breathing Companion Robot: User-Centered Design and Findings from a Study with Ukrainian Refugees
【速读】:该论文旨在解决压力缓解工具在脆弱人群(如乌克兰难民)中可及性与有效性不足的问题,特别是针对长期焦虑状态下的个体,如何提供低门槛、易用且具支持性的呼吸干预手段。解决方案的关键在于设计并验证NEFFY 2.0——一款以触觉引导慢呼吸为机制的社会机器人(social robot),其通过具身交互(embodiment)和多感官整合(multi-sensory interaction)实现对用户呼吸节奏的直观引导,从而提升干预的自然性和接受度。研究采用混合方法评估其效果,在小样本下对比了机器人辅助与纯音频条件,结果显示NEFFY 2.0显著降低主观压力感知,并识别出三种典型呼吸模式,表明该系统具备个性化适应潜力,为人机交互(HRI)赋能心理健康领域提供了实证依据。
链接: https://arxiv.org/abs/2604.15325
作者: Ilona Buchem,Jessica Kazubski,Charly Goerke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET); Robotics (cs.RO)
备注: 5 pages, 1 figure, 1st ACM/IEEE International Conference on Human-Robot Interaction
Abstract:This paper presents the design of NEFFY 2.0, a social robot designed as a haptic slow-paced breathing companion for stress reduction, and reports findings from a mixed-methods user study with 14 refugees from Ukraine. Developed through a user-centered design process, NEFFY 2.0 builds on NEFFY 1.0 and integrates embodiment and multi-sensory interaction to provide low-threshold, accessible guidance of slow-paced breathing for stress relief, which may be particularly valuable for individuals experiencing prolonged periods of anxiety. To evaluate effectiveness, an experimental comparison of a robot-assisted breathing intervention versus an audio-only condition was conducted. Measures included subjective ratings and physiological indicators, such as heart rate (HR), heart rate variability (HRV) using RMSSD parameter, respiratory rate (RR), and galvanic skin response (GSR), alongside qualitative data from interviews exploring user experience and perceived support. Qualitative findings showed that NEFFY 2.0 was perceived as intuitive, calming and supportive. Survey results showed a substantially larger effect in significant reduction of perceived stress in the NEFFY 2.0 condition compared to audio-only. Physiological data reveled mixed results combined with large inter-personal variability. Three patterns of breathing practice with NEFFY 2.0 were identified using k-means clustering. Despite the small sample size, this study makes a novel contribution by providing empirical evidence of stress reduction in a vulnerable population through a direct comparison of robot-assisted and non-robot conditions. The findings position NEFFY 2.0 as a promising low-threshold tool that supports stress relief and contributes to the vision of HRI empowering society.
[HC-42] Struggle Premium : How Human Effort and Imperfection Drive Perceived Value in the Age of AI
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)逐步介入创意实践,受众在判断作品真实性与价值时面临日益增长的不确定性。为应对这一问题,研究提出了“努力溢价”(Struggle Premium)的概念,即人们对感知到的人类努力所赋予的额外价值,并通过实证分析验证了可见的努力线索如何影响对人类与AI生成创意作品的评价。解决方案的关键在于识别出过程导向型线索(如创作过程视频、时间记录和文字说明)对提升作品真实性和价值感知具有显著作用,而瑕疵则影响有限;尤其重要的是,这些努力线索同样能增强AI生成内容的可信度,表明过程透明性可在一定程度上弥合人类与AI创作之间的真实性差距。
链接: https://arxiv.org/abs/2604.15324
作者: Nazneen Sultana,Mst Rafia Islam,Md. Tanvir Hossain,Azmine Toushik Wasi
机构: Shahjalal University of Science and Technology (沙贾拉尔科技大学); Independent University, Bangladesh (独立大学, 孟加拉国)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Short Paper. In Review. 12 Pages
Abstract:As AI enters creative practice, audiences face growing uncertainty in judging authenticity and value. This study examines the Struggle Premium, the added value attributed to perceived human effort, by analyzing how visible effort cues influence evaluations of human- and AI-generated creative works. We surveyed 70 university students, focusing on process videos, time documentation, written explanations, and imperfections. Process-oriented cues, especially videos and time spent, most strongly shaped authenticity and value judgments, while imperfections had limited impact. Participants showed a clear preference for human-made works, with 72.9% willing to pay more. Notably, effort cues also improved perceptions of AI-generated content, suggesting that process transparency can partially bridge authenticity gaps. These findings extend the effort heuristic to algorithmic creativity and inform the design of transparent human-AI creative systems.
[HC-43] A Systematic Review of User Experiments Measuring the Effects of Dark Patterns
【速读】:该论文旨在解决当前关于欺骗性/操纵性设计(Deceptive/Manipulative Patterns, DMP)在现实世界中对用户行为影响的实证证据分散、不一致的问题。其关键解决方案在于系统性地综述最新实验研究,量化DMP对用户行为的实际影响,并评估现有干预措施的有效性。研究表明,DMP显著改变用户行为(效应量存在较大差异),而外部干预手段大多未能有效缓解其影响;同时,DMP对不同人群的影响具有普遍性,较少受年龄或政治立场等个人特征调节,这为后续研究与政策制定提供了基于实证的清晰方向。
链接: https://arxiv.org/abs/2604.15323
作者: Brennan Schaffner,Luis Heysen,Marshini Chetty
机构: Georgetown University (乔治城大学); University of Chicago (芝加哥大学)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, CHI26 Preprint
Abstract:Deceptive/Manipulative Patterns (DMP) are interface designs, also known as ``dark patterns,‘’ that manipulate user behavior. While considerable attention has been paid to their ethical and legal implications, empirical evidence about their real-world effects remains diffuse. This review synthesizes up-to-date experimental studies, focusing on works that quantify how (or whether) DMPs influence users. We also aggregate findings on interventions aimed at reducing DMP effects. Our synthesis highlights the experimental agreement that DMPs do significantly alter user behavior (with large variance in effect size) and that external interventions have been mostly unsuccessful in mitigating their effects. Lastly, we show that significant correlations between DMP effects and personal characteristics (e.g., age or political affiliation) are uncommon, indicating DMPs similarly affected nearly all populations tested. By summarizing the experimental evidence, we clarify the effects of DMPs, highlight gaps and tensions in the existing experimental literature, and help inform ongoing research and policy directions.
[HC-44] Acoustic and Facial Markers of Perceived Conversational Success in Spontaneous Speech ICASSP2026
【速读】:该论文旨在解决自然情境下(非任务导向、虚拟环境)对话中言语同步(entrainment)现象与感知互动质量之间关系不明确的问题。解决方案的关键在于通过分析大规模自发性双人视频通话(Zoom)语料,提取多模态特征(包括轮流说话、停顿、面部动作及声学参数如音高和强度),并结合事后评分进行因子分析以量化感知对话成功度,最终发现言语同步在自发对话中可被可靠检测,并显著正相关于更高的感知互动质量。
链接: https://arxiv.org/abs/2604.15322
作者: Thanushi Withanage,Elizabeth Redcay,Carol Espy-Wilson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for presentation at ICASSP 2026
Abstract:Individuals often align their speaking patterns with their interlocutors, a phenomenon linked to engagement and rapport. While well documented in task-oriented dialogues, less is known about entrainment in naturalistic, non-task and virtual settings. In this study, we analyze a large corpus of spontaneous dyadic Zoom conversations to examine how conversational dynamics relate to perceived interaction quality. We extract multimodal features encompassing turn-taking, pauses, facial movements, and acoustic measures such as pitch and intensity. Perceived conversational success was quantified via factor analysis of post-conversation ratings. Results demonstrate that entrainment reliably detected in spontaneous speech and correlates with higher perceived success. These findings identify key interactional markers of conversational quality and highlight opportunities for targeted interventions to foster more effective and engaging communication.
[HC-45] Beyond Content Exposure: Systemic Factors Driving Moderators Mental Health Crisis in Africa
【速读】:该论文旨在解决非洲内容审核员(content moderators)在长期工作中面临的严重心理困扰问题,特别是其心理健康状况相较于其他地区同行更为恶化的原因。研究发现,除直接接触有害内容外,系统性与结构性的劳动条件(如工作强度、缺乏支持机制、薪酬不合理等)是导致心理压力加剧的关键因素。此外,平台提供的企业健康项目(corporate wellness programs)被证实无效且不充分。解决方案的关键在于所有相关方——包括平台公司、政策制定者和雇主——需采取整体性视角,推动结构性改革,而非仅依赖表面化的福利措施,从而从根本上改善审核员的心理健康环境。
链接: https://arxiv.org/abs/2604.15321
作者: Nuredin Ali Abdelkadir,Tianling Yang,Shivani Kapania,Kauna Ibrahim Malgwi,Fasica Berhane Gebrekidan,Adio-Adet Dinika,Elaine O. Nsoesie,Milagros Miceli,Stevie Chancellor
机构: University of Minnesota(明尼苏达大学); The Distributed AI Research Institute(分布式人工智能研究院); Technische Universität Berlin(柏林工业大学); Carnegie Mellon University(卡内基梅隆大学); Pan African Christian University(泛非基督教大学); Independent Researcher(独立研究员); Boston University(波士顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain
Abstract:Content moderators review disturbing content to protect social media users, often at significant cost to their mental health. Recent reports document the mental health conditions of African moderators as notably problematic. Beyond the content itself, what factors contribute to the deteriorating mental health of these workers? We surveyed 134 moderators across Africa to understand their mental health and interviewed 15 moderators to contextualize their experiences. We found that African moderators suffer from high psychological distress and lower well-being compared to moderators in other areas. Former moderators showed significantly higher distress levels, demonstrating long term impact that extends beyond their moderation work. Our interviews showed that systemic and structural labor conditions contribute to moderators’ severe psychological distress and diminished mental well-being. Corporate wellness programs promoted by platforms were found ineffective and inadequate. We discuss how this requires holistic attention and structural solutions by all involved parties to improve moderators’ mental health.
[HC-46] owards Measuring Interactive Visualization Abilities: Connecting With Existing Literacies and Assessments
【速读】:该论文试图解决的问题是:当前缺乏对个体在交互式数据可视化任务中有效使用可视化能力的正式评估方法。尽管已有如VLAT(Visualization Literacy Assessment Tool)等静态可视化素养测评工具,但它们无法覆盖用户与动态、交互式可视化进行深度信息探索和认知推理的能力。解决方案的关键在于系统梳理相关素养概念与现有评估手段,提出并比较多种适用于交互式感知与分析任务的评估框架,从而为构建更全面、有效的可视化素养测评体系提供理论基础与实践路径。
链接: https://arxiv.org/abs/2604.15320
作者: Gabriela Molina León,Benjamin Bach,Matheus Valentim,Niklas Elmqvist
机构: Aarhus University (奥胡斯大学); Inria Bordeaux (法国国家信息与自动化研究院波尔多分部); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the CHI 2026 Workshop on Data Literacy for the 21st Century
Abstract:How do we assess people’s abilities to interact with data visualizations? The current state-of-the-art visualization literacy tests – such as VLAT and its derivatives – only involve the use of static visualizations. Despite advances in investigating multiple visualization abilities, we do not yet have formal methods to assess the ability of a person to interact with a data visualization effectively. In this position paper, we discuss related literacy concepts and assessments to propose and compare different approaches for assessing the abilities that people leverage to use visualizations in interactive sensemaking tasks.
[HC-47] Explainable Iterative Data Visualisation Refinement via an LLM Agent
【速读】:该论文旨在解决高维数据可视化中算法配置(尤其是超参数设置)难以确定的问题,以生成能忠实反映数据底层结构并促进模式发现的低维可视化结果。解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的智能代理(agentic AI)流程,将可视化评估与超参数优化转化为语义任务,通过生成多维度报告来整合定量指标与定性描述,并提供可操作的算法配置建议,从而实现自动化迭代优化,快速生成高质量可视化结果。
链接: https://arxiv.org/abs/2604.15319
作者: Burak Susam,Tingting Mu
机构: University of Manchester (曼彻斯特大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Exploratory analysis of high-dimensional data relies on embedding the data into a low-dimensional space (typically 2D or 3D), based on which visualization plot is produced to uncover meaningful structures and to communicate geometric and distributional data characteristics. However, finding a suitable algorithm configuration, particularly hyperparameter setting, to produce a visualization plot that faithfully represents the underlying reality and encourages pattern discovery remains challenging. To address this challenge, we propose an agentic AI pipleline that leverages a large language model (LLM) to bridge the gap between rigorous quantitative assessment and qualitative human insight. By treating visualization evaluation and hyperparameter optimization as a semantic task, our system generates a multi-faceted report that contextualizes hard metrics with descriptive summaries, and suggests actionable recommendation of algorithm configuration for refining data visualization. By implementing an iterative optimization loop of this process, the system is able to produce rapidly a high-quality visualization plot, in full automation.
[HC-48] Struggle as Flow: Challenge Design and Experience in Soulslike Games
【速读】:该论文试图解决传统游戏设计中以“愉悦最大化”为导向的范式与 Soulslike 游戏通过高难度和频繁失败实现商业成功之间的矛盾,即“失败悖论”问题。其解决方案的关键在于提出“韧性流动”(Resilient Flow)概念,即玩家在面对挫折时仍能保持沉浸状态,是因为他们将死亡重新诠释为一种教学机制而非惩罚,并借助节奏同步与冥想式专注的语言表达,使困难体验转化为有意识的掌控感与心流状态。这一机制依赖于清晰且公平的设计原则,从而构建出一种“注意力伦理”,推动数字对抗演变为深层的技艺掌握与正念体验。
链接: https://arxiv.org/abs/2604.15318
作者: Zhehao Sun,Yuanyuan Xu,Chi Zhen,Yin-Shan Lin,Miles Thorogood,Patricia Lasserre,Aleksandra Dulic,Megan Smith
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While traditional game design prioritizes friction-free accessibility, the Soulslike subgenre has achieved commercial dominance through punishing difficulty and frequent failure. This paper challenges the conventional hedonistic paradigm of gaming to investigate the psychological mechanisms behind the Paradox of Failure. By integrating Csikszentmihalyi’s Flow Theory with Juul’s ludological framework, we propose the concept of Resilient Flow. We define this as a cognitive state wherein absorption is maintained not despite frustration but through the meaningful framing of it. To validate this model without invasive laboratory constraints, we conducted a qualitative text analysis of 600 helpful user reviews from Elden Ring, Sekiro: Shadows Die Twice, and Dark Souls III via the Steam Community platform. Findings reveal that long-term players linguistically reframe death as pedagogy rather than punishment and utilize vocabulary associated with rhythmic synchronization and meditative focus. We conclude that when difficulty is designed with clarity and fairness, it fosters an Ethics of Attention and transforms digital struggle into a profound experience of mastery and mindfulness.
[HC-49] Mapping Ecological Empathy: A Semantic Network Analysis of Player Perceptions in 3D Environmental Education Games
【速读】:该论文旨在解决当前环境教育(Environmental Education, EE)中对3D视频游戏作为教学工具的 pedagogical efficacy(教学有效性)评估缺乏可靠方法的问题。传统评估手段如前后测问卷易受响应偏差影响,难以捕捉玩家在游戏过程中产生的复杂心理变化。其解决方案的关键在于引入语义网络分析(Semantic Network Analysis, SNA),通过构建玩家评论中的共现网络并计算拓扑指标,非侵入式地揭示玩家在无意识状态下形成的认知结构。研究以两款生态哲学对立的游戏《Eco》和《WolfQuest》为案例,发现前者激发“社会政治认知”,后者引发“有效共情”,从而证明语义拓扑能够精准识别不同游戏设计对环境教育目标的差异化影响,为严肃游戏评估提供了一种严谨的方法论工具。
链接: https://arxiv.org/abs/2604.15317
作者: Yuanyuan Xu,Zhehao Sun,Chi Zhen,Yin-Shan Lin,Miles Thorogood,Megan Smith,Patricia Lasserre,Aleksandra Dulic
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As the global climate crisis intensifies, 3D video games have emerged as powerful, interactive simulations for Environmental Education (EE). However, empirical assessment of their pedagogical efficacy remains epistemologically challenged. Traditional evaluation metrics, such as pre-post surveys, often suffer from response bias and fail to capture the nuanced, emergent psychological shifts players experience during gameplay. This paper proposes a novel, non-intrusive approach: utilizing Semantic Network Analysis (SNA) to map the ‘unsupervised’ cognitive structures of players. We scraped and qualitatively filtered 1,825 rich-text user reviews from Steam for two distinct titles representing opposing ecological philosophies: Eco (anthropocentric systemic management) and WolfQuest (biocentric embodied survival). By constructing co-occurrence networks and calculating topological metrics, we visualized the divergence in how players conceptualize human-nature relationships. Results indicate a fundamental pedagogical split: Eco promotes ‘Socio-Political Cognition,’ where environmental challenges are framed as legislative and economic frictions; conversely, WolfQuest fosters ‘Effective Empathy,’ where players internalize the fragility of life through the vulnerability of the avatar. We argue that semantic topology offers a rigorous methodological tool for serious games assessment, revealing that effective environmental education requires a strategic tension between systemic logic and emotional resonance.
[HC-50] Anthropomorphism and Trust in Human-Large Language Model interactions
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)日益普及背景下,人类如何基于特定心理维度对其产生类人认知与信任的问题。研究通过系统操纵LLM在温暖度(warmth,即友好性)、能力(competence,即能力与连贯性)和共情(empathy,包括认知共情与情感共情)三个维度的表现,发现温暖度和认知共情是预测感知类人化、信任、相似性、关系亲密感、挫败感及有用性的关键因素,而能力主要影响除类人化外的所有结果;情感共情则主要作用于关系性感知指标,对知识性结果无显著预测力。此外,主观且与个人相关的话题(如情感建议)会放大上述效应,增强用户对LLM的人格化感知与关系联结。因此,解决方案的关键在于识别并验证温暖度、能力和共情作为人类赋予人工智能关系性和认知性评价的核心心理维度。
链接: https://arxiv.org/abs/2604.15316
作者: Akila Kadambi,Ylenia D’Elia,Tanishka Shah,Iulia Comsa,Alison Lentz,Katie Siri-Ngammuang,Tara Buechler,Jonas Kaplan,Antonio Damasio,Srini Narayanan,Lisa Aziz-Zadeh
机构: University of Southern California(南加州大学); University of California, Los Angeles(加州大学洛杉矶分校); Google DeepMind(谷歌深度思维); Google Research(谷歌研究)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:With large language models (LLMs) becoming increasingly prevalent in daily life, so too has the tendency to attribute to them human-like minds and emotions, or anthropomorphize them. Here, we investigate dimensions people use to anthropomorphize and attribute trust toward LLMs across more than 2,000 human-LLM interactions. Participants (N=115) engaged with LLM chatbots systematically varied in warmth (friendliness), competence (capability, coherence), and empathy (cognitive and affective). Warmth and cognitive empathy significantly predicted perceptions on all outcomes (perceived anthropomorphism, trust, similarity, relational closeness, frustration, usefulness), while competence predicted all outcomes except for anthropomorphism. Affective empathy primarily predicted perceived relational measures, but did not predict the epistemic outcomes. Topic sub-analyses showed that more subjective, personally relevant topics (e.g., relationship advice) amplified these effects, producing greater human-likeness and relational connection with the LLM than did objective topics. Together, these findings reveal that warmth, competence, and empathy are key dimensions through which people attribute relational and epistemic perceptions to artificial agents.
[HC-51] Modeling of ASD/TD Childrens Behaviors in Interaction with a Virtual Social Robot During a Music Education Program Using Deep Neural Networks
【速读】:该论文旨在解决如何通过智能系统对自闭症谱系障碍(ASD)儿童与神经典型(TD)儿童的行为差异进行量化评估,并生成具有真实感的行为模型以辅助诊断和干预的问题。解决方案的关键在于利用深度神经网络,特别是基于Transformer架构的模型,从音乐教育情境中虚拟社交机器人互动的数据(包括动作信号和影响数据)中提取行为特征,实现对两类儿童的高精度区分(准确率81%,敏感度96%),并成功模拟出难以被专家辨别的逼真行为模式(专家识别准确率仅53.5%,一致性达68%),从而为临床诊断、治疗师培训及机制理解提供可解释且可复现的行为建模工具。
链接: https://arxiv.org/abs/2604.15314
作者: Armin Tandiseh,Morteza Memari,Alireza Taheri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures
Abstract:This research aimed to develop an intelligent system to evaluate performance and extract behavioral models for children with ASD and neurotypical (TD) children by interacting with a virtual social robot in a music education program using deep neural networks. The system has two main features: 1) it distinguishes between neurotypical children and those with ASD based on their behavior, and 2) generates behaviors resembling those of neurotypical or ASD children in similar situations using deep learning. Intelligent systems that identify complex patterns and simulate behavior can aid in diagnosis, therapist training, and understanding the disorder. Using data from a previous study at the Social and Cognitive Robotics Laboratory of Sharif University of Technology (including the usable data of 9 ASD and 21 TD participants), the system achieved an accuracy of 81% and sensitivity of 96% in distinguishing neurotypical children from those with ASD using both impact data and motion signals. A transformer-based network was designed to reproduce children’s behaviors. Experts in the field struggled to differentiate real behaviors from reproduced ones, with an accuracy of 53.5% and agreement of 68%, indicating the model’s success in simulating realistic behaviors.
[HC-52] PADTHAI-MM: Principles-based Approach for Designing Trustworthy Human-centered AI using MAST Methodology
【速读】:该论文旨在解决高风险决策领域中可信人工智能(AI)系统设计与评估工具匮乏的问题,尤其在缺乏可操作性方法的情况下,难以确保AI系统的透明性和可信赖性。解决方案的关键在于提出一个基于多源AI评分表(Multisource AI Scorecard Table, MAST)的迭代式设计框架——基于原则的设计可信且以人为中心的AI方法(Principles-based Approach for Designing Trustworthy, Human-centered AI using MAST Methodology, PADTHAI-MM)。该框架通过整合AI上下文信息、解释机制和利益相关者反馈,指导开发出具有可解释性的AI辅助决策系统(如READIT平台),并实证验证了MAST对提升用户信任的有效性,从而为特定应用场景下构建可信赖AI提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2401.13850
作者: Myke C. Cohen,Nayoung Kim,Yang Ba,Anna Pan,Shawaiz Bhatti,Pouria Salehi,James Sung,Erik Blasch,Michelle V. Mancenido,Erin K. Chiou
机构: Arizona State University (亚利桑那州立大学); Aptima, Inc. (艾普蒂玛公司); U.S. Department of Homeland Security (美国国土安全部); Air Force Office of Scientific Research (空军科学研究所)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Despite an extensive body of literature on trust in technology, designing trustworthy AI systems for high-stakes decision domains remains a significant challenge, further compounded by the lack of actionable design and evaluation tools. The Multisource AI Scorecard Table (MAST) was designed to bridge this gap by offering a systematic, tradecraft-centered approach to evaluating AI-enabled decision support systems. Expanding on MAST, we introduce an iterative design framework called \textitPrinciples-based Approach for Designing Trustworthy, Human-centered AI using MAST Methodology (PADTHAI-MM). We demonstrate this framework in our development of the Reporting Assistant for Defense and Intelligence Tasks (READIT), a research platform that leverages data visualizations and natural language processing-based text analysis, emulating an AI-enabled system supporting intelligence reporting work. To empirically assess the efficacy of MAST on trust in AI, we developed two distinct iterations of READIT for comparison: a High-MAST version, which incorporates AI contextual information and explanations, and a Low-MAST version, akin to a ``black box’’ system. This iterative design process, guided by stakeholder feedback and contemporary AI architectures, culminated in a prototype that was evaluated through its use in an intelligence reporting task. We further discuss the potential benefits of employing the MAST-inspired design framework to address context-specific needs. We also explore the relationship between stakeholder evaluators’ MAST ratings and three categories of information known to impact trust: \textitprocess, \textitpurpose, and \textitperformance. Overall, our study supports the practical benefits and theoretical validity for PADTHAI-MM as a viable method for designing trustable, context-specific AI systems.
计算机视觉
[CV-0] Repurposing 3D Generative Model for Autoregressive Layout Generation
【速读】:该论文旨在解决现有3D布局生成方法依赖文本描述进行对象布局推理所带来的局限性,即难以精确建模物体间的几何关系与物理约束,从而导致生成场景的合理性与效率不足的问题。其解决方案的关键在于提出LaviGen框架,该框架直接在原生3D空间中操作,将布局生成建模为一个自回归过程,显式地学习物体间的几何关联与物理约束;同时引入一种改进的3D扩散模型,并结合场景、物体和指令信息,采用双引导自回放蒸馏机制(dual-guidance self-rollout distillation mechanism)以提升生成效率与空间准确性。
链接: https://arxiv.org/abs/2604.16299
作者: Haoran Feng,Yifan Niu,Zehuan Huang,Yang-Tian Sun,Chunchao Guo,Yuxin Peng,Lu Sheng
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); University of Hong Kong (香港大学); Tencent Hunyuan (腾讯混元); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at this https URL.
[CV-1] FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation CVPR2026
【速读】:该论文旨在解决无人机视觉语言导航(UAV vision-language navigation, VLN)中零样本方法性能受限的问题,尤其是在复杂三维环境中遵循模糊多步指令进行长程导航时的挑战。现有方法通常依赖大型基础模型、通用提示词和松散协同的模块结构,导致泛化能力弱、可解释性差。其解决方案的关键在于提出一种受人类认知启发的自顶向下框架 FineCog-Nav,将导航任务细分为语言处理、感知、注意力、记忆、想象、推理与决策等精细模块,每个模块由适度规模的基础模型驱动,并配备角色特异的提示词和结构化的输入输出协议,从而实现高效协作与更高可解释性。该设计显著提升了指令遵循度、长程规划能力和未见环境下的泛化性能。
链接: https://arxiv.org/abs/2604.16298
作者: Dian Shao,Zhengzheng Xu,Peiyang Wang,Like Liu,Yule Wang,Jieqi Shi,Jing Huo
机构: Northwestern Polytechnical University (西北工业大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by CVPR 2026 Findings
Abstract:UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: this https URL.
[CV-2] Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan CVPR2025
【速读】:该论文旨在解决大气雾霾对野生动物图像的显著退化问题,这一问题严重阻碍了计算机视觉技术在野生动物监测中的应用,如动物检测、跟踪与行为分析。其解决方案的关键在于构建了一个名为AnimalHaze3k的合成数据集(包含3,477张通过物理模型生成的雾霾图像),并提出了一种新颖的IncepDehazeGan架构——该架构融合了inception模块与残差跳跃连接的生成对抗网络(Generative Adversarial Network, GAN),在去雾性能上达到当前最优水平(SSIM: 0.8914, PSNR: 20.54, LPIPS: 0.1104),相较现有方法提升6.27% SSIM和10.2% PSNR;进一步应用于YOLOv11目标检测任务时,使mAP提升112%,IoU提升67%,显著增强了生态学领域在复杂环境下的视觉分析能力。
链接: https://arxiv.org/abs/2604.16284
作者: Shivarth Rai,Tejeswar Pokuri
机构: Manipal Institute of Technology, Manipal, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CV4Animals Workshop, CVPR 2025
Abstract:Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tracking, and behavior analysis. To address this challenge, we introduce AnimalHaze3k a synthetic dataset comprising of 3,477 hazy images generated from 1,159 clear wildlife photographs through a physics-based pipeline. Our novel IncepDehazeGan architecture combines inception blocks with residual skip connections in a GAN framework, achieving state-of-the-art performance (SSIM: 0.8914, PSNR: 20.54, and LPIPS: 0.1104), delivering 6.27% higher SSIM and 10.2% better PSNR than competing approaches. When applied to downstream detection tasks, dehazed images improved YOLOv11 detection mAP by 112% and IoU by 67%. These advances can provide ecologists with reliable tools for population monitoring and surveillance in challenging environmental conditions, demonstrating significant potential for enhancing wildlife conservation efforts through robust visual analytics.
[CV-3] Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement AAAI2026
【速读】:该论文旨在解决水下图像增强中因光吸收和散射导致的颜色失真、对比度低及细节模糊等严重退化问题。现有基于学习的方法如卷积神经网络(Convolutional Neural Networks, CNNs)难以建模非均匀退化的长程依赖关系,而Transformer类方法则因二次计算复杂度在高分辨率图像上效率低下。解决方案的关键在于提出一种基于Mamba架构的新型网络Hero-Mamba,其核心创新是采用双域并行处理机制——同时在空间域(RGB图像)和频域(FFT分量)输入信息,实现退化因子的解耦(颜色/亮度与纹理/噪声分离);并通过引入线性复杂度的Mamba-based SS2D模块捕获全局感受野和长程依赖,克服CNN与Transformer的局限性,进一步结合由背景光照先验引导的ColorFusion块以高保真恢复色彩信息,从而显著提升增强效果与泛化能力。
链接: https://arxiv.org/abs/2604.16266
作者: Tejeswar Pokuri,Shivarth Rai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AI4ES Workshop AAAI 2026
Abstract:Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scattering in water. While learning-based methods like CNNs and Transformers have shown promise, they face critical limitations: CNNs struggle to model the long-range dependencies needed for non-uniform degradation, and Transformers incur quadratic computational complexity, making them inefficient for high-resolution images. To address these challenges, we propose Hero-Mamba, a novel Mamba-based network that achieves efficient dual-domain learning for underwater image enhancement. Our approach uniquely processes information from both the spatial domain (RGB image) and the spectral domain (FFT components) in parallel. This dual-domain input allows the network to decouple degradation factors, separating color/brightness information from texture/noise. The core of our network utilizes Mamba-based SS2D blocks to capture global receptive fields and long-range dependencies with linear complexity, overcoming the limitations of both CNNs and Transformers. Furthermore, we introduce a ColorFusion block, guided by a background light prior, to restore color information with high fidelity. Extensive experiments on the LSUI and UIEB benchmark datasets demonstrate that Hero-Mamba outperforms state-of-the-art methods. Notably, our model achieves a PSNR of 25.802 and an SSIM of 0.913 on LSUI, validating its superior performance and generalization capabilities.
[CV-4] Information Router for Mitigating Modality Dominance in Vision-Language Models
【速读】:该论文旨在解决多模态视觉语言模型(Vision Language Models, VLMs)中存在的模态主导(modality dominance)问题,即模型在推理时过度依赖某一模态(如图像或文本),而忽视其他模态的信息。现有方法主要通过调整注意力分配来缓解此问题,但仅改变关注位置无法解决模态间信息缺失或模糊的根本矛盾。本文提出Multi-modal Information Router (MoIR),其核心创新在于:在融合前显式识别低信息量的模态token,并从信息更丰富的模态中路由互补信息,构建更具信息密度的token表示,从而在信息层面上减少模态间的信息差异。这一策略不依赖于注意力机制,而是通过增强信息可用性来实现可靠的模态平衡,显著提升模型在模态退化场景下的鲁棒性和下游任务性能。
链接: https://arxiv.org/abs/2604.16264
作者: Seulgi Kim,Mohit Prabhushankar,Ghassan AlRegib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model’s attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model’s attention does not resolve the underlying lack of information. In this paper, we propose \textscMoIR: \textitMulti-modal Information Router, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textscMoIR identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textscMoIR enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textscMoIR on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textscMoIR consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.
[CV-5] Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization CVPR
【速读】:该论文旨在解决基于视觉语言模型(Vision-Language Models, VLMs)在图像粗粒度地理定位(country-level image geolocalization)中的性能评估问题,特别是如何在不依赖图像匹配、GPS元数据或任务特定训练的前提下,利用零样本(zero-shot)提示策略实现国家级别的地理推断。其解决方案的关键在于采用提示工程(prompt-based)方法,在仅使用地面视角图像(ground-view imagery)的情况下,对多个前沿VLMs进行系统性零样本测试,并在三个地理多样性数据集上验证其泛化能力与鲁棒性,从而揭示当前VLMs在语义推理层面捕捉宏观地理线索的潜力及其在细粒度地理特征识别上的局限性。
链接: https://arxiv.org/abs/2604.16248
作者: Siddhant Bharadwaj,Ashish Vashist,Fahimul Aleem,Shruti Vyas
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR EarthVision 2026 Workshop
Abstract:Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.
[CV-6] Find Fix Reason : Context Repair for Video Reasoning
【速读】:该论文旨在解决当前强化学习在大规模多模态模型中进行视频推理时存在的两大问题:一是主流方法依赖于策略内自探索(on-policy self-exploration),导致性能受限于模型的知识边界;二是混合回放(hybrid replay)方法需复杂正则化且难以稳定优化。针对这些问题,作者提出了一种基于观察级干预(observation-level intervention)的解决方案,其关键在于引入一个冻结的、工具集成的教师模型(frozen, tool-integrated teacher),该教师能识别缺失的时空依赖关系,并从原始视频中提供最小证据片段(如时间戳或区域信息),而无需改变问题本身。学生模型在获得额外上下文后重新作答,训练过程采用组相对策略优化(Group Relative Policy Optimization, GRPO)中的选择性回放机制,同时设计了鲁棒改进奖励(Robust Improvement Reward, RIR),以同时保障答案正确性和推理依据与证据的一致性。该方案通过批内归一化实现群体层面的优化平衡,在保持策略内探索能力的同时引导其朝因果有意义的方向演进,显著提升了准确率和泛化能力。
链接: https://arxiv.org/abs/2604.16243
作者: Haojian Huang,Chuanyu Qin,Yinchuan Li,Yingcong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 7 figures, 17 tables. Ongoing work
Abstract:Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model’s knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model’s capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at this https URL.
[CV-7] CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting ICPR2026
【速读】:该论文旨在解决时间到碰撞(Time-to-Collision, TTC)预测任务中对视频数据多尺度时空特征建模的挑战,要求精确的时间预测能力并同时理解视频中的局部与全局空间-时间模式。其解决方案的关键在于提出一种基于分层Transformer架构的新型网络CollideNet:在空间流中,通过多分辨率并行聚合每帧信息以捕捉不同尺度的空间特征;在时间流中,不仅进行多尺度特征编码,还显式分离非平稳性(non-stationarity)、趋势(trend)和季节性(seasonality)成分,从而提升模型对复杂动态变化的建模能力。该方法在三个公开数据集上均达到当前最优性能,并通过跨数据集评估验证了良好的泛化能力。
链接: https://arxiv.org/abs/2604.16240
作者: Nishq Poorav Desai,Ali Etemad,Michael Greenspan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR 2026
Abstract:Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at this https URL.
[CV-8] A Two-Stage Object-Centric Deep Learning Framework for Robust Exam Cheating Detection
【速读】:该论文旨在解决学术诚信(academic integrity)领域中考试作弊行为难以高效、准确检测的问题。传统监考依赖人工观察,存在效率低、成本高且易出错的局限;现有AI监控系统则常因缺乏透明度或架构复杂而难以大规模部署。解决方案的关键在于提出一种基于两阶段框架的改进方法:首先利用YOLOv8n模型实现考场图像中学生位置的精准定位(object detection),随后通过微调的RexNet-150模型对裁剪后的区域进行行为分类(正常或作弊行为),从而实现端到端的作弊检测。该方案在包含273,897样本的数据集上达到0.95准确率、0.96精度和0.95 F1-score,较基线提升13%,同时具备13.9毫秒/样本的平均推理速度,展现出良好的实时性与可扩展性,并通过私密反馈机制保障伦理合规性。
链接: https://arxiv.org/abs/2604.16234
作者: Van-Truong Le,Le-Khanh Nguyen,Trong-Doanh Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures. Accepted at the FISU Joint Conference on Artificial Intelligence (FJCAI 2026), Vietnam
Abstract:Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.
[CV-9] Dental Panoramic Radiograph Analysis Using YOLO26 From Tooth Detection to Disease Diagnosis
【速读】:该论文旨在解决全景牙片(panoramic radiography)中人工诊断效率低、易出错的问题,尤其是在高门诊量场景下。其核心解决方案是首次将YOLOv26模型应用于牙科影像的自动牙齿检测、FDI编号及龋病等四类牙病的分割任务,通过在DENTEX数据集上采用Roboflow预处理与增强,并基于Google Colab平台利用迁移学习训练五种YOLOv26-seg变体,最终实现高精度的牙齿定位与病灶分割,其中YOLOv26m-seg在牙齿计数任务中达到0.976的精度和0.976的box mAP50,显著优于YOLOv8x基线模型,验证了该框架在提升牙科影像分析自动化水平与临床一致性方面的潜力。
链接: https://arxiv.org/abs/2604.16231
作者: Khawaja Azfar Asif,Rafaqat Alam Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposure. However, manual interpretation is time-consuming and prone to errors, especially in high-volume clinical settings. This creates a pressing need for efficient automated solutions. This study presents the first application of YOLOv26 for automated tooth detection, FDI-based numbering, and dental disease segmentation in panoramic radiographs. The DENTEX dataset was preprocessed using Roboflow for format conversion and augmentation, yielding 1,082 images for tooth enumeration and 1,040 images for disease segmentation across four pathology classes. Five YOLOv26-seg variants were trained on Google Colab using transfer learning at a resolution of 800x800. Results demonstrate that the YOLOv26m-seg model achieved the best performance for tooth enumeration, with a precision of 0.976, recall of 0.970, and box mAP50 of 0.976. It outperformed the YOLOv8x baseline by 4.9% in precision and 3.3% in mAP50, while also enabling high-quality mask-level segmentation (mask mAP50 = 0.970). For disease segmentation, the YOLOv26l-seg model attained a box mAP50 of 0.591 and a mask mAP50 of 0.547. Impacted teeth showed the highest per-class average precision (0.943), indicating that visual distinctiveness influences detection performance more than annotation quantity. Overall, these findings demonstrate that YOLOv26-based models offer a robust and accurate framework for automated dental image analysis, with strong potential to enhance diagnostic efficiency and consistency in clinical practice.
[CV-10] GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos
【速读】:该论文旨在解决现实社会系统中群体情感(Group Affect)的定量建模难题,尤其是在自然场景下(in-the-wild)由于多模态社交交互的复杂性、上下文多样性及行为变异性导致的计算社会系统建模挑战。其关键解决方案是构建了首个大规模、多模态、带上下文标注的群体情感数据集GAViD,包含5091个视频片段及其对应的视频、音频和情境信息,并引入VideoGPT生成的上下文元数据与人工标注的行为线索;同时提出Context-Aware Group Affect Recognition Network (CAGNet),通过融合多模态特征与上下文感知机制实现群体情感识别,最终在GAViD上达到63.20%的测试准确率,验证了方法的有效性。
链接: https://arxiv.org/abs/2604.16214
作者: Deepak Kumar,Abhishek Pratap Singh,Puneet Kumar,Xiaobai Li,Balasubramanian Raman
机构: Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校); Indian Institute of Technology Ropar (印度理工学院罗帕尔分校); Zhejiang University (浙江大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20% test accuracy on GAViD, comparable to state-of-the-art performance. The dataset and code are available at this http URL.
[CV-11] AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection
【速读】:该论文旨在解决增量式人脸伪造检测(Incremental Face Forgery Detection, IFFD)中因特征漂移(feature drift)和灾难性遗忘(catastrophic forgetting)导致的性能下降问题。现有方法通常依赖数据回放或粗粒度二分类监督,难以有效约束特征空间的稳定性。其解决方案的关键在于提出AIFIND框架,通过引入“Artifact-Aware Interpreting Fine-Grained Alignment”机制:首先设计Artifact-Driven Semantic Prior Generator以从低级伪造痕迹中提取不变语义锚点(semantic anchors),构建固定坐标系;随后利用Artifact-Probe Attention将这些锚点注入图像编码器,显式约束易变的视觉特征与稳定语义锚对齐;最后通过Adaptive Decision Harmonizer保持语义锚间的角度关系,实现跨任务的几何一致性分类器更新。这一系列设计共同提升了增量学习过程中的特征稳定性和模型泛化能力。
链接: https://arxiv.org/abs/2604.16207
作者: Hao Wang,Beichen Zhang,Yanpei Gong,Shaoyi Fang,Zhaobo Qi,Yuanrong Xu,Xinyan Liu,Weigang Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.
[CV-12] DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
【速读】:该论文旨在解决消费级激光雷达(LiDAR)在非视距(NLOS)感知中的应用难题,即如何利用低成本LiDAR通过数据驱动方法实现对隐藏物体的准确重建。传统方法受限于硬件性能难以有效处理多跳光信号,而本文提出的关键解决方案是构建首个大规模真实世界空间-时间直方图数据集DENALI,涵盖72,000个包含不同物体形状、位置、光照条件和空间分辨率的隐藏物体场景,从而验证了消费级LiDAR结合数据驱动推理可在NLOS感知中实现高精度重建,并揭示了影响性能的关键场景与建模因素及仿真到现实迁移中的差距,为未来基于消费级LiDAR的可扩展NLOS视觉研究奠定基础。
链接: https://arxiv.org/abs/2604.16201
作者: Nikhil Behari,Diego Rivero,Luke Apostolides,Suman Ghosh,Paul Pu Liang,Ramesh Raskar
机构: Massachusetts Institute of Technology (麻省理工学院); Technische Universität Berlin (柏林工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space-time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.
[CV-13] Saturation-Aware Space-Variant Blind Image Deblurring
【速读】:该论文旨在解决高动态范围(High Dynamic Range, HDR)和低光照条件下图像去模糊过程中饱和像素带来的挑战。饱和像素会显著影响去模糊效果,导致伪影或失真。解决方案的关键在于提出一种饱和感知的空间变异性盲图像去模糊框架,通过基于模糊强度和接近饱和程度对图像进行分割,并利用预估计的光扩散函数(Light Spread Function, LSF)抑制杂散光效应;同时,借助暗通道先验(Dark Channel Prior, DCP)准确估计饱和区域的真实辐射值,从而在不引入环形伪影(ringing artifacts)的前提下提升去模糊质量。该方法在合成与真实数据集上均表现出优于现有饱和感知及通用去模糊方法的性能。
链接: https://arxiv.org/abs/2604.16200
作者: Muhammad Z. Alam,Larry Stetsiuk,Arooba Zeshan
机构: University of New Brunswick (新布伦瑞克大学); Brandon University (布兰登大学); University of Winnipeg (温尼伯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 Figure
Abstract:This paper presents a novel saturation aware space variant blind image deblurring framework designed to address challenges posed by saturated pixels in deblurring under high dynamic range and low light conditions. The proposed approach effectively segments the image based on blur intensity and proximity to saturation, leveraging a pre estimated Light Spread Function to mitigate stray light effects. By accurately estimating the true radiance of saturated regions using the dark channel prior, our method enhances the deblurring process without introducing artifacts like ringing. Experimental evaluations on both synthetic and real world datasets demonstrate that the framework improves deblurring outcomes across various scenarios showcasing superior performance compared to state of the art saturation-aware and general purpose methods. This adaptability highlights the framework potential integration with existing and emerging blind image deblurring techniques.
[CV-14] Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement CVPR2026
【速读】:该论文旨在解决图像去阴影(Image Shadow Removal)问题,即从受阴影影响的图像中恢复出无阴影的清晰外观。其解决方案的关键在于提出一个三阶段渐进式去阴影流水线,基于OmniSR架构,将去阴影过程建模为迭代直接精修(iterative direct refinement),使后续阶段逐步修正前序预测留下的残差伪影;同时融合RGB外观信息、冻结的DINOv2语义引导以及单目深度和表面法向量等几何线索,并在所有阶段复用这些多模态特征以增强一致性;此外,引入收缩约束目标函数(contraction-constrained objective)以稳定多阶段优化,确保重建误差在整个级联过程中非递增;最终通过分阶段训练策略实现从WSRD预训练到WSRD+监督再到WSRD+ 2026适配的迁移,并结合余弦退火检查点集成(cosine-annealed checkpoint ensembling)提升性能,在NTIRE 2026 WSRD+挑战赛中取得最优结果(PSNR: 26.680)。
链接: https://arxiv.org/abs/2604.16177
作者: Lorenzo Beltrame,Jules Salzinger,Filip Svoboda,Jasmin Lampert,Phillipp Fanta-Jende,Radu Timofte,Marco Koerner
机构: Austrian Institute of Technology (奥地利技术研究所); Technical University of Munich (慕尼黑工业大学); University of Cambridge (剑桥大学); University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 5 tables, accepted at the CVPR 2026 Workshops (NTIRE 2026 Image Shadow Removal Challenge). Code and materials are available at this https URL
Abstract:We present a three-stage progressive shadow-removal pipeline for the CVPR2026 NTIRE WSRD+ challenge. Built on OmniSR, our method treats deshadowing as iterative direct refinement, where later stages correct residual artefacts left by earlier predictions. The model combines RGB appearance with frozen DINOv2 semantic guidance and geometric cues from monocular depth and surface normals, reused across all stages. To stabilise multi-stage optimisation, we introduce a contraction-constrained objective that encourages non-increasing reconstruction error across the cascade. A staged training pipeline transfers from earlier WSRD pretraining to WSRD+ supervision and final WSRD+ 2026 adaptation with cosine-annealed checkpoint ensembling. On the official WSRD+ 2026 hidden test set, our final ensemble achieved 26.680 PSNR, 0.8740 SSIM, 0.0578 LPIPS, and 26.135 FID, ranked first overall, and won the NTIRE 2026 Image Shadow Removal Challenge. The strong performance of the proposed model is further validated on the ISTD+ and UAV-SC+ datasets.
[CV-15] MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation ACL2026
【速读】:该论文旨在解决自动化三维放射学报告生成中常见的临床幻觉(clinical hallucinations)问题以及缺乏人类诊疗流程中迭代验证机制的局限性。现有视觉-语言模型(Vision-Language Models, VLMs)通常作为“黑箱”系统运行,无法模拟放射科团队的专业协作结构。其解决方案的关键在于提出MARCH(Multi-Agent Radiology Clinical Hierarchy)多智能体框架,该框架通过模拟放射科部门的层级组织结构,赋予不同角色的智能体专业化职责:由住院医师代理(Resident Agent)完成初始报告草稿并结合多尺度CT特征提取;多个主治医师代理(Fellow Agents)进行基于检索增强的修订;最终由主治医师代理(Attending Agent)主导一种基于立场的迭代共识对话,以解决诊断分歧。这一设计显著提升了报告在临床真实性和语言准确性方面的表现。
链接: https://arxiv.org/abs/2604.16175
作者: Yi Lin,Yihao Ding,Yonghui Wu,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); University of Western Australia (西澳大利亚大学); University of Florida (佛罗里达大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026 main conference
Abstract:Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic “black-box” systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.
[CV-16] neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
【速读】:该论文旨在解决当前生成式 AI 在三维计算机辅助设计(3D CAD)模型编辑任务中表现不足的问题,尤其是缺乏真实场景下专业工程师操作行为的数据驱动基准。解决方案的关键在于构建首个基于真实专家交互数据的基准测试集 neuralCAD-Edit,其通过录制专业 CAD 工程师在软件中直接操作模型时的视频、语音和手势轨迹,收集高保真、语义丰富的编辑请求,而非依赖文本条件输入。这一方法显著提升了任务的真实性与挑战性,为评估和改进基础模型在复杂 CAD 编辑任务中的能力提供了可靠标准。
链接: https://arxiv.org/abs/2604.16170
作者: Toby Perrett,Matthew Bouchard,William McCarthy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: Project page: this https URL
Abstract:We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: this https URL
[CV-17] SWNet: A Cross-Spectral Network for Camouflaged Weed Detection
【速读】:该论文旨在解决在密集农业环境中检测伪装杂草(camouflaged weeds)的难题,这类杂草通过同色融合(homochromatic blending)模仿主作物的表型特征,导致传统计算机视觉系统难以区分。解决方案的关键在于构建一个双模态端到端的跨光谱网络(SWNet),其核心包括:1)采用Pyramid Vision Transformer v2骨干网络以捕获长距离依赖关系;2)引入双模态门控融合模块(Bimodal Gated Fusion Module)动态整合可见光(Visible)与近红外(Near-Infrared, NIR)信息,利用叶绿素在NIR波段的反射差异增强目标可分性;3)设计边缘感知精化模块(Edge-Aware Refinement)提升分割边界清晰度,减少结构歧义。实验表明,该方法在Weeds-Banana数据集上优于十种先进模型,验证了跨光谱数据融合与边界引导精化的有效性。
链接: https://arxiv.org/abs/2604.16147
作者: Henry O. Velesaca,Luigi Miranda,Angel D. Sappa
机构: ESPOL Polytechnic University (ESPOL理工学院); University of Granada (格拉纳达大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: this https URL
[CV-18] Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions
【速读】:该论文旨在解决文本到动作生成(text-to-motion generation)中复合动作合成的难题,特别是当前基于扩散模型的方法在处理多并发动作时存在的两个关键问题:一是“灾难性忽略”(catastrophic neglect),即早期动作因时间信息处理不当被后期动作覆盖;二是“注意力坍缩”(attention collapse),源于交叉注意力机制中特征过度融合导致语义表示失真。为应对上述挑战,论文提出Motion-Adapter模块,其核心创新在于通过计算解耦的交叉注意力映射(decoupled cross-attention maps)作为结构掩码(structural masks),在去噪过程中引导模型更准确地保留各动作阶段的语义与物理合理性,从而实现更忠实、连贯的复合动作生成。
链接: https://arxiv.org/abs/2604.16135
作者: Yue Jiang,Mingyu Yang,Liuyuxin Yang,Yang Xu,Bingxin Yun,Yuhe Zhang
机构: Northwest University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 figures, Under review for publication in IEEE Transactions on Visualization and Computer Graphics
Abstract:Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.
[CV-19] From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts
【速读】:该论文旨在解决高光谱树种分类中面临的三大挑战:类别标签有限且不平衡、光谱混合(多种树种光谱信号重叠)以及生态异质性(不同生态系统间的变异)。传统方法仅依赖光谱特征,难以应对上述问题。解决方案的关键在于提出一种生物启发的半监督深度学习框架,融合多源地球观测数据(高光谱成像 HSI 与机载激光扫描 ALS)和专家生态知识,通过预计算的冠层图进行生物启发的伪标签生成,并利用大语言模型(LLMs)自动提取物种共存先验信息,构建物种共存矩阵以编码物种共现概率,从而将生态学先验知识有效引入伪标签策略中,显著提升分类精度并降低训练成本。
链接: https://arxiv.org/abs/2604.16115
作者: Michał Romaszewski,Dominik Kopeć,Michał Cholewa,Katarzyna Kołodziej,Przemysław Głomb,Jan Niedzielko,Jakub Charyton,Justyna Wylazłowska,Anna Jarocińska
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.
[CV-20] owards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset
【速读】:该论文旨在解决图像调色(Tone Style Transfer, TST)任务中因缺乏高质量大规模三元组数据集而导致模型性能受限的问题。现有方法通常依赖自监督或代理目标,难以实现精准的风格迁移与视觉美感保持。其解决方案的关键在于构建了一个包含10万对内容图-参考图-风格化结果三元组的大规模数据集TST100K,并引入一个训练好的调色风格评分器(tone style scorer)以确保每对三元组的风格一致性;同时提出ICTone框架,该框架基于扩散模型,在上下文感知条件下联合条件建模内容图与参考图,利用生成模型的语义先验实现语义感知的调色迁移,并结合评分器提供的奖励反馈机制进一步提升风格保真度和视觉质量。
链接: https://arxiv.org/abs/2604.16114
作者: Yuhai Deng,Huimin She,Wei Shen,Meng Li,Ruoxi Wu,Lunxi Yuan,Xiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 14 figures
Abstract:Tone style transfer for photo retouching aims to adapt the stylistic tone of the reference image to a given content image. However, the lack of high-quality large-scale triplet datasets with stylized ground truth forces existing methods to rely on self-supervised or proxy objectives, which limits model capability. To mitigate this gap, we design a data construction pipeline to build TST100K, a large-scale dataset of 100,000 content-reference-stylized triplets. At the core of this pipeline, we train a tone style scorer to ensure strict stylistic consistency for each triplet. In addition, existing methods typically extract content and reference features independently and then fuse them in a decoder, which may cause semantic loss and lead to inappropriate color transfer and degraded visual aesthetics. Instead, we propose ICTone, a diffusion-based framework that performs tone transfer in an in-context manner by jointly conditioning on both images, leveraging the semantic priors of generative models for semantic-aware transfer. Reward feedback learning using the tone style scorer is further incorporated to improve stylistic fidelity and visual quality. Experiments demonstrate the effectiveness of TST100K, and ICTone achieves state-of-the-art performance on both quantitative metrics and human evaluations.
[CV-21] Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation
【速读】:该论文旨在解决多语言语音驱动面部动画(Speech-Driven Facial Animation, SDFA)中的关键挑战,即现有模型通常仅基于单语种数据训练,难以在真实多语言场景中实现高质量生成;同时,传统方法往往仅依赖语言或说话人特定条件之一,无法有效建模语言与个体表达风格之间的交互关系。解决方案的关键在于提出一个统一的基于扩散模型(diffusion-based architecture)的框架Polyglot,通过引入词元嵌入(transcript embeddings)编码语言信息,并利用参考面部序列提取的风格嵌入(style embeddings)捕捉个体发声特征,从而在无需预定义语言或说话人标签的情况下,通过自监督学习实现跨语言和跨说话人的泛化能力。该方法联合条件化语言与风格信息,能够准确再现节奏、发音清晰度及习惯性面部动作等表现特征,生成时序一致且逼真的动画效果。
链接: https://arxiv.org/abs/2604.16108
作者: Federico Nocentini,Kwanggyoon Seo,Qingju Liu,Claudio Ferrari,Stefano Berretti,David Ferman,Hyeongwoo Kim,Pablo Garrido,Akin Caliskan
机构: University of Florence (佛罗伦萨大学); Flawless AI; Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project website is available at this https URL
Abstract:Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.
[CV-22] DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
【速读】:该论文旨在解决从噪声环境中提取和理解表格结构与内容的挑战,特别是在医疗行政文档(如牙科估价单)中常见的模糊图像、不规则布局及复杂语义任务(如多步算术运算和逻辑一致性检查)下的表结构识别(Table Recognition, TR)与表视觉问答(TableVQA)问题。现有数据集多基于干净数字源或渲染表格,难以反映真实场景中的噪声干扰;为此,作者提出了DenTab数据集,包含2000张牙科估价单裁剪图像及其高质量HTML标注,涵盖11类共2208个问题,用于统一评估TR与TableVQA性能。关键解决方案是提出Table Router Pipeline,其核心在于将算术类问题路由至确定性执行模块:首先由视觉-语言模型(VLM)生成结构化表格表示与约束程序,再由基于规则的执行器对解析后的表格进行精确计算,从而在不依赖额外训练的前提下显著提升算术推理的可靠性。
链接: https://arxiv.org/abs/2604.16099
作者: Laziz Hamdi,Amine Tamasna,Thierry Paquet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2,000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2,208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision–language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.16099 [cs.CV] (or arXiv:2604.16099v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16099 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-23] Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance ICPR2026
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中因过度追求对图像外观(style)变化的不变性而导致关键视觉信息丢失的问题。传统SSL方法如MoCo或DINO通过增强数据变换来提取对光照、几何变化不敏感的特征,这在对象识别任务中有效,但在气象分析或医学影像等场景中,外观本身(如雨痕、雪粒、大气散射、反射和晕影)恰恰是判别性信号,忽略这些特征会带来严重风险。解决方案的关键在于提出ST-STORM框架,该框架将外观(style)作为独立语义模态进行解耦,通过双流架构显式分离内容(Content)与风格(Style)两个潜在表示:内容分支采用JEPA(Joint-Embedding Predictive Architecture)结合对比损失以实现对风格变化的鲁棒性;风格分支则利用对抗约束下的特征预测与重建机制,精准捕捉纹理、对比度和散射等外观特征。实验表明,该设计能在保持内容分支在ImageNet-1K上F1=80%性能的同时,使风格分支在多天气分类(F1=97%)和皮肤癌检测(F1=94%,仅需10%标注数据)中高效分离复杂外观现象,显著提升关键视觉信息的保留能力。
链接: https://arxiv.org/abs/2604.16086
作者: Hamed Ouattara,Pierre Duthon,Pascal Houssam Salmane,Frédéric Bernardin,Omar Ait Aider
机构: Cerema(国家公路与交通研究中心); Clermont Auvergne INP(克莱蒙奥弗涅国立理工学院); Université Clermont Auvergne(克莱蒙奥弗涅大学); Institut Pascal(帕斯卡研究所); CNRS(法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 20 pages, 16 figures, ICPR 2026 (28th International Conference on Pattern Recognition)
Abstract:One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance
[CV-24] DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics
【速读】:该论文旨在解决生成式 AI(Generative AI)伪造图像的定位问题,即如何在不依赖复杂设计的前提下,实现对多种图像篡改类型和成像条件具有强泛化能力的像素级伪造区域检测。其解决方案的关键在于提出一种基于 DINOv3 的简单但高效的基准模型,该模型采用 LoRA(Low-Rank Adaptation)微调策略与轻量级卷积解码器相结合,仅需 9.1M 可训练参数即可显著超越现有方法,在 CAT-Net 协议下平均像素级 F1 提升 17.0 点;同时在数据稀缺场景(MVSS-Net 协议)中,LoRA 显著优于全参数微调,表明预训练视觉 Transformer(ViT)中的取证信息更适合作为低秩适配的先验,而非被完全覆盖,从而提升了模型的鲁棒性和实用性。
链接: https://arxiv.org/abs/2604.16083
作者: Jieming Yu,Qiuxiao Feng,Zhuohan Wang,Xiaochen Ma
机构: The Hong Kong University of Science and Technology (香港科技大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at this https URL.
[CV-25] Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model
【速读】:该论文旨在解决急性髓系白血病(Acute Myeloid Leukemia, AML)细胞多类别分类难题,该问题因不同细胞类型在显微图像中视觉相似性高而尤为棘手。解决方案的关键在于采用YOLOv12深度学习模型,并结合基于细胞和细胞核特征的两种图像分割策略,其中利用Hue通道与Otsu阈值法对图像进行预处理,最终在基于细胞特征的分割方法上实现了最高验证和测试准确率(均为99.3%),显著提升了分类性能。
链接: https://arxiv.org/abs/2604.16082
作者: Enas E. Ahmed,Salah A. Aly,Mayar Moner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 10 figures, 2 tables
Abstract:Acute Myeloid Leukemia (AML) is one of the most life-threatening type of blood cancers, and its accurate classification is considered and remains a challenging task due to the visual similarity between various cell types. This study addresses the classification of the multiclasses of AML cells Utilizing YOLOv12 deep learning model. We applied two segmentation approaches based on cell and nucleus features, using Hue channel and Otsu thresholding techniques to preprocess the images prior to classification. Our experiments demonstrate that YOLOv12 with Otsu thresholding on cell-based segmentation achieved the highest level of validation and test accuracy, both reaching 99.3%.
[CV-26] he Amazing Stability of Flow Matching
【速读】:该论文旨在解决深度生成模型(特别是流匹配模型,flow-matching models)在训练数据规模缩减或架构/训练配置变化时,其生成样本质量与多样性是否保持稳定的问题。研究发现,即使在CelebA-HQ数据集上裁剪50%的数据后,流匹配模型仍能维持高质量和多样性的生成能力,且潜在表示(latent representation)几乎不变;此外,在改变模型架构或训练设置的情况下,潜在空间映射也具有高度一致性。解决方案的关键在于揭示了流匹配模型对训练数据扰动和结构变化具有显著鲁棒性,从而为理解其可靠性提供了实证依据。
链接: https://arxiv.org/abs/2604.16079
作者: Rania Briq,Michael Kamp,Ohad Fried,Sarel Cohen,Stefan Kesselheim
机构: Forschungszentrum Jülich (弗劳恩霍夫研究中心); TU Dortmund (多特蒙德工业大学); Lamarr Institute (拉马尔研究所); Institute for AI in Medicine, UK Essen (UK埃森大学人工智能医学研究所); Reichman University (里奇曼大学); Helmholtz AI (亥姆霍兹人工智能); University of Cologne (科隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)
Abstract:The success of deep generative models in generating high-quality and diverse samples is often attributed to particular architectures and large training datasets. In this paper, we investigate the impact of these factors on the quality and diversity of samples generated by \emphflow-matching models. Surprisingly, in our experiments on CelebA-HQ dataset, flow matching remains stable even when pruning 50% of the dataset. That is, the quality and diversity of generated samples are preserved. Moreover, pruning impacts the latent representation only slightly, that is, samples generated by models trained on the full and pruned dataset map to visually similar outputs for a given seed. We observe similar stability when changing the architecture or training configuration, such that the latent representation is maintained under these changes as well. Our results quantify just how strong this stability can be in practice, and help explain the reliability of flow-matching models under various perturbations. Comments: EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.16079 [cs.CV] (or arXiv:2604.16079v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-27] ableSeq: Unified Generation of Structure Content and Layout
【速读】:该论文旨在解决表格结构识别(Table Structure Recognition, TSR)、内容识别(Table Content Recognition, TCR)与单元格定位(Cell Localization)三者协同处理的难题,传统方法通常依赖多阶段流水线、外部光学字符识别(OCR)或多个专用解码器,导致系统复杂且难以优化。其解决方案的关键在于提出 TableSeq——一个纯图像输入、端到端的统一序列生成框架,将上述三项任务建模为单一序列生成问题:通过一个解码器输出交错的 HTML 标签、单元格文本和离散坐标标记,实现逻辑结构、文本内容与几何位置的一致性对齐。该设计摒弃了辅助模块与复杂后处理,结合轻量级高分辨率 FCN-H16 编码器、结构先验头及单层 Transformer 编码器,构建出紧凑而高效的架构,在多个基准数据集上达到先进性能,同时支持无需任务特定头部的基于索引的表格查询,展现出良好的泛化能力。
链接: https://arxiv.org/abs/2604.16070
作者: Laziz Hamdi,Amine Tamasna,Pascal Boisson,Thierry Paquet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present TableSeq, an image-only, end-to-end framework for joint table structure recognition, content recognition, and cell localization. The model formulates these tasks as a single sequence-generation problem: one decoder produces an interleaved stream of \textttHTML tags, cell text, and discretized coordinate tokens, thereby aligning logical structure, textual content, and cell geometry within a unified autoregressive sequence. This design avoids external OCR, auxiliary decoders, and complex multi-stage post-processing. TableSeq combines a lightweight high-resolution FCN-H16 encoder with a minimal structure-prior head and a single-layer transformer encoder, yielding a compact architecture that remains effective on challenging layouts. Across standard benchmarks, TableSeq achieves competitive or state-of-the-art results while preserving architectural simplicity. It reaches 95.23 TEDS / 96.83 S-TEDS on PubTabNet, 97.45 TEDS / 98.69 S-TEDS on FinTabNet, and 99.79 / 99.54 / 99.66 precision / recall / F1 on SciTSR under the CAR protocol, while remaining competitive on PubTables-1M under GriTS. Beyond TSR/TCR, the same sequence interface generalizes to index-based table querying without task-specific heads, achieving the best IRDR score and competitive ICDR/ICR performance. We also study multi-token prediction for faster blockwise decoding and show that it reduces inference latency with only limited accuracy degradation. Overall, TableSeq provides a practical and reproducible single-stream baseline for unified table recognition, and the source code will be made publicly available at this https URL.
[CV-28] AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning
【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Model, VLM)在适应机器人控制任务时,因跨模态梯度不对称性导致的视觉问答(Visual Question Answering, VQA)能力快速退化问题。具体而言,将基于均方误差(MSE)的连续动作梯度注入仅通过交叉熵(Cross-Entropy, CE)预训练的模型时,低秩回归梯度与高维语义流形之间存在谱维度不匹配,引发累积激活漂移和严重遗忘。解决方案的关键在于提出AEGIS(Anchor-Enforced Gradient Isolation System),其核心是无需缓冲区的逐层正交梯度投影机制:通过预计算各Transformer层的静态高斯参考锚点(Gaussian reference anchor),并在每步训练中构建Wasserstein-2运输惩罚以生成锚点恢复梯度;再利用序列双反向传播分解任务梯度与锚点梯度,并对每一层执行单次Gram-Schmidt正交投影,使任务梯度偏离破坏方向但保留有效信息,平均仅损失<1%梯度能量,即可彻底消除累积激活漂移。
链接: https://arxiv.org/abs/2604.16067
作者: Guransh Singh
机构: Independent Researcher
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM’s visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.
[CV-29] Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLM s
【速读】:该论文旨在解决当前多模态推理模型(Multimodal Reasoning Models, MRMs)在视觉空间推理任务中表现不佳的问题,特别是基于思维链(Chain-of-Thought, CoT)的文本驱动推理范式在处理空间智能任务时存在显著局限。研究发现,CoT提示会系统性降低模型在十三个空间基准测试中的性能,并通过“无图像++”(No-Image++)消融实验揭示了MRMs和仅依赖文本的大型语言模型(Large Language Models, MLMs)存在严重的捷径学习(shortcut learning)现象,即模型会凭空生成与图像无关的视觉细节,反映出其对视觉信息的依赖不足。解决方案的关键在于摒弃以文本为中心的CoT推理路径,转而构建以视觉为核心的推理范式(vision-centric reasoning paradigms),从而提升模型在空间认知任务中的准确性与鲁棒性。
链接: https://arxiv.org/abs/2604.16060
作者: Sai Srinivas Kancheti,Aditya Sanjiv Kanade,Vineeth N. Balasubramanian,Tanuja Ganu
机构: IIT, Hyderabad (印度理工学院海得拉巴分校); Microsoft Research India (微软研究院印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.
[CV-30] Minds Eye: A Benchmark of Visual Abstraction Transformation and Composition for Multimodal LLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉认知与空间推理能力方面理解不足的问题,尤其是其在处理抽象、关系和变换类任务时的表现远低于人类水平。解决方案的关键在于构建了一个名为“Mind’s Eye”的多选题基准测试集,该测试集由八个受经典人类智力测验启发的视觉认知任务组成,并基于全新的“A-R-T”分类法(Abstraction, Relation, Transformation)进行组织,从而系统性地评估MLLMs在流体智能核心过程(如模式归纳、类比关系映射和心理变换)中的表现。通过对比人类参与者(80%准确率)与主流闭源及开源MLLMs(最高低于50%准确率)的性能,研究揭示了模型在视觉注意力分配、内部感知操作以及底层视觉概念抽象方面的局限性,强调了开发更贴近人类认知机制的评估框架的必要性。
链接: https://arxiv.org/abs/2604.16054
作者: Rohit Sinha,Aditya Kanade,Sai Srinivas Kancheti,Vineeth N Balasubramanian,Tanuja Ganu
机构: IIT Hyderabad (印度理工学院海得拉巴分校); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce “Mind’s Eye”, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel “A-R-T” taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
[CV-31] Elucidating the SNR-t Bias of Diffusion Probabilistic Models CVPR2026
【速读】:该论文旨在解决扩散概率模型(Diffusion Probabilistic Models)在推理阶段存在的信号-噪声比-时间步(Signal-to-Noise Ratio-timestep, SNR-t)偏差问题。该偏差表现为推理过程中样本的SNR与其对应时间步之间的错位,导致误差累积并降低生成质量。解决方案的关键在于提出一种简单而有效的差分校正方法:基于扩散模型在反向去噪过程中优先重建低频成分、再逐步恢复高频细节的特性,将样本分解为不同频率成分,并对每个频率分量单独应用差分校正,从而实现对SNR-t偏差的有效缓解。该方法在多种扩散模型(如IDDPM、ADM、DDIM等)和不同分辨率数据集上均显著提升了生成质量,且计算开销可忽略不计。
链接: https://arxiv.org/abs/2604.16044
作者: Meng Yu,Lei Sun,Jianhao Zeng,Xiangxiang Chu,Kun Zhan
机构: Lanzhou University (兰州大学); AMAP Alibaba Group (AMAP 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026, 19pages, with appendix
Abstract:Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at this https URL.
[CV-32] Ranking XAI Methods for Head and Neck Cancer Outcome Prediction
【速读】:该论文旨在解决头颈部癌(Head and Neck Cancer, HNC)患者预后预测中人工智能(Artificial Intelligence, AI)模型可解释性不足的问题,从而推动AI在临床实践中的采纳。其解决方案的关键在于首次系统性地对13种可解释人工智能(Explainable Artificial Intelligence, XAI)方法进行多维度评估与排序,涵盖忠实性(faithfulness)、鲁棒性(robustness)、复杂度(complexity)和合理性(plausibility)共24项指标,并基于多中心HECKTOR挑战数据集的实验结果发现,Integrated Gradients(IG)和DeepLIFT(DL)在多个评价维度上表现最优,凸显了全面评估XAI方法的重要性。
链接: https://arxiv.org/abs/2604.16034
作者: Baoqiang Ma,Djennifer K. Madzia-Madzou,Rosa C.J. Kraaijveld,Jin Ouyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注: 4-page conference paper, accepted at IEEE ISBI 2026 (International Symposium on Biomedical Imaging)
Abstract:For head and neck cancer (HNC) patients, prognostic outcome prediction can support personalized treatment strategy selection. Improving prediction performance of HNC outcomes has been extensively explored by using advanced artificial intelligence (AI) techniques on PET/CT data. However, the interpretability of AI remains a critical obstacle for its clinical adoption. Unlike previous HNC studies that empirically selected explainable AI (XAI) techniques, we are the first to comprehensively evaluate and rank 13 XAI methods across 24 metrics, covering faithfulness, robustness, complexity and plausibility. Experimental results on the multi-center HECKTOR challenge dataset show large variations across evaluation aspects among different XAI methods, with Integrated Gradients (IG) and DeepLIFT (DL) consistently obtained high rankings for faithfulness, complexity and plausibility. This work highlights the importance of comprehensive XAI method evaluation and can be extended to other medical imaging tasks.
[CV-33] Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs
【速读】:该论文旨在解决现有基于深度学习的钻孔破裂(borehole breakout)自动识别方法中误报率高的问题,即模型容易将非破裂特征(如天然裂缝、键槽和测井伪影)错误分类为破裂。解决方案的关键在于两个策略:一是引入负样本训练机制,将与破裂具有相似特征(如低声波幅度或局部扩径)的非破裂特征纳入训练集,提升模型对真破裂与假破裂的判别能力;二是通过方位对称性准则对候选破裂进行二次验证,剔除不满足近似对称方位特性的检测结果,从而显著降低误报率。该方法在多个区域的实际声成像测井数据上验证了其优越性,有效提升了自动破裂识别的准确性与可靠性,进而优化了基于钻孔破裂的原位应力分析精度。
链接: https://arxiv.org/abs/2604.16011
作者: Guangyu Wang,Xiaodong Ma,Xinming Wu
机构: University of Science and Technology of China (中国科学技术大学); Mengcheng National Geophysical Observatory (蒙城国家地球物理观测站)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Geophysics (physics.geo-ph)
备注:
Abstract:Borehole breakouts are stress-induced spalling on the borehole wall, which are identifiable in acoustic image logs as paired zones with near-symmetry azimuths, low acoustic amplitudes, and increased borehole radius. Accurate breakout characterization is crucial for in-situ stress analysis. In recent years, deep learning has been introduced to automate the time-consuming and labor-intensive breakout picking process. However, existing approaches often suffer from misclassification of non-breakout features, leading to high false positive rates. To address this limitation, this study develops a deep learning framework, termed Breakout-picker, with a specific focus on reducing false positives in automatic breakout characterization. Breakout-picker reduces false positives through two strategies. First, the training of Breakout-picker incorporates negative samples of non-breakout features, including natural fractures, keyseats, and logging artifacts. They share similar characteristics with breakouts, such as low acoustic amplitude or locally enlarged borehole radius. These negative training samples enables Breakout-picker to better discriminate true breakouts and similar non-breakout features. Second, candidate breakouts identified by Breakout-picker are further validated by azimuthal symmetry criteria, whereby detections that do not exhibit the near-symmetry characteristics of breakout azimuth are excluded. The performance of Breakout-picker is evaluated using three acoustic image log datasets from different regions. The results demonstrate that Breakout-picker outperforms other automatic methods with higher accuracy and substantially lower false positive rates. By reducing false positives, Breakout-picker enhances the reliability of automatic breakout characterization from acoustic image logs, which in turn benefits in-situ stress analysis based on borehole breakouts.
[CV-34] IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE CVPR2026
【速读】:该论文旨在解决传统对比度受限自适应直方图均衡化(Contrast Limited Adaptive Histogram Equalization, CLAHE)在实际应用中因固定裁剪限制(clip limit)参数导致的过度增强问题,该问题源于其未根据局部区域直方图分布动态调整参数。解决方案的关键在于提出图像自适应CLAHE(Image-Adaptive CLAHE, IA-CLAHE),通过训练一个轻量级裁剪限制估计器,基于输入图像的直方图分布自适应地估计每个图像块(tile)的clip limit,并借助CLAEHE的可微分扩展实现端到端优化。该方法无需任务特定数据或预搜索的真值clip limit,能够学习将输入图像直方图映射至域不变的均匀分布,从而实现跨多种场景的零样本泛化能力,同时提升识别性能与人眼感知视觉质量。
链接: https://arxiv.org/abs/2604.16010
作者: Rikuto Otsuka,Yuho Shoji,Yuka Ogino,Takahiro Toizumi,Atsushi Ito
机构: NEC Corporation (日本电气公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NTIRE 2026 Workshop at CVPR 2026
Abstract:This paper proposes image-adaptive contrast limited adaptive histogram equalization (IA-CLAHE). Conventional CLAHE is widely used to boost the performance of various computer vision tasks and to improve visual quality for human perception in practical industrial applications. CLAHE applies contrast limited histogram equalization to each local region to enhance local contrast. However, CLAHE often leads to over-enhancement, because the contrast-limiting parameter clip limit is fixed regardless of the histogram distribution of each local region. Our IA-CLAHE addresses this limitation by adaptively estimating tile-wise clip limits from the input image. To achieve this, we train a lightweight clip limits estimator with a differentiable extension of CLAHE, enabling end-to-end optimization. Unlike prior learning-based CLAHE methods, IA-CLAHE does not require pre-searched ground-truth clip limits or task-specific datasets, because it learns to map input image histograms toward a domain-invariant uniform distribution, enabling zero-shot generalization across diverse conditions. Experimental results show that IA-CLAHE consistently improves recognition performance, while simultaneously enhancing visual quality for human perception, without requiring any task-specific training data.
[CV-35] MMGait: Towards Multi-Modal Gait Recognition CVPR2026
【速读】:该论文旨在解决现有步态识别方法在真实场景中因依赖单一模态(如RGB图像)而难以实现多模态协同与跨模态检索的问题。其关键解决方案是构建了一个包含五类异构传感器数据的综合性多模态步态识别基准MMGait,涵盖十二种模态和超过33.4万条序列,支持对几何、光度和运动域的系统性探索;并提出OmniGait基线模型,通过学习跨多种模态的共享嵌入空间,在单模态、跨模态及多模态识别范式下均展现出优异性能,从而实现了统一的全多模态步态识别任务(Omni Multi-Modal Gait Recognition)。
链接: https://arxiv.org/abs/2604.15979
作者: Chenye Wang,Qingyuan Cai,Saihui Hou,Aoqi Li,Yongzhen Huang
机构: Beijing Normal University (北京师范大学); WATRIX.AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at this https URL.
[CV-36] woHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在内容安全方面的关键漏洞问题,尤其是现有安全对齐方法主要关注显式恶意概念,而忽视了由多个看似无害概念组合后产生的隐含不安全性。其解决方案的核心是提出并形式化了一种新型漏洞——多概念组合不安全(Multi-Concept Compositional Unsafety, MCCU),并构建了名为TwoHamsters的综合性基准测试集(包含17.5k条提示),用于系统评估T2I模型和防御机制在MCCU场景下的表现。通过该框架,研究揭示了当前主流模型和防御策略在处理复合语义风险时存在显著不足,从而推动了对生成式AI(Generative AI)安全机制的深入反思与改进。
链接: https://arxiv.org/abs/2604.15967
作者: Chaoshuo Zhang,Yibo Liang,Mengke Tian,Chenhao Lin,Zhengyu Zhao,Le Yang,Chong Zhang,Yang Zhang,Chao Shen
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the remarkable synthesis capabilities of text-to-image (T2I) models, safeguarding them against content violations remains a persistent challenge. Existing safety alignments primarily focus on explicit malicious concepts, often overlooking the subtle yet critical risks of compositional semantics. To address this oversight, we identify and formalize a novel vulnerability: Multi-Concept Compositional Unsafety (MCCU), where unsafe semantics stem from the implicit associations of individually benign concepts. Based on this formulation, we introduce TwoHamsters, a comprehensive benchmark comprising 17.5k prompts curated to probe MCCU vulnerabilities. Through a rigorous evaluation of 10 state-of-the-art models and 16 defense mechanisms, our analysis yields 8 pivotal insights. In particular, we demonstrate that current T2I models and defense mechanisms face severe MCCU risks: on TwoHamsters, FLUX achieves an MCCU generation success rate of 99.52%, while LLaVA-Guard only attains a recall of 41.06%, highlighting a critical limitation of the current paradigm for managing hazardous compositional generation.
[CV-37] From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
【速读】:该论文旨在解决训练-free文本引导图像编辑中因编辑与重建分支独立优化导致的语义冲突和不可预测结果问题,其核心挑战在于现有方法采用对抗策略使两分支各自追求与目标和源提示的最大对齐,缺乏协同机制。解决方案的关键是提出Coopetitive Training-Free Image Editing (CoEdit),通过将注意力控制从竞争转向“合作竞争”(coopetition)协商机制,在空间维度引入双熵注意力调控(Dual-Entropy Attention Manipulation),以方向性熵交互量化重构与编辑分支间的协调关系,从而最大化编辑和谐度并提升可编辑区域与保留区域的定位精度;在时间维度设计熵潜变量精炼机制(Entropic Latent Refinement),动态调整去噪轨迹中的潜在表示,减少累积编辑误差并保障语义过渡一致性。
链接: https://arxiv.org/abs/2604.15948
作者: Jinhao Shen,Haoqian Du,Xulu Zhang,Xiao-Yong Wei,Qing Li
机构: Hong Kong Polytechnic University (香港理工大学); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at this https URL.
[CV-38] SENSE: Stereo OpEN Vocabulary SEmantic Segmentation
【速读】:该论文旨在解决开放词汇语义分割(open-vocabulary semantic segmentation)在动态环境中因依赖单视角图像而导致的空间精度不足问题,尤其是在遮挡和物体边界附近表现不佳。其解决方案的关键在于提出SENSE方法,首次将立体视觉(stereo vision)与视觉-语言模型(vision-language models)相结合,利用双目图像对提供几何线索以增强空间推理能力,从而提升分割准确性。通过在PhraseStereo数据集上训练,SENSE在短语定位任务中显著优于基线方法,并在零样本设置下展现出良好泛化能力,同时在Cityscapes和KITTI数据集上分别实现了mIoU指标的相对提升3.5%和18%。
链接: https://arxiv.org/abs/2604.15946
作者: Thomas Campagnolo(ACENTAURI),Ezio Malis(ACENTAURI),Philippe Martinet(ACENTAURI),Gaétan Bahl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.
[CV-39] Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction CVPR2026
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在处理高频外观细节场景时存在的关键问题:由于每个高斯基元仅能表示单一颜色,导致在出现锐利颜色过渡时需大量基元,造成基元数量急剧增长,从而影响效率与可扩展性。解决方案的关键在于提出神经Gabor溅射(Neural Gabor Splatting),其核心是为每个高斯基元引入一个轻量级多层感知机(MLP),以建模单个基元内的多种颜色变化;同时设计了一种基于频率能量的致密化策略,通过选择频域能量不匹配的基元进行剪枝和克隆,有效控制基元数量并提升对高频表面的重建精度。
链接: https://arxiv.org/abs/2604.15941
作者: Haato Watanabe,Nobuyuki Umetani
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR 2026
Abstract:Recent years have witnessed the rapid emergence of 3D Gaussian splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition. To overcome this limitation, we propose neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy. Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NeRF360 and High-Frequency datasets (e.g., checkered patterns), supported by comprehensive ablation studies.
[CV-40] Hierarchical Codec Diffusion for Video-to-Speech Generation CVPR2026
【速读】:该论文旨在解决视频到语音(Video-to-Speech, VTS)生成中因忽略语音的层次结构而导致视觉与语音特征在特定层级难以直接对齐的问题。现有方法未充分建模从粗粒度说话人感知语义到细粒度韵律细节的多层级语音表征,从而限制了音频-视觉对齐效果。解决方案的关键在于提出HiCoDiT——一种基于残差向量量化(Residual Vector Quantization, RVQ)编码器层次结构的分层编解码扩散Transformer模型。其核心创新包括:利用低层块条件于唇部同步运动和面部身份以捕获说话人相关语义,高层块则通过面部表情调制韵律动态;同时引入双尺度自适应实例归一化(dual-scale adaptive instance layer normalization),分别通过通道维度归一化提取全局声学风格、时间维度归一化捕捉局部韵律变化,实现更有效的粗粒度到细粒度条件控制。
链接: https://arxiv.org/abs/2604.15923
作者: Jiaxin Ye,Gaoxiang Cong,Chenhui Wang,Xin-Cheng Wen,Zhaoyang Li,Boyuan Cao,Hongming Shan
机构: Fudan University (复旦大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Harbin Institute of Technology (Shenzhen) (深圳哈尔滨工业大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at this https URL.
[CV-41] Making Image Editing Easier via Adaptive Task Reformulation with Agent ic Executions
【速读】:该论文旨在解决当前基于生成式 AI (Generative AI) 的图像编辑任务中,尽管模型能力较强,仍存在大量看似简单却难以可靠完成的编辑失败问题。研究表明,这些失败主要源于任务表述不当,如目标过小、空间关系隐含或指令不明确等。解决方案的关键在于将图像编辑失败视为任务表述问题,并提出一种自适应任务重构框架:通过一个多模态大语言模型(MLLM)代理动态分析原始图像-指令对,执行任务分解、路由、重构与反馈驱动的迭代优化,从而在不修改底层编辑模型的前提下显著提升性能。实验表明,该方法在多个基准测试(如ImgEdit、PICA和RePlan)上均取得一致改进,尤其在挑战性案例中效果显著。
链接: https://arxiv.org/abs/2604.15917
作者: Bo Zhao,Kairui Guo,Runnan Du,Haiyang Sun,Pengshan Wang,Huan Yang,Kun Gai,Yixin Cao,Wei Ji
机构: Nanjing University (南京大学); Beijing Institute of Technology (北京理工大学); China University of Petroleum (Beijing) (中国石油大学(北京)); Beijing University of Posts and Telecommunications (北京邮电大学); Kuaishou Technology (快手科技); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9pages
Abstract:Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by better matching editing tasks to the effective operating regime of existing models.
[CV-42] Efficient Video Diffusion Models: Advancements and Challenges
【速读】:该论文旨在解决生成式视频扩散模型(generative video diffusion models)在实际部署中面临的严重推理成本问题,其核心挑战在于视频生成过程中时空token增长与迭代去噪带来的计算复杂度,导致注意力机制和内存访问成为现实场景下的主要瓶颈。解决方案的关键在于提出一种统一的分类体系,将现有高效视频扩散模型方法归纳为四大范式:步骤蒸馏(step distillation)、高效注意力机制(efficient attention)、模型压缩(model compression)以及缓存/轨迹优化(cache/trajectory optimization),并系统分析各范式的算法趋势及其对两大核心目标的优化策略——减少函数评估次数与降低每步开销,从而为高效视频生成提供结构化技术路径与未来研究方向。
链接: https://arxiv.org/abs/2604.15911
作者: Shitong Shao,Lichen Bai,Pengfei Wan,James Kwok,Zeke Xie
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The Hong Kong University of Science and Technology(香港科技大学); Kling Team, Kuaishou Technology(快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.
[CV-43] AeroDeshadow: Physics-Guided Shadow Synthesis and Penumbra-Aware Deshadowing for Aerospace Imagery
【速读】:该论文旨在解决高分辨率航空航天影像(High-Resolution Aerospace Imagery, ASI)中阴影导致的光谱失真与信息丢失问题,这些问题会显著降低下游任务的解析精度。现有深度学习方法在自然图像阴影去除上取得进展,但直接应用于ASI时面临两大挑战:一是严格配对的训练数据严重匮乏;二是同质阴影假设无法处理航空场景中广泛存在的半影过渡区域(penumbra transition zones)。解决方案的关键在于提出一个统一的两阶段框架AeroDeshadow:第一阶段通过物理引导的阴影合成网络(PDSS-Net)显式建模光照衰减和空间衰减,生成包含软边界过渡的大规模配对数据集AeroDS-Syn;第二阶段采用半影感知级联去阴影网络(PCDS-Net),将输入图像解耦为本影(umbra)与半影(penumbra)成分并分步恢复,有效缓解边界伪影和过度校正问题。该方案仅依赖合成数据训练即可在真实AS图像上实现优异泛化性能。
链接: https://arxiv.org/abs/2604.15903
作者: Wei Lu,Zi-Yang Bo,Fei-Fei Sang,Yi Liu,Xue Yang,Si-Bao Chen
机构: Anhui University (安徽大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures
Abstract:Shadows are prevalent in high-resolution aerospace imagery (ASI). They often cause spectral distortion and information loss, which degrade downstream interpretation tasks. While deep learning methods have advanced natural-image shadow removal, their direct application to ASI faces two primary challenges. First, strictly paired training data are severely lacking. Second, homogeneous shadow assumptions fail to handle the broad penumbra transition zones inherent in aerospace scenes. To address these issues, we propose AeroDeshadow, a unified two-stage framework integrating physics-guided shadow synthesis and penumbra-aware restoration. In the first stage, a Physics-aware Degradation Shadow Synthesis Network (PDSS-Net) explicitly models illumination decay and spatial attenuation. This process constructs AeroDS-Syn, a large-scale paired dataset featuring soft boundary transitions. Constrained by this physical formulation, a Penumbra-aware Cascaded DeShadowing Network (PCDS-Net) then decouples the input into umbra and penumbra components. By restoring these regions progressively, PCDS-Net alleviates boundary artifacts and over-correction. Trained solely on the synthetic AeroDS-Syn, the network generalizes to real-world ASI without requiring paired real annotations. Experimental results indicate that AeroDeshadow achieves state-of-the-art quantitative accuracy and visual fidelity across synthetic and real-world datasets. The datasets and code will be made publicly available at: this https URL.
[CV-44] PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking
【速读】:该论文旨在解决胎儿超声(fetal ultrasound, US)图像在智能解读中因标注成本高和操作者差异导致的性能瓶颈问题,同时针对现有自监督预训练方法忽视超声成像特性的局限性——如数据冗余严重、扇形局部性(fan-shaped locality)以及极坐标束形成(polar coordinate beamforming)等特征——提出高效且针对性强的解决方案。其关键创新在于:1)引入渐进式视觉-语义筛选(Progressive Visual-Semantic Screening, PVSS),自适应提取高价值样本以缓解连续扫描冗余,提升预训练效率;2)设计声学边界区域约束(Acoustic-Bounded Region Constraint, ABRC),强制模型聚焦有效声学区域而非无效暗背景,适配超声局部特性;3)提出极坐标纹理协同掩码机制(Polar-Texture Collaborative Masking, PTCM),融合束形成先验与局部细节信息,使模型能够捕捉径向成像模式及关键组织结构,从而显著提升下游任务性能与预训练可扩展性。
链接: https://arxiv.org/abs/2604.15893
作者: Meng Lv,Yapeng Li,Hang Su,Juhua Liu,Bo Du
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 3 tables
Abstract:Intelligent fetal ultrasound (US) interpretation is crucial for prenatal diagnosis, but high annotation costs and operator-induced variance make unsupervised pre-training a highly promising paradigm. However, existing pre-training methods largely ignore US-specific characteristics – severe data redundancy, fan-shaped locality, and polar coordinate beamforming – limiting their effectiveness in downstream tasks. To address this, we propose PolarMAE, a novel and efficient pre-training framework tailored for US images. Specifically, to mitigate continuous scanning redundancy, we introduce a Progressive Visual-Semantic Screening (PVSS) that adaptively extracts high-value samples, significantly boosting pre-training efficiency. Furthermore, we design an Acoustic-Bounded Region Constraint (ABRC) to accommodate US locality, forcing the model to focus strictly on valid acoustic regions rather than invalid dark backgrounds. Finally, leveraging the beamforming prior and local details, we propose a Polar-Texture Collaborative Masking (PTCM), enabling the model to capture underlying radial imaging patterns and critical tissue structures. Extensive experiments across diverse datasets and downstream interpretation tasks demonstrate that our method achieves state-of-the-art performance with strong pre-training scalability and efficiency.
[CV-45] CLOTH-HUGS: Cloth Aware Human Gaussian Splatting
【速读】:该论文旨在解决现有神经渲染方法在人体穿衣重建中难以分离身体与衣物、导致松散衣物和复杂形变表现不佳的问题。其解决方案的关键在于提出Cloth-HUGS框架,通过在共享的规范空间(canonical space)中使用独立的高斯层分别表示身体和衣物,并结合SMPL驱动的刚性变形与学习到的线性混合皮肤权重实现整体结构的动态调整;同时,利用从网格拓扑初始化的高斯分布及物理启发约束(如模拟一致性、ARAP正则化和掩码监督)提升衣物的真实感,最终通过深度感知的多通道渲染策略实现高效且稳定的体-衣-场景合成,支持超过60 FPS的实时渲染。
链接: https://arxiv.org/abs/2604.15875
作者: Sadia Mubashshira,Nazanin Amini,Kevin Desai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Cloth-HUGS, a Gaussian Splatting based neural rendering framework for photorealistic clothed human reconstruction that explicitly disentangles body and clothing. Unlike prior methods that absorb clothing into a single body representation and struggle with loose garments and complex deformations, Cloth-HUGS represents the performer using separate Gaussian layers for body and cloth within a shared canonical space. The canonical volume jointly encodes body, cloth, and scene primitives and is deformed through SMPL-driven articulation with learned linear blend skinning weights. To improve cloth realism, we initialize cloth Gaussians from mesh topology and apply physics-inspired constraints, including simulation-consistency, ARAP regularization, and mask supervision. We further introduce a depth-aware multi-pass rendering strategy for robust body-cloth-scene compositing, enabling real-time rendering at over 60 FPS. Experiments on multiple benchmarks show that Cloth-HUGS improves perceptual quality and geometric fidelity over state-of-the-art baselines, reducing LPIPS by up to 28% while producing temporally coherent cloth dynamics.
[CV-46] UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLM s
【速读】:该论文旨在解决视觉编辑模型评估体系碎片化的问题,特别是现有基准测试在方法和模态上缺乏统一性,导致跨范式比较困难,且视频编辑领域缺乏可靠的评估基准。同时,传统自动指标常与人类偏好不一致,而直接使用大语言多模态模型(Large Multimodal Models, MLLMs)作为评估器则存在高昂的计算和财务成本。解决方案的关键在于提出UniEditBench——一个支持图像与视频编辑任务的统一基准,涵盖九类图像操作和八类视频操作,并构建了一套结构化的评估协议;并通过知识蒸馏技术将高性能MLLM判官(Qwen3-VL-235B-A22B Instruct)压缩为轻量级4B/8B评估器,实现对结构保真度、文本对齐、背景一致性、自然性和时空间一致性等多维评分,在保持与人类判断高度一致的同时显著降低部署成本。
链接: https://arxiv.org/abs/2604.15871
作者: Lifan Jiang,Tianrun Wu,Yuhang Pei,Chenyang Wang,Boxi Wu,Deng Cai
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at this https URL.
[CV-47] Splats in Splats: Robust and Generalizable 3D Gaussian Splatting Steganography
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)资产在广泛应用中面临的版权保护难题,即如何在不破坏原始渲染管线和保持高视觉保真度的前提下,实现高容量、隐蔽性强且抗结构扰动的隐写嵌入。其解决方案的关键在于提出了一种统一且与流水线无关的隐写框架——Splats in Splats++,通过两个核心机制实现:首先,基于球谐函数(Spherical Harmonics, SH)系数频谱分布的原理性分析,设计了重要性分级的SH系数加密方案,确保嵌入信息不可感知且不削弱原模型表达能力;其次,引入哈希网格引导的不透明度映射(Hash-Grid Guided Opacity Mapping)机制,并结合梯度门控不透明度一致性损失(Gradient-Gated Opacity Consistency Loss),强制原场景与隐藏场景在空间属性上的紧密耦合,将离散属性映射投影至连续、抗攻击的潜在流形中,从而从根本上消除几何歧义导致的信息泄露问题。
链接: https://arxiv.org/abs/2604.15862
作者: Yijia Guo,Wenkai Huang,Tong Hu,Gaolei Li,Yang Li,Yuxin Hong,Liwen Hu,Xitong Ling,Jianhua Li,Shengbo Chen,Tiejun Huang,Lei Ma
机构: Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); National Biomedical Imaging Center (国家生物医学成像中心); Nanchang University (南昌大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently redefined the paradigm of 3D reconstruction, striking an unprecedented balance between visual fidelity and computational efficiency. As its adoption proliferates, safeguarding the copyright of explicit 3DGS assets has become paramount. However, existing invisible message embedding frameworks struggle to reconcile secure and high-capacity data embedding with intrinsic asset utility, often disrupting the native rendering pipeline or exhibiting vulnerability to structural perturbations. In this work, we present \textbf\textitSplats in Splats++, a unified and pipeline-agnostic steganography framework that seamlessly embeds high-capacity 3D/4D content directly within the native 3DGS representation. Grounded in a principled analysis of the frequency distribution of Spherical Harmonics (SH), we propose an importance-graded SH coefficient encryption scheme that achieves imperceptible embedding without compromising the original expressive power. To fundamentally resolve the geometric ambiguities that lead to message leakage, we introduce a \textbfHash-Grid Guided Opacity Mapping mechanism. Coupled with a novel \textbfGradient-Gated Opacity Consistency Loss, our formulation enforces a stringent spatial-attribute coupling between the original and hidden scenes, effectively projecting the discrete attribute mapping into a continuous, attack-resilient latent manifold. Extensive experiments demonstrate that our method substantially outperforms existing approaches, achieving up to \textbf6.28 db higher message fidelity, \textbf3 \times faster rendering, and exceptional robustness against aggressive 3D-targeted structural attacks (e.g., GSPure). Furthermore, our framework exhibits remarkable versatility, generalizing seamlessly to 2D image embedding, 4D dynamic scene steganography, and diverse downstream tasks.
[CV-48] AHS: Adaptive Head Synthesis via Synthetic Data Augmentations CVPR2026
【速读】:该论文旨在解决当前头像替换(head swapping)技术在真实场景中应用受限的问题,特别是现有方法多依赖于以人脸为中心的裁剪数据且视角单一,难以处理多样化的头部姿态、表情变化及非面部区域的自然融合。解决方案的关键在于提出自适应头像合成(Adaptive Head Synthesis, AHS),其核心创新是引入一种新颖的头像重演合成数据增强策略,通过无配对数据的自监督训练提升模型对不同表情和头部朝向的泛化能力,从而实现高质量、身份一致且表达忠实的跨视角头像替换效果。
链接: https://arxiv.org/abs/2604.15857
作者: Taewoong Kang,Hyojin Jang,Sohyun Jeong,Seunggi Moon,Gihwi Kim,Hoon Jin Jung,Jaegul choo
机构: KAIST; Korea University; FLIPTION
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page : this https URL
Abstract:Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one’s head is seamlessly integrated with another’s body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
[CV-49] Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection
【速读】:该论文旨在解决多模态遥感图像语义分割中因传感器故障或大气条件导致部分模态缺失时,现有模型在保留模态不变信息的同时难以有效利用模态特异性互补信息的问题。解决方案的关键在于提出CBC-SLP模型,通过引入一种结构化的潜在投影(structured latent projection)架构先验,将潜在表示明确分解为共享与模态特定两部分,并根据随机模态可用性掩码自适应地向解码器传递相应组件,从而在全模态和缺失模态场景下均能保持性能优势并恢复被共享表示忽略的互补信息。
链接: https://arxiv.org/abs/2604.15856
作者: Irem Ulku,Erdem Akagündüz,Ömer Özgür Tanrıöver
机构: Ankara University (安卡拉大学); METU (中土耳其大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 9 tables
Abstract:Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at this https URL.
[CV-50] Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment
【速读】:该论文旨在解决传统自动美学质量评估(AQA)方法仅将图像视为静态像素向量、依赖语义感知进行预测的问题,这种范式与人类基于动态视觉探索(如扫视路径、处理流畅性及自下而上显著性与自上而下意图的交互)的审美认知存在偏差。其解决方案的关键在于提出AestheticNet,一种受人类认知启发的AQA新范式,采用双通路架构:一是通过在眼动数据上离线预训练的注视对齐视觉编码器(GAVE),利用资源高效的对比注视对齐策略建模人类视觉系统的注意力机制;二是保留固定语义编码器(如CLIP)构成语义通路,并通过交叉注意力融合方式增强语义通路。视觉注意力提供了反映前景/背景结构、色彩渐变、明暗与光照等超越语义维度的认知先验,实验验证表明该设计能稳定提升性能,且注视模块具备模型无关性,可兼容多种AQA骨干网络,证实了引入类人视觉认知对于AQA的必要性和模块化优势。
链接: https://arxiv.org/abs/2604.15853
作者: Liwen Yu,Chi Liu,Xiaotong Han,Congcong Zhu,Minghao Wang,Sheng Shen
机构: City University of Macau (澳门城市大学); Sun Yat-sen University (中山大学); Torrens University Australia (托伦斯大学澳大利亚分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for Poster Presentation at CogSci 2026
Abstract:Automated Aesthetic Quality Assessment (AQA) treats images primarily as static pixel vectors, aligning predictions with human-rating scores largely through semantic perception. However, this paradigm diverges from human aesthetic cognition, which arises from dynamic visual exploration shaped by scanning paths, processing fluency, and the interplay between bottom-up salience and top-down intention. We introduce AestheticNet, a novel cognitive-inspired AQA paradigm that integrates human-like visual cognition and semantic perception with a two-pathway architecture. The visual attention pathway, implemented as a gaze-aligned visual encoder (GAVE) pre-trained offline on eye-tracking data using resource-efficient contrast gaze alignment, models attention from human vision system. This pathway augments the semantic pathway, which uses a fixed semantic encoder such as CLIP, through cross-attention fusion. Visual attention provides a cognitive prior reflecting foreground/background structure, color cascade, brightness, and lighting, all of which are determinants of aesthetic perception beyond semantics. Experiments validated by hypothesis testing show a consistent improvement over the semantic-alone baselines, and demonstrate the gaze module as a model-agnostic corrector compatible with diverse AQA backbones, supporting the necessity and modularity of human-like visual cognition for AQA. Our code is available at this https URL.
[CV-51] Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration CVPR2026
【速读】:该论文旨在解决文本到图像生成模型(text-to-image generative models)在训练过程中因大规模数据集中的隐式偏见而意外生成不安全或 undesirable 内容的问题。现有概念擦除方法(concept erasure methods)存在明显局限:纯文本方法难以完全抑制目标概念,而直接依赖图像引导的方法则可能过度擦除无关内容,导致语义和视觉信息失真。论文提出 TICoE(Text-Image Collaborative Erasing framework),其核心创新在于构建一个连续凸概念流形(continuous convex concept manifold)与分层视觉表征学习机制,实现对目标概念的精准擦除同时保留其他语义和视觉内容的完整性。该方案通过一种以保真度为导向的评估策略客观衡量擦除后的可用性,实验表明 TICoE 在概念移除精度与内容保真度上均优于现有方法,从而提升生成模型的安全性和可控性。
链接: https://arxiv.org/abs/2604.15829
作者: Jun Li,Lizhi Xiong,Ziqiang Li,Weiwei Jiang,Zhangjie Fu,Yong Li,Guo-Sen Xie
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Southeast University (东南大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 25 pages, accepted by CVPR 2026
Abstract:Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets. Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a text-image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation. Our code is available at this https URL
[CV-52] SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification CVPR
【速读】:该论文旨在解决高光谱图像(Hyperspectral Imaging, HSI)分类中因高维度、光谱冗余、标注数据有限及域偏移强烈而导致的鲁棒分类器难以学习的问题,尤其针对标注数据稀缺且不平衡的场景,提出轻量级通用分类模型以适应多样化的采集条件。解决方案的关键在于设计了一种轻量级的光谱-空间融合Transformer(Spectral-Spatial Fusion Transformer, SSFT),其通过将表征学习分解为光谱路径和空间路径,并利用交叉注意力机制融合二者,从而有效捕捉波长依赖性信息与结构信息;实验表明,SSFT在多个基准数据集上达到最优性能,参数量不足前代最优方法的2%,且无需数据增强仍保持鲁棒性。
链接: https://arxiv.org/abs/2604.15828
作者: Alexander Musiat,Nikolas Ebert,Oliver Wasenmüller
机构: Mannheim University of Applied Sciences (曼海姆应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026
Abstract:Hyperspectral imaging enables fine-grained recognition of materials by capturing rich spectral signatures, but learning robust classifiers is challenging due to high dimensionality, spectral redundancy, limited labeled data, and strong domain shifts. Beyond earth observation, labeled HSI data is often scarce and imbalanced, motivating compact models for generic hyperspectral classification across diverse acquisition regimes. We propose the lightweight Spectral-Spatial Fusion Transformer (SSFT), which factorizes representation learning into spectral and spatial pathways and integrates them via cross-attention to capture complementary wavelength-dependent and structural information. We evaluate our SSFT on the challenging HSI-Benchmark, a heterogeneous multi-dataset benchmark covering earth observation, fruit condition assessment, and fine-grained material recognition. SSFT achieves state-of-the-art overall performance, ranking first while using less than 2% of the parameters of the previous leading method. We further evaluate transfer to the substantially larger SpectralEarth benchmark under the official protocol, where SSFT remains competitive despite its compact size. Ablation studies show that both spectral and spatial pathways are crucial, with spatial modeling contributing most, and that SSFT remains robust without data augmentation.
[CV-53] Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
【速读】:该论文旨在解决生成式 AI (Generative AI) 在电影情感理解任务中因训练数据与实际应用场景存在领域差异而导致的性能下降问题。具体而言,现有研究主要基于原生电影画面(cinematic footage)进行情感建模,而真实机器人代理通常通过第一人称屏幕视角(egocentric screen-view)感知视频内容,这种视角差异引入了视点扭曲、尺度变化、光照波动及环境干扰等域偏移(domain shift),严重限制了模型在现实场景中的泛化能力。解决方案的关键在于构建首个面向第一人称屏幕视角的电影情感理解基准数据集 EgoScreen-Emotion (ESE),其包含224部受控条件下采集的电影预告片及其28,667帧时序对齐的关键帧,并采用多标注者、置信度感知的多标签标注协议以应对情绪模糊性;同时提出一种融合视觉时序证据、叙事摘要、压缩历史上下文和音频线索的多模态长程推理框架,显著提升了模型在真实观看条件下的鲁棒性。
链接: https://arxiv.org/abs/2604.15823
作者: Ze Dong,Hao Shi,Zejia Gao,Zhonghua Yi,Kaiwei Wang,Lin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.
[CV-54] Continual Hand-Eye Calibration for Open-world Robotic Manipulation
【速读】:该论文旨在解决机器人在开放世界(open-world)环境中进行手眼标定(hand-eye calibration)时,因场景变化导致深度学习模型出现灾难性遗忘(catastrophic forgetting)的问题。现有基于简单重放(rehearsal-based)的持续学习策略难以有效缓解此问题。解决方案的关键在于提出一种持续手眼标定框架,其核心包含两个创新机制:一是空间感知重放策略(Spatial-Aware Replay Strategy, SARS),通过构建几何均匀的回放缓冲区,以最大信息量视点替代冗余帧,确保各场景位姿空间的全面覆盖;二是结构保持型双蒸馏(Structure-Preserving Dual Distillation, SPDD),将定位知识解耦为粗粒度场景布局与细粒度位姿精度两部分,并分别进行蒸馏,从而缓解不同类型的遗忘。该框架在新场景到来时,利用SARS提供历史场景代表性样本,SPDD对其进行结构化蒸馏以保留先验知识,同时将新场景样本加入缓冲区,实现多场景标定能力的持续累积。
链接: https://arxiv.org/abs/2604.15814
作者: Fazeng Li,Gan Sun,Chenxi Liu,Yao He,Wei Cong,Yang Cong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.
[CV-55] Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的感知能力不足问题,即尽管模型能够定位到与问题相关的图像区域,却仍可能产生错误答案。研究表明,这一现象源于VLM内部信息流的次优设计:文本标记在解码过程中对无关视觉标记分配了过多注意力,从而引入干扰。解决方案的关键在于通过调节推理阶段的信息流动来提升模型的感知准确性,具体方法是基于标记动态(token dynamics)确定视觉标记的重要性——即在不同解码阶段表现出显著激活模式差异的视觉标记被视为关键信息源。该方法引导文本标记仅与重要视觉标记关联,从而消除冗余区域的干扰,显著提升了多个下游任务(如视觉问答、视觉定位、光学字符识别等)的性能。
链接: https://arxiv.org/abs/2604.15809
作者: Chengxin Liu,Wonseok Choi,Chenshuang Zhang,Tae-Hyun Oh
机构: KAIST; POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL
Abstract:Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: this https URL.
[CV-56] Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
【速读】:该论文旨在解决医学视觉语言模型(VLMs)在临床影像分析中缺乏空间可解释性与多帧推理能力的问题,即现有模型通常仅基于二维图像进行预测,未能充分考虑医学影像的三维(volumetric)特性及病灶在多个切片中的分布。其解决方案的关键在于构建了一个名为SGMRI-VQA的41,307对多帧、空间定位明确的MRI视觉问答基准数据集,该数据集基于放射科专家标注的fastMRI+数据集(涵盖脑部和膝关节MRI),每条问答对均包含与临床一致的思维链(chain-of-thought)轨迹以及帧索引的边界框坐标(bounding box coordinates)。通过此结构化标注,任务被组织为检测、定位、计数/分类和描述四个层级,要求模型同时识别目标的存在、位置及其跨帧范围,从而实现更精准的空间感知与临床可解释性推理。实验表明,使用边界框监督对Qwen3-VL-8B进行微调能显著提升空间定位性能,验证了针对性的空间监督是实现临床场景下可靠视觉-语言联合推理的有效路径。
链接: https://arxiv.org/abs/2604.15808
作者: Lama Moukheiber,Caleb M. Yeung,Haotian Xue,Alec Helbling,Zelin Zhao,Yongxin Chen
机构: Georgia Institute of Technology (佐治亚理工学院); Harvard University (哈佛大学); Georgetown University (乔治城大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.
[CV-57] Fed3D: Federated 3D Object Detection
【速读】:该论文旨在解决多机器人感知网络中3D目标检测模型训练时面临的隐私保护与通信效率问题。现有联邦学习方法在3D目标检测场景下难以应用,主要受限于3D数据异构性(如本地输入点云的不规则性和类别分布差异)以及有限的通信带宽。为此,作者提出了一种新型联邦3D目标检测框架Fed3D,其关键创新在于:首先设计局部-全局类别感知损失函数(local-global class-aware loss),以平衡不同类别在本地和全局层面的梯度反向传播速率,缓解因类别分布差异导致的全局模型性能下降;其次引入联邦3D提示模块(federated 3D prompt module),仅通过少量可学习参数传递提示信息,显著降低每轮通信开销。实验表明,Fed3D在有限本地训练数据条件下实现了优于当前最优算法的检测性能,同时具备更低的通信成本。
链接: https://arxiv.org/abs/2604.15795
作者: Suyan Dai,Chenxi Liu,Fazeng Li,Peican Lin
机构: South China University of Technology (华南理工大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D object detection models trained in one server plays an important role in autonomous driving, robotics manipulation, and augmented reality scenarios. However, most existing methods face severe privacy concern when deployed on a multi-robot perception network to explore large-scale 3D scene. Meanwhile, it is highly challenging to employ conventional federated learning methods on 3D object detection scenes, due to the 3D data heterogeneity and limited communication bandwidth. In this paper, we take the first attempt to propose a novel Federated 3D object detection framework (i.e., Fed3D), to enable distributed learning for 3D object detection with privacy preservation. Specifically, considering the irregular input 3D object in local robot and various category distribution between robots could cause local heterogeneity and global heterogeneity, respectively. We then propose a local-global class-aware loss for the 3D data heterogeneity issue, which could balance gradient back-propagation rate of different 3D categories from local and global aspects. To reduce communication cost on each round, we develop a federated 3D prompt module, which could only learn and communicate the prompts with few learnable parameters. To the end, several extensive experiments on federated 3D object detection show that our Fed3D model significantly outperforms state-of-the-art algorithms with lower communication cost when providing the limited local training data.
[CV-58] SegMix:Shuffle-based Feedback Learning for Semantic Segmentation of Pathology Images
【速读】:该论文旨在解决病理图像语义分割任务中高质量像素级标注数据获取困难的问题,该问题限制了深度学习方法在计算病理学中的广泛应用。为缓解这一瓶颈,研究者提出了一种基于shuffle反馈学习的新方法,其关键在于引入受课程学习(curriculum learning)启发的图像块(patch)级别随机重排机制,并通过模型对先前学习反馈的自适应调整来优化伪标签生成过程,从而提升由图像级分类标签生成的伪像素级掩码质量,最终在三个不同数据集上实现了优于现有最先进方法的分割性能。
链接: https://arxiv.org/abs/2604.15777
作者: Zhiling Yan,Sicheng Chen,Tianyi Zhang,Nan Ying,Yanli Lei,Guanglei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Segmentation is a critical task in computational pathology, as it identifies areas affected by disease or abnormal growth and is essential for diagnosis and treatment. However, acquiring high-quality pixel-level supervised segmentation data requires significant workload demands from experienced pathologists, limiting the application of deep learning. To overcome this challenge, relaxing the label conditions to image-level classification labels allows for more data to be used and more scenarios to be enabled. One approach is to leverage Class Activation Map (CAM) to generate pseudo pixel-level annotations for semantic segmentation with only image-level labels. However, this method fails to thoroughly explore the essential characteristics of pathology images, thus identifying only small areas that are insufficient for pseudo masking. In this paper, we propose a novel shuffle-based feedback learning method inspired by curriculum learning to generate higher-quality pseudo-semantic segmentation masks. Specifically, we perform patch level shuffle of pathology images, with the model adaptively adjusting the shuffle strategy based on feedback from previous learning. Experimental results demonstrate that our proposed approach outperforms state-of-the-arts on three different datasets.
[CV-59] PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding
【速读】:该论文旨在解决开放词汇表(open-vocabulary)三维场景理解中语义表示的准确性与空间精度难以兼顾的问题,尤其在将二维像素级语义信息传播至三维空间时,常因冗余导致存储和查询效率低下。其解决方案的关键在于提出一种像素级语言对齐特征提取框架(PLAF),该框架能够在不牺牲开放词汇表达能力的前提下,实现二维空间中的高密度且精确的语义对齐;在此基础上,进一步设计了高效的语义存储与查询机制,显著降低了二维与三维域中的冗余信息,从而为准确、高效的开放词汇表三维场景理解提供了坚实基础。
链接: https://arxiv.org/abs/2604.15770
作者: Junjie Wen,Junlin He,Fei Ma,Jinqiang Cui
机构: Peng Cheng Laboratory (鹏程实验室); Sun Yat-Sen University-Shenzhen Campus (中山大学深圳校区); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCA 2026
Abstract:Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emphPLAF, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emphPLAF provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at this https URL.
[CV-60] Concept-wise Attention for Fine-grained Concept Bottleneck Models CVPR2026
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBM)在概念建模中的两个关键问题:一是预训练偏差导致的粒度不匹配或对结构先验的依赖;二是使用二元交叉熵(Binary Cross-Entropy, BCE)损失进行微调时,忽略概念间的互斥性,从而造成对齐效果不佳。解决方案的关键在于提出一种名为CoAt-CBM的新框架,其核心创新是引入可学习的概念级视觉查询(concept-wise visual queries),以自适应地获取细粒度的概念视觉嵌入,并通过一种新颖的概念对比优化策略来引导模型关注概念分数的相对重要性,从而提升概念预测与图像内容的一致性和对齐精度。
链接: https://arxiv.org/abs/2604.15748
作者: Minghong Zhong,Guoshuai Zou,Kanghao Chen,Dexia Chen,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Key Laboratory of Machine Intelligence and Advanced Computing, MOE (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, Accepted by CVPR 2026 Fingdings
Abstract:Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.
[CV-61] Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
【速读】:该论文旨在解决通过手绘草图(hand-drawn sketches)或文本描述(textual descriptions)进行细粒度图像检索(fine-grained image retrieval)时存在的模态差异(modality gaps)问题。由于草图缺乏颜色和纹理信息,而文本虽能提供丰富的颜色与纹理语义但无法表达空间结构,二者存在显著互补性。解决方案的关键在于提出Sketch and Text Based Image Retrieval (STBIR)框架,其核心创新包括:(1) 基于课程学习的鲁棒性增强模块,提升模型对不同质量查询的适应能力;(2) 基于类别知识的特征空间优化模块,增强模型表征能力;(3) 多阶段跨模态特征对齐机制,有效缓解跨模态对齐难题。实验表明,该框架在细粒度图像检索任务中显著优于现有方法。
链接: https://arxiv.org/abs/2604.15735
作者: Siyuan Wang,Hanchen Gao,Guangming Zhu,Jiang Lu,Yiyue Ma,Tianci Wu,Jincai Huang,Liang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Image Retrieval, Hand-drawn Sketch, Multi-stage Cross-modal Feature Alignment
Abstract:Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model’s robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model’s representational power. Finally, we design a multi-stage cross-modal feature alignment mechanism to effectively mitigate the challenges of cross modal feature alignment. Furthermore, we curate the fine-grained STBIR benchmark dataset to rigorously validate the efficacy of our proposed framework and to provide data support as a reference for subsequent related research. Extensive experiments demonstrate that the proposed STBIR framework significantly outperforms state of the art methods.
[CV-62] MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis
【速读】:该论文旨在解决基于Mamba的多实例学习(Multiple Instance Learning, MIL)在全切片图像(Whole Slide Image, WSI)分析中面临的三大挑战:(1) 一维序列展开导致二维空间局部性破坏;(2) 对细粒度细胞结构建模不足;(3) 在资源受限边缘设备上推理时内存峰值过高。其解决方案的关键在于提出一种新型混合架构MambaBack,通过三个核心设计实现优化:首先引入希尔伯特采样策略(Hilbert sampling strategy)以保留瓦片在一维序列中的二维空间局部性,增强空间感知能力;其次构建分层结构,包含基于MambaOut的1D门控卷积(Gated CNN)模块用于提取局部细胞特征,以及BiMamba2模块用于聚合全局上下文信息,从而提升多尺度表征能力;最后采用非对称分块设计(asymmetric chunking design),训练时并行处理、推理时分块流式累积,显著降低峰值内存占用,适用于边缘部署。
链接: https://arxiv.org/abs/2604.15729
作者: Sicheng Chen,Chad Wong,Tianyi Zhang,Enhui Chai,Zeyu Liu,Fei Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba’s SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial locality of tiles within 1D sequences, enhancing the model’s spatial perception. Second, we design a hierarchical structure comprising a 1D Gated CNN block based on MambaOut to capture local cellular features, and a BiMamba2 block to aggregate global context, jointly enhancing multi-scale representation. Finally, we implement an asymmetric chunking design, allowing parallel processing during training and chunking-streaming accumulation during inference, minimizing peak memory usage for deployment. Experimental results on five datasets demonstrate that MambaBack outperforms seven state-of-the-art methods. Source code and datasets are publicly available.
[CV-63] Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images
【速读】:该论文旨在解决手持式眼底成像设备采集图像中存在的多种伪影(如闪光反射、曝光差异和运动模糊)导致的图像质量下降问题,这些问题严重影响了后续的诊断分析。解决方案的关键在于提出一种无监督的扩散自编码器(diffusion autoencoder),该模型通过引入上下文编码器(context encoder)与去噪过程相结合,能够在仅使用高质量台式眼底图像进行训练的情况下,学习语义上有意义的表示,并有效恢复受伪影影响的手持设备采集图像。这一方法无需配对监督或预设伪影结构,具备对非结构化退化场景的良好适应性。
链接: https://arxiv.org/abs/2604.15723
作者: Mathumetha Palani,Kavya Puthumana,Ayantika Das,Ganapathy Krishnamurthi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, 1 Table - ISBI IEEE 2025 CONFERENCE
Abstract:The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions
[CV-64] NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition
【速读】:该论文旨在解决传统视觉说话人识别方法在复杂环境条件下鲁棒性不足的问题,尤其是当声学线索不可用时,如何利用唇部运动(lip motion)的个体特异性行为动态实现稳定的身份识别。现有方法依赖于外观特征,在光照、视角变化下性能下降明显;而常规帧式相机难以捕捉精细的唇部运动动态,受限于运动模糊和低动态范围。解决方案的关键在于提出一种基于事件相机(event-based camera)的神经网络框架NeuroLip,其核心创新包括:1)时间感知的体素编码模块(Temporal-aware Voxel Encoding),通过自适应加权机制增强时间敏感性;2)结构感知的空间增强模块(Structure-aware Spatial Enhancer),在抑制噪声的同时保留垂直方向的运动结构信息;3)极性一致性正则化机制(Polarity Consistency Regularization),以保持事件极性中蕴含的运动方向信息。该方案在严格跨场景协议下实现了高精度识别,显著优于现有方法。
链接: https://arxiv.org/abs/2604.15718
作者: Junguang Yao,Wenye Liu,Stjepan Picek,Yue Zheng
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); Faculty of Electrical Engineering and Computing University of Zagreb (萨格勒布大学电气工程与计算学院); Faculty of Science, Radboud University (奈梅亨大学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at this https URL.
[CV-65] SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
【速读】:该论文旨在解决病理图像分析中基于视觉Transformer(Vision Transformer, ViT)的ROI级基础模型(Foundation Models, FMs)存在的三大核心问题:跨放大倍数域偏移、局部-全局关系建模不足以及细粒度敏感性欠缺。针对这些问题,论文提出了一种混合自监督学习(Self-Supervised Learning, SSL)框架SSMamba,其关键创新在于引入三个领域自适应组件:用于缓解域偏移的Mamba掩码图像建模(MAMIM)、用于平衡局部与全局特征建模的方向多尺度(Directional Multi-scale, DMS)模块,以及用于提升细粒度敏感性的局部感知残差(Local Perception Residual, LPR)模块。通过两阶段流程——目标ROI数据集上的SSL预训练与后续监督微调(Supervised Fine-Tuning, SFT),SSMamba在10个公开ROI数据集和6个WSI数据集上均超越了11种SOTA病理基础模型及8种SOTA方法,验证了任务特异性架构设计在病理图像分析中的有效性。
链接: https://arxiv.org/abs/2604.15711
作者: Enhui Chai,Sicheng Chen,Tianyi Zhang,Xingyu Li,Tianxiang Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.
[CV-66] APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition CVPR2026
【速读】:该论文旨在解决深度神经网络在3D点云识别任务中对对抗攻击的脆弱性问题,同时克服现有防御方法在鲁棒性(robustness)与迁移性(transferability)之间的权衡困境。其解决方案的关键在于提出一种轻量级输入级净化模块——对抗点反制(Adversarial Point Counterattack, APC),该模块为每个点生成实例特定的反向扰动(counter-perturbations),通过利用干净样本与对抗样本对,在数据空间中强制几何一致性、在特征空间中保持语义一致性,从而有效中和多种类型的对抗攻击。此外,采用混合训练策略融合多种攻击类型的对抗点云以提升泛化能力,并因纯输入级操作特性实现无需重训练即可直接迁移至未见过的模型进行防御,推理时仅需一次前向传播即可获得净化后的点云,计算与参数开销极低。
链接: https://arxiv.org/abs/2604.15708
作者: Geunyoung Jung,Soohong Kim,Inseok Kong,Jiyoung Jung
机构: University of Seoul(首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Findings
Abstract:The advent of deep neural networks has led to remarkable progress in 3D point cloud recognition, but they remain vulnerable to adversarial attacks. Although various defense methods have been studied, they suffer from a trade-off between robustness and transferability. We propose Adversarial Point Counterattack (APC) to achieve both simultaneously. APC is a lightweight input-level purification module that generates instance-specific counter-perturbations for each point, effectively neutralizing attacks. Leveraging clean-adversarial pairs, APC enforces geometric consistency in data space and semantic consistency in feature space. To improve generalizability across diverse attacks, we adopt a hybrid training strategy using adversarial point clouds from multiple attack types. Since APC operates purely on input point clouds, it directly transfers to unseen models and defends against attacks targeting them without retraining. At inference, a single APC forward pass provides purified point clouds with negligible time and parameter overhead. Extensive experiments on two 3D recognition benchmarks demonstrate that the APC achieves state-of-the-art defense performance. Furthermore, cross-model evaluations validate its superior transferability. The code is available at this https URL.
[CV-67] LP2DH: A Locality-Preserving Pixel-Difference Hashing Framework for Dynamic Texture Recognition
【速读】:该论文旨在解决时空局部二值模式(Spatiotemporal Local Binary Pattern, STLBP)特征维度极高且在三个正交平面上独立提取导致跨平面相关性丢失的问题。其解决方案的关键在于提出一种保持局部结构的像素差值哈希(Locality-Preserving Pixel-Difference Hashing, LP²DH)框架,该框架通过联合编码全时空邻域内的像素差值向量(Pixel-Difference Vectors, PDVs),将其映射为具有最大判别能力的紧凑二进制码;同时引入保持局部结构嵌入机制,在哈希前后保留PDVs的局部流形结构,并采用曲线搜索策略在Stiefel流形上联合优化哈希矩阵与二进制码,最终结合字典学习将二进制向量编码为词袋表示,从而实现高精度动态纹理识别。
链接: https://arxiv.org/abs/2604.15707
作者: Ruxin Ding,Jianfeng Ren,Heng Yu,Jiawei Li,Xudong Jiang
机构: University of Nottingham Ningbo China (诺丁汉大学宁波分校); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatiotemporal Local Binary Pattern (STLBP) is a widely used dynamic texture descriptor, but it suffers from extremely high dimensionality. To tackle this, STLBP features are often extracted on three orthogonal planes, which sacrifice inter-plane correlation. In this work, we propose a Locality-Preserving Pixel-Difference Hashing (LP ^2 DH) framework that jointly encodes pixel differences in the full spatiotemporal neighbourhood. LP ^2 DH transforms Pixel-Difference Vectors (PDVs) into compact binary codes with maximal discriminative power. Furthermore, we incorporate a locality-preserving embedding to maintain the PDVs’ local structure before and after hashing. Then, a curvilinear search strategy is utilized to jointly optimize the hashing matrix and binary codes via gradient descent on the Stiefel manifold. After hashing, dictionary learning is applied to encode the binary vectors into codewords, and the resulting histogram is utilized as the final feature representation. The proposed LP ^2 DH achieves state-of-the-art performance on three major dynamic texture recognition benchmarks: 99.80% against DT-GoogleNet’s 98.93% on UCLA, 98.52% against HoGF ^3D 's 97.63% on DynTex++, and 96.19% compared to STS’s 95.00% on YUPENN. The source code is available at: this https URL.
[CV-68] P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models ICRA2026
【速读】:该论文旨在解决预训练3D视觉语言模型(VLMs)在下游任务中进行微调时存在的计算资源消耗大和过拟合导致泛化能力下降的问题。传统全参数微调方法虽有效但效率低,而现有提示调优(prompt tuning)方法易因过拟合损害模型泛化性能。其解决方案的关键在于提出一种参数高效的提示调优方法——原型点级提示调优(Prototypical Point-level Prompt Tuning, P³T),该方法包含两个核心组件:1)点提示器(Point Prompter),为输入点云生成实例感知的点级提示;2)文本提示器(Text Prompter),将可学习提示嵌入文本输入而非依赖人工设计提示。此外,通过引入原型损失(prototypical loss)增强嵌入空间对齐,降低类别内方差,从而在不牺牲通用性的前提下实现任务特定适配。实验表明,P³T在分类与少样本学习任务上达到或超越全微调效果,并在跨数据集场景下展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2604.15703
作者: Geunyoung Jung,Soohong Kim,Kyungwoo Song,Jiyoung Jung
机构: University of Seoul (首尔大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2026
Abstract:With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P ^3 T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P ^3 T consists of two components: 1) \textitPoint Prompter, which generates instance-aware point-level prompts for the input point cloud, and 2) \textitText Prompter, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P ^3 T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolorvioletthis https URL.
[CV-69] Self-Supervised Angular Deblurring in Photoacoustic Reconstruction via Noisier2Inverse
【速读】:该论文旨在解决光声断层成像(Photoacoustic Tomography, PAT)中因有限尺寸探测器导致的图像系统性模糊问题,该问题在传统基于理想点状或线状探测器的重建算法(如傅里叶方法、滤波反投影和时间反转)中尤为突出。解决方案的关键在于提出一种无需真实标签数据的自监督重建方法——Noisier2Inverse,其核心创新包括:(i) 将问题重构为角度去卷积(angular deblurring)形式以适配PAT特性;(ii) 在极坐标域中采用Noisier2Inverse框架,利用已知的角度点扩散函数(angular point-spread function)进行建模;(iii) 设计了一种基于统计理论的早停规则,从而在不依赖地面真值的情况下实现高质量图像恢复。实验表明,该方法在无监督场景下性能优于现有替代方案,并接近有监督基准水平,且适用于实际有限尺寸探测器采集的数据。
链接: https://arxiv.org/abs/2604.15681
作者: Markus Haltmeier,Nadja Gruber,Gyeongha Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photoacoustic tomography (PAT) is an emerging imaging modality that combines the complementary strengths of optical contrast and ultrasonic resolution. A central task is image reconstruction, where measured acoustic signals are used to recover the initial pressure distribution. For ideal point-like or line-like detectors, several efficient and fast reconstruction algorithms exist, including Fourier methods, filtered backprojection, and time reversal. However, when applied to data acquired with finite-size detectors, these methods yield systematically blurred images. Although sharper images can be obtained by compensating for finite-detector effects, supervised learning approaches typically require ground-truth images that may not be available in practice. We propose a self-supervised reconstruction method based on Noisier2Inverse that addresses finite-size detector effects without requiring ground-truth data. Our approach operates directly on noisy measurements and learns to recover high-quality PAT images in a ground-truth-free manner. Its key components are: (i) PAT-specific modeling that recasts the problem as angular deblurring; (ii) a Noisier2Inverse formulation in the polar domain that leverages the known angular point-spread function; and (iii) a novel, statistically grounded early-stopping rule. In experiments, the proposed method consistently outperforms alternative approaches that do not use supervised data and achieves performance close to supervised benchmarks, while remaining practical for real acquisitions with finite-size detectors.
[CV-70] Hierarchical Active Inference using Successor Representations
【速读】:该论文旨在解决将基于自由能原理(Free Energy Principle, FEP)的主动推理(Active Inference)模型扩展至复杂现实环境中大规模问题时所面临的可扩展性挑战。其核心问题是:如何在保持神经科学合理性的同时,实现高效且具层次结构的规划能力。解决方案的关键在于提出一种基于分层主动推理的规划框架,该框架结合了环境的分层建模与后续表示(Successor Representations, SRs),通过低层SR学习高层抽象状态和动作,并利用低层主动推理的规划结果来引导高层抽象动作的学习与优化,从而实现跨尺度的状态抽象与动作抽象,显著提升在多种强化学习(Reinforcement Learning, RL)任务中的规划效率。
链接: https://arxiv.org/abs/2604.15679
作者: Prashant Rangarajan,Rajesh P. N. Rao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Neural Computation (MIT Press). 82 pages, 29 figures
Abstract:Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a variant of the well-known four rooms task, a key-based navigation task, a partially observable planning problem, the Mountain Car problem, and PointMaze, a family of navigation tasks with continuous state and action spaces. Our results represent, to our knowledge, the first application of learned hierarchical state and action abstractions to active inference in FEP-based theories of brain function.
[CV-71] HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning CVPR2026
【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在跨领域少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)中因数据分布不均衡和领域异质性导致的性能退化问题。具体而言,作者识别出“域引力”(Domain Gravity)这一表征不对称现象——即在异构领域中,样本量多或信息熵低的域会主导嵌入空间,引发原型漂移,从而损害低频或高复杂度域的分类性能。解决方案的关键在于提出一种无需训练的混合原型校准方法(Hybrid Prototype Calibration, HyCal),该方法融合余弦相似度与马氏距离,分别捕获特征方向对齐性和协方差感知的幅度特性,在冻结CLIP嵌入的基础上实现稳定且高效的原型校准,有效缓解域引力并提升跨域不平衡增量学习的表现。
链接: https://arxiv.org/abs/2604.15678
作者: Eunju Lee,MiHyeon Kim,JuneHyoung Kwon,Yoonji Lee,JiHyun Kim,Soojin Jang,YoungBin Kim
机构: 63.74\mathbf{59}28
10.14¯\underline{78.74}{\pm 1.94}
36
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Eunju Lee and MiHyeon Kim contributed equally as co-first authors
Abstract:Pretrained Vision-Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties-directional alignment and covariance-aware magnitude-yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention-adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.
[CV-72] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation CVPR2026
【速读】:该论文旨在解决无人机(UAV)遥感图像中推理分割(Reasoning Segmentation)任务的挑战,特别是由倾斜视角、超高分辨率和极端尺度变化带来的复杂性。为应对这些问题,作者首次正式定义了UAV推理分割任务,并将其语义需求归纳为空间(Spatial)、属性(Attribute)和场景级(Scene-level)三个维度的推理能力。解决方案的关键在于构建了一个大规模基准数据集DRSeg,包含10,000张高分辨率航拍图像及对应Chain-of-Thought问答监督信号,覆盖全部三类推理类型;同时提出一个简单但高效的像素级多模态语言模型PixDLM,作为统一基线方法,显著提升了该任务的评估与研究基础。
链接: https://arxiv.org/abs/2604.15670
作者: Shuyan Ke,Yifan Mei,Changli Wu,Yonghan Zheng,Jiayi Ji,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (highlight)
Abstract:Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
[CV-73] CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment
【速读】:该论文旨在解决基于单目视频的三维生物力学分析(3D biomechanics analysis)在消费级硬件和低资源环境中的部署难题,当前多数研究级流程依赖GPU加速,限制了其可及性和扩展性。解决方案的关键在于对MonocularBiomechanics框架衍生的管道进行系统级优化,包括重构模型初始化逻辑、消除磁盘I/O序列化操作以及提升CPU并行化效率,从而实现仅用CPU即可显著提升处理吞吐量(提升2.47倍)和降低运行时间(减少59.6%),同时保持与基线结果的高度一致性(关节角度偏差均值仅0.35°,相关系数r=0.998)。
链接: https://arxiv.org/abs/2604.15665
作者: Yan Zhang,Xiong Zhao
机构: Google LLC (Google); AccMov Health Inc. (AccMov Health Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:
Abstract:Markerless 3D movement analysis from monocular video enables accessible biomechanical assessment in clinical and sports settings. However, most research-grade pipelines rely on GPU acceleration, limiting deployment on consumer-grade hardware and in low-resource environments. In this work, we optimize a monocular 3D biomechanics pipeline derived from the MonocularBiomechanics framework for efficient CPU-only execution. Through profiling-driven system optimization, including model initialization restructuring, elimination of disk I/O serialization, and improved CPU parallelization. Experiments on a consumer workstation (AMD Ryzen 7 9700X CPU) show a 2.47x increase in processing throughput and a 59.6% reduction in total runtime, with initialization latency reduced by 4.6x. Despite these changes, biomechanical outputs remain highly consistent with the baseline implementation (mean joint-angle deviation 0.35 ^\circ , r=0.998 ). These results demonstrate that research-grade vision-based biomechanics pipelines can be deployed on commodity CPU hardware for scalable movement assessment.
[CV-74] From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark
【速读】:该论文旨在解决超高清(Ultra-high-definition, UHD)图像复原中因高空间分辨率、内容多样性及精细结构带来的独特挑战。其解决方案的关键在于提出一种分阶段的频谱分解策略,将复原过程划分为零频增强(zero-frequency enhancement)、低频恢复(low-frequency restoration)和高频精修(high-frequency refinement)三个阶段,并据此设计了一个名为ERR的协同式框架,包含三个子网络:零频增强器(ZFE)、低频复原器(LFR)和高频精修器(HFR)。其中,HFR创新性地引入了频率窗化的Kolmogorov-Arnold网络(FW-KAN),用于高效恢复细粒度纹理与复杂细节,从而实现高保真度的UHD图像复原。
链接: https://arxiv.org/abs/2604.15654
作者: Chen Zhao,Yunzhe Xu,Zhizhou Chen,Enxuan Gu,Kai Zhang,Xiaoming Liu,Jian Yang,Ying Tai
机构: Nanjing University (南京大学); Michigan State University (密歇根州立大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI
Abstract:Ultra-high-definition (UHD) image restoration poses unique challenges due to the high spatial resolution, diverse content, and fine-grained structures present in UHD images. To address these issues, we introduce a progressive spectral decomposition for the restoration process, decomposing it into three stages: zero-frequency \textbfenhancement, low-frequency \textbfrestoration, and high-frequency \textbfrefinement. Based on this formulation, we propose a novel framework, \textbfERR, which integrates three cooperative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). The ZFE incorporates global priors to learn holistic mappings, the LFR reconstructs the main content by focusing on coarse-scale information, and the HFR adopts our proposed frequency-windowed Kolmogorov-Arnold Network (FW-KAN) to recover fine textures and intricate details for high-fidelity restoration. To further advance research in UHD image restoration, we also construct a large-scale, high-quality benchmark dataset, \textbfLSUHDIR, comprising 82,126 UHD images with diverse scenes and rich content. Our proposed methods demonstrate superior performance across a range of UHD image restoration tasks, and extensive ablation studies confirm the contribution and necessity of each module. Project page: this https URL.
[CV-75] owards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
【速读】:该论文旨在解决开放词汇遥感图像分割(Open-vocabulary remote sensing image segmentation, OVRSIS)领域中存在的关键挑战,包括数据集碎片化、训练多样性不足以及缺乏能够反映真实地理空间应用场景的评估基准。为应对这些问题,作者提出了OVRSISBenchV2这一大规模、面向应用的基准,其核心创新在于构建了包含约95K张图像-掩码对的OVRSIS95K基础数据集(覆盖35个常见语义类别),并整合10个下游任务数据集,形成总计170K图像和128类别的多场景、高复杂度评估体系,显著提升了场景多样性和任务真实性。解决方案的关键是提出Pi-Seg基线模型,通过引入正向激励噪声机制(positive-incentive noise mechanism),在训练中利用可学习且语义引导的扰动扩展视觉-文本特征空间,从而增强模型的跨域迁移能力。实验表明,Pi-Seg在OVRSISBenchV1、OVRSISBenchV2及多个下游任务上均表现优异,尤其在更具挑战性的OVRSISBenchV2上展现出强一致性性能,验证了该方法的有效性与现实适用性。
链接: https://arxiv.org/abs/2604.15652
作者: Bingyu Li,Tao Huo,Haocheng Dong,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
机构: University of Science and Technology of China (中国科学技术大学); School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University (西北工业大学人工智能学院); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textitOVRSISBenchV1 established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textitOVRSISBenchV2, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbfOVRSIS95K, a balanced dataset of about 95K image–mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbfPi-Seg, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbfpositive-incentive noise mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \hrefthis https URLLiBingyu01/RSKT-Seg/tree/Pi-Seg.
[CV-76] SPLIT: Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography
【速读】:该论文旨在解决非线性断层成像中缺乏真实图像(ground-truth images)样本时,如何利用自监督学习(self-supervised learning)进行高质量重建的问题。传统机器学习方法依赖于配对的测量数据与真实图像,但在实际场景中此类数据往往不可得;现有自监督方法主要局限于去噪和线性逆问题,难以处理非线性、不完整且含噪的投影数据。论文提出的解决方案是SPLIT(Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography),其关键在于通过跨分区一致性约束(cross-partition consistency)和测量域保真度(measurement-domain fidelity)来建模数据结构,并利用多分区间的互补信息实现无监督训练。理论分析表明,在温和条件下,所提自监督目标函数在期望意义上等价于监督目标;同时引入基于无参考图像质量代理指标的自动停止规则以优化训练过程。该框架在多光谱计算机断层成像(multispectral computed tomography)中取得了优于经典迭代重建和最新自监督基线方法的性能。
链接: https://arxiv.org/abs/2604.15651
作者: Markus Haltmeier,Lukas Neumann,Nadja Gruber,Gyeongha Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine learning has achieved impressive performance in tomographic reconstruction, but supervised training requires paired measurements and ground-truth images that are often unavailable. This has motivated self-supervised approaches, which have primarily addressed denoising and, more recently, linear inverse problems. We address nonlinear inverse problems and introduce SPLIT (Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography), a self-supervised machine-learning framework for reconstructing images from nonlinear, incomplete, and noisy projection data without any samples of ground-truth images. SPLIT enforces cross-partition consistency and measurement-domain fidelity while exploiting complementary information across multiple partitions. Our main theoretical result shows that, under mild conditions, the proposed self-supervised objective is equivalent to its supervised counterpart in expectation. We regularize training with an automatic stopping rule that halts optimization when a no-reference image-quality surrogate saturates. As a concrete application, we derive SPLIT variants for multispectral computed tomography. Experiments on sparse-view acquisitions demonstrate high reconstruction quality and robustness to noise, surpassing classical iterative reconstruction and recent self-supervised baselines.
[CV-77] Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification
【速读】:该论文旨在解决无监督可见-红外视频行人重识别(Unsupervised Visible-Infrared Video Re-Identification, USL-VVI-ReID)中因缺乏跨模态标注导致的性能瓶颈问题,特别是由通用预训练编码器引发的身份判别能力弱、模态偏差严重、类内混淆加剧及跨模态聚类粒度不均等挑战。解决方案的关键在于提出一种因果引导的自举对齐框架(Causal Bootstrapped Alignment, CBA),其核心包括两个模块:一是因果干预热身(Causal Intervention Warm-up, CIW),通过时序身份一致性和跨模态身份一致性进行序列级因果干预,抑制模态和运动引起的虚假相关性,保留身份相关语义;二是原型引导的不确定性精化(Prototype-Guided Uncertainty Refinement, PGUR),采用粗到细对齐策略缓解跨模态粒度差异,借助可靠的可见原型与不确定性感知监督重构红外特征聚类结构,从而提升伪标签可靠性并实现有效跨模态对齐。
链接: https://arxiv.org/abs/2604.15631
作者: Shuang Li,Jiaxu Leng,Changjiang Kuang,Mingpi Tan,Yu Yuan,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submit to IEEE TIFS
Abstract:VVI-ReID is a critical technique for all-day surveillance, where temporal information provides additional cues beyond static images. However, existing approaches rely heavily on fully supervised learning with expensive cross-modality annotations, limiting scalability. To address this issue, we investigate Unsupervised Learning for VVI-ReID (USL-VVI-ReID), which learns identity-discriminative representations directly from unlabeled video tracklets. Directly extending image-based USL-VI-ReID methods to this setting with generic pretrained encoders leads to suboptimal performance. Such encoders suffer from weak identity discrimination and strong modality bias, resulting in severe intra-modality identity confusion and pronounced clustering granularity imbalance between visible and infrared modalities. These issues jointly degrade pseudo-label reliability and hinder effective cross-modality alignment. To address these challenges, we propose a Causal Bootstrapped Alignment (CBA) framework that explicitly exploits inherent video priors. First, we introduce Causal Intervention Warm-up (CIW), which performs sequence-level causal interventions by leveraging temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics, yielding cleaner representations for unsupervised clustering. Second, we propose Prototype-Guided Uncertainty Refinement (PGUR), which employs a coarse-to-fine alignment strategy to resolve cross-modality granularity mismatch, reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Extensive experiments on the HITSZ-VCM and BUPTCampus benchmarks demonstrate that CBA significantly outperforms existing USL-VI-ReID methods when extended to the USL-VVI-ReID setting.
[CV-78] AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM -Guided Execution
【速读】:该论文旨在解决语言对齐视觉基础模型(Language-aligned Vision Foundation Models, VFMs)在边缘设备上部署时面临的严格延迟和功耗约束问题。解决方案的关键在于提出AdaVFM框架,其核心创新是通过运行时自适应计算策略动态调整模型复杂度:一方面利用神经架构搜索(Neural Architecture Search, NAS)将轻量子网络嵌入到VFMs主干中,实现推理阶段的灵活执行;另一方面借助云端部署的多模态大语言模型(Multimodal Large Language Model, LLM)作为上下文感知代理,根据场景内容与任务复杂度实时控制模型行为。这种协同机制在保持高精度的同时显著提升了效率,实现了当前最优的准确性-效率权衡。
链接: https://arxiv.org/abs/2604.15622
作者: Yiwei Zhao,Yi Zheng,Huapeng Su,Jieyu Lin,Stefano Ambrogio,Cijo Jose,Michaël Ramamonjisoa,Patrick Labatut,Barbara De Salvo,Chiao Liu,Phillip B. Gibbons,Ziyun Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to 7.9% in acc@1 on IN1K and 5.2% mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to 77.9% .
[CV-79] GaussianFlow SLAM: Monocular Gaussian Splatting SLAM Guided by GaussianFlow
【速读】:该论文旨在解决单目SLAM(Simultaneous Localization and Mapping)中因缺乏可靠几何线索而导致的地图重建与位姿估计易陷入局部最优、结构退化和精度不足的问题。其解决方案的关键在于提出GaussianFlow SLAM,通过引入光流(optical flow)作为几何感知的引导信号,约束高斯分布(Gaussian splatting)在图像平面上的投影运动(称为GaussianFlow)与实际光流一致,从而为场景结构优化和相机位姿估计提供一致性约束;同时,设计基于归一化误差的稀疏化与修剪模块,有效提升地图质量与位姿准确性。
链接: https://arxiv.org/abs/2604.15612
作者: Dong-Uk Seo,Jinwoo Jeon,Eungchang Mason Lee,Hyun Myung
机构: KAIST (Korea Advanced Institute of Science and Technology)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 7 tables, accepted to IEEE RA-L
Abstract:Gaussian splatting has recently gained traction as a compelling map representation for SLAM systems, enabling dense and photo-realistic scene modeling. However, its application to monocular SLAM remains challenging due to the lack of reliable geometric cues from monocular input. Without geometric supervision, mapping or tracking could fall in local-minima, resulting in structural degeneracies and inaccuracies. To address this challenge, we propose GaussianFlow SLAM, a monocular 3DGS-SLAM that leverages optical flow as a geometry-aware cue to guide the optimization of both the scene structure and camera poses. By encouraging the projected motion of Gaussians, termed GaussianFlow, to align with the optical flow, our method introduces consistent structural cues to regularize both map reconstruction and pose estimation. Furthermore, we introduce normalized error-based densification and pruning modules to refine inactive and unstable Gaussians, thereby contributing to improved map quality and pose accuracy. Experiments conducted on public datasets demonstrate that our method achieves superior rendering quality and tracking accuracy compared with state-of-the-art algorithms. The source code is available at: this https URL.
[CV-80] CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder
【速读】:该论文旨在解决医学影像中脑结构随时间演变的建模问题,以支持早期干预、预后评估和治疗规划。其核心挑战在于如何高效且准确地生成具有时序一致性的高质量脑部磁共振成像(MRI)扫描。解决方案的关键在于提出CLIMB框架——一种基于状态空间模型(State Space Model, SSM)的潜在扩散模型(Latent Diffusion Model, LDM),它通过引入多条件变量(如预测年龄、性别、疾病状态、遗传信息及脑结构体积)实现可控的纵向脑图像生成;同时采用高斯对齐自动编码器(Gaussian-aligned Autoencoder)提取符合先验分布的潜在表示,避免传统变分自动编码器中的采样噪声,从而在保持图像质量的同时显著降低计算复杂度。
链接: https://arxiv.org/abs/2604.15611
作者: Duy-Phuong Dao,Muhammad Taqiyuddin,Jahae Kim,Sang-Heon Lee,Hye-Won Jung,Jaehoo Choi,Hyung-Jeong Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 5 tables
Abstract:Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.
[CV-81] Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models DATE
【速读】:该论文旨在解决黑盒模型(black-box models)在测试时适应(Test-Time Adaptation, TTA)中的关键挑战,即在仅能通过API访问模型的情况下实现高效、稳定的性能提升。现有方法如后处理输出修正(post-hoc output refinement)适应能力有限,而零阶优化(Zeroth-Order Optimization, ZOO)虽可进行输入空间适应但存在查询成本高和无监督场景下优化困难的问题。解决方案的关键在于提出BETA框架,其核心是引入一个轻量级局部白盒引导模型(local white-box steering model),构建可微分的梯度路径;结合预测一致性正则化与面向提示学习(prompt learning-oriented)的过滤机制,实现无需额外API调用且延迟极低的稳定适应,从而在ImageNet-C等基准上显著优于现有白盒与灰盒方法,并在商用API上以250倍更低的查询成本达到ZOO相当性能。
链接: https://arxiv.org/abs/2604.15609
作者: Yunbei Zhang,Shuaicheng Niu,Chengyi Cai,Feng Liu,Jihun Hamm
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Third Workshop on Test-Time Updates (Oral)
Abstract:Test-Time Adaptation (TTA) for black-box models accessible only via APIs remains a largely unexplored challenge. Existing approaches such as post-hoc output refinement offer limited adaptive capacity, while Zeroth-Order Optimization (ZOO) enables input-space adaptation but faces high query costs and optimization challenges in the unsupervised TTA setting. We introduce BETA (Black-box Efficient Test-time Adaptation), a framework that addresses these limitations by employing a lightweight, local white-box steering model to create a tractable gradient pathway. Through a prediction harmonization technique combined with consistency regularization and prompt learning-oriented filtering, BETA enables stable adaptation with no additional API calls and negligible latency beyond standard inference. On ImageNet-C, BETA achieves a +7.1% accuracy gain on ViT-B/16 and +3.4% on CLIP, surpassing strong white-box and gray-box methods including TENT and TPT. On a commercial API, BETA achieves comparable performance to ZOO at 250x lower cost while maintaining real-time inference speed, establishing it as a practical and efficient solution for real-world black-box TTA.
[CV-82] Learning Affine-Equivariant Proximal Operators ICASSP2026
【速读】:该论文旨在解决现有学习型近邻网络(Learned Proximal Networks, LPNs)在处理具有特定结构约束的正则化问题时,缺乏对仿射变换(如平移和缩放)不变性的局限性。其核心问题是:如何设计可学习的近邻算子函数,使其在保持精确计算能力的同时,具备仿射等变性(affine equivariance),从而提升模型在分布外噪声和仿射扰动下的鲁棒性。解决方案的关键在于构建一种新型的仿射等变学习型近邻网络(Affine-Equivariant Learned Proximal Networks, AE-LPNs),通过神经网络参数化方式确保所学近邻算子严格满足仿射等变性质,同时保留精确逼近能力。这一方法显著增强了模型在真实场景中的泛化性能,尤其在分布外去噪任务中表现出更强的适应性和稳定性。
链接: https://arxiv.org/abs/2604.15556
作者: Oriel Savir,Zhenghan Fang,Jeremias Sulam
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, Accepted at ICASSP 2026
Abstract:Proximal operators are fundamental across many applications in signal processing and machine learning, including solving ill-posed inverse problems. Recent work has introduced Learned Proximal Networks (LPNs), providing parametric functions that compute exact proximals for data-driven and potentially non-convex regularizers. However, in many settings it is important to include additional structure to these regularizers–and their corresponding proximals–such as shift and scale equivariance. In this work, we show how to obtain learned functions parametrized by neural networks that provably compute exact proximal operators while being equivariant to shifts and scaling, which we dub Affine-Equivariant Learned Proximal Networks (AE-LPNs). We demonstrate our results on synthetic, constructive examples, and then on real data via denoising in out-of-distribution settings. Our equivariant learned proximals enhance robustness to noise distributions and affine shifts far beyond training distributions, improving the practical utility of learned proximal operators
[CV-83] CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification
【速读】:该论文旨在解决胸部X光(Chest X-ray, CXR)图像解读中因病种分布长尾特性(long-tailed distribution)和临床环境开放世界性质(open-world nature)导致的模型泛化能力不足问题。现有基准多基于单一机构的封闭类别集合,难以反映罕见疾病的实际发生率或新出现的病理表现。解决方案的关键在于构建一个大规模、多中心的CXR数据集(包含超过14.5万张图像),并引入由放射科医生标注的开发与测试集,以提升评估的临床可靠性;同时定义两类核心任务:一是对30个已知类别的鲁棒多标签分类,二是对6个未见罕见病类别的开放世界泛化能力评估。实验表明,视觉-语言基础模型在分布内和零样本场景下均取得改进,但在跨中心数据分布偏移下检测罕见病仍具挑战性。
链接: https://arxiv.org/abs/2604.15555
作者: Hexin Dong,Yi Lin,Pengyu Zhou,Fengnian Zhao,Alan Clint Legasto,Juno Cho,Dohui Kim,Justin Namuk Kim,Mingeon Kim,Sunwoo Kwak,Gabriel Moyà-Alcover,Ky Trung Nguyen,Thanh-Huy Nguyen,Ha-Hieu Pham,Huy-Hieu Pham,Huy Le Pham,Nikhileswara Rao Sulake,Aina Tur-Serrano,Ruichi Zhang,Ang Zu,Adam E. Flanders,Zhiyong Lu,Ronald M. Summers,Mingquan Lin,Hao Chen,Yuzhe Yang,George Shih,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); Fuwai Hospital (阜外医院); National Center for Cardiovascular Diseases (国家心血管病中心); Chinese Academy of Medical Sciences (中国医学科学院); Peking Union Medical College (北京协和医学院); West China School of Medicine (华西医学中心); Sichuan University (四川大学); Chengdu Second People’s Hospital (成都第二人民医院); Korea Advanced Institute of Science and Technology (韩国科学技术院); Gwangju Institute of Science and Technology (光州科学技术院); Case Western Reserve University (凯斯西储大学); Cornell Tech (康奈尔科技学院); Universitat de les Illes Balears (巴利阿里群岛大学); VNU-HCM International University (胡志明市国家大学国际大学); Vietnam National University, Ho Chi Minh City (胡志明市国家大学); Carnegie Mellon University (卡内基梅隆大学); VinUniversity (VinUni); VinUni-Illinois Smart Health Center (VinUni-伊利诺伊智能健康中心); Hanoi (河内); Xiamen University (厦门大学); Thomas Jefferson University (杰斐逊大学); National Institutes of Health (美国国立卫生研究院); University of Minnesota Twin Cities (明尼苏达大学双城分校); The Hong Kong University of Science and Technology (香港科技大学); University of California, Los Angeles (加州大学洛杉矶分校); University of Michigan (密歇根大学); National Library of Medicine (美国国家医学图书馆); Clinical Center (临床中心); University of California, San Diego (加州大学圣地亚哥分校); University of Texas at Austin (德克萨斯大学奥斯汀分校); Massachusetts Institute of Technology (麻省理工学院); Stanford University (斯坦福大学); University of Oxford (牛津大学); University of Cambridge (剑桥大学); University of Edinburgh (爱丁堡大学); University of Tokyo (东京大学); Tsinghua University (清华大学); Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Fudan University (复旦大学); Nanjing University (南京大学); Beijing Institute of Technology (北京理工大学); Harbin Institute of Technology (哈尔滨工业大学); Chinese Academy of Sciences (中国科学院); Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); Institute of Computing Technology, Chinese Academy of Sciences (计算技术研究所,中国科学院); Tsinghua University (清华大学); University of Illinois (伊利诺伊大学); University of Southern California (南加州大学); University of Washington (华盛顿大学); University of Maryland (马里兰大学); University of Pennsylvania (宾夕法尼亚大学); University of Chicago (芝加哥大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); University of California, Berkeley (加州大学伯克利分校); University of California, San Francisco (加州大学旧金山分校); University of Toronto (多伦多大学); University of Montreal (蒙特利尔大学); McGill University (麦吉尔大学); ETH Zurich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院); TU Munich (慕尼黑工业大学); RWTH Aachen University (亚琛工业大学); KAIST (韩国科学技术院); POSTECH (浦项科技大学); NUS (新加坡国立大学); NTU (南洋理工大学); HKUST (香港科技大学); CUHK (香港中文大学); CUHK-Shenzhen (香港中文大学深圳校区); Tsinghua-Berkeley Shenzhen Institute (清华-伯克利深圳学院); MIT (麻省理工学院); CMU (卡内基梅隆大学); UC Berkeley (加州大学伯克利分校); UCL (伦敦大学学院); Imperial College London (帝国理工学院); ETH Zurich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院); TU Munich (慕尼黑工业大学); RWTH Aachen University (亚琛工业大学); KAIST (韩国科学技术院); POSTECH (浦项科技大学); NUS (新加坡国立大学); NTU (南洋理工大学); HKUST (香港科技大学); CUHK (香港中文大学); CUHK-Shenzhen (香港中文大学深圳校区); Tsinghua-Berkeley Shenzhen Institute (清华-伯克利深圳学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 6 figures
Abstract:Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.
[CV-84] UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation
【速读】:该论文旨在解决高温中子辐照下三结构各向同性(Tristructural Isotropic, TRISO)包覆颗粒燃料在微观尺度上的形变与化学反应问题,传统人工分析方法存在效率低、主观性强等局限。解决方案的关键在于提出一种名为UA-Net的深度学习框架,其核心创新包括:采用多阶段预训练策略(从ImageNet通用图像特征到TRISO微纳图像细调),实现对五类典型区域的精准分割;集成元模型生成预测不确定性图,有效识别微小缺陷,从而提升检测灵敏度与可靠性。实验表明,该方法在测试集上达到95.5%的平均交并比(mIoU)和97.3%的平均精度(mP),且不确定性预测模型具备91.8%特异性和93.5%敏感性,显著优于人工评估方式。
链接: https://arxiv.org/abs/2604.15542
作者: Kyle Lucke,Zuzanna Krajewska-Travar,Shoukun Sun,Lu Cai,John D. Stempien,Min Xian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Tristructural isotropic (TRISO)-coated particle fuels undergo dimensional changes and chemical reactions during high-temperature neutron irradiation. Post-irradiation materialography helps understand processes that impact fuel performance, such as coating integrity and fission product retention. Conventionally, experts manually evaluate features in thousands of cross sections of sub-mm-sized samples, which is tedious and subjective. In this work, we propose UA-Net, a deep learning framework that segments five characteristic regions of TRISO fuel micrographs and generates an uncertainty map for predictions. The model uses a multi-stage pretraining strategy, starting with general image representations learned from ImageNet, followed by fine-tuning on TRISO micrographs from various irradiation experiments and AGR-5/6/7 particle cross sections. A meta-model for uncertainty prediction is integrated to identify small defects in TRISO images. UA-Net was evaluated on a test set of 102 images, achieving mean Intersection over Union (mIoU) and mean Precision (mP) of 95.5% and 97.3%, respectively. The meta-model achieved a specificity of 91.8% and sensitivity of 93.5%, demonstrating strong performance in detecting misclassifications. The model was also applied to new TRISO images for qualitative evaluation, showing high accuracy in extracting layer regions.
[CV-85] Frequency-Aware Flow Matching for High-Quality Image Generation CVPR2026
【速读】:该论文旨在解决流匹配模型(Flow Matching)在图像生成过程中对不同频率成分建模不均衡的问题,即模型在反向去噪过程中倾向于先生成低频成分(全局结构),而高频成分(细节纹理)则滞后出现,导致生成图像的细节质量不足。解决方案的关键在于提出频率感知的流匹配方法(Frequency-Aware Flow Matching, FreqFlow),其核心创新是引入时间依赖的自适应加权机制,并设计双分支架构:一个频率分支分别处理低频和高频信息以增强结构与细节建模能力,另一个空间分支在潜在空间中合成图像并受频率分支输出引导。通过显式融合频率信息,FreqFlow实现了全局一致性与局部细节的协同优化,显著提升了生成质量,在ImageNet-256类条件生成任务上取得FID=1.38的最新性能。
链接: https://arxiv.org/abs/2604.15521
作者: Sucheng Ren,Qihang Yu,Ju He,Xiaohui Shen,Alan Yuille,Liang-Chieh Chen
机构: Johns Hopkins University (约翰霍普金斯大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch’s output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at this https URL.
[CV-86] ProtoTTA: Prototype-Guided Test-Time Adaptation ICLR2026 DATE
【速读】:该论文旨在解决原型驱动的深度模型(prototypical models)在分布偏移(distribution shift)下鲁棒性不足的问题,尤其针对其在医疗等关键领域应用时对解释性与稳定性并重的需求。现有测试时自适应(Test-Time Adaptation, TTA)方法主要依赖模型输出熵最小化,但未充分挖掘原型(prototype)这一可解释表征的潜力。解决方案的关键在于提出ProtoTTA框架,通过优化中间原型相似度分布的熵来引导模型在分布偏移数据上产生更自信且语义明确的原型激活;同时引入几何滤波机制,基于原型重要性权重和模型置信度对更新样本进行约束,从而在提升鲁棒性的同时保持原型激活的语义一致性与人类对齐性。
链接: https://arxiv.org/abs/2604.15494
作者: Mohammad Mahdi Abootorabi,Parvin Mousavi,Purang Abolmaesumi,Evan Shelhamer
机构: University of British Columbia (不列颠哥伦比亚大学); Queen’s University (皇后大学); Vector Institute (向量研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026 Test-Time Updates (TTU) Workshop
Abstract:Deep networks that rely on prototypes-interpretable representations that can be related to the model input-have gained significant attention for balancing high accuracy with inherent interpretability, which makes them suitable for critical domains such as healthcare. However, these models are limited by their reliance on training data, which hampers their robustness to distribution shifts. While test-time adaptation (TTA) improves the robustness of deep networks by updating parameters and statistics, the prototypes of interpretable models have not been explored for this purpose. We introduce ProtoTTA, a general framework for prototypical models that leverages intermediate prototype signals rather than relying solely on model outputs. ProtoTTA minimizes the entropy of the prototype-similarity distribution to encourage more confident and prototype-specific activations on shifted data. To maintain stability, we employ geometric filtering to restrict updates to samples with reliable prototype activations, regularized by prototype-importance weights and model-confidence scores. Experiments across four prototypical backbones on four diverse benchmarks spanning fine-grained vision, histopathology, and NLP demonstrate that ProtoTTA improves robustness over standard output entropy minimization while restoring correct semantic focus in prototype activations. We also introduce novel interpretability metrics and a vision-language model (VLM) evaluation framework to explain TTA dynamics, confirming ProtoTTA restores human-aligned semantic focus and correlates reliably with VLM-rated reasoning quality. Code is available at: this https URL.
[CV-87] (1D) Ordered Tokens Enable Efficient Test-Time Search
【速读】:该论文旨在解决自回归(Autoregressive, AR)生成模型在测试时搜索(test-time search)过程中,因token结构设计不当而导致的生成可控性与推理效率受限的问题。其核心挑战在于如何通过测试时的候选生成评估与筛选机制(如验证器引导的搜索),实现对生成结果的有效引导和优化。解决方案的关键在于采用具有粗到细(coarse-to-fine)顺序结构的一维(1D)token化方式替代传统的二维(2D)网格结构,使得中间生成状态携带语义信息,从而增强验证器对候选序列的可靠评估能力,进而提升测试时搜索的效率与效果。实验表明,这种token结构不仅改善了AR模型的测试时缩放行为(test-time scaling),还支持无需训练即可实现文本到图像生成的纯测试时搜索方法,为AR模型的高效推理提供了新范式。
链接: https://arxiv.org/abs/2604.15453
作者: Zhitong Gao,Parham Rezaei,Ali Cy,Mingqiao Ye,Nataša Jovanović,Jesse Allardice,Afshin Dehghan,Amir Zamir,Roman Bachmann,Oğuzhan Fatih Kar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models. Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.15453 [cs.CV] (or arXiv:2604.15453v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.15453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-88] Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
【速读】:该论文旨在解决大规模视觉学习中训练成本过高的问题,传统知识蒸馏方法通常将信息从更强的教师模型压缩到较弱的学生模型以提升性能或减少参数量,而本文提出了一种全新的蒸馏策略:通过在早期训练阶段对较弱教师模型进行冻结,并仅在学生模型尚未达到教师水平时应用蒸馏,一旦学生模型超越教师性能即停止蒸馏。其解决方案的关键在于“阶段性蒸馏”机制——即利用教师模型指导学生模型在训练初期快速收敛至教师性能水平,从而显著缩短达到目标精度所需的训练轮次(epochs)或步数(steps),实现通用且高效的视觉学习加速。
链接: https://arxiv.org/abs/2604.15451
作者: Baiang Li,Wenhao Chai,Felix Heide
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures
Abstract:Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead investigate distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7 times epoch speedup for object detection on the COCO dataset, and 2.5 times earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning. Comments: 18 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.15451 [cs.CV] (or arXiv:2604.15451v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.15451 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-89] M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention ICME
【速读】:该论文旨在解决降水临近预报(rainfall nowcasting)中如何有效融合多源异构气象数据以提升预测精度和效率的问题。现有深度学习方法在利用多样化多媒体数据(如NEXRAD雷达图像与个人天气站(Personal Weather Station, PWS)时间序列)时存在局限性,难以实现时空对齐与特征协同。其解决方案的关键在于提出M3R架构——一种气象信息引导的多模态注意力机制模型,通过将PWS时间序列作为查询(query),动态聚焦于雷达图像中的关键降水区域,从而实现跨模态的信息选择性融合与精细化特征提取。该设计显著提升了降水检测能力与预测准确性,为基于多模态数据的短临预报建立了新基准。
链接: https://arxiv.org/abs/2604.15377
作者: Sanjeev Panta,Rhett M Morvant,Xu Yuan,Li Chen,Nian-Feng Tzeng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at IEEE International Conference on Multimedia and Expo (ICME) 2026
Abstract:Accurate and timely rainfall nowcasting is crucial for disaster mitigation and water resource management. Despite recent advances in deep learning, precipitation prediction remains challenging due to limitations in effectively leveraging diverse multimedia data sources. We introduce M3R, a Meteorology-informed MultiModal attention-based architecture for direct Rainfall prediction that synergistically combines visual NEXRAD radar imagery with numerical Personal Weather Station (PWS) measurements, using a comprehensive pipeline for temporal alignment of heterogeneous meteorological data. With specialized multimodal attention mechanisms, M3R novelly leverages weather station time series as queries to selectively attend to spatial radar features, enabling focused extraction of precipitation signatures. Experimental results for three spatial areas of 100 km * 100 km centered at NEXRAD radar stations demonstrate that M3R outperforms existing approaches, achieving substantial improvements in accuracy, efficiency, and precipitation detection capabilities. Our work establishes new benchmarks for multimedia-based precipitation nowcasting and provides practical tools for operational weather prediction systems. The source code is available at this https URL
[CV-90] Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
【速读】:该论文旨在解决多步缩放(multi-step zoom-in)管道在GUI接地任务中因丢弃中间预测结果而导致的信息浪费问题。传统方法仅利用最终坐标映射结果,而忽略了中间步骤提供的几何一致性信息。其解决方案的关键在于提出并利用“缩放一致性”(zoom consistency),即模型第二步预测与裁剪中心之间的距离,这一量度无需校准即可在不同架构的视觉语言模型(VLMs)之间直接比较,并被证明是第一步空间误差的线性估计器(理想条件下)。实验表明,缩放一致性与预测正确性存在显著相关性(AUC = 0.60,Spearman ρ ≈ -0.11 至 -0.14,p < 10⁻⁶),并可用于路由策略,在不牺牲性能的前提下提升模型选择效率,实现接近最优专家头(oracle headroom)的16.5%。
链接: https://arxiv.org/abs/2604.15376
作者: Keon Kim,Krish Chelikavada
机构: Om Labs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-step zoom-in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model’s step-2 prediction and the crop center. Unlike log-probabilities or token-level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step-1 spatial error under idealized conditions (perfect step-2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = -0.14, p 10^-6 for KV-Ground-8B; rho = -0.11, p = 0.0003 for Qwen3.5-27B). The correlation is small but consistent across models, application categories, and operating systems. As a proof-of-concept, we use zoom consistency to route between a specialist and generalist model, capturing 16.5% of the oracle headroom between them (+0.8%, McNemar p = 0.19). Code is available at this https URL.
[CV-91] Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration
【速读】:该论文旨在解决传统计算机断层扫描(CT)影像在区分肺部良恶性病灶以及提供可解释诊断信息方面的局限性。其解决方案的关键在于提出了一种双模态人工智能框架,通过融合CT影像与苏木精-伊红(HE)组织病理学图像的特征,并结合临床元数据,利用卷积神经网络提取多模态表征,再采用加权决策级融合机制进行肺癌亚型分类(包括腺癌、鳞状细胞癌、大细胞癌、小细胞肺癌及正常组织)。该方法不仅显著提升了诊断准确性(最高达0.87),还借助Grad-CAM++等可解释AI技术实现了对模型决策过程的可视化,增强了临床可信度,为精准肿瘤学中的辅助诊断系统提供了可行路径。
链接: https://arxiv.org/abs/2604.16104
作者: Baramee Sukumal,Aueaphum Aueawatthanaphisut
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures, 3 tables, 8 equations
Abstract:Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (HE) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.
[CV-92] opology-Driven Fusion of nnU-Net and MedNeXt for Accurate Brain Tumor Segmentation on Sub-Saharan Africa Dataset
【速读】:该论文旨在解决低收入和中等收入国家(LMIC)中脑肿瘤自动分割的准确性问题,其挑战源于缺乏标准化的影像学协议、数据多样性、广泛使用低场强磁共振成像(MRI)设备以及医疗资源有限。解决方案的关键在于:首先利用BraTS 2025成人胶质瘤预处理任务数据对分割模型(如nnU-Net、MedNeXt及其组合)进行预训练,再在BraTS-Africa数据集上微调;其次引入一个拓扑精修模块(topology refinement module),以纠正因拓扑错误导致的预测形变问题,从而显著提升分割精度,特别是在SNFH、NETC和ET三种区域的Normalized Surface Distance(NSD)指标上分别达到0.810、0.829和0.895。
链接: https://arxiv.org/abs/2604.15964
作者: Prabin Bohara,Pralhad Kumar Shrestha,Arpan Rai,Usha Poudel Lamgade,Confidence Raymond,Dong Zhang,Aondona Lorumbu,Craig Jones,Mahesh Shakya,Bishesh Khanal,Pratibha Kulung
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Accurate automatic brain tumor segmentation in Low and Middle-Income (LMIC) countries is challenging due to the lack of defined national imaging protocols, diverse imaging data, extensive use of low-field Magnetic Resonance Imaging (MRI) scanners and limited health-care resources. As part of the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge, we applied topology refinement to the state-of-the-art segmentation models like nnU-Net, MedNeXt, and a combination of both. Since the BraTS-Africa dataset has low MRI image quality, we incorporated the BraTS 2025 challenge data of pre-treatment adult glioma (Task 1) to pre-train the segmentation model and use it to fine-tune on the BraTS-Africa dataset. We added an extra topology refinement module to address the issue of deformation in prediction that arose due to topological error. With the introduction of this module, we achieved a better Normalized Surface Distance (NSD) of 0.810, 0.829, and 0.895 on Surrounding Non-Enhancing FLAIR Hyperintensity (SNFH) , Non-Enhancing Tumor Core (NETC) and Enhancing tumor (ET).
[CV-93] CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark
【速读】:该论文旨在解决医学影像分割研究中因训练与测试集存在病例重叠而导致性能评估结果被严重高估的问题(即“slice-mixed”评估范式下的虚假性能提升)。其核心问题在于,现有方法常将同一患者的多个切片混合分配至训练、验证和测试集,从而导致模型在未真正独立的样本上表现优异,缺乏临床可推广性。解决方案的关键在于构建一个患者隔离(case-disjoint)的可复现基准——CTSCAN,该平台通过明确的病例级划分、确定性数据分割清单、弱监督控制机制及多种子协议扫描流程,实现了对真实泛化能力的测量。实验表明,在相同模型配置下,移除患者重复使用后,前景Dice系数下降69.00%,IoU下降76.52%,凸显了严格病例隔离评估的重要性。
链接: https://arxiv.org/abs/2604.15561
作者: Anton Ivchenko
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.
[CV-94] RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference CVPR2026
【速读】:该论文旨在解决医学图像去噪(Medical Image Denoising, MID)中因缺乏绝对干净图像作为监督信号而导致的“噪声参考”问题,该问题从根本上限制了去噪性能。现有方法如模拟监督判别学习(SimSDL)和模拟监督生成学习(SimSGL)将噪声参考视为干净目标,导致次优收敛或参考偏差学习;而自监督学习(SSL)则依赖于现实中很少满足的严格噪声假设。解决方案的关键在于提出 RelativeFlow,一种基于流匹配(flow matching)的框架,其核心创新是将绝对的噪声到干净图像映射分解为相对的、从较脏到较干净的映射,并通过两个关键组件实现:1)一致传输(Consistent Transport, CoT),约束相对流为统一绝对流的组成部分并逐步组合成完整映射;2)基于模拟的速度场(Simulation-based Velocity Field, SVF),利用模态特定的退化算子构建可学习速度场,支持不同医学成像模态。此方法有效利用异质噪声参考,使任意质量输入逐步逼近统一高质量目标,显著优于现有方法。
链接: https://arxiv.org/abs/2604.15459
作者: Yuxin Liu,Yiqing Dong,Wenxue Yu,Zhan Wu,Rongjun Ge,Yang Chen,Yuting He
机构: Southeast University (东南大学); Case Western Reserve University (凯斯西储大学); University of Arizona (亚利桑那大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbfRelativeFlow, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.
人工智能
[AI-0] ASMR-Bench: Auditing for Sabotage in ML Research
【速读】:该论文旨在解决人工智能(AI)系统在自主开展机器学习(ML)研究时可能出现的“误导性偏差”问题,即当AI系统被用于科研任务时,其代码中可能隐藏细微但具有破坏性的修改(sabotage),这些修改不会改变论文描述的高阶方法论,却会导致实验结果发生显著偏移,且难以被常规审计手段识别。解决方案的关键在于构建了一个名为ASMR-Bench(Auditing for Sabotage in ML Research)的基准测试集,包含9个经过人工注入攻击的ML研究代码库及其对应的篡改版本,每个变体通过调整超参数、训练数据或评估代码等细节实现隐蔽性篡改。该基准可用于系统性评估审计者(包括大语言模型LLM和人类专家)识别此类恶意修改的能力,从而推动针对AI驱动科研过程的监控与审计技术发展。
链接: https://arxiv.org/abs/2604.16286
作者: Eric Gan,Aryan Bhatt,Buck Shlegeris,Julian Stastny,Vivek Hebbar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.
[AI-1] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing
【速读】:该论文旨在解决机器学习(Machine Learning, ML)结果在透明性和用户友好性方面难以解释的问题,这是可解释人工智能(Explainable Artificial Intelligence, XAI)领域长期面临的挑战。其解决方案的关键在于构建一个融合领域知识与ML结果的结构化知识图谱(Knowledge Graph, KG),并通过选择性检索机制从KG中提取相关三元组,再由大语言模型(Large Language Model, LLM)生成面向用户的自然语言解释。该方法实现了LLM对KG的动态访问,从而提升ML结果的可解释性,并在制造场景中通过实证验证了其有效性,支持更优的决策制定。
链接: https://arxiv.org/abs/2604.16280
作者: Thomas Bayer,Alexander Lohr,Sarah Weiß,Bernd Michelberger,Wolfram Höpken
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, Submittet to conference
Abstract:Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.
[AI-2] Beyond Distribution Sharpening: The Importance of Task Rewards
【速读】:该论文旨在解决一个核心争议问题:基于任务奖励的强化学习(task-reward-based reinforcement learning, RL)是否真正赋予基础模型新的能力,还是仅仅通过分布锐化(distribution sharpening)来激发其已有潜力。为厘清这一分歧,作者提出了一种明确的对比实验设计,将RL作为工具分别用于实现两种范式——分布锐化与任务导向型学习。解决方案的关键在于从理论上证明了分布锐化的局限性:其最优解可能不稳定且不理想;同时,实验结果表明,在数学推理数据集上,分布锐化带来的性能提升有限,而引入任务奖励信号则能显著提升模型的鲁棒性和学习稳定性,从而验证了任务奖励机制对生成式AI(Generative AI)能力进化的实质性贡献。
链接: https://arxiv.org/abs/2604.16259
作者: Sarthak Mittal,Leo Gagnon,Guillaume Lajoie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning.
[AI-3] Characterising LLM -Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在本体工程中用于自动生成能力问题(Competency Questions, CQs)时,缺乏对不同大语言模型(Large Language Models, LLMs)所生成 CQs 的系统性评估与量化比较的问题。其核心挑战在于:尽管 LLM 能够规模化生成 CQs 以降低人工成本并提升参与度,但不同模型在可读性、与输入文本的相关性及结构复杂度等关键属性上表现差异显著,且这种差异受使用场景影响明显。解决方案的关键在于提出一套多维定量指标体系,通过在多个典型应用场景下对多种开放与闭源 LLM(如 KimiK2-1T、Llama3.1-8B、Gemini 2.5 Pro、GPT 4.1 等)生成的 CQs 进行系统分析,识别出各模型在不同任务情境下的生成特征与性能差异,从而为 LLM 在 CQ 自动化生成中的选型与优化提供实证依据。
链接: https://arxiv.org/abs/2604.16258
作者: Reham Alharbi,Valentina Tamma,Terry R. Payne,Jacopo de Berardinis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2507.02989
Abstract:Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.
[AI-4] Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
【速读】:该论文旨在解决低资源环境下,从长段音频-文本序列中学习语义一致的文档级跨模态表示问题,尤其在音频与文本维度严重失衡时如何有效对齐多模态信息并保持各自结构特性。解决方案的关键在于提出HILBERT框架,其核心创新包括:(1)基于冻结预训练语音和语言编码器提取片段级特征,并通过交叉注意力与自注意力池化生成模态特异性文档表示及联合交叉注意力嵌入;(2)设计一种互相对照(reciprocal dual contrastive)目标函数,同时对齐音频→联合和文本→联合表示,而非直接对比音频与文本,从而缓解模态不平衡带来的对齐偏差;(3)引入中心核对齐(Centered Kernel Alignment, CKA)损失和互信息平衡损失作为辅助正则项,分别维持模态与联合嵌入间的结构一致性并均衡双模态的信息流入,最终结合Mixture-of-Experts(MoE)分类器实现对异构标签任务的灵活适配。
链接: https://arxiv.org/abs/2604.16247
作者: Habibeh Naderi,Behrouz Haji Soleimani,Stan Matwin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.
[AI-5] Neuro-Symbolic ODE Discovery with Latent Grammar Flow
【速读】:该论文旨在解决从数据中自动发现常微分方程(Ordinary Differential Equations, ODEs)的问题,尤其关注如何在保持模型可解释性的同时提升其泛化能力。传统黑箱机器学习方法虽能拟合数据,但缺乏对系统动力学机制的显式表达;而符号建模方法则受限于先验知识的强依赖性。为此,作者提出了一种神经符号生成框架Latent Grammar Flow (LGF),其核心创新在于将ODE表示为基于语法规则的离散潜在空间中的结构,并通过行为损失(behavioural loss)强制语义相近的方程在潜在空间中邻近分布;随后利用离散流模型(discrete flow model)递归采样候选方程,以最优匹配观测数据。该方案融合了领域知识(如稳定性约束)作为条件预测或规则嵌入,从而实现了可解释性强、适应性广的ODE发现机制。
链接: https://arxiv.org/abs/2604.16232
作者: Karin Yu,Eleni Chatzi,Georgios Kissas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Symbolic Computation (cs.SC)
备注:
Abstract:Understanding natural and engineered systems often relies on symbolic formulations, such as differential equations, which provide interpretability and transferability beyond black-box models. We introduce Latent Grammar Flow (LGF), a neuro-symbolic generative framework for discovering ordinary differential equations from data. LGF embeds equations as grammar-based representations into a discrete latent space and forces semantically similar equations to be positioned closer together with a behavioural loss. Then, a discrete flow model guides the sampling process to recursively generate candidate equations that best fit the observed data. Domain knowledge and constraints, such as stability, can be either embedded into the rules or used as conditional predictors. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Symbolic Computation (cs.SC) Cite as: arXiv:2604.16232 [cs.LG] (or arXiv:2604.16232v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.16232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-6] Synthetic data in cryptocurrencies using generative models
【速读】:该论文旨在解决金融数据在数字金融生态系统中因隐私风险和访问限制而带来的使用障碍问题,尤其是在加密货币价格时间序列分析中的应用困境。解决方案的关键在于采用基于条件生成对抗网络(Conditional Generative Adversarial Networks, CGANs)的深度学习方法,结合LSTM型循环生成器与多层感知机(MLP)判别器,生成统计特性一致的合成数据,从而有效保留市场趋势和动态特征,为市场行为分析和异常检测等场景提供低成本、高效率的数据替代方案。
链接: https://arxiv.org/abs/2604.16182
作者: André Saimon S. Sousa,Otto Pires,Frank Acasiete,Oscar M. Granados,Valéria Loureiro da Silva,Hugo Saba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data plays a fundamental role in consolidating markets, services, and products in the digital financial ecosystem. However, the use of real data, especially in the financial context, can lead to privacy risks and access restrictions, affecting institutions, research, and modeling processes. Although not all financial datasets present such limitations, this work proposes the use of deep learning techniques for generating synthetic data applied to cryptocurrency price time series. The approach is based on Conditional Generative Adversarial Networks (CGANs), combining an LSTM-type recurrent generator and an MLP discriminator to produce statistically consistent synthetic data. The experiments consider different crypto-assets and demonstrate that the model is capable of reproducing relevant temporal patterns, preserving market trends and dynamics. The generation of synthetic series through GANs is an efficient alternative for simulating financial data, showing potential for applications such as market behavior analysis and anomaly detection, with lower computational cost compared to more complex generative approaches.
[AI-7] raining Time Prediction for Mixed Precision-based Distributed Training
【速读】:该论文旨在解决分布式深度学习训练时间预测中因浮点精度设置差异导致的显著误差问题。现有预测方法依赖静态模型计算图,无法捕捉混合精度(mixed precision)等动态精度变化,导致预测误差高达147.85%的平均绝对百分比误差(MAPE)。其解决方案的关键在于提出一种精度感知(precision-aware)的分布式训练时间预测器,能够有效适应多种精度配置(包括混合精度),将MAPE降至9.8%,从而实现跨不同精度设置下的鲁棒高精度预测。
链接: https://arxiv.org/abs/2604.16145
作者: Minchul Kang,Changyong Shin,Jinwoo Jeong,Hyunho Lee,Younghun Go,Gyeongmin Kim,Gyeongsik Yang,Chuck Yoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:
Abstract:Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.
[AI-8] SCRIPT: Implementing an Intelligent Tutoring System for Programming in a German University Context
【速读】:该论文旨在解决当前编程教育中智能辅导系统(Intelligent Tutoring Systems, ITS)对Python语言支持不足、多局限于初级编程教学、且未充分融合生成式模型(Generative Models)最新进展的问题,同时需满足德国严格的监管环境,包括欧盟数据保护法规(GDPR)、欧洲人工智能法案(European AI Act)及德国研究基金会的伦理框架。解决方案的关键在于构建一个高度可扩展的ITS平台,具备灵活接口以集成多种提示机制(如大语言模型),并兼顾教学与科研双重功能,从而实现个性化编程指导的同时确保合规性与安全性。
链接: https://arxiv.org/abs/2604.16117
作者: Alina Deriyeva,Jesper Dannath,Benjamin Paassen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In: Cristea, A.I., Walker, E., Lu, Y., Santos, O.C., Isotani, S. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED. AIED 2025. Communications in Computer and Information Science, vol 2590 . Springer, Cham
Abstract:Practice and extensive exercises are essential in programming education. Intelligent tutoring systems (ITSs) are a viable option to provide individualized hints and advice to programming students even when human tutors are not available. However, prior ITS for programming rarely support the Python programming language, mostly focus on introductory programming, and rarely take recent developments in generative models into account. We aim to establish a novel ITS for Python programming that is highly adaptable, serves both as a teaching and research platform, provides interfaces to plug in hint mechanisms (e.g.\ via large language models), and works inside the particularly challenging regulatory environment of Germany, that is, conforming to the European data protection regulation, the European AI act, and ethical framework of the German Research Foundation. In this paper, we present the description of the current state of the ITS along with future development directions, as well as discuss the challenges and opportunities for improving the system.
[AI-9] he Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement
【速读】:该论文试图解决的问题是:如何将人类学者在人文与社会科学领域中所展现的稳定、可复制的学术推理能力,转化为可部署的生成式 AI(Generative AI)系统,并验证其在核心学术任务中的表现是否达到专家水平。解决方案的关键在于构建了一个八层提取流程和九模块技能架构,通过局部封闭语料库分析从学者已发表文献中提炼出结构化的推理约束,并将其作为推理时约束(inference-time constraints)注入大语言模型,从而生成具备专家级质量的学术输出。这一方法使得“学者机器人”(scholar-bots)在博士指导、同行评审、授课及学术研讨等场景中均获得专家评估认可,且在多轮辩论条件下表现出接近或超过高级讲师级别的绩效,揭示了当前技术门槛下知识生产系统的可替代性潜力。
链接: https://arxiv.org/abs/2604.16116
作者: Lin Deng,Chang-bo Liu
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:We extracted the scholarly reasoning systems of two internationally prominent humanities and social science scholars from their published corpora alone, converted those systems into structured inference-time constraints for a large language model, and tested whether the resulting scholar-bots could perform core academic functions at expert-assessed quality. The distillation pipeline used an eight-layer extraction method and a nine-module skill architecture grounded in local, closed-corpus analysis. The scholar-bots were then deployed across doctoral supervision, peer review, lecturing and panel-style academic exchange. Expert assessment involved three senior academics producing reports and appointment-level syntheses. Across the preserved expert record, all review and supervision reports judged the outputs benchmark-attaining, appointment-level recommendations placed both bots at or above Senior Lecturer level in the Australian university system, and recovered panel scores placed Scholar A between 7.9 and 8.9/10 and Scholar B between 8.5 and 8.9/10 under multi-turn debate conditions. A research-degree-student survey showed high performance ratings across information reliability, theoretical depth and logical rigor, with pronounced ceiling effects on a 7-point scale, despite all participants already being frontier-model users. We term this the Relic condition: when publication systems make stable reasoning architectures legible, extractable and cheaply deployable, the public record of intellectual labor becomes raw material for its own functional replacement. Because the technical threshold for this transition is already crossed at modest engineering effort, we argue that the window for protective frameworks covering disclosure, consent, compensation and deployment restriction is the present, while deployment remains optional rather than infrastructural.
[AI-10] Reckoning with the Political Economy of AI: Avoiding Decoys in Pursuit of Accountability
【速读】:该论文试图解决的问题是:当前人工智能(AI)发展过程中,表面上看似促进公平与问责的机制实则成为掩盖其深层政治经济结构的“ decoys(伪装物)”,从而阻碍了对AI权力关系和资源分配本质的批判性认知。解决方案的关键在于:首先识别并警惕这些伪装物如何转移注意力、制造虚假问责感;其次直接面对AI项目所依托的物质性政治经济基础,揭示其权力网络构造机制,以此推动对技术嵌入社会结构的深刻反思,并构建更具正义性的技术未来。
链接: https://arxiv.org/abs/2604.16106
作者: Janet Vertesi,danah boyd,Alex Taylor,Benjamin Shestakofsky
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To be presented at ACM FAccT, Montréal, Canada, June 25 to June 28, 2026
Abstract:The Project of AI is a world-building endeavor, wherein those who fund and develop AI systems both operate through and seek to sustain networks of power and wealth. As they expand their access to resources and configure our sociotechnical conditions, they benefit from the ways in which a suite of decoys animate scholars, critics, policymakers, journalists, and the public into co-constructing industry-empowering AI futures. Regardless of who constructs or nurtures them, these decoys often create the illusion of accountability while both masking the emerging political economies that the Project of AI has set into motion, and also contributing to the network-making power that is at the heart of the Project’s extraction and exploitation. Drawing on literature at the intersection of communication, science and technology studies, and economic sociology, we examine how the Project of AI is constructed. We then explore five decoys that seemingly critique - but in actuality co-constitute - AI’s emergent power relations and material political economy. We argue that advancing meaningful fairness or accountability in AI requires: 1) recognizing when and how decoys serve as a distraction, and 2) grappling directly with the material political economy of the Project of AI. Doing so will enable us to attend to the networks of power that make ‘AI’ possible, spurring new visions for how to realize a more just technologically entangled world.
[AI-11] Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因设备可用性与数据分布相关联而导致的不公平采样问题,即在概率同步并行(Probabilistic Synchronous Parallel, PSP)机制下,高可用设备频繁参与训练而低可用设备长期被忽略,造成某些类别或群体的数据持续欠采样,从而影响模型对关键特征的学习效率和效果。解决方案的关键在于提出一种可用性加权的PSP(Availability-Weighted PSP, AW-PSP),通过实时可用性预测、历史行为建模及故障相关性度量动态调整节点采样概率,并结合基于马尔可夫的可用性预测器区分瞬时与慢性故障,以及分布式哈希表(Distributed Hash Table, DHT)层实现去中心化的元数据管理(包括延迟、新鲜度和效用评分),从而在异构且易失效的环境中提升系统鲁棒性、标签覆盖率并降低公平性方差。
链接: https://arxiv.org/abs/2604.16090
作者: Stefan Behfar,Richard Mortier
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features. We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient \emphvs chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores. We implement AW-PSP and trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP. AW-PSP thus provides an availability-aware, and fairness-conscious node sampling protocol for FL deployments that will scale to large numbers of nodes even in heterogeneous and failure-prone environments.
[AI-12] Unveiling Stochasticity: Universal Multi-modal Probabilistic Modeling for Traffic Forecasting
【速读】:该论文旨在解决交通预测中不确定性建模不足的问题,现有研究多集中于确定性预测,忽略了交通动态中的随机性和不确定性。其解决方案的关键在于提出一种简洁通用的方法:仅将原有模型的最终输出层替换为新型高斯混合模型(Gaussian Mixture Model, GMM)层,即可将确定性模型转化为概率预测器。该方法无需修改训练流程,仅需使用负对数似然(Negative Log-Likelihood, NLL)损失函数进行训练,且不依赖额外辅助项或正则化策略,从而在保持原模型确定性性能的同时,有效捕捉交通数据的分布特性与不确定性信息。
链接: https://arxiv.org/abs/2604.16084
作者: Weijiang Xiong,Robert Fonod,Nikolas Geroliminis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic forecasting is a challenging spatio-temporal modeling task and a critical component of urban transportation management. Current studies mainly focus on deterministic predictions, with limited considerations on the uncertainty and stochasticity in traffic dynamics. Therefore, this paper proposes an elegant yet universal approach that transforms existing models into probabilistic predictors by replacing only the final output layer with a novel Gaussian Mixture Model (GMM) layer. The modified model requires no changes to the training pipeline and can be trained using only the Negative Log-Likelihood (NLL) loss, without any auxiliary or regularization terms. Experiments on multiple traffic datasets show that our approach generalizes from classic to modern model architectures while preserving deterministic performance. Furthermore, we propose a systematic evaluation procedure based on cumulative distributions and confidence intervals, and demonstrate that our approach is considerably more accurate and informative than unimodal or deterministic baselines. Finally, a more detailed study on a real-world dense urban traffic network is presented to examine the impact of data quality on uncertainty quantification and to show the robustness of our approach under imperfect data conditions. Code available at this https URL
[AI-13] Prototype-Grounded Concept Models for Verifiable Concept Alignment
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在深度学习中缺乏对所学概念是否与人类意图一致的验证机制的问题,从而削弱了模型的可解释性。其解决方案的关键在于提出原型锚定的概念模型(Prototype-Grounded Concept Models, PGCMs),通过将概念锚定在学习到的视觉原型(visual prototypes)上——即作为概念显式证据的图像局部区域——实现对概念语义的直接检验,并支持在原型层面进行有针对性的人工干预以修正语义偏差。实证结果表明,PGCMs 在预测性能上达到当前最优CBMs水平,同时显著提升了透明度、可解释性和可干预性。
链接: https://arxiv.org/abs/2604.16076
作者: Stefano Colamonaco,David Debot,Pietro Barbiero,Giuseppe Marra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human’s intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs match the predictive performance of state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.
[AI-14] AST: Adaptive Seamless and Training-Free Precise Speech Editing
【速读】:该论文旨在解决文本驱动语音编辑(Text-based Speech Editing)中面临的两大核心问题:一是现有方法依赖任务特定训练,导致数据成本高且未编辑区域的时间保真度差;二是语音合成模型(TTS)在编辑质量与一致性之间存在权衡。解决方案的关键在于提出AST框架——一种无需额外训练的自适应、无缝且精确的语音编辑方法。其核心创新包括:1)引入潜在空间重组(Latent Recomposition),通过选择性拼接保留源语音片段与新合成目标片段实现精准编辑;2)设计自适应弱事实引导(AWFG),动态调节mel空间引导信号,在编辑边界处施加结构约束以避免伪影而不破坏生成流形;3)构建LibriSpeech-Edit数据集并提出词级动态时间规整(WDTW)指标,有效评估未编辑区域的时间一致性。实验表明,AST在不增加训练成本的前提下显著提升编辑精度与时间保真度,优于现有最先进方法。
链接: https://arxiv.org/abs/2604.16056
作者: Sihan Lv,Yechen Jin,Zhen Li,Jintao Chen,Jinshan Zhang,Ying Li,Jianwei Yin,Meng Xi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech-Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consistency in unedited regions, we propose Word-level Dynamic Time Warping (WDTW). Extensive experiments demonstrate that AST resolves the controllability-quality trade-off without extra training. Compared to the previous most temporally consistent baseline, AST improves consistency while reducing Word Error Rate by nearly 70%. Moreover, applying AST to a foundation TTS model reduces WDTW by 27%, achieving state-of-the-art speaker preservation and temporal fidelity.
[AI-15] Safe Deep Reinforcement Learning for Building Heating Control and Demand-side Flexibility
【速读】:该论文旨在解决建筑供暖系统在高比例间歇性可再生能源接入电网背景下,如何实现需求侧灵活性(demand-side flexibility)以提升电网稳定性和能源效率的问题。其核心挑战在于,在保障用户舒适度的前提下,使建筑供暖系统能够响应电力系统运营商的灵活性调度指令,同时优化能耗与运行成本。解决方案的关键在于提出了一种基于深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)的强化学习控制框架,并引入实时自适应安全滤波器(real-time adaptive safety filter),确保控制器在学习最优加热策略的同时严格遵守灵活性约束条件,从而在显著降低能耗(相比规则控制器最高达50%节能)和成本的基础上,实现对调度指令的完全合规响应,且仅带来轻微的舒适度违反。
链接: https://arxiv.org/abs/2604.16033
作者: Colin Jüni,Mina Montazeri,Yi Guo,Federica Bellizio,Giovanni Sansavini,Philipp Heer
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Buildings account for approximately 40% of global energy consumption, and with the growing share of intermittent renewable energy sources, enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency. This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. To address safety concerns with reinforcement learning, particularly regarding compliance with flexibility requests, we propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision. The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy and cost efficiency – achieving up to 50% savings compared to a rule-based controller – while outperforming a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations.
[AI-16] Neurosymbolic Repo-level Code Localization
【速读】:该论文旨在解决当前代码定位(Code Localization)任务中存在的“关键词捷径”(Keyword Shortcut)问题,即现有模型过度依赖文件路径、函数名等表面词汇线索,而非进行真正的结构化逻辑推理。为应对这一挑战,作者提出了KA-LCL(Keyword-Agnostic Logical Code Localization)任务,并设计了KA-LogicQuery诊断基准,强制模型在无命名提示条件下进行结构性推理。解决方案的关键在于提出LogicLoc框架:该框架融合大语言模型(LLM)与Datalog的确定性逻辑推理能力,通过从代码库中提取程序事实并由LLM合成Datalog规则,结合解析器门控验证和基于中间规则变异的诊断反馈机制确保正确性与效率;最终由高性能Datalog推理引擎执行,实现自动闭环的精准代码定位。实验表明,LogicLoc在KA-LogicQuery上显著优于SOTA方法,且Token消耗更低、执行更快,因其将结构遍历任务交由确定性引擎处理,大幅减少了迭代式LLM推理开销。
链接: https://arxiv.org/abs/2604.16021
作者: Xiufeng Xu,Xiufeng Wu,Zejun Zhang,Yi Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code localization is a cornerstone of autonomous software engineering. Recent advancements have achieved impressive performance on real-world issue benchmarks. However, we identify a critical yet overlooked bias: these benchmarks are saturated with keyword references (e.g. file paths, function names), encouraging models to rely on superficial lexical matching rather than genuine structural reasoning. We term this phenomenon the Keyword Shortcut. To address this, we formalize the challenge of Keyword-Agnostic Logical Code Localization (KA-LCL) and introduce KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without any naming hints. Our evaluation reveals a catastrophic performance drop of state-of-the-art approaches on KA-LogicQuery, exposing their lack of deterministic reasoning capabilities. We propose LogicLoc, a novel agentic framework that combines large language models with the rigorous logical reasoning of Datalog for precise localization. LogicLoc extracts program facts from the codebase and leverages an LLM to synthesize Datalog programs, with parser-gated validation and mutation-based intermediate-rule diagnostic feedback to ensure correctness and efficiency. The validated programs are executed by a high-performance inference engine, enabling accurate and verifiable localization in a fully automated, closed-loop workflow. Experimental results demonstrate that LogicLoc significantly outperforms SOTA methods on KA-LogicQuery while maintaining competitive performance on popular issue-driven benchmarks. Notably, LogicLoc attains superior performance with significantly lower token consumption and faster execution by offloading structural traversal to a deterministic engine, reducing the overhead of iterative LLM inference.
[AI-17] MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
【速读】:该论文旨在解决当前AI基准测试中对元认知(metacognition)能力评估不足的问题,特别是模型在面对不确定性时如何监测和调节自身推理过程的能力未被充分衡量。解决方案的关键在于提出MEDLEY-BENCH这一行为式元认知基准,其创新性地将独立推理、私有自我修正与社会影响下的修正分离,并在真实模型间分歧情境下进行评估。该基准通过两个互补指标——元认知评分(Medley Metacognition Score, MMS)和元能力评分(Medley Ability Score, MAS),量化了反思性更新、社会鲁棒性和认识论表达等维度,揭示了模型规模与元认知控制之间的解耦现象,从而为测量信念修正中的社会压力响应提供了可操作的工具。
链接: https://arxiv.org/abs/2604.16009
作者: Farhad Abtahi,Abdolamir Karbalaie,Eduardo Illueca-Fernandez,Fernando Seoane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Metacognition, the ability to monitor and regulate one’s own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.
[AI-18] ReactBench: A Benchmark for Topological Reasoning in MLLM s on Chemical Reaction Diagrams
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理复杂拓扑结构(如分支路径、汇聚流和循环依赖)时结构性推理能力显著下降的问题,尤其在化学反应图谱这类真实科学图表中表现不佳。现有基准测试未能有效探测这一缺陷,因其侧重语义理解而非结构推理。论文提出的关键解决方案是构建ReactBench——一个包含1,618个专家标注问答对的基准测试集,涵盖四个层级的任务维度,专门用于评估模型对复杂拓扑结构的全局推理能力。实验表明,MLLMs在基于锚点的任务上表现良好,但在整体结构推理任务上的性能差距超过30%,且控制性消融实验证明该瓶颈源于推理模块而非感知模块,从而揭示了当前模型在结构理解上的根本性不足,并为视觉推理能力的提升指明方向。
链接: https://arxiv.org/abs/2604.15994
作者: Qiang Xu,Shengyuan Bai,Yu Wang,He Cao,Leqing Chen,Yuanyuan Liu,Bin Feng,Zijing Liu,Yu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.
[AI-19] Integrating Graphs Large Language Models and Agents : Reasoning and Retrieval
【速读】:该论文旨在解决当前生成式 AI(Generative AI)中大型语言模型(Large Language Models, LLMs)与图结构表示融合时存在的设计选择模糊问题,即在何种场景下、为何以及如何有效地将图结构整合进LLM以提升推理、检索、生成和推荐等任务的表现。其解决方案的关键在于系统性地对现有方法进行分类:依据任务目标(如推理、检索、生成、推荐)、图模态类型(如知识图谱、场景图、交互图、因果图、依存图)以及集成策略(提示工程、增强、训练或基于代理的使用),并通过跨领域案例(如网络安全、医疗、材料科学、金融、机器人及多模态环境)映射代表性工作,从而明确各类技术的优势、局限及其适用场景,为研究者提供按任务需求、数据特征和推理复杂度选择最优图-LLM融合路径的实用指南。
链接: https://arxiv.org/abs/2604.15951
作者: Hamed Jelodar,Samita Bai,Mohammad Meymani,Parisa Hamedi,Roozbeh Razavi-Far,Ali Ghorbani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. Despite rapid advances, there remains limited clarity regarding when, why, where, and what types of graph-LLM integrations are most appropriate across applications. This survey provides a concise, structured overview of the design choices underlying the integration of graphs with LLMs. We categorize existing methods based on their purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, interaction graphs, causal graphs, dependency graphs), and integration strategies (prompting, augmentation, training, or agent-based use). By mapping representative works across domains such as cybersecurity, healthcare, materials science, finance, robotics, and multimodal environments, we highlight the strengths, limitations, and best-fit scenarios for each technique. This survey aims to offer researchers a practical guide for selecting the most suitable graph-LLM approach depending on task requirements, data characteristics, and reasoning complexity.
[AI-20] owards Rigorous Explainability by Feature Attribution
【速读】:该论文旨在解决当前可解释人工智能(XAI)中非符号方法(non-symbolic methods)缺乏严谨性、易误导人类决策者的问题,尤其是在高风险应用场景下。其解决方案的关键在于推动使用严格的符号方法(symbolic methods)作为替代,以更可靠地量化特征重要性,从而提升模型解释的可信度与准确性。
链接: https://arxiv.org/abs/2604.15898
作者: Olivier Létoffé,Xuanxiang Huang,Joao Marques-Silva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.
[AI-21] QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
【速读】:该论文旨在解决当前大语言模型在数值预测任务中缺乏有效评估机制的问题,特别是现有基准测试主要集中在二元或多项选择等判断类任务,无法充分衡量模型对连续数值量的不确定性建模能力。其核心解决方案是引入**预测区间(Prediction Intervals)**作为评估框架,该方法要求模型不仅提供点估计,还需明确表达置信水平下的区间范围,从而检验模型在尺度敏感性、多置信度一致性以及跨连续结果的校准性能。这一设计使评估更具严谨性和实用性,能够更真实地反映模型在经济学、公共卫生和社会人口学等领域实际应用中的预测可靠性。
链接: https://arxiv.org/abs/2604.15859
作者: Jeremy Qin,Maksym Andriushchenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90% coverage target, with the top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
[AI-22] DPrivBench: Benchmarking LLM s Reasoning for Differential Privacy
【速读】:该论文旨在解决差分隐私(Differential Privacy, DP)算法设计与验证中存在专家门槛过高、自动化程度不足的问题,即非专业 practitioners 难以独立完成DP机制的正确性推理与验证。其解决方案的关键在于引入 DPrivBench——一个专门针对差分隐私推理能力设计的基准测试集,涵盖广泛DP主题、多难度层级,并通过结构化任务评估大语言模型(Large Language Models, LLMs)在给定假设下判断函数或算法是否满足指定DP保证的能力。实验表明,尽管最强LLMs能处理基础机制,但在高级算法上仍表现不佳,揭示了当前自动化DP推理的显著差距,并为未来改进方向提供了可量化的分析依据。
链接: https://arxiv.org/abs/2604.15851
作者: Erchi Wang,Pengrun Huang,Eli Chien,Om Thakkar,Kamalika Chaudhuri,Yu-Xiang Wang,Ruihan Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.
[AI-23] Stein Variational Black-Box Combinatorial Optimization
【速读】:该论文旨在解决高维组合优化问题中如何在利用(exploitation)与探索(exploration)之间取得平衡的难题,尤其针对多峰(multimodal)目标函数容易导致单点收敛的问题。传统估计分布算法(Estimation-of-Distribution Algorithms, EDAs)往往聚焦于单一区域,难以同时发现多个最优解。其解决方案的关键在于引入Stein算子(Stein operator),构建粒子间的排斥机制,使种群在参数空间中分散分布,从而协同探索多个模式(mode)的适应度景观。这一方法基于Stein变分梯度下降(Stein Variational Gradient Descent, SVGD),实验证明其在多种基准测试问题上表现优于或媲美当前最先进方法,尤其在大规模实例中展现出显著优势。
链接: https://arxiv.org/abs/2604.15837
作者: Thomas Landais,Olivier Goudet,Adrien Goëffon,Frédéric Saubion,Sylvain Lamprier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single region of interest, which may result in premature convergence when facing complex or multimodal objective landscapes. In this work, we incorporate the Stein operator to introduce a repulsive mechanism among particles in the parameter space, thereby encouraging the population to disperse and jointly explore several modes of the fitness landscape. Empirical evaluations across diverse benchmark problems show that the proposed method achieves performance competitive with, and in several cases superior to, leading state-of-the-art approaches, particularly on large-scale instances. These findings highlight the potential of Stein variational gradient descent as a promising direction for addressing large, computationally expensive, discrete black-box optimization problems.
[AI-24] ECG-Lens: Benchmarking ML DL Models on PTB-XL Dataset
【速读】:该论文旨在解决如何高效、准确地对12导联心电图(Electrocardiogram, ECG)信号进行自动分类以辅助心血管疾病诊断的问题。其解决方案的关键在于对比传统机器学习算法(如决策树、随机森林和逻辑回归)与深度学习模型(包括简单卷积神经网络、长短期记忆网络及复杂CNN——ECGLens)在原始ECG数据上的表现,并通过小波变换(Stationary Wavelet Transform, SWT)进行数据增强,从而提升模型性能。结果表明,复杂CNN架构(ECGLens)在准确率(80%)和ROC-AUC(90%)上显著优于传统方法,验证了深度学习在自动特征提取和高精度分类方面的优势。
链接: https://arxiv.org/abs/2604.15822
作者: Saloni Garg,Ukant Jadia,Amit Sagtani,Kamal Kant Hiran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
备注: 8 pages, 4 figures, 3 tables
Abstract:Automated classification of electrocardiogram (ECG) signals is a useful tool for diagnosing and monitoring cardiovascular diseases. This study compares three traditional machine learning algorithms (Decision Tree Classifier, Random Forest Classifier, and Logistic Regression) and three deep learning models (Simple Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Complex CNN (ECGLens)) for the classification of ECG signals from the PTB-XL dataset, which contains 12-lead recordings from normal patients and patients with various cardiac conditions. The DL models were trained on raw ECG signals, allowing them to automatically extract discriminative features. Data augmentation using the Stationary Wavelet Transform (SWT) was applied to enhance model performance, increase the diversity of training samples, and preserve the essential characteristics of the ECG signals. The models were evaluated using multiple metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. The ECG-Lens model achieved the highest performance, with 80% classification accuracy and a 90% ROC-AUC. These findings demonstrate that deep learning architectures, particularly complex CNNs substantially outperform traditional ML methods on raw 12-lead ECG data, and provide a practical benchmark for selecting automated ECG classification models and identifying directions for condition-specific model development.
[AI-25] From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
【速读】:该论文旨在解决机器人在真实环境中学习鲁棒策略时因数据多样性不足而导致的泛化能力差的问题,其核心挑战在于真实世界数据采集成本高、效率低。解决方案的关键在于提出一种生成式框架,建立从真实世界全景图像到高保真仿真场景的映射关系,并通过语义与几何编辑技术合成多样化的“数字孪生”(Digital Cousins)场景;同时结合高质量物理引擎和真实资产,支持交互式操作任务,并引入多房间拼接技术构建连贯的大规模环境,从而实现长时程导航任务的高效训练与评估。实验表明,该方法在仿真与真实之间具有强相关性,且大规模数据生成显著提升了对未见场景和物体变化的泛化性能。
链接: https://arxiv.org/abs/2604.15805
作者: Jasper Lu,Zhenhao Shen,Yuanfei Wang,Shugao Liu,Shengqiang Xu,Shawn Xie,Jingkai Xu,Feng Jiang,Jade Yang,Chen Xie,Ruihai Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform’s fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.
[AI-26] EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLM s
【速读】:该论文旨在解决动态系统推断中传统深度学习模型缺乏可解释性、计算成本高且需大量数据训练的问题。其核心挑战在于如何在不依赖大规模数据训练的前提下,获得既高效又可解释的推理算法。解决方案的关键在于提出EVIL(Evolving Interpretable algorithms with LLMs),利用大语言模型(Large Language Models, LLMs)引导的进化搜索机制,自动演化出简洁、纯Python/NumPy实现的可解释算法,从而在多个动态系统任务(如时间点过程的下一事件预测、马尔可夫跳跃过程的速率矩阵估计和时间序列插补)中实现零样本、上下文感知的泛化推理,且无需针对每个数据集单独训练,显著优于现有深度学习方法的效率与可解释性。
链接: https://arxiv.org/abs/2604.15787
作者: David Berghaus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce EVIL (\textbfEVolving \textbfInterpretable algorithms with \textbfLLMs), an approach that uses LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference. Rather than training neural networks on large datasets, EVIL evolves pure Python/NumPy programs that perform zero-shot, in-context inference across datasets. We apply EVIL to three distinct tasks: next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation. In each case, a single evolved algorithm generalizes across all evaluation datasets without per-dataset training (analogous to an amortized inference model). To the best of our knowledge, this is the first work to show that LLM-guided program evolution can discover a single compact inference function for these dynamical-systems problems. Across the three domains, the discovered algorithms are often competitive with, and even outperform, state-of-the-art deep learning models while being orders of magnitudes faster, and remaining fully interpretable.
[AI-27] Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension
【速读】:该论文旨在解决脉冲变压器(Spiking Transformer)缺乏理论指导框架的问题,特别是在其自注意力机制的表达能力方面。解决方案的关键在于建立了首个针对脉冲自注意力的完备表达性理论:证明了使用漏电积分-放电(Leaky Integrate-and-Fire, LIF)神经元的脉冲注意力机制是连续排列等变函数的通用逼近器,并提出了具有理论保障的新型侧抑制网络用于softmax归一化,其收敛速度为 $ O(1/\sqrt{T}) $;同时通过率失真理论推导出严格的脉冲计数下界 $ \Omega(L_f^2 nd/\varepsilon^2) ,并引入基于输入依赖的有效维度( d_\text{eff} $)来解释为何仅需 $ T=4 $ 个时间步即可满足性能要求,远低于传统最坏情况下的 $ T \geq 10,000 $ 预测。这一理论框架为神经形态变压器的设计提供了首个原理性的基础,并通过Spikformer、QKFormer和SpikingResformer在视觉与语言任务上的实验验证了预测精度($ R^2=0.97 $, $ p<0.001 $)。
链接: https://arxiv.org/abs/2604.15769
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 7 tables
Abstract:Spiking transformers achieve competitive accuracy with conventional transformers while offering 38 - 57\times energy efficiency on neuromorphic hardware, yet no theoretical framework guides their design. This paper establishes the first comprehensive expressivity theory for spiking self-attention. We prove that spiking attention with Leaky Integrate-and-Fire neurons is a universal approximator of continuous permutation-equivariant functions, providing explicit spike circuit constructions including a novel lateral inhibition network for softmax normalization with proven O(1/\sqrtT) convergence. We derive tight spike-count lower bounds via rate-distortion theory: \varepsilon -approximation requires \Omega(L_f^2 nd/\varepsilon^2) spikes, with rigorous information-theoretic derivation. Our key insight is input-dependent bounds using measured effective dimensions ( d_\texteff=47 – 89 for CIFAR/ImageNet), explaining why T=4 timesteps suffice despite worst-case T \geq 10,000 predictions. We provide concrete design rules with calibrated constants ( C=2.3 , 95% CI: [1.9, 2.7] ). Experiments on Spikformer, QKFormer, and SpikingResformer across vision and language benchmarks validate predictions with R^2=0.97 ( p0.001 ). Our framework provides the first principled foundation for neuromorphic transformer design.
[AI-28] cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection withNeural Network QQantum States
【速读】:该论文旨在解决神经网络量子态(Neural Network Quantum State, NNQS)方法中NNQS-SCI框架在扩展至更大体系时面临的性能瓶颈问题,特别是由CPU-GPU混合架构导致的通信开销和计算冗余问题。其核心解决方案是提出一个全GPU加速的SCI框架cuNNQS-SCI:首先引入分布式、负载均衡的全局去重算法以减少冗余并降低通信开销;其次采用细粒度CUDA内核实现精确耦合组态生成,提升计算效率;最后通过GPU内存中心化的运行时设计(包括GPU侧池化、流式小批量处理及重叠卸载机制)突破单GPU显存限制,将瓶颈从主机端迁移回设备端推理。这一设计显著提升了可求解问题规模,并在NVIDIA A100集群上实现了高达2.32倍的端到端加速比,同时保持化学精度。
链接: https://arxiv.org/abs/2604.15768
作者: Daran Sun,Bowen Kan,Haoquan Long,Hairui Zhao,Haoxu Li,Yicheng Liu,Pengyu Zhou,Ankang Feng,Wenjing Huang,Yida Gu,Zhenyu Li,Honghui Shang,Yunquan Zhang,Dingwen Tao,Ninghui Sun,Guangming Tan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted by HPDC’2026, 13 pages, 12 figures
Abstract:AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schrödinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.
[AI-29] When Do Early-Exit Networks Generalize? A PAC-Bayesian Theory of Adaptive Depth
【速读】:该论文旨在解决早期退出神经网络(Early-exit Neural Networks)在实际部署中普遍存在的泛化性能缺乏理论保障的问题,这是近期综述明确指出的研究空白。其解决方案的关键在于构建了一个统一的PAC-Bayesian框架,首次建立了依赖于退出深度熵 $ H(D) $ 和期望深度 $ \mathbb{E}[D] $ 的泛化界,而非传统基于最大深度 $ K $ 的界限,从而更精确地刻画自适应深度结构的泛化能力。该框架还提供了显式的构造性常数(如主导系数 $ \sqrt{2} \ln 2 \approx 1.177 $),并证明了在特定条件下自适应深度网络可严格优于固定深度模型,同时放宽了标签独立性假设以支持学习到的路由策略,最终通过多架构、多基准实验验证了边界指导阈值选择的有效性与紧致性。
链接: https://arxiv.org/abs/2604.15764
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 7 tables, 1 algorithm
Abstract:Early-exit neural networks enable adaptive computation by allowing confident predictions to exit at intermediate layers, achieving 2-8 \times inference speedup. Despite widespread deployment, their generalization properties lack theoretical understanding – a gap explicitly identified in recent surveys. This paper establishes a unified PAC-Bayesian framework for adaptive-depth networks. (1) Novel Entropy-Based Bounds: We prove the first generalization bounds depending on exit-depth entropy H(D) and expected depth \mathbbE[D] rather than maximum depth K , with sample complexity \mathcalO((\mathbbE[D] \cdot d + H(D))/\epsilon^2) . (2) Explicit Constructive Constants: Our analysis yields the leading coefficient \sqrt2\ln 2 \approx 1.177 with complete derivation. (3) Provable Early-Exit Advantages: We establish sufficient conditions under which adaptive-depth networks strictly outperform fixed-depth counterparts. (4) Extension to Approximate Label Independence: We relax the label-independence assumption to \epsilon -approximate policies, broadening applicability to learned routing. (5) Comprehensive Validation: Experiments across 6 architectures on 7 benchmarks demonstrate tightness ratios of 1.52-3.87 \times (all p 0.001 ) versus 100 \times for classical bounds. Bound-guided threshold selection matches validation-tuned performance within 0.1-0.3%.
[AI-30] KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在知识工作场景中缺乏自主问题识别能力的问题,即模型能否在未被提示的情况下从原始输入中识别出专业情境的本质结构,而非仅依赖已明确标注的任务框架进行求解。现有评估基准多聚焦于任务执行或信息提取,忽略了问题识别这一前置关键步骤。解决方案的核心在于提出KWBench——首个针对无提示问题识别的基准,其包含223个来自实际应用场景(如并购、合同谈判、临床药学等)的任务,每个任务均编码一种形式化的博弈论模式(如委托-代理冲突、信号传递、机制设计失败等),并提供结构化真值标签记录专家对情境的解读及预期失效模式。评分采用三级递进式量表,强制要求模型首先识别错误路径,从而确保评估聚焦于问题识别能力本身。实验表明,当前最优模型仅能在27.9%的任务中正确识别问题,且不同模型间共识度低,说明该能力仍属前沿挑战,凸显了KWBench在推动LLMs向真正具备“问题意识”的知识工作者演进中的关键作用。
链接: https://arxiv.org/abs/2604.15760
作者: Ankit Maloo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 37 pages, 8 figures
Abstract:We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them. Comments: 37 pages, 8 figures Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2604.15760 [cs.AI] (or arXiv:2604.15760v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.15760 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-31] DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在块级推理(block-wise inference)过程中如何有效平衡生成质量与解码速度的问题。现有方法依赖固定的块调度策略或局部当前步信号来确定块边界,并采用保守的置信度机制进行并行解码,限制了性能优化空间。解决方案的关键在于提出一种无需训练的框架DepCap,其核心创新是引入两类关键信号:一是跨步骤信号——利用前一已解码块对后续候选块的影响来自适应地决定块扩展范围;二是词元级冲突信号——识别每个块内无冲突的安全解码子集,从而实现更高效的并行解码。该方法通过信息论分析验证了累积最后块影响在词元间近似可加性,支撑了其分块准则的有效性,实验证明其可在多个DLM骨干网络和推理/编码基准上实现高达5.63倍的加速,同时保持质量几乎不变。
链接: https://arxiv.org/abs/2604.15750
作者: Xiang Xia,Wuyang Zhang,Jiazheng Liu,Cheng Yan,Yanyong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based on this view, we propose DepCap, a training-free framework for efficient block-wise DLM inference. Specifically, DepCap instantiates the cross-step signal as the influence of the last decoded block and uses it to adaptively determine how far the next block should extend, while identifying a conflict-free subset of tokens for safe parallel decoding within each block, enabling substantial inference acceleration with negligible quality degradation. DepCap is a plug-and-play method applicable to various DLMs, and compatible with existing KV-cache strategies for block-wise DLM. An information-theoretic analysis further suggests that the cumulative last-block influence on a candidate block is approximately additive across tokens, supporting the proposed block-partitioning criterion. Experimental results show that DepCap achieves favorable speed-quality trade-offs across multiple DLM backbones and reasoning and coding benchmarks, with up to 5.63 \times speedup without significant performance degradation.
[AI-32] Privacy-Preserving LLM s Routing
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)路由过程中引入的隐私泄露问题,尤其是在用户数据通过中间路由层传输时可能面临的安全风险。现有基于安全多方计算(Secure Multi-Party Computation, MPC)的隐私保护方案因协议设计不成熟和实现效率低下,难以在实际场景中应用。解决方案的关键在于提出一个名为PPRoute的隐私保护LLM路由框架:其核心创新包括三方面——首先,采用MPC友好的操作加速编码器推理;其次,设计多阶段模型训练算法,在加密域内保持路由质量;第三,提出一种通信复杂度为O(1)的无序Top-k排序算法,显著降低模型搜索过程中的通信延迟。实验表明,PPRoute在多个数据集上达到明文基准性能,同时相较朴素MPC实现提速约20倍。
链接: https://arxiv.org/abs/2604.15728
作者: Xidong Wu,Yukuan Zhang,Yuqiong Ji,Reza Shirkavand,Qian Lou,Shangqian Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) routing has emerged as a critical strategy to balance model performance and cost-efficiency by dynamically selecting services from various model providers. However, LLM routing adds an intermediate layer between users and LLMs, creating new privacy risks to user data. These privacy risks have not been systematically studied. Although cryptographic techniques such as Secure Multi-Party Computation (MPC) enable privacy-preserving computation, their protocol design and implementation remain under-explored, and naïve implementations typically incur prohibitive computational overhead. To address this, we propose a privacy-preserving LLM routing framework (PPRoute). PPRoute includes multiple strategies to speed up encoder inference and nearest neighbor search under the MPC and maintain the quality of LLM routing. First, PPRoute uses MPC-friendly operations to boost the encoder inference. Second, PPRoute uses a multiple-step model training algorithm to maintain routing quality despite the constraints of the encrypted domain. Third, PPRoute proposes an unsorted Top-k algorithm with O(1) communication complexity for secure sorting in model search, significantly reducing communication latency. Across different datasets, PPRoute achieves the performance of plaintext counterparts, while achieving approximately a 20 \times speedup over naïve MPC implementations.
[AI-33] Structured Abductive-Deductive-Inductive Reasoning for LLM s via Algebraic Invariants ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化逻辑推理中存在的系统性局限,包括假设生成与验证混淆、无法区分猜想与已验证知识,以及弱推理步骤在推理链中 unchecked 传播等问题。其解决方案的关键在于提出一种符号推理框架(symbolic reasoning scaffold),将皮尔士(Peirce)的三元推理体系——溯因(abduction)、演绎(deduction)和归纳(induction)——形式化为显式的LLM辅助推理协议,并通过五个代数不变量(Gamma Quintet)强制逻辑一致性,其中最核心的是“最弱环节”约束(Weakest Link bound),该约束确保推理链中任意结论的可靠性不超过其最弱前提的支持水平,从而防止多步推理中的逻辑不一致累积。
链接: https://arxiv.org/abs/2604.15727
作者: Sankalp Gilda,Shlok Gilda
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 10 pages + 3 pages references. Accepted as a poster at the ICLR 2026 Workshop for LLM Reasoning
Abstract:Large language models exhibit systematic limitations in structured logical reasoning: they conflate hypothesis generation with verification, cannot distinguish conjecture from validated knowledge, and allow weak reasoning steps to propagate unchecked through inference chains. We present a symbolic reasoning scaffold that operationalizes Peirce’s tripartite inference – abduction, deduction, and induction – as an explicit protocol for LLM-assisted reasoning. The framework enforces logical consistency through five algebraic invariants (the Gamma Quintet), the strongest of which – the Weakest Link bound – ensures that no conclusion in a reasoning chain can exceed the reliability of its least-supported premise. This principle, independently grounded as weakest link resolution in possibilistic logic and empirically validated for chain-of-thought reasoning, prevents logical inconsistencies from accumulating across multi-step inference. We verify all invariants through a property-based testing suite of 100 properties and 16 fuzz tests over 10^5+ generated cases, providing a verified reference implementation of the invariants suitable as a foundation for future reasoning benchmarks.
[AI-34] LLM Reasoning Is Latent Not the Chain of Thought
【速读】:该论文试图解决当前大语言模型(Large Language Model, LLM)推理研究中对“推理本质”的定义模糊问题,即推理是源于显式的思维链(Chain-of-Thought, CoT)还是隐含的状态轨迹(latent-state trajectory)。这一区分至关重要,因为它直接影响对模型可解释性、基准测试有效性以及推理干预策略设计的判断。论文的关键解决方案在于提出并形式化三个竞争假设(H1、H2和H0),并通过重新组织近期实证、机制性和调查性工作,结合计算资源审计后的案例分析,明确指出当前证据最支持H1——即推理主要由隐状态轨迹驱动,而非显式CoT或通用串行计算。因此,论文建议将隐状态动力学作为LLM推理研究的默认对象,并采用能明确解耦表面痕迹、隐状态与串行计算的设计来评估推理能力。
链接: https://arxiv.org/abs/2604.15726
作者: Wenshuo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.
[AI-35] Reasoning -targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在高风险场景下因推理过程被恶意注入有害内容而引发的安全隐患问题,即在保持最终答案不变的前提下,通过操纵推理链来传播不当信息。这一问题不同于以往仅关注最终输出安全性的“越狱攻击”,其挑战在于:1)修改输入指令可能意外改变模型的最终回答;2)多样化的输入问题使得持续绕过模型的安全对齐机制困难。解决方案的关键在于提出心理学驱动的推理目标型越狱攻击框架(Psychology-based Reasoning-targeted Jailbreak Attack, PRJA),该框架包含语义触发选择模块和基于心理学理论(如服从权威与道德脱离)的指令生成模块,能够自动识别可操控的推理触发点,并生成具有高适应性的指令以增强模型对有害内容生成的合规性,实验表明PRJA在多个商业LRM上平均攻击成功率高达83.6%。
链接: https://arxiv.org/abs/2604.15725
作者: Zehao Wang,Lanjun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM’s final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM’s safety alignment mechanisms and embed harmful content into its reasoning process. To address these challenges, we propose the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, which integrates a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. Specifically, the proposed PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions for enhancing the LRM’s compliance with harmful content generation. Extensive experiments on five question-answering datasets demonstrate that PRJA achieves an average attack success rate of 83.6% against several commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.
[AI-36] he World Leaks the Future: Harness Evolution for Future Prediction Agents
【速读】:该论文旨在解决生成式 AI (Generative AI) 在未来预测任务中因缺乏细粒度监督信号而导致的早期推理能力不足问题。现有方法主要依赖最终结果进行优化,但此类“事后反馈”无法有效指导因子追踪、证据收集与解释及不确定性处理等关键环节。其解决方案的关键在于提出一种名为 Milkyway 的自演化代理系统,该系统通过持续更新一个持久化的未来预测框架(future prediction harness),在未获知最终结果前利用不同时间点预测间的内部对比差异提取“内部反馈”(internal feedback),并将可复用的指导信息写入框架,从而实现对同一未决问题的多次预测迭代改进;最终结果则作为回溯性校验机制,确保更新后的框架可迁移至后续任务,显著提升预测性能。
链接: https://arxiv.org/abs/2604.15719
作者: Chuyang Wei(1 and 2),Maohang Gao(1 and 2),Zhixin Han(2),Kefei Chen(2 and 3),Yu Zhuang(2),Haoxiang Guan(1 and 2),Yanzhi Zhang(2),Yilin Cheng(2),Jiyan He(2),Huanhuan Chen(1),Jian Li(3),Yu Shi(2),Yitong Duan(2),Shuxin Zheng(2) ((1) University of Science and Technology of China, (2) Zhongguancun Academy, Beijing, China, (3) Tsinghua University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 6 tables
Abstract:Many consequential decisions must be made before the relevant outcome is known. Such problems are commonly framed as \emphfuture prediction, where an LLM agent must form a prediction for an unresolved question using only the public information available at the prediction time. The setting is difficult because public evidence evolves while useful supervision arrives only after the question is resolved, so most existing approaches still improve mainly from final outcomes. Yet final outcomes are too coarse to guide earlier factor tracking, evidence gathering and interpretation, or uncertainty handling. When the same unresolved question is revisited over time, temporal contrasts between earlier and later predictions can expose omissions in the earlier prediction process; we call this signal \emphinternal feedback. We introduce \emphMilkyway, a self-evolving agent system that keeps the base model fixed and instead updates a persistent \emphfuture prediction harness for factor tracking, evidence gathering and interpretation, and uncertainty handling. Across repeated predictions on the same unresolved question, \emphMilkyway extracts internal feedback and writes reusable guidance back into the harness, so later predictions on that question can improve before the outcome is known. After the question is resolved, the final outcome provides a \emphretrospective check before the updated harness is carried forward to subsequent questions. On FutureX and FutureWorld, Milkyway achieves the best overall score among the compared methods, improving FutureX from 44.07 to 60.90 and FutureWorld from 62.22 to 77.96.
[AI-37] Just Type It in Isabelle! AI Agents Drafting Mechanizing and Generalizing from Human Hints
【速读】:该论文旨在解决在秩一多项式λ-演算(rank-one polymorphic λ-calculus)中为项添加完整且最小类型注解(type annotations)的问题,以确保在重新解析和类型推断过程中保持语义不变性。其解决方案的关键在于构建一个形式化的元理论框架,包含完整的规范说明与严格证明,并在Isabelle/HOL中实现该形式化开发;同时通过人类驱动与大语言模型(LLM)驱动的混合工作流,探索AI辅助的形式化方法,从而提升形式化过程的自动化程度与通用性。
链接: https://arxiv.org/abs/2604.15713
作者: Kevin Kappelmann,Maximilian Schäffeler,Lukas Stevens,Mohammad Abdulaziz,Andrei Popescu,Dmitriy Traytel
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the problem of complete and minimal type annotations for rank-one polymorphic \lambda -calculus terms, as used in Isabelle. Building on prior work by Smolka, Blanchette et al., we give a metatheoretical account of the problem, with a full formal specification and proofs, and formalize it in Isabelle/HOL. Our development is a series of experiments featuring human-driven and AI-driven formalization workflows: a human and an LLM-powered AI agent independently produce pen-and-paper proofs, and the AI agent autoformalizes both in Isabelle, with further human-hinted AI interventions refining and generalizing the development.
[AI-38] Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在执行特定任务类时,其技能(Skill)设计对任务性能具有显著影响但缺乏系统优化方法的问题。核心挑战在于技能结构与组件内容之间存在强耦合关系,需联合优化二者以提升代理表现。解决方案的关键是将技能优化建模为一个双层优化问题:外层通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)探索最优技能结构,内层则在选定结构下利用LLM精细化调整各组件的内容;两层均借助LLM辅助决策,从而实现结构与内容的协同优化。
链接: https://arxiv.org/abs/2604.15709
作者: Chenyi Huang,Haoting Zhang,Jingxu Xu,Zeyu Zheng,Yunduan Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agent \textttskills are structured collections of instructions, tools, and supporting resources that help large language model (LLM) agents perform particular classes of tasks. Empirical evidence shows that the design of \textttskills can materially affect agent task performance, yet systematically optimizing \textttskills remains challenging. Since a \textttskill comprises instructions, tools, and supporting resources in a structured way, optimizing it requires jointly determining both the structure of these components and the content each component contains. This gives rise to a complex decision space with strong interdependence across structure and components. We therefore represent these two coupled decisions as \textttskill structure and component content, and formulate \textttskill optimization as a bilevel optimization problem. We propose a bilevel optimization framework in which an outer loop employs Monte Carlo Tree Search to determine the \textttskill structure, while an inner loop refines the component content within the structure selected by the outer loop. In both loops, we employ LLMs to assist the optimization procedure. We evaluate the proposed framework on an open-source Operations Research Question Answering dataset, and the experimental results suggest that the bilevel optimization framework improves the performance of the agents with the optimized \textttskill.
[AI-39] he Price of Paranoia: Robust Risk-Sensitive Cooperation in Non-Stationary Multi-Agent Reinforcement Learning AAMAS
【速读】:该论文旨在解决多智能体协同学习中合作均衡的不稳定性问题:当智能体在动态环境中相互学习时,彼此策略更新会引入噪声(co-learning noise),导致原本稳定的协作均衡迅速崩溃,即使该均衡在帕累托意义上占优。关键发现是,传统通过分布鲁棒性(distributional robustness)增强对伙伴不确定性的抗性反而加剧了不稳定性——因为风险厌恶的目标函数会惩罚高方差的合作行为,从而扩大不稳定区域。解决方案的核心在于重新定位鲁棒性的目标:不应作用于回报分布,而应针对由伙伴不确定性引发的策略梯度更新方差进行建模,并设计一种基于在线伙伴不可预测性度量来调节梯度更新的算法。这一区分使得合作吸引域(cooperation basin)得以扩展,在对称协调博弈中实现了理论保障的稳定性提升。为统一分析稳定性、样本复杂性和福利后果,作者进一步提出“偏执代价”(Price of Paranoia)作为“无政府状态代价”(Price of Anarchy)的结构对偶,结合新的“合作窗口”(Cooperation Window),精确刻画了在伙伴噪声下学习算法可恢复的最大福利水平,最终确定了平衡均衡稳定性和采样效率的最优鲁棒程度。
链接: https://arxiv.org/abs/2604.15695
作者: Deep Kumar Ganguly,Chandradithya S Jonnalagadda,Pratham Chintamani,Adithya Ananth
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Accepted to AAMAS ALA Workshop 2026
Abstract:Cooperative equilibria are fragile. When agents learn alongside each other rather than in a fixed environment, the process of learning destabilizes the cooperation they are trying to sustain: every gradient step an agent takes shifts the distribution of actions its partner will play, turning a cooperative partner into a source of stochastic noise precisely where the cooperation decision is most sensitive. We study how this co-learning noise propagates through the structure of coordination games, and find that the cooperative equilibrium, even when strongly Pareto-dominant, is exponentially unstable under standard risk-neutral learning, collapsing irreversibly once partner noise crosses the game’s critical cooperation threshold. The natural response to apply distributional robustness to hedge against partner uncertainty makes things strictly worse: risk-averse return objectives penalize the high-variance cooperative action relative to defection, widening the instability region rather than shrinking it, a paradox that reveals a fundamental mismatch between the domains where robustness is applied and instability originates. We resolve this by showing that robustness should target the policy gradient update variance induced by partner uncertainty, not the return distribution. This distinction yields an algorithm whose gradient updates are modulated by an online measure of partner unpredictability, provably expanding the cooperation basin in symmetric coordination games. To unify stability, sample complexity, and welfare consequences of this approach, we introduce the Price of Paranoia as the structural dual of the Price of Anarchy. Together with a novel Cooperation Window, it precisely characterizes how much welfare learning algorithms can recover under partner noise, pinning down the optimal degree of robustness as a closed-form balance between equilibrium stability and sample efficiency.
[AI-40] CodeMMR: Bridging Natural Language Code and Image for Unified Retrieval
【速读】:该论文旨在解决当前代码信息检索(Code IR)模型普遍存在的文本中心化局限性问题,即现有方法往往忽视了编程 artifacts 中固有的视觉与结构特征(如网页界面、数据可视化、SVG、示意图和UML图),从而限制了代码检索的准确性和实用性。其解决方案的关键在于提出 CodeMMR——一个统一的多模态代码检索模型,通过指令驱动的多模态对齐机制,将自然语言、代码和图像联合嵌入到共享语义空间中,实现了跨模态与跨语言的强泛化能力,在nDCG@10指标上平均优于基线模型(如UniIR、GME、VLM2Vec)达10分,并显著提升了检索增强生成(RAG)在未见代码生成任务中的代码生成保真度与视觉定位能力,验证了多模态检索作为下一代智能编程系统核心组件的潜力。
链接: https://arxiv.org/abs/2604.15663
作者: Jiahui Geng,Qing Li,Fengyu Cai,Fakhri Karray
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrieval-augmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding. Yet existing code IR models remain largely text-centric and often overlook the visual and structural aspects inherent in programming artifacts such as web interfaces, data visualizations, SVGs, schematic diagrams, and UML. To bridge this gap, we introduce MMCoIR, the first comprehensive benchmark for evaluating multimodal code IR across five visual domains, eight programming languages, eleven libraries, and show the challenge of the task through extensive evaluation. Therefore, we then propose CodeMMR, a unified retrieval model that jointly embeds natural language, code, and images into a shared semantic space through instruction-based multimodal alignment. CodeMMR achieves strong generalization across modalities and languages, outperforming competitive baselines (e.g., UniIR, GME, VLM2Vec) by an average of 10 points on nDCG@10. Moreover, integrating CodeMMR into RAG enhances code generation fidelity and visual grounding on unseen code generation tasks, underscoring the potential of multimodal retrieval as a core enabler for next-generation intelligent programming systems. Datasets are available at HuggingFace.
[AI-41] HYPERHEURIST: A Simulated Annealing-Based Control Framework for LLM -Driven Code Generation in Optimized Hardware Design IJCNN2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成寄存器传输级(Register Transfer Level, RTL)硬件设计时,难以稳定产出既功能正确又具备良好功耗-性能-面积(Power-Performance-Area, PPA)优化特性的设计问题。其解决方案的关键在于提出一种基于模拟退火(simulated annealing)的控制框架 HYPERHEURIST,将LLM生成的RTL视为中间候选方案而非最终设计,并通过分阶段筛选与优化机制实现:首先利用编译、结构检查和仿真验证功能正确性,仅对通过验证的设计进行PPA优化,从而显著提升优化结果的稳定性与可重复性。
链接: https://arxiv.org/abs/2604.15642
作者: Shiva Ahir,Prajna Bhat,Alex Doboli
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 5 tables. Accepted at IJCNN 2026
Abstract:Large Language Models (LLMs) have shown promising progress for generating Register Transfer Level (RTL) hardware designs, largely because they can rapidly propose alternative architectural realizations. However, single-shot LLM generation struggles to consistently produce designs that are both functionally correct and power-efficient. This paper proposes HYPERHEURIST, a simulated annealing-based control framework that treats LLM-generated RTL as intermediate candidates rather than final designs. The suggested system not only focuses on functionality correctness but also on Power-Performance-Area (PPA) optimization. In the first phase, RTL candidates are filtered through compilation, structural checks, and simulation to identify functionally valid designs. PPA optimization is restricted to RTL designs that have already passed compilation and simulation. Evaluated across eight RTL benchmarks, this staged approach yields more stable and repeatable optimization behavior than single-pass LLM-generated RTL.
[AI-42] VoodooNet: Achieving Analytic Ground States via High-Dimensional Random Projections
【速读】:该论文旨在解决传统深度学习模型依赖迭代优化算法(如随机梯度下降,SGD)进行训练所带来的高计算成本与时间开销问题。其核心挑战在于如何在不进行反向传播(backpropagation)的情况下实现高性能分类,同时显著降低训练时间。解决方案的关键在于提出一种非迭代神经架构 VoodooNet,通过将输入流形(manifold)投影到高维、高熵的“银河空间”(Galactic space, d ≫ 784),利用 Moore-Penrose 广义逆矩阵在单步内求解输出层参数,从而绕过传统训练过程中的热力学代价。该方法不仅实现了对复杂特征的解耦,还展现出近对数尺度的性能提升规律,表明模型精度主要取决于“银河空间”的体积而非迭代优化过程,为边缘计算场景下的实时生成式 AI 提供了新范式。
链接: https://arxiv.org/abs/2604.15613
作者: Wladimir Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 2 tables
Abstract:We present VoodooNet, a non-iterative neural architecture that replaces the stochastic gradient descent (SGD) paradigm with a closed-form analytic solution via Galactic Expansion. By projecting input manifolds into a high-dimensional, high-entropy “Galactic” space ( d \gg 784 ), we demonstrate that complex features can be untangled without the thermodynamic cost of backpropagation. Utilizing the Moore-Penrose pseudoinverse to solve for the output layer in a single step, VoodooNet achieves a classification accuracy of \textbf98.10% on MNIST and \textbf86.63% on Fashion-MNIST. Notably, our results on Fashion-MNIST surpass a 10-epoch SGD baseline (84.41%) while reducing the training time by orders of magnitude. We observe a near-logarithmic scaling law between dimensionality and accuracy, suggesting that performance is a function of “Galactic” volume rather than iterative refinement. This “Magic Hat” approach offers a new frontier for real-time Edge AI, where the traditional training phase is bypassed in favor of instantaneous manifold discovery.
[AI-43] DataCenterGym: A Physics-Grounded Simulator for Multi-Objective Data Center Scheduling
【速读】:该论文旨在解决现代数据中心在地理分布式部署场景下,因计算资源异构性、电力价格差异及热环境动态变化导致的调度效率低下问题。现有调度器通常将计算利用率、热量生成、冷却需求与能耗等物理因素独立处理,忽略了它们之间的强耦合关系,从而难以实现能效最优的作业分配。解决方案的关键在于提出一个基于物理机制的仿真平台 DataCenterGym,该平台整合了任务队列管理、建筑热力学动态、局部 HVAC 行为以及温度依赖的服务性能退化,并在此基础上设计了一种分层模型预测控制(Hierarchical Model Predictive Control, H-MPC)调度算法,通过显式建模热力和电力动态特性,在分布式作业放置中实现更优的调度性能。
链接: https://arxiv.org/abs/2604.15594
作者: Nilavra Pathak,Samadrita Biswas,Nirmalya Roy
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Modern datacenters schedule heterogeneous workloads across geo-distributed sites with diverse compute capacities, electricity prices, and thermal conditions. Compute utilization, heat generation, cooling demand, and energy consumption are tightly coupled, yet most existing schedulers abstract these effects and treat them independently. We present \textitDataCenterGym, a physics-grounded simulation environment for job scheduling in geo-distributed data centers, designed as a reusable testbed for future research. The simulator integrates compute queueing, building thermal dynamics, localized HVAC behavior, and temperature-dependent service degradation within a Gymnasium-compatible interface. We also develop a Hierarchical Model Predictive Control (H-MPC) scheduling algorithm that performs distributed job placement while explicitly accounting for thermal and power dynamics. Through experiments on nominal operation and workload sensitivity, we demonstrate how H-MPC improves scheduling performance relative to baseline schedulers. Comments: 10 pages, 5 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.15594 [cs.DC] (or arXiv:2604.15594v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.15594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-44] CSLE: A Reinforcement Learning Platform for Autonomous Security Management
【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)在网络安全管理中主要局限于仿真环境、难以推广到实际运行系统的问题。其核心挑战在于理论性能与 operational performance 之间的差距。解决方案的关键在于提出 CSLE 平台,该平台通过两个协同系统实现:一是基于虚拟化环境的仿真系统(emulation system),用于复现目标系统的关键组件并收集数据以构建马尔可夫决策过程(Markov Decision Process, MDP)模型;二是基于该模型的模拟系统(simulation system),用于高效学习安全策略,并将所学策略回传至仿真系统进行评估与迭代优化,从而缩小理论与实际部署间的性能差距。
链接: https://arxiv.org/abs/2604.15590
作者: Kim Hammar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted as Oral to the Ninth Annual Conference on Machine Learning and Systems (MLSys 2026), this https URL
Abstract:Reinforcement learning is a promising approach to autonomous and adaptive security management in networked systems. However, current reinforcement learning solutions for security management are mostly limited to simulation environments and it is unclear how they generalize to operational systems. In this paper, we address this limitation by presenting CSLE: a reinforcement learning platform for autonomous security management that enables experimentation under realistic conditions. Conceptually, CSLE encompasses two systems. First, it includes an emulation system that replicates key components of the target system in a virtualized environment. We use this system to gather measurements and logs, based on which we identify a system model, such as a Markov decision process. Second, it includes a simulation system where security strategies are efficiently learned through simulations of the system model. The learned strategies are then evaluated and refined in the emulation system to close the gap between theoretical and operational performance. We demonstrate CSLE through four use cases: flow control, replication control, segmentation control, and recovery control. Through these use cases, we show that CSLE enables near-optimal security management in an environment that approximates an operational system.
[AI-45] PAWN: Piece Value Analysis with Neural Networks
【速读】:该论文旨在解决国际象棋中棋子相对价值预测的难题,即如何准确评估某一棋子在特定局面下的贡献,这取决于其与棋盘上其他所有棋子的空间关系。解决方案的关键在于利用基于卷积神经网络(Convolutional Neural Network, CNN)的自编码器提取全局棋盘状态的潜在位置表示(latent position representations),并将该上下文信息引入多层感知机(Multilayer Perceptron, MLP)架构中进行棋子价值预测,从而显著提升模型对个体棋子贡献的建模能力。
链接: https://arxiv.org/abs/2604.15585
作者: Ethan Tang,Hasan Davulcu,Jia Zou,Zhongju Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, 12 tables
Abstract:Predicting the relative value of any given chess piece in a position remains an open challenge, as a piece’s contribution depends on its spatial relationships with every other piece on the board. We demonstrate that incorporating the state of the full chess board via latent position representations derived using a CNN-based autoencoder significantly improves accuracy for MLP-based piece value prediction architectures. Using a dataset of over 12 million piece-value pairs gathered from Grandmaster-level games, with ground-truth labels generated by Stockfish 17, our enhanced piece value predictor significantly outperforms context-independent MLP-based systems, reducing validation mean absolute error by 16% and predicting relative piece value within approximately 0.65 pawns. More generally, our findings suggest that encoding the full problem state as context provides useful inductive bias for predicting the contribution of any individual component.
[AI-46] Symbolic Guardrails for Domain-Specific Agents : Stronger Safety and Security Guarantees Without Sacrificing Utility
【速读】:该论文旨在解决高风险商业场景中AI代理(AI agent)因不当行为引发的安全与隐私问题,如数据泄露和财务损失,这些问题在现有基于训练的方法和神经防护机制中难以获得形式化保障。其解决方案的关键在于引入符号式防护机制(symbolic guardrails),通过形式化规则对代理行为进行约束,从而实现可证明的安全性和安全性保障;研究表明,74%的明确政策要求可通过此类机制强制执行,并且不会损害代理任务成功率,尤其适用于领域特定的AI代理系统。
链接: https://arxiv.org/abs/2604.15579
作者: Yining Hong,Yining She,Eunsuk Kang,Christopher S. Timperley,Christian Kästner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on \tau^2 -Bench, CAR-bench, and MedAgentBench. We find that 85% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at this https URL.
[AI-47] Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
【速读】:该论文旨在解决生成式模型在面对变化的奖励函数时,需重新训练才能适应新目标的问题。传统方法通过强化学习(Reinforcement Learning, RL)在训练阶段调整模型采样分布以最大化奖励,但当奖励函数发生变化时,必须重新训练整个模型,效率低下。其解决方案的关键在于提出一种奖励加权的无分类器引导(Reward-weighted Classifier-Free Guidance, RCFG),该方法能够在测试阶段直接对模型输出分布进行近似倾斜,从而实现对任意奖励函数的即时优化,而无需重新训练。此外,论文进一步表明,将RCFG作为教师模型用于知识蒸馏,可显著加速标准RL训练的收敛速度。
链接: https://arxiv.org/abs/2604.15577
作者: Alexander Peysakhovich,William Berman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Consider an auto-regressive model that produces outputs x (e.g., answers to questions, molecules) each of which can be summarized by an attribute vector y (e.g., helpfulness vs. harmlessness, or bio-availability vs. lipophilicity). An arbitrary reward function r(y) encodes tradeoffs between these properties. Typically, tilting the model’s sampling distribution to increase this reward is done at training time via reinforcement learning. However, if the reward function changes, re-alignment requires re-training. In this paper, we show that a reward weighted classifier-free guidance (RCFG) can act as a policy improvement operator in this setting, approximating tilting the sampling distribution by the Q function. We apply RCFG to molecular generation, demonstrating that it can optimize novel reward functions at test time. Finally, we show that using RCFG as a teacher and distilling into the base policy to serve as a warm start significantly speeds up convergence for standard RL.
[AI-48] Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在代理系统(agentic systems)中是否存在隐式行为偏见传递的问题,即即使训练数据经过严格过滤,是否仍可能通过轨迹(trajectories)学习到不安全的行为模式。其解决方案的关键在于通过两个互补的实验设置验证了行为偏见可通过模型蒸馏(model distillation)从教师代理(teacher agent)隐式传递给学生代理(student agent),而无需显式关键词或语义线索。在API接口和原生Bash环境中,尽管所有删除类关键字均被彻底过滤,学生代理仍继承了显著的破坏性行为倾向(如文件删除率高达100%或chmod优先执行率提升至30%-55%),表明行为偏见编码于轨迹动态中,而非依赖显式文本内容,从而揭示了当前数据净化策略的局限性。
链接: https://arxiv.org/abs/2604.15559
作者: Jacob Dang,Brian Y. Xie,Omar G. Younis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student’s deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student’s chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.
[AI-49] Natural gradient descent with momentum
【速读】:该论文旨在解决在非线性模型类(如神经网络或张量网络)中优化损失函数时,传统梯度下降和自然梯度下降(Natural Gradient Descent, NGD)容易陷入局部极小值的问题,尤其是在损失函数非理想条件(如KL散度用于密度估计或偏微分方程残差的范数在物理信息学习中)下,NGD也可能无法提供最优更新方向。解决方案的关键在于引入经典惯性动力学方法(如Heavy-Ball或Nesterov加速)的自然版本——即基于流形结构的自然惯性动力学,通过利用近似流形切空间的Gram矩阵作为预条件,实现对参数更新方向的更优引导,从而提升学习过程的收敛性和跳出局部极小的能力。
链接: https://arxiv.org/abs/2604.15554
作者: Anthony Nouy,Agustín Somacal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:
Abstract:We consider the problem of approximating a function by an element of a nonlinear manifold which admits a differentiable parametrization, typical examples being neural networks with differentiable activation functions or tensor networks. Natural gradient descent (NGD) for the optimization of a loss function can be seen as a preconditioned gradient descent where updates in the parameter space are driven by a functional perspective. In a spirit similar to Newton’s method, a NGD step uses, instead of the Hessian, the Gram matrix of the generating system of the tangent space to the approximation manifold at the current iterate, with respect to a suitable metric. This corresponds to a locally optimal update in function space, following a projected gradient onto the tangent space to the manifold. Still, both gradient and natural gradient descent methods get stuck in local minima. Furthermore, when the model class is a nonlinear manifold or the loss function is not ideally conditioned (e.g., the KL-divergence for density estimation, or a norm of the residual of a partial differential equation in physics informed learning), even the natural gradient might yield non-optimal directions at each step. This work introduces a natural version of classical inertial dynamic methods like Heavy-Ball or Nesterov and show how it can improve the learning process when working with nonlinear model classes.
[AI-50] LACE: Lattice Attention for Cross-thread Exploration
【速读】:该论文旨在解决当前大型语言模型在推理过程中存在冗余错误的问题,即多个并行推理路径虽能独立探索,但缺乏交互机制导致它们常以相似方式失败。解决方案的关键在于提出LACE框架,通过重构模型架构引入跨线程注意力(cross-thread attention),使不同推理路径能够在推理阶段共享中间见解并相互纠错,从而实现协同式并行推理。为克服缺乏自然训练数据的挑战,作者还设计了一套合成数据生成管道,专门用于训练模型在多线程间进行沟通与错误修正。实验表明,该方法显著优于传统并行搜索策略,在推理准确性上提升超过7个百分点。
链接: https://arxiv.org/abs/2604.15529
作者: Yang Li,Zirui Zhang,Yang Liu,Chengzhi Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 15 figures
Abstract:Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.
[AI-51] LLM bench: A Comparative Close Reading Workbench for Large Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)比较工具多聚焦于量化评估与用户评分指标,而忽视了数字人文领域对生成文本进行深度阐释性阅读(hermeneutic reading)的需求。其解决方案的关键在于设计并实现LLMbench——一个基于浏览器的交互式工作台,支持对两个模型输出进行并置对比分析,并提供六种分析模式(如Stochastic Variation、Temperature Gradient等)和四种可视化分析层(Probabilities、Differences、Tone、Structure),从而将生成文本的token级概率结构显式化,使文本的“可能替代版本”及其生成路径得以可视化呈现,为批判性研究生成式AI(Generative AI)提供了可操作的数据基础与分析框架。
链接: https://arxiv.org/abs/2604.15508
作者: David M. Berry
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures
Abstract:LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR’s LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, that show the counterfactual history from which each word emerged. This paper describes the tool’s architecture, its six modes, and its design rationale, and argues that log-probability data, currently underused in humanistic and social-scientific readings of AI, is an important resource for a critical studies of generative AI models.
[AI-52] SecureRouter: Encrypted Routing for Efficient Secure Inference
【速读】:该论文旨在解决加密环境下神经网络推理效率低下的问题,尤其是现有隐私保护推理系统因采用单一固定Transformer模型处理所有加密输入而导致的性能瓶颈。其核心挑战在于未能根据输入特征动态选择最优模型规模,从而难以在计算效率与精度之间取得平衡。解决方案的关键在于提出SecureRouter框架,通过构建一个统一的加密流水线,将安全路由机制与MPC优化的模型池相结合:一方面设计了MPC成本感知的安全路由器,在加密状态下预测各模型的效用与开销;另一方面对模型池中的架构和量化方案进行联合训练以最小化MPC通信与计算开销,实现输入自适应的模型选择与推理协同优化,最终在保持近似精度的前提下显著降低延迟(提升1.95倍)。
链接: https://arxiv.org/abs/2604.15499
作者: Yukuan Zhang,Mengxin Zheng,Qian Lou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in the 63rd IEEE/ACM Design Automation Conference (DAC 2026)
Abstract:Cryptographically secure neural network inference typically relies on secure computing techniques such as Secure Multi-Party Computation (MPC), enabling cloud servers to process client inputs without decrypting them. Although prior privacy-preserving inference systems co-design network optimizations with MPC, they remain slow and costly, limiting real-world deployment. A major bottleneck is their use of a single, fixed transformer model for all encrypted inputs, ignoring that different inputs require different model sizes to balance efficiency and accuracy. We present SecureRouter, an end-to-end encrypted routing and inference framework that accelerates secure transformer inference through input-adaptive model selection under encryption. SecureRouter establishes a unified encrypted pipeline that integrates a secure router with an MPC-optimized model pool, enabling coordinated routing, inference, and protocol execution while preserving full data and model confidentiality. The framework includes training-phase and inference-phase components: an MPC-cost-aware secure router that predicts per-model utility and cost from encrypted features, and an MPC-optimized model pool whose architectures and quantization schemes are co-trained to minimize MPC communication and computation overhead. Compared to prior work, SecureRouter achieves a latency reduction by 1.95x with negligible accuracy loss, offering a practical path toward scalable and efficient secure AI inference. Our open-source implementation is available at: this https URL
[AI-53] A Q-learning-based QoS-aware multipath routing protocol in IoMT-based wireless body area network
【速读】:该论文旨在解决无线体域网(Wireless Body Area Networks, WBANs)中因动态拓扑、能量约束及多样化服务质量(Quality of Service, QoS)需求所带来的路由挑战。其解决方案的关键在于提出一种基于Q-learning的QoS感知多路径路由方法(QQMR),通过将数据分为三类优先级,结合自适应多级队列与模糊C均值聚类优化路由决策,并为每类数据维护独立的学习策略,从而实现主路径与备用路径的智能选择,显著提升了分组投递率,同时降低了延迟、路由开销和能量消耗。
链接: https://arxiv.org/abs/2604.15489
作者: Mehdi Hosseinzadeh,Roohallah Alizadehsani,Amin Beheshti,Hamid Alinejad-Roknyd,Lu Chen,Mohammad Sadegh Yousefpoor,Efat Yousefpoor,Muneera Altayeb,Thantrira Porntaveetus,Sadia Din
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:The Internet of Medical Things (IoMT) enables intelligent healthcare services but faces challenges such as dynamic topology, energy constraints, and diverse QoS requirements. This paper proposes QQMR, a Q-learning-based QoS-aware multipath routing method for WBANs. QQMR classifies data into three priority levels and employs adaptive multi-level queuing and fuzzy C-means clustering to optimize routing decisions. It maintains separate learning policies for each data type and selects primary and backup paths accordingly. Experimental results demonstrate improved packet delivery ratio and significant reductions in delay, routing overhead, and energy consumption compared to existing methods.
[AI-54] Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在去遗忘(unlearning)过程中难以同时满足多个挑战性目标的问题,包括移除有害或隐私泄露信息、保持通用功能性能、避免对邻近概念的过度拒绝行为,以及确保对对抗性探测攻击的鲁棒性。现有方法通常仅关注去遗忘效果和功能保留,忽视了鲁棒性和边界行为,导致多目标场景下任务干扰。解决方案的关键在于提出一种多目标去遗忘框架,通过数据与优化协同设计实现目标平衡:首先将训练语料统一表示以缩小领域差距,进而引入双向蒸馏机制,从上下文指导的教师模型中提取期望行为并抑制学生模型中的不良行为,从而在理论上和实证上实现域分布对齐,并将看似无关的去遗忘任务转化为协同优化问题。
链接: https://arxiv.org/abs/2604.15482
作者: Yisheng Zhong,Sijia Liu,Zhuangdi Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference. We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data representation to reduce the domain gap, and then introduce a bidirectional distillation method that simultaneously elicits desired behavior from a context-instructed teacher while suppressing undesirable behavior in the student model. Theoretical and empirical analyses show that our method aligns domain distributions and converts seemingly irrelevant unlearning tasks into cooperative optimization. Evaluation demonstrates state-of-the-art performance, which enables balanced and reliable unlearning across diverse, challenging requirements.
[AI-55] he Semi-Executable Stack: Agent ic Software Engineering and the Expanding Scope of SE
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)和智能体(Agent)技术的发展,软件工程领域面临的核心价值是否会被削弱的担忧。作者指出,这种担忧源于对“被工程化对象”范围变化的误解——即软件工程并未失去意义,而是其核心对象已从单纯的可执行代码扩展至“半可执行栈”(Semi-Executable Stack),包括自然语言、工具、工作流、控制机制及组织惯例等组合性、依赖人类或概率解释的产物。解决方案的关键在于提出一个六层诊断模型——“半可执行栈”,用以定位技术贡献、瓶颈或组织转型的具体层级及其相互依赖关系,并将传统质疑转化为可工程化的改进目标,最终通过“保留 vs. 纯化”(preserve-versus-purify)决策框架区分应保留、简化或重构的软件工程实践。
链接: https://arxiv.org/abs/2604.15468
作者: Robert Feldt,Per Lenberg,Julian Frattini,Dhasarathy Parthasarathy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This paper is the write-up of Robert Feldt’s keynote "Agentic Software Engineering Will Eat the World: AI-Based Systems as the New Operating System of Society’’ given at the Agentic Engineering 2026 workshop, Rio de Janeiro, Brazil, April 14, 2026
Abstract:AI-based systems, currently driven largely by LLMs and tool-using agentic harnesses, are increasingly discussed as a possible threat to software engineering. Foundation models get stronger, agents can plan and act across multiple steps, and tasks such as scaffolding, routine test generation, straightforward bug fixing, and small integration work look more exposed than they did only a few years ago. The result is visible unease not only among students and junior developers, but also among experienced practitioners who worry that hard-won expertise may lose value. This paper argues for a different reading. The important shift is not that software engineering loses relevance. It is that the thing being engineered expands beyond executable code to semi-executable artifacts; combinations of natural language, tools, workflows, control mechanisms, and organizational routines whose enactment depends on human or probabilistic interpretation rather than deterministic execution. The Semi-Executable Stack is introduced as a six-ring diagnostic reference model for reasoning about that expansion, spanning executable artifacts, instructional artifacts, orchestrated execution, controls, operating logic, and societal and institutional fit. The model helps locate where a contribution, bottleneck, or organizational transition primarily sits, and which adjacent rings it depends on. The paper develops the argument through three worked cases, reframes familiar objections as engineering targets rather than reasons to dismiss the transition, and closes with a preserve-versus-purify heuristic for deciding which legacy software engineering processes, controls, and coordination routines should be kept and which should be simplified or redesigned. This paper is a conceptual keynote companion: diagnostic and agenda-setting rather than empirical.
[AI-56] Rag ged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在TPU(Tensor Processing Unit)架构上高效推理的问题,特别是针对现有推理内核和部署系统多为GPU优化、缺乏适用于TPU的高效注意力机制实现这一痛点。其核心挑战在于如何在动态且不规则(ragged)的执行模式下充分利用TPU的计算与内存资源。解决方案的关键在于提出Ragged Paged Attention (RPA),一种基于Pallas和Mosaic实现的高性能、灵活的注意力内核,包含三项核心技术:(1) 细粒度分块(fine-grained tiling),支持对不规则内存进行高效动态切片;(2) 自定义软件流水线,将KV缓存更新与注意力计算融合以减少冗余操作;(3) 分布式感知编译策略,为解码(decode)、预填充(prefill)及混合工作负载生成专用内核。实测表明,RPA在TPU7x上实现了高达86%的内存带宽利用率(MBU)和73%的模型浮点运算利用率(MFU),并已集成至vLLM和SGLang作为生产级TPU后端。
链接: https://arxiv.org/abs/2604.15464
作者: Jevin Jiang,Ying Chen,Blake A. Hechtman,Fenghui Zhang,Yarong Mu
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 19 figures, 12 tables
Abstract:Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google’s Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures–particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware compilation strategy that generates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. Integrated as the primary TPU backend in vLLM and SGLang, RPA provides a production-grade foundation for efficient TPU inference and offers practical insights into kernel design.
[AI-57] DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agent ic AI
【速读】:该论文旨在解决当前医疗领域中人工智能(AI)系统在临床应用中缺乏可信度与透明性的问题,特别是现有深度研究系统在证据评估标准不明确、难以检验的情况下,可能引入错误并阻碍研究人员和临床医生对其输出结果可靠性的判断。其解决方案的关键在于提出 DeepER-Med 框架,该框架将医学研究过程建模为一个可解释的循证生成工作流,包含三个模块:研究规划、代理协作和证据合成,并通过构建 DeepER-MedQA 数据集(由100个专家级真实医学问题组成)实现对复杂真实场景的评估,从而显著提升 AI 在医疗科研中的可信赖性和实用性。
链接: https://arxiv.org/abs/2604.15456
作者: Zhizheng Wang,Chih-Hsuan Wei,Joey Chan,Robert Leaman,Chi-Ping Day,Chuan Wu,Mark A Knepper,Antolin Serrano Farias,Jordina Rincon-Torroella,Hasan Slika,Betty Tyler,Ryan Huu-Tuan Nguyen,Asmita Indurkar,Mélanie Hébert,Shubo Tian,Lauren He,Noor Naffakh,Aseem Aseem,Nicholas Wan,Emily Y Chew,Tiarnan D L Keenan,Zhiyong Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 6 figures, 5 tables
Abstract:Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med’s conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.
[AI-58] ransfer Learning from Foundational Optimization Embeddings to Unsupervised SAT Representations
【速读】:该论文旨在解决基础优化嵌入(foundational optimization embeddings)是否能够从混合整数规划(MIP)问题泛化到决策问题,特别是布尔可满足性(SAT)问题的挑战。其核心问题是验证这类预训练表示能否在无需架构调整或监督微调的情况下,有效捕捉SAT实例的结构规律并支持无监督任务。解决方案的关键在于将CNF公式映射为与MIP相同的二分图约束-变量表示结构,从而直接复用已训练好的优化嵌入模型,实现了跨问题类型的零样本迁移,首次证明了基础优化嵌入可在约束满足领域中有效迁移,为统一优化与决策问题的表征框架迈出关键一步。
链接: https://arxiv.org/abs/2604.15448
作者: Koyena Pal,Serdar Kadioglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Foundational optimization embeddings have recently emerged as powerful pre-trained representations for mixed-integer programming (MIP) problems. These embeddings were shown to enable cross-domain transfer and reduce reliance on solver-generated labels. In this work, we investigate whether such representations generalize beyond optimization to decision problems, focusing on Boolean satisfiability (SAT). We adapt the foundational optimization architecture to SAT by mapping CNF formulas into the same bipartite constraint-variable graph representation used for MIPs. This allows direct reuse of the pre-trained embedding model without architectural changes or supervised fine-tuning. Our results show that these embeddings capture structural regularities in SAT instances and support unsupervised tasks such as instance clustering and distribution identification. We demonstrate, for the first time, that foundational optimization embeddings can transfer to constraint satisfaction domains. Our findings is a step toward a unified representational framework for both optimization and decision problems.
[AI-59] StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
【速读】:该论文旨在解决SignSGD在非光滑目标函数上容易发散的问题,这在现代机器学习中尤为突出,因为ReLU激活函数、最大池化层和专家混合模型等结构广泛存在,导致优化问题通常是非光滑的。解决方案的关键在于提出StoSignSGD算法,其核心创新是在符号操作符中引入结构化的随机性(structural stochasticity),同时保持更新步长的无偏性(unbiased update step)。这一设计使得StoSignSGD在凸优化场景下能严格收敛并达到理论最优速率,在非凸非光滑优化中则通过定义广义驻点(generalized stationary measures)显著改进了现有复杂度边界,并在大规模语言模型(LLM)训练中展现出卓越的稳定性与效率。
链接: https://arxiv.org/abs/2604.15416
作者: Dingzhi Yu,Rui Pan,Yuxing Liu,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbfStoSignSGD, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass prior definitions, proving that StoSignSGD improves upon the best-known complexity bounds by dimensional factors. Empirically, StoSignSGD exhibits robust stability and superior efficiency across diverse large language model (LLM) training regimes. Notably, in low-precision FP8 pretraining – a setting where AdamW fails catastrophically – StoSignSGD remains highly stable and yields a remarkable 1.44 \times to 2.14 \times speedup relative to established baselines. Furthermore, when fine-tuning 7B LLMs on mathematical reasoning tasks, StoSignSGD delivers substantial performance gains over both AdamW and SignSGD. Finally, to dissect the mechanisms driving its success, we develop a sign conversion framework capable of transforming any general optimizer into its unbiased, sign-based counterpart. Utilizing this framework, we deconstruct the core components of StoSignSGD and present a comprehensive ablation study to empirically validate our algorithmic design choices.
[AI-60] HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents ?
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代理生态系统中对有害技能(harmful skills)缺乏系统性测量与评估的问题。现有研究多关注技能内部漏洞(如提示注入),而忽视了那些可能被滥用以实施网络攻击、欺诈、隐私侵犯或生成色情内容等恶意行为的技能。其解决方案的关键在于:首先,通过构建一个基于LLM驱动的评分系统和有害技能分类体系,对两个主流技能注册表中的98,440个技能进行大规模量化分析,发现约4.93%为有害技能;其次,提出首个面向真实代理场景的基准测试工具HarmfulSkillBench,包含200个跨20类别的有害技能及四种评估条件,用于衡量LLMs在面对预安装技能诱导时的安全响应能力。实验表明,使用预装技能显著降低模型拒绝率,且当危害意图隐含而非显式表达时,危害得分从0.27升至0.76,揭示了技能生态系统的潜在风险。
链接: https://arxiv.org/abs/2604.15415
作者: Yukun Jiang,Yage Zhang,Michael Backes,Xinyue Shen,Yang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have evolved into autonomous agents that rely on open skill ecosystems (e.g., ClawHub and this http URL), hosting numerous publicly reusable skills. Existing security research on these ecosystems mainly focuses on vulnerabilities within skills, such as prompt injection. However, there is a critical gap regarding skills that may be misused for harmful actions (e.g., cyber attacks, fraud and scams, privacy violations, and sexual content generation), namely harmful skills. In this paper, we present the first large-scale measurement study of harmful skills in agent ecosystems, covering 98,440 skills across two major registries. Using an LLM-driven scoring system grounded in our harmful skill taxonomy, we find that 4.93% of skills (4,858) are harmful, with ClawHub exhibiting an 8.84% harmful rate compared to 3.49% on this http URL. We then construct HarmfulSkillBench, the first benchmark for evaluating agent safety against harmful skills in realistic agent contexts, comprising 200 harmful skills across 20 categories and four evaluation conditions. By evaluating six LLMs on HarmfulSkillBench, we find that presenting a harmful task through a pre-installed skill substantially lowers refusal rates across all models, with the average harm score rising from 0.27 without the skill to 0.47 with it, and further to 0.76 when the harmful intent is implicit rather than stated as an explicit user request. We responsibly disclose our findings to the affected registries and release our benchmark to support future research (see this https URL).
[AI-61] Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning, CRL)中因单一策略保留机制导致的可塑性丧失问题,即在任务序列中,即使保留了先前成功的策略,也可能因干扰而无法作为可靠起点实现快速再适应。其解决方案的关键在于提出一种名为TeLAPA(Transfer-Enabled Latent-Aligned Policy Archives)的框架,通过构建每个任务的策略邻域档案(policy archives),并在共享潜在空间中保持策略间的可比性和可重用性,从而将CRL从单一策略保留转向维护技能对齐的、具备能力且行为相关的策略邻域,支持未来高效再学习。
链接: https://arxiv.org/abs/2604.15414
作者: Lute Lillo,Nick Cheney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emphsingle-model preservation, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emphloss of plasticity that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textscTeLAPA (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emphskill-aligned neighborhoods with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textscTeLAPA learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.
[AI-62] PRL-Bench: A Comprehensive Benchmark Evaluating LLM s Capabilities in Frontier Physics Research
【速读】:该论文旨在解决当前科学基准测试无法有效评估大语言模型(Large Language Models, LLMs)在真实科研场景中进行自主探索和长周期流程执行能力的问题。现有评测主要聚焦于领域知识理解和复杂推理,而忽视了科学研究中关键的探索性、流程复杂性和可验证性等核心特征。解决方案的关键在于构建一个名为PRL-Bench(Physics Research by LLMs)的新型基准,该基准基于《物理评论快报》(Physical Review Letters)自2025年8月以来精选的100篇论文,覆盖天体物理、凝聚态物理、高能物理、量子信息和统计物理五大理论与计算密集型子领域,每个任务均模拟真实科研的核心特性:以探索为导向的问题设定、长周期研究流程以及客观可验证的结果输出。通过这一设计,PRL-Bench系统性地映射了LLMs在端到端物理学研究中的能力边界,揭示当前先进模型在完成复杂科学任务时仍存在显著能力差距(最佳综合得分低于50),从而为下一代具备自主科学发现能力的AI系统提供可靠评估工具。
链接: https://arxiv.org/abs/2604.15411
作者: Tingjia Miao,Wenkai Jin,Muhua Zhang,Jinxin Tan,Yuelin Hu,Tu Guo,Jiejun Zhang,Yuhan Wang,Wenbo Li,Yinuo Gao,Shuo Chen,Weiqi Jiang,Yayun Hu,Zixing Lei,Xianghe Pang,Zexi Liu,Yuzhi Zhang,Linfeng Zhang,Kun Chen,Wei Wang,Weinan E,Siheng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注: 15 pages, 5 figures
Abstract:The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, we present research-oriented evaluations in theoretical and computational physics, a natural testbed with comprehensive domain knowledge, complex reasoning, and verifiable end-to-end workflows without reliance on experiments. Here we introduce PRL-Bench (Physics Research by LLMs), a benchmark designed to systematically map the capability boundaries of LLMs in executing end-to-end physics research. Constructed from 100 curated papers from the latest issues of Physical Review Letters since August 2025 and validated by domain experts, PRL-Bench covers five major theory- and computation-intensive subfields of modern physics: astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics. Each task in the benchmark is designed to replicate the core properties of authentic scientific research, including exploration-oriented formulation, long-horizon workflows, and objective verifiability, thereby reconstructing the essential reasoning processes and research workflows of real physics research. Evaluation across frontier models shows that performance remains limited, with the best overall score below 50, revealing a pronounced gap between current LLM capabilities and the demands of real scientific research. PRL-Bench serves a reliable testbed for accessing next generation AI scientists advancing AI systems toward autonomous scientific discovery.
[AI-63] he Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
【速读】:该论文旨在解决自回归Transformer推理中KV缓存(Key-Value caching)的数值等价性假设问题,即传统认为启用KV缓存与无缓存计算在结果上应完全一致。研究发现,在标准FP16精度下,由于浮点运算的非结合性(FP16 non-associativity),缓存开启(cache-ON)与关闭(cache-OFF)路径因累积顺序不同而产生确定性的解码token序列差异,导致所有采样策略下均出现100%的token差异,且缓存版本在8/9条件下准确率更高,表明这种偏差具有系统性而非随机性。解决方案的关键在于识别出FP16精度下的浮点非结合性是唯一因果驱动因素,并通过引入FP32精度进行“虚假验证”(falsification)将误差降低至零,从而确认了KV缓存机制本身作为状态变量(stateful variable)对数值稳定性的决定性作用,为理解现代大语言模型(LLM)推理中的数值不稳定性提供了可解释的机制框架。
链接: https://arxiv.org/abs/2604.15409
作者: Ranjith Chodavarapu,Lei Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma’s larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.
[AI-64] Dispatch-Aware Rag ged Attention for Pruned Vision Transformers
【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)中基于 token pruning 的注意力机制优化在实际运行时无法显著降低延迟的问题。尽管 token pruning 方法理论上可减少注意力计算的浮点运算量(FLOPs),但在使用现代变长注意力 API(如 FlashAttention-2 的 varlen 和 PyTorch 的 NestedTensor SDPA)时,由于主机端调度开销(dispatch overhead)占比较高(60–90 微秒),导致实际 wall-clock 时间并未按预期下降。解决方案的关键在于设计了一个轻量级、双向的 Triton 注意力内核,其调度开销降低至约 40 微秒(比 FlashAttention-2 varlen 低约 1.5 倍),从而使得剪枝带来的计算节省能真正体现在端到端性能提升上;集成该内核的 pack-attend-unpack 流水线在多种剪枝算法和模型规模下均实现了最高达 2.24x 的吞吐量提升,并保持了 bit-exact 的分类预测一致性。
链接: https://arxiv.org/abs/2604.15408
作者: Saif Mahmoud,Ahmad Almasri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet when pruned sequences are executed with state-of-the-art variable-length attention APIs – including FlashAttention-2’s varlen and PyTorch’s NestedTensor SDPA-the wall-clock attention latency doesn’t scale accordingly. We trace this to a dispatch-overhead bottleneck: at the short, post-pruning sequence lengths typical of ViTs (=197 tokens), actual matrix arithmetic completes in single-digit microseconds while the host-side dispatch path consumes 60-90 us. We present a lightweight, bidirectional Triton attention kernel whose dispatch floor is 40 us roughly 1.5x lower than FlashAttention-2 varlen-allowing pruning savings to become more visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline, our system achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA consistently across four pruning algorithms (Threshold-L2, DynamicViT, EViT, ATS), scales across DeiT-T/S/B, and maintains bit-exact classification predictions with 0.007 max absolute logit difference.
[AI-65] Lightweight Geometric Adaptation for Training Physics-Informed Neural Networks
【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在求解复杂偏微分方程(Partial Differential Equations, PDEs)时普遍存在的收敛速度慢、训练不稳定和精度降低的问题,这些问题主要源于损失函数景观的各向异性和快速变化的几何结构。解决方案的关键在于提出一种轻量级的曲率感知优化框架,通过引入基于割线信息的自适应预测校正机制来增强现有的一阶优化器;该机制利用连续梯度差作为局部几何变化的廉价代理,并结合步长归一化的割线曲率指标来控制校正强度,从而在不显式计算二阶矩阵的前提下提升优化稳定性与效率。
链接: https://arxiv.org/abs/2604.15392
作者: Kang An,Chenhao Si,Shiqian Ma,Ming Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 22 pages, Chenhao Si and Kang An contributed equally to this work. Their authorship order was determined randomly
Abstract:Physics-Informed Neural Networks (PINNs) often suffer from slow convergence, training instability, and reduced accuracy on challenging partial differential equations due to the anisotropic and rapidly varying geometry of their loss landscapes. We propose a lightweight curvature-aware optimization framework that augments existing first-order optimizers with an adaptive predictive correction based on secant information. Consecutive gradient differences are used as a cheap proxy for local geometric change, together with a step-normalized secant curvature indicator to control the correction strength. The framework is plug-and-play, computationally efficient, and broadly compatible with existing optimizers, without explicitly forming second-order matrices. Experiments on diverse PDE benchmarks show consistent improvements in convergence speed, training stability, and solution accuracy over standard optimizers and strong baselines, including on the high-dimensional heat equation, Gray–Scott system, Belousov–Zhabotinsky system, and 2D Kuramoto–Sivashinsky system.
[AI-66] Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks
【速读】:该论文旨在解决代码混淆(code obfuscation)后难以恢复原始可读程序的问题,尤其针对控制流混淆(control flow obfuscation),包括控制流扁平化(Control Flow Flattening, CFF)和隐蔽谓词(Opaque Predicates)等典型技术。传统方法依赖人工分析与复杂工具,耗时长且成本高。解决方案的关键在于引入基于思维链(Chain-of-Thought, CoT)提示的大型语言模型(Large Language Models, LLMs),通过显式的、分步骤的推理过程引导模型进行代码结构理解和语义保持,从而显著提升去混淆质量。实验表明,CoT prompting相比简单提示能大幅提升控制流图重建准确率(平均提升约16%)和程序语义保留度(平均提升约20.5%),证明了LLMs在辅助逆向工程中的有效性。
链接: https://arxiv.org/abs/2604.15390
作者: Seyedreza Mohseni,Sarvesh Baskar,Edward Raff,Manas Gaur
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code deobfuscation is the task of recovering a readable version of a program while preserving its original behavior. In practice, this often requires days or even months of manual work with complex and expensive analysis tools. In this paper, we explore an alternative approach based on Chain-of-Thought (CoT) prompting, where a large language model is guided through explicit, step-by-step reasoning tailored for code analysis. We focus on control flow obfuscation, including Control Flow Flattening (CFF), Opaque Predicates, and their combination, and we measure both structural recovery of the control flow graph and preservation of program semantics. We evaluate five state-of-the-art large language models and show that CoT prompting significantly improves deobfuscation quality compared with simple prompting. We validate our approach on a diverse set of standard C benchmarks and report results using both structural metrics for control flow graphs and semantic metrics based on output similarity. Among the tested models and by applying CoT, GPT5 achieves the strongest overall performance, with an average gain of about 16% in control-flow graph reconstruction and about 20.5% in semantic preservation across our benchmarks compared to zero-shot prompting. Our results also show that model performance depends not only on the obfuscation level and the chosen obfuscator but also on the intrinsic complexity of the original control flow graph. Collectively, these findings suggest that CoT-guided large language models can serve as effective assistants for code deobfuscation, providing improved code explainability, more faithful control flow graph reconstruction, and better preservation of program behavior while potentially reducing the manual effort needed for reverse engineering.
[AI-67] Exploring LLM -based Verilog Code Generation with Data-Efficient Fine-Tuning and Testbench Automation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在硬件描述语言(Hardware Description Language, HDL)生成任务中因训练数据和测试平台(testbench)稀缺而导致性能受限的问题。其解决方案的关键在于提出了一种基于多智能体(multi-agent)的自动化测试平台生成工作流,通过该流程构建高质量的细调数据集,从而使得在规范到Verilog转换任务中,使用较少训练数据即可达到与当前最先进方法相当的性能,显著提升了LLM在HDL生成领域的实用性和可扩展性。
链接: https://arxiv.org/abs/2604.15388
作者: Mu-Chi Chen,Po-Hsuan Huang,Yu-Hung Kao,Yen-Fu Liu,Yu-Kai Hung,Cheng Liang,Shao-Chun Ho,Chia-Heng Tu,Shih-Hao Hung
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models have improved code generation, but their use in hardware description languages is still limited. Moreover, training data and testbenches for these models are often scarce. This paper presents a workflow that uses multi-agent models to generate testbenches for high-quality fine-tuning data. By automating testbench creation, the fine-tuned model for the specification-to-Verilog task achieves performance comparable to state-of-the-art methods on the refined VerilogEval v2 benchmark while using less training data. This study provides a basis for future work on LLM-based HDL generation and automated verification.
[AI-68] LinuxArena: A Control Setting for AI Agents in Live Production Software Environments
【速读】:该论文旨在解决当前软件工程控制研究中缺乏真实、复杂且多样化的生产环境测试基准的问题,以更有效地评估生成式 AI(Generative AI)在实际系统中的安全性和可控性。其解决方案的关键在于构建 LinuxArena——一个包含 20 个真实运行的多服务生产环境、1,671 个主任务(代表合法软件工程操作)和 184 个侧任务(如数据泄露与后门植入等安全失败场景)的控制设置,从而提供迄今为止最大且最多样化的软件工程控制测试平台。通过引入 sabotage evaluation 和 monitor evaluation 两种机制,该研究验证了 LinuxArena 在衡量攻击者隐蔽性与监控模型检测能力方面的有效性,并揭示了当前攻击策略尚未饱和该环境,为未来控制协议的发展提供了重要实验基础。
链接: https://arxiv.org/abs/2604.15384
作者: Tyler Tracy,Ram Potham,Nick Kuhn,Myles Heller,Anshul Khandelwal,Cody Rushing,Henri Lemoine,Miguel Brandao,Tomas Turlik,Adam Hanson,Josh Hills,Amy Ngo,Ram Rachum,Nik Mitchell,Falko Galperin,Oscar Sykes,Pip Arnott,Samuel Prieto Lima,Carlos Giudice,Matt Goldwater,Daniel Popp,Drew de Wet,Ruben Castaing,Qi Guo,Douw Marx,Benjamin Shaffrey,Justin Shenk,Martin Milbradt,Hannah Meagher,Shaheen Ahmed-Chowdhury,Daniel O’Connell,Chris Canal,Buck Shlegeris,Aryan Bhatt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We introduce LinuxArena, a control setting in which agents operate directly on live, multi-service production environments. LinuxArena contains 20 environments, 1,671 main tasks representing legitimate software engineering work, and 184 side tasks representing safety failures such as data exfiltration and backdooring, making it the largest and most diverse control setting for software engineering to date. We validate LinuxArena is useful for control research by running sabotage evaluations, which measure whether attackers can complete side tasks while working on main tasks, and monitor evaluations, which measure a monitor model’s ability to detect sabotage attempts. Against a GPT-5-nano trusted monitor at a 1% step-wise false positive rate, Claude Opus 4.6 achieves roughly a 23% undetected sabotage success rate. We additionally release LaStraj, a dataset of human-crafted attack trajectories that evade monitors at substantially higher rates than any model-generated attacks we elicited, showing that current attack policies do not saturate LinuxArena. These results suggest that LinuxArena has meaningful headroom for both attackers and defenders, making it a strong testbed for developing and evaluating future control protocols.
[AI-69] mporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models ACL2026
【速读】:该论文旨在解决统一的大规模音频-语言模型(Large Audio-Language Models, LALMs)在推理过程中存在的**时间平滑偏差(temporal smoothing bias)**问题,即模型倾向于依赖语言先验所支持的时序平滑上下文,而忽略瞬态声学线索(transient acoustic cues),从而导致音频对齐输出不够精确。解决方案的关键在于提出一种无需训练的解码方法——时间对比解码(Temporal Contrastive Decoding, TCD):通过构建输入波形经时间平滑后的“慢路径”视图并重新编码,与原始输入的预测 logits 进行对比,利用该对比信号作为 token 级别的 logits 更新,仅作用于一个小候选集;同时引入自归一化的稳定性分数自动调节模糊窗口和更新幅度,并基于不确定性和音频依赖性设计逐步门控机制,在必要时激活更新,从而在不改变模型结构的前提下提升音频感知精度。
链接: https://arxiv.org/abs/2604.15383
作者: Yanda Li,Yuhan Liu,Zirui Song,Yunchao Wei,Martin Takáč,Salem Lahlou
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings
Abstract:Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a \emphtemporal smoothing bias: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leading to less specific audio-grounded outputs. We propose \emphTemporal Contrastive Decoding (TCD), a training-free decoding method for unified LALMs that mitigates this effect at inference time. TCD constructs a temporally blurred slow-path view by smoothing the input waveform and re-encoding it, then contrasts next-token logits from the original and slow-path views. The contrastive signal is applied as a token-level logit update restricted to a small candidate set. A self-normalized stability score sets the blur window and update scale, and a step-wise gate based on uncertainty and audio reliance activates the update only when needed. Experiments on MMAU and AIR-Bench show consistent improvements on strong unified LALMs. We further conduct ablations and an architectural applicability study to analyze the contributions of key components and how TCD behaves across large audio-language model designs.
[AI-70] VeriCWEty: Embedding enabled Line-Level CWE Detection in Verilog
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成的寄存器传输级(Register-Transfer Level, RTL)代码中普遍存在但难以被传统方法识别的常见漏洞与弱点(Common Vulnerabilities and Weaknesses, CWEs)问题。现有RTL漏洞检测技术依赖规则检查、形式属性或粗粒度结构分析,难以捕捉语义层面的漏洞且定位精度不足。论文提出了一种基于嵌入(embedding-based)的漏洞检测框架,其关键在于利用语义嵌入实现模块级和行级的漏洞检测与分类,从而在保持高精度的同时提升漏洞定位的细粒度能力,实验表明该方法对CWE-1244和CWE-1245等典型漏洞的识别准确率达约89%,行级漏洞检测准确率高达96%。
链接: https://arxiv.org/abs/2604.15375
作者: Prithwish Basu Roy,Zeng Wang,Anatolii Chuvashlov,Weihua Xiao,Johann Knechtel,Ozgur Sinanoglu,Ramesh Karri
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) have shown significant improvement in RTL code generation. Despite the advances, the generated code is often riddled with common vulnerabilities and weaknesses (CWEs) that can slip by untrained eyes. Attackers can often exploit these weaknesses to fulfill their nefarious motives. Existing RTL bug-detection techniques rely on rule-based checks, formal properties, or coarse-grained structural analysis, which either fail to capture semantic vulnerabilities or lack precise localization. In our work, we bridge this gap by proposing an embedding-based bug-detection framework that detects and classifies bugs at both module and line-level granularity. Our method achieves about 89% precision in identifying common CWEs such as CWE-1244 and CWE-1245, and 96% accuracy in detecting line-level bugs.
[AI-71] he Synthetic Media Shift: Tracking the Rise Virality and Detectability of AI-Generated Multimodal Misinformation
【速读】:该论文旨在解决生成式 AI(Generative AI)发展背景下,真实与合成媒体界限日益模糊所引发的多模态虚假信息传播问题,特别是其在在线信息生态中的扩散机制与检测挑战。解决方案的关键在于构建并分析一个大规模多模态虚假信息数据集 CONVEX,该数据集包含超过 15 万条带有注释和互动指标的多模态帖子,涵盖误标注、编辑及 AI 生成内容;通过量化分析其病毒式传播、用户参与度和社区共识动态,揭示了 AI 生成内容虽初始报告较慢但一旦被标记则快速达成共识,且依赖被动参与而非主动讨论驱动传播;同时指出当前专用检测器和视觉-语言模型在面对不断演进的生成模型时性能持续下降,凸显出需建立持续监测与自适应策略以应对快速变化的数字信息环境。
链接: https://arxiv.org/abs/2604.15372
作者: Zacharias Chrysidis,Stefanos-Iordanis Papadopoulos,Symeon Papadopoulos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:As generative AI advances, the distinction between authentic and synthetic media is increasingly blurred, challenging the integrity of online information. In this study, we present CONVEX, a large-scale dataset of multimodal misinformation involving miscaptioned, edited, and AI-generated visual content, comprising over 150K multimodal posts with associated notes and engagement metrics from X’s Community Notes. We analyze how multimodal misinformation evolves in terms of virality, engagement, and consensus dynamics, with a focus on synthetic media. Our results show that while AI-generated content achieves disproportionate virality, its spread is driven primarily by passive engagement rather than active discourse. Despite slower initial reporting, AI-generated content reaches community consensus more quickly once flagged. Moreover, our evaluation of specialized detectors and vision-language models reveals a consistent decline in performance over time in distinguishing synthetic from authentic images as generative models evolve. These findings highlight the need for continuous monitoring and adaptive strategies in the rapidly evolving digital information environment.
[AI-72] aming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
【速读】:该论文旨在解决移动边缘计算中模型推理延迟(inference latency)精确估计的问题,尤其是在动态电压频率调节(Dynamic Voltage and Frequency Scaling, DVFS)环境下,传统静态 profiling 方法因处理器(CPU/GPU)频率变化导致延迟波动而失效。针对现有方法在小型语言模型(Small Language Models, SLMs)上因上下文长度可变而导致的昂贵、耗时的全频点采样问题,本文提出 FLAME,其核心创新在于采用分层建模(layer-wise modeling)机制,量化各层中的重叠并行性(overlapping parallelism),并聚合由 CPU(内核启动)与 GPU(执行)异步耦合引发的动态流水线气泡(pipeline bubbles),从而实现跨频率组合的高精度延迟预测。该自底向上的建模策略显著提升了对不同深度神经网络(DNN)和 SLM 的泛化能力,并支持仅用稀疏采样即可完成高效建模,将 DNN 和 SLM 的 profiling 时间分别从小时级和天级缩短至分钟级,同时保持小误差范围,最终在基于截止时间感知的 DVFS 中实现优于当前最优方案的能效与延迟保障性能。
链接: https://arxiv.org/abs/2604.15357
作者: Jiesong Chen,Jun You,Zhidan Liu,Zhenjiang Li
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However, the ubiquity of Dynamic Voltage and Frequency Scaling (DVFS) renders traditional static profiling invalid in real-world deployments, as inference latency fluctuates with varying processor (CPU and GPU) frequencies. While extensive profiling across frequency combinations is theoretically possible, it is prohibitively expensive, particularly for emerging Small Language Models (SLMs), where variable context lengths explode the profiling up to days. We observe that simple analytic scaling fails to predict these fluctuations due to the complex asynchronous coupling between CPU (kernel launching) and GPU (execution). In this paper, we introduce FLAME to accurately estimate inference latency across frequency combinations. It features a novel layer-wise modeling that quantifies the overlapping parallelism and then aggregates dynamic pipeline bubbles caused by asynchronous processor interactions when extending to the full model. This bottom-up approach ensures generalizability across diverse models from DNNs to SLMs, and its precise modeling allows for profiling a sparse subset of samples, cutting DNN profiling from hours to minutes and SLM profiling from days to mere minutes, while maintaining small estimation errors across frequencies. We further showcase FLAME’s utility in a deadline-aware DVFS, outperforming the state-of-the-art approach in both power efficiency and latency guarantees.
[AI-73] Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
【速读】:该论文旨在解决Transformer模型中键值缓存(KV cache)的高效压缩问题,现有方法如TurboQuant虽已逼近每向量压缩的香农熵极限,但忽略了KV缓存本质上是序列数据这一关键特性。解决方案的核心在于提出一种两层结构的顺序KV压缩机制:第一层为概率前缀去重(probabilistic prefix deduplication),利用概率语言树(Probabilistic Language Tries, PLTs)中的trie距离 dT(s,s′)=−log2PM(s∧s′) 识别跨会话语义等价的共享前缀;第二层为预测性差分编码(predictive delta coding),仅存储每个新KV向量与其自身模型预测值之间的残差,从而实现基于条件熵 H(KVi+1∣KV≤i)=H(tokeni+1∣token≤i) 的压缩。理论分析表明,在典型语言模型困惑度下(约10–20),该方法平均每标记位置压缩至3.3–4.3比特,相较TurboQuant的每向量组件3比特具有高达914,000倍的理论压缩比,且在实际场景中仍保持显著优势。
链接: https://arxiv.org/abs/2604.15356
作者: Gregory Magarshak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注: 22 Pages
Abstract:Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data – they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s’) = -log_2 P_M(s ^ s’) from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model’s own prediction of it, achieving a per-token entropy bound of H(KV_i+1 | KV_=i) = H(token_i+1 | token_=i). We prove that at typical language model perplexity – approximately 10-20 for fluent English text – this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant’s 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor – a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders – the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.
[AI-74] ChemGraph-XANES: An Agent ic Framework for XANES Simulation and Analysis
【速读】:该论文旨在解决计算X射线吸收近边结构(XANES)在大规模应用中因工作流复杂性导致的效率瓶颈问题,而非受限于底层模拟方法本身。其关键解决方案是提出ChemGraph-XANES框架,该框架通过将自然语言任务描述、结构获取、FDMNES输入生成、任务并行执行、谱图归一化及可溯源的数据整理等步骤统一为类型化的Python工具,并借助大语言模型(LLM)代理进行编排,实现了自动化XANES模拟与分析。该框架基于ASE、FDMNES、Parsl以及LangGraph/LangChain构建,支持显式结构文件输入和化学层面自然语言请求,且因XANES计算天然具备任务并行特性,特别适用于高性能计算(HPC)环境下的高通量部署,从而为下游分析和机器学习应用提供可复现、可扩展的物理驱动型XANES数据库生成能力。
链接: https://arxiv.org/abs/2604.16205
作者: Vitor F. Grizzi,Thang Duc Pham,Luke N. Pretzie,Jiayi Xu,Murat Keceli,Cong Liu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:
Abstract:Computational X-ray absorption near-edge structure (XANES) is widely used to probe local coordination environments, oxidation states, and electronic structure in chemically complex systems. However, the use of computational XANES at scale is constrained more by workflow complexity than by the underlying simulation method itself. To address this challenge, we present ChemGraph-XANES, an agentic framework for automated XANES simulation and analysis that unifies natural-language task specification, structure acquisition, FDMNES input generation, task-parallel execution, spectral normalization, and provenance-aware data curation. Built on ASE, FDMNES, Parsl, and a LangGraph/LangChain-based tool interface, the framework exposes XANES workflow operations as typed Python tools that can be orchestrated by large language model (LLM) agents. In multi-agent mode, a retrieval-augmented expert agent consults the FDMNES manual to ground parameter selection, while executor agents translate user requests into structured tool calls. We demonstrate documentation-grounded parameter retrieval and show that the same workflow supports both explicit structure-file inputs and chemistry-level natural-language requests. Because independent XANES calculations are naturally task-parallel, the framework is well suited for high-throughput deployment on high-performance computing (HPC) systems, enabling scalable XANES database generation for downstream analysis and machine-learning applications. ChemGraph-XANES thus provides a reproducible and extensible workflow layer for physics-based XANES simulation, spectral curation, and agent-compatible computational spectroscopy.
[AI-75] Phase Transitions as the Breakdown of Statistical Indistinguishability
【速读】:该论文旨在解决相变识别中依赖于特定序参量(order parameter)的局限性问题,提出了一种不依赖模型细节和序参量信息的通用相变判定框架。其解决方案的关键在于将相变重新定义为:在热力学极限下,当参数扰动趋于零时,系统状态统计不可区分性的失效(即统计可区分性的突破)。作者通过引入一种分布无关的两样本游程检验(two-sample run test),在无需预先知道序参量的情况下,准确识别出二维伊辛模型(Ising model)的临界点,从而验证了该框架的普适性和有效性。
链接: https://arxiv.org/abs/2604.15773
作者: Taiyo Narita,Hideyuki Miyahara
机构: 未知
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:We introduce a novel characterization of phase transitions based on hypothesis testing. In our formulation, a phase transition is defined as the breakdown of statistical indistinguishability under vanishing parameter perturbations in the thermodynamic limit. This perspective provides a general, order-parameter-free framework that does not rely on model-specific insights or learning procedures. We show that conventional approaches, such as those based on the Binder parameter, can be reinterpreted as special cases within this framework. As a concrete realization, we employ a distribution-free two-sample run test and demonstrate that the critical point of the two-dimensional Ising model is accurately identified without prior knowledge of the order parameter. Subjects: Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Methodology (stat.ME) Cite as: arXiv:2604.15773 [cond-mat.stat-mech] (or arXiv:2604.15773v1 [cond-mat.stat-mech] for this version) https://doi.org/10.48550/arXiv.2604.15773 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hideyuki Miyahara [view email] [v1] Fri, 17 Apr 2026 07:29:07 UTC (305 KB) Full-text links: Access Paper: View a PDF of the paper titled Phase Transitions as the Breakdown of Statistical Indistinguishability, by Taiyo Narita and Hideyuki MiyaharaView PDFHTML (experimental)TeX Source view license Current browse context: cond-mat.stat-mech prev | next new | recent | 2026-04 Change to browse by: cond-mat cs cs.AI stat stat.ME References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-76] Exascale Multi-Task Graph Foundation Models for Imbalanced Multi-Fidelity Atomistic Data
【速读】:该论文旨在解决材料发现中因第一性原理计算(first-principles computation)成本过高而导致的高通量筛选难题,尤其在大规模化学空间探索中的效率瓶颈。其关键解决方案是构建一个基于HydraGNN的原子图基础模型(atomistic graph foundation model),通过联合训练16个开放的第一性原理数据集(覆盖85+元素、超5.44亿结构),采用多任务架构与每数据集专用头,并结合可扩展的ADIOS2/DDStore数据管道,在Frontier超级计算机上执行六次大规模DeepHyper超参数优化,最终获得性能最优的消息传递模型(PaiNN-based lead model)。该模型实现了百亿级原子结构的快速评估(50秒内完成11亿结构预测),显著压缩了传统第一性原理计算所需数年的工作量,同时支持数据稀缺场景下的下游任务微调和跨平台强弱扩展性,从而实现高效、可靠的化学设计空间探索。
链接: https://arxiv.org/abs/2604.15380
作者: Massimiliano Lupo Pasini,Jong Youl Choi,Kshitij Mehta,Richard Messerly,Rylie Weaver,Linda Ungerboeck,Isaac Lyngaas,Benajmin Stump,Ashwin M. Aji,Karl W. Schulz,Jorda Polo
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 12 pages; 5 figures; 15 tables
Abstract:We present an exascale workflow for materials discovery using atomistic graph foundation models built on HydraGNN. We jointly train on 16 open first-principles datasets (544+ million structures covering 85+ elements) using a multi-task architecture with per-dataset heads and a scalable ADIOS2/DDStore data pipeline. On Frontier, we execute six large-scale DeepHyper hyperparameter optimization campaigns in FP64 and promote the top-performing message-passing models to sustained 2,048-node training, yielding a PaiNN-based lead model. The resulting model enables billion-scale screening, evaluating 1.1 billion atomistic structures in 50 seconds, compressing a workload that would require years of first-principles computation, and supports data-scarce fine-tuning across diverse downstream tasks. We quantify precision-performance tradeoffs (BF16/FP32/FP64), demonstrate transfer across twelve chemically diverse downstream tasks, and establish seamless strong- and weak-scaling across Frontier, Aurora, and Perlmutter. This work allows fast and reliable exploration of vast chemical design spaces that are otherwise inaccessible to first-principles methods.
[AI-77] Seeing the imagined: a latent functional alignment in visual imagery decoding from fMRI data
【速读】:该论文旨在解决当前基于fMRI的视觉脑解码模型在心理意象(mental-imagery)场景下性能不佳的问题,尤其是在缺乏与意象数据直接匹配的监督信号时,如何有效利用感知(perception)训练中获得的语义结构来提升意象重建质量。其解决方案的关键在于提出一种潜在功能对齐(latent functional alignment)方法,该方法将意象诱发的fMRI活动映射到预训练生成模型的条件空间中,同时冻结其余组件以保持模型稳定性;此外,为缓解意象-感知监督样本稀缺问题,进一步引入基于检索的增强策略,选择语义相关的自然场景感知试验作为补充监督信号。实验证明,该方法在四个受试者中均显著优于固定预训练基线和体素空间岭对齐基线,在多个皮层区域实现了高于随机水平的意象解码性能。
链接: https://arxiv.org/abs/2604.15374
作者: Fabrizio Spera,Tommaso Boccato,Michal Olak,Sara Cammarota,Matteo Ciferri,Michelangelo Tronti,Nicola Toschi,Matteo Ferrante
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Recent progress in visual brain decoding from fMRI has been enabled by large-scale datasets such as the Natural Scenes Dataset (NSD) and powerful diffusion-based generative models. While current pipelines are primarily optimized for perception, their performance under mental-imagery remains less well understood. In this work, we study how a state-of-the-art (SOTA) perception decoder (DynaDiff) can be adapted to reconstruct imagined content from the Imagery-NSD benchmark. We propose a latent functional alignment approach that maps imagery-evoked activity into the pretrained model’s conditioning space, while keeping the remaining components frozen. To mitigate the limited amount of matched imagery-perception supervision, we further introduce a retrieval-based augmentation strategy that selects semantically related NSD perception trials. Across four subjects, latent functional alignment consistently improves high-level semantic reconstruction metrics relative to the frozen pretrained baseline and a voxel-space ridge alignment baseline, and enables above-chance decoding from multiple cortical regions. These results suggest that semantic structure learned from perception can be leveraged to stabilize and improve visual imagery decoding under out-of-distribution conditions.
机器学习
[LG-0] Geometric regularization of autoencoders via observed stochastic dynamics
链接: https://arxiv.org/abs/2604.16282
作者: Sean Hill,Felix X.-F. Ye
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR)
*备注:
Abstract:Stochastic dynamical systems with slow or metastable behavior evolve, on long time scales, on an unknown low-dimensional manifold in high-dimensional ambient space. Building a reduced simulator from short-burst ambient ensembles is a long-standing problem: local-chart methods like ATLAS suffer from exponential landmark scaling and per-step reprojection, while autoencoder alternatives leave tangent-bundle geometry poorly constrained, and the errors propagate into the learned drift and diffusion. We observe that the ambient covariance~ \Lambda already encodes coordinate-invariant tangent-space information, its range spanning the tangent bundle. Using this, we construct a tangent-bundle penalty and an inverse-consistency penalty for a three-stage pipeline (chart learning, latent drift, latent diffusion) that learns a single nonlinear chart and the latent SDE. The penalties induce a function-space metric, the \rho -metric, strictly weaker than the Sobolev H^1 norm yet achieving the same chart-quality generalization rate up to logarithmic factors. For the drift, we derive an encoder-pullback target via Itô’s formula on the learned encoder and prove a bias decomposition showing the standard decoder-side formula carries systematic error for any imperfect chart. Under a W^2,\infty chart-convergence assumption, chart-level error propagates controllably to weak convergence of the ambient dynamics and to convergence of radial mean first-passage times. Experiments on four surfaces embedded in up to 201 ambient dimensions reduce radial MFPT error by 50 – 70% under rotation dynamics and achieve the lowest inter-well MFPT error on most surface–transition pairs under metastable Müller–Brown Langevin dynamics, while reducing end-to-end ambient coefficient errors by up to an order of magnitude relative to an unregularized autoencoder.
[LG-1] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design
链接: https://arxiv.org/abs/2604.16279
作者: Shriram Chennakesavalu,Kirill Shmilovich,Hayley Weir,Colin Grambow,John Bradshaw,Patricia Suriana,Chen Cheng,Kangway Chuang
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.
[LG-2] FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale
链接: https://arxiv.org/abs/2604.16265
作者: Aswathi Mundayatt,Jaya Sreevalsan-Nair
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing multi-hazard susceptibility mapping (MHSM) studies often rely on spatially uniform models, treat hazards independently, and provide limited representation of cross-hazard dependence and uncertainty. To address these limitations, this study proposes a deep learning (DL) workflow for joint flood-landslide multi-hazard susceptibility mapping (FL-MHSM) that combines two-level spatial partitioning, probabilistic Early Fusion (EF), a tree-based Late Fusion (LF) baseline, and a soft-gating Mixture of Experts (MoE) model, with MoE serving as final predictive model. The proposed design preserves spatial heterogeneity through zonal partitions and enables data-parallel large-area prediction using overlapping lattice grids. In Kerala, EF remained competitive with LF, improving flood recall from 0.816 to 0.840 and reducing Brier score from 0.092 to 0.086, while MoE provided strongest performance for flood susceptibility, achieving an AUC-ROC of 0.905, recall of 0.930, and F1-score of 0.722. In Nepal, EF similarly improved flood recall from 0.820 to 0.858 and reduced Brier score from 0.057 to 0.049 relative to LF, while MoE outperformed both EF and LF for landslide susceptibility, achieving an AUC-ROC of 0.914, recall of 0.901, and F1-score of 0.559. GeoDetector analysis of MoE outputs further showed that dominant factors varied more across zones in Kerala, where susceptibility was shaped by different combinations of topographic, land-cover, and drainage-related controls, while Nepal showed a more consistent influence of topographic and glacier-related factors across zones. These findings show that EF and LF provide complementary predictive behavior, and that their spatially adaptive integration through MoE yields robust overall predictive performance for FL-MHSM while supporting interpretable characterization of multi-hazard susceptibility in spatially heterogeneous landscapes.
[LG-3] Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction
链接: https://arxiv.org/abs/2604.16238
作者: Hannah Guan,Soukayna Mouatadid,Paulo Orenstein,Judah Cohen,Haiyu Dong,Zekun Ni,Jeremy Berman,Genevieve Flaspohler,Alex Lu,Jakob Schloer,Joshua Talib,Jonathan A. Weyn,Lester Mackey
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注:
Abstract:Decision-makers rely on weather forecasts to plant crops, manage wildfires, allocate water and energy, and prepare for weather extremes. Today, such forecasts enjoy unprecedented accuracy out to two weeks thanks to steady advances in physics-based dynamical models and data-driven artificial intelligence (AI) models. However, model skill drops precipitously at subseasonal timescales (2 - 6 weeks ahead), due to compounding errors and persistent biases. To counter this degradation, we introduce probabilistic bias correction (PBC), a machine learning framework that substantially reduces systematic error by learning to correct historical probabilistic forecasts. When applied to the leading dynamical and AI models from the European Centre for Medium-Range Weather Forecasts (ECMWF), PBC doubles the subseasonal skill of the AI Forecasting System and improves the skill of the operationally-debiased dynamical model for 91% of pressure, 92% of temperature, and 98% of precipitation targets. We designed PBC for operational deployment, and, in ECMWF’s 2025 real-time forecasting competition, its global forecasts placed first for all weather variables and lead times, outperforming the dynamical models from six operational forecasting centers, an international dynamical multi-model ensemble, ECMWF’s AI Forecasting System, and the forecasting systems of 34 teams worldwide. These probabilistic skill gains translate into more accurate prediction of extreme events and have the potential to improve agricultural planning, energy management, and disaster preparedness in vulnerable communities.
[LG-4] OT on the Map: Quantifying Domain Shifts in Geographic Space
链接: https://arxiv.org/abs/2604.16220
作者: Haoran Zhang,Livia Betti,Konstantin Klemmer,Esther Rolf,David Alvarez-Melis
类目: Machine Learning (cs.LG)
*备注:
Abstract:In computer vision and machine learning for geographic data, out-of-domain generalization is a pervasive challenge, arising from uneven global data coverage and distribution shifts across geographic regions. Though models are frequently trained in one region and deployed in another, there is no principled method for determining when this cross-region adaptation will be successful. A well-defined notion of distance between distributions can effectively quantify how different a new target domain is compared to the domains used for model training, which in turn could support model training and deployment decisions. In this paper, we propose a strategy for computing distances between geospatial domains that leverages geographic information with Optimal Transport methods (GeoSpOT). In our experiments, GeoSpOT distances emerge as effective predictors of cross-domain transfer difficulty. We further demonstrate that embeddings from pretrained location encoders provide information comparable to image/text embeddings, despite relying solely on longitude-latitude pairs as input. This allows users to get an approximation of out-of-domain performance for geospatial models, even when the exact downstream task is unknown, or no task-specific data is available. Building on these findings, we show that GeoSpOT distances can preemptively guide data selection and enable predictive tools to analyze regions where a model is likely to underperform.
[LG-5] Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
链接: https://arxiv.org/abs/2604.16197
作者: Yide Ran,Jianwen Xie,Minghui Wang,Wenjin Zheng,Denghui Zhang,Chuan Li,Zhaozhuo Xu
类目: Machine Learning (cs.LG)
*备注: 54 pages
Abstract:Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112 \times compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training examples for specific predictions, and (2) prospective valuation, scoring candidate data utility zero-shot. We validate RISE on three tasks: Howdy backdoor data detection, Finance-Medical domain separation, and Brain Rot high-quality data selection. In a closed-loop Brain Rot study, continued pretraining on RISE-selected data yields consistent downstream improvements. Overall, RISE provides a practical and scalable primitive for influence analysis and training-data selection in modern large language models.
[LG-6] abular foundation models for in-context prediction of molecular properties
链接: https://arxiv.org/abs/2604.16123
作者: Karim K. Ben Hicham,Jan G. Rittig,Martin Grohe,Alexander Mitsos
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning transferable molecular representations; however, they typically involve task-specific fine-tuning, require machine learning expertise, and often fail to outperform classical baselines. Tabular foundation models (TFMs) offer a fundamentally different paradigm: they perform predictions through in-context learning, enabling inference without task-specific training. Here, we evaluate TFMs in the low- to medium-data regime across both standardized pharmaceutical benchmarks and chemical engineering datasets. We evaluate both frozen molecular foundation model representations, as well as classical descriptors and fingerprints. Across the benchmarks, the approach shows excellent predictive performance while reducing computational cost, compared to fine-tuning, with these advantages also transferring to practical engineering data settings. In particular, combining TFMs with CheMeleon embeddings yields up to 100% win rates on 30 MoleculeACE tasks, while compact RDKit2d and Mordred descriptors provide strong descriptor-based alternatives. Molecular representation emerges as a key determinant in TFM performance, with molecular foundation model embeddings and 2D descriptor sets both providing substantial gains over classic molecular fingerprints on many tasks. These results suggest that in-context learning with TFMs provides a highly accurate and cost-efficient alternative for property prediction in practical applications.
[LG-7] Univariate Channel Fusion for Multivariate Time Series Classification ICPR2026
链接: https://arxiv.org/abs/2604.16119
作者: Fernando Moro,Vinicius M. A. Souza
类目: Machine Learning (cs.LG)
*备注: International Conference on Pattern Recognition (ICPR 2026)
Abstract:Multivariate time series classification (MTSC) plays a crucial role in various domains, including biomedical signal analysis and motion monitoring. However, existing approaches, particularly deep learning models, often require high computational resources, making them unsuitable for real-time applications or deployment on low-cost hardware, such as IoT devices and wearable systems. In this paper, we propose the Univariate Channel Fusion (UCF) method to deal with MTSC efficiently. UCF transforms multivariate time series into a univariate representation through simple channel fusion strategies such as the mean, median, or dynamic time warping barycenter. This transformation enables the use of any classifier originally designed for univariate time series, providing a flexible and computationally lightweight alternative to complex models. We evaluate UCF in five case studies covering diverse application domains, including chemical monitoring, brain-computer interfaces, and human activity analysis. The results demonstrate that UCF often outperforms baseline methods and state-of-the-art algorithms tailored for MTSC, while achieving substantial gains in computational efficiency, being particularly effective in problems with high inter-channel correlation.
[LG-8] Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model ALT2021
链接: https://arxiv.org/abs/2604.16111
作者: Jean Tarbouriech,Matteo Pirotta,Michal Valko,Alessandro Lazaric
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 32nd International Conference on Algorithmic Learning Theory (ALT 2021)
Abstract:We study the sample complexity of learning an \epsilon -optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with S states, A actions, minimum cost c_\min , and maximum expected cost of the optimal policy over all states B_\star , where any algorithm requires at least \Omega(SAB_\star^3/(c_\min\epsilon^2)) samples to return an \epsilon -optimal policy with high probability. Surprisingly, this implies that whenever c_\min = 0 an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when c_\min = 0 , but only under the condition that the optimal policy has a bounded hitting time to the goal state.
[LG-9] he Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback ICML2025
链接: https://arxiv.org/abs/2604.16087
作者: Côme Fiegel,Pierre Ménard,Tadashi Kozuno,Michal Valko,Vianney Perchet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 42nd International Conference on Machine Learning (ICML 2025)
Abstract:We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of \mathcalO(T^-1/8) on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being \Omega(T^-1/4) in contrast to the usual \Omega(T^-1/2) rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
[LG-10] owards Universal Convergence of Backward Error in Linear System Solvers
链接: https://arxiv.org/abs/2604.16075
作者: Michał Dereziński,Yuji Nakatsukasa,Elizaveta Rebrova
类目: Numerical Analysis (math.NA); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The quest for an algorithm that solves an n\times n linear system in O(n^2) time complexity, or O(n^2 \textpoly(1/\epsilon)) when solving up to \epsilon relative error, is a long-standing open problem in numerical linear algebra and theoretical computer science. There are two predominant paradigms for measuring relative error: forward error (i.e., distance from the output to the optimum solution) and backward error (i.e., distance to the nearest problem solved by the output). In most prior studies, convergence of iterative linear system solvers is measured via various notions of forward error, and as a result, depends heavily on the conditioning of the input. Yet, the numerical analysis literature has long advocated for backward error as the more practically relevant notion of approximation. In this work, we show that – surprisingly – the classical and simple Richardson iteration incurs at most 1/k (relative) backward error after k iterations on any positive semidefinite (PSD) linear system, irrespective of its condition number. This universal convergence rate implies an O(n^2/\epsilon) complexity algorithm for solving a PSD linear system to \epsilon backward error, and we establish similar or better complexity when using a variety of Krylov solvers beyond Richardson. Then, by directly minimizing backward error over a Krylov subspace, we attain an even faster O(1/k^2) universal rate, and we turn this into an efficient algorithm, MINBERR, with complexity O(n^2/\sqrt\epsilon) . We extend this approach via normal equations to solving general linear systems, for which we empirically observe O(1/k) convergence. We report strong numerical performance of our algorithms on benchmark problems.
[LG-11] Constant-Factor Approximations for Doubly Constrained Fair k-Center k-Median and k-Means
链接: https://arxiv.org/abs/2604.16061
作者: Nicole Funk,Annika Hennes,Johanna Hillebrand,Sarah Sturm
类目: Data Structures and Algorithms (cs.DS); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 30 pages, 3 figures
Abstract:We study discrete k-clustering problems in general metric spaces that are constrained by a combination of two different fairness conditions within the demographic fairness model. Given a metric space (P,d), where every point in P is equipped with a protected attribute, and a number k, the goal is to partition P into k clusters with a designated center each, such that a center-based objective function is minimized and the attributes are fairly distributed with respect to the following two fairness concepts: 1) group fairness: We aim for clusters with balanced numbers of attributes by specifying lower and upper bounds for the desired attribute proportions. 2) diverse center selection: Clusters have natural representatives, i.e., their centers. We ask for a balanced set of representatives by specifying the desired number of centers to choose from each attribute. Dickerson, Esmaeili, Morgenstern and Zhang (2023) denote the combination of these two constraints as doubly constrained fair clustering. They present algorithms whose guarantees depend on the best known approximation factors for either of these problems. Currently, this implies an 8-approximation with a small additive violation on the group fairness constraint. For k-center, we improve this approximation factor to 4 with a small additive violation. This guarantee also depends on the currently best algorithm for DS-fair k-center given by Jones, Nguyen and Nguyen (2020). For k-median and k-means, we propose the first constant-factor approximation algorithms. Our algorithms transform a solution that satisfies diverse center selection into a doubly constrained fair clustering using an LP-based approach. Furthermore, our results are generalizable to other center-selection constraints, such as matroid k-clustering and knapsack constraints. Comments: 30 pages, 3 figures Subjects: Data Structures and Algorithms (cs.DS); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2604.16061 [cs.DS] (or arXiv:2604.16061v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2604.16061 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Corner Reflector Array Jamming Discrimination Using Multi-Dimensional Micro-Motion Features with Frequency Agile Radar
链接: https://arxiv.org/abs/2604.16008
作者: Jie Yuan,Lei Wang,Yanhao Wang,Yimin Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces a robust discrimination method for distinguishing real ship targets from corner-reflector-array jamming with frequency-agile radar. The key idea is to exploit the multidimensional micro-motion signatures that separate rigid ships from non-rigid decoys. From Range-Velocity maps we derive two new hand-crafted descriptors-mean weighted residual (MWR) and complementary contrast factor (CCF) and fuse them with deep features learned by a lightweight CNN. An XGBoost classifier then gives the final decision. Extensive simulations show that the hybrid feature set consistently outperforms state-of-the-art alternatives, confirming the superiority of the proposed approach.
[LG-13] Impact of Nonlinear Power Amplifier on Massive MIMO: Machine Learning Prediction Under Realistic Radio Channel
链接: https://arxiv.org/abs/2604.15977
作者: Marcin Hoffmann,Paweł Kryszkiewicz
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE Transactions on Vehicular Technology
Abstract:M-MIMO is one of the crucial technologies for increasing spectral and energy efficiency of wireless networks. Most of the current works assume that M-MIMO arrays are equipped with a linear front end. However, ongoing efforts to make wireless networks more energy-efficient push the hardware to the limits, where its nonlinear behavior appears. This is especially a common problem for the multicarrier systems, e.g., OFDM used in 4G, 5G, and possibly also in 6G, which is characterized by a high Peak-to-Average Power Ratio. While the impact of a nonlinear Power Amplifier (PA) on an OFDM signal is well characterized, it is a relatively new topic for the M-MIMO OFDM systems. Most of the recent works either neglect nonlinear effects or utilize simplified models proper for Rayleigh or LoS radio channel models. In this paper, we first theoretically characterize the nonlinear distortion in the M-MIMO system under commonly used radio channel models. Then, utilizing 3D-Ray Tracing (3D-RT) software, we demonstrate that these models are not very accurate. Instead, we propose two models: a statistical one and an ML-based one using 3D-RT results. The proposed statistical model utilizes the Generalized Extreme Value (GEV) distribution to model Signal to Distortion Ratio (SDR) for victim users, receiving nonlinear distortion, e.g., as interference from neighboring cells. The proposed ML model aims to predict SDR for a scheduled user (receiving nonlinear distortion along with the desired signal), based on the spatial characteristics of the radio channel and the operation point of each PA feeding at the M-MIMO antenna array. The predicted SDR can then be used to perform PA-aware per-user power allocation. The results show about 12% median gain in user throughput achieved by the proposed ML-based power allocation scheme over the state-of-the-art, fixed operating point scheme.
[LG-14] Evaluating quality in synthetic data generation for large tabular health datasets
链接: https://arxiv.org/abs/2604.15961
作者: Jean-Baptiste Escudié,Benjamin Barnes,Stefan Meisegeier,Klaus Kraywinkel,Fabian Prasser,Nils Körber
类目: Machine Learning (cs.LG)
*备注:
Abstract:There is no consensus in the field of synthetic data on concise metrics for quality evaluations or benchmarks on large health datasets, such as historical epidemiological data. This study presents an evaluation of seven recent models from major machine learning families. The models were evaluated using four different datasets, each with a distinct scale. To ensure a fair comparison, we systematically tuned the hyperparameters of each model for each dataset. We propose a methodology for evaluating the fidelity of synthesized joint distributions, aligning metrics with visualization on a single plot. This method is applicable to any dataset and is complemented by a domain-specific analysis of the German Cancer Registries’ epidemiological dataset. The analysis reveals the challenges models face in strictly adhering to the medical domain. We hope this approach will serve as a foundational framework for guiding the selection of synthesizers and remain accessible to all stakeholders involved in releasing synthetic datasets.
[LG-15] Multi-Objective Bayesian Optimization via Adaptive varepsilon-Constraints Decomposition
链接: https://arxiv.org/abs/2604.15959
作者: Yaohong Yang,Sammie Katt,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-objective Bayesian optimization (MOBO) provides a principled framework for optimizing expensive black-box functions with multiple objectives. However, existing MOBO methods often struggle with coverage, scalability with respect to the number of objectives, and integrating constraints and preferences. In this work, we propose \textitSTAGE-BO, Sequential Targeting Adaptive Gap-Filling \varepsilon -Constraint Bayesian Optimization, that explicitly targets under-explored regions of the Pareto front. By analyzing the coverage of the approximate Pareto front, our method identifies the largest geometric gaps. These gaps are then used as constraints, which transforms the problem into a sequence of inequality-constrained subproblems, efficiently solved via constrained expected improvement acquisition. Our approach provides a uniform Pareto coverage without hypervolume computation and naturally applies to constrained and preference-based settings. Experiments on synthetic and real-world benchmarks demonstrate superior coverage and competitive hypervolume performance against state-of-the-art baselines.
[LG-16] winTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
链接: https://arxiv.org/abs/2604.15950
作者: Tristan Kirscher(ICube),Alexandra Ertl(DKFZ),Klaus Maier-Hein(DKFZ),Xavier Coubez(ICANS),Philippe Meyer(ICube),Sylvain Faisan(ICube)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.
[LG-17] (Weighted) Adaptive Radius Near Neighbor Search: Evaluation for WiFi Fingerprint-based Positioning
链接: https://arxiv.org/abs/2604.15940
作者: Khang Le,Joaquín Torres-Sospedra,Philipp Müller
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 11 pages, 2 figures, 2 tables, submitted to IPIN 2026
Abstract:Fixed Radius Near Neighbor (FRNN) search is an alternative to the widely used k Nearest Neighbors (kNN) search. Unlike kNN, FRNN determines a label or an estimate for a test sample based on all training samples within a predefined distance. While this approach is beneficial in certain scenarios, assuming a fixed maximum distance for all training samples can decrease the accuracy of the FRNN. Therefore, in this paper we propose the Adaptive Radius Near Neighbor (ARNN) and the Weighted ARNN (WARNN), which employ adaptive distances and in latter case weights. All three methods are compared to kNN and twelve of its variants for a regression problem, namely WiFi fingerprinting indoor positioning, using 22 different datasets to provide a comprehensive analysis. While the performances of the tested FRNN and ARNN versions were amongst the worse, three of the four best methods in the test were WARNN versions, indicating that using weights together with adaptive distances achieves performance comparable or even better than kNN variants.
[LG-18] Reversible Residual Normalization Alleviates Spatio-Temporal Distribution Shift
链接: https://arxiv.org/abs/2604.15838
作者: Zhaobo Hu,Vincent Gauthier,Mehdi Naima
类目: Machine Learning (cs.LG)
*备注:
Abstract:Distribution shift severely degrades the performance of deep forecasting models. While this issue is well-studied for individual time series, it remains a significant challenge in the spatio-temporal domain. Effective solutions like instance normalization and its variants can mitigate temporal shifts by standardizing statistics. However, distribution shift on a graph is far more complex, involving not only the drift of individual node series but also heterogeneity across the spatial network where different nodes exhibit distinct statistical properties. To tackle this problem, we propose Reversible Residual Normalization (RRN), a novel framework that performs spatially-aware invertible transformations to address distribution shift in both spatial and temporal dimensions. Our approach integrates graph convolutional operations within invertible residual blocks, enabling adaptive normalization that respects the underlying graph structure while maintaining reversibility. By combining Center Normalization with spectral-constrained graph neural networks, our method captures and normalizes complex Spatio-Temporal relationships in a data-driven manner. The bidirectional nature of our framework allows models to learn in a normalized latent space and recover original distributional properties through inverse transformation, offering a robust and model-agnostic solution for forecasting on dynamic spatio-temporal systems.
[LG-19] Modern Structure-Aware Simplicial Spatiotemporal Neural Network
链接: https://arxiv.org/abs/2604.15833
作者: Zhaobo Hu,Vincent Gauthier,Mehdi Naima
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spatiotemporal modeling has evolved beyond simple time series analysis to become fundamental in structural time series analysis. While current research extensively employs graph neural networks (GNNs) for spatial feature extraction with notable success, these networks are limited to capturing only pairwise relationships, despite real-world networks containing richer topological relationships. Additionally, GNN-based models face computational challenges that scale with graph complexity, limiting their applicability to large networks. To address these limitations, we present Modern Structure-Aware Simplicial SpatioTemporal neural network (ModernSASST), the first approach to leverage simplicial complex structures for spatiotemporal modeling. Our method employs spatiotemporal random walks on high-dimensional simplicial complexes and integrates parallelizable Temporal Convolutional Networks to capture high-order topological structures while maintaining computational efficiency. Our source code is publicly available on GitHub\footnoteCode is available at: this https URL.
[LG-20] Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
链接: https://arxiv.org/abs/2604.15830
作者: Yangyi Fang,Jiaye Lin,Xiaoliang Fu,Cong Qin,Haolin Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning has become a powerful approach for enhancing large language model reasoning, but faces a fundamental dilemma: training on easy problems can cause overfitting and pass@k degradation, while training on hard problems often results in sparse rewards. Recent question augmentation methods address this by prepending partial solutions as hints. However, uniform hint provision may introduce redundant information while missing critical reasoning bottlenecks, and excessive hints can reduce reasoning diversity, causing pass@k degradation. We propose \textbfPieceHint, a hint injection framework that strategically identifies and provides critical reasoning steps during training. By scoring the importance of different reasoning steps, selectively allocating hints based on problem difficulty, and progressively withdrawing scaffolding, PieceHint enables models to transition from guided learning to independent reasoning. Experiments on six mathematical reasoning benchmarks show that our 1.5B model achieves comparable average performance to 32B baselines while preserving pass@k diversity across all k values.
[LG-21] Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials
链接: https://arxiv.org/abs/2604.15821
作者: Yuanchang Zhou,Hongyu Wang,Yiming Du,Yan Wang,Mingzhen Li,Siyu Hu,Xiangyu Zhang,Weijian Liu,Chen Wang,Zhuoqiang Guo,Long Wang,Jingde Bu,Yutong Lu,Guangming Tan,Weile Jia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures
Abstract:Universal Machine Learning Interatomic Potentials (uMLIPs), pre-trained on massively diverse datasets encompassing inorganic materials and organic molecules across the entire periodic table, serve as foundational models for quantum-accurate physical simulations. However, uMLIP training requires second-order derivatives, which lack corresponding parallel training frameworks; moreover, scaling to the billion-parameter regime causes explosive growth in computation and communication overhead, making its training a tremendous challenge. We introduce MatRIS-MoE, a billion-parameter Mixture-of-Experts model built upon invariant architecture, and Janus, a pioneering high-dimensional distributed training framework for uMLIPs with hardware-aware optimizations. Deployed across two Exascale supercomputers, our code attains a peak performance of 1.2/1.0 EFLOPS (24%/35.5% of theoretical peak) in single precision at over 90% parallel efficiency, compressing the training of billion-parameter uMLIPs from weeks to hours. This work establishes a new high-water mark for AI-for-Science (AI4S) foundation models at Exascale and provides essential infrastructure for rapid scientific discovery.
[LG-22] Convolutionally Low-Rank Models with Modified Quantile Regression for Interval Time Series Forecasting
链接: https://arxiv.org/abs/2604.15791
作者: Miaoxuan Zhu,Yi Yu,Yuyang Li,Wei Li,Guangcan Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The quantification of uncertainty in prediction models is crucial for reliable decision-making, yet remains a significant challenge. Interval time series forecasting offers a principled solution to this problem by providing prediction intervals (PIs), which indicates the probability that the true value falls within the predicted range. We consider a recently established point forecasts (PFs) method termed Learning-Based Convolution Nuclear Norm Minimization (LbCNNM), which directly generates multi-step ahead forecasts by leveraging the convolutional low-rankness property derived from training data. While theoretically complete and empirically effective, LbCNNM lacks inherent uncertainty estimation capabilities, a limitation shared by many advanced forecasting methods. To resolve the issue, we modify the well-known Quantile Regression (QR) and integrate it into LbCNNM, resulting in a novel interval forecasting method termed LbCNNM with Modified Quantile Regression (LbCNNM-MQR). In addition, we devise interval calibration techniques to further improve the accuracy of PIs. Extensive experiments on over 100,000 real-world time series demonstrate the superior performance of LbCNNM-MQR.
[LG-23] Similarity-Based Bike Station Expansion via Hybrid Denoising Autoencoders
链接: https://arxiv.org/abs/2604.15783
作者: Oluwaleke Yusuf,M. Tsaqif Wismadi,Adil Rasheed
类目: Machine Learning (cs.LG)
*备注: 10 pages, 9 figures. Code available at this https URL
Abstract:Urban bike-sharing systems require strategic station expansion to meet growing demand. Traditional allocation approaches rely on explicit demand modelling that may not capture the urban characteristics distinguishing successful stations. This study addresses the need to exploit patterns from existing stations to inform expansion decisions, particularly in data-constrained environments. We present a data-driven framework leveraging existing stations deemed desirable by operational metrics. A hybrid denoising autoencoder (HDAE) learns compressed latent representations from multi-source grid-level features (socio-demographic, built environment, and transport network), with a supervised classification head regularising the embedding space structure. Expansion candidates are selected via greedy allocation with spatial constraints based on latent-space similarity to existing stations. Evaluation on Trondheim’s bike-sharing network demonstrates that HDAE embeddings yield more spatially coherent clusters and allocation patterns than raw features. Sensitivity analyses across similarity methods and distance metrics confirm robustness. A consensus-based procedure across multiple parametrisations distils 32 high-confidence extension zones where all parametrisations agree. The results demonstrate how representation learning captures complex patterns that raw features miss, enabling evidence-based expansion planning without explicit demand modelling. The consensus procedure strengthens recommendations by requiring agreement across parametrisations, while framework configurability allows planners to incorporate operational knowledge. The methodology generalises to any location-allocation problem where existing desirable instances inform the selection of new candidates.
[LG-24] Fusing Cellular Network Data and Tollbooth Counts for Urban Traffic Flow Estimation
链接: https://arxiv.org/abs/2604.15782
作者: Oluwaleke Yusuf,Shaira Tabassum
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 8 pages, 7 figures
Abstract:Traffic simulations, essential for planning urban transit infrastructure interventions, require vehicle-category-specific origin-destination (OD) data. Existing data sources are imperfect: sparse tollbooth sensors provide accurate vehicle counts by category, while extensive mobility data from cellular network activity captures aggregated crowd movement, but lack modal disaggregation and have systematic biases. This study develops a machine learning framework to correct and disaggregate cellular network data using sparse tollbooth counts as ground truth. The model uses temporal and spatial features to learn the complex relationship between aggregated mobility data and vehicular data. The framework infers destinations from transit routes and implements routing logic to distribute corrected flows between OD pairs. This approach is applied to a bus depot expansion in Trondheim, Norway, generating hourly OD matrices by vehicle length category. The results show how limited but accurate sensor measurements can correct extensive but aggregated mobility data to produce grounded estimates of background vehicular traffic flows. These macro-scale estimates can be refined for micro-scale analysis at desired locations. The framework provides a generalisable approach for generating origin-destination data from cellular network data. This enables downstream tasks, like detailed traffic simulations for infrastructure planning in data-scarce contexts, supporting urban planners in making informed decisions.
[LG-25] Federated Learning with Quantum Enhanced LSTM for Applications in High Energy Physics
链接: https://arxiv.org/abs/2604.15775
作者: Abhishek Sawaika,Durga Pritam Suggisetti,Udaya Parampalli,Rajkumar Buyya
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Quantum Physics (quant-ph)
*备注: 8 pages, 7 figures, accepted at IEEE WCCI, 2026
Abstract:Learning with large-scale datasets and information-critical applications, such as in High Energy Physics (HEP), demands highly complex, large-scale models that are both robust and accurate. To tackle this issue and cater to the learning requirements, we envision using a federated learning framework with a quantum-enhanced model. Specifically, we design a hybrid quantum-classical long-shot-term-memory model (QLSTM) for local training at distributed nodes. It combines the representative power of quantum models in understanding complex relationships within the feature space, and an LSTM-based model to learn necessary correlations across data points. Given the computing limitations and unprecedented cost of current stand-alone noisy-intermediate quantum (NISQ) devices, we propose to use a federated learning setup, where the learning load can be distributed to local servers as per design and data availability. We demonstrate the benefits of such a design on a classification task for the Supersymmetry(SUSY) dataset, having 5M rows. Our experiments indicate that the performance of this design is not only better that some of the existing work using variational quantum circuit (VQC) based quantum machine learning (QML) techniques, but is also comparable ( \Delta \sim \pm 1% ) to that of classical deep-learning benchmarks. An important observation from this study is that the designed framework has 300 parameters and only needs 20K data points to give a comparable performance. Which also turns out to be a 100 \times improvement than the compared baseline models. This shows an improved learning capability of the proposed framework with minimal data and resource requirements, due to the joint model with an LSTM based architecture and a quantum enhanced VQC.
[LG-26] Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions
链接: https://arxiv.org/abs/2604.15762
作者: Huan Lin,Lianghui Ding
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20-UAV swarms transfers directly to swarms of up to 500 UAVs without fine-tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.
[LG-27] Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment
链接: https://arxiv.org/abs/2604.15757
作者: Peter Vamplew,Cameron Foale
类目: Machine Learning (cs.LG)
*备注:
Abstract:This research note identifies a previously overlooked distinction between multi-objective reinforcement learning (MORL), and more conventional single-objective reinforcement learning (RL). It has previously been noted that the optimal policy for an MORL agent with a non-linear utility function is required to be conditioned on both the current environmental state and on some measure of the previously accrued reward. This is generally implemented by concatenating the observed state of the environment with the discounted sum of previous rewards to create an augmented state. While augmented states have been widely-used in the MORL literature, one implication of their use has not previously been reported – namely that they require the agent to have continued access to the reward signal (or a proxy thereof) after deployment, even if no further learning is required. This note explains why this is the case, and considers the practical repercussions of this requirement.
[LG-28] Collective Kernel EFT for Pre-activation ResNets
链接: https://arxiv.org/abs/2604.15742
作者: Hidetoshi Kawase,Toshihiro Ota
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
*备注: 20 pages
Abstract:In finite-width deep neural networks, the empirical kernel G evolves stochastically across layers. We develop a collective kernel effective field theory (EFT) for pre-activation ResNets based on a G -only closure hierarchy and diagnose its finite validity window. Exploiting the exact conditional Gaussianity of residual increments, we derive an exact stochastic recursion for G . Applying Gaussian approximations systematically yields a continuous-depth ODE system for the mean kernel K_0 , the kernel covariance V_4 , and the 1/n mean correction K_1,\mathrmEFT , which emerges diagrammatically as a one-loop tadpole correction. Numerically, K_0 remains accurate at all depths. However, the V_4 equation residual accumulates to an O(1) error at finite time, primarily driven by approximation errors in the G -only transport term. Furthermore, K_1,\mathrmEFT fails due to the breakdown of the source closure, which exhibits a systematic mismatch even at initialization. These findings highlight the limitations of G -only state-space reduction and suggest extending the state space to incorporate the sigma-kernel.
[LG-29] Why Colors Make Clustering Harder:Global Integrality Gaps the Price of Fairness and Color-Coupled Algorithms in Chromatic Correlation Clustering
链接: https://arxiv.org/abs/2604.15738
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chromatic Correlation Clustering (CCC) extends Correlation Clustering by assigning semantic colors to edges and requiring each cluster to receive a single color label. Unlike standard CC, whose LP relaxation has integrality gap 2 on complete graphs and admits a 2.06-approximation, the analogous LP for CCC has a strict lower bound of 2.11, and the best known LP-rounding algorithm achieves 2.15. We explain this gap by isolating the source of difficulty: cross-edge chromatic interference. Neutral edges, whose color does not match the candidate cluster color, create an irreducible cost absent from standard CC and force any color-independent rounding scheme to pay an additional mismatch penalty. We make four contributions. First, we prove a Global Integrality Gap Decomposition Theorem showing that the gap of any color-independent CCC rounding algorithm equals the standard CC gap plus an irreducible chromatic penalty Delta(L) 0. Second, we solve the associated min-max problem and derive the staircase formula Delta(L) = ((L-1)/L) Delta_infinity, where Delta_infinity is approximately 0.0734. In particular, the two-color gap is 2.0967, separating CCC from standard CC already at L = 2. Third, we introduce Color-Coupled Correlation Clustering (C4). Adding the valid global constraint sum_c x_uv^c = L-1 and a correlated interval-packing rounding scheme makes neutral edges behave like classical negative edges, recovering the optimal 2.06 approximation and bypassing the 2.11 lower bound for the uncoupled LP. Fourth, experiments on extremal instances, real multi-relational networks, and fairness benchmarks validate the theory: empirical LP gaps follow the predicted staircase, and C4 matches the unconstrained approximation ratio under fairness constraints. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.15738 [cs.LG] (or arXiv:2604.15738v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.15738 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] Neuromorphic Parameter Estimation for Power Converter Health Monitoring Using Spiking Neural Networks
链接: https://arxiv.org/abs/2604.15714
作者: Hyeongmeen Baik,Hamed Poursiami,Maryam Parsa,Jinia Roy
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages, 11 figures, 4 tables. Submitted to ICONS 2026
Abstract:Always-on converter health monitoring demands sub-mW edge inference, a regime inaccessible to GPU-based physics-informed neural networks. This work separates spiking temporal processing from physics enforcement: a three-layer leaky integrate-and-fire SNN estimates passive component parameters while a differentiable ODE solver provides physics-consistent training by decoupling the ODE physics loss from the unrolled spiking loop. On an EMI-corrupted synchronous buck converter benchmark, the SNN reduces lumped resistance error from 25.8% to 10.2% versus a feedforward baseline, within the \pm 10% manufacturing tolerance of passive components, at a projected \sim270\times energy reduction on neuromorphic hardware. Persistent membrane states further enable degradation tracking and event-driven fault detection via a +5.5 percentage-point spike-rate jump at abrupt faults. With 93% spike sparsity, the architecture is suited for always-on deployment on Intel Loihi 2 or BrainChip Akida.
[LG-31] owards Robust Endogenous Reasoning : Unifying Drift Adaptation in Non-Stationary Tuning
链接: https://arxiv.org/abs/2604.15705
作者: Xiaoyu Yang,En Yu,Wei Duan,Jie Lu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Fine-Tuning (RFT) has established itself as a critical paradigm for the alignment of Multi-modal Large Language Models (MLLMs) with complex human values and domain-specific requirements. Nevertheless, current research primarily focuses on mitigating exogenous distribution shifts arising from data-centric factors, the non-stationarity inherent in the endogenous reasoning remains largely unexplored. In this work, a critical vulnerability is revealed within MLLMs: they are highly susceptible to endogenous reasoning drift, across both thinking and perception perspectives. It manifests as unpredictable distribution changes that emerge spontaneously during the autoregressive generation process, independent of external environmental perturbations. To adapt it, we first theoretically define endogenous reasoning drift within the RFT of MLLMs as the multi-modal concept drift. In this context, this paper proposes Counterfactual Preference Optimization ++ (CPO++), a comprehensive and autonomous framework adapted to the multi-modal concept drift. It integrates counterfactual reasoning with domain knowledge to execute controlled perturbations across thinking and perception, employing preference optimization to disentangle spurious correlations. Extensive empirical evaluations across two highly dynamic and safety-critical domains: medical diagnosis and autonomous driving. They demonstrate that the proposed framework achieves superior performance in reasoning coherence, decision-making precision, and inherent robustness against extreme interference. The methodology also exhibits exceptional zero-shot cross-domain generalization, providing a principled foundation for reliable multi-modal reasoning in safety-critical applications.
[LG-32] Graph self-supervised learning based on frequency corruption WWW2026
链接: https://arxiv.org/abs/2604.15699
作者: Haojie Li,Mengjiao Zhang,Guanfeng Liu,Qiang Hu,Yan Wang,Junwei Du
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 11 pages, 4 tables, 3 figures. Accepted at The ACM Web Conference 2026 (WWW 2026)
Abstract:Graph self-supervised learning can reduce the need for labeled graph data and has been widely used in recommendation, social networks, and other web applications. However, existing methods often underuse high-frequency signals and may overfit to specific local patterns, which limits representation quality and generalization. We propose Frequency-Corrupt Based Graph Self-Supervised Learning (FC-GSSL), a method that builds corrupted graphs biased toward high-frequency information by corrupting nodes and edges according to their low-frequency contributions. These corrupted graphs are used as inputs to an autoencoder, while low-frequency and general features are reconstructed as supervision targets, forcing the model to fuse information from multiple frequency bands. We further design multiple sampling strategies and generate diverse corrupted graphs from the intersections and unions of the sampling results. By aligning node representations from these views, the model can discover useful frequency combinations, reduce reliance on specific high-frequency components, and improve robustness. Experiments on 14 datasets across node classification, graph prediction, and transfer learning show that FC-GSSL consistently improves performance and generalization.
[LG-33] Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
链接: https://arxiv.org/abs/2604.15694
作者: Jingyuan Li,Xiaoyi Jiang,Fukang Wen,Wei Liu,Renqian Luo,Yi Zhu,Zuoqiang Shi,Pipi Hu
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix as a single object – via concrete scores, clean-data predictions ( x_0 -parameterization), or denoising distributions – rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. Since a CTMC is fundamentally a Poisson process fully determined by these two quantities, decomposing along this structure is closer to first principles and naturally leads to our formulation. We propose \textbfNeural CTMC, which separately parameterizes the reverse process through an \emphexit rate (when to jump) and a \emphjump distribution (where to jump) using two dedicated network heads. We show that the evidence lower bound (ELBO) differs from a path-space KL divergence between the true and learned reverse processes by a \theta -independent constant, so that the training objective is fully governed by the exit rate and jump distribution we parameterize. Moreover, this KL factorizes into a Poisson KL for timing and a categorical KL for direction. We further show that the tractable conditional surrogate preserves the gradients and minimizers of the corresponding marginal reverse-process objective under standard regularity assumptions. Our theoretical framework also covers masked and GIDD-style noise schedules. Empirically, while the uniform forward process has been explored in prior work, our model, to our best of the knowledge, is the first pure-uniform method to outperform mask-based methods on the OpenWebText this http URL facilitate reproducibility, we release our pretrained weights at this https URL.
[LG-34] NK-GAD: Neighbor Knowledge-Enhanced Unsupervised Graph Anomaly Detection
链接: https://arxiv.org/abs/2604.15668
作者: Zehao Wang,Lanjun Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph anomaly detection aims to identify irregular patterns in graph-structured data. Most unsupervised GNN-based methods rely on the homophily assumption that connected nodes share similar attributes. However, real-world graphs often exhibit attribute-level heterophily, where connected nodes have dissimilar attributes. Our analysis of attribute-level heterophily graphs reveals two phenomena indicating that current approaches are not practical for unsupervised graph anomaly detection: 1) attribute similarities between connected nodes show nearly identical distributions across different connected node pair types, and 2) anomalies cause consistent variation trends between the graph with and without anomalous edges in the low- and high-frequency components of the spectral energy distributions, while the mid-part exhibits more erratic variations. Based on these observations, we propose NK-GAD, a neighbor knowledge-enhanced unsupervised graph anomaly detection framework. NK-GAD integrates a joint encoder capturing both similar and dissimilar neighbor features, a neighbor reconstruction module modeling normal distributions, a center aggregation module refining node features, and dual decoders for reconstructing attributes and structures. Experiments on seven datasets show NK-GAD achieves an average 3.29% AUC improvement.
[LG-35] Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
链接: https://arxiv.org/abs/2604.15664
作者: Xinge Liu,Terry Jingchen Zhang,Bernhard Schölkopf,Zhijing Jin,Kristen Menou
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at this https URL and this https URL, respectively.
[LG-36] PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
链接: https://arxiv.org/abs/2604.15645
作者: Shimon Pisnoy,Hemanth Chandravamsi,Ziv Chen,Aaron Goldgewert,Gal Shaviner,Boris Shragner,Steven H. Frankel
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注:
Abstract:We present PINNACLE, an open-source computational framework for physics-informed neural networks (PINNs) that integrates modern training strategies, multi-GPU acceleration, and hybrid quantum-classical architectures within a unified modular workflow. The framework enables systematic evaluation of PINN performance across benchmark problems including 1D hyperbolic conservation laws, incompressible flows, and electromagnetic wave propagation. It supports a range of architectural and training enhancements, including Fourier feature embeddings, random weight factorization, strict boundary condition enforcement, adaptive loss balancing, curriculum training, and second-order optimization strategies, with extensibility to additional methods. We provide a comprehensive benchmark study quantifying the impact of these methods on convergence, accuracy, and computational cost, and analyze distributed data parallel scaling in terms of runtime and memory efficiency. In addition, we extend the framework to hybrid quantum-classical PINNs and derive a formal estimate for circuit-evaluation complexity under parameter-shift differentiation. Results highlight the sensitivity of PINNs to architectural and training choices, confirm their high computational cost relative to classical solvers, and identify regimes where hybrid quantum models offer improved parameter efficiency. PINNACLE provides a foundation for benchmarking physics-informed learning methods and guiding future developments through quantitative assessment of their trade-offs.
[LG-37] Majority Voting for Code Generation ICLR2026 DATE
链接: https://arxiv.org/abs/2604.15618
作者: Tim Launer,Jonas Hübotter,Marco Bagatella,Ido Hakimi,Andreas Krause
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 Test-Time Updates (TTU) Workshop
Abstract:We investigate Functional Majority Voting (FMV), a method based on functional consensus for code generation with Large Language Models, which identifies a representative solution from multiple generations using their runtime execution signatures on test inputs. We find that FMV is an effective test-time inference strategy, substantially boosting performance on LiveCodeBench without a large compute overhead. Furthermore, we extend the utility of functional consensus and apply it as an aggregation strategy for label-free Test-Time Reinforcement Learning. We demonstrate that this increases pass@1 on holdout tasks, but find no evidence of self-improvement beyond the base model’s performance ceiling.
[LG-38] Flexible Empowerment at Reasoning with Extended Best-of-N Sampling
链接: https://arxiv.org/abs/2604.15614
作者: Taisuke Kobayashi
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures
Abstract:This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore, this paper investigates BoN sampling for empowerment. Furthermore, to adjust the degree of policy modification in a generalizable manner while maintaining computational cost, this paper proposes a novel BoN sampling method extended by Tsalis statistics. Through toy problems, the proposed method’s cability to balance EED is verified. In addition, it is demonstrated that the proposed method improves RL performance to solve complex locomotion tasks.
[LG-39] Optimizing Stochastic Gradient Push under Broadcast Communications
链接: https://arxiv.org/abs/2604.15549
作者: Tuan Nguyen,Ting He
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:
Abstract:We consider the problem of minimizing the convergence time for decentralized federated learning (DFL) in wireless networks under broadcast communications, with focus on mixing matrix design. The mixing matrix is a critical hyperparameter for DFL that simultaneously controls the convergence rate across iterations and the communication demand per iteration, both strongly influencing the convergence time. Although the problem has been studied previously, existing solutions are mostly designed for decentralized parallel stochastic gradient descent (D-PSGD), which requires the mixing matrix to be symmetric and doubly stochastic. These constraints confine the activated communication graph to undirected (i.e., bidirected) graphs, which limits design flexibility. In contrast, we consider mixing matrix design for stochastic gradient push (SGP), which allows asymmetric mixing matrices and hence directed communication graphs. By analyzing how the convergence rate of SGP depends on the mixing matrices, we extract an objective function that explicitly depends on graph-theoretic parameters of the activated communication graph, based on which we develop an efficient design algorithm with performance guarantees. Our evaluations based on real data show that the proposed solution can notably reduce the convergence time compared to the state of the art without compromising the quality of the trained model.
[LG-40] Verification Modulo Tested Library Contracts
链接: https://arxiv.org/abs/2604.15533
作者: Abhishek Uppar,Omar Muhammad,Sumanth Prabhu,Deepak D’Souza,Madhusudan P,Adithya Murali
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
*备注:
Abstract:We consider the problem of \emphverification modulo tested library contracts as a step towards automating the verification of client programs that use complex libraries. We formulate this problem as the synthesis of modular contracts for the library methods used by the client that are adequate to prove the client correct, and that also pass the scrutiny of a testing engine that tests the library against these contracts. We also consider a new form of method contracts called \emphcontextual contracts that arise in this setting that hold in the context of the client program, and can often be simpler and easier to infer than classical modular contracts. We provide a counterexample-guided learning framework to solve this problem, in which the synthesizer interacts with a constraint solver as well as the testing engine in order to infer adequate modular/contextual method contracts and inductive invariants for the client. The main synthesis engines we use are generalizing CHC solvers that are realized using ICE learning algorithms. We realize this framework in a tool called \vmtlc and show its efficacy on benchmarks where clients call large libraries. Subjects: Programming Languages (cs.PL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE) Cite as: arXiv:2604.15533 [cs.PL] (or arXiv:2604.15533v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2604.15533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] π_0.7: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities WWW
链接: https://arxiv.org/abs/2604.15483
作者: Physical Intelligence,Bo Ai,Ali Amin,Raichelle Aniceto,Ashwin Balakrishna,Greg Balke,Kevin Black,George Bokinsky,Shihao Cao,Thomas Charbonnier,Vedant Choudhary,Foster Collins,Ken Conley,Grace Connors,James Darpinian,Karan Dhabalia,Maitrayee Dhaka,Jared DiCarlo,Danny Driess,Michael Equi,Adnan Esmail,Yunhao Fang,Chelsea Finn,Catherine Glossop,Thomas Godden,Ivan Goryachev,Lachlan Groom,Haroun Habeeb,Hunter Hancock,Karol Hausman,Gashon Hussein,Victor Hwang,Brian Ichter,Connor Jacobsen,Szymon Jakubczak,Rowan Jen,Tim Jones,Gregg Kammerer,Ben Katz,Liyiming Ke,Mairbek Khadikov,Chandra Kuchi,Marinda Lamb,Devin LeBlanc,Brendon LeCount,Sergey Levine,Xinyu Li,Adrian Li-Bell,Vladislav Lialin,Zhonglin Liang,Wallace Lim,Yao Lu,Enyu Luo,Vishnu Mano,Nandan Marwaha,Aikys Mongush,Liam Murphy,Suraj Nair,Tyler Patterson,Karl Pertsch,Allen Z. Ren,Gavin Schelske,Charvi Sharma,Baifeng Shi,Lucy Xiaoyang Shi,Laura Smith,Jost Tobias Springenberg,Kyle Stachowicz,Will Stoeckle,Jiaming Tang,Jimmy Tanner,Shalom Tekeste,Marcel Torne,Kyle Vedder,Quan Vuong,Anna Walling,Haohuan Wang,Jason Wang,XuDong Wang,Chris Whalen,Samuel Whitmore,Blake Williams,Charles Xu,Sukwon Yoo,Lili Yu,Wuming Zhang,Zhuoyang Zhang,Ury Zhilinsky
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Website: this https URL
Abstract:We present a new robotic foundation model, called \pi_0.7 , that can enable strong out-of-the-box performance in a wide range of scenarios. \pi_0.7 can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level of performance that matches much more specialized RL-finetuned models. The main idea behind \pi_0.7 is to use diverse context conditioning during training. This conditioning information, contained in the prompt, makes it possible to steer the model precisely to perform many tasks with different strategies. It is conditioned not just on a language command that describes what it should do, but on additional multimodal information that also describes the manner or strategy in which it should do it, including metadata about task performance and subgoal images. This enables \pi_0.7 to use very diverse data, including demonstrations, potentially suboptimal (autonomous) data including failures, and data from non-robot sources. Our experiments evaluate \pi_0.7 across numerous tasks with multiple robot platforms, on tasks that require speed and dexterity, language following, and compositional task generalization.
[LG-42] Lossless Compression via Chained Lightweight Neural Predictors with Information Inheritance
链接: https://arxiv.org/abs/2604.15472
作者: Yuriy Kim,Evgeny Belyaev
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Under review
Abstract:This paper is dedicated to lossless data compression with probability estimation using neural networks. First, we propose a probability estimation architecture based on a chain of neural predictors, so that each unit of the chain is defined as a neural network with the minimum possible number of weights, which is sufficient for efficient compression of data generated by Markov sources of a given order. We show that this architecture allows us to minimize the overall number of weights participating in the probability estimation process depending on the statistical properties of the input data. Second, in order to improve compression efficiency, we introduce an information inheritance mechanism, where the probability estimate obtained by a low-order unit is used at the next higher-order unit. Experimental results show that the proposed lossless data compressor equipped with the chained probability estimation architecture provides compression ratios close to the state-of-the-art PAC compressor. At the same time, it outperforms PAC by a factor of 1.2 to 6.3 in encoding throughput and by a factor of 2.8 to 12.3 in decoding throughput on a consumer GPU.
[LG-43] Python library supporting Discrete Variational Formulations and training solutions with Collocation-based Robust Variational Physics Informed Neural Networks (DVF-CRVPINN)
链接: https://arxiv.org/abs/2604.15398
作者: Tomasz Służalec,Marcin Łoś,Askold Vilkha,Maciej Paszyński
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Python library, Robust Variational Physics-Informed Neural Networks, Collocation Methods, Robust loss, Stokes Equations, Laplace problem
Abstract:We explore the possibility of solving Partial Differential Equations (PDEs) using discrete weak formulations. We propose a programming environment for defining a discrete computational domain, introducing discrete functions defined over a set of points, constructing discrete inner products, and introducing discrete weak formulations employing Kronecker delta test functions. Building on this setup, we propose a discrete neural network representation, training the solution function defined over a discrete set of points and employing discrete finite difference derivatives in the automatic differentiation procedures. As a challenging computational model example, we focus on Stokes equations in two-dimensions, defined over a discrete set of points. We train the solution using the discrete weak residual and the Adamax algorithm with discrete automatic differentiation of the discrete gradients. Despite introducing the python environment, we also provide a rigorous mathematical formulation based on discrete weak formulations, proving the well-posedness and robustness of the loss function. The solution of the discrete weak formulations is based on neural network training employing a robust loss function that is related to the true error. In this way, we have a robust control of the numerical error during the training of the neural networks. Besides the Stokes formulation, we also explain the functionality of the proposed library using the Laplace problem formulation.
[LG-44] Prompt-Driven Code Summarization: A Systematic Literature Review
链接: https://arxiv.org/abs/2604.15385
作者: Afia Farjana,Zaiyu Cheng,Antonio Mastropaolo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 42 pages, 9 figures, 10 tables. Systematic Literature Review. This work is currently under review at ACM TOSEM
Abstract:Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.
[LG-45] opFeaRe: Locating Critical State of Adversarial Resilience for Graphs Regarding Topology-Feature Entanglement
链接: https://arxiv.org/abs/2604.15370
作者: Xinxin Fan,Wenxiong Chen,Quanliang Jing,Chi Lin,Shaoye Luo,Wenbo Song,Yunfeng Lu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Graph adversarial attacks are usually produced from the two perspectives of topology/structure and node feature, both of them represent the paramount characteristics learned by today’s deep learning models. Although some defense countermeasures are proposed at present, they fails to disclose the intrinsic reasons why these two aspects necessitate and how they are adequately fused to co-learn the graph representation. Towards this question, we in this paper propose an adversarial defense approach through locating the graph’s critical state of adversarial resilience, resorting to the equilibrium-point theory in the discipline of complex dynamic system (CDS). In brief, our work has three novelties: i) Adversarial-Attack Modeling, i.e. map a graph regime into CDS, and use the oscillation of dynamic system to model the behavior of adversarial perturbation; ii) 2D Topology-Feature-Entangled Function Design for Perturbed Graph, i.e. project graph topology and node feature as two characteristic spaces, and define two-dimensional entangled perturbation functions to represent the dynamic variance under adversarial attacks; and iii) Location of Critical State of Adversarial Resilience, i.e. utilize the equilibrium-point theory to locate the graph’s critical state of attack resilience resorting to the perturbation-reflected 2D function. Finally, multi-facet experiments on five commonly-used realistic datasets validate the effectiveness of our proposed approach, and the results show our approach can significantly outperform the state-of-the-art baselines under four representative graph adversarial attacks.
[LG-46] Photonic AI: A Hybrid Diffractive Holographic Neural System for Passive Optical Real-Time Image Classification
链接: https://arxiv.org/abs/2604.15364
作者: Prakul Sunil Hiremath
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 18 pages, 3 figures. Operator-theoretic formulation and simulation of a hybrid diffractive-holographic optical neural system
Abstract:Edge intelligence is constrained by the energy and latency costs of shuttling data through electronic memory hierarchies. Optical systems offer a fundamentally different computational regime: once an input wavefront is launched into a structured medium, propagation, diffraction, and interference jointly enact a linear transformation whose cost is determined by wave physics rather than by clocked arithmetic. This paper develops a rigorous systems-level treatment of that regime and introduces a hybrid diffractive holographic architecture for image classification. The proposed model couples a Diffractive Optical Neural Network (DONN) with a Holographic Interference-Based Learning (HIBL) operator a formal map from digitally optimized phase distributions to physically realizable, fabrication-compatible interference patterns embeddable in passive optical elements. We express the full inference pipeline as a composition of encoding, phase modulation, free-space propagation, and intensity measurement operators, making explicit which quantities are learned, which are fixed by design, and where nonlinearity enters through photodetection. This operator-theoretic view resolves a persistent gap in the optical-ML literature between learning a transformation and physically realizing it. In physics-informed simulation on MNIST, a three-layer system with approximately 25,000 phase elements achieves 91.2% test accuracy with propagation-limited nanosecond-scale latency. The primary contribution is not a performance claim but a precise computational framework: learned representations can be physically embedded into structured optical media so that inference is executed by wavefront transformation through a passive, fabricated object rather than by sequential electronic multiply accumulate operations.
[LG-47] Mapping High-Performance Regions in Battery Scheduling across Data Uncertainty Battery Design and Planning Horizons
链接: https://arxiv.org/abs/2604.15360
作者: Jaime de Miguel Rodriguez,Artjom Vargunin,Brigitta Robin Raudne,David Solis Martin,Yaroslava Mykhailenko,Kaarel Oja
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 40 pages
Abstract:This study presents a triadic analysis of energy storage operation under multi-stage model predictive control, investigating the interplay between data characteristics, forecast uncertainty, planning horizon, and battery c-rate. Synthetic datasets are generated to systematically explore variations in data profiles and uncertainty, enabling parametrization and the construction of relationships that map these characteristics to optimal horizon length. Results reveal the presence of an effective horizon, defined as the look-ahead length beyond which additional forecast information provides limited operational benefit. Accounting for this horizon can reduce computational costs while maintaining optimal performance. The study provides optimal horizon lengths across a broad range of combinations of battery types, uncertainty levels, and data profiles, offering practical guidance for industrial storage operation. It also quantifies revenue losses due to forecast uncertainty, showing that errors can impact performance even for fast batteries. Finally, the framework lays the groundwork for future machine learning approaches that map dataset parametrization to optimal horizons, supporting continuous optimization in industrial settings without heavy computation.
[LG-48] AutoFlows: Hierarchical Message Flow Mining for System on Chip Designs
链接: https://arxiv.org/abs/2604.15359
作者: Bardia Nadimi,Hao Zheng
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Understanding communication behavior in modern system-on-chip (SoC) designs is critical for functional verification, performance analysis, and post-silicon debugging. Communication traces capture message exchanges among system components and provide valuable insights into system behavior. However, deriving concise communication specifications from such traces remains challenging due to interleaved instances of communication flows, and ambiguous causal relationships among messages. Existing mining approaches often struggle with scalability and ambiguity when traces contain complex interleaving of message patterns across multiple components. These conditions often lead to an explosion in the number of candidate flows and inaccurate extraction of communication behaviors. This paper presents AutoFlows++, a design-architecture-guided hierarchical framework for mining message flows from communication traces of complex SoC designs. AutoFlows++ operates in two stages: local mining followed by global mining. In the local mining stage, simple communication patterns are extracted from traces observed at individual communication interfaces between components. In the global mining stage, these local patterns are composed to identify higher-level message flows that characterize communication behavior across multiple components. Experimental results on both synthetic traces and traces generated from SoC models in GEM5 demonstrate that AutoFlows++ significantly improves flow extraction accuracy compared with prior approaches, highlighting its effectiveness for practical SoC validation tasks.
[LG-49] he Spectral Geometry of Thought: Phase Transitions Instruction Reversal Token-Level Dynamics and Perfect Correctness Prediction in How Transformers Reason
链接: https://arxiv.org/abs/2604.15350
作者: Yi Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We discover that large language models exhibit \emphspectral phase transitions in their hidden activation spaces when engaging in reasoning versus factual recall. Through systematic spectral analysis across \textbf11 models spanning \textbf5 architecture families (Qwen, Pythia, Phi, Llama, DeepSeek-R1), we identify \textbfseven core phenomena: (1)~\textbfReasoning Spectral Compression – 9/11 models show significantly lower \alpha for reasoning ( p 0.05 ), with larger effects in stronger models; (2)~\textbfInstruction Tuning Spectral Reversal – base models show reasoning \alpha factual \alpha , while instruction-tuned models reverse this relationship; (3)~\textbfArchitecture-Dependent Generation Taxonomy – prompt-to-response shifts partition into expansion, compression, and equilibrium regimes; (4)~\textbfSpectral Scaling Law – \alpha_\textreasoning \propto -0.074 \ln N across 4 Qwen base models ( R^2 = 0.46 ); (5)~\textbfToken-Level Spectral Cascade – per-token alpha tracking reveals local synchronization that decays exponentially with layer distance, and is weaker for reasoning than factual tasks; (6)~\textbfReasoning Step Spectral Punctuation – phase-transition signatures align with reasoning step boundaries; and (7)~\textbfSpectral Correctness Prediction – spectral \alpha alone achieves AUC = 1.000 (Qwen2.5-7B, late layers) and mean AUC = 0.893 across 6 models in predicting correctness \emphbefore the final answer is generated. Together, these findings establish a comprehensive \emphspectral theory of reasoning in transformers, revealing that the geometry of thought is universal in direction, architecture-specific in dynamics, and predictive of outcome.
[LG-50] Adaptive multi-fidelity optimization with fast learning rates AISTATS
链接: https://arxiv.org/abs/2604.16239
作者: Come Fiegel,Victor Gabillon,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at International Conference on Artificial Intelligence and Statistics (AISTATS) 2020
Abstract:In multi-fidelity optimization, biased approximations of varying costs of the target function are available. This paper studies the problem of optimizing a locally smooth function with a limited budget, where the learner has to make a tradeoff between the cost and the bias of these approximations. We first prove lower bounds for the simple regret under different assumptions on the fidelities, based on a cost-to-bias function. We then present the Kometo algorithm which achieves, with additional logarithmic factors, the same rates without any knowledge of the function smoothness and fidelity assumptions, and improves previously proven guarantees. We finally empirically show that our algorithm outperforms previous multi-fidelity optimization methods without the knowledge of problem-dependent parameters.
[LG-51] A Wasserstein Geometric Framework for Hebbian Plasticity
链接: https://arxiv.org/abs/2604.16052
作者: Ulrich Tan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: Preprint. 75 pages including appendices and bibliography
Abstract:We introduce the Tan-HWG framework (Hebbian-Wasserstein-Geometry), a geometric theory of Hebbian plasticity in which memory states are modeled as probability measures evolving through Wasserstein minimizing movements. Hebbian learning rules are formalized as Hebbian energies satisfying a sequential stability condition, ensuring well-posed fiberwise JKO updates, optimal-transport realizations, and an energy descent inequality. This variational structure induces a fundamental separation between internal and observable dynamics. Internal memory states evolve along Wasserstein geodesics in a latent curved space, while observable quantities, such as effective synaptic weights, arise through geometric projection maps into external spaces. Simplicial projections recover classical affine schemes (including exponential moving averages and mirror descent), while revealing synaptic competition and pruning as geometric consequences of mass redistribution. Hilbertian projections provide a geometric account of phase alignment and multi-scale coherence. Classical neural networks appear as flat projections of this curved dynamics, while the framework naturally accommodates richer distributional representations, including structural weights and embedding memories, and their spectral extensions in complex internal spaces. Under mild Lipschitz regularity assumptions, including a quasi-stationary “sleep-mode” regime, we establish the existence of continuous-time limit curves. This yields a variational formulation of memory consolidation as a perturbed Wasserstein gradient flow. The framework thus provides a unified geometric foundation for synaptic plasticity, representation dynamics, and context-dependent computation. Comments: Preprint. 75 pages including appendices and bibliography Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR) MSC classes: 49Q22 (Primary) 49J15, 37N40, 60J20 (Secondary) Cite as: arXiv:2604.16052 [math.OC] (or arXiv:2604.16052v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2604.16052 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ulrich Tan [view email] [v1] Fri, 17 Apr 2026 13:27:22 UTC (64 KB)
[LG-52] Discovering quantum phenomena with Interpretable Machine Learning
链接: https://arxiv.org/abs/2604.16015
作者: Paulin de Schoulepnikoff,Hendrik Poulsen Nautrup,Hans J. Briegel,Gorka Muñoz-Gil
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:
Abstract:Interpretable machine learning techniques are becoming essential tools for extracting physical insights from complex quantum data. We build on recent advances in variational autoencoders to demonstrate that such models can learn physically meaningful and interpretable representations from a broad class of unlabeled quantum datasets. From raw measurement data alone, the learned representation reveals rich information about the underlying structure of quantum phase spaces. We further augment the learning pipeline with symbolic methods, enabling the discovery of compact analytical descriptors that serve as order parameters for the distinct regimes emerging in the learned representations. We demonstrate the framework on experimental Rydberg-atom snapshots, classical shadows of the cluster Ising model, and hybrid discrete-continuous fermionic data, revealing previously unreported phenomena such as a corner-ordering pattern in the Rydberg arrays. These results establish a general framework for the automated and interpretable discovery of physical laws from diverse quantum datasets. All methods are available through qdisc, an open-source Python library designed to make these tools accessible to the broader community.
[LG-53] A Structure-Preserving Graph Neural Solver for Parametric Hyperbolic Conservation Laws
链接: https://arxiv.org/abs/2604.15617
作者: Jiamin Jiang,Shanglin Lv,Jingrun Chen
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Hyperbolic conservation laws govern a wide range of transport-driven dynamics featuring shocks, contact discontinuities, and complex wave interactions, posing distinct challenges for deep-learning-based surrogate modeling. While classical numerical methods provide robust and physically admissible solutions, their computational cost restricts applicability in many-query tasks such as parametric studies and design optimization. Conversely, existing neural surrogates offer rapid inference but often fail to respect intrinsic PDE structures, leading to non-physical artifacts, rollout instability, and poor generalization. We present an interpretable, structure-preserving graph neural solver that bridges classical numerical principles with graph neural networks (GNNs). The network is designed as a learned reconstruction-and-flux operator rather than a black-box state updater, thereby inherently preserving key properties such as local conservation and upwinding. Inspired by Arbitrary high-order DERivatives schemes, we further recast message-passing GNNs as high-order space-time predictors, enabling conservative and stable neural updates with large time steps. Evaluation is performed on challenging supersonic flow benchmarks spanning broad parametric variations in geometry, initial/boundary conditions, and flow regimes. The neural solver achieves superior long-horizon rollout stability and accuracy compared with strong surrogate baselines, outperforms low-order discretizations, and delivers orders-of-magnitude runtime speedups over high-resolution simulations. Subjects: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2604.15617 [physics.comp-ph] (or arXiv:2604.15617v1 [physics.comp-ph] for this version) https://doi.org/10.48550/arXiv.2604.15617 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] ExoNet: Multimodal Deep Learning for TESS Exoplanet Candidate Identification via Phase-Folded Light Curves Stellar Parameters and Multi-Head Attention Fusion
链接: https://arxiv.org/abs/2604.15560
作者: Md.Rashadul Islam
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 4 tables
Abstract:NASA’s Transiting Exoplanet Survey Satellite (TESS) has identified thousands of exoplanet candidates, yet many remain unconfirmed due to the limitations of manual vetting processes. This paper presents ExoNet, a multimodal deep learning framework that integrates phase-folded global and local light curve representations with stellar parameters using a late-fusion architecture combining 1D Convolutional Neural Networks and Multi-Head Attention. Trained on labeled Kepler data, ExoNet achieves strong classification performance and demonstrates effective generalization to TESS data. Applied to 200 unconfirmed TESS planet candidates, the model identifies multiple high-confidence candidates, including several within the habitable zone. The results highlight the effectiveness of multimodal fusion and attention mechanisms in automated exoplanet candidate validation.
[LG-55] PRIM-cipal components analysis
链接: https://arxiv.org/abs/2604.15538
作者: Tianhao Liu,Daniel Andrés Díaz-Pachón,J. Sunil Rao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 pages, 46 figures
Abstract:Supervised No Free Lunch Theorems (NFLTs) are well studied, yet unsupervised NFLTs remain underexplored. For elliptical distributions, we prove that there exist two equally optimal, scientifically meaningful bump-hunting strategies that are exact opposites, with no universal winner. Specifically, peeling k orthogonal dimensions from \mathbbR^d ( d \ge k ), retaining an inter-quantile region of probability 1-\alpha per peeled dimension, maximizes total variance and Frobenius norm when the k smallest principal components (called pettiest components) are selected, and minimizes them when the selected dimensions are the k leading principal components. These optima inspire PRIM-based bump-hunting algorithms either by minimizing variance or by minimizing volume, thereby motivating an NFLT. We test our results on the Fashion-MNIST database, showing that peeling the largest principal components captures multiplicity, while peeling the smallest principal components isolates popular styles.
[LG-56] One-Shot Generative Flows: Existence and Obstructions
链接: https://arxiv.org/abs/2604.15439
作者: Panos Tsimpos,Daniel Sharp,Youssef Marzouk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We study dynamic measure transport for generative modelling in the setting of a stochastic process X_\bullet whose marginals interpolate between a source distribution P_0 and a target distribution P_1 while remaining independent, i.e., when (X_0,X_1)\sim P_0\otimes P_1 . Conditional expectations of this process X_\bullet define an ODE whose flow map transports from P_0 to P_1 . We discuss when such a process induces a \emphstraight-line flow, namely one whose pointwise acceleration vanishes and is therefore exactly integrable by any first-order method. We first develop multiple characterizations of straightness in terms of PDEs involving the conditional statistics of the process. Then, we prove that straightness under endpoint independence exhibits a sharp dichotomy. On one hand, we construct explicit, computable straight-line processes for arbitrary Gaussian endpoints. On the other hand, we show straight-line processes do not exist for targets with sufficiently well-separated modes. We demonstrate this through a sequence of increasingly general impossibility theorems that uncover a fundamental relationship between the sample-path behavior of a process with independent endpoints and the space-time geometry of this process’ flow map. Taken together, these results provide a structural theory of when straight generative flows can, and cannot, exist. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2604.15439 [stat.ML] (or arXiv:2604.15439v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.15439 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Panagiotis Tsimpos [view email] [v1] Thu, 16 Apr 2026 18:01:48 UTC (3,353 KB) Full-text links: Access Paper: View a PDF of the paper titled One-Shot Generative Flows: Existence and Obstructions, by Panos Tsimpos and 2 other authorsView PDFTeX Source view license Current browse context: stat.ML prev | next new | recent | 2026-04 Change to browse by: cs cs.LG math math.PR stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-57] Machine learning approaches to uncover the neural mechanisms of motivated behaviour: from ADHD to individual differences in effort and reward sensitivity
链接: https://arxiv.org/abs/2604.15363
作者: Nam Trinh
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: PhD thesis, Dublin City University, December 2025. 194 pages
Abstract:Motivated behaviour relies on the brain’s capacity to evaluate effort and reward. Dysregulation within these processes contributes to a spectrum of conditions, from hyperactivity in attention-deficit/hyperactivity disorder (ADHD) to diminished goal-directed behaviour in apathy. This thesis investigates the neural mechanisms underlying ADHD using electroencephalography (EEG) and examines individual differences in effort and reward sensitivity using neuroimaging, applying machine learning approaches through three main studies. In Study 1, task-based and resting-state EEG were employed with machine learning models to classify adult individuals with ADHD and healthy controls. Machine learning classifiers trained on task-based EEG during a stop signal task outperformed those trained on resting-state EEG, with the strongest predictive features arising from gamma-band spectral power over fronto-central and parietal regions. In Study 2, diffusion MRI and whole-brain permutation-based analyses identified associations between white matter integrity and computationally modelled parameters reflecting effort and reward sensitivity, with SMA-connected tracts emerging as a central hub. In Study 3, grey matter volumes from structural T1-weighted MRI were used to examine correlates of effort sensitivity, reward sensitivity, and subclinical apathy, with machine learning confirming robust decoding of reward sensitivity and apathy levels. Across studies, fronto-parietal circuits emerged as central to effort valuation and reward processing. These findings may serve as neural biomarkers for improving diagnostic accuracy in ADHD and motivational impairments, and for guiding personalised neurotechnological interventions.
[LG-58] A methodology to rank importance of frequencies and channels in electromyography data with Decision Tree classifiers
链接: https://arxiv.org/abs/2604.15353
作者: Albert A. Nasybullin,Nursultan Abdullaev,Maksim A. Baranov,Viacheslav V. Koshman,Vitaly A. Mahonin
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, 1 table. Published in Russian Journal of Nonlinear Dynamics, 2024
Abstract:This study presents a methodology for identifying the most informative frequencies and channels in electromyography (EMG) data to evaluate muscle recovery using Decision Tree classifiers. EMG signals, recorded from the vastus lateralis muscle during squat exercises, were analyzed across varying rest intervals to assess optimal recovery periods. By employing single Decision Tree classifiers, the study enhances interpretability, offering insights into feature importance - essential for applications in medical and sports settings where transparency is critical. The experimental protocol utilized a grid search for hyperparameter tuning and cross-validation to address class imbalance, ultimately achieving a reliable classification of rest intervals based on power spectral density features. The results indicate that a limited subset of highly informative features provides sufficient accuracy, suggesting that streamlined, interpretable models are effective for the evaluation of muscle recovery. This approach can guide future research in developing compact, robust models adapted to EMG-based diagnostics.
附件下载


