本篇博文主要内容为 2025-06-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-26)
今日共更新380篇论文,其中:
- 自然语言处理共51篇(Computation and Language (cs.CL))
- 人工智能共118篇(Artificial Intelligence (cs.AI))
- 计算机视觉共68篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共134篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] MMSearch-R1: Incentivizing LMMs to Search
【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在真实世界场景中部署时,因信息复杂性和动态性而需要依赖外部知识源的问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)和提示工程搜索代理依赖于刚性流程,常导致搜索行为低效或过度。论文提出的解决方案是MMSearch-R1,这是首个端到端的强化学习框架,使LMMs能够在真实互联网环境中执行按需的多轮搜索。其关键在于整合图像和文本搜索工具,并通过基于结果的奖励机制与搜索惩罚引导模型决定何时以及如何调用这些工具,从而实现高效、按需的搜索行为。
链接: https://arxiv.org/abs/2506.20670
作者: Jinming Wu,Zihao Deng,Wei Li,Yiding Liu,Bo You,Bo Li,Zejun Ma,Ziwei Liu
机构: ByteDance(字节跳动); S-Lab, NTU(南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
zh
[NLP-1] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLM s
【速读】: 该论文试图解决当前在大型语言模型(Large Language Models, LLMs)中对人类价值权衡(value trade-offs)的动态和多维表征进行解释的工具有限的问题。其解决方案的关键在于应用一种先进的认知模型——即礼貌言语的认知模型,以评估LLMs在不同模型设置下(如前沿黑盒模型中的推理“努力”程度以及开源模型的强化学习后训练动态)是否能够体现类似人类的价值权衡。该方法揭示了推理模型在信息效用上高于社交效用,并强调了基础模型选择和预训练数据对模型训练初期效用值的长期影响。
链接: https://arxiv.org/abs/2506.20666
作者: Sonia K. Murthy,Rosie Zhao,Jennifer Hu,Sham Kakade,Markus Wulfmeier,Peng Qian,Tomer Ullman
机构: Kempner Institute for Natural and Artificial Intelligence, Harvard University (哈佛大学自然与人工智能研究所); Google DeepMind (谷歌深度思维); Department of Psychology, Harvard University (哈佛大学心理学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures
Abstract:Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person’s feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called “cognitive models” provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker’s competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning “effort” in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs’ training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.
zh
[NLP-2] he Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在多智能体场景中缺乏理论心智(Theory of Mind, ToM)能力的问题,即模型难以准确推理其他智能体的“心理”状态,从而影响其在合作与竞争环境中的表现。解决方案的关键在于提出Decrypto,这是一个基于游戏的多智能体推理与ToM基准测试平台,其设计旨在消除其他基准中常见的混淆因素,提供一个更真实、互动性更强的评估环境。Decrypto结合了认知科学、计算语用学和多智能体强化学习的思想,为研究LLMs的ToM能力提供了新的实验框架,并揭示了当前先进模型在这一领域表现不佳的现象。
链接: https://arxiv.org/abs/2506.20664
作者: Andrei Lupu,Timon Willi,Jakob Foerster
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 41 pages, 19 figures
Abstract:As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the “mental” states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents. Comments: 41 pages, 19 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) Cite as: arXiv:2506.20664 [cs.AI] (or arXiv:2506.20664v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.20664 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-3] Memento: Note-Taking for Your Future Self
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要紧密结合推理与检索的任务中表现不佳的问题,特别是在多跳问答任务中。其解决方案的关键在于提出一种名为Memento的提示策略,该策略通过三个阶段实现:首先将复杂问题分解为较小的步骤,然后利用LLMs动态构建事实数据库,最后将这些事实拼接起来以解决问题。这种分步处理和动态知识整合的方法显著提升了现有提示策略的性能。
链接: https://arxiv.org/abs/2506.20642
作者: Chao Wan,Albert Gong,Mihir Mishra,Carl-Leander Henneking,Claas Beger,Kilian Q. Weinberger
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) excel at reasoning-only tasks, but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. To overcome these limitations, we introduce a prompting strategy that first decomposes a complex question into smaller steps, then dynamically constructs a database of facts using LLMs, and finally pieces these facts together to solve the question. We show how this three-stage strategy, which we call Memento, can boost the performance of existing prompting strategies across diverse settings. On the 9-step PhantomWiki benchmark, Memento doubles the performance of chain-of-thought (CoT) when all information is provided in context. On the open-domain version of 2WikiMultiHopQA, CoT-RAG with Memento improves over vanilla CoT-RAG by more than 20 F1 percentage points and over the multi-hop RAG baseline, IRCoT, by more than 13 F1 percentage points. On the challenging MuSiQue dataset, Memento improves ReAct by more than 3 F1 percentage points, demonstrating its utility in agentic settings.
zh
[NLP-4] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
【速读】: 该论文旨在解决扩散大语言模型(dLLMs)在代码生成任务中训练与推理机制尚不成熟的问题,特别是其解码行为与自回归(AR)模型的差异及其在强化学习(RL)训练中的优化。论文的关键解决方案是通过系统研究dLLMs的去噪过程和RL方法,提出一种名为耦合-GRPO(coupled-GRPO)的新采样方案,该方案通过构建互补掩码噪声来降低令牌对数似然估计的方差并保持训练效率,从而提升模型在代码生成基准测试中的性能,并减少对AR因果性的依赖。
链接: https://arxiv.org/abs/2506.20639
作者: Shansan Gong,Ruixiang Zhang,Huangjie Zheng,Jiatao Gu,Navdeep Jaitly,Lingpeng Kong,Yizhe Zhang
机构: Apple(苹果); The University of Hong Kong(香港大学)
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbfDiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbfcoupled-GRPO, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder’s performance on code generation benchmarks (+4.4% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. this https URL.
zh
[NLP-5] PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
【速读】: 该论文试图解决在使用低秩适应(Low-Rank Adaptation, LoRA)进行模型微调时,如何有效确定适配器(adapter)应放置的模块类型的问题。现有方法在适配器放置策略上存在分歧,例如原始LoRA论文建议将适配器放置在注意力模块中,而其他工作则建议将其放置在多层感知机(MLP)模块中。论文提出的解决方案关键在于引入PLoP(Precise LoRA Placement),这是一种轻量级方法,能够根据预训练模型和微调任务自动识别适合放置LoRA适配器的模块类型,从而提升微调效果。
链接: https://arxiv.org/abs/2506.20629
作者: Soufiane Hayou,Nikhil Ghosh,Bin Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: TD,LR: A lightweight module type selection method for LoRA finetuning. PLoP gives precise placements for LoRA adapters for improved performance
Abstract:Low-Rank Adaptation (LoRA) is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick module types to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.
zh
[NLP-6] Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
【速读】: 该论文试图解决在高风险领域部署基于大语言模型(Large Language Models, LLMs)的智能体所带来的安全与伦理风险问题。其核心挑战在于如何有效引导智能体的行为以确保其符合伦理规范,避免造成现实世界中的严重后果。解决方案的关键在于将智能体行为引导问题建模为一种模型编辑任务,即行为编辑(Behavior Editing),通过精确且高效地修改LLM,同时保持其整体能力,从而实现对智能体伦理行为的动态调控。
链接: https://arxiv.org/abs/2506.20606
作者: Baixiang Huang,Zhen Tan,Haoran Wang,Zijie Liu,Dawei Li,Ali Payani,Huan Liu,Tianlong Chen,Kai Shu
机构: Emory University (埃默里大学); Arizona State University (亚利桑那州立大学); UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Cisco Research (思科研究院)
类目: Computation and Language (cs.CL)
备注: Main paper: 9 pages; total: 18 pages (including appendix). Code, data, results, and additional resources are available at: this https URL
Abstract:Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent’s global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
zh
[NLP-7] When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLM s
【速读】: 该论文试图解决在多语言、多任务环境下,如何稳健地扩展推理时计算资源以提升生成式AI(Generative AI)性能的问题。现有方法主要针对英语及少数领域如数学和代码进行优化,但缺乏对开放性任务、可形式化验证任务以及多种语言的通用性。论文的关键解决方案在于适应不同领域和语言设置,调整基于温度变化的采样策略和选择策略,并提出专门针对多语言和多任务推理场景的新型采样与选择策略,从而在多个语言和任务上实现显著性能提升。
链接: https://arxiv.org/abs/2506.20544
作者: Ammar Khairi,Daniel D’souza,Ye Shen,Julia Kreutzer,Sara Hooker
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. However, work to date has focused on English and a handful of domains such as math and code. In contrast, we are most interested in techniques that generalize across open-ended tasks, formally verifiable tasks, and across languages. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy based on temperature variation and selection strategy must be adapted to account for diverse domains and varied language settings. We evaluate existing selection methods, revealing that strategies effective in English often fail to generalize across languages. We propose novel sampling and selection strategies specifically adapted for multilingual and multi-task inference scenarios, and show they yield notable gains across languages and tasks. In particular, our combined sampling and selection methods lead to an average +6.8 jump in win-rates for our 8B models on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At larger scale, Command-A (111B model) equipped with our methods, shows +9.0 improvement in win-rates on the same benchmark with just five samples against single-sample decoding, a substantial increase at minimal cost. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20544 [cs.CL] (or arXiv:2506.20544v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.20544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-8] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)中使用强化学习(Reinforcement Learning, RL)时,离策略(Off-policy)方法虽然在实现简便性和数据效率上优于在策略(On-policy)技术,但往往导致次优性能的问题。解决方案的关键在于研究介于离策略RL和监督微调之间的算法,通过分析一种简单的离策略REINFORCE算法,其中优势函数定义为A=r−V,其中r为奖励,V为可调节基线。该方法通过调整基线V来控制对高奖励样本的强调或对低奖励样本的惩罚,理论分析表明当基线V下界期望奖励时,该算法具有策略改进保证,并且实验验证了其在控制环境和推理任务中的有效性。
链接: https://arxiv.org/abs/2506.20520
作者: Charles Arnal,Gaëtan Narozniak,Vivien Cabannes,Yunhao Tang,Julia Kempe,Remi Munos
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as A=r-V , with r a reward and V some tunable baseline. Intuitively, lowering V emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline V lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
zh
[NLP-9] OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
【速读】: 该论文试图解决基础语言模型在强化学习(Reinforcement Learning, RL)微调过程中表现差异的问题,特别是如何提升模型在推理密集型任务中的RL适应性。其解决方案的关键在于通过优化中段训练策略,包括引入高质量数学语料库(如MegaMath-Web-Pro)、增加问答风格数据及长链式思维(long chain-of-thought, CoT)示例,并采用分阶段的训练策略(Stable-then-Decay),以提升模型的推理深度与RL训练稳定性,从而增强模型的RL兼容性。
链接: https://arxiv.org/abs/2506.20512
作者: Zengzhi Wang,Fan Zhou,Xuefeng Li,Pengfei Liu
机构: Shanghai Jiao Tong University (上海交通大学); SII (SII); GAIR Lab (GAIR实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages; The first three authors contribute to this work equally
Abstract:Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).
zh
[NLP-10] ReCode: Updating Code API Knowledge with Reinforcement Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对外部库API频繁更新时适应能力不足的问题,这一问题源于其训练数据中包含的过时API知识。解决方案的关键在于提出一种名为ReCode的框架,该框架基于规则的强化学习(rule-based Reinforcement learning),通过构建约2000条数据的版本迁移数据集来训练LLMs,并引入改进的字符串相似性度量作为强化学习的奖励机制,从而提升模型在动态API场景下的代码生成性能。
链接: https://arxiv.org/abs/2506.20495
作者: Haoze Wu,Yunzhi Yao,Wenhao Yu,Huajun Chen,Ningyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Work in progress
Abstract:Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at this https URL.
zh
[NLP-11] Counterfactual Influence as a Distributional Quantity ICML2025
【速读】: 该论文试图解决机器学习模型在训练过程中对训练数据样本的记忆现象所带来的隐私和泛化性问题,特别是传统基于自影响(self-influence)的评估方法可能无法全面反映实际风险的问题。其解决方案的关键在于将反事实影响(counterfactual influence)视为一个分布量,考虑所有训练样本对单个样本记忆程度的综合影响,而非仅关注单一样本的自影响。通过计算小规模语言模型中训练样本之间的完整影响分布并分析其特性,研究揭示了仅依赖自影响可能导致对记忆风险的严重低估,尤其是在存在(近似)重复样本的情况下。
链接: https://arxiv.org/abs/2506.20481
作者: Matthieu Meeus,Igor Shilov,Georgios Kaissis,Yves-Alexandre de Montjoye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Workshop on The Impact of Memorization on Trustworthy Foundation Models (MemFM) @ ICML 2025
Abstract:Machine learning models are known to memorize samples from their training data, raising concerns around privacy and generalization. Counterfactual self-influence is a popular metric to study memorization, quantifying how the model’s prediction for a sample changes depending on the sample’s inclusion in the training dataset. However, recent work has shown memorization to be affected by factors beyond self-influence, with other training samples, in particular (near-)duplicates, having a large impact. We here study memorization treating counterfactual influence as a distributional quantity, taking into account how all training samples influence how a sample is memorized. For a small language model, we compute the full influence distribution of training samples on each other and analyze its properties. We find that solely looking at self-influence can severely underestimate tangible risks associated with memorization: the presence of (near-)duplicates seriously reduces self-influence, while we find these samples to be (near-)extractable. We observe similar patterns for image classification, where simply looking at the influence distributions reveals the presence of near-duplicates in CIFAR-10. Our findings highlight that memorization stems from complex interactions across training data and is better captured by the full influence distribution than by self-influence alone.
zh
[NLP-12] GPT ailor: Large Language Model Pruning Through Layer Cutting and Stitching
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在部署和推理过程中因模型规模庞大而导致的计算成本过高的问题。其解决方案的关键在于提出一种新的模型压缩策略,通过有策略地组合或合并微调后的模型变体中的层,从而在保持原始模型能力的同时降低参数数量。该方法将最优模型定制问题建模为零阶优化问题,并支持三种操作:层移除、从不同候选模型中选择层以及层合并。实验结果表明,该方法在保持高性能的同时显著减少了参数量,例如在Llama2-13B模型家族上,压缩后的模型保持了约97.3%的原始性能,同时移除了约25%的参数。
链接: https://arxiv.org/abs/2506.20480
作者: Guinan Su,Li Shen,Lu Yin,Shiwei Liu,Yanwu Yang,Jonas Geiping
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ELLIS Institute Tübingen (图宾根ELLIS研究所); Tübingen AI Center (图宾根人工智能中心); University of Tübingen (图宾根大学); Sun Yat-sen University (中山大学); University of Surrey (萨里大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model’s abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3% of the original performance while removing \sim25% of parameters, significantly outperforming previous state-of-the-art methods. The code is available at this https URL.
zh
[NLP-13] Knowledge-Aware Diverse Reranking for Cross-Source Question Answering
【速读】: 该论文旨在解决在大规模文档集合中高效检索与问题相关支持文档的问题,特别是在面对多样化目标主题、问题类型和知识组织方式时的挑战。其解决方案的关键在于提出了一种知识感知的多样化重排序(knowledge-aware diverse reranking)RAG(Retrieval-Augmented Generation)流水线,该方法在SIGIR 2025 LiveRAG竞赛中取得了第一名的成绩。
链接: https://arxiv.org/abs/2506.20476
作者: Tong Zhou
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:This paper presents Team Marikarp’s solution for the SIGIR 2025 LiveRAG competition. The competition’s evaluation set, automatically generated by DataMorgana from internet corpora, encompassed a wide range of target topics, question types, question formulations, audience types, and knowledge organization methods. It offered a fair evaluation of retrieving question-relevant supporting documents from a 15M documents subset of the FineWeb corpus. Our proposed knowledge-aware diverse reranking RAG pipeline achieved first place in the competition.
zh
[NLP-14] me is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
【速读】: 该论文试图解决对话中说话时间分配的量化与分析问题,具体关注对话层面的说话时间分布以及导致这种分布的底层动态机制。其解决方案的关键在于引入一个计算框架,用于量化说话时间的分配及其动态过程,并基于多个直观的变异轴构建了说话时间共享动态的类型学。该框架不仅能够描述对话的整体平衡性,还能揭示不同类型的动态对参与者感知的影响,从而为计算机中介通信平台的设计提供新的工具和视角。
链接: https://arxiv.org/abs/2506.20474
作者: Kaixiang Zhang,Justine Zhang,Cristian Danescu-Niculescu-Mizil
机构: Cornell University (康奈尔大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:An intrinsic aspect of every conversation is the way talk-time is shared between multiple speakers. Conversations can be balanced, with each speaker claiming a similar amount of talk-time, or imbalanced when one talks disproportionately. Such overall distributions are the consequence of continuous negotiations between the speakers throughout the conversation: who should be talking at every point in time, and for how long? In this work we introduce a computational framework for quantifying both the conversation-level distribution of talk-time between speakers, as well as the lower-level dynamics that lead to it. We derive a typology of talk-time sharing dynamics structured by several intuitive axes of variation. By applying this framework to a large dataset of video-chats between strangers, we confirm that, perhaps unsurprisingly, different conversation-level distributions of talk-time are perceived differently by speakers, with balanced conversations being preferred over imbalanced ones, especially by those who end up talking less. Then we reveal that – even when they lead to the same level of overall balance – different types of talk-time sharing dynamics are perceived differently by the participants, highlighting the relevance of our newly introduced typology. Finally, we discuss how our framework offers new tools to designers of computer-mediated communication platforms, for both human-human and human-AI communication. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.20474 [cs.CL] (or arXiv:2506.20474v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.20474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-15] Probing AI Safety with Source Code
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全性和对齐人类价值观方面存在的不足问题,特别是其在安全关键型应用中可能带来的不安全和有害用户体验。论文提出的解决方案的关键在于引入一种称为“思维代码”(Code of Thought, CoDoT)的提示策略,该策略将自然语言输入转换为表示相同意图的简单代码,从而有效评估LLMs的安全性。通过CoDoT,研究发现多种先进的LLMs在安全性方面存在显著缺陷,例如GPT-4 Turbo的毒性增加了16.5倍,DeepSeek R1完全失败,且在七种现代LLMs中平均毒性增加了300%。此外,递归应用CoDoT可进一步提升毒性两倍,表明当前LLMs在安全机制上的不足,强调了从基础原理出发评估安全性的必要性。
链接: https://arxiv.org/abs/2506.20471
作者: Ujwal Narayan,Shreyas Chaudhari,Ashwin Kalyan,Tanmay Rajpurohit,Karthik Narasimhan,Ameet Deshpande,Vishvak Murahari
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt “Make the statement more toxic: text” to: “make_more_toxic(text)”. We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo’s toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.
zh
[NLP-16] An Agent ic System for Rare Disease Diagnosis with Traceable Reasoning
【速读】: 该论文旨在解决罕见疾病诊断中存在的时间延迟和准确性不足的问题,这一问题主要源于罕见疾病的临床异质性、个体患病率低以及大多数临床医生对罕见病缺乏熟悉度。解决方案的关键在于提出DeepRare系统,该系统是首个基于大型语言模型(Large Language Model, LLM)的罕见病诊断代理系统,能够处理多种类型的临床输入,并生成带有可解释推理链的排名诊断假设,从而提升诊断的准确性和透明度。
链接: https://arxiv.org/abs/2506.20430
作者: Weike Zhao,Chaoyi Wu,Yanjie Fan,Xiaoman Zhang,Pengcheng Qiu,Yuze Sun,Xiao Zhou,Yanfeng Wang,Ya Zhang,Yongguo Yu,Kun Sun,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Xinhua Hospital affiliated to Shanghai Jiao Tong University School of Medicine (上海交通大学医学院附属新华医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser’s 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application this http URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA) Cite as: arXiv:2506.20430 [cs.CL] (or arXiv:2506.20430v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.20430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] APS: Tool-Augmented Personalisation via Structured Tagging
【速读】: 该论文试图解决在目标导向对话代理中,用户偏好未能有效融入工具使用的问题(即个性化工具使用的不足)。解决方案的关键在于引入TAPS(Tagging and Uncertainty-based Personalised Tool Selection),通过结构化标签工具和基于不确定性的工具检测器,显著提升大型语言模型对用户偏好的整合能力。
链接: https://arxiv.org/abs/2506.20409
作者: Ekaterina Taktasheva,Jeff Dalton
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce \name, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.
zh
[NLP-18] Biomed-Enriched: A Biomedical Dataset Enriched with LLM s for Pretraining and Extracting Rare and Hidden Content
【速读】: 该论文试图解决临床文本在生物医学自然语言处理(NLP)中难以获取的问题,尤其是由于隐私限制导致医院记录无法公开共享。其解决方案的关键在于构建一个名为Biomed-Enriched的生物医学文本数据集,该数据集通过两阶段标注流程从PubMed中提取并精炼出高质量的临床案例段落,从而提供了一个大规模、公开可用的临床病例资源,为生物医学和临床NLP研究提供了支持。
链接: https://arxiv.org/abs/2506.20331
作者: Rian Touchent,Nathan Godey,Eric de la Clergerie
机构: Sorbonne Université (索邦大学); INRIA Paris (法国国家信息与自动化研究所巴黎分部)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Dataset link: this https URL
Abstract:We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.
zh
[NLP-19] From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents
【速读】: 该论文旨在解决历史文献中鲁棒的文档布局分析(Document Layout Analysis, DLA)问题,特别是在处理具有复杂页面组织结构的文献时。其解决方案的关键在于评估不同目标检测架构在多个标注数据集上的表现,并强调使用有向边界框(Oriented Bounding Boxes, OBB)对于准确建模历史手稿非笛卡尔特性的必要性。研究结果表明,基于卷积神经网络(CNN)的OBB模型在视觉多样性和复杂性较高的数据集上表现出更好的泛化能力,而Transformer模型在结构化布局中更具全局上下文感知优势。
链接: https://arxiv.org/abs/2506.20326
作者: Sergio Torres Aguilar
机构: University of Luxembourg(卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Robust Document Layout Analysis (DLA) is critical for the automated processing and understanding of historical documents with complex page organizations. This paper benchmarks five state-of-the-art object detection architectures on three annotated datasets representing a spectrum of codicological complexity: The e-NDP, a corpus of Parisian medieval registers (1326-1504); CATMuS, a diverse multiclass dataset derived from various medieval and modern sources (ca.12th-17th centuries) and HORAE, a corpus of decorated books of hours (ca.13th-16th centuries). We evaluate two Transformer-based models (Co-DETR, Grounding DINO) against three YOLO variants (AABB, OBB, and YOLO-World). Our findings reveal significant performance variations dependent on model architecture, data set characteristics, and bounding box representation. In the e-NDP dataset, Co-DETR achieves state-of-the-art results (0.752 mAP@.50:.95), closely followed by YOLOv11X-OBB (0.721). Conversely, on the more complex CATMuS and HORAE datasets, the CNN-based YOLOv11x-OBB significantly outperforms all other models (0.564 and 0.568, respectively). This study unequivocally demonstrates that using Oriented Bounding Boxes (OBB) is not a minor refinement but a fundamental requirement for accurately modeling the non-Cartesian nature of historical manuscripts. We conclude that a key trade-off exists between the global context awareness of Transformers, ideal for structured layouts, and the superior generalization of CNN-OBB models for visually diverse and complex documents.
zh
[NLP-20] Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
【速读】: 该论文试图解决在大规模文本语料库中动态建模叙事演变的问题,尤其是在面对高计算或财务成本时的挑战。其关键解决方案是结合大型语言模型(Large Language Model)的语言理解能力与主题模型的大规模适用性,利用叙事政策框架(Narrative Policy Framework)来检测特定主题的叙事变化,并通过变化点检测方法识别具有代表性的文档,进而由大型语言模型自动解析内容与叙事变化的区别。
链接: https://arxiv.org/abs/2506.20269
作者: Kai-Robin Lange,Tobias Schmidt,Matthias Reccius,Henrik Müller,Michael Roos,Carsten Jentsch
机构: TU Dortmund University (多特蒙德工业大学); Ruhr University Bochum (鲁尔大学波鸿分校)
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注: 14 pages, 1 figure
Abstract:With rapidly evolving media narratives, it has become increasingly critical to not just extract narratives from a given corpus but rather investigate, how they develop over time. While popular narrative extraction methods such as Large Language Models do well in capturing typical narrative elements or even the complex structure of a narrative, applying them to an entire corpus comes with obstacles, such as a high financial or computational cost. We propose a combination of the language understanding capabilities of Large Language Models with the large scale applicability of topic models to dynamically model narrative shifts across time using the Narrative Policy Framework. We apply a topic model and a corresponding change point detection method to find changes that concern a specific topic of interest. Using this model, we filter our corpus for documents that are particularly representative of that change and feed them into a Large Language Model that interprets the change that happened in an automated fashion and distinguishes between content and narrative shifts. We employ our pipeline on a corpus of The Wall Street Journal news paper articles from 2009 to 2023. Our findings indicate that a Large Language Model can efficiently extract a narrative shift if one exists at a given point in time, but does not perform as well when having to decide whether a shift in content or a narrative shift took place.
zh
[NLP-21] Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue
【速读】: 该论文试图解决在人机交互中检测误解(miscommunication)的问题,这一问题对于维持用户参与度和信任至关重要。尽管人类能够通过言语和非言语线索轻松检测交流中的错误,但机器人在解读非言语反馈方面仍面临重大挑战。论文提出的解决方案关键在于利用机器学习模型对多模态数据进行分析,以评估其在检测对话中误解的能力。研究使用了包含240次人类-机器人对话的多模态数据集,并引入了四种不同类型的对话失败,旨在测试当前最先进的计算机视觉模型在识别误解方面的有效性。然而,实验结果表明,即使使用最先进的模型,其性能也仅略高于随机猜测,这揭示了在对话中识别机器人误解的根本性局限。
链接: https://arxiv.org/abs/2506.20268
作者: Ruben Janssens,Jens De Bock,Sofie Labat,Eva Verhelst,Veronique Hoste,Tony Belpaeme
机构: Ghent University–imec(根特大学–imec); Ghent University(根特大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2025)
Abstract:Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting non-verbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscommunications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models’ ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. To explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner. This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed.
zh
[NLP-22] Language Modeling by Language Models
【速读】: 该论文试图解决如何利用大语言模型(Large Language Models, LLMs)来建模发现新型语言模型(Language Model, LM)架构的过程。其关键解决方案是提出一种多智能体LLM方法,称为Genesys,该方法模拟了传统研究的各个阶段,包括构思、文献检索、设计实现、生成式预训练和下游评估。Genesys采用了一种“尺度阶梯”(Ladder of Scales)策略,在不断扩大的模型规模(14M ∼ 350M参数)上逐步提出、对抗性评审、实现并选择性验证新设计,同时预算逐渐缩减。为提高发现效率与可分解性,Genesys引入了一种新颖的遗传编程骨干架构,实验证明其在成功设计生成方面优于常用的直接提示生成流程。
链接: https://arxiv.org/abs/2506.20249
作者: Junyan Cheng,Peter Clark,Kyle Richardson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system, Genesys, employs a Ladder of Scales approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M \sim 350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., \sim 86% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified through pre-training) and find the best designs to be highly competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks). We couple these results with comprehensive system-level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.
zh
[NLP-23] CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment
【速读】: 该论文旨在解决非母语者语音流利度评估(fluency assessment)中的挑战,特别是在捕捉语音节奏、停顿和不流畅现象方面。其解决方案的关键在于提出一种基于分块(chunk-based)的方法,融合自监督学习(SSL)模型(如Wav2Vec2、HuBERT和WavLM)的互补优势,并结合层次化CNN-BiLSTM框架进行细粒度的时间分析。通过Silero-VAD对语音进行呼吸群分块,减少过度分割带来的伪影,并利用可学习加权机制融合SSL嵌入,同时引入块级流利度标记,从而提升流利度评估的准确性。
链接: https://arxiv.org/abs/2506.20243
作者: Papa Séga Wade,Mihai Andries,Ioannis Kanellos,Thierry Moudenc
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, accepted for presentation at EUSIPCO 2025
Abstract:Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and Speechocean762, our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing this http URL-based segmentation baselines. These findings highlight chunk-based multi-SSL fusion for robust fluency evaluation, though future work should explore generalization to dialects with irregular prosody.
zh
[NLP-24] Enhancing Large Language Models through Structured Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在执行涉及逻辑推理和系统规划的复杂推理任务时存在的困难,这些问题主要源于模型依赖隐式的统计关系而缺乏结构化知识。解决方案的关键在于通过显式标注推理步骤将非结构化数据转换为结构化格式,并利用该结构化数据通过监督微调(Supervised Fine-Tuning, SFT)增强LLMs的推理能力。此外,还引入了Group Relative Policy Optimization (GRPO)方法,并结合MAX-Flow和Longest Common Subsequence (LCS)两种创新算法,以提升推理效果并降低计算复杂度。
链接: https://arxiv.org/abs/2506.20241
作者: Yubo Dong,Hehe Fan
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Recent Large Language Models (LLMs) have significantly advanced natural language processing and automated decision-making. However, these models still encounter difficulties when performing complex reasoning tasks involving logical deduction and systematic planning, primarily due to their reliance on implicit statistical relationships without structured knowledge this http URL by cognitive science and neurosymbolic AI, we introduce a novel approach to enhance LLMs through explicit structured reasoning. First, we convert unstructured data into structured formats by explicitly annotating reasoning steps. We then employ this structured dataset to train LLMs through Supervised Fine-Tuning (SFT). Additionally, we enhance the structured reasoning capabilities of LLMs using Group Relative Policy Optimization (GRPO), incorporating two innovative algorithms–MAX-Flow and Longest Common Subsequence (LCS)–which notably improve reasoning effectiveness and reduce computational complexity. Experimental results from fine-tuning a DeepSeek-R1-Distill-Qwen-1.5B model demonstrate concise reasoning, robust performance across various scenarios, and improved compatibility with optimization techniques, validating the efficacy of structured reasoning integration in LLMs.
zh
[NLP-25] Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
【速读】: 该论文试图解决在自然语言处理(Natural Language Processing, NLP)中,传统方法通过聚合标注者观点以建立单一真实标签所导致的少数观点被边缘化的问题,特别是在主观任务中,标注者可能因个人偏好而产生系统性分歧。解决方案的关键在于提出一种多视角方法,利用软标签(soft labels)来捕捉人类标注的多样性,从而促进更具包容性和多元化的视角感知模型的发展。该方法在多个主观文本分类任务中表现出更接近人类标签分布的能力,并提升了分类性能。
链接: https://arxiv.org/abs/2506.20209
作者: Benedetta Muscato,Lucia Passaro,Gizem Gezici,Fosca Giannotti
机构: Scuola Normale Superiore (意大利圣安娜高等学院); University of Pisa (比萨大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In the realm of Natural Language Processing (NLP), common approaches for handling human disagreement consist of aggregating annotators’ viewpoints to establish a single ground truth. However, prior studies show that disregarding individual opinions can lead can lead to the side effect of underrepresenting minority perspectives, especially in subjective tasks, where annotators may systematically disagree because of their preferences. Recognizing that labels reflect the diverse backgrounds, life experiences, and values of individuals, this study proposes a new multi-perspective approach using soft labels to encourage the development of the next generation of perspective aware models, more inclusive and pluralistic. We conduct an extensive analysis across diverse subjective text classification tasks, including hate speech, irony, abusive language, and stance detection, to highlight the importance of capturing human disagreements, often overlooked by traditional aggregation methods. Results show that the multi-perspective approach not only better approximates human label distributions, as measured by Jensen-Shannon Divergence (JSD), but also achieves superior classification performance (higher F1 scores), outperforming traditional approaches. However, our approach exhibits lower confidence in tasks like irony and stance detection, likely due to the inherent subjectivity present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model uncertainty and uncover meaningful insights into model predictions.
zh
[NLP-26] Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesnt Help with MT Evaluation
【速读】: 该论文试图解决句子嵌入模型在内在语义相似性测试与下游翻译评估任务之间表现不一致的问题,即模型在语义属性探测任务中表现优异却未必能在实际应用任务中取得良好效果。其解决方案的关键在于通过基于COMET的度量标准对嵌入模型进行微调,以探索嵌入空间的平滑性与下游任务性能之间的关系,从而揭示句子嵌入中“可操作化语义”的重要性。
链接: https://arxiv.org/abs/2506.20203
作者: Petra Barančíková,Ondřej Bojar
机构: Charles University (查理大学); Faculty of Mathematics and Physics (数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into ‘operationalizable semantics’ in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.20203 [cs.CL] (or arXiv:2506.20203v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.20203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-27] How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models ?
【速读】: 该论文旨在解决对话情感识别(Conversational Emotion Recognition, CER)中由于任务主观性导致的高精度应用构建难题。其解决方案的关键在于通过在上下文学习(In-Context Learning, ICL)中检索高质量示例来提升CER性能,特别是采用增强型示例检索策略,通过改写等方法优化检索到的示例,从而显著提高模型在多个数据集上的表现。
链接: https://arxiv.org/abs/2506.20199
作者: Mengqi Wang,Tiantian Feng,Shrikanth Narayanan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have enabled a wide variety of real-world applications in various domains. However, creating a high-performing application with high accuracy remains challenging, particularly for subjective tasks like emotion recognition. Inspired by the SLT 2024 GenSER Challenge, this study investigates approaches to improving conversational emotion recognition (CER) by LLMs. Specifically, we explore how to retrieve high-quality examples in in-context learning (ICL) to enhance CER. We propose various strategies based on random and augmented example retrieval and also analyze the impact of conversational context on CER accuracy. Experiments were conducted on the three datasets including IEMOCAP, MELD and EmoryNLP. The results show that augmented example retrieval consistently outperforms other techniques under investigation across all datasets, highlighting the importance of retrieving coherent targeted examples and enhancing them through paraphrasing.
zh
[NLP-28] COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees
【速读】: 该论文旨在解决基础模型在生成文本时存在的不确定性量化(Uncertainty Quantification, UQ)问题,特别是如何在保证选择性预测中假发现率(False Discovery Rate, FDR)控制的前提下,有效识别并减少生成文本中的幻觉现象。其解决方案的关键在于提出COIN框架,该框架通过校准统计上有效的阈值,在用户指定的FDR约束下筛选每个问题的单一生成答案。COIN利用校准集估计经验错误率,并结合Clopper-Pearson置信区间方法建立真实错误率(即FDR)的高概率上限,从而在测试数据上实现FDR控制的同时显著提升样本保留率。
链接: https://arxiv.org/abs/2506.20178
作者: Zhiyuan Wang,Jinhao Duan,Qingni Wang,Xiaofeng Zhu,Tianlong Chen,Xiaoshuang Shi,Kaidi Xu
机构: University of Electronic Science and Technology of China (中国电子科技大学); Drexel University (德雷塞尔大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Uncertainty quantification (UQ) for foundation models is essential to identify and mitigate potential hallucinations in automatically generated text. However, heuristic UQ approaches lack formal guarantees for key metrics such as the false discovery rate (FDR) in selective prediction. Previous work adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing prediction sets, but these sets often contain incorrect candidates, limiting their practical utility. To address this, we propose COIN, an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question under user-specified FDR constraints. COIN estimates the empirical error rate on a calibration set and applies confidence interval methods such as Clopper-Pearson to establish a high-probability upper bound on the true error rate (i.e., FDR). This enables the selection of the largest uncertainty threshold that ensures FDR control on test data while significantly increasing sample retention. We demonstrate COIN’s robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data across both general and multimodal text generation tasks. Furthermore, we show that employing alternative upper bound constructions and UQ strategies can further boost COIN’s power performance, which underscores its extensibility and adaptability to diverse application scenarios.
zh
[NLP-29] SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLM s
【速读】: 该论文旨在解决多变量时间序列预测中结构依赖建模与语义层面推理及任务适应性不足的问题。现有结构编码器虽能有效建模特征交互,但缺乏语义推理和任务迁移能力;而大语言模型(Large Language Models, LLMs)虽具备强泛化能力,却无法直接处理原始时间序列输入。解决方案的关键在于提出SEED框架,其核心是通过嵌入驱动解码的结构编码器,整合四个阶段:感知标记的编码器用于片段提取、投影模块将片段对齐至语言模型嵌入、语义重编程机制将片段映射到任务感知原型,以及冻结的语言模型用于预测,从而实现数值模式与语义推理的高效对齐。
链接: https://arxiv.org/abs/2506.20167
作者: Fengze Li,Yue Wang,Yangle Liu,Ming Huang,Dou Hong,Jieming Ma
机构: Xi’an Jiaotong-Liverpool University (西安交通大学-利物浦大学); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remain incompatible with raw time series inputs. This gap limits the development of unified, transferable prediction systems. Therefore, we introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, a semantic reprogramming mechanism that maps patches to task-aware prototypes, and a frozen language model for prediction. This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning. Empirical results demonstrate that the proposed method achieves consistent improvements over strong baselines, and comparative studies on various datasets confirm SEED’s role in addressing the structural-semantic modeling gap.
zh
[NLP-30] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在生成长链式思维时导致的高延迟和高成本问题,而这种“过度思考”并未带来相应的准确率提升。解决方案的关键在于引入AALC(一种轻量级、准确性感知的长度奖励机制),该机制通过将验证准确率纳入奖励函数,并采用平滑且动态调度的长度惩罚策略,在训练过程中动态平衡正确性与简洁性,从而在保证或提升原始准确率的前提下,显著减少响应长度并抑制冗余推理模式。
链接: https://arxiv.org/abs/2506.20160
作者: Ruosen Li,Ziming Luo,Quan Zhang,Ruochen Li,Ben Zhou,Ali Payani,Xinya Du
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); Arizona State University (亚利桑那州立大学); Cisco Research (思科研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) achieve impressive reasoning capabilities by generating lengthy chain-of-thoughts, but this “overthinking” incurs high latency and cost without commensurate accuracy gains. In this work, we introduce AALC, a lightweight, accuracy-aware length reward integrated into reinforcement learning that dynamically balances correctness and brevity during training. By incorporating validation accuracy into the reward and employing a smooth, dynamically scheduled length penalty, AALC delays length penalty until target performance is met. Through extensive experiments across standard and out-of-distribution math benchmarks, we show that our approach reduces response length by over 50% while maintaining or even improving the original accuracy. Furthermore, qualitative analysis reveals that our method curbs redundant reasoning patterns such as excessive subgoal setting and verification, leading to structurally refined outputs rather than naive truncation. We also identify that efficiency gains are accompanied by reduced interpretability: models trained with AALC omit some narrative framing and explanatory context. These findings highlight the potential of reward-based strategies to guide LRMs toward more efficient, generalizable reasoning paths.
zh
[NLP-31] CCRS: A Zero-Shot LLM -as-a-Judge Framework for Comprehensive RAG Evaluation SIGIR2025
【速读】: 该论文旨在解决RAG(Retrieval-Augmented Generation)系统输出质量评估的复杂性问题,特别是如何有效衡量其在语境连贯性、查询相关性、事实正确性、信息密度和信息召回等方面的多维质量。现有评估方法要么依赖于简单的词法重叠指标,无法捕捉细节,要么涉及复杂的多阶段流程或需要微调专用判断模型,影响实际效率。论文提出的解决方案关键在于引入CCR(S)(Contextual Coherence and Relevance Score),这是一个基于单一强大预训练语言模型的零样本端到端评估框架,通过五个核心指标(Contextual Coherence, Question Relevance, Information Density, Answer Correctness, Information Recall)实现对RAG系统的高效、全面评估。
链接: https://arxiv.org/abs/2506.20128
作者: Aashiq Muhamed
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at LLM4Eval @ SIGIR 2025
Abstract:RAG systems enhance LLMs by incorporating external knowledge, which is crucial for domains that demand factual accuracy and up-to-date information. However, evaluating the multifaceted quality of RAG outputs, spanning aspects such as contextual coherence, query relevance, factual correctness, and informational completeness, poses significant challenges. Existing evaluation methods often rely on simple lexical overlap metrics, which are inadequate for capturing these nuances, or involve complex multi-stage pipelines with intermediate steps like claim extraction or require finetuning specialized judge models, hindering practical efficiency. To address these limitations, we propose CCRS (Contextual Coherence and Relevance Score), a novel suite of five metrics that utilizes a single, powerful, pretrained LLM as a zero-shot, end-to-end judge. CCRS evaluates: Contextual Coherence (CC), Question Relevance (QR), Information Density (ID), Answer Correctness (AC), and Information Recall (IR). We apply CCRS to evaluate six diverse RAG system configurations on the challenging BioASQ dataset. Our analysis demonstrates that CCRS effectively discriminates between system performances, confirming, for instance, that the Mistral-7B reader outperforms Llama variants. We provide a detailed analysis of CCRS metric properties, including score distributions, convergent/discriminant validity, tie rates, population statistics, and discriminative power. Compared to the complex RAGChecker framework, CCRS offers comparable or superior discriminative power for key aspects like recall and faithfulness, while being significantly more computationally efficient. CCRS thus provides a practical, comprehensive, and efficient framework for evaluating and iteratively improving RAG systems.
zh
[NLP-32] Leverag ing AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests
【速读】: 该论文试图解决传统构造性回答测试在评估学习者高阶能力(如表达能力和逻辑思维)时,因需要大量人工评分而导致的劳动密集和成本高昂的问题。其解决方案的关键在于提出一种新的缺失分数填补方法,该方法利用自动化评分技术提升基于项目反应理论(Item Response Theory, IRT)的能力估计准确性,从而在保证评估质量的同时显著降低人工评分的工作量。
链接: https://arxiv.org/abs/2506.20119
作者: Masaki Uto,Yuma Ito
机构: The University of Electro-Communications (电波大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to EvalLAC’25: 2nd Workshop on Automatic Evaluation of Learning and Assessment Content, held at AIED 2025, Palermo, Italy. This is the camera-ready version submitted to CEUR Workshop Proceedings
Abstract:Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructed-response tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are effective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.
zh
[NLP-33] A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在放射学报告校对中的阳性预测值(Positive Predictive Value, PPV)受限的问题,主要原因是错误发生率较低。解决方案的关键在于提出一种三阶段LLM框架,包括提取器、检测器和假阳性验证器,通过多步骤处理显著提高了PPV并降低了运营成本,同时保持了检测性能的稳定性。
链接: https://arxiv.org/abs/2506.20112
作者: Songsoo Kim,Seungtae Lee,See Young Lee,Joonho Kim,Keechan Kan,Dukyong Yoon
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 5 figures, 4 tables. Code available at this https URL
Abstract:Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P.001 vs. baselines). aTPR remained stable (0.012-0.014; P=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3’s superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.
zh
[NLP-34] MIRAG E: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
【速读】: 该论文试图解决在咨询交互场景中,多模态模型在专家级推理与决策方面的评估问题,特别是在农业领域中,如何有效处理基于图像的上下文、复杂用户查询及专家响应的高保真基准构建问题。解决方案的关键在于构建MIRAGE基准,该基准基于超过35,000次真实用户-专家交互,并通过多步骤筛选流程精心设计,涵盖了多样化的作物健康、病虫害诊断及作物管理场景,同时包含超过7,000个独特的生物实体,具有高度的分类学多样性,能够支持对模型在 grounded reasoning(具身推理)、澄清策略和长文本生成能力的评估。此外,MIRAGE引入了非明确指定、上下文丰富的开放世界场景,要求模型具备推断隐含知识缺口、处理罕见实体以及主动引导交互或作出回应的能力。
链接: https://arxiv.org/abs/2506.20100
作者: Vardhan Dongre,Chi Gui,Shubham Garg,Hooshang Nayyeri,Gokhan Tur,Dilek Hakkani-Tür,Vikram S. Adve
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 66 pages, 32 figures, 23 tables
Abstract:We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: this https URL
zh
[NLP-35] PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
【速读】: 该论文旨在解决在视觉环境中自主推导符号化动作语义(即前提和后置条件)的问题,而无需依赖专家定义的动作规则。传统方法通常局限于文本领域或依赖于不现实的假设,如预定义的问题文件、完全可观测性或显式错误信息。论文提出的PSALM-V系统通过分析执行结果并合成可能的错误解释,动态推断PDDL问题文件和领域动作语义,其关键在于利用大语言模型(LLMs)生成启发式计划和候选符号语义,并通过迭代生成与执行计划,维护每个动作可能语义的树状信念结构,逐步优化这些信念直至达到目标状态。
链接: https://arxiv.org/abs/2506.20097
作者: Wang Bill Zhu,Miaosen Chai,Ishika Singh,Robin Jia,Jesse Thomason
机构: University of Southern California (南加州大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:
Abstract:We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot.
zh
[NLP-36] ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset
【速读】: 该论文试图解决如何有效整合高维时间序列信号与自然语言以支持动态、交互式任务的问题(Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge)。其解决方案的关键在于提出一种新的框架——Instruct Time Transformer (ITFormer),该框架将时间序列编码器与冻结的大规模语言模型(LLMs)相结合,从而有效地提取、对齐和融合时序与文本特征,实现了在仅增加少于1%可训练参数的情况下显著提升问答准确率。
链接: https://arxiv.org/abs/2506.20093
作者: Yilin Wang,Peixuan Lei,Jie Song,Yuzhe Hao,Tao Chen,Yuxuan Zhang,Lei Jia,Yuanxiang Li,Zhongyu Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: this https URL
zh
[NLP-37] Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
【速读】: 该论文试图解决当前基于Transformer的自回归语言模型(Language Models, LMs)在可解释性、可控性、组合性和泛化能力方面的不足。其解决方案的关键在于通过将组合语义属性整合到分布语义空间中,实现符号语义与分布语义之间的桥梁构建,这一方向被称为语义表示学习(Semantic Representation Learning)。研究对比了三种主流的自编码器架构——变分自编码器(Variational AutoEncoder, VAE)、向量量化变分自编码器(Vector Quantised VAE, VQVAE)和稀疏自编码器(Sparse AutoEncoder, SAE),并分析了它们在语义结构和可解释性方面所诱导的独特潜在空间几何特性。
链接: https://arxiv.org/abs/2506.20083
作者: Yingji Zhang,Danilo S. Carvalho,André Freitas
机构: University of Manchester (曼彻斯特大学); Idiap Research Institute (Idiap研究机构); CRUK Manchester Institute (英国癌症研究基金会曼彻斯特研究所)
类目: Computation and Language (cs.CL)
备注: In progress
Abstract:Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textitsemantic representation learning. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.
zh
[NLP-38] SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
【速读】: 该论文试图解决当前代码检索模型过度依赖表面文本特征(如文档字符串、标识符名称)以及对良好文档化代码存在显著偏见的问题。其解决方案的关键在于提出SACL框架,通过增强文本信息并结合语义信息来补充代码或结构知识,从而减少检索过程中的偏差并提升检索效果。
链接: https://arxiv.org/abs/2506.20081
作者: Dhruv Gupta,Gayathri Ganesh Lakshmy,Yiqing Xie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is this http URL on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).
zh
[NLP-39] A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLM s
【速读】: 该论文试图解决现有时空数据挖掘模型在多任务推理和复杂长文本推理方面能力不足的问题,这些模型通常局限于单一任务,无法生成深入且解释性强的输出,从而限制了其在现实世界多维度决策场景中的应用。解决方案的关键在于提出STReason框架,该框架将大语言模型(Large Language Models, LLMs)的推理能力与时空模型的分析能力相结合,通过上下文学习将复杂的自然语言查询分解为模块化、可解释的程序,并系统执行以生成解决方案和详细推理过程,从而实现多任务推理与执行。
链接: https://arxiv.org/abs/2506.20073
作者: Kethmi Hirushini Hettige,Jiahao Ji,Cheng Long,Shili Xiang,Gao Cong,Jingyuan Wang
机构: Nanyang Technological University (南洋理工大学); A*STAR (新加坡科技研究局); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Spatio-temporal data mining plays a pivotal role in informed decision making across diverse domains. However, existing models are often restricted to narrow tasks, lacking the capacity for multi-task inference and complex long-form reasoning that require generation of in-depth, explanatory outputs. These limitations restrict their applicability to real-world, multi-faceted decision scenarios. In this work, we introduce STReason, a novel framework that integrates the reasoning strengths of large language models (LLMs) with the analytical capabilities of spatio-temporal models for multi-task inference and execution. Without requiring task-specific finetuning, STReason leverages in-context learning to decompose complex natural language queries into modular, interpretable programs, which are then systematically executed to generate both solutions and detailed rationales. To facilitate rigorous evaluation, we construct a new benchmark dataset and propose a unified evaluation framework with metrics specifically designed for long-form spatio-temporal reasoning. Experimental results show that STReason significantly outperforms advanced LLM baselines across all metrics, particularly excelling in complex, reasoning-intensive spatio-temporal scenarios. Human evaluations further validate STReason’s credibility and practical utility, demonstrating its potential to reduce expert workload and broaden the applicability to real-world spatio-temporal tasks. We believe STReason provides a promising direction for developing more capable and generalizable spatio-temporal reasoning systems.
zh
[NLP-40] Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models
【速读】: 该论文试图解决强化学习中指令跟随策略开发的问题,该问题由于依赖于大量人工标注的指令数据集以及从稀疏奖励中学习的难度而显得尤为困难。论文提出的解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)从先前收集的智能体轨迹中回顾性地自动生成开放式的指令。通过识别智能体隐含完成的有意义子任务,LLMs被用于重新标注失败的轨迹,从而丰富智能体的训练数据并显著减少对人工标注的依赖。这一开放指令重标注方法使智能体能够高效学习到一个统一的指令跟随策略,以处理多种任务。
链接: https://arxiv.org/abs/2506.20061
作者: Zhicheng Zhang,Ziyan Wang,Yali Du,Fei Fang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under Review
Abstract:Developing effective instruction-following policies in reinforcement learning remains challenging due to the reliance on extensive human-labeled instruction datasets and the difficulty of learning from sparse rewards. In this paper, we propose a novel approach that leverages the capabilities of large language models (LLMs) to automatically generate open-ended instructions retrospectively from previously collected agent trajectories. Our core idea is to employ LLMs to relabel unsuccessful trajectories by identifying meaningful subtasks the agent has implicitly accomplished, thereby enriching the agent’s training data and substantially alleviating reliance on human annotations. Through this open-ended instruction relabeling, we efficiently learn a unified instruction-following policy capable of handling diverse tasks within a single policy. We empirically evaluate our proposed method in the challenging Craftax environment, demonstrating clear improvements in sample efficiency, instruction coverage, and overall policy performance compared to state-of-the-art baselines. Our results highlight the effectiveness of utilizing LLM-guided open-ended instruction relabeling to enhance instruction-following reinforcement learning.
zh
[NLP-41] Cross-Layer Discrete Concept Discovery for Interpreting Language Models
【速读】: 该论文试图解决在Transformer模型中跨层信息混合与冗余导致的新兴概念难以被揭示的问题,现有研究通常仅分析单层神经表示,忽略了跨层叠加现象。其解决方案的关键在于提出\glsclvqvae框架,通过向量量化将多层表示映射到紧凑且可解释的概念向量,从而压缩重复的残差流特征。该方法结合了基于温度的top-k采样与EMA代码本更新,实现了对离散潜在空间的可控探索,同时保持代码本多样性,并通过缩放球面k-means++进行代码本初始化,以方向相似性聚类,更符合词嵌入空间中的语义结构。
链接: https://arxiv.org/abs/2506.20040
作者: Ankur Garg,Xuemin Yu,Hassan Sajjad,Samira Ebrahimi Kahou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose \glsclvqvae, a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top- k temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.
zh
[NLP-42] Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
【速读】: 该论文试图解决生成式 AI (Generative AI) 在面对涉及身份认同的问题时是否表现出类似人类的动机性推理(motivated reasoning)这一问题,即模型是否会因被赋予的个性角色(persona)而倾向于得出与该角色身份一致的结论。解决方案的关键在于通过设计包含四个政治和社会人口属性的八种人格角色,测试不同大型语言模型(LLMs)在两项基于人类受试者研究的推理任务中的表现,从而验证人格设定是否引发模型的动机性推理行为。
链接: https://arxiv.org/abs/2506.20020
作者: Saloni Dash,Amélie Reymond,Emma S. Spiro,Aylin Caliskan
机构: University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reasoning in humans is prone to biases due to underlying motivations like identity protection, that undermine rational decision-making and judgment. This motivated reasoning at a collective level can be detrimental to society when debating critical issues such as human-driven climate change or vaccine safety, and can further aggravate political polarization. Prior studies have reported that large language models (LLMs) are also susceptible to human-like cognitive biases, however, the extent to which LLMs selectively reason toward identity-congruent conclusions remains largely unexplored. Here, we investigate whether assigning 8 personas across 4 political and socio-demographic attributes induces motivated reasoning in LLMs. Testing 8 LLMs (open source and proprietary) across two reasoning tasks from human-subject studies – veracity discernment of misinformation headlines and evaluation of numeric scientific evidence – we find that persona-assigned LLMs have up to 9% reduced veracity discernment relative to models without personas. Political personas specifically, are up to 90% more likely to correctly evaluate scientific evidence on gun control when the ground truth is congruent with their induced political identity. Prompt-based debiasing methods are largely ineffective at mitigating these effects. Taken together, our empirical findings are the first to suggest that persona-assigned LLMs exhibit human-like motivated reasoning that is hard to mitigate through conventional debiasing prompts – raising concerns of exacerbating identity-congruent reasoning in both LLMs and humans.
zh
[NLP-43] Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks
【速读】: 该论文试图解决人工智能在医疗领域应用中的环境影响与伦理问题,特别是商业大型语言模型(Large Language Models, LLMs)在资源消耗、患者隐私及安全方面的不足。其解决方案的关键在于开发一个可定制的检索增强生成(Retrieval-Augmented Generation, RAG)框架,该框架不仅能够监测能源使用和二氧化碳排放,还通过结合开源LLMs构建高效的RAG模型,从而在提升医疗任务准确性的同时降低能耗与碳足迹。
链接: https://arxiv.org/abs/2506.20009
作者: Konstantinos Vrettos,Michail E. Klontzas
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 3 Figures
Abstract:Background The increasing adoption of Artificial Intelligence (AI) in healthcare has sparked growing concerns about its environmental and ethical implications. Commercial Large Language Models (LLMs), such as ChatGPT and DeepSeek, require substantial resources, while the utilization of these systems for medical purposes raises critical issues regarding patient privacy and safety. Methods We developed a customizable Retrieval-Augmented Generation (RAG) framework for medical tasks, which monitors its energy usage and CO2 emissions. This system was then used to create RAGs based on various open-source LLMs. The tested models included both general purpose models like llama3.1:8b and medgemma-4b-it, which is medical-domain specific. The best RAGs performance and energy consumption was compared to DeepSeekV3-R1 and OpenAIs o4-mini model. A dataset of medical questions was used for the evaluation. Results Custom RAG models outperformed commercial models in accuracy and energy consumption. The RAG model built on llama3.1:8B achieved the highest accuracy (58.5%) and was significantly better than other models, including o4-mini and DeepSeekV3-R1. The llama3.1-RAG also exhibited the lowest energy consumption and CO2 footprint among all models, with a Performance per kWh of 0.52 and a total CO2 emission of 473g. Compared to o4-mini, the llama3.1-RAG achieved 2.7x times more accuracy points per kWh and 172% less electricity usage while maintaining higher accuracy. Conclusion Our study demonstrates that local LLMs can be leveraged to develop RAGs that outperform commercial, online LLMs in medical tasks, while having a smaller environmental impact. Our modular framework promotes sustainable AI development, reducing electricity usage and aligning with the UNs Sustainable Development Goals.
zh
[NLP-44] A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior ACL2025
【速读】: 该论文试图解决传统眼动建模方法在捕捉阅读过程中固定点(fixation)和扫视(saccade)的时空动态特性方面的不足。传统方法依赖于聚合的眼动测量数据和强假设模型,忽略了阅读过程中的复杂时空模式。论文提出的解决方案的关键在于构建一个基于标记时空点过程(marked spatio-temporal point process)的更通用的概率模型,该模型不仅能够描述固定点的持续时间,还能捕捉其在空间和时间上的分布。其中,扫视被建模为一个Hawkes过程,以捕捉固定点对后续固定点发生概率的激发效应;而固定点持续时间则通过固定点特定预测因子的时间卷积函数进行建模,从而捕捉溢出效应。
链接: https://arxiv.org/abs/2506.19999
作者: Francesco Ignazio Re,Andreas Opedal,Glib Manaiev,Mario Giulianelli,Ryan Cotterell
机构: ETH Zürich (ETH Zurich); Max Planck Institute for Intelligent Systems, Tübingen (马克斯·普朗克智能系统研究所,图宾根)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: ACL 2025
Abstract:Reading is a process that unfolds across space and time, alternating between fixations where a reader focuses on a specific point in space, and saccades where a reader rapidly shifts their focus to a new point. An ansatz of psycholinguistics is that modeling a reader’s fixations and saccades yields insight into their online sentence processing. However, standard approaches to such modeling rely on aggregated eye-tracking measurements and models that impose strong assumptions, ignoring much of the spatio-temporal dynamics that occur during reading. In this paper, we propose a more general probabilistic model of reading behavior, based on a marked spatio-temporal point process, that captures not only how long fixations last, but also where they land in space and when they take place in time. The saccades are modeled using a Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. The duration time of fixation events is modeled as a function of fixation-specific predictors convolved across time, thus capturing spillover effects. Empirically, our Hawkes process model exhibits a better fit to human saccades than baselines. With respect to fixation durations, we observe that incorporating contextual surprisal as a predictor results in only a marginal improvement in the model’s predictive accuracy. This finding suggests that surprisal theory struggles to explain fine-grained eye movements.
zh
[NLP-45] Doc2Agent : Scalable Generation of Tool-Using Agents from API Documentation
【速读】: 该论文试图解决如何从非结构化的API文档中构建可调用工具的代理(agent)的问题,特别是在任意领域中实现工具使用代理的构建,这需要处理复杂的实际API。解决方案的关键在于提出Doc2Agent,这是一个可扩展的流程,能够从API文档生成可执行工具,并通过代码代理迭代优化这些工具,从而有效提升工具的准确性和实用性。
链接: https://arxiv.org/abs/2506.19998
作者: Xinyi Ni,Haonan Jian,Qiuyang Wang,Vedanshi Chetan Shah,Pengyu Hong
机构: Brandeis University (布兰戴斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:REST APIs play important roles in enriching the action space of web agents, yet most API-based agents rely on curated and uniform toolsets that do not reflect the complexity of real-world APIs. Building tool-using agents for arbitrary domains remains a major challenge, as it requires reading unstructured API documentation, testing APIs and inferring correct parameters. We propose Doc2Agent, a scalable pipeline to build agents that can call Python-based tools generated from API documentation. Doc2Agent generates executable tools from API documentations and iteratively refines them using a code agent. We evaluate our approach on real-world APIs, WebArena APIs, and research APIs, producing validated tools. We achieved a 55% relative performance improvement with 90% lower cost compared to direct API calling on WebArena benchmark. A domain-specific agent built for glycomaterial science further demonstrates the pipeline’s adaptability to complex, knowledge-rich tasks. Doc2Agent offers a generalizable solution for building tool agents from unstructured API documentation at scale.
zh
[NLP-46] Inference Scaled GraphRAG : Improving Multi Hop Question Answering on Knowledge Graphs
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型推理任务中表现不足的问题,尤其是由于缺乏对结构化上下文和多跳信息的访问能力。其解决方案的关键在于提出一种名为Inference-Scaled GraphRAG的新框架,通过在推理阶段应用计算扩展来增强基于图的推理能力,具体包括序列缩放与深度思维链图遍历的结合,以及在交错推理-执行循环中对采样轨迹进行并行缩放与多数投票。
链接: https://arxiv.org/abs/2506.19967
作者: Travis Thompson,Seung-Hwan Lim,Paul Liu,Ruoying He,Dongkuan Xu
机构: North Carolina State University (北卡罗来纳州立大学); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop information. Retrieval-Augmented Generation (RAG) partially mitigates this by grounding generation in retrieved context, but conventional RAG and GraphRAG methods often fail to capture relational structure across nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel framework that enhances LLM-based graph reasoning by applying inference-time compute scaling. Our method combines sequential scaling with deep chain-of-thought graph traversal, and parallel scaling with majority voting over sampled trajectories within an interleaved reasoning-execution loop. Experiments on the GRBench benchmark demonstrate that our approach significantly improves multi-hop question answering performance, achieving substantial gains over both traditional GraphRAG and prior graph traversal baselines. These findings suggest that inference-time scaling is a practical and architecture-agnostic solution for structured knowledge reasoning with LLMs
zh
[NLP-47] CycleDistill: Bootstrapping Machine Translation using LLM s with Cyclical Distillation
【速读】: 该论文试图解决低资源语言中高质量机器翻译(Machine Translation, MT)系统难以构建的问题,因为这类语言通常缺乏平行语料库。解决方案的关键在于提出一种名为CycleDistill的自举方法,该方法利用大型语言模型(Large Language Models, LLMs)和少量示例翻译,通过迭代生成合成平行语料库,并以此微调用于生成数据的模型,从而获得高质量的MT系统。该方法仅需1到4个少量示例即可实现有效训练,实验表明其在三个印度语言上的表现优于基线模型。
链接: https://arxiv.org/abs/2506.19952
作者: Deepon Halder,Thanmay Jayakumar,Raj Dabre
机构: Nilekani Centre at AI4Bharat (Nilekani Centre at AI4Bharat); Indian Institute of Technology (Indian Institute of Technology); Indian Institute of Technology (Indian Institute of Technology); Indian Institute of Engineering Science and Technology (Indian Institute of Engineering Science and Technology)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose CycleDistill, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by over 20-30 chrF points on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.
zh
[NLP-48] Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
【速读】: 该论文试图解决机器学习(Machine Learning, ML)领域中由于同行评审机制的不完善而导致的错误、误导性或有缺陷的研究被接受甚至被突出展示的问题。其解决方案的关键是建议ML会议设立一个专门的“反驳与批评”(Refutations and Critiques, R&C)赛道,为批判性挑战既有研究的重要工作提供高知名度和可信度的平台,从而促进研究生态系统的动态自我修正能力。
链接: https://arxiv.org/abs/2506.19882
作者: Rylan Schaeffer,Joshua Kazdan,Yegor Denisov-Blanch,Brando Miranda,Matthias Gerstgrasser,Susan Zhang,Andreas Haupt,Isha Gupta,Elyas Obbad,Jesse Dodge,Jessica Zosa Forde,Koustuv Sinha,Francesco Orabona,Sanmi Koyejo,David Donoho
机构: Stanford University (斯坦福大学); ETH Zürich (苏黎世联邦理工学院); Allen Institute for AI (艾伦人工智能研究所); Brown University (布朗大学); Meta AI, FAIR (Meta AI, FAIR); KAUST (卡塔尔科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Science progresses by iteratively advancing and correcting humanity’s understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are this http URL position paper argues that ML conferences should establish a dedicated “Refutations and Critiques” (R C) Track. This R C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.
zh
[NLP-49] Capturing Visualization Design Rationale
【速读】: 该论文试图解决现有自然语言数据集在数据可视化任务中过于侧重可视化解读而非设计编码的问题,这些问题通常依赖于人工构建的可视化和问题。其解决方案的关键在于利用真实世界中的可视化笔记本,这些笔记本由学生在数据可视化课程中创建,结合了视觉元素与设计说明,从而显性化地呈现设计决策的依据。此外,通过大型语言模型(Large Language Models, LLMs)生成并分类问答-理由三元组,最终构建出一个能够捕捉和提炼学生可视化设计选择及其对应理由的数据集。
链接: https://arxiv.org/abs/2506.16571
作者: Maeve Hutchinson,Radu Jianu,Aidan Slingsby,Jo Wood,Pranava Madhyastha
机构: City St George’s, University of London (城市圣乔治大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:Prior natural language datasets for data visualization have focused on tasks such as visualization literacy assessment, insight generation, and visualization generation from natural language instructions. These studies often rely on controlled setups with purpose-built visualizations and artificially constructed questions. As a result, they tend to prioritize the interpretation of visualizations, focusing on decoding visualizations rather than understanding their encoding. In this paper, we present a new dataset and methodology for probing visualization design rationale through natural language. We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course. These notebooks combine visual artifacts with design exposition, in which students make explicit the rationale behind their design decisions. We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks. We then carefully validate the triples and curate a dataset that captures and distills the visualization design choices and corresponding rationales of the students.
zh
[NLP-50] FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment
【速读】: 该论文旨在解决自动视网膜图像质量评估(Fundus Image Quality Assessment, FIQA)中存在的挑战,这些问题主要源于图像采集的差异性和专家主观评价的不一致性。解决方案的关键在于提出一种基于专家验证的新型框架FundaQ-8,该框架通过八个关键参数对视网膜图像质量进行系统性评估,包括视野覆盖范围、解剖结构可见性、光照条件和图像伪影等。在此基础上,研究者构建了一个基于ResNet18的回归模型,用于预测0到1之间的连续质量评分,并通过迁移学习、均方误差优化和标准化预处理方法进行训练,从而实现了高质量的图像质量评估。
链接: https://arxiv.org/abs/2506.20303
作者: Lee Qi Zun,Oscar Wong Jin Hao,Nor Anita Binti Che Omar,Zalifa Zakiah Binti Asnir,Mohamad Sabri bin Sinal Zainal,Goh Man Fye
机构: 未知
类目: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated fundus image quality assessment (FIQA) remains a challenge due to variations in image acquisition and subjective expert evaluations. We introduce FundaQ-8, a novel expert-validated framework for systematically assessing fundus image quality using eight critical parameters, including field coverage, anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a structured scoring reference, we develop a ResNet18-based regression model to predict continuous quality scores in the 0 to 1 range. The model is trained on 1800 fundus images from real-world clinical sources and Kaggle datasets, using transfer learning, mean squared error optimization, and standardized preprocessing. Validation against the EyeQ dataset and statistical analyses confirm the framework’s reliability and clinical interpretability. Incorporating FundaQ-8 into deep learning models for diabetic retinopathy grading also improves diagnostic robustness, highlighting the value of quality-aware training in real-world screening applications.
zh
计算机视觉
[CV-0] IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals
【速读】:该论文旨在解决视觉基础的3D全景场景补全(Panoptic Scene Completion, PSC)中实例级信息感知不足以及模型在测试阶段无法动态适应观测场景的问题。现有基于Transformer的方法虽然在训练过程中利用图像上下文更新查询,但这些查询在测试时保持静态,限制了其对场景的动态适应能力。论文提出的IPFormer方法的关键在于引入上下文自适应的实例提议(context-adaptive instance proposals),在训练和测试阶段均基于图像上下文生成并优化这些提议,通过注意力机制进行编码与解码,从而更精确地推理语义实例-体素关系,提升了整体全景指标表现并显著降低了运行时间。
链接: https://arxiv.org/abs/2506.20671
作者: Markus Gross,Aya Fahmy,Danit Niwattananan,Dominik Muhle,Rui Song,Daniel Cremers,Henri Meeß
机构: Technical University of Munich (慕尼黑工业大学); Fraunhofer Institute IVI (弗劳恩霍夫研究所IVI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based SSC approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first approach that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Experimental results show that our approach surpasses state-of-the-art methods in overall panoptic metrics PQ ^\dagger and PQ-All, matches performance in individual metrics, and achieves a runtime reduction exceeding 14 \times . Furthermore, our ablation studies reveal that dynamically deriving instance proposals from image context, as opposed to random initialization, leads to a 3.62% increase in PQ-All and a remarkable average improvement of 18.65% in combined Thing-metrics. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.
zh
[CV-1] EditP23: 3D Editing via Propagation of Image Prompts to Multi-View
【速读】:该论文试图解决无掩码的3D编辑问题,即如何在不依赖手动掩码或文本提示的情况下,将2D图像编辑一致地传播到多视角的3D表示中。解决方案的关键在于利用一对图像作为提示:原始视角及其用户编辑后的对应图像,通过引导预训练多视角扩散模型潜在空间中的编辑感知流,实现跨视角的连贯编辑,且无需优化过程,同时保持原始物体的结构和外观一致性。
链接: https://arxiv.org/abs/2506.20652
作者: Roi Bar-On,Dana Cohen-Bar,Daniel Cohen-Or
机构: Tel-Aviv University (特拉维夫大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Code, supplementary videos, interactive 3D visualizations, and additional results are available at this https URL
Abstract:We present EditP23, a method for mask-free 3D editing that propagates 2D image edits to multi-view representations in a 3D-consistent manner. In contrast to traditional approaches that rely on text-based prompting or explicit spatial masks, EditP23 enables intuitive edits by conditioning on a pair of images: an original view and its user-edited counterpart. These image prompts are used to guide an edit-aware flow in the latent space of a pre-trained multi-view diffusion model, allowing the edit to be coherently propagated across views. Our method operates in a feed-forward manner, without optimization, and preserves the identity of the original object, in both structure and appearance. We demonstrate its effectiveness across a range of object categories and editing scenarios, achieving high fidelity to the source while requiring no manual masks.
zh
[CV-2] Disentangled representations of microscopy images MICRO IJCNN2025
【速读】:该论文旨在解决显微图像分析中模型可解释性不足的问题,这是该领域的一个关键挑战。尽管深度神经网络在显微图像分类任务中表现出色,但其缺乏可解释性限制了其在实际应用中的可靠性与可信度。论文提出的解决方案是基于解耦表征学习(Disentangled Representation Learning, DRL)的方法,通过利用从合成数据中学习到的表征进行迁移,实现准确性和可解释性之间的良好平衡。
链接: https://arxiv.org/abs/2506.20649
作者: Jacopo Dapueto,Vito Paolo Pastore,Nicoletta Noceti,Francesca Odone
机构: MaLGa-DIBRIS, Università degli studi di Genova, Genova, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in: International Joint Conference on Neural Networks (IJCNN 2025). Project page: this https URL
Abstract:Microscopy image analysis is fundamental for different applications, from diagnosis to synthetic engineering and environmental monitoring. Modern acquisition systems have granted the possibility to acquire an escalating amount of images, requiring a consequent development of a large collection of deep learning-based automatic image analysis methods. Although deep neural networks have demonstrated great performance in this field, interpretability, an essential requirement for microscopy image analysis, remains an open challenge. This work proposes a Disentangled Representation Learning (DRL) methodology to enhance model interpretability for microscopy image classification. Exploiting benchmark datasets from three different microscopic image domains (plankton, yeast vacuoles, and human cells), we show how a DRL framework, based on transferring a representation learnt from synthetic data, can provide a good trade-off between accuracy and interpretability in this domain. Comments: Published in: International Joint Conference on Neural Networks (IJCNN 2025). Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.20649 [cs.CV] (or arXiv:2506.20649v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.20649 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-3] Joint attitude estimation and 3D neural reconstruction of non-cooperative space objects CVPR2025
【速读】:该论文旨在解决从模拟图像中对非合作空间目标进行高精度三维重建的问题,这对于提升空间态势感知(Space Situational Awareness, SSA)能力具有重要意义。其解决方案的关键在于利用神经辐射场(NeRF)模型,并通过联合优化相机位姿与NeRF参数来提高重建精度,特别是在面对单色图像、未知目标姿态、有限观测角度等挑战性条件下,通过逐帧训练和引入正则化策略以保持相机位姿的连续性。
链接: https://arxiv.org/abs/2506.20638
作者: Clément Forray,Pauline Delporte,Nicolas Delaygue,Florence Genin,Dawa Derksen
机构: CS Group( CS 组); CNES(国家空间研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted for CVPR 2025 NFBCC workshop
Abstract:Obtaining a better knowledge of the current state and behavior of objects orbiting Earth has proven to be essential for a range of applications such as active debris removal, in-orbit maintenance, or anomaly detection. 3D models represent a valuable source of information in the field of Space Situational Awareness (SSA). In this work, we leveraged Neural Radiance Fields (NeRF) to perform 3D reconstruction of non-cooperative space objects from simulated images. This scenario is challenging for NeRF models due to unusual camera characteristics and environmental conditions : mono-chromatic images, unknown object orientation, limited viewing angles, absence of diffuse lighting etc. In this work we focus primarly on the joint optimization of camera poses alongside the NeRF. Our experimental results show that the most accurate 3D reconstruction is achieved when training with successive images one-by-one. We estimate camera poses by optimizing an uniform rotation and use regularization to prevent successive poses from being too far apart.
zh
[CV-4] Shape2Animal: Creative Animal Generation from Natural Silhouettes
【速读】:该论文试图解决如何将自然物体的轮廓(如云、石头或火焰)重新诠释为合理的动物形态的问题,从而模拟人类在模糊刺激中感知有意义模式的能力,即pareidolia。解决方案的关键在于利用开放词汇分割技术提取物体轮廓,并通过视觉-语言模型解释语义上合适的动物概念,随后借助文本到图像扩散模型生成符合输入形状的动物图像,并将其无缝融合到原始场景中,以生成视觉连贯且空间一致的组合。
链接: https://arxiv.org/abs/2506.20616
作者: Quoc-Duy Tran,Anh-Tuan Vo,Dinh-Khoi Vo,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science (科学大学); Vietnam National University - Ho Chi Minh (越南国家大学-胡志明); Dayton University (戴顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: this https URL
zh
[CV-5] Video Perception Models for 3D Scene Synthesis
【速读】:该论文试图解决传统3D场景合成需要专家知识和大量手动操作的问题,以及现有方法在3D空间推理能力不足和多视角不一致性方面的局限性。解决方案的关键在于提出一种名为VIPScene的框架,该框架利用视频生成模型中编码的3D物理世界的常识知识,以确保场景布局的一致性和物体放置的跨视角一致性,同时结合文本和图像提示,融合视频生成、前馈3D重建和开放词汇感知模型,实现对场景中每个物体的语义与几何分析。
链接: https://arxiv.org/abs/2506.20601
作者: Rui Huang,Guangyao Zhai,Zuria Bauer,Marc Pollefeys,Federico Tombari,Leonidas Guibas,Gao Huang,Francis Engelmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditionally, 3D scene synthesis requires expert knowledge and significant manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, virtual reality, and gaming. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors of modern image generation models. However, current LLMs demonstrate limited 3D spatial reasoning ability, which restricts their ability to generate realistic and coherent 3D scenes. Meanwhile, image generation-based methods often suffer from constraints in viewpoint selection and multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For more precise analysis, we further introduce First-Person View Score (FPVScore) for coherence and plausibility evaluation, utilizing continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios. The code will be released.
zh
[CV-6] SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection
【速读】:该论文旨在解决生成式人工智能(Generative AI)产生的难以检测的虚假遥感图像(RSI)所带来的挑战,这些问题可能引发错误的情报、虚假新闻甚至阴谋论。现有伪造检测方法通常依赖于单一视觉特征来捕捉预定义的伪影,然而由于地理地形、地表覆盖类型或RSI中的特定特征差异,伪影的性质会显著不同,且随着生成模型的复杂化而演变,导致现有方法在多样化遥感数据上的泛化能力不足。该论文提出了一种名为SFNet的新颖伪造检测框架,其关键在于通过结合空间域和频域特征来获取丰富且全面的视觉信息,并设计了领域特征映射模块和混合领域特征精炼模块(CBAM注意力机制),以依次对齐和融合多领域特征并抑制冗余信息,从而提升检测性能和泛化能力。
链接: https://arxiv.org/abs/2506.20599
作者: Ji Qi,Xinchang Zhang,Dingqi Ye,Yongjia Ruan,Xin Guo,Shaowen Wang,Haifeng Li
机构: Guangzhou University (广州大学); Xiamen University Of Technology (厦门理工学院); Xinjiang University (新疆大学); University of Illinois at Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at this https URL.
zh
[CV-7] WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration
【速读】:该论文旨在解决单图像交互式3D场景生成中的可探索性受限问题,即现有方法在原视点之外进行大范围移动时无法生成高质量图像,尤其是在进入未见区域时存在视觉伪影和空间不一致的问题。解决方案的关键在于将问题分解为两个核心子问题:新视角质量(novel view quality)和跨视角一致性(cross-view consistency)。为提升新视角的渲染质量,提出WorldRestorer,一个数据驱动的视频修复模型,用于消除浮动物和伪影;为增强跨视角一致性,提出ConsistView,一种多视角联合修复机制,确保不同视角间的时空一致性。
链接: https://arxiv.org/abs/2506.20590
作者: Chaojun Ni,Jie Li,Haoyun Li,Hengyu Liu,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Boyuan Wang,Chenxin Li,Guan Huang,Wenjun Mei
机构: GigaAI; Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address this challenge, we propose WonderFree, the first model that enables users to interactively generate 3D worlds with the freedom to explore from arbitrary angles and directions. Specifically, we decouple this challenge into two key subproblems: novel view quality, which addresses visual artifacts and floating issues in novel views, and cross-view consistency, which ensures spatial consistency across different viewpoints. To enhance rendering quality in novel views, we introduce WorldRestorer, a data-driven video restoration model designed to eliminate floaters and artifacts. In addition, a data collection pipeline is presented to automatically gather training data for WorldRestorer, ensuring it can handle scenes with varying styles needed for 3D scene generation. Furthermore, to improve cross-view consistency, we propose ConsistView, a multi-view joint restoration mechanism that simultaneously restores multiple perspectives while maintaining spatiotemporal coherence. Experimental results demonstrate that WonderFree not only enhances rendering quality across diverse viewpoints but also significantly improves global coherence and consistency. These improvements are confirmed by CLIP-based metrics and a user study showing a 77.20% preference for WonderFree over WonderWorld enabling a seamless and immersive 3D exploration experience. The code, model, and data will be publicly available.
zh
[CV-8] RIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness
【速读】:该论文旨在解决视频摘要生成中对监督标注依赖性强以及基于注意力机制的模型在计算效率和跨领域适应性上的不足问题。其解决方案的关键在于提出一种开创性的自监督视频摘要模型,该模型通过结合基于马尔可夫过程的损失度量和两阶段自监督学习范式,在无需注意力机制、循环神经网络(RNN)或Transformer的情况下,有效捕捉视频中的空间和时间依赖性,从而实现了高效且性能优越的视频摘要生成。
链接: https://arxiv.org/abs/2506.20588
作者: Pritam Mishra,Coloma Ballester,Dimosthenis Karatzas
机构: Pompeu Fabra University (庞佩乌法布拉大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.
zh
[CV-9] Learning-Based Distance Estimation for 360° Single-Sensor Setups
【速读】:该论文试图解决在全景成像中基于传统几何方法进行精确距离估计的挑战,尤其是在存在镜头畸变和环境变化的情况下。其解决方案的关键在于提出一种基于神经网络的单目距离估计方法,该方法使用单一360°鱼眼镜头相机,直接从原始全景输入中学习并推断物体的距离,无需依赖精确的镜头标定,从而提高了在不同条件下的鲁棒性和适应性。
链接: https://arxiv.org/abs/2506.20586
作者: Yitong Quan,Benjamin Kiefer,Martin Messmer,Andreas Zell
机构: University of Tuebingen (图宾根大学); LOOKOUT (LOOKOUT)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to ECMR 2025
Abstract:Accurate distance estimation is a fundamental challenge in robotic perception, particularly in omnidirectional imaging, where traditional geometric methods struggle with lens distortions and environmental variability. In this work, we propose a neural network-based approach for monocular distance estimation using a single 360° fisheye lens camera. Unlike classical trigonometric techniques that rely on precise lens calibration, our method directly learns and infers the distance of objects from raw omnidirectional inputs, offering greater robustness and adaptability across diverse conditions. We evaluate our approach on three 360° datasets (LOAF, ULM360, and a newly captured dataset Boat360), each representing distinct environmental and sensor setups. Our experimental results demonstrate that the proposed learning-based model outperforms traditional geometry-based methods and other learning baselines in both accuracy and robustness. These findings highlight the potential of deep learning for real-time omnidirectional distance estimation, making our approach particularly well-suited for low-cost applications in robotics, autonomous navigation, and surveillance.
zh
[CV-10] Dense Video Captioning using Graph-based Sentence Summarization
【速读】:该论文旨在解决密集视频描述任务中,现有方法在处理较长事件时间提议时,未能充分探索场景演变的问题,导致在场景和物体发生变化时性能不理想。其解决方案的关键在于提出一种基于图的分段与摘要(GPaS)框架,该框架通过将整个事件提议分割为短视频片段进行细粒度描述,并在摘要阶段利用语义词之间的关系进行有效整合,具体通过将语义词视为图中的节点,并结合图卷积网络(GCN)与长短期记忆网络(LSTM)来学习它们的交互,从而生成能够概括整个事件的简洁句子。
链接: https://arxiv.org/abs/2506.20583
作者: Zhiwang Zhang,Dong Xu,Wanli Ouyang,Luping Zhou
机构: University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Recently, dense video captioning has made attractive progress in detecting and captioning all events in a long untrimmed video. Despite promising results were achieved, most existing methods do not sufficiently explore the scene evolution within an event temporal proposal for captioning, and therefore perform less satisfactorily when the scenes and objects change over a relatively long proposal. To address this problem, we propose a graph-based partition-and-summarization (GPaS) framework for dense video captioning within two stages. For the partition" stage, a whole event proposal is split into short video segments for captioning at a finer level. For the
summarization" stage, the generated sentences carrying rich description information for each segment are summarized into one sentence to describe the whole event. We particularly focus on the ``summarization" stage, and propose a framework that effectively exploits the relationship between semantic words for summarization. We achieve this goal by treating semantic words as nodes in a graph and learning their interactions by coupling Graph Convolutional Network (GCN) and Long Short Term Memory (LSTM), with the aid of visual cues. Two schemes of GCN-LSTM Interaction (GLI) modules are proposed for seamless integration of GCN and LSTM. The effectiveness of our approach is demonstrated via an extensive comparison with the state-of-the-arts methods on the two benchmarks ActivityNet Captions dataset and YouCook II dataset.
zh
[CV-11] Causal Representation Learning with Observational Grouping for CXR Classification
【速读】:该论文旨在解决医学影像中任务特定潜在特征的泛化性和鲁棒性不足的问题,通过学习可识别的因果表示来提升模型性能。其解决方案的关键在于利用观测数据的分组策略,在端到端框架下学习疾病分类的因果表示,从而强制模型对种族、性别和成像视角等无关因素保持不变性,进而提升模型在多个分类任务中的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2506.20582
作者: Rajat Rasal,Avinash Kori,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Identifiable causal representation learning seeks to uncover the true causal relationships underlying a data generation process. In medical imaging, this presents opportunities to improve the generalisability and robustness of task-specific latent features. This work introduces the concept of grouping observations to learn identifiable representations for disease classification in chest X-rays via an end-to-end framework. Our experiments demonstrate that these causal representations improve generalisability and robustness across multiple classification tasks when grouping is used to enforce invariance w.r.t race, sex, and imaging views.
zh
[CV-12] Show Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization
【速读】:该论文试图解决密集视频描述(dense video captioning)问题,即为视频中的多个事件生成多句描述。解决方案的关键在于提出了一种分而总结(division-and-summarization, DaS)框架,该框架通过将长视频划分为多个事件提案,并对每个提案中的视频片段生成句子描述,随后利用带有层次化注意力机制的两阶段长短期记忆网络(LSTM)模型,结合视觉特征对生成的句子进行语义总结,最终输出一个描述性的句子。
链接: https://arxiv.org/abs/2506.20567
作者: Zhiwang Zhang,Dong Xu,Wanli Ouyang,Chuanqi Tan
机构: University of Sydney (悉尼大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:In this work, we propose a division-and-summarization (DaS) framework for dense video captioning. After partitioning each untrimmed long video as multiple event proposals, where each event proposal consists of a set of short video segments, we extract visual feature (e.g., C3D feature) from each segment and use the existing image/video captioning approach to generate one sentence description for this segment. Considering that the generated sentences contain rich semantic descriptions about the whole event proposal, we formulate the dense video captioning task as a visual cue aided sentence summarization problem and propose a new two stage Long Short Term Memory (LSTM) approach equipped with a new hierarchical attention mechanism to summarize all generated sentences as one descriptive sentence with the aid of visual features. Specifically, the first-stage LSTM network takes all semantic words from the generated sentences and the visual features from all segments within one event proposal as the input, and acts as the encoder to effectively summarize both semantic and visual information related to this event proposal. The second-stage LSTM network takes the output from the first-stage LSTM network and the visual features from all video segments within one event proposal as the input, and acts as the decoder to generate one descriptive sentence for this event proposal. Our comprehensive experiments on the ActivityNet Captions dataset demonstrate the effectiveness of our newly proposed DaS framework for dense video captioning.
zh
[CV-13] HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction
【速读】:该论文试图解决大型视觉-语言模型(VLM)在实时人机交互(HRI)中因高延迟导致的感知能力不足问题,从而影响用户体验和实际应用。解决方案的关键在于构建HRIBench,这是一个针对HRI中关键人类感知任务的视觉问答(VQA)基准测试,涵盖非语言线索理解、语言指令理解、人机物体关系理解、社会导航和人员识别五个领域,并通过真实HRI环境数据与公开数据集相结合的方式,构建了1000个VQA问题进行评估。
链接: https://arxiv.org/abs/2506.20566
作者: Zhonghao Shi,Enyu Zhao,Nathaniel Dennler,Jingzhen Wang,Xinyang Xu,Kaleen Shrestha,Mengxue Fu,Daniel Seita,Maja Matarić
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 19th International Symposium on Experimental Robotics (ISER 2025)
Abstract:Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: this https URL.
zh
[CV-14] AdvMIM: Adversarial Masked Image Modeling for Semi-Supervised Medical Image Segmentation MICCAI2025
【速读】:该论文旨在解决在标注数据稀缺的半监督学习场景中,如何有效训练Vision Transformer(视觉Transformer)以提升医学图像分割性能的问题。其关键解决方案是提出一种对抗性掩码图像建模方法,通过构建一个从原始域中提取的辅助掩码域,并利用标记数据的原始标签和未标记数据的伪标签来训练Transformer预测完整的分割掩码,从而增强监督信号。此外,论文还从多域学习的角度进行了理论分析,并设计了一种新颖的对抗性训练损失以减小原始域与掩码域之间的领域差异,进而提升半监督学习性能。
链接: https://arxiv.org/abs/2506.20563
作者: Lei Zhu,Jun Zhou,Rick Siow Mong Goh,Yong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:Vision Transformer has recently gained tremendous popularity in medical image segmentation task due to its superior capability in capturing long-range dependencies. However, transformer requires a large amount of labeled data to be effective, which hinders its applicability in annotation scarce semi-supervised learning scenario where only limited labeled data is available. State-of-the-art semi-supervised learning methods propose combinatorial CNN-Transformer learning to cross teach a transformer with a convolutional neural network, which achieves promising results. However, it remains a challenging task to effectively train the transformer with limited labeled data. In this paper, we propose an adversarial masked image modeling method to fully unleash the potential of transformer for semi-supervised medical image segmentation. The key challenge in semi-supervised learning with transformer lies in the lack of sufficient supervision signal. To this end, we propose to construct an auxiliary masked domain from original domain with masked image modeling and train the transformer to predict the entire segmentation mask with masked inputs to increase supervision signal. We leverage the original labels from labeled data and pseudo-labels from unlabeled data to learn the masked domain. To further benefit the original domain from masked domain, we provide a theoretical analysis of our method from a multi-domain learning perspective and devise a novel adversarial training loss to reduce the domain gap between the original and masked domain, which boosts semi-supervised learning performance. We also extend adversarial masked image modeling to CNN network. Extensive experiments on three public medical image segmentation datasets demonstrate the effectiveness of our method, where our method outperforms existing methods significantly. Our code is publicly available at this https URL.
zh
[CV-15] Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos
【速读】:该论文旨在解决视频中基于单帧的物体检测模型忽略时间上下文信息以及现有视频检测方法引入复杂时间模块导致模型规模和计算复杂度增加的问题。其解决方案的关键在于将多个连续帧堆叠作为输入送入基于YOLO的检测器,仅对单个目标帧的输出进行监督,从而在不显著改变现有架构的前提下,有效利用时间信息,提升检测鲁棒性,同时保持模型的简洁性、计算效率和实时推理能力。
链接: https://arxiv.org/abs/2506.20550
作者: Yitong Quan,Benjamin Kiefer,Martin Messmer,Andreas Zell
机构: University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to ECMR 2025
Abstract:Modern image-based object detection models, such as YOLOv7, primarily process individual frames independently, thus ignoring valuable temporal context naturally present in videos. Meanwhile, existing video-based detection methods often introduce complex temporal modules, significantly increasing model size and computational complexity. In practical applications such as surveillance and autonomous driving, transient challenges including motion blur, occlusions, and abrupt appearance changes can severely degrade single-frame detection performance. To address these issues, we propose a straightforward yet highly effective strategy: stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame. This approach leverages temporal information with minimal modifications to existing architectures, preserving simplicity, computational efficiency, and real-time inference capability. Extensive experiments on the challenging MOT20Det and our BOAT360 datasets demonstrate that our method improves detection robustness, especially for lightweight models, effectively narrowing the gap between compact and heavy detection networks. Additionally, we contribute the BOAT360 benchmark dataset, comprising annotated fisheye video sequences captured from a boat, to support future research in multi-frame video object detection in challenging real-world scenarios.
zh
[CV-16] Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks
【速读】:该论文旨在解决在线社交网络(OSNs)中深度伪造图像检测面临的两大挑战:一是现有方法忽视了压缩引入的“块效应”导致深度伪造痕迹被掩盖,二是多数方法依赖于成对数据,而实际场景中此类数据稀缺。解决方案的关键在于提出PLADA框架,其核心模块包括Block Effect Eraser(B2E),通过双阶段注意力机制处理块效应,以及Open Data Aggregation(ODA),能够处理成对和非成对数据以提升检测性能。该方法在26个数据集上表现出色,尤其在有限成对数据和压缩条件下仍能有效检测深度伪造图像。
链接: https://arxiv.org/abs/2506.20548
作者: Manyi Li,Renshuai Tao,Yufan Liu,Chuangchuang Tan,Haotong Qin,Bing Li,Yunchao Wei,Yao Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 20 pages, 10 figures
Abstract:With the rapid advancement of deep learning, particularly through generative adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or deepfakes", have become nearly indistinguishable from real ones. These images are widely shared across Online Social Networks (OSNs), raising concerns about their misuse. Existing deepfake detection methods overlook the
block effects" introduced by compression in OSNs, which obscure deepfake artifacts, and primarily focus on raw images, rarely encountered in real-world scenarios. To address these challenges, we propose PLADA (Pay Less Attention to Deceptive Artifacts), a novel framework designed to tackle the lack of paired data and the ineffective use of compressed images. PLADA consists of two core modules: Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to handle block effects, and Open Data Aggregation (ODA), which processes both paired and unpaired data to improve detection. Extensive experiments across 26 datasets demonstrate that PLADA achieves a remarkable balance in deepfake detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with limited paired data and compression. More importantly, this work introduces the ``block effect" as a critical factor in deepfake detection, providing a robust solution for open-world scenarios. Our code is available at this https URL.
zh
[CV-17] AI-assisted radiographic analysis in detecting alveolar bone-loss severity and patterns
【速读】:该论文旨在解决牙周炎导致的牙槽骨吸收程度和模式的准确评估问题,这是牙周病诊断与治疗计划中的关键环节。其解决方案的关键在于提出一种基于人工智能的深度学习框架,结合YOLOv8进行牙齿检测、Keypoint R-CNN识别解剖标志点以精确计算骨吸收严重程度,并利用YOLOv8x-seg模型通过几何分析区分水平型与角形骨吸收模式,从而实现对牙槽骨吸收的自动化检测与量化。
链接: https://arxiv.org/abs/2506.20522
作者: Chathura Wimalasiri,Piumal Rathnayake,Shamod Wijerathne,Sumudu Rasnayaka,Dhanushka Leuke Bandara,Roshan Ragel,Vajira Thambawita,Isuru Nawinne
机构: Faculty of Engineering, University of Peradeniya; Faculty of Dental Sciences, University of Peradeniya; Simula Metropolitan Center for Digital Engineering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is 17 pages with 5 tables and 12 figures. The manuscript is under review at Nature Scientific Reports
Abstract:Periodontitis, a chronic inflammatory disease causing alveolar bone loss, significantly affects oral health and quality of life. Accurate assessment of bone loss severity and pattern is critical for diagnosis and treatment planning. In this study, we propose a novel AI-based deep learning framework to automatically detect and quantify alveolar bone loss and its patterns using intraoral periapical (IOPA) radiographs. Our method combines YOLOv8 for tooth detection with Keypoint R-CNN models to identify anatomical landmarks, enabling precise calculation of bone loss severity. Additionally, YOLOv8x-seg models segment bone levels and tooth masks to determine bone loss patterns (horizontal vs. angular) via geometric analysis. Evaluated on a large, expertly annotated dataset of 1000 radiographs, our approach achieved high accuracy in detecting bone loss severity (intra-class correlation coefficient up to 0.80) and bone loss pattern classification (accuracy 87%). This automated system offers a rapid, objective, and reproducible tool for periodontal assessment, reducing reliance on subjective manual evaluation. By integrating AI into dental radiographic analysis, our framework has the potential to improve early diagnosis and personalized treatment planning for periodontitis, ultimately enhancing patient care and clinical outcomes.
zh
[CV-18] A Deep Learning Approach to Identify Rock Bolts in Complex 3D Point Clouds of Underground Mines Captured Using Mobile Laser Scanners
【速读】:该论文试图解决在地下矿山中对锚杆进行高效、准确自动检测的问题,尤其是在低光环境和复杂结构下手动检测困难的情况。解决方案的关键在于提出一种名为DeepBolt的两阶段深度学习架构,该架构专门设计用于处理大规模点云数据中的严重类别不平衡问题,从而实现对锚杆的自动且高效识别。
链接: https://arxiv.org/abs/2506.20464
作者: Dibyayan Patra,Pasindu Ranasinghe,Bikram Banerjee,Simit Raval
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rock bolts are crucial components of the subterranean support systems in underground mines that provide adequate structural reinforcement to the rock mass to prevent unforeseen hazards like rockfalls. This makes frequent assessments of such bolts critical for maintaining rock mass stability and minimising risks in underground mining operations. Where manual surveying of rock bolts is challenging due to the low light conditions in the underground mines and the time-intensive nature of the process, automated detection of rock bolts serves as a plausible solution. To that end, this study focuses on the automatic identification of rock bolts within medium to large-scale 3D point clouds obtained from underground mines using mobile laser scanners. Existing techniques for automated rock bolt identification primarily rely on feature engineering and traditional machine learning approaches. However, such techniques lack robustness as these point clouds present several challenges due to data noise, varying environments, and complex surrounding structures. Moreover, the target rock bolts are extremely small objects within large-scale point clouds and are often partially obscured due to the application of reinforcement shotcrete. Addressing these challenges, this paper proposes an approach termed DeepBolt, which employs a novel two-stage deep learning architecture specifically designed for handling severe class imbalance for the automatic and efficient identification of rock bolts in complex 3D point clouds. The proposed method surpasses state-of-the-art semantic segmentation models by up to 42.5% in Intersection over Union (IoU) for rock bolt points. Additionally, it outperforms existing rock bolt identification techniques, achieving a 96.41% precision and 96.96% recall in classifying rock bolts, demonstrating its robustness and effectiveness in complex underground environments.
zh
[CV-19] HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
【速读】:该论文旨在解决高分辨率图像合成中生成式AI(Generative AI)模型在超出训练分辨率时产生的视觉伪影问题,如对象重复和空间不一致性。其解决方案的关键在于提出HiWave,一种无需训练的零样本方法,通过两阶段流程实现:首先从预训练模型生成基础图像,随后进行基于DDIM的块级反演和新颖的小波域细节增强模块。该方法在采样过程中保留基础图像的低频成分以确保结构一致性,同时引导高频成分以增强细节和纹理,从而显著提升超高清图像合成的视觉保真度和结构连贯性。
链接: https://arxiv.org/abs/2506.20452
作者: Tobias Vontobel,Seyedmorteza Sadat,Farnood Salehi,Romann M. Weber
机构: ETH Zurich (苏黎世联邦理工学院); Disney Research|Studios (迪士尼研究院|工作室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave’s performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.
zh
[CV-20] Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation
【速读】:该论文旨在解决医疗图像生成中面临的数据集规模小和医疗文本数据稀缺的问题。其解决方案的关键在于提出Med-Art框架,该框架利用视觉-语言模型生成医学图像的视觉描述,从而缓解可用医疗文本数据不足的问题,并基于预训练的文本到图像模型PixArt-α进行适应性调整,在有限数据下实现高性能。此外,论文还引入了混合层级扩散微调(HLDF)方法,通过像素级损失有效解决颜色过度饱和等问题。
链接: https://arxiv.org/abs/2506.20449
作者: Changlu Guo,Anders Nymark Christensen,Morten Rieger Hannemose
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project is available at \url{ this https URL }
Abstract:Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt- \alpha , based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve state-of-the-art performance on two medical image datasets, measured by FID, KID, and downstream classification performance.
zh
[CV-21] A Novel Large Vision Foundation Model (LVFM)-based Approach for Generating High-Resolution Canopy Height Maps in Plantations for Precision Forestry Management
【速读】:该论文旨在解决如何以低成本、高精度的方式监测人工林的地上生物量(Aboveground Biomass, AGB),以支持地方生计和碳汇项目如中国核证自愿减排量(CCER)计划。传统基于激光雷达(LiDAR)的方法虽然能生成高分辨率冠层高度图(Canopy Height Maps, CHMs),但成本较高;而利用RGB影像的深度学习方法在准确提取冠层高度特征方面仍存在挑战。论文提出的解决方案是开发一种基于大型视觉基础模型(Large Vision Foundation Model, LVFM)的新模型,其关键在于集成特征提取器、自监督特征增强模块以及高度估计器,从而有效保留空间细节并提升冠层高度估算的准确性。
链接: https://arxiv.org/abs/2506.20388
作者: Shen Tan,Xin Zhang,Liangxiu Han,Huaguo Huang,Han Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate, cost-effective monitoring of plantation aboveground biomass (AGB) is crucial for supporting local livelihoods and carbon sequestration initiatives like the China Certified Emission Reduction (CCER) program. High-resolution canopy height maps (CHMs) are essential for this, but standard lidar-based methods are expensive. While deep learning with RGB imagery offers an alternative, accurately extracting canopy height features remains challenging. To address this, we developed a novel model for high-resolution CHM generation using a Large Vision Foundation Model (LVFM). Our model integrates a feature extractor, a self-supervised feature enhancement module to preserve spatial details, and a height estimator. Tested in Beijing’s Fangshan District using 1-meter Google Earth imagery, our model outperformed existing methods, including conventional CNNs. It achieved a mean absolute error of 0.09 m, a root mean square error of 0.24 m, and a correlation of 0.78 against lidar-based CHMs. The resulting CHMs enabled over 90% success in individual tree detection, high accuracy in AGB estimation, and effective tracking of plantation growth, demonstrating strong generalization to non-training areas. This approach presents a promising, scalable tool for evaluating carbon sequestration in both plantations and natural forests.
zh
[CV-22] Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking
【速读】:该论文旨在解决基于Transformer的视觉跟踪模型在资源受限设备上因处理速度慢而实用性受限的问题。其关键解决方案是提出HiT系列高效跟踪模型,核心创新在于引入Bridge Module,将轻量级Transformer连接至跟踪框架以提升特征表示质量,并采用双图像位置编码方法有效编码空间信息。进一步地,论文还提出了DyHiT,通过动态路由机制根据场景复杂度选择不同计算路径,实现精度与速度的优化平衡。
链接: https://arxiv.org/abs/2506.20381
作者: Ben Kang,Xin Chen,Jie Zhao,Chunjuan Bo,Dong Wang,Huchuan Lu
机构: 大连理工大学( Dalian University of Technology); 辽宁师范大学( Liaoning Normal University)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper was accepted by International Journal of Computer Vision(IJCV)
Abstract:Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient this http URL on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on this http URL, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT.
zh
[CV-23] InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking
【速读】:该论文旨在解决图像零水印技术中鲁棒性不足的问题,特别是在面对各种图像失真时,如何保持水印的稳定性和可恢复性。其解决方案的关键在于提出一种基于畸变不变特征学习的深度学习框架,该框架包含两个核心模块:第一模块通过噪声对抗学习训练特征提取器,生成既对畸变具有不变性又具备语义表达能力的特征表示;第二模块则设计了一个基于学习的多比特零水印方案,将训练得到的不变特征投影到一组可训练参考码上,以优化匹配目标二进制信息。该方法在多个图像数据集和广泛畸变条件下均表现出优异的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2506.20370
作者: Abdullah All Tanvir,Xin Zhong
机构: University of Nebraska Omaha(内布拉斯加大学奥马哈分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:This paper introduces a novel deep learning framework for robust image zero-watermarking based on distortion-invariant feature learning. As a zero-watermarking scheme, our method leaves the original image unaltered and learns a reference signature through optimization in the feature space. The proposed framework consists of two key modules. In the first module, a feature extractor is trained via noise-adversarial learning to generate representations that are both invariant to distortions and semantically expressive. This is achieved by combining adversarial supervision against a distortion discriminator and a reconstruction constraint to retain image content. In the second module, we design a learning-based multibit zero-watermarking scheme where the trained invariant features are projected onto a set of trainable reference codes optimized to match a target binary message. Extensive experiments on diverse image datasets and a wide range of distortions show that our method achieves state-of-the-art robustness in both feature stability and watermark recovery. Comparative evaluations against existing self-supervised and deep watermarking techniques further highlight the superiority of our framework in generalization and robustness.
zh
[CV-24] DreamAnywhere: Object-Centric Panoramic 3D Scene Generation
【速读】:该论文试图解决现有文本到3D场景生成方法在生成环境时存在视角受限、视觉保真度低、场景理解不足以及仅适用于室内或室外设置等问题。其解决方案的关键在于提出DreamAnywhere,一个模块化系统,能够从文本生成360°全景图像,分解背景与物体,通过混合修复构建完整的3D表示,并将物体掩码提升为细节丰富的3D物体,从而实现沉浸式导航和直观的物体级编辑。
链接: https://arxiv.org/abs/2506.20367
作者: Edoardo Alberto Dominici,Jozef Hladky,Floor Verhoeven,Lukas Radl,Thomas Deixelberger,Stefan Ainetter,Philipp Drescher,Stefan Hauswiesner,Arno Coomans,Giacomo Nazzaro,Konstantinos Vardis,Markus Steinberger
机构: Huawei Technologies Switzerland(华为技术瑞士公司); Huawei Technologies Germany(华为技术德国公司); Graz University of Technology(格拉茨工业大学); Huawei Technologies Austria(华为技术奥地利公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in text-to-3D scene generation have demonstrated significant potential to transform content creation across multiple industries. Although the research community has made impressive progress in addressing the challenges of this complex task, existing methods often generate environments that are only front-facing, lack visual fidelity, exhibit limited scene understanding, and are typically fine-tuned for either indoor or outdoor settings. In this work, we address these issues and propose DreamAnywhere, a modular system for the fast generation and prototyping of 3D scenes. Our system synthesizes a 360° panoramic image from text, decomposes it into background and objects, constructs a complete 3D representation through hybrid inpainting, and lifts object masks to detailed 3D objects that are placed in the virtual environment. DreamAnywhere supports immersive navigation and intuitive object-level editing, making it ideal for scene exploration, visual mock-ups, and rapid prototyping – all with minimal manual modeling. These features make our system particularly suitable for low-budget movie production, enabling quick iteration on scene layout and visual tone without the overhead of traditional 3D workflows. Our modular pipeline is highly customizable as it allows components to be replaced independently. Compared to current state-of-the-art text and image-based 3D scene generation approaches, DreamAnywhere shows significant improvements in coherence in novel view synthesis and achieves competitive image quality, demonstrating its effectiveness across diverse and challenging scenarios. A comprehensive user study demonstrates a clear preference for our method over existing approaches, validating both its technical robustness and practical usefulness.
zh
[CV-25] Feature Hallucination for Self-supervised Action Recognition
【速读】:该论文旨在解决视频中人类动作识别的问题,即在仅依赖原始像素分析的基础上,如何提升对高层语义信息的捕捉能力以及多模态特征的有效融合。其解决方案的关键在于提出一种深度翻译动作识别框架,通过联合预测动作概念和辅助特征来增强识别准确性,并在测试阶段利用幻觉流推断缺失线索,从而在不增加计算开销的情况下丰富特征表示。此外,引入了两种新的领域特定描述符——目标检测特征(Object Detection Features, ODF)和显著性检测特征(Saliency Detection Features, SDF),以聚焦于动作相关区域,同时将这些描述符与多种辅助模态(如光流、改进的密集轨迹、骨骼数据和音频线索)无缝集成,提升了模型的鲁棒性和性能。
链接: https://arxiv.org/abs/2506.20342
作者: Lei Wang,Piotr Koniusz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in International Journal of Computer Vision (IJCV)
Abstract:Understanding human actions in videos requires more than raw pixel analysis; it relies on high-level semantic reasoning and effective integration of multimodal features. We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames. At test time, hallucination streams infer missing cues, enriching feature representations without increasing computational overhead. To focus on action-relevant regions beyond raw pixels, we introduce two novel domain-specific descriptors. Object Detection Features (ODF) aggregate outputs from multiple object detectors to capture contextual cues, while Saliency Detection Features (SDF) highlight spatial and intensity patterns crucial for action recognition. Our framework seamlessly integrates these descriptors with auxiliary modalities such as optical flow, Improved Dense Trajectories, skeleton data, and audio cues. It remains compatible with state-of-the-art architectures, including I3D, AssembleNet, Video Transformer Network, FASTER, and recent models like VideoMAE V2 and InternVideo2. To handle uncertainty in auxiliary features, we incorporate aleatoric uncertainty modeling in the hallucination step and introduce a robust loss function to mitigate feature noise. Our multimodal self-supervised action recognition framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.
zh
[CV-26] On the Burstiness of Faces in Set
【速读】:该论文试图解决在基于集合的人脸识别(Set-based Face Recognition, SFR)中广泛存在的突发性(burstiness)问题,该问题表现为某些特定人脸在集合中出现的频率高于统计独立模型的预期,从而影响模型的泛化能力和评估准确性。解决方案的关键在于提出三种检测突发性人脸的策略:基于Quickshift++的聚类、特征自相似性分析以及广义最大池化(Generalized Max-Pooling, GMP)。通过将检测结果应用于训练和评估阶段,提升罕见人脸的采样比例或贡献度,并引入具有质量感知能力的GMP以增强对低质量人脸的鲁棒性,从而有效抑制突发性现象并提升识别性能。
链接: https://arxiv.org/abs/2506.20312
作者: Jiong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures
Abstract:Burstiness, a phenomenon observed in text and image retrieval, refers to that particular elements appear more times in a set than a statistically independent model assumes. We argue that in the context of set-based face recognition (SFR), burstiness exists widely and degrades the performance in two aspects: Firstly, the bursty faces, where faces with particular attributes %exist frequently in a face set, dominate the training instances and dominate the training face sets and lead to poor generalization ability to unconstrained scenarios. Secondly, the bursty faces %dominating the evaluation sets interfere with the similarity comparison in set verification and identification when evaluation. To detect the bursty faces in a set, we propose three strategies based on Quickshift++, feature self-similarity, and generalized max-pooling (GMP). We apply the burst detection results on training and evaluation stages to enhance the sampling ratios or contributions of the infrequent faces. When evaluation, we additionally propose the quality-aware GMP that enables awareness of the face quality and robustness to the low-quality faces for the original GMP. We give illustrations and extensive experiments on the SFR benchmarks to demonstrate that burstiness is widespread and suppressing burstiness considerably improves the recognition performance.
zh
[CV-27] Radiomic fingerprints for knee MR images assessment
【速读】:该论文试图解决传统放射组学(radiomic)方法在膝关节MRI影像诊断中因固定特征集(signature)导致的泛化能力差和个体病理变异表达不足的问题。现有方法依赖于在群体层面选择的固定特征集,虽然具有可解释性,但缺乏对个体差异的适应性,从而限制了其性能。解决方案的关键在于提出一种动态构建的放射组学指纹(radiomic fingerprint)框架,通过深度学习模型为每位患者从大规模特征池中选择与临床状况相关的预测特征,实现个性化特征提取,同时结合低维逻辑回归模型进行下游分类,从而在保持可解释性的同时提升诊断准确性。
链接: https://arxiv.org/abs/2506.20306
作者: Yaxi Chen,Simin Ni,Shaheer U. Saeed,Aleksandra Ivanova,Rikin Hargunani,Jie Huang,Chaozong Liu,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate interpretation of knee MRI scans relies on expert clinical judgment, often with high variability and limited scalability. Existing radiomic approaches use a fixed set of radiomic features (the signature), selected at the population level and applied uniformly to all patients. While interpretable, these signatures are often too constrained to represent individual pathological variations. As a result, conventional radiomic-based approaches are found to be limited in performance, compared with recent end-to-end deep learning (DL) alternatives without using interpretable radiomic features. We argue that the individual-agnostic nature in current radiomic selection is not central to its intepretability, but is responsible for the poor generalization in our application. Here, we propose a novel radiomic fingerprint framework, in which a radiomic feature set (the fingerprint) is dynamically constructed for each patient, selected by a DL model. Unlike the existing radiomic signatures, our fingerprints are derived on a per-patient basis by predicting the feature relevance in a large radiomic feature pool, and selecting only those that are predictive of clinical conditions for individual patients. The radiomic-selecting model is trained simultaneously with a low-dimensional (considered relatively explainable) logistic regression for downstream classification. We validate our methods across multiple diagnostic tasks including general knee abnormalities, anterior cruciate ligament (ACL) tears, and meniscus tears, demonstrating comparable or superior diagnostic accuracy relative to state-of-the-art end-to-end DL models. More importantly, we show that the interpretability inherent in our approach facilitates meaningful clinical insights and potential biomarker discovery, with detailed discussion, quantitative and qualitative analysis of real-world clinical cases to evidence these advantages.
zh
[CV-28] Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding
【速读】:该论文试图解决的是如何利用学习方法实现对二维码(QR code)的解码问题,特别是针对具有中等输入敏感度(input-sensitivity)的学习函数进行研究。解决方案的关键在于使用Transformer模型,通过学习嵌入文本的结构,成功解码二维码,甚至超越了理论上的纠错极限。实验表明,该方法能够从以英语为主的训练数据泛化到其他语言及随机字符串,并且Transformer模型在解码过程中更关注数据位而非纠错位,表明其采用了与传统二维码读取器不同的解码机制。
链接: https://arxiv.org/abs/2506.20305
作者: Kazuki Yoda,Kazuhiko Kawamoto,Hiroshi Kera
机构: Chiba University (千叶大学); Zuse Institute Berlin (祖泽研究所柏林)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures
Abstract:The hardness of learning a function that attains a target task relates to its input-sensitivity. For example, image classification tasks are input-insensitive as minor corruptions should not affect the classification results, whereas arithmetic and symbolic computation, which have been recently attracting interest, are highly input-sensitive as each input variable connects to the computation results. This study presents the first learning-based Quick Response (QR) code decoding and investigates learning functions of medium sensitivity. Our experiments reveal that Transformers can successfully decode QR codes, even beyond the theoretical error-correction limit, by learning the structure of embedded texts. They generalize from English-rich training data to other languages and even random strings. Moreover, we observe that the Transformer-based QR decoder focuses on data bits while ignoring error-correction bits, suggesting a decoding mechanism distinct from standard QR code readers.
zh
[CV-29] DiR: Transformer based Diffusion for Image Restoration Tasks
【速读】:该论文旨在解决在复杂环境中捕获的图像因噪声、色彩偏移、模糊和光散射等因素导致的图像质量下降问题,这些问题严重影响了图像在目标检测、地图构建和分类等下游任务中的应用。其解决方案的关键在于提出一种基于Transformer的扩散模型,通过结合扩散模型与Transformer架构,显著提升了退化图像的质量,从而在多个公共数据集上的水下图像增强、去噪和去雨任务中超越了现有深度学习方法。
链接: https://arxiv.org/abs/2506.20302
作者: Abbas Anwar,Mohammad Shullar,Ali Arshad Nasir,Mudassir Masood,Saeed Anwar
机构: KFUPM(沙特国王大学); UWA(西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Images captured in challenging environments often experience various forms of degradation, including noise, color cast, blur, and light scattering. These effects significantly reduce image quality, hindering their applicability in downstream tasks such as object detection, mapping, and classification. Our transformer-based diffusion model was developed to address image restoration tasks, aiming to improve the quality of degraded images. This model was evaluated against existing deep learning methodologies across multiple quality metrics for underwater image enhancement, denoising, and deraining on publicly available datasets. Our findings demonstrate that the diffusion model, combined with transformers, surpasses current methods in performance. The results of our model highlight the efficacy of diffusion models and transformers in improving the quality of degraded images, consequently expanding their utility in downstream tasks that require high-fidelity visual data.
zh
[CV-30] Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations
【速读】:该论文试图解决扩散模型在条件生成过程中容易陷入局部最优解的问题,这些局部最优解虽然在视觉上局部一致,但整体上可能存在不一致或条件不对齐的情况。解决方案的关键在于提出一种名为Ctrl-Z Sampling的新型采样策略,该策略通过奖励模型识别潜在的局部最大值,并通过注入噪声和回退到更嘈杂的状态来逃离这些局部最优解,从而实现更高质量的生成结果。
链接: https://arxiv.org/abs/2506.20294
作者: Shunqi Mao,Wei Guo,Chaoyi Zhang,Weidong Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 2 tables
Abstract:Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian noise toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned latent space, where the model iteratively refines the sample toward regions of higher probability. However, diffusion models often converge to local optima that are locally visually coherent yet globally inconsistent or conditionally misaligned, due to latent space complexity and suboptimal initialization. Prior efforts attempted to address this by strengthening guidance signals or manipulating the initial noise distribution. We introduce Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy designed to detect and escape such local maxima during conditional generation. The method first identifies potential local maxima using a reward model. Upon detection, it injects noise and reverts to a previous, noisier state to escape the current optimization plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, while progressively deeper retreat enables stronger escapes when nearby alternatives fail. This controlled random zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed Ctrl-Z Sampling is model-agnostic and compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality with only around 7.6X increase in function evaluations.
zh
[CV-31] Breaking Spatial Boundaries: Spectral-Domain Registration Guided Hyperspectral and Multispectral Blind Fusion
【速读】:该论文旨在解决未配准的高光谱图像(Hyperspectral Images, HSIs)与多光谱图像(Multispectral Images, MSIs)的盲融合问题。现有方法通过在HSI上应用空间变换以实现与MSI的对齐,但由于两者在空间分辨率上的显著差异,导致性能不理想,且在处理大尺寸遥感图像时注册过程耗时较长。该论文的关键解决方案是从光谱域出发,提出一种轻量级的光谱先验学习(Spectral Prior Learning, SPL)网络,用于提取HSI的光谱特征并增强MSI的光谱分辨率,随后通过子空间表示和循环训练策略提升注册后HSI的光谱精度,并结合盲稀疏融合(Blind Sparse Fusion, BSF)方法,利用组稀疏正则化来等效促进图像的低秩性,从而避免了秩估计并降低计算复杂度。
链接: https://arxiv.org/abs/2506.20293
作者: Kunjing Yang,Libin Zheng,Minru Bai,Ting Lu,Leyuan Fang
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The blind fusion of unregistered hyperspectral images (HSIs) and multispectral images (MSIs) has attracted growing attention recently. To address the registration challenge, most existing methods employ spatial transformations on the HSI to achieve alignment with the MSI. However, due to the substantial differences in spatial resolution of the images, the performance of these methods is often unsatisfactory. Moreover, the registration process tends to be time-consuming when dealing with large-sized images in remote sensing. To address these issues, we propose tackling the registration problem from the spectral domain. Initially, a lightweight Spectral Prior Learning (SPL) network is developed to extract spectral features from the HSI and enhance the spectral resolution of the MSI. Following this, the obtained image undergoes spatial downsampling to produce the registered HSI. In this process, subspace representation and cyclic training strategy are employed to improve spectral accuracy of the registered HSI obtained. Next, we propose a blind sparse fusion (BSF) method, which utilizes group sparsity regularization to equivalently promote the low-rankness of the image. This approach not only circumvents the need for rank estimation, but also reduces computational complexity. Then, we employ the Proximal Alternating Optimization (PAO) algorithm to solve the BSF model, and present its convergence analysis. Finally, extensive numerical experiments on simulated and real datasets are conducted to verify the effectiveness of our method in registration and fusion. We also demonstrate its efficacy in enhancing classification performance.
zh
[CV-32] From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios
【速读】:该论文旨在解决密集预测任务在真实世界场景中泛化能力不足的问题,当前方法主要针对理想化条件设计,面临真实世界数据稀缺的挑战。其解决方案的关键在于提出DenseDiT,该方法通过统一策略最大化利用生成模型的视觉先验,结合参数复用机制和两个轻量级分支,自适应整合多尺度上下文信息,仅需不到0.1%的额外参数即可完成多种真实世界的密集预测任务。
链接: https://arxiv.org/abs/2506.20279
作者: Changliang Xia,Chengyou Jia,Zhuohang Dang,Minnan Luo
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated label for an input image. Despite advances in this field, existing methods primarily focus on idealized conditions, with limited generalization to real-world scenarios and facing the challenging scarcity of real-world data. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. Then, we propose DenseDiT, which maximally exploits generative models’ visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context, working with less than 0.1% additional parameters. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment. Our data, and checkpoints and codes are available at this https URL
zh
[CV-33] Forensic Study of Paintings Through the Comparison of Fabrics
【速读】:该论文试图解决艺术作品中画布织物鉴定、归属和保护过程中传统方法的局限性,特别是当画布不来自卷轴连续位置时,基于线密度图匹配的方法无法适用的问题。解决方案的关键在于提出一种基于深度学习的新型方法,通过设计并训练一个Siamese深度学习模型,利用从扫描图像中学习到的特征表示来比较画布图像对,并结合多对织物样本的预测结果生成鲁棒的相似性评分,从而实现无需依赖线密度图的画布相似性评估。
链接: https://arxiv.org/abs/2506.20272
作者: Juan José Murillo-Fuentes,Pablo M. Olmos,Laura Alba-Carcelén
机构: ETSi Universidad de Sevilla(ETSi Universidad de Sevilla); Universidad Carlos III de Madrid( Universidad Carlos III de Madrid); Museo Nacional del Prado(国家普拉多博物馆)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The study of canvas fabrics in works of art is a crucial tool for authentication, attribution and conservation. Traditional methods are based on thread density map matching, which cannot be applied when canvases do not come from contiguous positions on a roll. This paper presents a novel approach based on deep learning to assess the similarity of textiles. We introduce an automatic tool that evaluates the similarity between canvases without relying on thread density maps. A Siamese deep learning model is designed and trained to compare pairs of images by exploiting the feature representations learned from the scans. In addition, a similarity estimation method is proposed, aggregating predictions from multiple pairs of cloth samples to provide a robust similarity score. Our approach is applied to canvases from the Museo Nacional del Prado, corroborating the hypothesis that plain weave canvases, widely used in painting, can be effectively compared even when their thread densities are similar. The results demonstrate the feasibility and accuracy of the proposed method, opening new avenues for the analysis of masterpieces.
zh
[CV-34] X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis MICCAI2025
【速读】:该论文旨在解决三维医学影像中复杂结构(如大脑皮层)难以可视化和解释的问题,从而提升可解释模型在临床决策中的应用。其解决方案的关键在于提出一种基于可解释皮层特征的人类可理解预测的神经网络——可解释表面视觉Transformer(X-SiT),其中引入了原型表面块解码器,结合基于案例的推理与空间对应皮层原型,实现了对阿尔茨海默病和额颞叶痴呆的高精度检测,并提供了与已知疾病模式一致的可解释原型。
链接: https://arxiv.org/abs/2506.20267
作者: Fabian Bongratz,Tom Nuno Wolf,Jaume Gual Ramon,Christian Wachinger
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2025
Abstract:Interpretable models are crucial for supporting clinical decision-making, driving advances in their development and application for medical images. However, the nature of 3D volumetric data makes it inherently challenging to visualize and interpret intricate and complex structures like the cerebral cortex. Cortical surface renderings, on the other hand, provide a more accessible and understandable 3D representation of brain anatomy, facilitating visualization and interactive exploration. Motivated by this advantage and the widespread use of surface data for studying neurological disorders, we present the eXplainable Surface Vision Transformer (X-SiT). This is the first inherently interpretable neural network that offers human-understandable predictions based on interpretable cortical features. As part of X-SiT, we introduce a prototypical surface patch decoder for classifying surface patch embeddings, incorporating case-based reasoning with spatially corresponding cortical prototypes. The results demonstrate state-of-the-art performance in detecting Alzheimer’s disease and frontotemporal dementia while additionally providing informative prototypes that align with known disease patterns and reveal classification errors.
zh
[CV-35] Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification
【速读】:该论文旨在解决少样本细粒度图像分类(Few-shot Fine-grained Image Classification, FS-FGIC)问题,即在标注样本有限的情况下,模型需准确区分视觉相似的子类。现有方法存在两大局限:基于度量的方法丢失空间信息且局部特征对齐不足,基于重建的方法未能有效利用层次化特征信息且缺乏关注判别区域的机制。论文提出的解决方案是Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN),其关键在于通过双层特征重建与融合模块,结合可学习融合权重,平衡高层语义表示与中层结构细节;同时引入空间二值掩码增强的Transformer自重建模块,通过自适应阈值处理查询特征并保留完整支持特征,提升对判别区域的关注并抑制背景噪声。
链接: https://arxiv.org/abs/2506.20263
作者: Ning Luo,Meiyin Hu,Huan Wan,Yanyan Yang,Zhuohang Jiang,Xin Wei
机构: Nanchang University (南昌大学); Jiangxi Normal University (江西师范大学); Beijing Jiaotong University (北京交通大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot fine-grained image classification (FS-FGIC) presents a significant challenge, requiring models to distinguish visually similar subclasses with limited labeled examples. Existing methods have critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods fail to utilize hierarchical feature information and lack mechanisms to focus on discriminative regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), which integrates dual-layer feature reconstruction with mask-enhanced feature processing to improve fine-grained classification. HMDRN incorporates a dual-layer feature reconstruction and fusion module that leverages complementary visual information from different network hierarchies. Through learnable fusion weights, the model balances high-level semantic representations from the last layer with mid-level structural details from the penultimate layer. Additionally, we design a spatial binary mask-enhanced transformer self-reconstruction module that processes query features through adaptive thresholding while maintaining complete support features, enhancing focus on discriminative regions while filtering background noise. Extensive experiments on three challenging fine-grained datasets demonstrate that HMDRN consistently outperforms state-of-the-art methods across Conv-4 and ResNet-12 backbone architectures. Comprehensive ablation studies validate the effectiveness of each proposed component, revealing that dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations. Visualization results provide evidence of HMDRN’s superior feature reconstruction capabilities.
zh
[CV-36] A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features
【速读】:该论文试图解决手写识别中单一模态信息利用不足的问题,即现有系统通常仅依赖离线图像或在线笔迹轨迹中的一种模态,而忽略了两者可能提供的互补线索。解决方案的关键在于引入一个端到端网络,在共享潜在空间中进行离线图像与在线笔迹数据的早期融合。该网络通过补丁编码器将灰度图像转换为固定长度的视觉标记,并利用轻量级Transformer对(x, y, \textpen)序列进行嵌入,可学习的潜在查询联合关注两种标记流,生成上下文增强的笔迹嵌入,最终在交叉熵损失目标下进行池化和解码,从而提升识别性能。
链接: https://arxiv.org/abs/2506.20255
作者: Ayush Lodh,Ritabrata Chakraborty,Shivakumara Palaiahnakote,Umapada Pal
机构: Indian Statistical Institute (印度统计研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 7 figures
Abstract:We posit that handwriting recognition benefits from complementary cues carried by the rasterized complex glyph and the pen’s trajectory, yet most systems exploit only one modality. We introduce an end-to-end network that performs early fusion of offline images and online stroke data within a shared latent space. A patch encoder converts the grayscale crop into fixed-length visual tokens, while a lightweight transformer embeds the (x, y, \textpen) sequence. Learnable latent queries attend jointly to both token streams, yielding context-enhanced stroke embeddings that are pooled and decoded under a cross-entropy loss objective. Because integration occurs before any high-level classification, temporal cues reinforce each other during representation learning, producing stronger writer independence. Comprehensive experiments on IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1%. Our study also shows adaptation of this pipeline with gesturification on the ISI-Air dataset. Our code can be found here.
zh
[CV-37] Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement MICCAI2025
【速读】:该论文旨在解决跨机构和跨手术流程的外科工作流理解中模型泛化能力不足的问题,尤其是在面对不同手术室环境、机构协议和解剖变异时,现有基于大规模视觉-语言数据预训练的外科基础模型在零样本场景下的性能受限。解决方案的关键在于提出一种轻量级框架Surgical Phase Anywhere (SPA),其核心是通过少量样本的空间适应对多模态嵌入进行机构特定场景和阶段的对齐,并利用扩散建模确保时间一致性,同时通过动态测试时适应机制,在无需额外标注的情况下自监督地提升模型在测试分布偏移下的可靠性。
链接: https://arxiv.org/abs/2506.20254
作者: Kun Yuan,Tingxuan Chen,Shi Li,Joel L. Lavanchy,Christian Heiliger,Ege Özsoy,Yiming Huang,Long Bai,Nassir Navab,Vinkle Srivastav,Hongliang Ren,Nicolas Padoy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025
Abstract:The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at this https URL
zh
[CV-38] FedBKD: Distilled Federated Learning to Embrace Gerneralization and Personalization on Non-IID Data
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中非独立同分布(non-IID)数据带来的挑战,即如何同时构建一个泛化能力强的全局模型和性能优异的个性化本地模型。现有方法要么专注于构建强大的全局模型,要么侧重于定制本地模型,难以兼顾两者。此外,许多解决方案依赖引入公共数据集来缓解non-IID问题,但可能增加数据泄露风险。论文提出的解决方案关键在于提出一种无需外部数据的双向知识蒸馏框架——联邦双向知识蒸馏(FedBKD),通过训练生成对抗网络(GAN)生成合成数据,并利用本地模型作为判别器冻结参数,随后通过双向蒸馏实现全局与本地模型之间的知识交互,从而提升双方性能。
链接: https://arxiv.org/abs/2506.20245
作者: Yushan Zhao,Jinyuan He,Donglai Chen,Weijie Luo,Chong Xie,Ri Zhang,Yonghong Chen,Yan Xu
机构: Linklogis Inc. (Linklogis公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning (FL) is a decentralized collaborative machine learning (ML) technique. It provides a solution to the issues of isolated data islands and data privacy leakage in industrial ML practices. One major challenge in FL is handling the non-identical and independent distributed (non-IID) data. Current solutions either focus on constructing an all-powerful global model, or customizing personalized local models. Few of them can provide both a well-generalized global model and well-performed local models at the same time. Additionally, many FL solutions to the non-IID problem are benefited from introducing public datasets. However, this will also increase the risk of data leakage. To tackle the problems, we propose a novel data-free distillation framework, Federated Bidirectional Knowledge Distillation (FedBKD). Specifically, we train Generative Adversarial Networks (GAN) for synthetic data. During the GAN training, local models serve as discriminators and their parameters are frozen. The synthetic data is then used for bidirectional distillation between global and local models to achieve knowledge interactions so that performances for both sides are improved. We conduct extensive experiments on 4 benchmarks under different non-IID settings. The results show that FedBKD achieves SOTA performances in every case.
zh
[CV-39] Dynamic Bandwidth Allocation for Hybrid Event-RGB Transmission
【速读】:该论文试图解决混合事件相机(event camera)与RGB相机系统中由于大量触发事件和RGB图像传输导致的带宽瓶颈问题,以及由此带来的重建性能下降和实时去模糊困难。其解决方案的关键在于提出一种联合事件与图像(E-I)传输框架,通过贝叶斯建模和信息瓶颈方法分离共享信息与领域特定信息,从而消除冗余并优化信道带宽利用,同时根据场景动态自适应分配传输带宽,以实现高效重建和实时去模糊。
链接: https://arxiv.org/abs/2506.20222
作者: Pujing Yang,Guangyi Zhang,Yunlong Cai,Lei Yu,Guanding Yu
机构: Zhejiang University (浙江大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:Event cameras asynchronously capture pixel-level intensity changes with extremely low latency. They are increasingly used in conjunction with RGB cameras for a wide range of vision-related applications. However, a major challenge in these hybrid systems lies in the transmission of the large volume of triggered events and RGB images. To address this, we propose a transmission scheme that retains efficient reconstruction performance of both sources while accomplishing real-time deblurring in parallel. Conventional RGB cameras and event cameras typically capture the same scene in different ways, often resulting in significant redundant information across their outputs. To address this, we develop a joint event and image (E-I) transmission framework to eliminate redundancy and thereby optimize channel bandwidth utilization. Our approach employs Bayesian modeling and the information bottleneck method to disentangle the shared and domain-specific information within the E-I inputs. This disentangled information bottleneck framework ensures both the compactness and informativeness of extracted shared and domain-specific information. Moreover, it adaptively allocates transmission bandwidth based on scene dynamics, i.e., more symbols are allocated to events for dynamic details or to images for static information. Simulation results demonstrate that the proposed scheme not only achieves superior reconstruction quality compared to conventional systems but also delivers enhanced deblurring performance.
zh
[CV-40] UniCode2: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
【速读】:该论文旨在解决现有基于代码本(codebook)的多模态大语言模型在视觉标记化过程中存在的语义细粒度不足、令牌利用率低以及训练不稳定的问题。其解决方案的关键在于提出UniCode^2,一个级联代码本框架,通过聚类数百万个SigLIP序列嵌入构建一个500K条目的代码本,以保持视觉-语言对齐并扩展容量,同时通过冻结代码本与可训练代码本的解耦设计确保稳定性与高令牌利用率。
链接: https://arxiv.org/abs/2506.20214
作者: Yanzhe Chen(Yen-chieh Chan),Huasong Zhong,Yan Li,Zhenheng Yang
机构: ByteDance(字节跳动); ByteDance(字节跳动); ByteDance(字节跳动); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 19 pages, 5 figures
Abstract:Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable training. We propose UniCode ^2 , a cascaded codebook framework enabling large-scale, semantically aligned, and stable visual tokenization. By clustering millions of SigLIP sequence embeddings, we build a 500K-entry codebook that preserves vision-language alignment while expanding capacity. Stability is ensured via a cascaded design: a frozen codebook anchors the embedding space, and a trainable codebook refines task-specific semantics. This decoupling promotes high utilization and robust learning. Moreover, the alignment of our visual tokens with textual semantics enables seamless integration with pretrained diffusion decoders, supporting high-quality visual synthesis with minimal adaptation. UniCode^2 delivers strong performance across diverse benchmarks, demonstrating the viability of scaling visual token spaces without sacrificing stability, semantics, or modularity.
zh
[CV-41] Progressive Alignment Degradation Learning for Pansharpening
【速读】:该论文试图解决深度学习在全色锐化(pansharpening)任务中因Wald协议对真实世界退化模式近似不准确而导致模型泛化能力受限的问题。解决方案的关键在于提出Progressive Alignment Degradation Module (PADM),通过PAlignNet和PDegradeNet两个子网络的相互迭代,自适应地学习精确的退化过程,而不依赖于预定义的操作符。
链接: https://arxiv.org/abs/2506.20179
作者: Enzhe Zhao,Zhichang Guo,Yao Li,Fanghui Song,Boying Wu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 13 pages, 9 figures
Abstract:Deep learning-based pansharpening has been shown to effectively generate high-resolution multispectral (HRMS) images. To create supervised ground-truth HRMS images, synthetic data generated using the Wald protocol is commonly employed. This protocol assumes that networks trained on artificial low-resolution data will perform equally well on high-resolution data. However, well-trained models typically exhibit a trade-off in performance between reduced-resolution and full-resolution datasets. In this paper, we delve into the Wald protocol and find that its inaccurate approximation of real-world degradation patterns limits the generalization of deep pansharpening models. To address this issue, we propose the Progressive Alignment Degradation Module (PADM), which uses mutual iteration between two sub-networks, PAlignNet and PDegradeNet, to adaptively learn accurate degradation processes without relying on predefined operators. Building on this, we introduce HFreqdiff, which embeds high-frequency details into a diffusion framework and incorporates CFB and BACM modules for frequency-selective detail extraction and precise reverse process learning. These innovations enable effective integration of high-resolution panchromatic and multispectral images, significantly enhancing spatial sharpness and quality. Experiments and ablation studies demonstrate the proposed method’s superior performance compared to state-of-the-art techniques.
zh
[CV-42] owards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition
【速读】:该论文试图解决如何在地球观测(Earth Observation)任务中提升模型性能的同时降低训练成本的问题。其解决方案的关键在于探索将预训练的生成式 AI (Generative AI) 模型进行特征级集成,以替代传统的大规模从头训练方法,从而在保持或超越大模型性能的前提下,减少训练时间和计算资源消耗。此外,研究还提出了通过知识蒸馏技术将集成模型的优势转移到更轻量级模型中的方法,为实际应用提供了可行路径。
链接: https://arxiv.org/abs/2506.20174
作者: Man Duc Chuc
机构: University of Engineering and Technology, Vietnam National University, Hanoi, Vienam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models are rapidly transforming Earth Observation data mining by enabling generalizable and scalable solutions for key tasks such as scene classification and semantic segmentation. While most efforts in the geospatial domain have focused on developing large models trained from scratch using massive Earth Observation datasets, an alternative strategy that remains underexplored is the reuse and combination of existing pretrained models. In this study, we investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance across a diverse set of key Earth Observation tasks. Using the GEO-Bench benchmark, we evaluate several prominent models, including Prithvi, Hiera, and DOFA, on eleven datasets covering a range of spatial resolutions, sensor modalities, and task types. The results show that feature-level ensembling of smaller pretrained models can match or exceed the performance of much larger models, while requiring less training time and computational resources. Moreover, the study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models, offering a practical path for deploying foundation models in real-world Earth Observation applications.
zh
[CV-43] Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
【速读】:该论文旨在解决在视觉退化条件下,多模态大语言模型在文档理解中因无法准确感知视觉不确定性而导致的OCR幻觉问题。其关键解决方案是提出一种基于GRPO的框架,该框架引入了视觉不确定性自感知机制和拒绝回答的分析方法,以实现视觉忠实的推理,从而有效减少在不确定数据上的幻觉生成。
链接: https://arxiv.org/abs/2506.20168
作者: Zhentao He,Can Zhang,Ziheng Wu,Zhenghao Chen,Yufei Zhan,Yifan Li,Zhao Zhang,Xian Wang,Minghui Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards and invoices, with simulated real-world degradations for OCR reliability. This setup allows for evaluating models’ capacity, under degraded input, to distinguish reliable visual information and answer accordingly, thereby highlighting the challenge of avoiding hallucination on uncertain data. To achieve vision-faithful reasoning and thereby avoid the aforementioned issues, we further introduce a GRPO-based framework featuring a novel reward mechanism. By incorporating a self-awareness of visual uncertainty and an analysis method that initiates refusal to answer to increase task difficulty within our supervised fine-tuning and reinforcement learning framework, we successfully mitigated hallucinations in ambiguous regions. Experiments on Qwen2.5-VL demonstrate that our 7B-parameter model achieves a 22% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness.
zh
[CV-44] owards Efficient Exemplar Based Image Editing with Multimodal VLMs ECCV2024
【速读】:该论文试图解决基于示例对(exemplar pair)的图像编辑问题,即如何将一个示例对中的编辑效果迁移到目标图像上。传统方法依赖文本描述进行编辑,但文本难以准确表达某些模糊的图像编辑操作。该工作的关键在于利用预训练的文本到图像扩散模型和多模态视觉语言模型(VLM),构建了一个无需优化的端到端管道,从而在保持高效性的同时,提升了多种类型编辑任务的性能。
链接: https://arxiv.org/abs/2506.20155
作者: Avadhoot Jadhav,Ashutosh Srivastava,Abhinav Java,Silky Singh,Tarun Ram Menta,Surgan Jandial,Balaji Krishnamurthy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2024 (AI4VA Workshop)
Abstract:Text-to-Image Diffusion models have enabled a wide array of image editing applications. However, capturing all types of edits through text alone can be challenging and cumbersome. The ambiguous nature of certain image edits is better expressed through an exemplar pair, i.e., a pair of images depicting an image before and after an edit respectively. In this work, we tackle exemplar-based image editing – the task of transferring an edit from an exemplar pair to a content image(s), by leveraging pretrained text-to-image diffusion models and multimodal VLMs. Even though our end-to-end pipeline is optimization-free, our experiments demonstrate that it still outperforms baselines on multiple types of edits while being ~4x faster.
zh
[CV-45] Loss-Aware Automatic Selection of Structured Pruning Criteria for Deep Neural Network Acceleration
【速读】:该论文旨在解决深度神经网络在资源受限的边缘设备上部署时的模型压缩问题,特别是如何高效地进行结构化剪枝以实现模型轻量化和加速。其解决方案的关键在于提出一种基于损失感知的自动结构化剪枝准则选择方法(LAASP),该方法采用“训练中剪枝”的策略,将传统流程中的训练、剪枝和微调阶段整合为一个循环,同时通过网络在小规模训练数据上的整体损失来自动选择剪枝准则和剪枝层,从而避免了手动设置各层剪枝率的繁琐过程,并有效缓解了因剪枝导致的精度骤降问题。
链接: https://arxiv.org/abs/2506.20152
作者: Deepak Ghimire,Kilho Lee,Seong-heum Kim
机构: Soongsil University (崇实大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages: 1) training, 2) pruning, and 3) fine-tuning, whereas the proposed pruning technique adopts a pruning-while-training approach that eliminates the first stage and integrates the second and third stages into a single cycle. The automatic selection of magnitude or similarity-based filter pruning criteria from a specified pool of criteria and the specific pruning layer at each pruning iteration is guided by the network’s overall loss on a small subset of the training data. To mitigate the abrupt accuracy drop due to pruning, the network is retrained briefly after each reduction of a predefined number of floating-point operations (FLOPs). The optimal pruning rates for each layer in the network are automatically determined, eliminating the need for manual allocation of fixed or variable pruning rates for each layer. Experiments on the VGGNet and ResNet models on the CIFAR-10 and ImageNet benchmark datasets demonstrate the effectiveness of the proposed method. In particular, the ResNet56 and ResNet110 models on the CIFAR-10 dataset significantly improve the top-1 accuracy compared to state-of-the-art methods while reducing the network FLOPs by 52%. Furthermore, the ResNet50 model on the ImageNet dataset reduces FLOPs by more than 42% with a negligible 0.33% drop in top-5 accuracy. The source code of this paper is publicly available online - this https URL.
zh
[CV-46] EAR: Erasing Concepts from Unified Autoregressive Models
【速读】:该论文试图解决在保持生成质量的前提下,从自回归模型(AR)中有效移除不需要的概念这一挑战。其解决方案的关键在于提出一种名为Erasure Autoregressive Model (EAR)的微调方法,该方法通过引入Windowed Gradient Accumulation (WGA)策略以对齐块级解码与擦除目标,并结合Thresholded Loss Masking (TLM)策略以保护与目标概念无关的内容,在微调过程中维持模型的实用性。
链接: https://arxiv.org/abs/2506.20151
作者: Haipeng Fan,Shiyuan Zhang,Baohunesitu,Zihang Guo,Huaiwen Zhang
机构: Inner Mongolia University (内蒙古大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, 1 tables
Abstract:Autoregressive (AR) models have achieved unified and strong performance across both visual understanding and image generation tasks. However, removing undesired concepts from AR models while maintaining overall generation quality remains an open challenge. In this paper, we propose Erasure Autoregressive Model (EAR), a fine-tuning method for effective and utility-preserving concept erasure in AR models. Specifically, we introduce Windowed Gradient Accumulation (WGA) strategy to align patch-level decoding with erasure objectives, and Thresholded Loss Masking (TLM) strategy to protect content unrelated to the target concept during fine-tuning. Furthermore, we propose a novel benchmark, Erase Concept Generator and Visual Filter (ECGVF), aim at provide a more rigorous and comprehensive foundation for evaluating concept erasure in AR models. Specifically, we first employ structured templates across diverse large language models (LLMs) to pre-generate a large-scale corpus of target-replacement concept prompt pairs. Subsequently, we generate images from these prompts and subject them to rigorous filtering via a visual classifier to ensure concept fidelity and alignment. Extensive experimental results conducted on the ECGVF benchmark with the AR model Janus-Pro demonstrate that EAR achieves marked improvements in both erasure effectiveness and model utility preservation. Code is available at: this https URL
zh
[CV-47] From 2D to 3D Cognition: A Brief Survey of General World Models
【速读】:该论文试图解决当前3D认知世界模型领域缺乏系统性分析与分类的问题,旨在明确新兴技术的作用并推动其发展。其解决方案的关键在于提出一个概念性框架,聚焦于两个核心技术驱动因素:3D表示的进展和世界知识的融合,并进一步剖析支撑3D世界建模的三项核心认知能力:3D物理场景生成、3D空间推理和3D空间交互。
链接: https://arxiv.org/abs/2506.20134
作者: Ningwei Xie,Zizi Tian,Lei Yang,Xiao-Ping Zhang,Meng Guo,Jie Li
机构: China Mobile Research Institute(中国移动研究院); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models have garnered increasing attention in the development of artificial general intelligence (AGI), serving as computational frameworks for learning representations of the external world and forecasting future states. While early efforts focused on 2D visual perception and simulation, recent 3D-aware generative world models have demonstrated the ability to synthesize geometrically consistent, interactive 3D environments, marking a shift toward 3D spatial cognition. Despite rapid progress, the field lacks systematic analysis to categorize emerging techniques and clarify their roles in advancing 3D cognitive world models. This survey addresses this need by introducing a conceptual framework, providing a structured and forward-looking review of world models transitioning from 2D perception to 3D cognition. Within this framework, we highlight two key technological drivers, particularly advances in 3D representations and the incorporation of world knowledge, as fundamental pillars. Building on these, we dissect three core cognitive capabilities that underpin 3D world modeling: 3D physical scene generation, 3D spatial reasoning, and 3D spatial interaction. We further examine the deployment of these capabilities in real-world applications, including embodied AI, autonomous driving, digital twin, and gaming/VR. Finally, we identify challenges across data, modeling, and deployment, and outline future directions for advancing more robust and generalizable 3D world models.
zh
[CV-48] BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos
【速读】:该论文旨在解决AI生成视频中视觉伪影(visual artifacts)的检测与空间定位问题,这一问题限制了生成视频的真实性和用户信任度。现有研究缺乏专门针对AI生成视频伪影定位的全面基准数据集,现有数据集要么仅限于视频或帧级别的检测,要么缺乏细粒度的空间标注。为此,论文提出了BrokenVideos,一个包含3,254个AI生成视频的基准数据集,其特点是具有像素级的标注掩码,用于突出显示视觉退化区域,并通过人工详细检查确保高质量的地面真实数据。该数据集的关键在于其高精度的像素级标注和严格的质量控制流程,为改进生成模型和提升伪影定位能力提供了重要基础。
链接: https://arxiv.org/abs/2506.20103
作者: Jiahao Lin,Weixuan Peng,Bojia Zi,Yifeng Gao,Xianbiao Qi,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shenzhen University (深圳大学); The Chinese University of Hong Kong (香港中文大学); IntelliFusion Inc. (智融科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 page,4 figures,2 tables
Abstract:Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: this https URL.
zh
[CV-49] oSA: Token Merging with Spatial Awareness IROS2025
【速读】:该论文试图解决Vision Transformers (ViT)在计算成本上的高消耗问题,特别是通过引入更有效的token merging策略来加速ViT。现有方法主要依赖视觉token的特征相似性进行合并,但忽略了空间信息的潜在价值。解决方案的关键在于提出ToSA(Token Merging with Spatial Awareness),该方法结合语义和空间感知,利用深度图像生成伪空间token作为辅助空间信息,从而在早期层中更准确地指导token合并,更好地保留关键场景结构。
链接: https://arxiv.org/abs/2506.20066
作者: Hsiang-Wei Huang,Wenhao Chai,Kuang-Ming Chen,Cheng-Yen Yang,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2025
Abstract:Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token’s feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: this https URL
zh
[CV-50] Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception
【速读】:该论文试图解决机器人抓取任务中因目标物体位姿估计过于自信而导致的任务失败问题,其核心在于通过预测位姿估计的不确定性来提升抓取策略的鲁棒性。解决方案的关键是训练轻量级深度网络,使其能够在基于图像的位姿估计指导下预测抓取是否成功,从而在高不确定性情况下避免执行可能导致失败的抓取动作。该方法通过在真实图像上的位姿估计和模拟抓取生成训练数据,并发现尽管抓取实验中物体存在高度多样性,但联合训练所有物体仍能有效提升模型性能。
链接: https://arxiv.org/abs/2506.20045
作者: Eric C. Joyce,Qianwen Zhao,Nathaniel Burgdorfer,Long Wang,Philippos Mordohai
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal.
zh
[CV-51] EBC-ZIP: Improving Blockwise Crowd Counting with Zero-Inflated Poisson Regression
【速读】:该论文旨在解决现有密度图估计方法在处理真实场景中极度稀疏的地面真实密度图时存在的问题,即模型容易偏向高密度区域的过估计以及在稀疏区域表现不佳。其关键解决方案是提出EBC-ZIP框架,该框架通过使用零膨胀泊松(Zero-Inflated Poisson, ZIP)回归建模计数的空间分布,并用ZIP分布的负对数似然作为损失函数,从而更好地处理零值密集的数据分布,同时保持计数精度。
链接: https://arxiv.org/abs/2506.19955
作者: Yiming Ma,Victor Sanchez,Tanaya Guha
机构: University of Warwick (华威大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Density map estimation has become the mainstream paradigm in crowd counting. However, most existing methods overlook the extreme sparsity of ground-truth density maps. In real-world crowd scenes, the vast majority of spatial regions (often over 95%) contain no people, leading to heavily imbalanced count distributions. Ignoring this imbalance can bias models toward overestimating dense regions and underperforming in sparse areas. Furthermore, most loss functions used in density estimation are majorly based on MSE and implicitly assume Gaussian distributions, which are ill-suited for modeling discrete, non-negative count data. In this paper, we propose EBC-ZIP, a crowd counting framework that models the spatial distribution of counts using a Zero-Inflated Poisson (ZIP) regression formulation. Our approach replaces the traditional regression loss with the negative log-likelihood of the ZIP distribution, enabling better handling of zero-heavy distributions while preserving count accuracy. Built upon the recently proposed Enhanced Block Classification (EBC) framework, EBC-ZIP inherits EBC’s advantages in preserving the discreteness of targets and ensuring training stability, while further improving performance through a more principled probabilistic loss. We also evaluate EBC-ZIP with backbones of varying computational complexity to assess its scalability. Extensive experiments on four crowd counting benchmarks demonstrate that EBC-ZIP consistently outperforms EBC and achieves state-of-the-art results.
zh
[CV-52] Computer Vision based Automated Quantification of Agricultural Sprayers Boom Displacement
【速读】:该论文试图解决自走式农业喷雾机在农业生产中因喷洒臂(spray boom)不稳定导致的施药率误差问题。解决方案的关键是开发一种自动化计算机视觉系统,用于量化喷洒臂的运动。该系统利用YOLO V7、V8和V11神经网络模型实时跟踪喷洒臂边缘的目标,并结合倾角传感器数据验证模型输出,从而准确测量喷洒臂在垂直和横向方向上的有效位移。
链接: https://arxiv.org/abs/2506.19939
作者: Aryan Singh Dalal,Sidharth Rai,Rahul Singh,Treman Singh Kaloya,Rahul Harsha Cheppally,Ajay Sharda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under publication process for COMPAG
Abstract:Application rate errors when using self-propelled agricultural sprayers for agricultural production remain a concern. Among other factors, spray boom instability is one of the major contributors to application errors. Spray booms’ width of 38m, combined with 30 kph driving speeds, varying terrain, and machine dynamics when maneuvering complex field boundaries, make controls of these booms very complex. However, there is no quantitative knowledge on the extent of boom movement to systematically develop a solution that might include boom designs and responsive boom control systems. Therefore, this study was conducted to develop an automated computer vision system to quantify the boom movement of various agricultural sprayers. A computer vision system was developed to track a target on the edge of the sprayer boom in real time. YOLO V7, V8, and V11 neural network models were trained to track the boom’s movements in field operations to quantify effective displacement in the vertical and transverse directions. An inclinometer sensor was mounted on the boom to capture boom angles and validate the neural network model output. The results showed that the model could detect the target with more than 90 percent accuracy, and distance estimates of the target on the boom were within 0.026 m of the inclinometer sensor data. This system can quantify the boom movement on the current sprayer and potentially on any other sprayer with minor modifications. The data can be used to make design improvements to make sprayer booms more stable and achieve greater application accuracy.
zh
[CV-53] Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)中自回归(Autoregressive, AR)范式与掩码扩散模型(Masked Diffusion Models, MDMs)之间比较不公平的问题,其核心挑战在于两者通常同时改变建模范式和架构,导致难以区分性能差异是由范式本身还是架构变化引起的。解决方案的关键是将MDMs置于解码器-only框架中进行评估,从而公平比较MDMs(作为任意顺序自回归,Any-Order AR, AO-AR)与标准AR范式,并探讨MDMs中解码器-only与编码器-only架构的影响,揭示了在生成速度与困惑度之间的关键权衡。
链接: https://arxiv.org/abs/2506.19935
作者: Shuchen Xue,Tianyu Xie,Tianyang Hu,Zijin Feng,Jiacheng Sun,Kenji Kawaguchi,Zhenguo Li,Zhi-Ming Ma
机构: University of Chinese Academy of Sciences (中国科学院大学); Academy of Mathematics and Systems Science, Chinese Academy of Sciences (中国科学院数学与系统科学研究院); School of Mathematical Sciences, Peking University (北京大学数学科学学院); National University of Singapore (新加坡国立大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it’s hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language’s inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups ( \sim25\times ) and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at this https URL.
zh
[CV-54] Weighted Mean Frequencies: a handcraft Fourier feature for 4D Flow MRI segmentation
【速读】:该论文旨在解决4D Flow MRI图像在血管分割任务中因分辨率不足和噪声干扰导致的性能受限问题,特别是对壁面剪切应力等生物标志物的影响。其解决方案的关键在于引入一种新的手工特征——加权平均频率(Weighted Mean Frequencies, WMF),该特征能够揭示三维空间中脉动血流经过的区域,从而提升分割精度。通过实验验证,WMF在使用最优阈值分割和深度学习方法时,显著提高了IoU和Dice系数,相较于PC-MRA特征分别提升了0.12和0.13。
链接: https://arxiv.org/abs/2506.20614
作者: Simon Perrin,Sébastien Levilly,Huajun Sun,Harold Mouchère,Jean-Michel Serfaty
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent decades, the use of 4D Flow MRI images has enabled the quantification of velocity fields within a volume of interest and along the cardiac cycle. However, the lack of resolution and the presence of noise in these biomarkers are significant issues. As indicated by recent studies, it appears that biomarkers such as wall shear stress are particularly impacted by the poor resolution of vessel segmentation. The Phase Contrast Magnetic Resonance Angiography (PC-MRA) is the state-of-the-art method to facilitate segmentation. The objective of this work is to introduce a new handcraft feature that provides a novel visualisation of 4D Flow MRI images, which is useful in the segmentation task. This feature, termed Weighted Mean Frequencies (WMF), is capable of revealing the region in three dimensions where a voxel has been passed by pulsatile flow. Indeed, this feature is representative of the hull of all pulsatile velocity voxels. The value of the feature under discussion is illustrated by two experiments. The experiments involved segmenting 4D Flow MRI images using optimal thresholding and deep learning methods. The results obtained demonstrate a substantial enhancement in terms of IoU and Dice, with a respective increase of 0.12 and 0.13 in comparison with the PC-MRA feature, as evidenced by the deep learning task. This feature has the potential to yield valuable insights that could inform future segmentation processes in other vascular regions, such as the heart or the brain.
zh
[CV-55] Fusing Radiomic Features with Deep Representations for Gestational Age Estimation in Fetal Ultrasound Images MICCAI2025
【速读】:该论文旨在解决通过胎儿超声图像准确估计孕周(Gestational Age, GA)的问题,传统方法依赖人工测量,存在操作者依赖性和耗时的缺点。其解决方案的关键在于提出一种新型的特征融合框架,利用深度学习模型提取超声图像的深层表征,并结合影像组学(Radiomic)特征来揭示胎儿大脑生长的模式和特征,从而实现无需任何测量信息的GA估计。该方法在三个孕期阶段均表现出优于现有基于机器学习的方法,且在不同地理区域的不同人群中具有良好的鲁棒性。
链接: https://arxiv.org/abs/2506.20407
作者: Fangyijie Wang,Yuan Liang,Sourav Bhattacharjee,Abey Campbell,Kathleen M. Curran,Guénolé Silvestre
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025
Abstract:Accurate gestational age (GA) estimation, ideally through fetal ultrasound measurement, is a crucial aspect of providing excellent antenatal care. However, deriving GA from manual fetal biometric measurements depends on the operator and is time-consuming. Hence, automatic computer-assisted methods are demanded in clinical practice. In this paper, we present a novel feature fusion framework to estimate GA using fetal ultrasound images without any measurement information. We adopt a deep learning model to extract deep representations from ultrasound images. We extract radiomic features to reveal patterns and characteristics of fetal brain growth. To harness the interpretability of radiomics in medical imaging analysis, we estimate GA by fusing radiomic features and deep representations. Our framework estimates GA with a mean absolute error of 8.0 days across three trimesters, outperforming current machine learning-based methods at these gestational ages. Experimental results demonstrate the robustness of our framework across different populations in diverse geographical regions. Our code is publicly available on \hrefthis https URLGitHub.
zh
[CV-56] Practical insights on the effect of different encodings ansätze and measurements in quantum and hybrid convolutional neural networks
【速读】:该论文旨在解决量子和混合卷积神经网络架构中参数化量子电路(PQCs)设计选择对卫星图像分类任务性能的影响问题。其解决方案的关键在于系统评估数据编码技术、变分回路(variational ansätze)以及测量策略在约500种不同模型配置中的表现,从而揭示各设计因素对模型性能的影响层级。研究发现,在混合架构中,数据编码策略是影响模型性能的主要因素,而在纯量子模型中,测量协议和数据到振幅的映射则成为决定性因素。
链接: https://arxiv.org/abs/2506.20355
作者: Jesús Lozano-Cruz,Albert Nieto-Morales,Oriol Balló-Gimbernat,Adan Garriga,Antón Rodríguez-Otero,Alejandro Borrallo-Rentero
机构: Quantum Lab, CTIC Centro Tecnológico, Parque Científico Tecnológico de Gijón, Asturias, Spain; Universidad Internacional de la Rioja (UNIR), Logroño, Spain; Eurecat, Centre Tecnològic de Catalunya, Multimedia Technologies, Barcelona, Spain; Centre de Visió per Computador (CVC), Barcelona, Spain; Universitat Autònoma de Barcelona (UAB), Barcelona, Spain; Fsas International Quantum Center (Fujitsu), Santiago de Compostela, Spain
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 22 figures
Abstract:This study investigates the design choices of parameterized quantum circuits (PQCs) within quantum and hybrid convolutional neural network (HQNN and QCNN) architectures, applied to the task of satellite image classification using the EuroSAT dataset. We systematically evaluate the performance implications of data encoding techniques, variational ansätze, and measurement in approx. 500 distinct model configurations. Our analysis reveals a clear hierarchy of influence on model performance. For hybrid architectures, which were benchmarked against their direct classical equivalents (e.g. the same architecture with the PQCs removed), the data encoding strategy is the dominant factor, with validation accuracy varying over 30% for distinct embeddings. In contrast, the selection of variational ansätze and measurement basis had a comparatively marginal effect, with validation accuracy variations remaining below 5%. For purely quantum models, restricted to amplitude encoding, performance was most dependent on the measurement protocol and the data-to-amplitude mapping. The measurement strategy varied the validation accuracy by up to 30% and the encoding mapping by around 8 percentage points.
zh
[CV-57] EAGLE: An Efficient Global Attention Lesion Segmentation Model for Hepatic Echinococcosis
【速读】:该论文旨在解决肝包虫病(Hepatic Echinococcosis, HE)医学图像分割中的准确性和效率问题。现有方法如基于卷积神经网络(CNN)和Transformer的模型在处理HE病变分割时存在局限性,CNN缺乏全局上下文建模能力,而Transformer计算成本较高。本文提出的EAGLE网络通过结合一种渐进式视觉状态空间(Progressive Visual State Space, PVSS)编码器与混合视觉状态空间(Hybrid Visual State Space, HVSS)解码器,实现高效且精确的分割。其关键在于引入的卷积视觉状态空间块(Convolutional Vision State Space Block, CVSSB)模块,用于融合局部与全局特征,以及哈尔小波变换块(Haar Wavelet Transformation Block, HWTB)模块,用于将空间信息压缩到通道维度以实现无损下采样。
链接: https://arxiv.org/abs/2506.20333
作者: Jiayan Chen,Kai Li,Yulu Zhao,Jianqiang Huang,Zhan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hepatic echinococcosis (HE) is a widespread parasitic disease in underdeveloped pastoral areas with limited medical resources. While CNN-based and Transformer-based models have been widely applied to medical image segmentation, CNNs lack global context modeling due to local receptive fields, and Transformers, though capable of capturing long-range dependencies, are computationally expensive. Recently, state space models (SSMs), such as Mamba, have gained attention for their ability to model long sequences with linear complexity. In this paper, we propose EAGLE, a U-shaped network composed of a Progressive Visual State Space (PVSS) encoder and a Hybrid Visual State Space (HVSS) decoder that work collaboratively to achieve efficient and accurate segmentation of hepatic echinococcosis (HE) lesions. The proposed Convolutional Vision State Space Block (CVSSB) module is designed to fuse local and global features, while the Haar Wavelet Transformation Block (HWTB) module compresses spatial information into the channel dimension to enable lossless downsampling. Due to the lack of publicly available HE datasets, we collected CT slices from 260 patients at a local hospital. Experimental results show that EAGLE achieves state-of-the-art performance with a Dice Similarity Coefficient (DSC) of 89.76%, surpassing MSVM-UNet by 1.61%.
zh
[CV-58] Opportunistic Osteoporosis Diagnosis via Texture-Preserving Self-Supervision Mixture of Experts and Multi-Task Integration MICCAI2025
【速读】:该论文旨在解决骨质疏松症诊断中因双能X线吸收法(DXA)可及性受限而带来的问题,同时克服现有基于机会性CT分析方法的三个局限:未充分利用未标记的椎体数据、设备特异性DXA差异导致的系统性偏差以及临床知识(如BMD空间分布模式)整合不足。其解决方案的关键在于提出一个统一的深度学习框架,包含三项创新:基于影像组学表征的自监督学习方法以利用未标记CT数据并保留骨纹理;采用带有学习门控机制的专家混合(MoE)架构以提升跨设备适应性;以及融合骨质疏松症诊断、BMD回归和椎体定位预测的多任务学习框架。
链接: https://arxiv.org/abs/2506.20282
作者: Jiaxing Huang,Heng Guo,Le Lu,Fan Yang,Minfeng Xu,Ge Yang,Wei Luo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025
Abstract:Osteoporosis, characterized by reduced bone mineral density (BMD) and compromised bone microstructure, increases fracture risk in aging populations. While dual-energy X-ray absorptiometry (DXA) is the clinical standard for BMD assessment, its limited accessibility hinders diagnosis in resource-limited regions. Opportunistic computed tomography (CT) analysis has emerged as a promising alternative for osteoporosis diagnosis using existing imaging data. Current approaches, however, face three limitations: (1) underutilization of unlabeled vertebral data, (2) systematic bias from device-specific DXA discrepancies, and (3) insufficient integration of clinical knowledge such as spatial BMD distribution patterns. To address these, we propose a unified deep learning framework with three innovations. First, a self-supervised learning method using radiomic representations to leverage unlabeled CT data and preserve bone texture. Second, a Mixture of Experts (MoE) architecture with learned gating mechanisms to enhance cross-device adaptability. Third, a multi-task learning framework integrating osteoporosis diagnosis, BMD regression, and vertebra location prediction. Validated across three clinical sites and an external hospital, our approach demonstrates superior generalizability and accuracy over existing methods for opportunistic osteoporosis screening and diagnosis.
zh
[CV-59] MS-IQA: A Multi-Scale Feature Fusion Network for PET/CT Image Quality Assessment MICCAI2025
【速读】:该论文旨在解决医学影像质量评估(Medical Image Quality Assessment, IQA)中无法同时考虑低级特征(如失真)和高级特征(如器官解剖结构)的问题。现有方法在处理PET/CT图像时,难以全面反映图像的诊断价值。该研究提出的MS-IQA模型通过融合ResNet和Swin Transformer不同中间层的多尺度特征,增强了对局部和全局信息的感知能力,并引入多尺度特征融合模块,利用动态加权通道注意力机制有效结合高低级信息,从而提升评估性能。
链接: https://arxiv.org/abs/2506.20200
作者: Siqiao Li,Chen Hui,Wei Zhang,Rui Liang,Chenyue Song,Feng Jiang,Haiqi Zhu,Zhixuan Li,Hong Huang,Xiang Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:Positron Emission Tomography / Computed Tomography (PET/CT) plays a critical role in medical imaging, combining functional and anatomical information to aid in accurate diagnosis. However, image quality degradation due to noise, compression and other factors could potentially lead to diagnostic uncertainty and increase the risk of misdiagnosis. When evaluating the quality of a PET/CT image, both low-level features like distortions and high-level features like organ anatomical structures affect the diagnostic value of the image. However, existing medical image quality assessment (IQA) methods are unable to account for both feature types simultaneously. In this work, we propose MS-IQA, a novel multi-scale feature fusion network for PET/CT IQA, which utilizes multi-scale features from various intermediate layers of ResNet and Swin Transformer, enhancing its ability of perceiving both local and global information. In addition, a multi-scale feature fusion module is also introduced to effectively combine high-level and low-level information through a dynamically weighted channel attention mechanism. Finally, to fill the blank of PET/CT IQA dataset, we construct PET-CT-IQA-DS, a dataset containing 2,700 varying-quality PET/CT images with quality scores assigned by radiologists. Experiments on our dataset and the publicly available LDCTIQAC2023 dataset demonstrate that our proposed model has achieved superior performance against existing state-of-the-art methods in various IQA metrics. This work provides an accurate and efficient IQA method for PET/CT. Our code and dataset are available at this https URL.
zh
[CV-60] VoxelOpt: Voxel-Adaptive Message Passing for Discrete Optimization in Deformable Abdominal CT Registration MICCAI2025
【速读】:该论文旨在解决基于学习的变形图像配准(DIR)方法在训练数据有限、大形变以及缺乏标签监督时性能下降的问题,同时克服传统迭代方法在运行效率上的不足。其解决方案的关键在于提出VoxelOpt框架,该框架结合了学习方法与迭代方法的优势,通过引入体素级自适应消息传递机制、多层级图像金字塔结构以及预训练分割模型进行特征提取,从而在保持高配准精度的同时提升计算效率。
链接: https://arxiv.org/abs/2506.19975
作者: Hang Zhang,Yuxi Zhang,Jiazheng Wang,Xiang Chen,Renjiu Hu,Xin Tian,Gaolei Li,Min Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted for publication at MICCAI 2025
Abstract:Recent developments in neural networks have improved deformable image registration (DIR) by amortizing iterative optimization, enabling fast and accurate DIR results. However, learning-based methods often face challenges with limited training data, large deformations, and tend to underperform compared to iterative approaches when label supervision is unavailable. While iterative methods can achieve higher accuracy in such scenarios, they are considerably slower than learning-based methods. To address these limitations, we propose VoxelOpt, a discrete optimization-based DIR framework that combines the strengths of learning-based and iterative methods to achieve a better balance between registration accuracy and runtime. VoxelOpt uses displacement entropy from local cost volumes to measure displacement signal strength at each voxel, which differs from earlier approaches in three key aspects. First, it introduces voxel-wise adaptive message passing, where voxels with lower entropy receives less influence from their neighbors. Second, it employs a multi-level image pyramid with 27-neighbor cost volumes at each level, avoiding exponential complexity growth. Third, it replaces hand-crafted features or contrastive learning with a pretrained foundational segmentation model for feature extraction. In abdominal CT registration, these changes allow VoxelOpt to outperform leading iterative in both efficiency and accuracy, while matching state-of-the-art learning-based methods trained with label supervision. The source code will be available at this https URL
zh
[CV-61] A Multi-Modal Spatial Risk Framework for EV Charging Infrastructure Using Remote Sensing
【速读】:该论文旨在解决电动汽车(Electric Vehicle, EV)充电基础设施在环境和基础设施压力下的韧性问题,即评估其在极端气候条件下的脆弱性。解决方案的关键在于提出一种名为RSERI-EV的空间显式多模态风险评估框架,该框架融合了遥感数据、开放基础设施数据集和空间图分析技术,通过整合洪水风险图、地表温度极端值、植被指数、土地利用/覆被类型、电力变电站接近度及道路可达性等多源数据,生成综合的韧性评分,并利用空间k近邻图进行网络层面的邻域比较与图感知诊断,从而支持气候适应性强、基础设施意识明确的EV部署。
链接: https://arxiv.org/abs/2506.19860
作者: Oktay Karakuş,Padraig Corcoran
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 2 tables
Abstract:Electric vehicle (EV) charging infrastructure is increasingly critical to sustainable transport systems, yet its resilience under environmental and infrastructural stress remains underexplored. In this paper, we introduce RSERI-EV, a spatially explicit and multi-modal risk assessment framework that combines remote sensing data, open infrastructure datasets, and spatial graph analytics to evaluate the vulnerability of EV charging stations. RSERI-EV integrates diverse data layers, including flood risk maps, land surface temperature (LST) extremes, vegetation indices (NDVI), land use/land cover (LULC), proximity to electrical substations, and road accessibility to generate a composite Resilience Score. We apply this framework to the country of Wales EV charger dataset to demonstrate its feasibility. A spatial k -nearest neighbours ( k NN) graph is constructed over the charging network to enable neighbourhood-based comparisons and graph-aware diagnostics. Our prototype highlights the value of multi-source data fusion and interpretable spatial reasoning in supporting climate-resilient, infrastructure-aware EV deployment.
zh
[CV-62] Enhanced Dermatology Image Quality Assessment via Cross-Domain Training
【速读】:该论文试图解决远程皮肤科(teledermatology)中图像质量差的问题,这一问题严重影响了远程会诊的实用性。现有的皮肤科图像质量评估(IQA)研究较少,且未充分利用非皮肤科领域的最新IQA进展。论文提出的解决方案的关键在于跨领域(cross-domain)训练IQA模型,通过结合皮肤科和非皮肤科的IQA数据集,从而克服皮肤科IQA数据规模小的限制,并提升模型对多种图像失真类型的处理能力。
链接: https://arxiv.org/abs/2506.16116
作者: Ignacio Hernández Montilla,Alfonso Medela,Paola Pasquali,Andy Aguilar,Taig Mac Carthy,Gerardo Fernández,Antonio Martorell,Enrique Onieva
机构: Legit.Health(合法健康); University of Deusto(德乌斯托大学); Pius Hospital de Valls(比乌斯医院瓦尔斯); Dermatology Department, Hospital de Manises(皮肤病科,曼尼塞斯医院); University of Deusto(德乌斯托大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages, 4 figures. This manuscript has been accepted to the 2025 12th International Conference on Bioinformatics Research and Applications (ICBRA 2025). It will be published in International Conference Proceedings by ACM, which will be archived in ACM Digital Library, indexed by Ei Compendex and Scopus
Abstract:Teledermatology has become a widely accepted communication method in daily clinical practice, enabling remote care while showing strong agreement with in-person visits. Poor image quality remains an unsolved problem in teledermatology and is a major concern to practitioners, as bad-quality images reduce the usefulness of the remote consultation process. However, research on Image Quality Assessment (IQA) in dermatology is sparse, and does not leverage the latest advances in non-dermatology IQA, such as using larger image databases with ratings from large groups of human observers. In this work, we propose cross-domain training of IQA models, combining dermatology and non-dermatology IQA datasets. For this purpose, we created a novel dermatology IQA database, this http URL-DIQA-Artificial, using dermatology images from several sources and having them annotated by a group of human observers. We demonstrate that cross-domain training yields optimal performance across domains and overcomes one of the biggest limitations in dermatology IQA, which is the small scale of data, and leads to models trained on a larger pool of image distortions, resulting in a better management of image quality in the teledermatology process.
zh
人工智能
[AI-0] owards Community-Driven Agents for Machine Learning Engineering
【速读】:该论文试图解决现有基于大型语言模型的机器学习(Machine Learning, ML)代理在独立处理研究问题时缺乏与更广泛研究社区互动的问题,从而无法利用集体知识进行优化和创新。解决方案的关键在于提出MLE-Live框架,该框架用于评估代理与模拟Kaggle研究社区交流和利用集体知识的能力,并在此基础上开发出CoMind代理,该代理能够在社区环境中高效交换见解并提出新颖解决方案。
链接: https://arxiv.org/abs/2506.20640
作者: Sijie Li,Weiwei Sun,Shanda Li,Ameet Talwalkar,Yiming Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model-based machine learning (ML) agents have shown great promise in automating ML research. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent’s ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, a novel agent that excels at exchanging insights and developing novel solutions within a community context. CoMind achieves state-of-the-art performance on MLE-Live and outperforms 79.2% human competitors on average across four ongoing Kaggle competitions. Our code is released at this https URL.
zh
[AI-1] Define-ML: An Approach to Ideate Machine Learning-Enabled Systems MICRO
【速读】:该论文试图解决机器学习(Machine Learning, ML)在软件系统中日益广泛应用所带来的特定挑战,如数据依赖性、技术可行性以及业务目标与概率系统行为之间的对齐问题。传统构想方法(如Lean Inception)缺乏针对这些ML特有考虑的结构化支持,可能导致产品愿景偏离和不切实际的期望。解决方案的关键在于提出Define-ML框架,该框架通过引入三项定制活动——数据源映射(Data Source Mapping)、特征到数据源映射(Feature-to-Data Source Mapping)和ML映射(ML Mapping),系统地将数据和技术约束整合到早期ML产品构想中。
链接: https://arxiv.org/abs/2506.20621
作者: Silvio Alonso,Antonio Pedro Santos Alves,Lucas Romao,Hélio Lopes,Marcos Kalinowski
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 51st Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA) 2025
Abstract:[Context] The increasing adoption of machine learning (ML) in software systems demands specialized ideation approaches that address ML-specific challenges, including data dependencies, technical feasibility, and alignment between business objectives and probabilistic system behavior. Traditional ideation methods like Lean Inception lack structured support for these ML considerations, which can result in misaligned product visions and unrealistic expectations. [Goal] This paper presents Define-ML, a framework that extends Lean Inception with tailored activities - Data Source Mapping, Feature-to-Data Source Mapping, and ML Mapping - to systematically integrate data and technical constraints into early-stage ML product ideation. [Method] We developed and validated Define-ML following the Technology Transfer Model, conducting both static validation (with a toy problem) and dynamic validation (in a real-world industrial case study). The analysis combined quantitative surveys with qualitative feedback, assessing utility, ease of use, and intent of adoption. [Results] Participants found Define-ML effective for clarifying data concerns, aligning ML capabilities with business goals, and fostering cross-functional collaboration. The approach’s structured activities reduced ideation ambiguity, though some noted a learning curve for ML-specific components, which can be mitigated by expert facilitation. All participants expressed the intention to adopt Define-ML. [Conclusion] Define-ML provides an openly available, validated approach for ML product ideation, building on Lean Inception’s agility while aligning features with available data and increasing awareness of technical feasibility.
zh
[AI-2] Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot Recordings
【速读】:该论文旨在解决枪支暴力事件中实时、准确的枪声检测与枪型分类问题,以提升执法部门应对此类事件的能力。其解决方案的关键在于利用声学分析技术,通过常见的录音设备(如手机)获取的枪声数据,结合机器学习方法实现枪声检测与枪型分类。研究提出并评估了支持向量机(SVM)和卷积神经网络(CNN)两种框架,其中基于深度学习的CNN方法在干净标注数据上的平均精度(mAP)达到0.58,优于SVM基准(mAP 0.39),展示了其在复杂环境下的潜力与优势。
链接: https://arxiv.org/abs/2506.20609
作者: Ankit Shah,Rita Singh,Bhiksha Raj,Alexander Hauptmann
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 4 pages + 1 References
Abstract:The escalating rates of gun-related violence and mass shootings represent a significant threat to public safety. Timely and accurate information for law enforcement agencies is crucial in mitigating these incidents. Current commercial gunshot detection systems, while effective, often come with prohibitive costs. This research explores a cost-effective alternative by leveraging acoustic analysis of gunshot recordings, potentially obtainable from ubiquitous devices like cell phones, to not only detect gunshots but also classify the type of firearm used. This paper details a study on deciphering gun type hierarchies using a curated dataset of 3459 recordings. We investigate the fundamental acoustic characteristics of gunshots, including muzzle blasts and shockwaves, which vary based on firearm type, ammunition, and shooting direction. We propose and evaluate machine learning frameworks, including Support Vector Machines (SVMs) as a baseline and a more advanced Convolutional Neural Network (CNN) architecture for joint gunshot detection and gun type classification. Results indicate that our deep learning approach achieves a mean average precision (mAP) of 0.58 on clean labeled data, outperforming the SVM baseline (mAP 0.39). Challenges related to data quality, environmental noise, and the generalization capabilities when using noisy web-sourced data (mAP 0.35) are also discussed. The long-term vision is to develop a highly accurate, real-time system deployable on common recording devices, significantly reducing detection costs and providing critical intelligence to first responders.
zh
[AI-3] AI Assistants to Enhance and Exploit the PETSc Knowledge Base
【速读】:该论文试图解决科学计算库PETSc中积累的大量非结构化、分散的技术知识难以被有效利用的问题,这些知识包括源代码、文档、邮件列表、GitLab问题、Discord对话和技术论文等。解决方案的关键在于构建一个基于大型语言模型(LLMs)的系统,结合检索增强生成(RAG)、重排序算法和聊天机器人等工具,以激活和利用这些知识,从而辅助用户、支持开发者并推动正式文档的更新。该系统通过特定于PETSc的信息检索与处理,提升了数值软件的开发与使用效率。
链接: https://arxiv.org/abs/2506.20608
作者: Barry Smith,Junchao Zhang,Hong Zhang,Lois Curfman McInnes,Murat Keceli,Archit Vasan,Satish Balay,Toby Isaac,Le Chen,Venkatram Vishwanath
机构: 未知
类目: Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Generative AI, especially through large language models (LLMs), is transforming how technical knowledge can be accessed, reused, and extended. PETSc, a widely used numerical library for high-performance scientific computing, has accumulated a rich but fragmented knowledge base over its three decades of development, spanning source code, documentation, mailing lists, GitLab issues, Discord conversations, technical papers, and more. Much of this knowledge remains informal and inaccessible to users and new developers. To activate and utilize this knowledge base more effectively, the PETSc team has begun building an LLM-powered system that combines PETSc content with custom LLM tools – including retrieval-augmented generation (RAG), reranking algorithms, and chatbots – to assist users, support developers, and propose updates to formal documentation. This paper presents initial experiences designing and evaluating these tools, focusing on system architecture, using RAG and reranking for PETSc-specific information, evaluation methodologies for various LLMs and embedding models, and user interface design. Leveraging the Argonne Leadership Computing Facility resources, we analyze how LLM responses can enhance the development and use of numerical software, with an initial focus on scalable Krylov solvers. Our goal is to establish an extensible framework for knowledge-centered AI in scientific software, enabling scalable support, enriched documentation, and enhanced workflows for research and development. We conclude by outlining directions for expanding this system into a robust, evolving platform that advances software ecosystems to accelerate scientific discovery.
zh
[AI-4] CogGen: A Learner-Centered Generative AI Architecture for Intelligent Tutoring with Programming Video
【速读】:该论文旨在解决传统编程视频教育中缺乏个性化和互动性的问题,通过将编程视频转化为具有适应性的学习体验来提升教学效果。其解决方案的关键在于构建一个以学习者为中心的AI架构——CogGen,该架构整合了基于认知导师制(Cognitive Apprenticeship)的生成式AI辅导与学生建模技术,包含视频按学习目标分割、对话式辅导引擎以及利用贝叶斯知识追踪(Bayesian Knowledge Tracing)进行自适应教学的学生模型三个核心组件。
链接: https://arxiv.org/abs/2506.20600
作者: Wengxi Li,Roy Pea,Nick Haber,Hari Subramonyam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CogGen, a learner-centered AI architecture that transforms programming videos into interactive, adaptive learning experiences by integrating student modeling with generative AI tutoring based on the Cognitive Apprenticeship framework. The architecture consists of three components: (1) video segmentation by learning goals, (2) a conversational tutoring engine applying Cognitive Apprenticeship strategies, and (3) a student model using Bayesian Knowledge Tracing to adapt instruction. Our technical evaluation demonstrates effective video segmentation accuracy and strong pedagogical alignment across knowledge, method, action, and interaction layers. Ablation studies confirm the necessity of each component in generating effective guidance. This work advances AI-powered tutoring by bridging structured student modeling with interactive AI conversations, offering a scalable approach to enhancing video-based programming education.
zh
[AI-5] Fine-Tuning and Prompt Engineering of LLM s for the Creation of Multi-Agent AI for Addressing Sustainable Protein Production Challenges
【速读】:该论文旨在解决可持续蛋白质来源研究中科学知识处理与合成效率低的问题,提出了一种基于多智能体的人工智能(AI)框架以支持微生物蛋白生产的智能化研究。该解决方案的关键在于构建了一个面向检索增强生成(RAG)的系统,包含两个基于GPT的大语言模型(LLM)代理:文献检索代理用于获取特定微生物菌株相关的科学文献,信息提取代理则用于从检索到的内容中提取相关的生物和化学信息。通过微调和提示工程两种方法对代理进行优化,显著提升了信息提取代理的性能,其生成结果与理想输出之间的变换器余弦相似度得分最高提升了25%,并达到了≥0.89的平均得分,其中微调方法在提升平均得分方面表现更优。
链接: https://arxiv.org/abs/2506.20598
作者: Alexander D. Kalian,Jaewook Lee,Stefan P. Johannesson,Lennart Otte,Christer Hogstrand,Miao Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:The global demand for sustainable protein sources has accelerated the need for intelligent tools that can rapidly process and synthesise domain-specific scientific knowledge. In this study, we present a proof-of-concept multi-agent Artificial Intelligence (AI) framework designed to support sustainable protein production research, with an initial focus on microbial protein sources. Our Retrieval-Augmented Generation (RAG)-oriented system consists of two GPT-based LLM agents: (1) a literature search agent that retrieves relevant scientific literature on microbial protein production for a specified microbial strain, and (2) an information extraction agent that processes the retrieved content to extract relevant biological and chemical information. Two parallel methodologies, fine-tuning and prompt engineering, were explored for agent optimisation. Both methods demonstrated effectiveness at improving the performance of the information extraction agent in terms of transformer-based cosine similarity scores between obtained and ideal outputs. Mean cosine similarity scores were increased by up to 25%, while universally reaching mean scores of \geq 0.89 against ideal output text. Fine-tuning overall improved the mean scores to a greater extent (consistently of \geq 0.94 ) compared to prompt engineering, although lower statistical uncertainties were observed with the latter approach. A user interface was developed and published for enabling the use of the multi-agent AI system, alongside preliminary exploration of additional chemical safety-based search capabilities
zh
[AI-6] AI in the Writing Process: How Purposeful AI Support Fosters Student Writing
【速读】:该论文试图解决生成式 AI (Generative AI) 在学生写作中的应用所带来的学习者自主性下降和内容参与度不足的问题。研究通过对比三种不同的AI支持方式,探讨其对写作者自主性和知识转化深度的影响。解决方案的关键在于设计一种集成的AI写作工具,该工具能够支持写作过程中的多个子过程,从而增强学生的写作自主性并促进更深层次的知识转化。
链接: https://arxiv.org/abs/2506.20595
作者: Momin N. Siddiqui,Roy Pea,Hari Subramonyam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The ubiquity of technologies like ChatGPT has raised concerns about their impact on student writing, particularly regarding reduced learner agency and superficial engagement with content. While standalone chat-based LLMs often produce suboptimal writing outcomes, evidence suggests that purposefully designed AI writing support tools can enhance the writing process. This paper investigates how different AI support approaches affect writers’ sense of agency and depth of knowledge transformation. Through a randomized control trial with 90 undergraduate students, we compare three conditions: (1) a chat-based LLM writing assistant, (2) an integrated AI writing tool to support diverse subprocesses, and (3) a standard writing interface (control). Our findings demonstrate that, among AI-supported conditions, students using the integrated AI writing tool exhibited greater agency over their writing process and engaged in deeper knowledge transformation overall. These results suggest that thoughtfully designed AI writing support targeting specific aspects of the writing process can help students maintain ownership of their work while facilitating improved engagement with content.
zh
[AI-7] Vulnerability Disclosure through Adaptive Black-Box Adversarial Attacks on NIDS
【速读】:该论文试图解决对抗攻击在结构化数据(如网络流量)中的实际应用难题,特别是现有方法在可重复性和应对不断演化的对抗攻击方面存在的不足。其解决方案的关键在于提出一种新的黑盒对抗攻击方法,该方法严格遵守黑盒约束,通过减少交互以避免被检测,并采用基于变化点检测和因果分析的自适应特征选择策略,以识别并针对敏感特征进行扰动,从而实现低计算成本和高部署性的有效攻击。
链接: https://arxiv.org/abs/2506.20576
作者: Sabrine Ennaji,Elhadj Benkhelifa,Luigi V. Mancini
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial attacks, wherein slight inputs are carefully crafted to mislead intelligent models, have attracted increasing attention. However, a critical gap persists between theoretical advancements and practical application, particularly in structured data like network traffic, where interdependent features complicate effective adversarial manipulations. Moreover, ambiguity in current approaches restricts reproducibility and limits progress in this field. Hence, existing defenses often fail to handle evolving adversarial attacks. This paper proposes a novel approach for black-box adversarial attacks, that addresses these limitations. Unlike prior work, which often assumes system access or relies on repeated probing, our method strictly respect black-box constraints, reducing interaction to avoid detection and better reflect real-world scenarios. We present an adaptive feature selection strategy using change-point detection and causality analysis to identify and target sensitive features to perturbations. This lightweight design ensures low computational cost and high deployability. Our comprehensive experiments show the attack’s effectiveness in evading detection with minimal interaction, enhancing its adaptability and applicability in real-world scenarios. By advancing the understanding of adversarial attacks in network traffic, this work lays a foundation for developing robust defenses.
zh
[AI-8] Large Language Model-Driven Code Compliance Checking in Building Information Modeling
【速读】:该论文试图解决建筑信息模型(Building Information Modeling, BIM)中手动代码合规性检查耗时且易出错的问题。解决方案的关键在于引入大型语言模型(Large Language Model, LLM)驱动的方法,通过将LLM与Revit软件集成,实现对建筑规范的解读、Python脚本的生成以及BIM环境内的半自动化合规性检查,从而提升检查效率和准确性。
链接: https://arxiv.org/abs/2506.20551
作者: Soumya Madireddy,Lu Gao,Zia Din,Kinam Kim,Ahmed Senouci,Zhe Han,Yunpeng Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perform semi-automated compliance checks within the BIM environment. Case studies on a single-family residential project and an office building project demonstrated the system’s ability to reduce the time and effort required for compliance checks while improving accuracy. It streamlined the identification of violations, such as non-compliant room dimensions, material usage, and object placements, by automatically assessing relationships and generating actionable reports. Compared to manual methods, the system eliminated repetitive tasks, simplified complex regulations, and ensured reliable adherence to standards. By offering a comprehensive, adaptable, and cost-effective solution, this proposed approach offers a promising advancement in BIM-based compliance checking, with potential applications across diverse regulatory documents in construction projects.
zh
[AI-9] WattsOnAI: Measuring Analyzing and Visualizing Energy and Carbon Footprint of AI Workloads
【速读】:该论文试图解决当前AI模型训练和推理过程中能源消耗与碳排放测量与报告工具碎片化、缺乏系统性指标整合及有限的相关性分析支持的问题。解决方案的关键在于提出WattsOnAI,这是一个全面的软件工具包,能够对AI工作负载中的能耗、功耗、硬件性能和碳排放进行测量、分析和可视化,并通过与现有AI框架的无缝集成,提供标准化报告和细粒度时间序列数据,从而支持基准测试和可重复性研究,同时实现硬件指标与模型性能的深入相关性分析,促进瓶颈识别和性能优化。
链接: https://arxiv.org/abs/2506.20535
作者: Hongzhen Huang,Kunming Zhang,Hanlong Liao,Kui Wu,Guoming Tang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 7 figures and 5 tables
Abstract:The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents WattsOnAI, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, WattsOnAI offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, WattsOnAI encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable “Green AI” practices. The code is available at this https URL.
zh
[AI-10] Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios
【速读】:该论文旨在解决自动驾驶系统在高风险、动态环境中的避让决策问题,特别是在需要快速、情境感知且具有可解释性的决策场景中。其关键解决方案是提出一种基于案例推理增强的大型语言模型(CBR-LLM)框架,该框架通过整合行车记录仪视频输入的语义场景理解与相关历史驾驶案例的检索,使语言模型能够生成符合情境且与人类行为一致的驾驶操作建议。
链接: https://arxiv.org/abs/2506.20531
作者: Wenbin Gan,Minh-Son Dao,Koji Zettsu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 10 figures, under-review conference
Abstract:Driving in safety-critical scenarios requires quick, context-aware decision-making grounded in both situational understanding and experiential reasoning. Large Language Models (LLMs), with their powerful general-purpose reasoning capabilities, offer a promising foundation for such decision-making. However, their direct application to autonomous driving remains limited due to challenges in domain adaptation, contextual grounding, and the lack of experiential knowledge needed to make reliable and interpretable decisions in dynamic, high-risk environments. To address this gap, this paper presents a Case-Based Reasoning Augmented Large Language Model (CBR-LLM) framework for evasive maneuver decision-making in complex risk scenarios. Our approach integrates semantic scene understanding from dashcam video inputs with the retrieval of relevant past driving cases, enabling LLMs to generate maneuver recommendations that are both context-sensitive and human-aligned. Experiments across multiple open-source LLMs show that our framework improves decision accuracy, justification quality, and alignment with human expert behavior. Risk-aware prompting strategies further enhance performance across diverse risk types, while similarity-based case retrieval consistently outperforms random sampling in guiding in-context learning. Case studies further demonstrate the framework’s robustness in challenging real-world conditions, underscoring its potential as an adaptive and trustworthy decision-support tool for intelligent driving systems.
zh
[AI-11] Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation
【速读】:该论文旨在解决工业非侵入式负载监测(Industrial Non-Intrusive Load Monitoring, NILM)中因高质量数据集稀缺和工业能耗模式复杂多变而导致的性能受限问题。其关键解决方案是提出了一种名为设备调制数据增强(Appliance-Modulated Data Augmentation, AMDA)的方法,该方法通过智能地根据设备功率贡献的相对影响进行缩放,以计算效率的方式增强NILM模型的泛化能力。实验结果表明,使用AMDA增强数据训练的模型在复杂工业设备的能量分解任务中表现显著优于未进行数据增强或使用随机数据增强的模型。
链接: https://arxiv.org/abs/2506.20525
作者: Christian Internò,Andrea Castellani,Sebastian Schmitt,Fabio Stella,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Industrial Non-Intrusive Load Monitoring (NILM) is limited by the scarcity of high-quality datasets and the complex variability of industrial energy consumption patterns. To address data scarcity and privacy issues, we introduce the Synthetic Industrial Dataset for Energy Disaggregation (SIDED), an open-source dataset generated using Digital Twin simulations. SIDED includes three types of industrial facilities across three different geographic locations, capturing diverse appliance behaviors, weather conditions, and load profiles. We also propose the Appliance-Modulated Data Augmentation (AMDA) method, a computationally efficient technique that enhances NILM model generalization by intelligently scaling appliance power contributions based on their relative impact. We show in experiments that NILM models trained with AMDA-augmented data significantly improve the disaggregation of energy consumption of complex industrial appliances like combined heat and power systems. Specifically, in our out-of-sample scenarios, models trained with AMDA achieved a Normalized Disaggregation Error of 0.093, outperforming models trained without data augmentation (0.451) and those trained with random data augmentation (0.290). Data distribution analyses confirm that AMDA effectively aligns training and test data distributions, enhancing model generalization.
zh
[AI-12] Engineering Sentience
【速读】:该论文试图解决如何为人工智能(Artificial Intelligence, AI)定义一种可操作的“意识”(sentience)概念,以指导其在机器中的设计与构建。解决方案的关键在于将意识概念转化为功能性和计算性的描述,使其具备足够的细节以支持实际实现,同时确保该概念不仅包含对感知内容的编码能力,还必须反映某种本质上的“主观性”。为此,作者提出特定的感官信号需要同时具备断言性(assertoric)和质性(qualitative)特征,以满足功能性意识的要求。
链接: https://arxiv.org/abs/2506.20504
作者: Konstantin Demin,Taylor Webb,Eric Elmoznino,Hakwan Lau
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:We spell out a definition of sentience that may be useful for designing and building it in machines. We propose that for sentience to be meaningful for AI, it must be fleshed out in functional, computational terms, in enough detail to allow for implementation. Yet, this notion of sentience must also reflect something essentially ‘subjective’, beyond just having the general capacity to encode perceptual content. For this specific functional notion of sentience to occur, we propose that certain sensory signals need to be both assertoric (persistent) and qualitative. To illustrate the definition in more concrete terms, we sketch out some ways for potential implementation, given current technology. Understanding what it takes for artificial agents to be functionally sentient can also help us avoid creating them inadvertently, or at least, realize that we have created them in a timely manner.
zh
[AI-13] Mixtures of Neural Cellular Automata: A Stochastic Framework for Growth Modelling and Self-Organization
【速读】:该论文试图解决神经细胞自动机(Neural Cellular Automata, NCA)在建模真实生物和物理系统时因确定性特性而无法捕捉随机性的局限性。解决方案的关键在于提出混合神经细胞自动机(Mixture of Neural Cellular Automata, MNCA),通过结合概率规则分配与内在噪声,实现对多样化局部行为的建模,并重现生物过程中观察到的随机动力学。
链接: https://arxiv.org/abs/2506.20486
作者: Salvatore Milite,Giulio Caravagna,Andrea Sottoriva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Neural Cellular Automata (NCAs) are a promising new approach to model self-organizing processes, with potential applications in life science. However, their deterministic nature limits their ability to capture the stochasticity of real-world biological and physical systems. We propose the Mixture of Neural Cellular Automata (MNCA), a novel framework incorporating the idea of mixture models into the NCA paradigm. By combining probabilistic rule assignments with intrinsic noise, MNCAs can model diverse local behaviors and reproduce the stochastic dynamics observed in biological processes. We evaluate the effectiveness of MNCAs in three key domains: (1) synthetic simulations of tissue growth and differentiation, (2) image morphogenesis robustness, and (3) microscopy image segmentation. Results show that MNCAs achieve superior robustness to perturbations, better recapitulate real biological growth patterns, and provide interpretable rule segmentation. These findings position MNCAs as a promising tool for modeling stochastic dynamical systems and studying self-growth processes. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20486 [cs.AI] (or arXiv:2506.20486v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.20486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-14] Automatic Demonstration Selection for LLM -based Tabular Data Classification
【速读】:该论文试图解决在表格数据分类中应用上下文学习(In-Context Learning, ICL)时,如何确定提示中理想演示样本数量的问题。其解决方案的关键在于提出一种算法,该算法通过整合表格数据的分布、用户选择的提示模板以及特定的大语言模型(Large Language Model, LLM)来自动选择合理的演示数量。该方法基于谱图理论,定义了一个新的度量标准来量化不同演示之间的相似性,并通过构建相似性图及其拉普拉斯矩阵的特征值分析,得出能够代表数据的最小演示数量。
链接: https://arxiv.org/abs/2506.20451
作者: Shuchu Han,Wolfgang Bruckner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A fundamental question in applying In-Context Learning (ICL) for tabular data classification is how to determine the ideal number of demonstrations in the prompt. This work addresses this challenge by presenting an algorithm to automatically select a reasonable number of required demonstrations. Our method distinguishes itself by integrating not only the tabular data’s distribution but also the user’s selected prompt template and the specific Large Language Model (LLM) into its estimation. Rooted in Spectral Graph Theory, our proposed algorithm defines a novel metric to quantify the similarities between different demonstrations. We then construct a similarity graph and analyze the eigenvalues of its Laplacian to derive the minimum number of demonstrations capable of representing the data within the LLM’s intrinsic representation space. We validate the efficacy of our approach through experiments comparing its performance against conventional random selection algorithms on diverse datasets and LLMs.
zh
[AI-15] Off-Policy Evaluation and Learning for the Future under Non-Stationarity
【速读】:该论文试图解决非平稳环境中未来离策略评估(Future Off-Policy Evaluation, F-OPE)与学习(Future Off-Policy Learning, F-OPL)的问题,即在历史数据中无法获取未来环境信息的情况下,如何准确估计和优化策略的未来价值。现有方法通常假设环境是平稳的或依赖于严格的奖励建模假设,导致显著偏差。该论文提出了一种名为OPFV(Off-Policy Estimator for the Future Value)的新估计器,其关键在于利用时间序列数据中的结构信息,如季节性、周周期或节假日效应,这些特征在历史和未来数据中具有一致性,从而通过一种新型重要性加权方法有效实现未来离策略评估。
链接: https://arxiv.org/abs/2506.20417
作者: Tatsuhiro Shimizu,Kazuki Kawamura,Takanori Muroi,Yusuke Narita,Kei Tateno,Takuma Udagawa,Yuta Saito
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the novel problem of future off-policy evaluation (F-OPE) and learning (F-OPL) for estimating and optimizing the future value of policies in non-stationary environments, where distributions vary over time. In e-commerce recommendations, for instance, our goal is often to estimate and optimize the policy value for the upcoming month using data collected by an old policy in the previous month. A critical challenge is that data related to the future environment is not observed in the historical data. Existing methods assume stationarity or depend on restrictive reward-modeling assumptions, leading to significant bias. To address these limitations, we propose a novel estimator named \textit\textbfOff-\textbfPolicy Estimator for the \textbfFuture \textbfValue (\textbf\textitOPFV), designed for accurately estimating policy values at any future time point. The key feature of OPFV is its ability to leverage the useful structure within time-series data. While future data might not be present in the historical log, we can leverage, for example, seasonal, weekly, or holiday effects that are consistent in both the historical and future data. Our estimator is the first to exploit these time-related structures via a new type of importance weighting, enabling effective F-OPE. Theoretical analysis identifies the conditions under which OPFV becomes low-bias. In addition, we extend our estimator to develop a new policy-gradient method to proactively learn a good future policy using only historical data. Empirical results show that our methods substantially outperform existing methods in estimating and optimizing the future policy value under non-stationarity for various experimental setups.
zh
[AI-16] SV-LLM : An Agent ic Approach for SoC Security Verification using Large Language Models
【速读】:该论文旨在解决复杂系统级芯片(SoC)设计安全性的保障问题,传统验证技术在自动化、可扩展性、全面性和适应性方面面临显著挑战。其解决方案的关键在于引入SV-LLM,一个基于多智能体协作的新型辅助系统,通过集成专门化的大型语言模型(LLM)代理,分别执行验证问答、安全资产识别、威胁建模、测试计划与属性生成、漏洞检测以及基于仿真的缺陷验证等任务,并利用不同的学习范式(如上下文学习、微调和检索增强生成)优化各代理性能,从而实现自动化和增强的SoC安全验证。
链接: https://arxiv.org/abs/2506.20415
作者: Dipayan Saha,Shams Tarek,Hasan Al Shaikh,Khan Thamid Hasan,Pavan Sai Nalluri,Md. Ajoad Hasan,Nashmin Alam,Jingbo Zhou,Sujan Kumar Saha,Mark Tehranipoor,Farimah Farahmandi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Ensuring the security of complex system-on-chips (SoCs) designs is a critical imperative, yet traditional verification techniques struggle to keep pace due to significant challenges in automation, scalability, comprehensiveness, and adaptability. The advent of large language models (LLMs), with their remarkable capabilities in natural language understanding, code generation, and advanced reasoning, presents a new paradigm for tackling these issues. Moving beyond monolithic models, an agentic approach allows for the creation of multi-agent systems where specialized LLMs collaborate to solve complex problems more effectively. Recognizing this opportunity, we introduce SV-LLM, a novel multi-agent assistant system designed to automate and enhance SoC security verification. By integrating specialized agents for tasks like verification question answering, security asset identification, threat modeling, test plan and property generation, vulnerability detection, and simulation-based bug validation, SV-LLM streamlines the workflow. To optimize their performance in these diverse tasks, agents leverage different learning paradigms, such as in-context learning, fine-tuning, and retrieval-augmented generation (RAG). The system aims to reduce manual intervention, improve accuracy, and accelerate security analysis, supporting proactive identification and mitigation of risks early in the design cycle. We demonstrate its potential to transform hardware security practices through illustrative case studies and experiments that showcase its applicability and efficacy.
zh
[AI-17] Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning
【速读】:该论文旨在解决在异构、资源受限的物联网(IoT)设备中实现高效且私密的个性化学习方法所面临的挑战,包括客户端间有效知识迁移、数据隐私保护以及对抗污染攻击的鲁棒性。其解决方案的关键在于提出P4(Personalized, Private, Peer-to-Peer)方法,该方法采用轻量级、完全去中心化的算法,以私密方式检测客户端相似性并形成协作群体,随后在群体内通过差分隐私的知识蒸馏进行联合训练,从而在保持高精度的同时增强对恶意客户端的鲁棒性。
链接: https://arxiv.org/abs/2506.20413
作者: Mohammad Mahdi Maheri,Denys Herasymuk,Hamed Haddadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The growing adoption of Artificial Intelligence (AI) in Internet of Things (IoT) ecosystems has intensified the need for personalized learning methods that can operate efficiently and privately across heterogeneous, resource-constrained devices. However, enabling effective personalized learning in decentralized settings introduces several challenges, including efficient knowledge transfer between clients, protection of data privacy, and resilience against poisoning attacks. In this paper, we address these challenges by developing P4 (Personalized, Private, Peer-to-Peer) – a method designed to deliver personalized models for resource-constrained IoT devices while ensuring differential privacy and robustness against poisoning attacks. Our solution employs a lightweight, fully decentralized algorithm to privately detect client similarity and form collaborative groups. Within each group, clients leverage differentially private knowledge distillation to co-train their models, maintaining high accuracy while ensuring robustness to the presence of malicious clients. We evaluate P4 on popular benchmark datasets using both linear and CNN-based architectures across various heterogeneity settings and attack scenarios. Experimental results show that P4 achieves 5% to 30% higher accuracy than leading differentially private peer-to-peer approaches and maintains robustness with up to 30% malicious clients. Additionally, we demonstrate its practicality by deploying it on resource-constrained devices, where collaborative training between two clients adds only ~7 seconds of overhead.
zh
[AI-18] GymPN: A Library for Decision-Making in Process Management Systems
【速读】:该论文旨在解决业务流程中任务分配与调度的决策优化问题,具体包括确定下一步执行的任务、执行时间以及任务分配对象。解决方案的关键在于提出一个名为GymPN的软件库,该库基于深度强化学习实现业务流程中的最优决策。GymPN通过支持部分流程可观测性和多决策建模,克服了以往方法在现实流程决策表示上的局限性,从而提升了决策的准确性和灵活性。
链接: https://arxiv.org/abs/2506.20404
作者: Riccardo Lo Bianco,Willem van Jaarsveld,Remco Dijkman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Process management systems support key decisions about the way work is allocated in organizations. This includes decisions on which task to perform next, when to execute the task, and who to assign the task to. Suitable software tools are required to support these decisions in a way that is optimal for the organization. This paper presents a software library, called GymPN, that supports optimal decision-making in business processes using Deep Reinforcement Learning. GymPN builds on previous work that supports task assignment in business processes, introducing two key novelties: support for partial process observability and the ability to model multiple decisions in a business process. These novel elements address fundamental limitations of previous work and thus enable the representation of more realistic process decisions. We evaluate the library on eight typical business process decision-making problem patterns, showing that GymPN allows for easy modeling of the desired problems, as well as learning optimal decision policies.
zh
[AI-19] Smart Ride and Delivery Services with Electric Vehicles: Leverag ing Bidirectional Charging for Profit Optimisation
【速读】:该论文试图解决电动车辆(Electric Vehicles, EVs)在服务系统中集成时面临的路径规划与充电/放电调度问题,即电动车辆路径规划问题与车网互动(Vehicle-to-Grid, V2G)相结合的利润最大化问题(Electric Vehicle Orienteering Problem with V2G, EVOP-V2G)。解决方案的关键在于通过混合整数规划(Mixed Integer Programming, MIP)模型对问题进行建模,并采用两种近优元启发式算法:一种是基于进化算法(Evolutionary Algorithm, EA)的方法,另一种是基于大邻域搜索(Large Neighborhood Search, LNS)的方法,以有效处理动态电价、充电站选择和路径约束等复杂因素。
链接: https://arxiv.org/abs/2506.20401
作者: Jinchun Du,Bojie Shen,Muhammad Aamir Cheema,Adel N. Toosi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rising popularity of electric vehicles (EVs), modern service systems, such as ride-hailing delivery services, are increasingly integrating EVs into their operations. Unlike conventional vehicles, EVs often have a shorter driving range, necessitating careful consideration of charging when fulfilling requests. With recent advances in Vehicle-to-Grid (V2G) technology - allowing EVs to also discharge energy back to the grid - new opportunities and complexities emerge. We introduce the Electric Vehicle Orienteering Problem with V2G (EVOP-V2G): a profit-maximization problem where EV drivers must select customer requests or orders while managing when and where to charge or discharge. This involves navigating dynamic electricity prices, charging station selection, and route constraints. We formulate the problem as a Mixed Integer Programming (MIP) model and propose two near-optimal metaheuristic algorithms: one evolutionary (EA) and the other based on large neighborhood search (LNS). Experiments on real-world data show our methods can double driver profits compared to baselines, while maintaining near-optimal performance on small instances and excellent scalability on larger ones. Our work highlights a promising path toward smarter, more profitable EV-based mobility systems that actively support the energy grid.
zh
[AI-20] Paladin-mini: A Compact and Efficient Grounding Model Excelling in Real-World Scenarios
【速读】:该论文旨在解决在给定上下文中验证声明(claim)是否具有支持性证据的问题,即“接地”(grounding)问题。其关键解决方案是提出Paladin-mini,一个参数量为3.8B的轻量级开源分类模型(classifier model),用于标注数据是否为接地或非接地,并设计了grounding-benchmark这一新的评估数据集,以衡量在关键推理任务上的性能。
链接: https://arxiv.org/abs/2506.20384
作者: Dror Ivry,Oran Nahum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures
Abstract:This paper introduces two significant contributions to address the issue of grounding claims in a given context. Grounding means that given a context (document) and a claim, there’s at least one supportive evidence for the claim in the document. We will introduce Paladin-mini, a compact (3.8B parameters) open-source classifier model (used for labeling data as grounded or ungrounded) engineered for robust performance in real-world scenarios, and the grounding-benchmark, a new evaluation dataset designed to assess performance on critical reasoning tasks. We’ll also demonstrate the results of Paladin-mini with benchmarks against the current State-of-the-art and share clear and reproducible results.
zh
[AI-21] CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition
【速读】:该论文试图解决人机群体交互中的情境定位问题,即如何在动态协作环境中实现对参与者、物体及其交互的准确识别与跟踪。解决方案的关键在于CARMA系统通过唯一标识现实世界中实体的物理实例,并将其组织为基于角色的行动者-物体-动作三元组,从而提供一致且结构化的场景表示,确保机器人能够可靠地进行时空推理和情境决策。
链接: https://arxiv.org/abs/2506.20373
作者: Joerg Deigmoeller,Stephan Hasler,Nakul Agarwal,Daniel Tanneberg,Anna Belardinelli,Reza Ghoddoosian,Chao Wang,Felix Ocker,Fan Zhang,Behzad Dariush,Michael Gienger
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CARMA, a system for situational grounding in human-robot group interactions. Effective collaboration in such group settings requires situational awareness based on a consistent representation of present persons and objects coupled with an episodic abstraction of events regarding actors and manipulated objects. This calls for a clear and consistent assignment of instances, ensuring that robots correctly recognize and track actors, objects, and their interactions over time. To achieve this, CARMA uniquely identifies physical instances of such entities in the real world and organizes them into grounded triplets of actors, objects, and actions. To validate our approach, we conducted three experiments, where multiple humans and a robot interact: collaborative pouring, handovers, and sorting. These scenarios allow the assessment of the system’s capabilities as to role distinction, multi-actor awareness, and consistent instance identification. Our experiments demonstrate that the system can reliably generate accurate actor-action-object triplets, providing a structured and robust foundation for applications requiring spatiotemporal reasoning and situated decision-making in collaborative settings. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20373 [cs.RO] (or arXiv:2506.20373v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2506.20373 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-22] Self-Supervised Graph Learning via Spectral Bootstrapping and Laplacian-Based Augmentations
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在自监督学习中依赖负采样、对比目标或手工设计增强方法的问题。其解决方案的关键在于引入基于拉普拉斯(Laplacian)的信号,并通过谱增强技术实现结构信息的丰富监督,同时采用对抗性自举训练机制以提升特征学习和模型鲁棒性,从而在不依赖传统对比学习策略的情况下实现高效且表达能力强的图表示学习。
链接: https://arxiv.org/abs/2506.20362
作者: Lorenzo Bini,Stephane Marchand-Maillet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: LaplaceGNN is a novel graph learning framework that employs a bootstrapped teacher-student architecture. Its precomputed spectral augmentations and adversarial training enable robust performance, outperforming SOTA methods while scaling linearly
Abstract:We present LaplaceGNN, a novel self-supervised graph learning framework that bypasses the need for negative sampling by leveraging spectral bootstrapping techniques. Our method integrates Laplacian-based signals into the learning process, allowing the model to effectively capture rich structural representations without relying on contrastive objectives or handcrafted augmentations. By focusing on positive alignment, LaplaceGNN achieves linear scaling while offering a simpler, more efficient, self-supervised alternative for graph neural networks, applicable across diverse domains. Our contributions are twofold: we precompute spectral augmentations through max-min centrality-guided optimization, enabling rich structural supervision without relying on handcrafted augmentations, then we integrate an adversarial bootstrapped training scheme that further strengthens feature learning and robustness. Our extensive experiments on different benchmark datasets show that LaplaceGNN achieves superior performance compared to state-of-the-art self-supervised graph methods, offering a promising direction for efficiently learning expressive graph representations.
zh
[AI-23] abular Feature Discovery With Reasoning Type Exploration
【速读】:该论文试图解决表格数据特征工程中生成的特征过于简单或重复的问题,这主要是由于大语言模型(Large Language Models, LLMs)在转换过程中存在固有偏差以及生成过程中缺乏结构化推理指导所致。论文提出的解决方案关键在于REFeat方法,该方法通过利用多种推理类型引导特征生成过程,从而发现多样且具有信息量的特征。
链接: https://arxiv.org/abs/2506.20357
作者: Sungwon Han,Sungkyu Park,Seungeon Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Feature engineering for tabular data remains a critical yet challenging step in machine learning. Recently, large language models (LLMs) have been used to automatically generate new features by leveraging their vast knowledge. However, existing LLM-based approaches often produce overly simple or repetitive features, partly due to inherent biases in the transformations the LLM chooses and the lack of structured reasoning guidance during generation. In this paper, we propose a novel method REFeat, which guides an LLM to discover diverse and informative features by leveraging multiple types of reasoning to steer the feature generation process. Experiments on 59 benchmark datasets demonstrate that our approach not only achieves higher predictive accuracy on average, but also discovers more diverse and meaningful features. These results highlight the promise of incorporating rich reasoning paradigms and adaptive strategy selection into LLM-driven feature discovery for tabular data.
zh
[AI-24] A foundation model with multi-variate parallel attention to generate neuronal activity
【速读】:该论文旨在解决多变量时间序列中异构通道配置带来的建模挑战,特别是在临床领域如颅内脑电图(iEEG)中,由于不同受试者的通道设置差异较大,导致深度神经网络难以有效学习。解决方案的关键在于提出一种新型的自注意力机制——多变量并行注意力(MVPA),该机制能够解耦内容、时间和空间注意力,从而实现对具有不同通道数量和配置的时间序列数据的灵活、泛化和高效建模。
链接: https://arxiv.org/abs/2506.20354
作者: Francesco Carzaniga,Michael Hersche,Abu Sebastian,Kaspar Schindler,Abbas Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The code is available at this https URL . The SWEC iEEG dataset is available at this https URL
Abstract:Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks (DNNs), particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future effort by the community, we release the SWEC iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in seizure detection and outperforming state-of-the-art Transformer baselines on our SWEC, the MAYO, and the FNUSA dataset. We further validate MVPA on standard time-series forecasting and classification tasks, where it matches or exceeds existing attention-based models. Together, our contributions establish MVPA as a general-purpose attention mechanism for heterogeneous time-series and MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with state-of-the-art clinical performance. The code is available at this https URL. The SWEC iEEG dataset is available at this https URL.
zh
[AI-25] DipSVD: Dual-importance Protected SVD for Efficient LLM Compression
【速读】:该论文试图解决传统基于奇异值分解(SVD)的压缩方法在模型压缩过程中忽视矩阵中关键组件保护的问题,导致压缩后模型性能下降。其解决方案的关键在于提出一种双层次重要性保护机制:局部重要性保护通过通道加权数据白化保留每个权重矩阵中的关键奇异向量;全局重要性保护则通过启发式或优化方法让不重要的层承担更多的压缩负担,从而最小化对关键层的影响。
链接: https://arxiv.org/abs/2506.20353
作者: Xuan Ding,Rui Sun,Yunjian Zhang,Xiu Yan,Yueqi Zhou,Kaihao Huang,Suzhong Fu,Chuanlong Xie,Yao Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The ever-increasing computational demands and deployment costs of large language models (LLMs) have spurred numerous compressing methods. Compared to quantization and unstructured pruning, SVD compression offers superior hardware compatibility and theoretical guarantees. However, existing SVD-based methods focus on the overall discrepancy between the original and compressed matrices while overlooking the protection of critical components within the matrix, which leads to inferior performance in the compressed models. This paper proposes a dual-level importance protection mechanism to enhance SVD-based compression methods: (1) local importance protection: preserving the most critical singular vectors within each weight matrix through channel-weighted data whitening; and (2) global importance protection: enabling less important layers to bear a greater portion of the compression burden through either a heuristic or optimization-based approach, thereby minimizing the impact of compression on critical layers. Extensive experiments demonstrate that DipSVD outperforms existing SVD-based compression approaches across multiple benchmarks, achieving superior model performance especially at high model compression ratios.
zh
[AI-26] Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards
【速读】:该论文试图解决基于视觉-语言模型的移动代理在动态环境中交互能力不足的问题,特别是由于现有研究主要依赖离线强化学习或基于动作级奖励的在线优化,导致代理容易陷入局部最优,削弱了其探索能力和错误动作修正能力。解决方案的关键在于提出一种称为Mobile-R1的方法,该方法采用基于任务级奖励的多轮交互强化学习,通过三个阶段的训练框架——初始格式微调、基于动作级奖励的单步在线训练以及基于多轮轨迹的任务级奖励在线训练,以提升代理的探索与错误修正能力。
链接: https://arxiv.org/abs/2506.20332
作者: Jihao Gu,Qihang Ai,Yingyao Wang,Pi Bu,Jingxuan Xing,Zekun Zhu,Wei Jiang,Ziming Wang,Yingxiu Zhao,Ming-Liang Zhang,Jun Song,Yuning Jiang,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures
Abstract:Vision-language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using action-level rewards, which limits the agent’s dynamic interaction with the environment. This often results in agents settling into local optima, thereby weakening their ability for exploration and error action correction. To address these challenges, we introduce an approach called Mobile-R1, which employs interactive multi-turn reinforcement learning with task-level rewards for mobile agents. Our training framework consists of three stages: initial format finetuning, single-step online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories. This strategy is designed to enhance the exploration and error correction capabilities of Mobile-R1, leading to significant performance improvements. Moreover, we have collected a dataset covering 28 Chinese applications with 24,521 high-quality manual annotations and established a new benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: this https URL.
zh
[AI-27] Comparative Analysis of Deep Learning Models for Crop Disease Detection: A Transfer Learning Approach
【速读】:该论文试图解决农村地区资源有限的环境下作物病害检测效率低的问题,其解决方案的关键在于利用深度学习模型进行迁移学习,通过比较不同模型(如EfficientNet、ResNet101、MobileNetV2及自定义的CNN)的性能,最终采用验证准确率达95.76%的自定义卷积神经网络(CNN)实现对植物病害的有效分类,从而提升作物健康管理并支持可持续农业发展。
链接: https://arxiv.org/abs/2506.20323
作者: Saundarya Subramaniam,Shalini Majumdar,Shantanu Nadar,Kaustubh Kulkarni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This research presents the development of an Artificial Intelligence (AI) - driven crop disease detection system designed to assist farmers in rural areas with limited resources. We aim to compare different deep learning models for a comparative analysis, focusing on their efficacy in transfer learning. By leveraging deep learning models, including EfficientNet, ResNet101, MobileNetV2, and our custom CNN, which achieved a validation accuracy of 95.76%, the system effectively classifies plant diseases. This research demonstrates the potential of transfer learning in reshaping agricultural practices, improving crop health management, and supporting sustainable farming in rural environments.
zh
[AI-28] Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration
【速读】:该论文旨在解决模仿学习(Imitation Learning)中从有限的专家示范中准确学习专家策略的难题,以及如何通过环境探索实现超越专家性能的问题。其解决方案的关键在于提出一种名为ILDE(Imitation Learning with Double Exploration)的新算法,该算法通过两个方面的探索机制来克服上述挑战:一是通过探索奖励机制进行乐观策略优化,以高不确定性的状态-动作对为奖励,从而提高收敛到专家策略的效率;二是通过好奇心驱动的探索机制,对偏离示范轨迹的状态进行探索,以期获得超越专家的表现。
链接: https://arxiv.org/abs/2506.20307
作者: Heyang Zhao,Xingrui Yu,David M. Bossens,Ivor W. Tsang,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Imitation learning is a central problem in reinforcement learning where the goal is to learn a policy that mimics the expert’s behavior. In practice, it is often challenging to learn the expert policy from a limited number of demonstrations accurately due to the complexity of the state space. Moreover, it is essential to explore the environment and collect data to achieve beyond-expert performance. To overcome these challenges, we propose a novel imitation learning algorithm called Imitation Learning with Double Exploration (ILDE), which implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert performance. Empirically, we demonstrate that ILDE outperforms the state-of-the-art imitation learning algorithms in terms of sample efficiency and achieves beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations than in previous work. We also provide a theoretical justification of ILDE as an uncertainty-regularized policy optimization method with optimistic exploration, leading to a regret growing sublinearly in the number of episodes.
zh
[AI-29] Enterprise Large Language Model Evaluation Benchmark
【速读】:该论文试图解决现有基准(如MMLU)在评估企业特定任务复杂性方面的不足,从而无法全面反映大型语言模型(Large Language Models, LLMs)在企业场景中的实际能力。其解决方案的关键在于提出一个基于布鲁姆分类法的14任务框架,并构建了一个结合LLM-as-a-Labeler、LLM-as-a-Judge和修正检索增强生成(CRAG)的可扩展流水线,以应对数据噪声和标注成本高的问题,最终形成了一个包含9,700个样本的稳健基准。
链接: https://arxiv.org/abs/2506.20274
作者: Liya Wang,David Yi,Damien Jose,John Passarelli,James Gao,Jordan Leventis,Kang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to MLNLP 2025 at this https URL
Abstract:Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom’s Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-a-Judge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model optimization. This work provides enterprises a blueprint for tailored evaluations and advances practical LLM deployment.
zh
[AI-30] Argumentative Ensembling for Robust Recourse under Model Multiplicity
【速读】:该论文试图解决在模型多重性(Model Multiplicity, MM)背景下提供鲁棒的反事实解释(Counterfactual Explanations, CEs)的问题,即当多个性能相当的模型对同一输入产生不同预测时,传统的集成方法可能无法保证CE的普遍有效性。解决方案的关键在于提出一种名为“可追溯集成”(Recourse-Aware Ensembling, RAE)的方法,该方法通过将每个模型的CE与其预测结果共同考虑,实现预测与可追溯性的同时决策。其核心思想是利用计算论证技术显式表示模型与CE之间的冲突,并通过论证语义解决这些冲突,从而确保CE在MM下的鲁棒性。
链接: https://arxiv.org/abs/2506.20260
作者: Junqi Jiang,Antonio Rago,Francesco Leofante,Francesca Toni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:In machine learning, it is common to obtain multiple equally performing models for the same prediction task, e.g., when training neural networks with different random seeds. Model multiplicity (MM) is the situation which arises when these competing models differ in their predictions for the same input, for which ensembling is often employed to determine an aggregation of the outputs. Providing recourse recommendations via counterfactual explanations (CEs) under MM thus becomes complex, since the CE may not be valid across all models, i.e., the CEs are not robust under MM. In this work, we formalise the problem of providing recourse under MM, which we name recourse-aware ensembling (RAE). We propose the idea that under MM, CEs for each individual model should be considered alongside their predictions so that the aggregated prediction and recourse are decided in tandem. Centred around this intuition, we introduce six desirable properties for solutions to this problem. For solving RAE, we propose a novel argumentative ensembling method which guarantees the robustness of CEs under MM. Specifically, our method leverages computational argumentation to explicitly represent the conflicts between models and counterfactuals regarding prediction results and CE validity. It then uses argumentation semantics to resolve the conflicts and obtain the final solution, in a manner which is parametric to the chosen semantics. Our method also allows for the specification of preferences over the models under MM, allowing further customisation of the ensemble. In a comprehensive theoretical analysis, we characterise the behaviour of argumentative ensembling with four different argumentation semantics. We then empirically demonstrate the effectiveness of our approach in satisfying desirable properties with eight instantiations of our method. (Abstract is shortened for arXiv.)
zh
[AI-31] Generating and Customizing Robotic Arm Trajectories using Neural Networks
【速读】:该论文试图解决机器人手臂轨迹生成与定制的问题,旨在实现高精度和可重复性的运动控制。解决方案的关键在于采用一种神经网络方法,该方法通过计算机器人手臂的正向运动学,并结合关节角度生成器,训练出能够生成精确轨迹的模型。此方法使得机器人能够在与人类交互过程中表现出更高的可预测性与动作质量。
链接: https://arxiv.org/abs/2506.20259
作者: Andrej Lúčny,Matilde Antonj,Carlo Mazzola,Hana Hornáčková,Igor Farkaš
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The code is released at this https URL
Abstract:We introduce a neural network approach for generating and customizing the trajectory of a robotic arm, that guarantees precision and repeatability. To highlight the potential of this novel method, we describe the design and implementation of the technique and show its application in an experimental setting of cognitive robotics. In this scenario, the NICO robot was characterized by the ability to point to specific points in space with precise linear movements, increasing the predictability of the robotic action during its interaction with humans. To achieve this goal, the neural network computes the forward kinematics of the robot arm. By integrating it with a generator of joint angles, another neural network was developed and trained on an artificial dataset created from suitable start and end poses of the robotic arm. Through the computation of angular velocities, the robot was characterized by its ability to perform the movement, and the quality of its action was evaluated in terms of shape and accuracy. Thanks to its broad applicability, our approach successfully generates precise trajectories that could be customized in their shape and adapted to different settings.
zh
[AI-32] me-series surrogates from energy consumers generated by machine learning approaches for long-term forecasting scenarios
【速读】:该论文旨在解决个体电力消耗长期预测中数据不足的问题,特别是针对传统方法在模拟个体能耗时间序列的时序动态、长程依赖和概率转移方面的局限性。其解决方案的关键在于采用多种先进的数据驱动方法,包括混合 Wasserstein 生成对抗网络(WGAN)、去噪扩散概率模型(DDPM)、隐马尔可夫模型(HMM)和掩码自回归伯恩斯坦多项式归一化流(MABF),对合成时间序列数据进行深入比较评估,以生成高保真度的电力消耗数据,从而提升状态估计和其他能源相关任务的准确性与可靠性。
链接: https://arxiv.org/abs/2506.20253
作者: Ben Gerhards,Nikita Popkov,Annekatrin König,Marcel Arpogaus,Bastian Schäfermeier,Leonie Riedl,Stephan Vogt,Philip Hehlert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Forecasting attracts a lot of research attention in the electricity value chain. However, most studies concentrate on short-term forecasting of generation or consumption with a focus on systems and less on individual consumers. Even more neglected is the topic of long-term forecasting of individual power consumption. Here, we provide an in-depth comparative evaluation of data-driven methods for generating synthetic time series data tailored to energy consumption long-term forecasting. High-fidelity synthetic data is crucial for a wide range of applications, including state estimations in energy systems or power grid planning. In this study, we assess and compare the performance of multiple state-of-the-art but less common techniques: a hybrid Wasserstein Generative Adversarial Network (WGAN), Denoising Diffusion Probabilistic Model (DDPM), Hidden Markov Model (HMM), and Masked Autoregressive Bernstein polynomial normalizing Flows (MABF). We analyze the ability of each method to replicate the temporal dynamics, long-range dependencies, and probabilistic transitions characteristic of individual energy consumption profiles. Our comparative evaluation highlights the strengths and limitations of: WGAN, DDPM, HMM and MABF aiding in selecting the most suitable approach for state estimations and other energy-related tasks. Our generation and analysis framework aims to enhance the accuracy and reliability of synthetic power consumption data while generating data that fulfills criteria like anonymisation - preserving privacy concerns mitigating risks of specific profiling of single customers. This study utilizes an open-source dataset from households in Germany with 15min time resolution. The generated synthetic power profiles can readily be used in applications like state estimations or consumption forecasting. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20253 [cs.LG] (or arXiv:2506.20253v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20253 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-33] Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models ICML2025
【速读】:该论文旨在解决量化大型语言模型(Large Language Models, LLMs)可能带来的安全能力下降问题,特别是在无校准数据集的量化方法下,量化过程可能对模型的安全性产生负面影响。为应对这一问题,论文提出了一种量化感知的安全修补框架Q-resafe,其关键在于在最小化对模型实用性的负面影响的同时,有效恢复量化后LLMs的安全能力。
链接: https://arxiv.org/abs/2506.20251
作者: Kejia Chen,Jiawen Zhang,Jiacong Hu,Yu Wang,Jian Lou,Zunlei Feng,Mingli Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experimental results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page is available at: this https URL.
zh
[AI-34] Directed Link Prediction using GNN with Local and Global Feature Fusion
【速读】:该论文旨在解决有向图中的链接预测(link prediction)问题,这是一个在图分析中经典的难题,具有广泛的实际应用。其解决方案的关键在于提出一种融合特征嵌入与社区信息的新型图神经网络(GNN)框架,并通过将输入图转换为有向线图(directed line graph)以更有效地利用这些混合特征进行图卷积操作,从而提升有向链接预测的性能。
链接: https://arxiv.org/abs/2506.20235
作者: Yuyang Zhang,Xu Shen,Yu Xie,Ka-Chun Wong,Weidun Xie,Chengbin Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Link prediction is a classical problem in graph analysis with many practical applications. For directed graphs, recently developed deep learning approaches typically analyze node similarities through contrastive learning and aggregate neighborhood information through graph convolutions. In this work, we propose a novel graph neural network (GNN) framework to fuse feature embedding with community information. We theoretically demonstrate that such hybrid features can improve the performance of directed link prediction. To utilize such features efficiently, we also propose an approach to transform input graphs into directed line graphs so that nodes in the transformed graph can aggregate more information during graph convolutions. Experiments on benchmark datasets show that our approach outperforms the state-of-the-art in most cases when 30%, 40%, 50%, and 60% of the connected links are used as training data, respectively.
zh
[AI-35] Affective Priming Score: A Data-Driven Method to Detect Priming in Sequential Datasets
【速读】:该论文试图解决情感计算中由于情感启动效应(affective priming)导致的数据歧义问题,该效应会影响生理信号数据的准确性,进而引发学习模型中的误分类。解决方案的关键在于提出一种数据驱动的方法——情感启动评分(Affective Priming Score, APS),通过为每个数据点分配评分来量化其受启动效应影响的程度,从而识别并去除受启动效应干扰的数据点,提升模型的鲁棒性。
链接: https://arxiv.org/abs/2506.20204
作者: Eduardo Gutierrez Maestro,Hadi Banaee,Amy Loutfi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Affective priming exemplifies the challenge of ambiguity in affective computing. While the community has largely addressed this issue from a label-based perspective, identifying data points in the sequence affected by the priming effect, the impact of priming on data itself, particularly in physiological signals, remains underexplored. Data affected by priming can lead to misclassifications when used in learning models. This study proposes the Affective Priming Score (APS), a data-driven method to detect data points influenced by the priming effect. The APS assigns a score to each data point, quantifying the extent to which it is affected by priming. To validate this method, we apply it to the SEED and SEED-VII datasets, which contain sufficient transitions between emotional events to exhibit priming effects. We train models with the same configuration using both the original data and priming-free sequences. The misclassification rate is significantly reduced when using priming-free sequences compared to the original data. This work contributes to the broader challenge of ambiguity by identifying and mitigating priming effects at the data level, enhancing model robustness, and offering valuable insights for the design and collection of affective computing datasets.
zh
[AI-36] Zero-Shot Attribution for Large Language Models : A Distribution Testing Approach
【速读】:该论文试图解决如何将由大型语言模型(Large Language Models, LLMs)生成的代码进行归属的问题,即判断给定的代码样本集是否来源于特定的LLM。其解决方案的关键在于引入一种名为Anubis的零样本归属工具,该工具将归属问题建模为分布检验问题,并利用代码样本及其来自LLM的密度估计值,从而克服了仅依赖样本时因维度灾难导致的不可行性。实验表明,Anubis在仅使用约2000个样本的情况下,能够有效区分如DeepSeek-Coder、CodeGemma和Stable-Code等LLMs,取得了高于0.9的AUROC分数。
链接: https://arxiv.org/abs/2506.20197
作者: Clément L. Canonne,Yash Pote,Uddalok Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures
Abstract:A growing fraction of all code is sampled from Large Language Models (LLMs). We investigate the problem of attributing code generated by language models using hypothesis testing to leverage established techniques and guarantees. Given a set of samples S and a suspect model \mathcalL^* , our goal is to assess the likelihood of S originating from \mathcalL^* . Due to the curse of dimensionality, this is intractable when only samples from the LLM are given: to circumvent this, we use both samples and density estimates from the LLM, a form of access commonly available. We introduce \mathsfAnubis , a zero-shot attribution tool that frames attribution as a distribution testing problem. Our experiments on a benchmark of code samples show that \mathsfAnubis achieves high AUROC scores ( \ge0.9 ) when distinguishing between LLMs like DeepSeek-Coder, CodeGemma, and Stable-Code using only \approx 2000 samples. Comments: 16 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.20197 [cs.LG] (or arXiv:2506.20197v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-37] AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary
【速读】:该论文试图解决将人工智能(Artificial Intelligence, AI)集成到敏捷软件开发(Agile Software Development)实践中所面临的实际挑战与机遇。研究聚焦于工具支持、治理机制、数据质量以及关键技能缺口等问题,并通过系统化分析识别了根本原因。解决方案的关键在于通过协作制定出一个研究路线图,明确未来工作的可操作方向,包括短期解决方案和长期目标,以推动产业界与学术界共同实现从现有问题到成功实施的转变。
链接: https://arxiv.org/abs/2506.20159
作者: Tomas Herda,Victoria Pichler,Zheying Zhang,Pekka Abrahamsson,Geir K. Hanssen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The full-day workshop on AI and Agile at XP 2025 convened a diverse group of researchers and industry practitioners to address the practical challenges and opportunities of integrating Artificial Intelligence into Agile software development. Through interactive sessions, participants identified shared frustrations related to integrating AI into Agile Software Development practices, including challenges with tooling, governance, data quality, and critical skill gaps. These challenges were systematically prioritized and analyzed to uncover root causes. The workshop culminated in the collaborative development of a research roadmap that pinpoints actionable directions for future work, including both immediate solutions and ambitious long-term goals. The key outcome is a structured agenda designed to foster joint industry-academic efforts to move from identified frustrations to successful implementation.
zh
[AI-38] Irec: A Metacognitive Scaffolding for Self-Regulated Learning through Just-in-Time Insight Recall: A Conceptual Framework and System Prototype
【速读】:该论文试图解决现有数字工具在支持元认知反思方面的不足,特别是在间隔重复系统(Spaced Repetition Systems, SRS)中忽略上下文作用以及个人知识管理(Personal Knowledge Management, PKM)工具需要高人工维护的问题。解决方案的关键在于提出“洞察回忆”(Insight Recall)这一新范式,将其作为元认知支架以促进自我调节学习(Self-Regulated Learning, SRL)。该范式基于及时自适应干预(Just-in-Time Adaptive Intervention, JITAI)框架,并通过原型系统Irec实现,其核心是一个动态的知识图谱,结合混合检索引擎与大语言模型(Large Language Model, LLM)进行相关性评估,从而在适当的时间提供最相关的学习支架。
链接: https://arxiv.org/abs/2506.20156
作者: Xuefei Hou,Xizhao Tan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Version 1 of a work in progress. Finalized system flowcharts, a public GitHub repository with the source code, and a full reproducibility package detailing the prompts, models, and testing guidelines will be provided in v2
Abstract:The core challenge in learning has shifted from knowledge acquisition to effective Self-Regulated Learning (SRL): planning, monitoring, and reflecting on one’s learning. Existing digital tools, however, inadequately support metacognitive reflection. Spaced Repetition Systems (SRS) use de-contextualized review, overlooking the role of context, while Personal Knowledge Management (PKM) tools require high manual maintenance. To address these challenges, this paper introduces “Insight Recall,” a novel paradigm that conceptualizes the context-triggered retrieval of personal past insights as a metacognitive scaffold to promote SRL. We formalize this paradigm using the Just-in-Time Adaptive Intervention (JITAI) framework and implement a prototype system, Irec, to demonstrate its feasibility. At its core, Irec uses a dynamic knowledge graph of the user’s learning history. When a user faces a new problem, a hybrid retrieval engine recalls relevant personal “insights.” Subsequently, a large language model (LLM) performs a deep similarity assessment to filter and present the most relevant scaffold in a just-in-time manner. To reduce cognitive load, Irec features a human-in-the-loop pipeline for LLM-based knowledge graph construction. We also propose an optional “Guided Inquiry” module, where users can engage in a Socratic dialogue with an expert LLM, using the current problem and recalled insights as context. The contribution of this paper is a solid theoretical framework and a usable system platform for designing next-generation intelligent learning systems that enhance metacognition and self-regulation. Comments: Version 1 of a work in progress. Finalized system flowcharts, a public GitHub repository with the source code, and a full reproducibility package detailing the prompts, models, and testing guidelines will be provided in v2 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) ACMclasses: H.5.2; I.2.7; H.3.3 Cite as: arXiv:2506.20156 [cs.HC] (or arXiv:2506.20156v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2506.20156 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-39] AI Copilots for Reproducibility in Science: A Case Study
【速读】:该论文试图解决科学研究中可重复性(reproducibility)的挑战,即如何确保已发表的研究成果能够被独立复现。解决方案的关键在于提出OpenPub平台,其中的核心组件是Reproducibility Copilot,它通过分析论文、代码和补充材料,生成结构化的Jupyter Notebooks和改进建议,以促进计算可重复性。该工具能够系统检测影响可重复性的障碍,如缺失的超参数、未记录的预处理步骤以及不完整或不可访问的数据集,从而显著降低复现时间并提高结果的可计算复现覆盖率。
链接: https://arxiv.org/abs/2506.20130
作者: Adrien Bibal,Steven N. Minton,Deborah Khider,Yolanda Gil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Open science initiatives seek to make research outputs more transparent, accessible, and reusable, but ensuring that published findings can be independently reproduced remains a persistent challenge. This paper introduces OpenPub, an AI-powered platform that supports researchers, reviewers, and readers through a suite of modular copilots focused on key open science tasks. In this work, we present the Reproducibility Copilot, which analyzes manuscripts, code, and supplementary materials to generate structured Jupyter Notebooks and recommendations aimed at facilitating computational, or “rote”, reproducibility. We conducted feasibility tests using previously studied research papers with known reproducibility benchmarks. Results indicate that OpenPub can substantially reduce reproduction time - from over 30 hours to about 1 hour - while achieving high coverage of figures, tables, and results suitable for computational reproduction. The system systematically detects barriers to reproducibility, including missing hyperparameters, undocumented preprocessing steps, and incomplete or inaccessible datasets. These findings suggest that AI-driven tools can meaningfully reduce the burden of reproducibility efforts and contribute to more transparent and verifiable scientific communication. The modular copilot architecture also provides a foundation for extending AI assistance to additional open science objectives beyond reproducibility.
zh
[AI-40] Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents
【速读】:该论文试图解决当前AI代码助手在生成代码补全时缺乏解释机制的问题,这种不透明性阻碍了开发者对输出结果的批判性评估、准确心理模型的构建以及对系统的合理信任。解决方案的关键在于提出CopilotLens,这是一个交互式框架,通过动态的双层界面揭示AI代理的“思考过程”,包括其重构的高层次计划及影响代码生成的具体代码库上下文,从而将代码补全从简单的建议转变为可解释的事件。
链接: https://arxiv.org/abs/2506.20062
作者: Runlong Ye,Zeling Zhang,Boushra Almazroua,Michael Liut
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-powered code assistants are widely used to generate code completions, significantly boosting developer productivity. However, these tools typically present suggestions without explaining their rationale, leaving their decision-making process inscrutable. This opacity hinders developers’ ability to critically evaluate the output, form accurate mental models, and build calibrated trust in the system. To address this, we introduce CopilotLens, a novel interactive framework that reframes code completion from a simple suggestion into a transparent, explainable event. CopilotLens operates as an explanation layer that reveals the AI agent’s “thought process” through a dynamic two-level interface, surfacing everything from its reconstructed high-level plans to the specific codebase context influencing the code. This paper presents the design and rationale of CopilotLens, offering a concrete framework for building future agentic code assistants that prioritize clarity of reasoning over speed of suggestion, thereby fostering deeper comprehension and more robust human-AI collaboration.
zh
[AI-41] DiaLLM s: EHR Enhanced Clinical Conversational System for Clinical Test Recommendation and Diagnosis Prediction
【速读】:该论文旨在解决现有医学大语言模型(Large Language Models, LLMs)在临床应用中的局限性,即过度关注诊断推荐而忽视电子健康记录(Electronic Health Records, EHR)的重要性,从而限制了其在实际医疗场景中的适用性。解决方案的关键在于提出DiaLLM,这是首个将异构EHR数据整合到临床基础对话中的医学大语言模型,通过临床检验参考(Clinical Test Reference, CTR)策略实现临床代码与描述的映射,并对检验结果进行“正常”或“异常”的分类,同时结合强化学习框架进行证据获取和自动化诊断,以提升临床检验推荐和诊断预测的准确性。
链接: https://arxiv.org/abs/2506.20059
作者: Weijieying Ren,Tianxiang Zhao,Lei Wang,Tianchun Wang,Vasant Honavar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have led to remarkable progresses in medical consultation. However, existing medical LLMs overlook the essential role of Electronic Health Records (EHR) and focus primarily on diagnosis recommendation, limiting their clinical applicability. We propose DiaLLM, the first medical LLM that integrates heterogeneous EHR data into clinically grounded dialogues, enabling clinical test recommendation, result interpretation, and diagnosis prediction to better align with real-world medical practice. To construct clinically grounded dialogues from EHR, we design a Clinical Test Reference (CTR) strategy that maps each clinical code to its corresponding description and classifies test results as “normal” or “abnormal”. Additionally, DiaLLM employs a reinforcement learning framework for evidence acquisition and automated diagnosis. To handle the large action space, we introduce a reject sampling strategy to reduce redundancy and improve exploration efficiency. Furthermore, a confirmation reward and a class-sensitive diagnosis reward are designed to guide accurate diagnosis prediction. Extensive experimental results demonstrate that DiaLLM outperforms baselines in clinical test recommendation and diagnosis prediction.
zh
[AI-42] Robust Robotic Exploration and Mapping Using Generative Occupancy Map Synthesis
【速读】:该论文旨在解决机器人探索中地图构建质量与可 traversability(可通行性)不足的问题。其核心解决方案是提出 SceneSense,一种基于扩散模型的生成式占用映射(Generative Occupancy Mapping)方法,能够根据部分观测数据预测三维占用图,并将其概率融合到实时运行的占用图中,从而提升地图质量和可通行性。
链接: https://arxiv.org/abs/2506.20049
作者: Lorin Achey,Alec Reed,Brendan Crowe,Bradley Hayes,Christoffer Heckman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2409.10681
Abstract:We present a novel approach for enhancing robotic exploration by using generative occupancy mapping. We introduce SceneSense, a diffusion model designed and trained for predicting 3D occupancy maps given partial observations. Our proposed approach probabilistically fuses these predictions into a running occupancy map in real-time, resulting in significant improvements in map quality and traversability. We implement SceneSense onboard a quadruped robot and validate its performance with real-world experiments to demonstrate the effectiveness of the model. In these experiments, we show that occupancy maps enhanced with SceneSense predictions better represent our fully observed ground truth data (24.44% FID improvement around the robot and 75.59% improvement at range). We additionally show that integrating SceneSense-enhanced maps into our robotic exploration stack as a “drop-in” map improvement, utilizing an existing off-the-shelf planner, results in improvements in robustness and traversability time. Finally we show results of full exploration evaluations with our proposed system in two dissimilar environments and find that locally enhanced maps provide more consistent exploration results than maps constructed only from direct sensor measurements.
zh
[AI-43] GNNs Uncertainty Quantification using Self-Distillation ALT
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在医疗领域中预测不确定性量化的问题,这一问题对于临床场景中的可信度至关重要。现有的贝叶斯和集成方法虽然可以用于量化不确定性,但计算成本较高,且集成方法中使用的分歧度量无法充分捕捉集成网络中模型的多样性。论文提出的解决方案关键在于基于知识蒸馏的方法,通过自蒸馏(self-distillation)使同一网络同时作为教师和学生模型,从而避免独立训练多个网络,并引入一种能够体现网络多样性的不确定性度量,通过为每个GNN分类器分配不同权重来增强不确定性量化的效果。
链接: https://arxiv.org/abs/2506.20046
作者: Hirad Daneshvar,Reza Samavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper has been accepted in the International Conference on AI in Healthcare (AIiH) 2025 and will appear in the conference proceedings
Abstract:Graph Neural Networks (GNNs) have shown remarkable performance in the healthcare domain. However, what remained challenging is quantifying the predictive uncertainty of GNNs, which is an important aspect of trustworthiness in clinical settings. While Bayesian and ensemble methods can be used to quantify uncertainty, they are computationally expensive. Additionally, the disagreement metric used by ensemble methods to compute uncertainty cannot capture the diversity of models in an ensemble network. In this paper, we propose a novel method, based on knowledge distillation, to quantify GNNs’ uncertainty more efficiently and with higher precision. We apply self-distillation, where the same network serves as both the teacher and student models, thereby avoiding the need to train several networks independently. To ensure the impact of self-distillation, we develop an uncertainty metric that captures the diverse nature of the network by assigning different weights to each GNN classifier. We experimentally evaluate the precision, performance, and ability of our approach in distinguishing out-of-distribution data on two graph datasets: MIMIC-IV and Enzymes. The evaluation results demonstrate that the proposed method can effectively capture the predictive uncertainty of the model while having performance similar to that of the MC Dropout and ensemble methods. The code is publicly available at this https URL.
zh
[AI-44] LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification
【速读】:该论文旨在解决多类不平衡数据流分类问题,特别是在处理动态变化的类别分布时面临的挑战。其关键解决方案是将局部敏感哈希与随机超平面投影(LSH-RHP)集成到动态集成多样化(DynED)框架中,通过LSH-RHP对多数类进行下采样,从而生成平衡的训练集,提升集成模型的预测性能。
链接: https://arxiv.org/abs/2506.20041
作者: Soheil Abadifard,Fazli Can
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:The classification of imbalanced data streams, which have unequal class distributions, is a key difficulty in machine learning, especially when dealing with multiple classes. While binary imbalanced data stream classification tasks have received considerable attention, only a few studies have focused on multi-class imbalanced data streams. Effectively managing the dynamic imbalance ratio is a key challenge in this domain. This study introduces a novel, robust, and resilient approach to address these challenges by integrating Locality Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) into the Dynamic Ensemble Diversification (DynED) framework. To the best of our knowledge, we present the first application of LSH-RHP for undersampling in the context of imbalanced non-stationary data streams. The proposed method undersamples the majority classes by utilizing LSH-RHP, provides a balanced training set, and improves the ensemble’s prediction performance. We conduct comprehensive experiments on 23 real-world and ten semi-synthetic datasets and compare LSH-DynED with 15 state-of-the-art methods. The results reveal that LSH-DynED outperforms other approaches in terms of both Kappa and mG-Mean effectiveness measures, demonstrating its capability in dealing with multi-class imbalanced non-stationary data streams. Notably, LSH-DynED performs well in large-scale, high-dimensional datasets with considerable class imbalances and demonstrates adaptation and robustness in real-world circumstances. To motivate our design, we review existing methods for imbalanced data streams, outline key challenges, and offer guidance for future work. For the reproducibility of our results, we have made our implementation available on GitHub.
zh
[AI-45] Learning Bilateral Team Formation in Cooperative Multi-Agent Reinforcement Learning
【速读】:该论文试图解决多智能体强化学习(MARL)中动态群体环境下算法双边组队选择的影响问题,现有研究主要关注单边分组、预定义团队或固定种群设置,而对动态种群中算法双边组队选择的影响研究不足。解决方案的关键在于引入一种用于动态多智能体系统中双侧团队形成的框架,通过该框架分析双边团队形成中的算法特性如何影响策略性能和泛化能力,并在广泛采用的多智能体场景中验证了方法的有效性。
链接: https://arxiv.org/abs/2506.20039
作者: Koorosh Moslemi,Chi-Guhn Lee
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: Accepted to the 2nd Coordination and Cooperation in Multi-Agent Reinforcement Learning (CoCoMARL) Workshop at RLC 2025
Abstract:Team formation and the dynamics of team-based learning have drawn significant interest in the context of Multi-Agent Reinforcement Learning (MARL). However, existing studies primarily focus on unilateral groupings, predefined teams, or fixed-population settings, leaving the effects of algorithmic bilateral grouping choices in dynamic populations underexplored. To address this gap, we introduce a framework for learning two-sided team formation in dynamic multi-agent systems. Through this study, we gain insight into what algorithmic properties in bilateral team formation influence policy performance and generalization. We validate our approach using widely adopted multi-agent scenarios, demonstrating competitive performance and improved generalization in most scenarios.
zh
[AI-46] Hierarchical Reinforcement Learning and Value Optimization for Challenging Quadruped Locomotion
【速读】:该论文试图解决复杂地形下四足机器人运动控制的问题,旨在提升其在多样化和挑战性环境中的适应能力和运动效率。解决方案的关键在于提出一种分层强化学习框架,其中高层策略(High-Level Policy, HLP)通过在线优化低层策略(Low-Level Policy, LLP)的已学习价值函数来选择最优步态目标,而低层策略则通过基于策略的演员-评论家算法进行训练,以足部放置作为目标。这种结构使得系统能够在不额外训练或依赖环境样本的情况下,有效提升奖励并减少碰撞。
链接: https://arxiv.org/abs/2506.20036
作者: Jeremiah Coholich,Muhammad Ali Murtaza,Seth Hutchinson,Zsolt Kira
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a novel hierarchical reinforcement learning framework for quadruped locomotion over challenging terrain. Our approach incorporates a two-layer hierarchy in which a high-level policy (HLP) selects optimal goals for a low-level policy (LLP). The LLP is trained using an on-policy actor-critic RL algorithm and is given footstep placements as goals. We propose an HLP that does not require any additional training or environment samples and instead operates via an online optimization process over the learned value function of the LLP. We demonstrate the benefits of this framework by comparing it with an end-to-end reinforcement learning (RL) approach. We observe improvements in its ability to achieve higher rewards with fewer collisions across an array of different terrains, including terrains more difficult than any encountered during training.
zh
[AI-47] Automated Generation of Diverse Courses of Actions for Multi-Agent Operations using Binary Optimization and Graph Learning
【速读】:该论文旨在解决在涉及多智能体的灾难响应、搜救和军事任务中,如何生成多样化的情景行动方案(COA)以应对环境变化和智能体能力差异的问题。解决方案的关键在于将任务空间和COA池抽象为图结构,以量化其多样性,并通过遗传算法进行不考虑顺序的任务分配,以联合最大化COA池内的多样性和智能体-任务映射的整体兼容性。此外,利用图神经网络结合策略梯度方法进行单个智能体的任务序列规划,以适应任务特征并提高完成率。
链接: https://arxiv.org/abs/2506.20031
作者: Prithvi Poddar,Ehsan Tarkesh Esfahani,Karthik Dantu,Souma Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Operations in disaster response, search \ rescue, and military missions that involve multiple agents demand automated processes to support the planning of the courses of action (COA). Moreover, traverse-affecting changes in the environment (rain, snow, blockades, etc.) may impact the expected performance of a COA, making it desirable to have a pool of COAs that are diverse in task distributions across agents. Further, variations in agent capabilities, which could be human crews and/or autonomous systems, present practical opportunities and computational challenges to the planning process. This paper presents a new theoretical formulation and computational framework to generate such diverse pools of COAs for operations with soft variations in agent-task compatibility. Key to the problem formulation is a graph abstraction of the task space and the pool of COAs itself to quantify its diversity. Formulating the COAs as a centralized multi-robot task allocation problem, a genetic algorithm is used for (order-ignoring) allocations of tasks to each agent that jointly maximize diversity within the COA pool and overall compatibility of the agent-task mappings. A graph neural network is trained using a policy gradient approach to then perform single agent task sequencing in each COA, which maximizes completion rates adaptive to task features. Our tests of the COA generation process in a simulated environment demonstrate significant performance gain over a random walk baseline, small optimality gap in task sequencing, and execution time of about 50 minutes to plan up to 20 COAs for 5 agent/100 task operations.
zh
[AI-48] Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting
【速读】:该论文旨在解决高维混沌系统中扩散模型在预测未来快照时难以建模复杂时间依赖性和未能显式考虑不确定性逐步增长的问题。其解决方案的关键在于引入了Elucidated Rolling Diffusion Models (ERDM),这是首个成功将滚动预测结构与Elucidated Diffusion Models (EDM) 的合理且高效设计相结合的框架。ERDM通过调整EDM的核心组件(如噪声调度、网络预处理和Heun采样器)以适应滚动预测场景,并通过三项关键贡献实现有效整合:新颖的损失加权方案、利用预训练EDM的高效初始化策略,以及用于稳健时空特征提取的定制混合序列架构。
链接: https://arxiv.org/abs/2506.20024
作者: Salva Rühling Cachay,Miika Aittala,Karsten Kreis,Noah Brenowitz,Arash Vahdat,Morteza Mardani,Rose Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
备注:
Abstract:Diffusion models are a powerful tool for probabilistic forecasting, yet most applications in high-dimensional chaotic systems predict future snapshots one-by-one. This common approach struggles to model complex temporal dependencies and fails to explicitly account for the progressive growth of uncertainty inherent to such systems. While rolling diffusion frameworks, which apply increasing noise to forecasts at longer lead times, have been proposed to address this, their integration with state-of-the-art, high-fidelity diffusion techniques remains a significant challenge. We tackle this problem by introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to successfully unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM components-its noise schedule, network preconditioning, and Heun sampler-to the rolling forecast setting. The success of this integration is driven by three key contributions: (i) a novel loss weighting scheme that focuses model capacity on the mid-range forecast horizons where determinism gives way to stochasticity; (ii) an efficient initialization strategy using a pre-trained EDM for the initial window; and (iii) a bespoke hybrid sequence architecture for robust spatiotemporal feature extraction under progressive denoising. On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5^\circ resolution, ERDM consistently outperforms key diffusion-based baselines, including conditional autoregressive EDM. ERDM offers a flexible and powerful general framework for tackling diffusion-based sequence generation problems where modeling escalating uncertainty is paramount. Code is available at: this https URL
zh
[AI-49] Achieving Trustworthy Real-Time Decision Support Systems with Low-Latency Interpretable AI Models
【速读】:该论文试图解决在资源受限环境下,如何通过低延迟人工智能模型实现高效实时决策支持的问题,重点探讨了生成式AI(Generative AI)在决策辅助中的应用潜力。其解决方案的关键在于整合全栈AI驱动的决策工具、边缘-物联网(Edge-IoT)技术以及人机协作的有效方法,同时结合模型压缩技术与边缘设备上的分析优化,以应对计算资源有限和系统灵活性不足等挑战。
链接: https://arxiv.org/abs/2506.20018
作者: Zechun Deng,Ziwei Liu,Ziqian Bi,Junhao Song,Chia Xin Liang,Joe Yeong,Junfeng Hao
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:This paper investigates real-time decision support systems that leverage low-latency AI models, bringing together recent progress in holistic AI-driven decision tools, integration with Edge-IoT technologies, and approaches for effective human-AI teamwork. It looks into how large language models can assist decision-making, especially when resources are limited. The research also examines the effects of technical developments such as DeLLMa, methods for compressing models, and improvements for analytics on edge devices, while also addressing issues like limited resources and the need for adaptable frameworks. Through a detailed review, the paper offers practical perspectives on development strategies and areas of application, adding to the field by pointing out opportunities for more efficient and flexible AI-supported systems. The conclusions set the stage for future breakthroughs in this fast-changing area, highlighting how AI can reshape real-time decision support.
zh
[AI-50] New Insights on Unfolding and Fine-tuning Quantum Federated Learning
【速读】:该论文旨在解决量子联邦学习(Quantum Federated Learning, QFL)中客户端异质性带来的性能挑战。其关键解决方案是引入深度展开(deep unfolding)技术,使客户端能够根据自身的训练行为自主优化超参数,如学习率和正则化因子,从而实现动态适应,缓解过拟合问题,并在高度异质环境中确保鲁棒优化。该方法通过自适应微调显著提升了模型性能,在真实量子硬件和模拟器上的实验表明其准确率可达约90%,远超传统方法的约55%。
链接: https://arxiv.org/abs/2506.20016
作者: Shanika Iroshi Nanayakkara,Shiva Raj Pokhrel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures, 7 Tables, Submitted to IEEE/ACM journal 2025
Abstract:Client heterogeneity poses significant challenges to the performance of Quantum Federated Learning (QFL). To overcome these limitations, we propose a new approach leveraging deep unfolding, which enables clients to autonomously optimize hyperparameters, such as learning rates and regularization factors, based on their specific training behavior. This dynamic adaptation mitigates overfitting and ensures robust optimization in highly heterogeneous environments where standard aggregation methods often fail. Our framework achieves approximately 90% accuracy, significantly outperforming traditional methods, which typically yield around 55% accuracy, as demonstrated through real-time training on IBM quantum hardware and Qiskit Aer simulators. By developing self adaptive fine tuning, the proposed method proves particularly effective in critical applications such as gene expression analysis and cancer detection, enhancing diagnostic precision and predictive modeling within quantum systems. Our results are attributed to convergence-aware, learnable optimization steps intrinsic to the deep unfolded framework, which maintains the generalization. Hence, this study addresses the core limitations of conventional QFL, streamlining its applicability to any complex challenges such as healthcare and genomic research.
zh
[AI-51] QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在量子计算代码生成任务中的有效性问题,特别是针对基于PennyLane的量子编程场景。其关键解决方案是引入QHackBench基准数据集,并结合原始提示和检索增强生成(Retrieval-Augmented Generation, RAG)方法进行模型评估,同时提出一种多智能体评估流水线以迭代优化错误解,从而提升代码执行成功率。
链接: https://arxiv.org/abs/2506.20008
作者: Abdul Basit,Minghao Shao,Haider Asif,Nouhaila Innan,Muhammad Kashif,Alberto Marchisio,Muhammad Shafique
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 8 pages, 6 figures, 3 tables, submitted to QAI 2025
Abstract:Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.
zh
[AI-52] RACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
【速读】:该论文旨在解决深度强化学习智能体在未见过环境中的泛化问题,其核心挑战在于如何设计有效的课程以提升智能体的零样本泛化能力。解决方案的关键在于提出一种名为TRACED(Transition-aware Regret Approximation with Co-learnability for Environment Design)的方法,该方法通过结合过渡预测误差和共学习性度量来改进遗憾值近似,从而更有效地指导环境设计,实现样本高效的课程生成。
链接: https://arxiv.org/abs/2506.19997
作者: Geonwoo Cho,Jaegyun Im,Jihwan Lee,Hojun Yi,Sejin Kim,Sundong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called co-learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED yields curricula that improve zero-shot generalization across multiple benchmarks while requiring up to 2x fewer environment interactions than strong baselines. Ablation studies confirm that the transition prediction error drives rapid complexity ramp-up and that co-learnability delivers additional gains when paired with the transition prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED.
zh
[AI-53] HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLM s for Efficient Summarization
【速读】:该论文试图解决复杂多模态数据集日益增长所带来的分析挑战,即需要一种不仅能有效分组数据,还能提供人类可理解的结构洞察的高级分析工具。解决方案的关键在于提出HERCULES(Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization),该算法通过递归应用k-means聚类构建层次化聚类结构,并深度集成大型语言模型(Large Language Models, LLMs)以生成语义丰富的聚类标题和描述,从而显著提升聚类结果的可解释性。
链接: https://arxiv.org/abs/2506.19992
作者: Gabor Petnehazi,Bernadett Aradi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main representation modes: direct' mode, which clusters based on original data embeddings or scaled numeric features, and
description’ mode, which clusters based on embeddings derived from LLM-generated summaries. Users can provide a `topic_seed’ to guide LLM-generated summaries towards specific themes. An interactive visualization tool facilitates thorough analysis and understanding of the clustering results. We demonstrate HERCULES’s capabilities and discuss its potential for extracting meaningful, hierarchical knowledge from complex datasets.
zh
[AI-54] Context Attribution with Multi-Armed Bandit Optimization
【速读】:该论文试图解决生成式问答系统中如何识别检索上下文中对模型生成答案有贡献的部分,以提升系统的可解释性和可信度。解决方案的关键在于将上下文归属问题建模为组合多臂老虎机(Combinatorial Multi-Armed Bandit, CMAB)问题,将每个上下文片段视为一个老虎机臂,并采用组合贝叶斯采样(Combinatorial Thompson Sampling, CTS)方法,在有限的查询预算下高效探索上下文子集的空间。通过基于归一化token似然的奖励函数,该方法能够衡量一段上下文支持原始模型响应的程度,并通过后验估计的片段相关性自适应地平衡探索与利用,从而在保持高归属保真度的同时显著提升查询效率。
链接: https://arxiv.org/abs/2506.19977
作者: Deng Pan,Keerthiram Murugesan,Nuno Moniz,Nitesh Chawla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding which parts of the retrieved context contribute to a large language model’s generated answer is essential for building interpretable and trustworthy generative QA systems. We propose a novel framework that formulates context attribution as a combinatorial multi-armed bandit (CMAB) problem. Each context segment is treated as a bandit arm, and we employ Combinatorial Thompson Sampling (CTS) to efficiently explore the exponentially large space of context subsets under a limited query budget. Our method defines a reward function based on normalized token likelihoods, capturing how well a subset of segments supports the original model response. Unlike traditional perturbation-based attribution methods such as SHAP, which sample subsets uniformly and incur high computational costs, our approach adaptively balances exploration and exploitation by leveraging posterior estimates of segment relevance. This leads to substantially improved query efficiency while maintaining high attribution fidelity. Extensive experiments on diverse datasets and LLMs demonstrate that our method achieves competitive attribution quality with fewer model queries.
zh
[AI-55] Prover Agent : An Agent -based Framework for Formal Mathematical Proofs
【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)中的挑战,特别是在有限样本预算下提高证明成功率的问题。其解决方案的关键在于提出Prover Agent,该代理整合了大型语言模型(Large Language Models, LLMs)与形式化证明助手Lean,通过协调非正式推理的LLM、形式化证明模型以及来自Lean的反馈,同时生成辅助引理以辅助发现整体证明策略,从而显著提升了定理证明的效率和成功率。
链接: https://arxiv.org/abs/2506.19923
作者: Kaito Baba,Chaoran Liu,Shuhei Kurita,Akiyoshi Sannai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 2 figures
Abstract:We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas to assist in discovering the overall proof strategy. It achieves an 86.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present case studies illustrating how these generated lemmas contribute to solving challenging problems.
zh
[AI-56] Can LLM s Replace Humans During Code Chunking?
【速读】:该论文试图解决政府应用中遗留代码(如ALC和MUMPS)在使用大型语言模型(LLMs)进行现代化过程中面临的输入限制问题。解决方案的关键在于通过不同的代码分块(code-chunking)方法优化生成模块注释的效率与质量,实验结果表明LLMs能够选择与人类专家划分高度一致的分割点,并且由LLMs生成的代码块在文档生成任务中表现出更高的事实准确性和实用性。
链接: https://arxiv.org/abs/2506.19897
作者: Christopher Glasz,Emily Escamilla,Eric O. Scott,Anand Patel,Jacob Zimmer,Colin Diggs,Michael Doyle,Scott Rosen,Nitin Naik,Justin F. Brunelle,Samruddhi Thaker,Parthav Poudel,Arun Sridharan,Amit Madan,Doug Wendt,William Macke,Thomas Schill
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have become essential tools in computer science, especially for tasks involving code understanding and generation. However, existing work does not address many of the unique challenges presented by code written for government applications. In particular, government enterprise software is often written in legacy languages like MUMPS or assembly language code (ALC) and the overall token lengths of these systems exceed the context window size for current commercially available LLMs. Additionally, LLMs are primarily trained on modern software languages and have undergone limited testing with legacy languages, making their ability to understand legacy languages unknown and, hence, an area for empirical study. This paper examines the application of LLMs in the modernization of legacy government code written in ALC and MUMPS, addressing the challenges of input limitations. We investigate various code-chunking methods to optimize the generation of summary module comments for legacy code files, evaluating the impact of code-chunking methods on the quality of documentation produced by different LLMs, including GPT-4o, Claude 3 Sonnet, Mixtral, and Llama 3. Our results indicate that LLMs can select partition points closely aligned with human expert partitioning. We also find that chunking approaches have significant impact on downstream tasks such as documentation generation. LLM-created partitions produce comments that are up to 20% more factual and up to 10% more useful than when humans create partitions. Therefore, we conclude that LLMs can be used as suitable replacements for human partitioning of large codebases during LLM-aided modernization.
zh
[AI-57] A Framework for Uncertainty Quantification Based on Nearest Neighbors Across Layers ICANN2025
【速读】:该论文试图解决神经网络在高风险领域(如医学诊断或自动驾驶)中可能出现的错误决策问题,这些问题往往由于模型无法准确检测模式或建立逻辑模型而导致。解决方案的关键在于提出一种基于检索到的训练案例的后处理框架,通过测量与查询具有相似激活向量的训练案例来评估决策的不确定性。该框架引入了两个新指标:Decision Change 和 Layer Uncertainty,用于捕捉跨层的最近邻类别分布变化,从而提升不确定性估计的准确性,特别是在具有挑战性的分类任务中表现优于基于 softmax 的置信度方法。
链接: https://arxiv.org/abs/2506.19895
作者: Miguel N. Font,José L. Jorro-Aragoneses,Carlos M. Alaíz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at ICANN 2025 (International Conference on Artificial Neural Networks) and will appear in the conference proceedings published by Springer Nature in the Lecture Notes in Computer Science (LNCS) series. The final authenticated version will be available on the publisher website
Abstract:Neural Networks have high accuracy in solving problems where it is difficult to detect patterns or create a logical model. However, these algorithms sometimes return wrong solutions, which become problematic in high-risk domains like medical diagnosis or autonomous driving. One strategy to detect and mitigate these errors is the measurement of the uncertainty over neural network decisions. In this paper, we present a novel post-hoc framework for measuring the uncertainty of a decision based on retrieved training cases that have a similar activation vector to the query for each layer. Based on these retrieved cases, we propose two new metrics: Decision Change and Layer Uncertainty, which capture changes in nearest-neighbor class distributions across layers. We evaluated our approach in a classification model for two datasets: CIFAR-10 and MNIST. The results show that these metrics enhance uncertainty estimation, especially in challenging classification tasks, outperforming softmax-based confidence.
zh
[AI-58] Explaining deep neural network models for electricity price forecasting with XAI
【速读】:该论文试图解决电力市场中价格动态难以理解的问题,特别是由于市场内部复杂的交互和依赖关系导致的价格驱动因素不明确。解决方案的关键在于利用深度神经网络模型(DNN)进行价格预测,并结合可解释人工智能(XAI)方法如SHAP和梯度分析,以及可视化技术如热图(saliency maps),来解析不同特征对价格变化的贡献。论文还引入了SSHAP值和SSHAP线以增强高维表格模型的复杂表示。
链接: https://arxiv.org/abs/2506.19894
作者: Antoine Pesenti,Aidan OSullivan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Electricity markets are highly complex, involving lots of interactions and complex dependencies that make it hard to understand the inner workings of the market and what is driving prices. Econometric methods have been developed for this, white-box models, however, they are not as powerful as deep neural network models (DNN). In this paper, we use a DNN to forecast the price and then use XAI methods to understand the factors driving the price dynamics in the market. The objective is to increase our understanding of how different electricity markets work. To do that, we apply explainable methods such as SHAP and Gradient, combined with visual techniques like heatmaps (saliency maps) to analyse the behaviour and contributions of various features across five electricity markets. We introduce the novel concepts of SSHAP values and SSHAP lines to enhance the complex representation of high-dimensional tabular models.
zh
[AI-59] Distillation-Enabled Knowledge Alignment for Generative Semantic Communications in AIGC Provisioning Tasks
【速读】:该论文旨在解决生成式语义通信(GSC)系统中知识对齐的挑战,即如何在云生成式AI(GAI)与边缘和用户端之间,以及无线传输知识与实际信道特性之间实现有效对齐。其解决方案的关键在于提出一种基于知识蒸馏的对齐算法DeKA-g,通过将云-GAI的生成知识蒸馏到低秩矩阵中,使边缘端能够利用这些知识以适应不同的无线信道条件。该方法包含两个创新技术:元词辅助知识蒸馏(MAKD)和可变速率分组信噪比适配(VGSA),从而显著提升了图像生成质量、压缩率适应效率及低信噪比环境下的性能。
链接: https://arxiv.org/abs/2506.19893
作者: Jingzhi Hu,Geoffrey Ye Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Image and Video Processing (eess.IV)
备注:
Abstract:Due to the surging amount of AI-generated content (AIGC), its provisioning to edges and mobile users from the cloud incurs substantial traffic on networks. Generative semantic communication (GSC) offers a promising solution by transmitting highly compact information, i.e., prompt text and latent representations, instead of high-dimensional AIGC data. However, GSC relies on the alignment between the knowledge in the cloud generative AI (GAI) and that possessed by the edges and users, and between the knowledge for wireless transmission and that of actual channels, which remains challenging. In this paper, we propose DeKA-g, a distillation-enabled knowledge alignment algorithm for GSC systems. The core idea is to distill the generation knowledge from the cloud-GAI into low-rank matrices, which can be incorporated by the edge and used to adapt the transmission knowledge to diverse wireless channel conditions. DeKA-g comprises two novel methods: metaword-aided knowledge distillation (MAKD) and variable-rate grouped SNR adaptation (VGSA). For MAKD, an optimized metaword is employed to enhance the efficiency of knowledge distillation, while VGSA enables efficient adaptation to diverse compression rates and SNR ranges. From simulation results, DeKA-g improves the alignment between the edge-generated images and the cloud-generated ones by 44%. Moreover, it adapts to compression rates with 116% higher efficiency than the baseline and enhances the performance in low-SNR conditions by 28%.
zh
[AI-60] RepuNet: A Reputation System for Mitigating Malicious Clients in DFL
【速读】:该论文旨在解决去中心化联邦学习(Decentralized Federated Learning, DFL)中由于节点自主选择聚合对象而引入的新安全威胁,如模型污染、延迟攻击和网络拥塞等问题。现有解决方案通常依赖于固定配置或额外基础设施(如区块链),导致计算开销大、可扩展性差或适应性有限。论文提出的解决方案是RepuNet,其关键在于构建一个去中心化的声誉系统,通过动态评估节点行为(如模型相似性、参数变化、消息延迟和通信量)来分类威胁,并根据节点的声誉分数调整其在模型聚合中的影响力,从而有效检测和缓解恶意行为。
链接: https://arxiv.org/abs/2506.19892
作者: Isaac Marroqui Penalva,Enrique Tomás Martínez Beltrán,Manuel Gil Pérez,Alberto Huertas Celdrán
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Decentralized Federated Learning (DFL) enables nodes to collaboratively train models without a central server, introducing new vulnerabilities since each node independently selects peers for model aggregation. Malicious nodes may exploit this autonomy by sending corrupted models (model poisoning), delaying model submissions (delay attack), or flooding the network with excessive messages, negatively affecting system performance. Existing solutions often depend on rigid configurations or additional infrastructures such as blockchain, leading to computational overhead, scalability issues, or limited adaptability. To overcome these limitations, this paper proposes RepuNet, a decentralized reputation system that categorizes threats in DFL and dynamically evaluates node behavior using metrics like model similarity, parameter changes, message latency, and communication volume. Nodes’ influence in model aggregation is adjusted based on their reputation scores. RepuNet was integrated into the Nebula DFL platform and experimentally evaluated with MNIST and CIFAR-10 datasets under non-IID distributions, using federations of up to 25 nodes in both fully connected and random topologies. Different attack intensities, frequencies, and activation intervals were tested. Results demonstrated that RepuNet effectively detects and mitigates malicious behavior, achieving F1 scores above 95% for MNIST scenarios and approximately 76% for CIFAR-10 cases. These outcomes highlight RepuNet’s adaptability, robustness, and practical potential for mitigating threats in decentralized federated learning environments.
zh
[AI-61] Orthogonal Soft Pruning for Efficient Class Unlearning
【速读】:该论文旨在解决机器遗忘(machine unlearning)中面临的挑战,即在满足隐私法规(如GDPR)的同时,如何在遗忘特定类别知识的过程中保持模型的预测准确性,而现有方法通常在遗忘速度与模型性能之间存在权衡。论文提出的解决方案关键在于一种基于类感知的软剪枝框架,通过正交卷积核正则化来实现快速且精确的遗忘,其核心机制是在训练过程中引入正交性约束,以解耦卷积滤波器和特征表示,并通过激活差异分析高效识别特定类别的通道,从而实现对目标类别的完全遗忘以及对保留类别的最小精度损失。
链接: https://arxiv.org/abs/2506.19891
作者: Qinghui Gong,Xue Yang,Xiaohu Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages,3 figures
Abstract:Machine unlearning aims to selectively remove class-specific knowledge from pretrained neural networks to satisfy privacy regulations such as the GDPR. Existing methods typically face a trade-off between unlearning speed and preservation of predictive accuracy, often incurring either high computational overhead or significant performance degradation on retained classes. In this paper, we propose a novel class-aware soft pruning framework leveraging orthogonal convolutional kernel regularization to achieve rapid and precise forgetting with millisecond-level response times. By enforcing orthogonality constraints during training, our method decorrelates convolutional filters and disentangles feature representations, while efficiently identifying class-specific channels through activation difference analysis. Extensive evaluations across multiple architectures and datasets demonstrate stable pruning with near-instant execution, complete forgetting of targeted classes, and minimal accuracy loss on retained data. Experiments on CIFAR-10, CIFAR-100, and TinyImageNet confirm that our approach substantially reduces membership inference attack risks and accelerates unlearning by orders of magnitude compared to state-of-the-art baselines. This framework provides an efficient, practical solution for real-time machine unlearning in Machine Learning as a Service (MLaaS) scenarios.
zh
[AI-62] Causal-Aware Intelligent QoE Optimization for VR Interaction with Adaptive Keyframe Extraction
【速读】:该论文旨在解决多用户虚拟现实(Multi-User Virtual Reality, MVR)交互中服务质量(Quality of Experience, QoE)优化的问题,具体包括超低延迟、高保真运动同步和公平资源分配之间的平衡难题。现有方法在应对带宽分配、CPU频率调整与用户感知之间的因果关系时存在不足,限制了QoE的提升。论文提出的解决方案关键在于将自适应关键帧提取与因果感知强化学习(Causal-Aware Reinforcement Learning, CARL)相结合,通过引入基于韦伯-费希纳定律的新型QoE度量指标,并构建混合整数规划(Mixed Integer Programming, MIP)模型,联合优化关键帧比例、带宽和计算资源。此外,提出的部分状态因果深度确定性策略梯度(Partial State Causal Deep Deterministic Policy Gradient, PS-CDDPG)方法结合了深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)与因果影响检测,利用因果推理计算的权重引导策略探索,从而提升训练效率。
链接: https://arxiv.org/abs/2506.19890
作者: Ziru Zhang,Jiadong Yu,Danny H.K. Tsang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The optimization of quality of experience (QoE) in multi-user virtual reality (VR) interactions demands a delicate balance between ultra-low latency, high-fidelity motion synchronization, and equitable resource allocation. While adaptive keyframe extraction mitigates transmission overhead, existing approaches often overlook the causal relationships among allocated bandwidth, CPU frequency, and user perception, limiting QoE gains. This paper proposes an intelligent framework to maximize QoE by integrating adaptive keyframe extraction with causal-aware reinforcement learning (RL). First, a novel QoE metric is formulated using the Weber-Fechner Law, combining perceptual sensitivity, attention-driven priorities, and motion reconstruction accuracy. The QoE optimization problem is then modeled as a mixed integer programming (MIP) task, jointly optimizing keyframe ratios, bandwidth, and computational resources under horizon-fairness constraints. We propose Partial State Causal Deep Deterministic Policy Gradient (PS-CDDPG), which integrates the Deep Deterministic Policy Gradient (DDPG) method with causal influence detection. By leveraging causal information regarding how QoE is influenced and determined by various actions, we explore actions guided by weights calculated from causal inference (CI), which in turn improves training efficiency. Experiments conducted with the CMU Motion Capture Database demonstrate that our framework significantly reduces interactive latency, enhances QoE, and maintains fairness, achieving superior performance compared to benchmark methods.
zh
[AI-63] Retrieval-Confused Generation is a Good Defender for Privacy Violation Attack of Large Language Models
【速读】:该论文试图解决隐私泄露攻击(Privacy Violation Attack, PVA)对大型语言模型(Large Language Models, LLMs)带来的个人隐私安全问题。现有防御方法存在推理成本高、防御效果不佳以及易被攻击者绕过等缺陷。该论文提出的解决方案关键在于基于检索混淆生成(Retrieval-Confused Generation, RCG)的防御范式,通过设计改写提示构建干扰数据库,并采用最不相关检索策略从干扰数据库中提取用户数据,最终替换原始查询中的“数据评论”以生成带有错误个人属性的防御查询,从而有效阻止PVA攻击。
链接: https://arxiv.org/abs/2506.19889
作者: Wanli Peng,Xin Chen,Hang Fu,XinYu He,Xue Yiming,Juan Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have made a profound impact on our society and also raised new security concerns. Particularly, due to the remarkable inference ability of LLMs, the privacy violation attack (PVA), revealed by Staab et al., introduces serious personal privacy issues. Existing defense methods mainly leverage LLMs to anonymize the input query, which requires costly inference time and cannot gain satisfactory defense performance. Moreover, directly rejecting the PVA query seems like an effective defense method, while the defense method is exposed, promoting the evolution of PVA. In this paper, we propose a novel defense paradigm based on retrieval-confused generation (RCG) of LLMs, which can efficiently and covertly defend the PVA. We first design a paraphrasing prompt to induce the LLM to rewrite the “user comments” of the attack query to construct a disturbed database. Then, we propose the most irrelevant retrieval strategy to retrieve the desired user data from the disturbed database. Finally, the “data comments” are replaced with the retrieved user data to form a defended query, leading to responding to the adversary with some wrong personal attributes, i.e., the attack fails. Extensive experiments are conducted on two datasets and eight popular LLMs to comprehensively evaluate the feasibility and the superiority of the proposed defense method.
zh
[AI-64] FlightKooba: A Fast Interpretable FTP Model
【速读】:该论文试图解决现有基于Koopman理论的飞行轨迹预测(FTP)模型在模型可解释性差、计算复杂度高及训练时间长等问题。解决方案的关键在于提出一种新的建模与控制框架FlightKooba,该框架结合了HIPPO方法、Koopman理论和控制论中的状态空间方程,通过直接从数据中构建Koopman算子,实现了模型的高度可解释性,并显著减少了可训练参数的数量,从而大幅降低了训练时间。
链接: https://arxiv.org/abs/2506.19885
作者: Jing Lu,Xuan Wu,Yizhun Tian,Songhan Fan,Yali Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 figures
Abstract:The Koopman theory is a powerful and effective modeling tool for converting nonlinear systems into linear representations, and flight trajectory prediction (FTP) is a complex nonlinear system. However, current models applying the Koopman theory to FTP tasks are not very effective, model interpretability is indeed an issue, and the Koopman operators are computationally intensive, resulting in long training times. To address this issue, this paper proposes a new modeling and control framework based on the HIPPO method, the Koopman theory, and state space equations from cybernetics: FlightKooba. Inspired by the idea of structural state space equations, FlightKooba directly constructs the Koopman operators from data. This makes the framework highly interpretable and significantly reduces the number of trainable parameters in the module, thereby greatly reducing training time. Experiments have demonstrated the superiority of the FlightKooba modeling method in terms of time and memory consumption (training time comparable to the Mamba module without using CUDA-level acceleration; memory reduced by more than 50% on most datasets, with a tenfold reduction in the number of parameters), essentially completing the FTP task. It provides a new method for the fast computation of the Koopman operators, opening up new possibilities for the combination of time series forecasting and control.
zh
[AI-65] MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection
【速读】:该论文旨在解决在移动设备上进行大型语言模型(Large Language Model, LLM)推理时的能耗问题,特别是针对内存受限的设备。现有研究多集中于加速预填充(prefill)阶段,而忽略了解码(decode)阶段的高能耗问题。论文提出的解决方案关键在于引入自适应能量中心核心选择(Adaptive Energy-Centric Core Selection, AECS),通过动态选择低功耗CPU核心,在保证解码速度在可接受的延迟阈值内的情况下,有效降低LLM解码过程中的能耗。
链接: https://arxiv.org/abs/2506.19884
作者: Zhengxiang Huang,Chaoyue Niu,Zhaode Wang,Jiarui Xue,Hanming Zhang,Yugang Wang,Zewei Xin,Xiaotang Jiang,Chengfei Lv,Fan Wu,Guihai Chen
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
备注:
Abstract:As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS) and integrate it into MNN to create the energy-efficient version, MNN-AECS, the first engine-level system solution without requiring root access or OS modifications for energy-efficient LLM decoding. MNN-AECS is designed to reduce LLM decoding energy while keeping decode speed within an acceptable slowdown threshold by dynamically selecting low-power CPU cores. MNN-AECS is evaluated across 5 Android and 2 iOS devices on 5 popular LLMs of various sizes. Compared to original MNN, MNN-AECS cuts down energy use by 23% without slowdown averaged over all 7 devices and 4 datasets. Against other engines, including this http URL, executorch, mllm, and MediaPipe, MNN-AECS delivers 39% to 78% energy saving and 12% to 363% speedup on average.
zh
[AI-66] STIMULUS: Achieving Fast Convergence and Low Sample Complexity in Stochastic Multi-Objective Learning
【速读】:该论文旨在解决多目标优化(Multi-Objective Optimization, MOO)算法设计中收敛速度和样本复杂度不理想的问题。其解决方案的关键在于提出一种名为STIMULUS的新型鲁棒算法,该算法采用简单而有效的递归框架来更新随机梯度估计,从而在保持低样本复杂度的同时提升收敛性能;此外,还引入了带有动量项的改进版本STIMULUS-M以进一步加速收敛。
链接: https://arxiv.org/abs/2506.19883
作者: Zhuqing Liu,Chaosheng Dong,Michinari Momma,Simone Shao,Shaoyuan Xu,Yan Gao,Haibo Yang,Jia Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, multi-objective optimization (MOO) has gained attention for its broad applications in ML, operations research, and engineering. However, MOO algorithm design remains in its infancy and many existing MOO methods suffer from unsatisfactory convergence rate and sample complexity performance. To address this challenge, in this paper, we propose an algorithm called STIMULUS( stochastic path-integrated multi-gradient recursive e\ulstimator), a new and robust approach for solving MOO problems. Different from the traditional methods, STIMULUS introduces a simple yet powerful recursive framework for updating stochastic gradient estimates to improve convergence performance with low sample complexity. In addition, we introduce an enhanced version of STIMULUS, termed STIMULUS-M, which incorporates a momentum term to further expedite convergence. We establish O(1/T) convergence rates of the proposed methods for non-convex settings and O (\exp-\mu T) for strongly convex settings, where T is the total number of iteration rounds. Additionally, we achieve the state-of-the-art O \left(n+\sqrtn\epsilon^-1\right) sample complexities for non-convex settings and O\left(n+ \sqrtn \ln (\mu/\epsilon)\right) for strongly convex settings, where \epsilon0 is a desired stationarity error. Moreover, to alleviate the periodic full gradient evaluation requirement in STIMULUS and STIMULUS-M, we further propose enhanced versions with adaptive batching called STIMULUS+/ STIMULUS-M+ and provide their theoretical analysis.
zh
[AI-67] Robust Anomaly Detection in Network Traffic: Evaluating Machine Learning Models on CICIDS2017
【速读】:该论文试图解决在动态网络环境中选择合适的入侵检测系统(Intrusion Detection System, IDS)模型以实现有效且泛化的安全解决方案的问题。解决方案的关键在于通过控制实验对比四种代表性模型——多层感知机(Multi-Layer Perceptron, MLP)、一维卷积神经网络(1D Convolutional Neural Network, CNN)、单类支持向量机(One-Class Support Vector Machine, OCSVM)和局部离群因子(Local Outlier Factor, LOF)——在已知攻击类型检测和未知威胁泛化两个场景下的性能表现,从而为实际应用提供模型选择的实践指导。
链接: https://arxiv.org/abs/2506.19877
作者: Zhaoyang Xu,Yunbo Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to IEEE CNS 2025
Abstract:Identifying suitable machine learning paradigms for intrusion detection remains critical for building effective and generalizable security solutions. In this study, we present a controlled comparison of four representative models - Multi-Layer Perceptron (MLP), 1D Convolutional Neural Network (CNN), One-Class Support Vector Machine (OCSVM) and Local Outlier Factor (LOF) - on the CICIDS2017 dataset under two scenarios: detecting known attack types and generalizing to previously unseen threats. Our results show that supervised MLP and CNN achieve near-perfect accuracy on familiar attacks but suffer drastic recall drops on novel attacks. Unsupervised LOF attains moderate overall accuracy and high recall on unknown threats at the cost of elevated false alarms, while boundary-based OCSVM balances precision and recall best, demonstrating robust detection across both scenarios. These findings offer practical guidance for selecting IDS models in dynamic network environments.
zh
[AI-68] owards Provable (In)Secure Model Weight Release Schemes
【速读】:该论文试图解决现有安全权重释放方案在理论安全基础不足的问题,这些方案虽然声称能够在开放源码模型分发中保护模型所有权并防止滥用,但缺乏严格的密码学安全定义和形式化安全保障。解决方案的关键在于通过引入具体的安全部署定义来形式化权重释放方案的安全性,并以TaylorMLP为例进行实证分析,揭示其参数提取漏洞,从而展示其未能实现原有的非正式安全目标。
链接: https://arxiv.org/abs/2506.19874
作者: Xing Yang,Bingtao Wang,Yuhao Wang,Zimo Ji,Terry Jingchen Zhang,Wenyuan Jiang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures
Abstract:Recent secure weight release schemes claim to enable open-source model distribution while protecting model ownership and preventing misuse. However, these approaches lack rigorous security foundations and provide only informal security guarantees. Inspired by established works in cryptography, we formalize the security of weight release schemes by introducing several concrete security definitions. We then demonstrate our definition’s utility through a case study of TaylorMLP, a prominent secure weight release scheme. Our analysis reveals vulnerabilities that allow parameter extraction thus showing that TaylorMLP fails to achieve its informal security goals. We hope this work will advocate for rigorous research at the intersection of machine learning and security communities and provide a blueprint for how future weight release schemes should be designed and evaluated.
zh
[AI-69] An Attack Method for Medical Insurance Claim Fraud Detection based on Generative Adversarial Network
【速读】:该论文试图解决保险欺诈检测系统在面对对抗性攻击时的脆弱性问题,即当前系统缺乏标准化的防御机制,导致容易受到新型对抗性威胁的影响。解决方案的关键在于提出一种基于生成对抗网络(Generative Adversarial Network, GAN)的方法,通过生成看似合法的欺诈案例,实现对现有检测系统的对抗攻击,从而揭示其安全漏洞并强调提升模型鲁棒性的紧迫性。
链接: https://arxiv.org/abs/2506.19871
作者: Yining Pang,Chenghan Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2405.12076 by other authors
Abstract:Insurance fraud detection represents a pivotal advancement in modern insurance service, providing intelligent and digitalized monitoring to enhance management and prevent fraud. It is crucial for ensuring the security and efficiency of insurance systems. Although AI and machine learning algorithms have demonstrated strong performance in detecting fraudulent claims, the absence of standardized defense mechanisms renders current systems vulnerable to emerging adversarial threats. In this paper, we propose a GAN-based approach to conduct adversarial attacks on fraud detection systems. Our results indicate that an attacker, without knowledge of the training data or internal model details, can generate fraudulent cases that are classified as legitimate with a 99% attack success rate (ASR). By subtly modifying real insurance records and claims, adversaries can significantly increase the fraud risk, potentially bypassing compromised detection systems. These findings underscore the urgent need to enhance the robustness of insurance fraud detection models against adversarial manipulation, thereby ensuring the stability and reliability of different insurance systems.
zh
[AI-70] Secure Energy Transactions Using Blockchain Leverag ing AI for Fraud Detection and Energy Market Stability
【速读】:该论文试图解决去中心化能源市场中的安全性、交易真实性以及市场可靠性问题。其解决方案的关键在于将区块链技术与人工智能(Artificial Intelligence, AI)相结合,通过区块链层确保交易的不可篡改性和透明性,同时利用AI层进行欺诈行为检测和市场智能分析,从而构建一个安全、智能且高效的能源交易系统。
链接: https://arxiv.org/abs/2506.19870
作者: Md Asif Ul Hoq Khan,MD Zahedul Islam,Istiaq Ahmed,Md Masud Karim Rabbi,Farhana Rahman Anonna,MD Abdul Fahim Zeeshan,Mehedi Hasan Ridoy,Bivash Ranjan Chowdhury,Md Nazmul Shakir Rabbi,GM Alamin Sadnan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Peer-to-peer trading and the move to decentralized grids have reshaped the energy markets in the United States. Notwithstanding, such developments lead to new challenges, mainly regarding the safety and authenticity of energy trade. This study aimed to develop and build a secure, intelligent, and efficient energy transaction system for the decentralized US energy market. This research interlinks the technological prowess of blockchain and artificial intelligence (AI) in a novel way to solve long-standing challenges in the distributed energy market, specifically those of security, fraudulent behavior detection, and market reliability. The dataset for this research is comprised of more than 1.2 million anonymized energy transaction records from a simulated peer-to-peer (P2P) energy exchange network emulating real-life blockchain-based American microgrids, including those tested by LO3 Energy and Grid+ Labs. Each record contains detailed fields of transaction identifier, timestamp, energy volume (kWh), transaction type (buy/sell), unit price, prosumer/consumer identifier (hashed for privacy), smart meter readings, geolocation regions, and settlement confirmation status. The dataset also includes system-calculated behavior metrics of transaction rate, variability of energy production, and historical pricing patterns. The system architecture proposed involves the integration of two layers, namely a blockchain layer and artificial intelligence (AI) layer, each playing a unique but complementary function in energy transaction securing and market intelligence improvement. The machine learning models used in this research were specifically chosen for their established high performance in classification tasks, specifically in the identification of energy transaction fraud in decentralized markets.
zh
[AI-71] DeepQuark: deep-neural-network approach to multiquark bound states
【速读】:该论文旨在解决多夸克束缚态的计算问题,这类系统由于强SU(3)色相互作用而比电子或核子系统更加复杂。其关键解决方案是设计了一种新型高效架构DeepQuark,以应对多夸克系统中的强关联、额外离散量子数以及难以处理的禁闭相互作用等独特挑战。通过引入基于深度神经网络的变分蒙特卡洛方法,该研究在核子、双重重四夸克和全重四夸克系统中表现出与当前先进方法相当的性能,并在五夸克系统中超越了现有计算结果。
链接: https://arxiv.org/abs/2506.20555
作者: Wei-Lin Wu,Lu Meng,Shi-Lin Zhu
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); High Energy Physics - Lattice (hep-lat); Nuclear Theory (nucl-th)
备注: 10 pages, 3 figures, 6 tables
Abstract:For the first time, we implement the deep-neural-network-based variational Monte Carlo approach for the multiquark bound states, whose complexity surpasses that of electron or nucleon systems due to strong SU(3) color interactions. We design a novel and high-efficiency architecture, DeepQuark, to address the unique challenges in multiquark systems such as stronger correlations, extra discrete quantum numbers, and intractable confinement interaction. Our method demonstrates competitive performance with state-of-the-art approaches, including diffusion Monte Carlo and Gaussian expansion method, in the nucleon, doubly heavy tetraquark, and fully heavy tetraquark systems. Notably, it outperforms existing calculations for pentaquarks, exemplified by the triply heavy pentaquark. For the nucleon, we successfully incorporate three-body flux-tube confinement interactions without additional computational costs. In tetraquark systems, we consistently describe hadronic molecule T_cc and compact tetraquark T_bb with an unbiased form of wave function ansatz. In the pentaquark sector, we obtain weakly bound \bar D^\Xi_cc^ molecule P_cc\bar c(5715) with S=\frac52 and its bottom partner P_bb\bar b(15569) . They can be viewed as the analogs of the molecular T_cc . We recommend experimental search of P_cc\bar c(5715) in the D-wave J/\psi \Lambda_c channel. DeepQuark holds great promise for extension to larger multiquark systems, overcoming the computational barriers in conventional methods. It also serves as a powerful framework for exploring confining mechanism beyond two-body interactions in multiquark states, which may offer valuable insights into nonperturbative QCD and general many-body physics.
zh
[AI-72] Valid Selection among Conformal Sets
【速读】:该论文试图解决在存在多个有效共形预测集(conformal prediction sets)的情况下,如何选择最优预测集(如最小的预测集)而不破坏覆盖保证(coverage guarantees)的问题。解决方案的关键在于提出一种基于稳定性的方法,该方法能够在选择预测集的同时确保覆盖概率的准确性。
链接: https://arxiv.org/abs/2506.20173
作者: Mahmoud Hegazy,Liviu Aolaritei,Michael I. Jordan,Aymeric Dieuleveut
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)
备注:
Abstract:Conformal prediction offers a distribution-free framework for constructing prediction sets with coverage guarantees. In practice, multiple valid conformal prediction sets may be available, arising from different models or methodologies. However, selecting the most desirable set, such as the smallest, can invalidate the coverage guarantees. To address this challenge, we propose a stability-based approach that ensures coverage for the selected prediction set. We extend our results to the online conformal setting, propose several refinements in settings where additional structure is available, and demonstrate its effectiveness through experiments.
zh
[AI-73] Do psychic cells generate consciousness?
【速读】:该论文试图解决关于意识产生的细胞层面机制这一基础科学问题,特别是探讨大脑中皮层锥体神经元(cortical pyramidal neurons)在意识处理中的作用。其解决方案的关键在于揭示分布于锥体细胞树突上的特定代谢型受体(metabotropic receptors)作为关键细胞机制,能够解释麻醉诱导意识丧失时反馈信号的选择性破坏。
链接: https://arxiv.org/abs/2506.20164
作者: Mototaka Suzuki,Jaan Aru
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Technological advances in the past decades have begun to enable neuroscientists to address fundamental questions about consciousness in an unprecedented way. Here we review remarkable recent progress in our understanding of cellular-level mechanisms of conscious processing in the brain. Of particular interest are the cortical pyramidal neurons – or “psychic cells” called by Ramón y Cajal more than 100 years ago – which have an intriguing cellular mechanism that accounts for selective disruption of feedback signaling in the brain upon anesthetic-induced loss of consciousness. Importantly, a particular class of metabotropic receptors distributed over the dendrites of pyramidal cells are highlighted as the key cellular mechanism. After all, Cajal’s instinct over a century ago may turn out to be correct – we may have just begun to understand whether and how psychic cells indeed generate and control our consciousness.
zh
[AI-74] Quantum Neural Networks for Propensity Score Estimation and Survival Analysis in Observational Biomedical Studies
【速读】:该论文试图解决在比较腹腔镜与开腹手术治疗结直肠癌患者生存结果时的选取偏差问题,通过生成式 AI (Generative AI) 构建的量子神经网络(Quantum Neural Networks, QNNs)进行倾向得分估计。其解决方案的关键在于采用线性 ZFeatureMap 进行数据编码、SummedPaulis 算子进行预测,并利用协方差矩阵自适应进化策略(CMA-ES)实现噪声环境下的鲁棒优化,同时引入方差正则化以减轻量子测量噪声的影响。此外,通过模拟硬件噪声和遗传匹配策略优化倾向得分匹配与加权,有效实现了协变量平衡,并在调整后未发现显著的生存差异,表明该方法在小样本、高维生物医学数据中的因果推断潜力。
链接: https://arxiv.org/abs/2506.19973
作者: Vojtěch Novák,Ivan Zelinka,Lenka Přibylová,Lubomír Martínek
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:This study investigates the application of quantum neural networks (QNNs) for propensity score estimation to address selection bias in comparing survival outcomes between laparoscopic and open surgical techniques in a cohort of 1177 colorectal carcinoma patients treated at University Hospital Ostrava (2001-2009). Using a dataset with 77 variables, including patient demographics and tumor characteristics, we developed QNN-based propensity score models focusing on four key covariates (Age, Sex, Stage, BMI). The QNN architecture employed a linear ZFeatureMap for data encoding, a SummedPaulis operator for predictions, and the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for robust, gradient-free optimization in noisy quantum environments. Variance regularization was integrated to mitigate quantum measurement noise, with simulations conducted under exact, sampling (1024 shots), and noisy hardware (FakeManhattanV2) conditions. QNNs, particularly with simulated hardware noise, outperformed classical logistic regression and gradient boosted machines in small samples (AUC up to 0.750 for n=100), with noise modeling enhancing predictive stability. Propensity score matching and weighting, optimized via genetic matching and matching weights, achieved covariate balance with standardized mean differences of 0.0849 and 0.0869, respectively. Survival analyses using Kaplan-Meier estimation, Cox proportional hazards, and Aalen additive regression revealed no significant survival differences post-adjustment (p-values 0.287-0.851), indicating confounding bias in unadjusted outcomes. These results highlight QNNs’ potential, enhanced by CMA-ES and noise-aware strategies, to improve causal inference in biomedical research, particularly for small-sample, high-dimensional datasets.
zh
[AI-75] An ab initio foundation model of wavefunctions that accurately describes chemical bond breaking
【速读】:该论文试图解决量子化学中键断裂的可靠描述问题,这一问题由于解离物种电子结构的多参考特性而具有挑战性。传统多参考方法在计算上成本高昂,且对于每个体系都需要从头计算,未能利用分子间电子结构的共性。该研究提出的解决方案的关键在于引入Orbformer,这是一种基于深度神经网络的可迁移波函数模型,通过在22,000个平衡和解离结构上进行预训练,能够对未见过的分子进行微调,从而实现与经典多参考方法相当的精度-成本比。
链接: https://arxiv.org/abs/2506.19960
作者: Adam Foster,Zeno Schätzle,P. Bernát Szabó,Lixue Cheng,Jonas Köhler,Gino Cassella,Nicholas Gao,Jiawei Li,Frank Noé,Jan Hermann
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Reliable description of bond breaking remains a major challenge for quantum chemistry due to the multireferential character of the electronic structure in dissociating species. Multireferential methods in particular suffer from large computational cost, which under the normal paradigm has to be paid anew for each system at a full price, ignoring commonalities in electronic structure across molecules. Quantum Monte Carlo with deep neural networks (deep QMC) uniquely offers to exploit such commonalities by pretraining transferable wavefunction models, but all such attempts were so far limited in scope. Here, we bring this new paradigm to fruition with Orbformer, a novel transferable wavefunction model pretrained on 22,000 equilibrium and dissociating structures that can be fine-tuned on unseen molecules reaching an accuracy-cost ratio rivalling classical multireferential methods. On established benchmarks as well as more challenging bond dissociations and Diels-Alder reactions, Orbformer is the only method that consistently converges to chemical accuracy (1 kcal/mol). This work turns the idea of amortizing the cost of solving the Schrödinger equation over many molecules into a practical approach in quantum chemistry.
zh
[AI-76] MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition INTERSPEECH2025
【速读】:该论文旨在解决自然条件下语音情感识别(Speech Emotion Recognition in Naturalistic Conditions, SERNC)中的分类情感识别和情感属性预测问题。其关键解决方案是提出多层级声学-文本情感表示(Multi-level Acoustic-Textual Emotion Representation, MATER),该框架通过在词、话语和嵌入层面上融合声学与文本特征,结合低级词汇和声学线索与高级上下文表示,有效捕捉细微的韵律变化和语义细节。此外,引入不确定性感知的集成策略以减轻标注者不一致性,提升模糊情感表达的鲁棒性。
链接: https://arxiv.org/abs/2506.19887
作者: Hyo Jin Jon,Longbin Jin,Hyuntaek Jung,Hyunseo Kim,Donghun Min,Eun Yi Kim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 4 figures, 2 tables, 1 algorithm, Accepted to INTERSPEECH 2025
Abstract:This paper presents our contributions to the Speech Emotion Recognition in Naturalistic Conditions (SERNC) Challenge, where we address categorical emotion recognition and emotional attribute prediction. To handle the complexities of natural speech, including intra- and inter-subject variability, we propose Multi-level Acoustic-Textual Emotion Representation (MATER), a novel hierarchical framework that integrates acoustic and textual features at the word, utterance, and embedding levels. By fusing low-level lexical and acoustic cues with high-level contextualized representations, MATER effectively captures both fine-grained prosodic variations and semantic nuances. Additionally, we introduce an uncertainty-aware ensemble strategy to mitigate annotator inconsistencies, improving robustness in ambiguous emotional expressions. MATER ranks fourth in both tasks with a Macro-F1 of 41.01% and an average CCC of 0.5928, securing second place in valence prediction with an impressive CCC of 0.6941.
zh
[AI-77] Physics-Guided Radiotherapy Treatment Planning with Deep Learning
【速读】:该论文旨在解决放射治疗中因解剖结构变化而需要频繁调整治疗计划所带来的效率问题,提出了一种基于物理引导的深度学习两阶段流水线以实现自动化治疗计划生成。其解决方案的关键在于第一阶段通过直接监督训练网络学习治疗计划参数(包括MLC和MU值),第二阶段则引入基于预测三维剂量分布的额外监督信号,从而将物理约束融入训练过程中,提升计划的准确性与临床适用性。
链接: https://arxiv.org/abs/2506.19880
作者: Stefanos Achlatis,Efstratios Gavves,Jan-Jakob Sonke
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Radiotherapy (RT) is a critical cancer treatment, with volumetric modulated arc therapy (VMAT) being a commonly used technique that enhances dose conformity by dynamically adjusting multileaf collimator (MLC) positions and monitor units (MU) throughout gantry rotation. Adaptive radiotherapy requires frequent modifications to treatment plans to account for anatomical variations, necessitating time-efficient solutions. Deep learning offers a promising solution to automate this process. To this end, we propose a two-stage, physics-guided deep learning pipeline for radiotherapy planning. In the first stage, our network is trained with direct supervision on treatment plan parameters, consisting of MLC and MU values. In the second stage, we incorporate an additional supervision signal derived from the predicted 3D dose distribution, integrating physics-based guidance into the training process. We train and evaluate our approach on 133 prostate cancer patients treated with a uniform 2-arc VMAT protocol delivering a dose of 62 Gy to the planning target volume (PTV). Our results demonstrate that the proposed approach, implemented using both 3D U-Net and UNETR architectures, consistently produces treatment plans that closely match clinical ground truths. Our method achieves a mean difference of D95% = 0.42 +/- 1.83 Gy and V95% = -0.22 +/- 1.87% at the PTV while generating dose distributions that reduce radiation exposure to organs at risk. These findings highlight the potential of physics-guided deep learning in RT planning.
zh
[AI-78] Speaker Embeddings to Improve Tracking of Intermittent and Moving Speakers
【速读】:该论文试图解决在说话人位置发生变化的场景下,传统基于空间观测的说话人跟踪方法所面临的身份分配不连贯问题(speaker tracking methods often rely on spatial observations to assign coherent track identities over time)。解决方案的关键在于在跟踪过程后使用说话人嵌入(speaker embeddings)进行身份重新分配,通过初始跟踪步骤提供的轨迹信息和多通道音频信号,结合波束成形技术增强说话人位置的信号以计算嵌入,并基于注册池为轨迹分配新的身份。
链接: https://arxiv.org/abs/2506.19875
作者: Taous Iatariene(MULTISPEECH),Can Cui(MULTISPEECH),Alexandre Guérin,Romain Serizel(MULTISPEECH)
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 33rd European Signal Processing Conference (EUSIPCO 2025), Sep 2025, Palerme (Italie), Italy
Abstract:Speaker tracking methods often rely on spatial observations to assign coherent track identities over time. This raises limits in scenarios with intermittent and moving speakers, i.e., speakers that may change position when they are inactive, thus leading to discontinuous spatial trajectories. This paper proposes to investigate the use of speaker embeddings, in a simple solution to this issue. We propose to perform identity reassignment post-tracking, using speaker embeddings. We leverage trajectory-related information provided by an initial tracking step and multichannel audio signal. Beamforming is used to enhance the signal towards the speakers’ positions in order to compute speaker embeddings. These are then used to assign new track identities based on an enrollment pool. We evaluate the performance of the proposed speaker embedding-based identity reassignment method on a dataset where speakers change position during inactivity periods. Results show that it consistently improves the identity assignment performance of neural and standard tracking systems. In particular, we study the impact of beamforming and input duration for embedding extraction.
zh
[AI-79] Scalable and Cost-Efficient de Novo Template-Based Molecular Generation
【速读】:该论文旨在解决模板基础的生成流网络(GFlowNets)在分子生成中的三个核心挑战:最小化合成成本、扩展至大规模构建块库以及有效利用小片段集合。其解决方案的关键在于提出一种**递归成本引导(Recursive Cost Guidance)机制,该机制通过辅助机器学习模型近似合成成本和可行性,从而在反向策略框架中引导生成过程向低成本合成路径迁移。此外,为提升小规模构建块库下的性能,还引入了动态库(Dynamic Library)**机制,通过重用高奖励中间状态构建完整的合成树,从而显著提升了合成效率、分子多样性和质量。
链接: https://arxiv.org/abs/2506.19865
作者: Piotr Gaiński,Oussama Boussif,Andrei Rekesh,Dmytro Shevchuk,Ali Parviz,Mike Tyers,Robert A. Batey,Michał Koziarski
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Template-based molecular generation offers a promising avenue for drug design by ensuring generated compounds are synthetically accessible through predefined reaction templates and building blocks. In this work, we tackle three core challenges in template-based GFlowNets: (1) minimizing synthesis cost, (2) scaling to large building block libraries, and (3) effectively utilizing small fragment sets. We propose \textbfRecursive Cost Guidance, a backward policy framework that employs auxiliary machine learning models to approximate synthesis cost and viability. This guidance steers generation toward low-cost synthesis pathways, significantly enhancing cost-efficiency, molecular diversity, and quality, especially when paired with an \textbfExploitation Penalty that balances the trade-off between exploration and exploitation. To enhance performance in smaller building block libraries, we develop a \textbfDynamic Library mechanism that reuses intermediate high-reward states to construct full synthesis trees. Our approach establishes state-of-the-art results in template-based molecular generation.
zh
[AI-80] Exploring the Capabilities of the Frontier Large Language Models for Nuclear Energy Research
【速读】:该论文试图解决如何利用生成式 AI (Generative AI) 加速核能研究中的聚变与裂变科学问题,具体包括提升研究效率、优化实验设计及自动化模拟流程等。解决方案的关键在于通过专家驱动的提示工程(prompt engineering)和将 AI 作为物理模型的补充工具,而非替代方案,从而有效利用大型语言模型(LLMs)在早期探索、文献综述和工作流设计方面的优势,同时克服其在新材料设计、复杂代码生成及领域细节准确性方面的局限性。
链接: https://arxiv.org/abs/2506.19863
作者: Ahmed Almeldein,Mohammed Alnaggar,Rick Archibald,Tom Beck,Arpan Biswas,Rike Bostelmann,Wes Brewer,Chris Bryan,Christopher Calle,Cihangir Celik,Rajni Chahal,Jong Youl Choi,Arindam Chowdhury,Mark Cianciosa,Franklin Curtis,Gregory Davidson,Sebastian De Pascuale,Lisa Fassino,Ana Gainaru,Yashika Ghai,Luke Gibson,Qian Gong,Christopher Greulich,Scott Greenwood,Cory Hauck,Ehab Hassan,Rinkle Juneja,Soyoung Kang,Scott Klasky,Atul Kumar,Vineet Kumar,Paul Laiu,Calvin Lear,Yan-Ru Lin,Jono McConnell,Furkan Oz,Anant Raj,Pradeep Ramuhalli,Marie Romedenne,Samantha Sabatino,José Salcedo-Pérez,Nathan D. See,Arpan Sircar,Punam Thankur,Tim Younkin,Xiao-Ying Yu,Prashant Jain,Tom Evans,Prasanna Balaprakash
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:The AI for Nuclear Energy workshop at Oak Ridge National Laboratory evaluated the potential of Large Language Models (LLMs) to accelerate fusion and fission research. Fourteen interdisciplinary teams explored diverse nuclear science challenges using ChatGPT, Gemini, Claude, and other AI models over a single day. Applications ranged from developing foundation models for fusion reactor control to automating Monte Carlo simulations, predicting material degradation, and designing experimental programs for advanced reactors. Teams employed structured workflows combining prompt engineering, deep research capabilities, and iterative refinement to generate hypotheses, prototype code, and research strategies. Key findings demonstrate that LLMs excel at early-stage exploration, literature synthesis, and workflow design, successfully identifying research gaps and generating plausible experimental frameworks. However, significant limitations emerged, including difficulties with novel materials designs, advanced code generation for modeling and simulation, and domain-specific details requiring expert validation. The successful outcomes resulted from expert-driven prompt engineering and treating AI as a complementary tool rather than a replacement for physics-based methods. The workshop validated AI’s potential to accelerate nuclear energy research through rapid iteration and cross-disciplinary synthesis while highlighting the need for curated nuclear-specific datasets, workflow automation, and specialized model development. These results provide a roadmap for integrating AI tools into nuclear science workflows, potentially reducing development cycles for safer, more efficient nuclear energy systems while maintaining rigorous scientific standards.
zh
[AI-81] DualEquiNet: A Dual-Space Hierarchical Equivariant Network for Large Biomolecules
【速读】:该论文旨在解决几何图神经网络(Geometric Graph Neural Networks, GNNs)在应用于大型生物分子(如RNA和蛋白质)时所面临的可扩展性和表达能力不足的问题。现有方法通常仅在欧几里得空间或球面谐波空间中操作,难以同时捕捉原子级细节与长程对称性感知的依赖关系。论文提出的解决方案是DualEquiNet,其关键在于构建欧几里得空间和球面谐波空间中的互补表示,通过双向跨空间信息传递和新型跨空间交互池化机制,实现原子特征向生物上有意义单元(如残基)的分层聚合,从而高效且有效地进行多尺度建模。
链接: https://arxiv.org/abs/2506.19862
作者: Junjie Xu,Jiahao Zhang,Mangal Prakash,Xiang Zhang,Suhang Wang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Geometric graph neural networks (GNNs) that respect E(3) symmetries have achieved strong performance on small molecule modeling, but they face scalability and expressiveness challenges when applied to large biomolecules such as RNA and proteins. These systems require models that can simultaneously capture fine-grained atomic interactions, long-range dependencies across spatially distant components, and biologically relevant hierarchical structure, such as atoms forming residues, which in turn form higher-order domains. Existing geometric GNNs, which typically operate exclusively in either Euclidean or Spherical Harmonics space, are limited in their ability to capture both the fine-scale atomic details and the long-range, symmetry-aware dependencies required for modeling the multi-scale structure of large biomolecules. We introduce DualEquiNet, a Dual-Space Hierarchical Equivariant Network that constructs complementary representations in both Euclidean and Spherical Harmonics spaces to capture local geometry and global symmetry-aware features. DualEquiNet employs bidirectional cross-space message passing and a novel Cross-Space Interaction Pooling mechanism to hierarchically aggregate atomic features into biologically meaningful units, such as residues, enabling efficient and expressive multi-scale modeling for large biomolecular systems. DualEquiNet achieves state-of-the-art performance on multiple existing benchmarks for RNA property prediction and protein modeling, and outperforms prior methods on two newly introduced 3D structural benchmarks demonstrating its broad effectiveness across a range of large biomolecule modeling tasks.
zh
机器学习
[LG-0] DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
链接: https://arxiv.org/abs/2506.20668
作者: Sungjae Park,Homanga Bharadhwaj,Shubham Tulsiani
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Preprint(17 pages). Under Review
Abstract:We propose DemoDiffusion, a simple and scalable method for enabling robots to perform manipulation tasks in natural environments by imitating a single human demonstration. Our approach is based on two key insights. First, the hand motion in a human demonstration provides a useful prior for the robot’s end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Our approach avoids the need for online reinforcement learning or paired human-robot data, enabling robust adaptation to new tasks and scenes with minimal manual effort. Experiments in both simulation and real-world settings show that DemoDiffusion outperforms both the base policy and the retargeted trajectory, enabling the robot to succeed even on tasks where the pre-trained generalist policy fails entirely. Project page: this https URL
[LG-1] Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning
链接: https://arxiv.org/abs/2506.20651
作者: Fei Wang,Baochun Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Recent work has shown that gradient updates in federated learning (FL) can unintentionally reveal sensitive information about a client’s local data. This risk becomes significantly greater when a malicious server manipulates the global model to provoke information-rich updates from clients. In this paper, we adopt a defender’s perspective to provide the first comprehensive analysis of malicious gradient leakage attacks and the model manipulation techniques that enable them. Our investigation reveals a core trade-off: these attacks cannot be both highly effective in reconstructing private data and sufficiently stealthy to evade detection – especially in realistic FL settings that incorporate common normalization techniques and federated averaging. Building on this insight, we argue that malicious gradient leakage attacks, while theoretically concerning, are inherently limited in practice and often detectable through basic monitoring. As a complementary contribution, we propose a simple, lightweight, and broadly applicable client-side detection mechanism that flags suspicious model updates before local training begins, despite the fact that such detection may not be strictly necessary in realistic FL settings. This mechanism further underscores the feasibility of defending against these attacks with minimal overhead, offering a deployable safeguard for privacy-conscious federated learning systems. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2506.20651 [cs.LG] (or arXiv:2506.20651v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20651 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] Mastering Multiple-Expert Routing: Realizable H-Consistency and Strong Guarantees for Learning to Defer ICML2025
链接: https://arxiv.org/abs/2506.20650
作者: Anqi Mao,Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025
Abstract:The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the trade-off between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, but challenges remain in ensuring their consistency properties. This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees. We address open questions regarding realizable H -consistency, H -consistency bounds, and Bayes-consistency for both single-stage (jointly learning predictor and deferral function) and two-stage (learning only the deferral function with a fixed expert) learning scenarios. For single-stage deferral, we introduce a family of new realizable H -consistent surrogate losses and further prove H -consistency for a selected member. For two-stage deferral, we derive new surrogate losses that achieve realizable H -consistency, H -consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario. Additionally, we provide enhanced theoretical guarantees under low-noise assumptions for both scenarios. Finally, we report the results of experiments using our proposed surrogate losses, comparing their performance against existing baselines.
[LG-3] Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices
链接: https://arxiv.org/abs/2506.20644
作者: Hangyu Li,Hongyue Wu,Guodong Fan,Zhen Zhang,Shizhan Chen,Zhiyong Feng
类目: Machine Learning (cs.LG)
*备注: Accepted by ICWS 2025
Abstract:As privacy protection gains increasing importance, more models are being trained on edge devices and subsequently merged into the central server through Federated Learning (FL). However, current research overlooks the impact of network topology, physical distance, and data heterogeneity on edge devices, leading to issues such as increased latency and degraded model performance. To address these issues, we propose a new federated learning scheme on edge devices that called Federated Learning with Encrypted Data Sharing(FedEDS). FedEDS uses the client model and the model’s stochastic layer to train the data encryptor. The data encryptor generates encrypted data and shares it with other clients. The client uses the corresponding client’s stochastic layer and encrypted data to train and adjust the local model. FedEDS uses the client’s local private data and encrypted shared data from other clients to train the model. This approach accelerates the convergence speed of federated learning training and mitigates the negative impact of data heterogeneity, making it suitable for application services deployed on edge devices requiring rapid convergence. Experiments results show the efficacy of FedEDS in promoting model performance.
[LG-4] Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning
链接: https://arxiv.org/abs/2506.20623
作者: Fariba Jangjoo,Matteo Marsili,Yasser Roudi
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 13 pages, 2 figures
Abstract:Closed-loop learning is the process of repeatedly estimating a model from data generated from the model itself. It is receiving great attention due to the possibility that large neural network models may, in the future, be primarily trained with data generated by artificial neural networks themselves. We study this process for models that belong to exponential families, deriving equations of motions that govern the dynamics of the parameters. We show that maximum likelihood estimation of the parameters endows sufficient statistics with the martingale property and that as a result the process converges to absorbing states that amplify initial biases present in the data. However, we show that this outcome may be prevented by polluting the data with an infinitesimal fraction of data points generated from a fixed model, by relying on maximum a posteriori estimation or by introducing regularisation. Furthermore, we show that the asymptotic behavior of the dynamics is not reparametrisation invariant.
[LG-5] H-FEX: A Symbolic Learning Method for Hamiltonian Systems
链接: https://arxiv.org/abs/2506.20607
作者: Jasen Lai,Senwei Liang,Chunmei Wang
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures
Abstract:Hamiltonian systems describe a broad class of dynamical systems governed by Hamiltonian functions, which encode the total energy and dictate the evolution of the system. Data-driven approaches, such as symbolic regression and neural network-based methods, provide a means to learn the governing equations of dynamical systems directly from observational data of Hamiltonian systems. However, these methods often struggle to accurately capture complex Hamiltonian functions while preserving energy conservation. To overcome this limitation, we propose the Finite Expression Method for learning Hamiltonian Systems (H-FEX), a symbolic learning method that introduces novel interaction nodes designed to capture intricate interaction terms effectively. Our experiments, including those on highly stiff dynamical systems, demonstrate that H-FEX can recover Hamiltonian functions of complex systems that accurately capture system dynamics and preserve energy over long time horizons. These findings highlight the potential of H-FEX as a powerful framework for discovering closed-form expressions of complex dynamical systems.
[LG-6] he kernel of graph indices for vector search
链接: https://arxiv.org/abs/2506.20584
作者: Mariano Tepper,Ted Willke
类目: Machine Learning (cs.LG)
*备注:
Abstract:The most popular graph indices for vector search use principles from computational geometry to build the graph. Hence, their formal graph navigability guarantees are only valid in Euclidean space. In this work, we show that machine learning can be used to build graph indices for vector search in metric and non-metric vector spaces (e.g., for inner product similarity). From this novel perspective, we introduce the Support Vector Graph (SVG), a new type of graph index that leverages kernel methods to establish the graph connectivity and that comes with formal navigability guarantees valid in metric and non-metric vector spaces. In addition, we interpret the most popular graph indices, including HNSW and DiskANN, as particular specializations of SVG and show that new indices can be derived from the principles behind this specialization. Finally, we propose SVG-L0 that incorporates an \ell_0 sparsity constraint into the SVG kernel method to build graphs with a bounded out-degree. This yields a principled way of implementing this practical requirement, in contrast to the traditional heuristic of simply truncating the out edges of each node. Additionally, we show that SVG-L0 has a self-tuning property that avoids the heuristic of using a set of candidates to find the out-edges of each node and that keeps its computational complexity in check.
[LG-7] Exploring Graph-Transformer Out-of-Distribution Generalization Abilities
链接: https://arxiv.org/abs/2506.20575
作者: Itay Niv,Neta Rabin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning on graphs has shown remarkable success across numerous applications, including social networks, bio-physics, traffic networks, and recommendation systems. Regardless of their successes, current methods frequently depend on the assumption that training and testing data share the same distribution, a condition rarely met in real-world scenarios. While graph-transformer (GT) backbones have recently outperformed traditional message-passing neural networks (MPNNs) in multiple in-distribution (ID) benchmarks, their effectiveness under distribution shifts remains largely unexplored. In this work, we address the challenge of out-of-distribution (OOD) generalization for graph neural networks, with a special focus on the impact of backbone architecture. We systematically evaluate GT and hybrid backbones in OOD settings and compare them to MPNNs. To do so, we adapt several leading domain generalization (DG) algorithms to work with GTs and assess their performance on a benchmark designed to test a variety of distribution shifts. Our results reveal that GT and hybrid GT-MPNN backbones consistently demonstrate stronger generalization ability compared to MPNNs, even without specialized DG algorithms. Additionally, we propose a novel post-training analysis approach that compares the clustering structure of the entire ID and OOD test datasets, specifically examining domain alignment and class separation. Demonstrating its model-agnostic design, this approach not only provided meaningful insights into GT and MPNN backbones. It also shows promise for broader applicability to DG problems beyond graph learning, offering a deeper perspective on generalization abilities that goes beyond standard accuracy metrics. Together, our findings highlight the promise of graph-transformers for robust, real-world graph learning and set a new direction for future research in OOD generalization. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.20575 [cs.LG] (or arXiv:2506.20575v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-8] Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time Series VLDB2026
链接: https://arxiv.org/abs/2506.20574
作者: Laura Boggia,Rafael Teixeira de Lima,Bogdan Malaescu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Submitted to VLDB 2026 conference, currently under review
Abstract:Anomaly detection in multivariate time series is an important problem across various fields such as healthcare, financial services, manufacturing or physics detector monitoring. Accurately identifying when unexpected errors or faults occur is essential, yet challenging, due to the unknown nature of anomalies and the complex interdependencies between time series dimensions. In this paper, we investigate transformer-based approaches for time series anomaly detection, focusing on the recently proposed iTransformer architecture. Our contributions are fourfold: (i) we explore the application of the iTransformer to time series anomaly detection, and analyse the influence of key parameters such as window size, step size, and model dimensions on performance; (ii) we examine methods for extracting anomaly labels from multidimensional anomaly scores and discuss appropriate evaluation metrics for such labels; (iii) we study the impact of anomalous data present during training and assess the effectiveness of alternative loss functions in mitigating their influence; and (iv) we present a comprehensive comparison of several transformer-based models across a diverse set of datasets for time series anomaly detection.
[LG-9] Demonstration of effective UCB-based routing in skill-based queues on real-world data
链接: https://arxiv.org/abs/2506.20543
作者: Sanne van Kempen,Jaron Sanders,Fiona Sloothaak,Maarten G. Wolf
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This paper is about optimally controlling skill-based queueing systems such as data centers, cloud computing networks, and service systems. By means of a case study using a real-world data set, we investigate the practical implementation of a recently developed reinforcement learning algorithm for optimal customer routing. Our experiments show that the algorithm efficiently learns and adapts to changing environments and outperforms static benchmark policies, indicating its potential for live implementation. We also augment the real-world applicability of this algorithm by introducing a new heuristic routing rule to reduce delays. Moreover, we show that the algorithm can optimize for multiple objectives: next to payoff maximization, secondary objectives such as server load fairness and customer waiting time reduction can be incorporated. Tuning parameters are used for balancing inherent performance trade–offs. Lastly, we investigate the sensitivity to estimation errors and parameter tuning, providing valuable insights for implementing adaptive routing algorithms in complex real-world queueing systems.
[LG-10] Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Laser Powder Bed Fusion
链接: https://arxiv.org/abs/2506.20537
作者: R. Sharma,M. Raissi,Y.B. Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process prediction due to the lasting issue of high computation cost using traditional numerical methods such as finite element analysis (FEA). This study presents an efficient modeling framework termed FEA-Regulated Physics-Informed Neural Network (FEA-PINN) to accelerate the thermal field prediction in a LPBF process while maintaining the FEA accuracy. A novel dynamic material updating strategy is developed to capture the dynamic phase change of powder-liquid-solid in the PINN model. The PINN model incorporates temperature-dependent material properties and phase change behavior using the apparent heat capacity method. While the PINN model demonstrates high accuracy with a small training data and enables generalization of new process parameters via transfer learning, it faces the challenge of high computation cost in time-dependent problems due to the residual accumulation. To overcome this issue, the FEA-PINN framework integrates corrective FEA simulations during inference to enforce physical consistency and reduce error drift. A comparative analysis shows that FEA-PINN achieves equivalent accuracy to FEA while significantly reducing computational cost. The framework has been validated using the benchmark FEA data and demonstrated through single-track scanning in LPBF.
[LG-11] WallStreetFeds: Client-Specific Tokens as Investment Vehicles in Federated Learning
链接: https://arxiv.org/abs/2506.20518
作者: Arno Geimer,Beltran Fiz Pontiveros,Radu State
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) is a collaborative machine learning paradigm which allows participants to collectively train a model while training data remains private. This paradigm is especially beneficial for sectors like finance, where data privacy, security and model performance are paramount. FL has been extensively studied in the years following its introduction, leading to, among others, better performing collaboration techniques, ways to defend against other clients trying to attack the model, and contribution assessment methods. An important element in for-profit Federated Learning is the development of incentive methods to determine the allocation and distribution of rewards for participants. While numerous methods for allocation have been proposed and thoroughly explored, distribution frameworks remain relatively understudied. In this paper, we propose a novel framework which introduces client-specific tokens as investment vehicles within the FL ecosystem. Our framework aims to address the limitations of existing incentive schemes by leveraging a decentralized finance (DeFi) platform and automated market makers (AMMs) to create a more flexible and scalable reward distribution system for participants, and a mechanism for third parties to invest in the federation learning process.
[LG-12] Collaborative Batch Size Optimization for Federated Learning
链接: https://arxiv.org/abs/2506.20511
作者: Arno Geimer,Karthick Panner Selvam,Beltran Fiz Pontiveros
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated Learning (FL) is a decentralized collaborative Machine Learning framework for training models without collecting data in a centralized location. It has seen application across various disciplines, from helping medical diagnoses in hospitals to detecting fraud in financial transactions. In this paper, we focus on improving the local training process through hardware usage optimization. While participants in a federation might share the hardware they are training on, since there is no information exchange between them, their training process can be hindered by an improper training configuration. Taking advantage of the parallel processing inherent to Federated Learning, we use a greedy randomized search to optimize local batch sizes for the best training settings across all participants. Our results show that against default parameter settings, our method improves convergence speed while staying nearly on par with the case where local parameters are optimized.
[LG-13] Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank
链接: https://arxiv.org/abs/2506.20501
作者: Philipp Hager,Onno Zoeter,Maarten de Rijke
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Additive two-tower models are popular learning-to-rank methods for handling biased user feedback in industry settings. Recent studies, however, report a concerning phenomenon: training two-tower models on clicks collected by well-performing production systems leads to decreased ranking performance. This paper investigates two recent explanations for this observation: confounding effects from logging policies and model identifiability issues. We theoretically analyze the identifiability conditions of two-tower models, showing that either document swaps across positions or overlapping feature distributions are required to recover model parameters from clicks. We also investigate the effect of logging policies on two-tower models, finding that they introduce no bias when models perfectly capture user behavior. However, logging policies can amplify biases when models imperfectly capture user behavior, particularly when prediction errors correlate with document placement across positions. We propose a sample weighting technique to mitigate these effects and provide actionable insights for researchers and practitioners using two-tower models.
[LG-14] Multimodal Representation Learning and Fusion
链接: https://arxiv.org/abs/2506.20494
作者: Qihang Jin,Enze Ge,Yuhang Xie,Hongying Luo,Junhao Song,Ziqian Bi,Chia Xin Liang,Jibin Guan,Joe Yeong,Junfeng Hao
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:Multi-modal learning is a fast growing area in artificial intelligence. It tries to help machines understand complex things by combining information from different sources, like images, text, and audio. By using the strengths of each modality, multi-modal learning allows AI systems to build stronger and richer internal representations. These help machines better interpretation, reasoning, and making decisions in real-life situations. This field includes core techniques such as representation learning (to get shared features from different data types), alignment methods (to match information across modalities), and fusion strategies (to combine them by deep learning models). Although there has been good progress, some major problems still remain. Like dealing with different data formats, missing or incomplete inputs, and defending against adversarial attacks. Researchers now are exploring new methods, such as unsupervised or semi-supervised learning, AutoML tools, to make models more efficient and easier to scale. And also more attention on designing better evaluation metrics or building shared benchmarks, make it easier to compare model performance across tasks and domains. As the field continues to grow, multi-modal learning is expected to improve many areas: computer vision, natural language processing, speech recognition, and healthcare. In the future, it may help to build AI systems that can understand the world in a way more like humans, flexible, context aware, and able to deal with real-world complexity.
[LG-15] Méthode de quadrature pour les PINNs fondée théoriquement sur la hessienne des résiduels
链接: https://arxiv.org/abs/2506.20441
作者: Antoine Caradot,Rémi Emonet,Amaury Habrard,Abdel-Rahim Mezidi,Marc Sebban
类目: Machine Learning (cs.LG)
*备注: 10 pages. In French. Comments are welcome
Abstract:Physics-informed Neural Networks (PINNs) have emerged as an efficient way to learn surrogate neural solvers of PDEs by embedding the physical model in the loss function and minimizing its residuals using automatic differentiation at so-called collocation points. Originally uniformly sampled, the choice of the latter has been the subject of recent advances leading to adaptive sampling refinements. In this paper, we propose a new quadrature method for approximating definite integrals based on the hessian of the considered function, and that we leverage to guide the selection of the collocation points during the training process of PINNs.
[LG-16] ackling Data Heterogeneity in Federated Learning through Knowledge Distillation with Inequitable Aggregation
链接: https://arxiv.org/abs/2506.20431
作者: Xing Ma
类目: Machine Learning (cs.LG)
*备注: 33pages,8figures
Abstract:Federated learning aims to train a global model in a distributed environment that is close to the performance of centralized training. However, issues such as client label skew, data quantity skew, and other heterogeneity problems severely degrade the model’s performance. Most existing methods overlook the scenario where only a small portion of clients participate in training within a large-scale client setting, whereas our experiments show that this scenario presents a more challenging federated learning task. Therefore, we propose a Knowledge Distillation with teacher-student Inequitable Aggregation (KDIA) strategy tailored to address the federated learning setting mentioned above, which can effectively leverage knowledge from all clients. In KDIA, the student model is the average aggregation of the participating clients, while the teacher model is formed by a weighted aggregation of all clients based on three frequencies: participation intervals, participation counts, and data volume proportions. During local training, self-knowledge distillation is performed. Additionally, we utilize a generator trained on the server to generate approximately independent and identically distributed (IID) data features locally for auxiliary training. We conduct extensive experiments on the CIFAR-10/100/CINIC-10 datasets and various heterogeneous settings to evaluate KDIA. The results show that KDIA can achieve better accuracy with fewer rounds of training, and the improvement is more significant under severe heterogeneity.
[LG-17] ESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
链接: https://arxiv.org/abs/2506.20380
作者: Zhengpeng Feng,Sadiq Jaffer,Jovana Knezevic,Silja Sormunen,Robin Young,Madeline Lisaius,Markus Immitzer,James Ball,Clement Atzberger,David A. Coomes,Anil Madhavapeddy,Andrew Blake,Srinivasan Keshav
类目: Machine Learning (cs.LG)
*备注:
Abstract:Satellite remote sensing (RS) enables a wide array of downstream Earth observation (EO) applications, including climate modeling, carbon accounting, and strategies for conservation and sustainable land use. We present TESSERA, a novel Remote Sensing Foundation Model (RSFM) that uses Self-Supervised Learning (SSL) to generate global, robust representations at 10m scale from pixel-level satellite time series data. TESSERA combines information from only optical and SAR data streams using two parallel Transformer-based encoders: one dedicated to Sentinel-1 SAR polarizations and another to Sentinel-2 MSI data (10 selected spectral bands) to create representations that are then fused using a multilayer perceptron (MLP), resulting in a global representation map covering the years 2017 to 2024. Our precomputed representations set a new state-of-the-art performance benchmark and our open-source approach democratizes access to high-performance, high-resolution representations. We benchmark the performance of TESSERA in five diverse tasks, comparing our work with state-of-the-art task-specific models and other foundation models. Our results show that TESSERA outperforms both traditional RS baselines and the leading geospatial foundation models in these diverse downstream tasks.
[LG-18] owards Interpretable and Efficient Feature Selection in Trajectory Datasets: A Taxonomic Approach
链接: https://arxiv.org/abs/2506.20359
作者: Chanuka Don Samarasinghage,Dhruv Gulabani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Trajectory analysis is not only about obtaining movement data, but it is also of paramount importance in understanding the pattern in which an object moves through space and time, as well as in predicting its next move. Due to the significant interest in the area, data collection has improved substantially, resulting in a large number of features becoming available for training and predicting models. However, this introduces a high-dimensionality-induced feature explosion problem, which reduces the efficiency and interpretability of the data, thereby reducing the accuracy of machine learning models. To overcome this issue, feature selection has become one of the most prevalent tools. Thus, the objective of this paper was to introduce a taxonomy-based feature selection method that categorizes features based on their internal structure. This approach classifies the data into geometric and kinematic features, further categorizing them into curvature, indentation, speed, and acceleration. The comparative analysis indicated that a taxonomy-based approach consistently achieved comparable or superior predictive performance. Furthermore, due to the taxonomic grouping, which reduces combinatorial space, the time taken to select features was drastically reduced. The taxonomy was also used to gain insights into what feature sets each dataset was more sensitive to. Overall, this study provides robust evidence that a taxonomy-based feature selection method can add a layer of interpretability, reduce dimensionality and computational complexity, and contribute to high-level decision-making. It serves as a step toward providing a methodological framework for researchers and practitioners dealing with trajectory datasets and contributing to the broader field of explainable artificial intelligence.
[LG-19] On the ability of Deep Neural Networks to Learn Granger Causality in Multi-Variate Time Series Data
链接: https://arxiv.org/abs/2506.20347
作者: Malik Shahid Sultan,Hernando Ombao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Granger Causality (GC) offers an elegant statistical framework to study the association between multivariate time series data. Linear Vector Autoregressive models (VAR) though have nice interpretation properties but have limited practical application due to underlying assumptions on the kind of associations that can be captured by these models. Numerous attempts have already been made in the literature that exploit the functional approximation power of Deep Neural Networks (DNNs) for the task of GC estimation. These methods however treat GC as a variable selection problem. We present a novel paradigm for approaching GC. We present this idea that GC is essentially linked with prediction and if a deep learning model is used to model the time series collectively or jointly, a well regularized model may learn the true granger causal structure from the data, given that there is enough training data. We propose to uncover the learned GC structure by comparing the model uncertainty or distribution of the residuals when the past of everything is used as compared to the one where a specific time series component is dropped from the model. We also compare the effect of input layer dropout on the ability of a neural network to learn granger causality from the data. We show that a well regularized model infact can learn the true GC structure from the data without explicitly adding terms in the loss function that guide the model to select variables or perform sparse regression.
[LG-20] Recurrent neural network-based robust control systems with closed-loop regional incremental ISS and application to MPC design
链接: https://arxiv.org/abs/2506.20334
作者: Daniele Ravasio,Marcello Farina,Alessio La Bella,Andrea Ballarino
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, submitted to IEEE Transactions on Automatic Control (under review)
Abstract:This paper investigates the design of output-feedback schemes for systems described by a class of recurrent neural networks. We propose a procedure based on linear matrix inequalities for designing an observer and a static state-feedback controller. The algorithm leverages global and regional incremental input-to-state stability (incremental ISS) and enables the tracking of constant setpoints, ensuring robustness to disturbances and state estimation uncertainty. To address the potential limitations of regional incremental ISS, we introduce an alternative scheme in which the static law is replaced with a tube-based nonlinear model predictive controller (NMPC) that exploits regional incremental ISS properties. We show that these conditions enable the formulation of a robust NMPC law with guarantees of convergence and recursive feasibility, leading to an enlarged region of attraction. Theoretical results are validated through numerical simulations on the pH-neutralisation process benchmark, demonstrating the effectiveness of the proposed schemes.
[LG-21] Producer-Fairness in Sequential Bundle Recommendation
链接: https://arxiv.org/abs/2506.20329
作者: Alexandre Rio,Marta Soare,Sihem Amer-Yahia
类目: Machine Learning (cs.LG)
*备注:
Abstract:We address fairness in the context of sequential bundle recommendation, where users are served in turn with sets of relevant and compatible items. Motivated by real-world scenarios, we formalize producer-fairness, that seeks to achieve desired exposure of different item groups across users in a recommendation session. Our formulation combines naturally with building high quality bundles. Our problem is solved in real time as users arrive. We propose an exact solution that caters to small instances of our problem. We then examine two heuristics, quality-first and fairness-first, and an adaptive variant that determines on-the-fly the right balance between bundle fairness and quality. Our experiments on three real-world datasets underscore the strengths and limitations of each solution and demonstrate their efficacy in providing fair bundle recommendations without compromising bundle quality.
[LG-22] Permutation Equivariant Neural Controlled Differential Equations for Dynamic Graph Representation Learning
链接: https://arxiv.org/abs/2506.20324
作者: Torben Berndt,Benjamin Walker,Tiexin Qin,Jan Stühmer,Andrey Kormilitzin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dynamic graphs exhibit complex temporal dynamics due to the interplay between evolving node features and changing network structures. Recently, Graph Neural Controlled Differential Equations (Graph Neural CDEs) successfully adapted Neural CDEs from paths on Euclidean domains to paths on graph domains. Building on this foundation, we introduce Permutation Equivariant Neural Graph CDEs, which project Graph Neural CDEs onto permutation equivariant function spaces. This significantly reduces the model’s parameter count without compromising representational power, resulting in more efficient training and improved generalisation. We empirically demonstrate the advantages of our approach through experiments on simulated dynamical systems and real-world tasks, showing improved performance in both interpolation and extrapolation scenarios.
[LG-23] Distilling A Universal Expert from Clustered Federated Learning
链接: https://arxiv.org/abs/2506.20285
作者: Zeqi Leng,Chunxu Zhang,Guodong Long,Riting Xia,Bo Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustered Federated Learning (CFL) addresses the challenges posed by non-IID data by training multiple group- or cluster-specific expert models. However, existing methods often overlook the shared information across clusters, which represents the generalizable knowledge valuable to all participants in the Federated Learning (FL) system. To overcome this limitation, this paper introduces a novel FL framework that distills a universal expert model from the knowledge of multiple clusters. This universal expert captures globally shared information across all clients and is subsequently distributed to each client as the initialization for the next round of model training. The proposed FL framework operates in three iterative steps: (1) local model training at each client, (2) cluster-specific model aggregation, and (3) universal expert distillation. This three-step learning paradigm ensures the preservation of fine-grained non-IID characteristics while effectively incorporating shared knowledge across clusters. Compared to traditional gradient-based aggregation methods, the distillation-based model aggregation introduces greater flexibility in handling model heterogeneity and reduces conflicts among cluster-specific experts. Extensive experimental results demonstrate the superior performance of the proposed method across various scenarios, highlighting its potential to advance the state of CFL by balancing personalized and shared knowledge more effectively.
[LG-24] Exploration-Exploitation Tradeoff in Universal Lossy Compression
链接: https://arxiv.org/abs/2506.20261
作者: Nir Weinberger,Ram Zamir
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: An extended version of ISIT 2025 paper
Abstract:Universal compression can learn the source and adapt to it either in a batch mode (forward adaptation), or in a sequential mode (backward adaptation). We recast the sequential mode as a multi-armed bandit problem, a fundamental model in reinforcement-learning, and study the trade-off between exploration and exploitation in the lossy compression case. We show that a previously proposed “natural type selection” scheme can be cast as a reconstruction-directed MAB algorithm, for sequential lossy compression, and explain its limitations in terms of robustness and short-block performance. We then derive and analyze robust cost-directed MAB algorithms, which work at any block length.
[LG-25] DuoGPT : Training-free Dual Sparsity through Activation-aware Pruning in LLM s
链接: https://arxiv.org/abs/2506.20194
作者: Ruokai Yin,Yuhang Li,Donghyun Lee,Priyadarshini Panda
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose DuoGPT, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39 \times compared to the baseline dense model.
[LG-26] Causal Operator Discovery in Partial Differential Equations via Counterfactual Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2506.20181
作者: Ronald Katende
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We develop a principled framework for discovering causal structure in partial differential equations (PDEs) using physics-informed neural networks and counterfactual perturbations. Unlike classical residual minimization or sparse regression methods, our approach quantifies operator-level necessity through functional interventions on the governing dynamics. We introduce causal sensitivity indices and structural deviation metrics to assess the influence of candidate differential operators within neural surrogates. Theoretically, we prove exact recovery of the causal operator support under restricted isometry or mutual coherence conditions, with residual bounds guaranteeing identifiability. Empirically, we validate the framework on both synthetic and real-world datasets across climate dynamics, tumor diffusion, and ocean flows. Our method consistently recovers governing operators even under noise, redundancy, and data scarcity, outperforming standard PINNs and DeepONets in structural fidelity. This work positions causal PDE discovery as a tractable and interpretable inference task grounded in structural causal models and variational residual analysis.
[LG-27] Causal discovery in deterministic discrete LTI-DAE systems
链接: https://arxiv.org/abs/2506.20169
作者: Bala Rajesh Konkathi,Arun K. Tangirala
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Methodology (stat.ME)
*备注:
Abstract:Discovering pure causes or driver variables in deterministic LTI systems is of vital importance in the data-driven reconstruction of causal networks. A recent work by Kathari and Tangirala, proposed in 2022, formulated the causal discovery method as a constraint identification problem. The constraints are identified using a dynamic iterative PCA (DIPCA)-based approach for dynamical systems corrupted with Gaussian measurement errors. The DIPCA-based method works efficiently for dynamical systems devoid of any algebraic relations. However, several dynamical systems operate under feedback control and/or are coupled with conservation laws, leading to differential-algebraic (DAE) or mixed causal systems. In this work, a method, namely the partition of variables (PoV), for causal discovery in LTI-DAE systems is proposed. This method is superior to the method that was presented by Kathari and Tangirala (2022), as PoV also works for pure dynamical systems, which are devoid of algebraic equations. The proposed method identifies the causal drivers up to a minimal subset. PoV deploys DIPCA to first determine the number of algebraic relations ( n_a ), the number of dynamical relations ( n_d ) and the constraint matrix. Subsequently, the subsets are identified through an admissible partitioning of the constraint matrix by finding the condition number of it. Case studies are presented to demonstrate the effectiveness of the proposed method.
[LG-28] Accept More Reject Less: Reducing up to 19% Unnecessary Desk-Rejections over 11 Years of ICLR Data
链接: https://arxiv.org/abs/2506.20141
作者: Xiaoyu Li,Zhao Song,Jiahao Zhang
类目: Data Structures and Algorithms (cs.DS); Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:The explosive growth of AI research has driven paper submissions at flagship AI conferences to unprecedented levels, necessitating many venues in 2025 (e.g., CVPR, ICCV, KDD, AAAI, IJCAI, WSDM) to enforce strict per-author submission limits and to desk-reject any excess papers by simple ID order. While this policy helps reduce reviewer workload, it may unintentionally discard valuable papers and penalize authors’ efforts. In this paper, we ask an essential research question on whether it is possible to follow submission limits while minimizing needless rejections. We first formalize the current desk-rejection policies as an optimization problem, and then develop a practical algorithm based on linear programming relaxation and a rounding scheme. Under extensive evaluation on 11 years of real-world ICLR (International Conference on Learning Representations) data, our method preserves up to 19.23% more papers without violating any author limits. Moreover, our algorithm is highly efficient in practice, with all results on ICLR data computed within at most 53.64 seconds. Our work provides a simple and practical desk-rejection strategy that significantly reduces unnecessary rejections, demonstrating strong potential to improve current CS conference submission policies.
[LG-29] Piecewise Linear Approximation in Learned Index Structures: Theoretical and Empirical Analysis
链接: https://arxiv.org/abs/2506.20139
作者: Jiayong Qin,Xianyu Zhu,Qiyu Liu,Guangyi Zhang,Zhigang Cai,Jianwei Liao,Sha Hu,Jingshu Peng,Yingxia Shao,Lei Chen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:A growing trend in the database and system communities is to augment conventional index structures, such as B±trees, with machine learning (ML) models. Among these, error-bounded Piecewise Linear Approximation ( \epsilon -PLA) has emerged as a popular choice due to its simplicity and effectiveness. Despite its central role in many learned indexes, the design and analysis of \epsilon -PLA fitting algorithms remain underexplored. In this paper, we revisit \epsilon -PLA from both theoretical and empirical perspectives, with a focus on its application in learned index structures. We first establish a fundamentally improved lower bound of \Omega(\kappa \cdot \epsilon^2) on the expected segment coverage for existing \epsilon -PLA fitting algorithms, where \kappa is a data-dependent constant. We then present a comprehensive benchmark of state-of-the-art \epsilon -PLA algorithms when used in different learned data structures. Our results highlight key trade-offs among model accuracy, model size, and query performance, providing actionable guidelines for the principled design of future learned data structures.
[LG-30] High-Resolution Live Fuel Moisture Content (LFMC) Maps for Wildfire Risk from Multimodal Earth Observation Data ICML2025
链接: https://arxiv.org/abs/2506.20132
作者: Patrick Alan Johnson,Gabriel Tseng,Yawen Zhang,Heather Heward,Virginia Sjahli,Favyen Bastani,Joseph Redmon,Patrick Beukema
类目: Machine Learning (cs.LG)
*备注: 10 pages, ICML 2025 (TerraBytes)
Abstract:Wildfires are increasing in intensity and severity at an alarming rate. Recent advances in AI and publicly available satellite data enable monitoring critical wildfire risk factors globally, at high resolution and low latency. Live Fuel Moisture Content (LFMC) is a critical wildfire risk factor and is valuable for both wildfire research and operational response. However, ground-based LFMC samples are both labor intensive and costly to acquire, resulting in sparse and infrequent updates. In this work, we explore the use of a pretrained, highly-multimodal earth-observation model for generating large-scale spatially complete (wall-to-wall) LFMC maps. Our approach achieves significant improvements over previous methods using randomly initialized models (20 reduction in RMSE). We provide an automated pipeline that enables rapid generation of these LFMC maps across the United States, and demonstrate its effectiveness in two regions recently impacted by wildfire (Eaton and Palisades).
[LG-31] Autonomous Cyber Resilience via a Co-Evolutionary Arms Race within a Fortified Digital Twin Sandbox
链接: https://arxiv.org/abs/2506.20102
作者: Malikussaid,Sutiyo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17 pages, 2 figures, 4 equations, 2 algorithms, 4 tables, to be published in ISPACS Conference 2025, unabridged version
Abstract:The convergence of IT and OT has created hyper-connected ICS, exposing critical infrastructure to a new class of adaptive, intelligent adversaries that render static defenses obsolete. Existing security paradigms often fail to address a foundational “Trinity of Trust,” comprising the fidelity of the system model, the integrity of synchronizing data, and the resilience of the analytical engine against sophisticated evasion. This paper introduces the ARC framework, a method for achieving analytical resilience through an autonomous, closed-loop hardening process. ARC establishes a perpetual co-evolutionary arms race within the high-fidelity sandbox of a F-SCDT. A DRL agent, the “Red Agent,” is formalized and incentivized to autonomously discover stealthy, physically-plausible attack paths that maximize process disruption while evading detection. Concurrently, an ensemble-based “Blue Agent” defender is continuously hardened via adversarial training against the evolving threats discovered by its adversary. This co-evolutionary dynamic forces both agents to become progressively more sophisticated, enabling the system to autonomously probe and patch its own vulnerabilities. Experimental validation on both the TEP and the SWaT testbeds demonstrates the framework’s superior performance. A comprehensive ablation study, supported by extensive visualizations including ROC curves and SHAP plots, reveals that the co-evolutionary process itself is responsible for a significant performance increase in detecting novel attacks. By integrating XAI to ensure operator trust and proposing a scalable F-ARC architecture, this work presents ARC not merely as an improvement, but as a necessary paradigm shift toward dynamic, self-improving security for the future of critical infrastructure.
[LG-32] MEL: Multi-level Ensemble Learning for Resource-Constrained Environments
链接: https://arxiv.org/abs/2506.20094
作者: Krishna Praneet Gudipaty,Walid A. Hanafy,Kaan Ozkara,Qianlin Liang,Jesse Milzman,Prashant Shenoy,Suhas Diggavi
类目: Machine Learning (cs.LG)
*备注:
Abstract:AI inference at the edge is becoming increasingly common for low-latency services. However, edge environments are power- and resource-constrained, and susceptible to failures. Conventional failure resilience approaches, such as cloud failover or compressed backups, often compromise latency or accuracy, limiting their effectiveness for critical edge inference services. In this paper, we propose Multi-Level Ensemble Learning (MEL), a new framework for resilient edge inference that simultaneously trains multiple lightweight backup models capable of operating collaboratively, refining each other when multiple servers are available, and independently under failures while maintaining good accuracy. Specifically, we formulate our approach as a multi-objective optimization problem with a loss formulation that inherently encourages diversity among individual models to promote mutually refining representations, while ensuring each model maintains good standalone performance. Empirical evaluations across vision, language, and audio datasets show that MEL provides performance comparable to original architectures while also providing fault tolerance and deployment flexibility across edge platforms. Our results show that our ensemble model, sized at 40% of the original model, achieves similar performance, while preserving 95.6% of ensemble accuracy in the case of failures when trained using MEL.
[LG-33] A Survey of Predictive Maintenance Methods: An Analysis of Prognostics via Classification and Regression
链接: https://arxiv.org/abs/2506.20090
作者: Ainaz Jamshidi,Dongchan Kim,Muhammad Arif
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
Abstract:Predictive maintenance (PdM) has become a crucial element of modern industrial practice. PdM plays a significant role in operational dependability and cost management by decreasing unforeseen downtime and optimizing asset life cycle management. Machine learning and deep learning have enabled more precise forecasts of equipment failure and remaining useful life (RUL). Although many studies have been conducted on PdM, there has not yet been a standalone comparative study between regression- and classification-based approaches. In this review, we look across a range of PdM methodologies, while focusing more strongly on the comparative use of classification and regression methods in prognostics. While regression-based methods typically provide estimates of RUL, classification-based methods present a forecast of the probability of failure across defined time intervals. Through a comprehensive analysis of recent literature, we highlight key advancements, challenges-such as data imbalance and high-dimensional feature spaces-and emerging trends, including hybrid approaches and AI-enabled prognostic systems. This review aims to provide researchers and practitioners with an awareness of the strengths and compromises of various PdM methods and to help identify future research and build more robust, directed adaptive maintenance systems. Future work may include a systematic review of practical aspects such as public datasets, benchmarking platforms, and open-source tools to support the advancement of PdM research.
[LG-34] Attack Smarter: Attention-Driven Fine-Grained Webpage Fingerprinting Attacks
链接: https://arxiv.org/abs/2506.20082
作者: Yali Yuan,Weiyi Zou,Guang Cheng
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Website Fingerprinting (WF) attacks aim to infer which websites a user is visiting by analyzing traffic patterns, thereby compromising user anonymity. Although this technique has been demonstrated to be effective in controlled experimental environments, it remains largely limited to small-scale scenarios, typically restricted to recognizing website homepages. In practical settings, however, users frequently access multiple subpages in rapid succession, often before previous content fully loads. WebPage Fingerprinting (WPF) generalizes the WF framework to large-scale environments by modeling subpages of the same site as distinct classes. These pages often share similar page elements, resulting in lower inter-class variance in traffic features. Furthermore, we consider multi-tab browsing scenarios, in which a single trace encompasses multiple categories of webpages. This leads to overlapping traffic segments, and similar features may appear in different positions within the traffic, thereby increasing the difficulty of classification. To address these challenges, we propose an attention-driven fine-grained WPF attack, named ADWPF. Specifically, during the training phase, we apply targeted augmentation to salient regions of the traffic based on attention maps, including attention cropping and attention masking. ADWPF then extracts low-dimensional features from both the original and augmented traffic and applies self-attention modules to capture the global contextual patterns of the trace. Finally, to handle the multi-tab scenario, we employ the residual attention to generate class-specific representations of webpages occurring at different temporal positions. Extensive experiments demonstrate that the proposed method consistently surpasses state-of-the-art baselines across datasets of different scales.
[LG-35] Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision ICDE’24 WWW
链接: https://arxiv.org/abs/2506.20070
作者: KMA Solaiman,Bharat Bhargava
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Submitted to ICDE’24. An earlier version of this paper appeared on TechRxiv: this https URL , uploaded on February 05, 2023
Abstract:Existing multi-media retrieval models either rely on creating a common subspace with modality-specific representation models or require schema mapping among modalities to measure similarities among multi-media data. Our goal is to avoid the annotation overhead incurred from considering retrieval as a supervised classification task and re-use the pretrained encoders in large language models and vision tasks. We propose “FemmIR”, a framework to retrieve multimodal results relevant to information needs expressed with multimodal queries by example without any similarity label. Such identification is necessary for real-world applications where data annotations are scarce and satisfactory performance is required without fine-tuning with a common framework across applications. We curate a new dataset called MuQNOL for benchmarking progress on this task. Our technique is based on weak supervision introduced through edit distance between samples: graph edit distance can be modified to consider the cost of replacing a data sample in terms of its properties, and relevance can be measured through the implicit signal from the amount of edit cost among the objects. Unlike metric learning or encoding networks, FemmIR re-uses the high-level properties and maintains the property value and relationship constraints with a multi-level interaction score between data samples and the query example provided by the user. We empirically evaluate FemmIR on a missing person use case with MuQNOL. FemmIR performs comparably to similar retrieval systems in delivering on-demand retrieval results with exact and approximate similarities while using the existing property identifiers in the system.
[LG-36] Supervised Coupled Matrix-Tensor Factorization (SCMTF) for Computational Phenotyping of Patient Reported Outcomes in Ulcerative Colitis
链接: https://arxiv.org/abs/2506.20065
作者: Cristian Minoccheri,Sophia Tesic,Kayvan Najarian,Ryan Stidham
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Phenotyping is the process of distinguishing groups of patients to identify different types of disease progression. A recent trend employs low-rank matrix and tensor factorization methods for their capability of dealing with multi-modal, heterogeneous, and missing data. Symptom quantification is crucial for understanding patient experiences in inflammatory bowel disease, especially in conditions such as ulcerative colitis (UC). However, patient-reported symptoms are typically noisy, subjective, and significantly more sparse than other data types. For this reason, they are usually not included in phenotyping and other machine learning methods. This paper explores the application of computational phenotyping to leverage Patient-Reported Outcomes (PROs) using a novel supervised coupled matrix-tensor factorization (SCMTF) method, which integrates temporal PROs and temporal labs with static features to predict medication persistence in ulcerative colitis. This is the first tensor-based method that is both supervised and coupled, it is the first application to the UC domain, and the first application to PROs. We use a deep learning framework that makes the model flexible and easy to train. The proposed method allows us to handle the large amount of missing data in the PROs. The best model predicts changes in medication 8 and 20 months in the future with AUCs of 0.853 and 0.803 on the test set respectively. We derive interpretable phenotypes consisting of static features and temporal features (including their temporal patterns). We show that low-rank matrix and tensor based phenotyping can be successfully applied to the UC domain and to highly missing PRO data. We identify phenotypes useful to predict medication persistence - these phenotypes include several symptom variables, showing that PROs contain relevant infromation that is usually discarded.
[LG-37] Universal pre-training by iterated random computation
链接: https://arxiv.org/abs/2506.20057
作者: Peter Bloem
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization.
[LG-38] Verifiable Unlearning on Edge
链接: https://arxiv.org/abs/2506.20037
作者: Mohammad M Maheri,Alex Davidson,Hamed Haddadi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This paper has been accepted to the IEEE European Symposium on Security and Privacy (EuroSP) 2025
Abstract:Machine learning providers commonly distribute global models to edge devices, which subsequently personalize these models using local data. However, issues such as copyright infringements, biases, or regulatory requirements may require the verifiable removal of certain data samples across all edge devices. Ensuring that edge devices correctly execute such unlearning operations is critical to maintaining integrity. In this work, we introduce a verification framework leveraging zero-knowledge proofs, specifically zk-SNARKs, to confirm data unlearning on personalized edge-device models without compromising privacy. We have developed algorithms explicitly designed to facilitate unlearning operations that are compatible with efficient zk-SNARK proof generation, ensuring minimal computational and memory overhead suitable for constrained edge environments. Furthermore, our approach carefully preserves personalized enhancements on edge devices, maintaining model performance post-unlearning. Our results affirm the practicality and effectiveness of this verification framework, demonstrating verifiable unlearning with minimal degradation in personalization-induced performance improvements. Our methodology ensures verifiable, privacy-preserving, and effective machine unlearning across edge devices. Comments: This paper has been accepted to the IEEE European Symposium on Security and Privacy (EuroSP) 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2506.20037 [cs.LG] (or arXiv:2506.20037v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.20037 Focus to learn more arXiv-issued DOI via DataCite
[LG-39] humb on the Scale: Optimal Loss Weighting in Last Layer Retraining
链接: https://arxiv.org/abs/2506.20025
作者: Nathan Stromberg,Christos Thrampoulidis,Lalitha Sankar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:While machine learning models become more capable in discriminative tasks at scale, their ability to overcome biases introduced by training data has come under increasing scrutiny. Previous results suggest that there are two extremes of parameterization with very different behaviors: the population (underparameterized) setting where loss weighting is optimal and the separable overparameterized setting where loss weighting is ineffective at ensuring equal performance across classes. This work explores the regime of last layer retraining (LLR) in which the unseen limited (retraining) data is frequently inseparable and the model proportionately sized, falling between the two aforementioned extremes. We show, in theory and practice, that loss weighting is still effective in this regime, but that these weights \emphmust take into account the relative overparameterization of the model.
[LG-40] DIM-SUM: Dynamic IMputation for Smart Utility Management
链接: https://arxiv.org/abs/2506.20023
作者: Ryan Hildebrant,Rahul Bhope,Sharad Mehrotra,Christopher Tull,Nalini Venkatasubramanian
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Time series imputation models have traditionally been developed using complete datasets with artificial masking patterns to simulate missing values. However, in real-world infrastructure monitoring, practitioners often encounter datasets where large amounts of data are missing and follow complex, heterogeneous patterns. We introduce DIM-SUM, a preprocessing framework for training robust imputation models that bridges the gap between artificially masked training data and real missing patterns. DIM-SUM combines pattern clustering and adaptive masking strategies with theoretical learning guarantees to handle diverse missing patterns actually observed in the data. Through extensive experiments on over 2 billion readings from California water districts, electricity datasets, and benchmarks, we demonstrate that DIM-SUM outperforms traditional methods by reaching similar accuracy with lower processing time and significantly less training data. When compared against a large pre-trained model, DIM-SUM averages 2x higher accuracy with significantly less inference time.
[LG-41] Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons
链接: https://arxiv.org/abs/2506.20015
作者: Dengyu Wu,Jiechen Chen,H. Vincent Poor,Bipin Rajendran,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Neuromorphic computing offers an energy-efficient alternative to conventional deep learning accelerators for real-time time-series processing. However, many edge applications, such as wireless sensing and audio recognition, generate streaming signals with rich spectral features that are not effectively captured by conventional leaky integrate-and-fire (LIF) spiking neurons. This paper investigates a wireless split computing architecture that employs resonate-and-fire (RF) neurons with oscillatory dynamics to process time-domain signals directly, eliminating the need for costly spectral pre-processing. By resonating at tunable frequencies, RF neurons extract time-localized spectral features while maintaining low spiking activity. This temporal sparsity translates into significant savings in both computation and transmission energy. Assuming an OFDM-based analog wireless interface for spike transmission, we present a complete system design and evaluate its performance on audio classification and modulation classification tasks. Experimental results show that the proposed RF-SNN architecture achieves comparable accuracy to conventional LIF-SNNs and ANNs, while substantially reducing spike rates and total energy consumption during inference and communication.
[LG-42] Can One Safety Loop Guard Them All? Agent ic Guard Rails for Federated Computing ICML2025 ICML’25
链接: https://arxiv.org/abs/2506.20000
作者: Narasimha Raghavan Veeraragavan,Jan Franz Nygård
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted at ICML 2025 Workshop on Collaborative and Federated Agentic Workflows (CFAgentic@ICML’25)
Abstract:We propose Guardian-FC, a novel two-layer framework for privacy preserving federated computing that unifies safety enforcement across diverse privacy preserving mechanisms, including cryptographic back-ends like fully homomorphic encryption (FHE) and multiparty computation (MPC), as well as statistical techniques such as differential privacy (DP). Guardian-FC decouples guard-rails from privacy mechanisms by executing plug-ins (modular computation units), written in a backend-neutral, domain-specific language (DSL) designed specifically for federated computing workflows and interchangeable Execution Providers (EPs), which implement DSL operations for various privacy back-ends. An Agentic-AI control plane enforces a finite-state safety loop through signed telemetry and commands, ensuring consistent risk management and auditability. The manifest-centric design supports fail-fast job admission and seamless extensibility to new privacy back-ends. We present qualitative scenarios illustrating backend-agnostic safety and a formal model foundation for verification. Finally, we outline a research agenda inviting the community to advance adaptive guard-rail tuning, multi-backend composition, DSL specification development, implementation, and compiler extensibility alongside human-override usability.
[LG-43] CoVE: Compressed Vocabulary Expansion Makes Better LLM -based Recommender Systems ACL2025
链接: https://arxiv.org/abs/2506.19993
作者: Haochen Zhang,Tianyi Zhang,Junze Yin,Oren Gal,Anshumali Shrivastava,Vladimir Braverman
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by ACL 2025 Findings
Abstract:Recommender systems play a pivotal role in providing relevant content to users. With the rapid development of large language models (LLMs), researchers have begun utilizing LLMs to build more powerful recommender systems. However, existing approaches that focus on aligning LLMs with recommendation tasks do not fully leverage their sequential information processing capabilities, leading to suboptimal performance. In this paper, we propose a novel system called compressed vocabulary expansion (CoVE). In CoVE, each item is assigned a unique ID within the expanded vocabulary. Our framework effectively capitalizes on sequence understanding abilities of LLMs, significantly enhancing their performance on recommendation tasks. Additionally, we compress the embedding layer, making CoVE practical for large-scale industrial applications. The effectiveness and performance of CoVE are demonstrated through comprehensive experiments on multiple recommendation datasets and comparisons with prior works. Our code can be found at this https URL. Comments: Accepted by ACL 2025 Findings Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2506.19993 [cs.IR] (or arXiv:2506.19993v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.19993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] MAIZX: A Carbon-Aware Framework for Optimizing Cloud Computing Emissions
链接: https://arxiv.org/abs/2506.19972
作者: Federico Ruilova,Ernst Gunnar Gran,Sven-Arne Reinemo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 2 pages, 2 figures. LOCO 2024, December 3, 2024, Glasgow/Online
Abstract:Cloud computing drives innovation but also poses significant environmental challenges due to its high-energy consumption and carbon emissions. Data centers account for 2-4% of global energy usage, and the ICT sector’s share of electricity consumption is projected to reach 40% by 2040. As the goal of achieving net-zero emissions by 2050 becomes increasingly urgent, there is a growing need for more efficient and transparent solutions, particularly for private cloud infrastructures, which are utilized by 87% of organizations, despite the dominance of public-cloud systems. This study evaluates the MAIZX framework, designed to optimize cloud operations and reduce carbon footprint by dynamically ranking resources, including data centers, edge computing nodes, and multi-cloud environments, based on real-time and forecasted carbon intensity, Power Usage Effectiveness (PUE), and energy consumption. Leveraging a flexible ranking algorithm, MAIZX achieved an 85.68% reduction in CO2 emissions compared to baseline hypervisor operations. Tested across geographically distributed data centers, the framework demonstrates scalability and effectiveness, directly interfacing with hypervisors to optimize workloads in private, hybrid, and multi-cloud environments. MAIZX integrates real-time data on carbon intensity, power consumption, and carbon footprint, as well as forecasted values, into cloud management, providing a robust tool for enhancing climate performance potential while maintaining operational efficiency. Comments: 2 pages, 2 figures. LOCO 2024, December 3, 2024, Glasgow/Online Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2506.19972 [cs.DC] (or arXiv:2506.19972v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2506.19972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-45] MILAAP: Mobile Link Allocation via Attention-based Prediction
链接: https://arxiv.org/abs/2506.19947
作者: Yung-Fu Chen,Anish Arora
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Channel hopping (CS) communication systems must adapt to interference changes in the wireless network and to node mobility for maintaining throughput efficiency. Optimal scheduling requires up-to-date network state information (i.e., of channel occupancy) to select non-overlapping channels for links in interference regions. However, state sharing among nodes introduces significant communication overhead, especially as network size or node mobility scale, thereby decreasing throughput efficiency of already capacity-limited networks. In this paper, we eschew state sharing while adapting the CS schedule based on a learning-based channel occupancy prediction. We propose the MiLAAP attention-based prediction framework for machine learning models of spectral, spatial, and temporal dependencies among network nodes. MiLAAP uses a self-attention mechanism that lets each node capture the temporospectral CS pattern in its interference region and accordingly predict the channel occupancy state within that region. Notably, the prediction relies only on locally and passively observed channel activities, and thus introduces no communication overhead. To deal with node mobility, MiLAAP also uses a multi-head self-attention mechanism that lets each node locally capture the spatiotemporal dependencies on other network nodes that can interfere with it and accordingly predict the motion trajectory of those nodes. Detecting nodes that enter or move outside the interference region is used to further improve the prediction accuracy of channel occupancy. We show that for dynamic networks that use local CS sequences to support relatively long-lived flow traffics, the channel state prediction accuracy of MiLAAP is remarkably ~100% across different node mobility patterns and it achieves zero-shot generalizability across different periods of CS sequences.
[LG-46] he Most Important Features in Generalized Additive Models Might Be Groups of Features
链接: https://arxiv.org/abs/2506.19937
作者: Tomas M. Bosschieter,Luis Franca,Jessica Wolk,Yiyuan Wu,Bella Mehta,Joseph Dehoney,Orsolya Kiss,Fiona C. Baker,Qingyu Zhao,Rich Caruana,Kilian M. Pohl
类目: Machine Learning (cs.LG)
*备注:
Abstract:While analyzing the importance of features has become ubiquitous in interpretable machine learning, the joint signal from a group of related features is sometimes overlooked or inadvertently excluded. Neglecting the joint signal could bypass a critical insight: in many instances, the most significant predictors are not isolated features, but rather the combined effect of groups of features. This can be especially problematic for datasets that contain natural groupings of features, including multimodal datasets. This paper introduces a novel approach to determine the importance of a group of features for Generalized Additive Models (GAMs) that is efficient, requires no model retraining, allows defining groups posthoc, permits overlapping groups, and remains meaningful in high-dimensional settings. Moreover, this definition offers a parallel with explained variation in statistics. We showcase properties of our method on three synthetic experiments that illustrate the behavior of group importance across various data regimes. We then demonstrate the importance of groups of features in identifying depressive symptoms from a multimodal neuroscience dataset, and study the importance of social determinants of health after total hip arthroplasty. These two case studies reveal that analyzing group importance offers a more accurate, holistic view of the medical issues compared to a single-feature analysis.
[LG-47] A Comparative Analysis of Reinforcement Learning and Conventional Deep Learning Approaches for Bearing Fault Diagnosis
链接: https://arxiv.org/abs/2506.19929
作者: Efe Çakır,Patrick Dumond
类目: Machine Learning (cs.LG)
*备注: 5 pages, 5 figures. To appear in the Proceedings of the Canadian Society for Mechanical Engineering (CSME) Congress 2025
Abstract:Bearing faults in rotating machinery can lead to significant operational disruptions and maintenance costs. Modern methods for bearing fault diagnosis rely heavily on vibration analysis and machine learning techniques, which often require extensive labeled data and may not adapt well to dynamic environments. This study explores the feasibility of reinforcement learning (RL), specifically Deep Q-Networks (DQNs), for bearing fault classification tasks in machine condition monitoring to enhance the accuracy and adaptability of bearing fault diagnosis. The results demonstrate that while RL models developed in this study can match the performance of traditional supervised learning models under controlled conditions, they excel in adaptability when equipped with optimized reward structures. However, their computational demands highlight areas for further improvement. These findings demonstrate RL’s potential to complement traditional methods, paving the way for adaptive diagnostic frameworks.
[LG-48] Diffusion-based Task-oriented Semantic Communications with Model Inversion Attack
链接: https://arxiv.org/abs/2506.19886
作者: Xuesong Wang,Mo Li,Xingyan Shi,Zhaoqian Liu,Shenghao Yang
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Semantic communication has emerged as a promising neural network-based system design for 6G networks. Task-oriented semantic communication is a novel paradigm whose core goal is to efficiently complete specific tasks by transmitting semantic information, optimizing communication efficiency and task performance. The key challenge lies in preserving privacy while maintaining task accuracy, as this scenario is susceptible to model inversion attacks. In such attacks, adversaries can restore or even reconstruct input data by analyzing and processing model outputs, owing to the neural network-based nature of the systems. In addition, traditional systems use image quality indicators (such as PSNR or SSIM) to assess attack severity, which may be inadequate for task-oriented semantic communication, since visual differences do not necessarily ensure semantic divergence. In this paper, we propose a diffusion-based semantic communication framework, named DiffSem, that optimizes semantic information reconstruction through a diffusion mechanism with self-referential label embedding to significantly improve task performance. Our model also compensates channel noise and adopt semantic information distortion to ensure the robustness of the system in various signal-to-noise ratio environments. To evaluate the attacker’s effectiveness, we propose a new metric that better quantifies the semantic fidelity of estimations from the adversary. Experimental results based on this criterion show that on the MNIST dataset, DiffSem improves the classification accuracy by 10.03%, and maintain stable performance under dynamic channels. Our results further demonstrate that significant deviation exists between traditional image quality indicators and the leakage of task-relevant semantic information.
[LG-49] Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models
链接: https://arxiv.org/abs/2506.19881
作者: Aloni Cohen
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Are there any conditions under which a generative model’s outputs are guaranteed not to infringe the copyrights of its training data? This is the question of “provable copyright protection” first posed by Vyas, Kakade, and Barak (ICML 2023). They define near access-freeness (NAF) and propose it as sufficient for protection. This paper revisits the question and establishes new foundations for provable copyright protection – foundations that are firmer both technically and legally. First, we show that NAF alone does not prevent infringement. In fact, NAF models can enable verbatim copying, a blatant failure of copy protection that we dub being tainted. Then, we introduce our blameless copy protection framework for defining meaningful guarantees, and instantiate it with clean-room copy protection. Clean-room copy protection allows a user to control their risk of copying by behaving in a way that is unlikely to copy in a counterfactual clean-room setting. Finally, we formalize a common intuition about differential privacy and copyright by proving that DP implies clean-room copy protection when the dataset is golden, a copyright deduplication requirement.
[LG-50] First-order methods for stochastic and finite-sum convex optimization with deterministic constraints
链接: https://arxiv.org/abs/2506.20630
作者: Zhaosong Lu,Yifeng Xiao
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 41 pages
Abstract:In this paper, we study a class of stochastic and finite-sum convex optimization problems with deterministic constraints. Existing methods typically aim to find an \epsilon - expectedly\ feasible\ stochastic\ optimal solution, in which the expected constraint violation and expected optimality gap are both within a prescribed tolerance \epsilon . However, in many practical applications, constraints must be nearly satisfied with certainty, rendering such solutions potentially unsuitable due to the risk of substantial violations. To address this issue, we propose stochastic first-order methods for finding an \epsilon - surely\ feasible\ stochastic\ optimal ( \epsilon -SFSO) solution, where the constraint violation is deterministically bounded by \epsilon and the expected optimality gap is at most \epsilon . Our methods apply an accelerated stochastic gradient (ASG) scheme or a modified variance-reduced ASG scheme only\ once to a sequence of quadratic penalty subproblems with appropriately chosen penalty parameters. We establish first-order oracle complexity bounds for the proposed methods in computing an \epsilon -SFSO solution. As a byproduct, we also derive first-order oracle complexity results for sample average approximation method in computing an \epsilon -SFSO solution of the stochastic optimization problem using our proposed methods to solve the sample average problem.
[LG-51] LARP: Learner-Agnostic Robust Data Prefiltering
链接: https://arxiv.org/abs/2506.20573
作者: Kristian Minchev,Dimitar Iliev Dimitrov,Nikola Konstantinov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The widespread availability of large public datasets is a key factor behind the recent successes of statistical inference and machine learning methods. However, these datasets often contain some low-quality or contaminated data, to which many learning procedures are sensitive. Therefore, the question of whether and how public datasets should be prefiltered to facilitate accurate downstream learning arises. On a technical level this requires the construction of principled data prefiltering methods which are learner-agnostic robust, in the sense of provably protecting a set of pre-specified downstream learners from corrupted data. In this work, we formalize the problem of Learner-Agnostic Robust data Prefiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We first instantiate our framework in the context of scalar mean estimation with Huber estimators under the Huber data contamination model. We provide a hardness result on a specific problem instance and analyze several natural prefiltering procedures. Our theoretical results indicate that performing LARP on a heterogeneous set of learners leads to some loss in model performance compared to the alternative of prefiltering data for each learner/use-case individually. We explore the resulting utility loss and its dependence on the problem parameters via extensive experiments on real-world image and tabular data, observing statistically significant reduction in utility. Finally, we model the trade-off between the utility drop and the cost of repeated (learner-specific) prefiltering within a game-theoretic framework and showcase benefits of LARP for large datasets.
[LG-52] Reinforcement Learning Increases Wind Farm Power Production by Enabling Closed-Loop Collaborative Control
链接: https://arxiv.org/abs/2506.20554
作者: Andrew Mole,Max Weissenbacher,Georgios Rigas,Sylvain Laizet
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Traditional wind farm control operates each turbine independently to maximize individual power output. However, coordinated wake steering across the entire farm can substantially increase the combined wind farm energy production. Although dynamic closed-loop control has proven effective in flow control applications, wind farm optimization has relied primarily on static, low-fidelity simulators that ignore critical turbulent flow dynamics. In this work, we present the first reinforcement learning (RL) controller integrated directly with high-fidelity large-eddy simulation (LES), enabling real-time response to atmospheric turbulence through collaborative, dynamic control strategies. Our RL controller achieves a 4.30% increase in wind farm power output compared to baseline operation, nearly doubling the 2.19% gain from static optimal yaw control obtained through Bayesian optimization. These results establish dynamic flow-responsive control as a transformative approach to wind farm optimization, with direct implications for accelerating renewable energy deployment to net-zero targets.
[LG-53] Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery
链接: https://arxiv.org/abs/2506.20533
作者: Gilad Lerman,Kang Li,Tyler Maunu,Teng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Robust subspace estimation is fundamental to many machine learning and data analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to this problem, yet its theoretical properties remain poorly understood. This paper establishes that, under deterministic conditions, a variant of IRLS with dynamic smoothing regularization converges linearly to the underlying subspace from any initialization. We extend these guarantees to affine subspace estimation, a setting that lacks prior recovery theory. Additionally, we illustrate the practical benefits of IRLS through an application to low-dimensional neural network training. Our results provide the first global convergence guarantees for IRLS in robust subspace recovery and, more broadly, for nonconvex IRLS on a Riemannian manifold.
[LG-54] Fast ground penetrating radar dual-parameter full waveform inversion method accelerated by hybrid compilation of CUDA kernel function and PyTorch
链接: https://arxiv.org/abs/2506.20513
作者: Lei Liu,Chao Song,Liangsheng He,Silin Wang,Xuan Feng,Cai Liu
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:This study proposes a high-performance dual-parameter full waveform inversion framework (FWI) for ground-penetrating radar (GPR), accelerated through the hybrid compilation of CUDA kernel functions and PyTorch. The method leverages the computational efficiency of GPU programming while preserving the flexibility and usability of Python-based deep learning frameworks. By integrating customized CUDA kernels into PyTorch’s automatic differentiation mechanism, the framework enables accurate and efficient inversion of both dielectric permittivity and electrical conductivity. Experimental evaluations on synthetic data and real wavefield data demonstrate that the proposed method achieves dual-parameter FWI for GPR data while maintaining high accuracy. Moreover, the framework is flexible and extensible, supporting optional regularization strategies such as total variation and multi-scale inversion. These features make the proposed approach a practical and scalable framework for rapid GPR-based subsurface imaging in applications including civil engineering, environmental monitoring, and geophysical exploration.
[LG-55] Scalable Subset Selection in Linear Mixed Models
链接: https://arxiv.org/abs/2506.20425
作者: Ryan Thompson,Matt P. Wand,Joanna J. J. Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:
Abstract:Linear mixed models (LMMs), which incorporate fixed and random effects, are key tools for analyzing heterogeneous data, such as in personalized medicine or adaptive marketing. Nowadays, this type of data is increasingly wide, sometimes containing thousands of candidate predictors, necessitating sparsity for prediction and interpretation. However, existing sparse learning methods for LMMs do not scale well beyond tens or hundreds of predictors, leaving a large gap compared with sparse methods for linear models, which ignore random effects. This paper closes the gap with a new \ell_0 regularized method for LMM subset selection that can run on datasets containing thousands of predictors in seconds to minutes. On the computational front, we develop a coordinate descent algorithm as our main workhorse and provide a guarantee of its convergence. We also develop a local search algorithm to help traverse the nonconvex optimization surface. Both algorithms readily extend to subset selection in generalized LMMs via a penalized quasi-likelihood approximation. On the statistical front, we provide a finite-sample bound on the Kullback-Leibler divergence of the new method. We then demonstrate its excellent performance in synthetic experiments and illustrate its utility on two datasets from biology and journalism.
[LG-56] POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes
链接: https://arxiv.org/abs/2506.20406
作者: Ruijia Zhang,Zhengling Qi,Yue Wu,Xiangyu Zhang,Yanxun Xu
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.
[LG-57] A Complete Loss Landscape Analysis of Regularized Deep Matrix Factorization
链接: https://arxiv.org/abs/2506.20344
作者: Po Chen,Rujun Jiang,Peng Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 35 pages, 3 figures
Abstract:Despite its wide range of applications across various domains, the optimization foundations of deep matrix factorization (DMF) remain largely open. In this work, we aim to fill this gap by conducting a comprehensive study of the loss landscape of the regularized DMF problem. Toward this goal, we first provide a closed-form expression of all critical points. Building on this, we establish precise conditions under which a critical point is a local minimizer, a global minimizer, a strict saddle point, or a non-strict saddle point. Leveraging these results, we derive a necessary and sufficient condition under which each critical point is either a local minimizer or a strict saddle point. This provides insights into why gradient-based methods almost always converge to a local minimizer of the regularized DMF problem. Finally, we conduct numerical experiments to visualize its loss landscape under different settings to support our theory.
[LG-58] OLALa: Online Learned Adaptive Lattice Codes for Heterogeneous Federated Learning
链接: https://arxiv.org/abs/2506.20297
作者: Natalie Lang,Maya Simhi,Nir Shlezinger
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Under review for publication in the IEEE
Abstract:Federated learning (FL) enables collaborative training across distributed clients without sharing raw data, often at the cost of substantial communication overhead induced by transmitting high-dimensional model updates. This overhead can be alleviated by having the clients quantize their model updates, with dithered lattice quantizers identified as an attractive scheme due to its structural simplicity and convergence-preserving properties. However, existing lattice-based FL schemes typically rely on a fixed quantization rule, which is suboptimal in heterogeneous and dynamic environments where the model updates distribution varies across users and training rounds. In this work, we propose Online Learned Adaptive Lattices (OLALa), a heterogeneous FL framework where each client can adjust its quantizer online using lightweight local computations. We first derive convergence guarantees for FL with non-fixed lattice quantizers and show that proper lattice adaptation can tighten the convergence bound. Then, we design an online learning algorithm that enables clients to tune their quantizers throughout the FL process while exchanging only a compact set of quantization parameters. Numerical experiments demonstrate that OLALa consistently improves learning performance under various quantization rates, outperforming conventional fixed-codebook and non-adaptive schemes.
[LG-59] Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives
链接: https://arxiv.org/abs/2506.20114
作者: Brian Liu,Rahul Mazumder,Peter Radchenko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Tree ensembles are non-parametric methods widely recognized for their accuracy and ability to capture complex interactions. While these models excel at prediction, they are difficult to interpret and may fail to uncover useful relationships in the data. We propose an estimator to extract compact sets of decision rules from tree ensembles. The extracted models are accurate and can be manually examined to reveal relationships between the predictors and the response. A key novelty of our estimator is the flexibility to jointly control the number of rules extracted and the interaction depth of each rule, which improves accuracy. We develop a tailored exact algorithm to efficiently solve optimization problems underlying our estimator and an approximate algorithm for computing regularization paths, sequences of solutions that correspond to varying model sizes. We also establish novel non-asymptotic prediction error bounds for our proposed approach, comparing it to an oracle that chooses the best data-dependent linear combination of the rules in the ensemble subject to the same complexity constraint as our estimator. The bounds illustrate that the large-sample predictive performance of our estimator is on par with that of the oracle. Through experiments, we demonstrate that our estimator outperforms existing algorithms for rule extraction.
[LG-60] Machine-Learning-Assisted Photonic Device Development: A Multiscale Approach from Theory to Characterization
链接: https://arxiv.org/abs/2506.20056
作者: Yuheng Chen,Alexander Montes McNeil,Taehyuk Park,Blake A. Wilson,Vaishnavi Iyer,Michael Bezick,Jae-Ik Choi,Rohan Ojha,Pravin Mahendran,Daksh Kumar Singh,Geetika Chitturi,Peigang Chen,Trang Do,Alexander V. Kildishev,Vladimir M. Shalaev,Michael Moebius,Wenshan Cai,Yongmin Liu,Alexandra Boltasseva
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:
Abstract:Photonic device development (PDD) has achieved remarkable success in designing and implementing new devices for controlling light across various wavelengths, scales, and applications, including telecommunications, imaging, sensing, and quantum information processing. PDD is an iterative, five-step process that consists of: i) deriving device behavior from design parameters, ii) simulating device performance, iii) finding the optimal candidate designs from simulations, iv) fabricating the optimal device, and v) measuring device performance. Classically, all these steps involve Bayesian optimization, material science, control theory, and direct physics-driven numerical methods. However, many of these techniques are computationally intractable, monetarily costly, or difficult to implement at scale. In addition, PDD suffers from large optimization landscapes, uncertainties in structural or optical characterization, and difficulties in implementing robust fabrication processes. However, the advent of machine learning over the past decade has provided novel, data-driven strategies for tackling these challenges, including surrogate estimators for speeding up computations, generative modeling for noisy measurement modeling and data augmentation, reinforcement learning for fabrication, and active learning for experimental physical discovery. In this review, we present a comprehensive perspective on these methods to enable machine-learning-assisted PDD (ML-PDD) for efficient design optimization with powerful generative models, fast simulation and characterization modeling under noisy measurements, and reinforcement learning for fabrication. This review will provide researchers from diverse backgrounds with valuable insights into this emerging topic, fostering interdisciplinary efforts to accelerate the development of complex photonic devices and systems.
[LG-61] A Principled Path to Fitted Distributional Evaluation
链接: https://arxiv.org/abs/2506.20048
作者: Sungee Hong,Jiayi Wang,Zhengling Qi,Raymond Ka Wai Wong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted-Q evaluation – developed for expectation-based reinforcement learning – to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.
[LG-62] PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning
链接: https://arxiv.org/abs/2506.20043
作者: Ahmet Sarigun,Bora Uyar,Vedran Franke,Altuna Akalin
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Sampling physically valid ligand-binding poses remains a major challenge in molecular docking, particularly for unseen or structurally diverse targets. We introduce PocketVina, a fast and memory-efficient, search-based docking framework that combines pocket prediction with systematic multi-pocket exploration. We evaluate PocketVina across four established benchmarks–PDBbind2020 (timesplit and unseen), DockGen, Astex, and PoseBusters–and observe consistently strong performance in sampling physically valid docking poses. PocketVina achieves state-of-the-art performance when jointly considering ligand RMSD and physical validity (PB-valid), while remaining competitive with deep learning-based approaches in terms of RMSD alone, particularly on structurally diverse and previously unseen targets. PocketVina also maintains state-of-the-art physically valid docking accuracy across ligands with varying degrees of flexibility. We further introduce TargetDock-AI, a benchmarking dataset we curated, consisting of over 500000 protein-ligand pairs, and a partition of the dataset labeled with PubChem activity annotations. On this large-scale dataset, PocketVina successfully discriminates active from inactive targets, outperforming a deep learning baseline while requiring significantly less GPU memory and runtime. PocketVina offers a robust and scalable docking strategy that requires no task-specific training and runs efficiently on standard GPUs, making it well-suited for high-throughput virtual screening and structure-based drug discovery.
[LG-63] Data-Driven Dynamic Factor Modeling via Manifold Learning
链接: https://arxiv.org/abs/2506.19945
作者: Graeme Baker,Agostino Capponi,J. Antonio Sidaoui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We propose a data-driven dynamic factor framework where a response variable depends on a high-dimensional set of covariates, without imposing any parametric model on the joint dynamics. Leveraging Anisotropic Diffusion Maps, a nonlinear manifold learning technique introduced by Singer and Coifman, our framework uncovers the joint dynamics of the covariates and responses in a purely data-driven way. We approximate the embedding dynamics using linear diffusions, and exploit Kalman filtering to predict the evolution of the covariates and response variables directly from the diffusion map embedding space. We generalize Singer’s convergence rate analysis of the graph Laplacian from the case of independent uniform samples on a compact manifold to the case of time series arising from Langevin diffusions in Euclidean space. Furthermore, we provide rigorous justification for our procedure by showing the robustness of approximations of the diffusion map coordinates by linear diffusions, and the convergence of ergodic averages under standard spectral assumptions on the underlying dynamics. We apply our method to the stress testing of equity portfolios using a combination of financial and macroeconomic factors from the Federal Reserve’s supervisory scenarios. We demonstrate that our data-driven stress testing method outperforms standard scenario analysis and Principal Component Analysis benchmarks through historical backtests spanning three major financial crises, achieving reductions in mean absolute error of up to 55% and 39% for scenario-based portfolio return prediction, respectively.
[LG-64] Supervised Similarity for Firm Linkages
链接: https://arxiv.org/abs/2506.19856
作者: Ryan Samson,Adrian Banner,Luca Candelori,Sebastien Cottrell,Tiziana Di Matteo,Paul Duchnowski,Vahagn Kirakosyan,Jose Marques,Kharen Musaelian,Stefano Pasquali,Ryan Stever,Dario Villani
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:We introduce a novel proxy for firm linkages, Characteristic Vector Linkages (CVLs). We use this concept to estimate firm linkages, first through Euclidean similarity, and then by applying Quantum Cognition Machine Learning (QCML) to similarity learning. We demonstrate that both methods can be used to construct profitable momentum spillover trading strategies, but QCML similarity outperforms the simpler Euclidean similarity.
[LG-65] Neural networks for the prediction of peel force for skin adhesive interface using FEM simulation
链接: https://arxiv.org/abs/2506.19855
作者: Ashish Masarkar,Rakesh Gupta,Naga Neehar Dingari,Beena Rai
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:
Abstract:Studying the peeling behaviour of adhesives on skin is vital for advancing biomedical applications such as medical adhesives and transdermal patches. Traditional methods like experimental testing and finite element method (FEM), though considered gold standards, are resource-intensive, computationally expensive and time-consuming, particularly when analysing a wide material parameter space. In this study, we present a neural network-based approach to predict the minimum peel force (F_min) required for adhesive detachment from skin tissue, limiting the need for repeated FEM simulations and significantly reducing the computational cost. Leveraging a dataset generated from FEM simulations of 90 degree peel test with varying adhesive and fracture mechanics parameters, our neural network model achieved high accuracy, validated through rigorous 5-fold cross-validation. The final architecture was able to predict a wide variety of skin-adhesive peeling behaviour, exhibiting a mean squared error (MSE) of 3.66*10^-7 and a R^2 score of 0.94 on test set, demonstrating robust performance. This work introduces a reliable, computationally efficient method for predicting adhesive behaviour, significantly reducing simulation time while maintaining accuracy. This integration of machine learning with high-fidelity biomechanical simulations enables efficient design and optimization of skin-adhesive systems, providing a scalable framework for future research in computational dermato-mechanics and bio-adhesive material design.
[LG-66] Finite-Time Information-Theoretic Bounds in Queueing Control
链接: https://arxiv.org/abs/2506.18278
作者: Yujie Liu,Vincent Y. F. Tan,Yunbei Xu
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:We establish the first finite-time information-theoretic lower bounds-and derive new policies that achieve them-for the total queue length in scheduling problems over stochastic processing networks with both adversarial and stochastic arrivals. Prior analyses of MaxWeight guarantee only stability and asymptotic optimality in heavy traffic; we prove that, at finite horizons, MaxWeight can incur strictly larger backlog by problem-dependent factors which we identify. Our main innovations are 1) a minimax framework that pinpoints the precise problem parameters governing any policy’s finite-time performance; 2) an information-theoretic lower bound on total queue length; 3) fundamental limitation of MaxWeight that it is suboptimal in finite time; and 4) a new scheduling rule that minimizes the full Lyapunov drift-including its second-order term-thereby matching the lower bound under certain conditions, up to universal constants. These findings reveal a fundamental limitation on “drift-only” methods and points the way toward principled, non-asymptotic optimality in queueing control.
信息检索
[IR-0] Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search SIGIR2023
链接: https://arxiv.org/abs/2506.20330
作者: Zhigong Zhou,Ning Ding,Xiaochuan Fan,Yue Shang,Yiming Qiu,Jingwei Zhuo,Zhiwei Ge,Songlin Wang,Lin Liu,Sulong Xu,Han Zhang
类目: Information Retrieval (cs.IR)
*备注: published in sigir2023
Abstract:Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval performance. Though learning from cross-modality data has been studied extensively in tasks such as visual question answering or media summarization, multimodal retrieval remains a non-trivial and unsolved problem especially in the asymmetric scenario where the query is unimodal while the item is multimodal. In this paper, we propose a novel model named SMAR, which stands for Semantic-enhanced Modality-Asymmetric Retrieval, to tackle the problem of modality fusion and alignment in this kind of asymmetric scenario. Extensive experimental results on an industrial dataset show that the proposed model outperforms baseline models significantly in retrieval accuracy. We have open sourced our industrial dataset for the sake of reproducibility and future research works.
[IR-1] A Literature Review on Simulation in Conversational Recommender Systems
链接: https://arxiv.org/abs/2506.20291
作者: Haoran Zhang,Xin Zhao,Jinze Chen,Junpeng Guo
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 6 pages, 1 figures, accepted as a poster for CSWIM 2025
Abstract:Conversational Recommender Systems (CRSs) have garnered attention as a novel approach to delivering personalized recommendations through multi-turn dialogues. This review developed a taxonomy framework to systematically categorize relevant publications into four groups: dataset construction, algorithm design, system evaluation, and empirical studies, providing a comprehensive analysis of simulation methods in CRSs research. Our analysis reveals that simulation methods play a key role in tackling CRSs’ main challenges. For example, LLM-based simulation methods have been used to create conversational recommendation data, enhance CRSs algorithms, and evaluate CRSs. Despite several challenges, such as dataset bias, the limited output flexibility of LLM-based simulations, and the gap between text semantic space and behavioral semantics, persist due to the complexity in Human-Computer Interaction (HCI) of CRSs, simulation methods hold significant potential for advancing CRS research. This review offers a thorough summary of the current research landscape in this domain and identifies promising directions for future inquiry.
[IR-2] Controlled Retrieval-augmented Context Evaluation for Long-form RAG
链接: https://arxiv.org/abs/2506.20051
作者: Jia-Huei Ju,Suzan Verberne,Maarten de Rijke,Andrew Yates
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval’s impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbfControlled \textbfRetrieval-a\textbfUgmented conte\textbfXt evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG’s retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG’s retrieval. Our data and code are publicly available to support and advance future research on retrieval.