本篇博文主要内容为 2025-06-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-23)
今日共更新791篇论文,其中:
- 自然语言处理共117篇(Computation and Language (cs.CL))
- 人工智能共232篇(Artificial Intelligence (cs.AI))
- 计算机视觉共166篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共271篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
【速读】: 该论文试图解决在对通用大语言模型(Large Language Model, LLM)进行微调后,模型的安全对齐特性会被削弱的问题,即使微调数据中不包含任何有害内容。这种安全性的下降可能被恶意行为者利用以绕过安全防护机制。论文指出,解决方案的关键在于建立可靠且可重复的安全评估方法,以有效检测和缓解微调带来的安全风险。通过研究安全基准测试对实验过程微小变化及模型随机性的鲁棒性,论文揭示了当前安全评估结果存在显著波动,这对未来研究结果的报告和比较提出了重要挑战。
链接: https://arxiv.org/abs/2506.17209
作者: Kathleen C. Fraser,Hillary Dawkins,Isar Nejadgholi,Svetlana Kiritchenko
机构: National Research Council Canada (国家研究委员会)
类目: Computation and Language (cs.CL)
备注: to appear at LLMSEC 2025
Abstract:Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the “attack”. Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental procedure, and the stochastic nature of LLMs. Our initial experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup. Our observations have serious implications for how researchers in this field should report results to enable meaningful comparisons in the future.
zh
[NLP-1] Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM - and Agent -Based Repair Systems
【速读】: 该论文旨在解决当前自动化程序修复(Automated Program Repair, APR)领域中,基于大型语言模型(Large Language Models, LLMs)的修复系统在架构设计和实现细节方面缺乏透明度的问题。其关键在于对SWE-Bench Lite和SWE-Bench Verified两个基准平台上的所有提交方案进行首次全面分析,揭示了这些解决方案在提交者类型、产品可用性、LLM使用情况及系统架构等方面的多样性,特别是强调了专有LLM(如Claude 3.5/3.7)的主导地位以及代理式与非代理式设计并存的现象。
链接: https://arxiv.org/abs/2506.17208
作者: Matias Martinez,Xavier Franch
机构: Universitat Politècnica de Catalunya(加泰罗尼亚理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench Verified, have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.
zh
[NLP-2] owards AI Search Paradigm
【速读】: 该论文试图解决传统搜索系统在处理复杂信息需求时的局限性,特别是在应对多阶段推理任务和动态适应不同信息需求方面的能力不足。其解决方案的关键在于提出一种基于大语言模型(LLM)的四模块化代理架构(Master、Planner、Executor 和 Writer),这些代理通过协同工作流实现查询复杂度评估、问题分解、工具使用协调、任务执行与内容生成,从而构建出能够模拟人类信息处理和决策能力的下一代搜索系统。
链接: https://arxiv.org/abs/2506.17188
作者: Yuchen Li,Hengyi Cai,Rui Kong,Xinran Chen,Jiamin Chen,Jun Yang,Haojie Zhang,Jiayi Li,Jiayi Wu,Yiqun Chen,Changle Qu,Keyi Kong,Wenwen Ye,Lixin Su,Xinyu Ma,Long Xia,Daiting Shi,Jiashu Zhao,Haoyi Xiong,Shuaiqiang Wang,Dawei Yin
机构: Baidu Search (百度搜索)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.
zh
[NLP-3] CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models
【速读】: 该论文试图解决语言模型在判断一个陈述是否因果解释另一个陈述时存在的能力不足问题,即如何区分语义相关性与真实的因果解释关系。解决方案的关键在于构建了CLEAR-3K数据集,该数据集包含3,000个断言-推理问题,用于评估语言模型在因果推理任务中的表现,并通过对此数据集的全面评估揭示了当前模型在因果推理上的局限性及参数规模对模型判断倾向的影响。
链接: https://arxiv.org/abs/2506.17180
作者: Naiming Liu,Richard Baraniuk,Shashank Sonkar
机构: Rice University (莱斯大学); University of Central Flordia (中佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews Correlation Coefficient plateaus at just 0.55, even for the best-performing this http URL, CLEAR-3K provides a crucial benchmark for developing and evaluating genuine causal reasoning in language models, which is an essential capability for applications that require accurate assessment of causal relationships.
zh
[NLP-4] Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
【速读】: 该论文旨在解决长上下文语言模型在处理如书籍摘要等任务时,由于键值(KV)缓存占用内存过大而导致的内存成本上升问题。其解决方案的关键在于提出了一种统一的度量标准——KV footprint,该指标综合考虑了存储的KV条目数量及其在内存中的存活时间,从而更全面地评估不同方法的内存效率。通过这一指标,论文揭示了现有KV淘汰方法在峰值内存方面的不足,并提出了改进策略,如调整后填淘汰方法以支持预填充阶段的KV淘汰,以及引入PruLong方法,通过端到端优化确定哪些注意力头需要保留完整的KV缓存,从而在保持长上下文性能的同时显著降低KV footprint。
链接: https://arxiv.org/abs/2506.17121
作者: Adithya Bhaskar,Alexander Wettig,Tianyu Gao,Yihe Dong,Danqi Chen
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computation and Language (cs.CL)
备注: We release our code publicly at this https URL
Abstract:Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the KV footprint as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods – post-fill eviction – has a high footprint due to being incompatible with eviction during pre-filling. We adapt these methods to be able to evict KVs during pre-filling, achieving substantially lower KV footprints. We then turn to recency eviction methods, wherein we propose PruLong, an end-to-end optimization method for learning which attention heads need to retain the full KV cache and which do not. PruLong saves memory while preserving long-context performance, achieving 12% smaller KV footprint than prior methods while retaining performance in challenging recall tasks. Our paper clarifies the complex tangle of long-context inference methods and paves the way for future development to minimize the KV footprint.
zh
[NLP-5] MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
【速读】: 该论文试图解决多模态推理中由于输入模态和任务复杂性不断增加而导致的统一框架构建难题(multimodal reasoning challenge)。其解决方案的关键在于提出一种无需训练的框架MEXA,该框架通过模态和任务感知的专家模型聚合策略,动态选择与输入模态及任务需求相匹配的专家模型,并利用大型推理模型(Large Reasoning Model, LRM)对生成的可解释文本推理输出进行综合推理,从而实现跨不同领域的高效多模态推理。
链接: https://arxiv.org/abs/2506.17113
作者: Shoubin Yu,Yue Zhang,Ziyang Wang,Jaehong Yoon,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The first two authors contributed equally; Github link: this https URL
Abstract:Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
zh
[NLP-6] Are Bias Evaluation Methods Biased ? ACL2025
【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models)安全性基准的稳健性问题,特别是针对偏见(bias)评估方法的差异对模型排名的影响。解决方案的关键在于通过多种不同的评估方法对一组代表性模型进行排名,并比较这些排名之间的相似性,从而揭示现有基准在一致性方面的不足,并为社区提供使用此类基准的建议。
链接: https://arxiv.org/abs/2506.17111
作者: Lina Berrayana,Sean Rooney,Luis Garcés-Erice,Ioana Giurgiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Workshop GEM
Abstract:The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approaches with distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.
zh
[NLP-7] Better Language Model Inversion by Compactly Representing Next-Token Distributions
【速读】: 该论文试图解决语言模型逆向问题(language model inversion),即通过仅利用语言模型的输出来恢复隐藏的提示(hidden prompts)。这一问题在语言模型部署中的安全性和问责性方面具有重要影响,例如可能从受API保护的语言模型的系统消息中泄露隐私信息。该论文提出的解决方案是基于对数概率序列的提示逆向方法(Prompt Inversion from Logprob Sequences, PILS),其关键在于发现语言模型的向量输出位于低维子空间中,这使得可以使用线性映射无损压缩多个生成步骤中的完整下一个词概率分布,从而提高逆向的准确性。该方法在恢复隐藏提示的任务中取得了显著提升,相较于之前最先进的方法,准确恢复率提高了2至3.5倍。
链接: https://arxiv.org/abs/2506.17090
作者: Murtaza Nazir,Matthew Finlayson,John X. Morris,Xiang Ren,Swabha Swayamdipta
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model’s system message. We propose a new method – prompt inversion from logprob sequences (PILS) – that recovers hidden prompts by gleaning clues from the model’s next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2–3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5–27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
zh
[NLP-8] Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models : An Empirical Evaluation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成过程中产生的幻觉(hallucination)问题,即模型在响应提示时生成事实错误或语义不相关的内容。其解决方案的关键在于探索链式思维(Chain-of-Thought, CoT)提示方法对幻觉检测的影响,发现尽管CoT提示能够减少幻觉的发生频率,但同时会模糊用于检测的关键信号,从而降低现有幻觉检测方法的效果,揭示了推理使用中被忽视的权衡问题。
链接: https://arxiv.org/abs/2506.17088
作者: Jiahao Cheng,Tiancheng Su,Jia Yuan,Guoxiu He,Jiawei Liu,Xinqi Tao,Jingwen Xie,Huaxia Li
机构: East China Normal University (华东师范大学); Wuhan University (武汉大学); Xiaohsongshu Inc. (小松树科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) often exhibit \textithallucinations, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: this https URL.
zh
[NLP-9] ower: Bridging Generality and Translation Specialization in Multilingual LLM s
【速读】: 该论文试图解决在微调预训练大语言模型(Large Language Models, LLMs)以达到特定任务(如机器翻译)的最先进性能时,往往会牺牲通用能力(如对话推理和指令遵循)的问题。解决方案的关键在于提出Tower+,通过一种新的训练方法,在翻译专长与多语言通用文本能力之间实现帕累托最优。该方法包括持续预训练、监督微调、偏好优化以及基于可验证奖励的强化学习,并在每个训练阶段精心生成和筛选数据,以提升翻译及代码生成、数学问题解决和一般指令遵循等通用任务的性能。
链接: https://arxiv.org/abs/2506.17080
作者: Ricardo Rei,Nuno M. Guerreiro,José Pombal,João Alves,Pedro Teixeirinha,Amin Farajian,André F. T. Martins
机构: Unbabel (Unbabel); Instituto de Telecomunicações (Instituto de Telecomunicações); Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit) (Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit)); CentraleSupélec, Université Paris-Saclay (CentraleSupélec, Université Paris-Saclay)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.
zh
[NLP-10] Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025
【速读】: 该论文旨在解决实时语音翻译(Simultaneous Speech Translation)中的挑战,特别是在多语言对下的高效与准确翻译问题。其解决方案的关键在于采用基于离线Whisper语音模型的直接或级联方法,并结合最先进的同步策略AlignAtt实现同步模式下的翻译与转写。此外,通过提示工程注入领域术语并考虑上下文信息进一步提升了性能,而级联系统则利用EuroLLM实现无限制的实时翻译。
链接: https://arxiv.org/abs/2506.17077
作者: Dominik Macháček,Peter Polák
机构: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Czech Republic (查理大学,数学与物理学院,形式与应用语言学研究所,捷克共和国)
类目: Computation and Language (cs.CL)
备注: IWSLT 2025
Abstract:This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the performance by prompting to inject in-domain terminology, and we accommodate context. Our cascaded systems further use EuroLLM for unbounded simultaneous translation. Compared to the Organizers’ baseline, our systems improve by 2 BLEU points on Czech to English and 13-22 BLEU points on English to German, Chinese and Japanese on the development sets. Additionally, we also propose a new enhanced measure of speech recognition latency.
zh
[NLP-11] From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
【速读】: 该论文试图解决当前生成式 AI (Generative AI) 模型中对复杂概念的可解释性不足以及缺乏统一的注意力机制分析方法的问题。现有研究主要关注多层感知机神经元和简单概念,而忽视了注意力机制的作用。解决方案的关键在于提出一种与概念无关的方法——可扩展注意力模块发现(SAMD),通过将每个概念表示为向量并计算其与每个注意力头的余弦相似度,选择得分最高的K个头来构建与概念相关的注意力模块。此外,还提出了标量注意力模块干预(SAMI),通过单一标量参数调整注意力模块以增强或减弱特定概念的影响。
链接: https://arxiv.org/abs/2506.17052
作者: Jingtong Su,Julia Kempe,Karen Ullrich
机构: NYU & Meta AI, FAIR; Meta AI, FAIR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing performance and improving behavioral control. Attribution methods help advance interpretability by assigning model outputs associated with a target concept to specific model components. Current attribution research primarily studies multi-layer perceptron neurons and addresses relatively simple concepts such as factual associations (e.g., Paris is located in France). This focus tends to overlook the impact of the attention mechanism and lacks a unified approach for analyzing more complex concepts. To fill these gaps, we introduce Scalable Attention Module Discovery (SAMD), a concept-agnostic method for mapping arbitrary, complex concepts to specific attention heads of general transformer models. We accomplish this by representing each concept as a vector, calculating its cosine similarity with each attention head, and selecting the TopK-scoring heads to construct the concept-associated attention module. We then propose Scalar Attention Module Intervention (SAMI), a simple strategy to diminish or amplify the effects of a concept by adjusting the attention module using only a single scalar parameter. Empirically, we demonstrate SAMD on concepts of varying complexity, and visualize the locations of their corresponding modules. Our results demonstrate that module locations remain stable before and after LLM post-training, and confirm prior work on the mechanics of LLM multilingualism. Through SAMI, we facilitate jailbreaking on HarmBench (+72.7%) by diminishing “safety” and improve performance on the GSM8K benchmark (+1.6%) by amplifying “reasoning”. Lastly, we highlight the domain-agnostic nature of our approach by suppressing the image classification accuracy of vision transformers on ImageNet.
zh
[NLP-12] MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在处理自然语言和视觉语境中的固有歧义时存在的挑战。现有基准测试通常忽视语言和视觉歧义,主要依赖单模态上下文进行消歧,未能充分利用模态间的相互澄清潜力。论文提出的解决方案关键在于引入MUCAR基准,该基准包含一个跨语言数据集和一个双歧义数据集,通过系统化地将模糊图像与模糊文本上下文配对,确保每种组合通过模态间相互消歧得到唯一明确的解释,从而更全面地评估多模态歧义解析能力。
链接: https://arxiv.org/abs/2506.17046
作者: Xiaolong Wang,Zhaolu Kang,Wangyuxuan Zhai,Xinyue Lou,Yunghwei Lai,Ziyue Wang,Yawen Wang,Kaiyu Huang,Yile Wang,Peng Li,Yang Liu
机构: Tsinghua University (清华大学); Peking University (北京大学); Shenzhen University (深圳大学); Jiangsu Collaborative Innovation Center for Language Competence (江苏省语言能力协同创新中心); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models–encompassing both open-source and proprietary architectures–reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
zh
[NLP-13] Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning
【速读】: 该论文旨在解决指令跟随语音处理任务中的多模态理解和生成问题,具体包括语音识别、翻译和口语问答。其解决方案的关键在于构建一个统一的语音到文本模型,通过第一阶段的模态对齐和第二阶段的指令微调,将预训练的连续语音编码器与文本解码器进行整合,同时采用小规模语言模型(2B参数)和高质量的CC-BY数据以及合成数据来提升模型性能。
链接: https://arxiv.org/abs/2506.17019
作者: Giuseppe Attanasio,Sonal Sannigrahi,Ben Peters,André F. T. Martins
机构: Instituto de Telecomunicações ( Instituto de Telecomunicações); Instituto Superior Técnico, Universidade de Lisboa ( Instituto Superior Técnico, Universidade de Lisboa); Unbabel (Unbabel)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, IWSLT 2025
Abstract:This paper presents the IT-IST submission to the IWSLT 2025 Shared Task on Instruction Following Speech Processing. We submit results for the Short Track, i.e., speech recognition, translation, and spoken question answering. Our model is a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder through a first phase of modality alignment and a second phase of instruction fine-tuning. Crucially, we focus on using small-scale language model backbones ( 2B) and restrict to high-quality, CC-BY data along with synthetic data generation to supplement existing resources.
zh
[NLP-14] LLM -Generated Feedback Supports Learning If Learners Choose to Use It
【速读】: 该论文试图解决生成式 AI (Generative AI) 生成的解释性反馈在学习效果中的影响问题,特别是其与传统反馈方法相比的有效性尚未得到充分研究。解决方案的关键在于通过实证研究评估基于大语言模型(LLM)的即时反馈对学习者在情景化导师培训课程中的影响,并采用倾向得分匹配来缓解选择偏差问题。研究结果表明,LLM反馈在部分课程中表现出显著的学习效益,且不影响任务完成时间,同时被学习者普遍认为具有帮助性,显示出其作为低成本、可扩展的学习支持工具的潜力。
链接: https://arxiv.org/abs/2506.17006
作者: Danielle R. Thomas,Conrad Borchers,Shambhavi Bhushan,Erin Gatz,Shivang Gupta,Kenneth R. Koedinger
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Full research paper accepted at EC-TEL '25
Abstract:Large language models (LLMs) are increasingly used to generate feedback, yet their impact on learning remains underexplored, especially compared to existing feedback methods. This study investigates how on-demand LLM-generated explanatory feedback influences learning in seven scenario-based tutor training lessons. Analyzing over 2,600 lesson completions from 885 tutor learners, we compare posttest performance among learners across three groups: learners who received feedback generated by gpt-3.5-turbo, those who declined it, and those without access. All groups received non-LLM corrective feedback. To address potential selection bias-where higher-performing learners may be more inclined to use LLM feedback-we applied propensity scoring. Learners with a higher predicted likelihood of engaging with LLM feedback scored significantly higher at posttest than those with lower propensity. After adjusting for this effect, two out of seven lessons showed statistically significant learning benefits from LLM feedback with standardized effect sizes of 0.28 and 0.33. These moderate effects suggest that the effectiveness of LLM feedback depends on the learners’ tendency to seek support. Importantly, LLM feedback did not significantly increase completion time, and learners overwhelmingly rated it as helpful. These findings highlight LLM feedback’s potential as a low-cost and scalable way to improve learning on open-ended tasks, particularly in existing systems already providing feedback without LLMs. This work contributes open datasets, LLM prompts, and rubrics to support reproducibility.
zh
[NLP-15] PersonalAI: Towards digital twins in the graph form
【速读】: 该论文试图解决语言模型个性化问题,特别是如何在交互过程中考虑用户的历史信息。尽管大型语言模型(Large Language Models, LLMs)和检索增强生成技术已提升了LLMs的事实基础,但保留大量个人数据并据此生成个性化响应的任务仍然具有挑战性。解决方案的关键在于利用外部记忆,即由LLM自身构建和更新的知识图谱,并首次引入包含标准边和两种类型超边的组合图结构,以实现知识构建与提取的统一性和鲁棒性。
链接: https://arxiv.org/abs/2506.17001
作者: Mikhail Menschikov,Dmitry Evseev,Ruslan Kostoev,Ilya Perepechkin,Ilnaz Salimov,Victoria Dochkina,Petr Anokhin,Evgeny Burnaev,Nikita Semenov
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The challenge of personalizing language models, specifically the ability to account for a user’s history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remains pertinent. To address this, we propose utilizing external memory in the form of knowledge graphs, which are constructed and updated by the LLM itself. We have expanded upon ideas of AriGraph architecture and for the first time introduced a combined graph featuring both standard edges and two types of hyperedges. Experiments conducted on the TriviaQA, HotpotQA and DiaASQ benchmarks indicates that this approach aids in making the process of graph construction and knowledge extraction unified and robust. Furthermore, we augmented the DiaASQ benchmark by incorporating parameters such as time into dialogues and introducing contradictory statements made by the same speaker at different times. Despite these modifications, the performance of the question-answering system remained robust, demonstrating the proposed architecture’s ability to maintain and utilize temporal dependencies.
zh
[NLP-16] Xpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLM s ACL2025
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在生成LaTeX代码方面能力不足的问题,尤其是在处理科学文档组件时的准确性与可靠性。其解决方案的关键在于引入了TeXpert,这是一个包含自然语言提示的基准数据集,旨在评估LLMs生成LaTeX代码的能力,并通过多难度级别的任务分析LLM在该领域的表现,从而揭示常见错误类型及模型间的差异。
链接: https://arxiv.org/abs/2506.16990
作者: Sahil Kale,Vijaykant Nadadur
机构: Knowledgeverse AI (知识宇宙人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the SDProc Workshop @ ACL 2025
Abstract:LaTeX’s precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at this https URL.
zh
[NLP-17] Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond
【速读】: 该论文试图解决传统知识追踪(Knowledge Tracing, KT)方法依赖于不可解释的潜在嵌入,以及基于大语言模型(LLM)的方法可能产生无准确保证的幻觉预测的问题。其解决方案的关键在于将KT重新建模为一个逆问题:学习一个最小的自然语言摘要,使得过去的答题可以被解释且未来的答题可被预测。通过引入语言瓶颈模型(Language Bottleneck Model, LBM),该模型包含一个生成可解释知识摘要的编码器LLM和一个仅使用该摘要文本进行重建和预测的冻结解码器LLM,从而确保摘要既包含准确信息又具备人类可读性。
链接: https://arxiv.org/abs/2506.16982
作者: Antonin Berthon,Mihaela van der Schaar
机构: University of Cambridge(剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurately assessing student knowledge is critical for effective education, yet traditional Knowledge Tracing (KT) methods rely on opaque latent embeddings, limiting interpretability. Even LLM-based approaches generate direct predictions or summaries that may hallucinate without any accuracy guarantees. We recast KT as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable. Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text. By constraining all predictive information to pass through a short natural-language bottleneck, LBMs ensure that the summary contains accurate information while remaining human-interpretable. Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories. We demonstrate that training the encoder with group-relative policy optimization, using downstream decoding accuracy as a reward signal, effectively improves summary quality.
zh
[NLP-18] Latent Concept Disentanglement in Transformer-based Language Models
【速读】: 该论文试图解决的问题是:当大型语言模型(Large Language Models, LLMs)使用上下文学习(In-Context Learning, ICL)解决新任务时,它们是否在计算过程中表征潜在结构,还是采取了捷径来解决问题。该研究的解决方案关键在于通过分析模型在包含潜在离散概念的两步推理任务中对潜在概念的识别与逐步概念组合能力,以及在参数化为连续潜在概念的任务中发现表示空间中的低维子空间,其几何结构模仿底层参数化方式,从而揭示了Transformer在ICL任务中对潜在概念的解耦与表征机制。
链接: https://arxiv.org/abs/2506.16975
作者: Guan Zhe Hong,Bhavya Vasudeva,Vatsal Sharan,Cyrus Rashtchian,Prabhakar Raghavan,Rina Panigrahy
机构: Purdue University (普渡大学); University of Southern California (南加利福尼亚大学); Google Research (谷歌研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:When large language models (LLMs) use in-context learning (ICL) to solve a new task, they seem to grasp not only the goal of the task but also core, latent concepts in the demonstration examples. This begs the question of whether transformers represent latent structures as part of their computation or whether they take shortcuts to solve the problem. Prior mechanistic work on ICL does not address this question because it does not sufficiently examine the relationship between the learned representation and the latent concept, and the considered problem settings often involve only single-step reasoning. In this work, we examine how transformers disentangle and use latent concepts. We show that in 2-hop reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. In tasks parameterized by a continuous latent concept, we find low-dimensional subspaces in the representation space where the geometry mimics the underlying parameterization. Together, these results refine our understanding of ICL and the representation of transformers, and they provide evidence for highly localized structures in the model that disentangle latent concepts in ICL tasks.
zh
[NLP-19] Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLM s
【速读】: 该论文旨在解决医学领域多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理能力上的不足,特别是缺乏系统性框架来搜索和评估有效的推理路径以支持关键诊断的问题。其解决方案的关键在于提出一种名为Mentor-Intern Collaborative Search (MICS)的新型推理路径搜索机制,该机制通过导师模型逐步初始化推理过程,并由实习模型沿初始路径继续思考,最终根据多个实习模型的整体推理性能选择最优路径,从而生成严谨且有效的医学链式思维(Chain-of-Thought, CoT)数据。
链接: https://arxiv.org/abs/2506.16962
作者: Haoran Sun,Yankai Jiang,Wenjie Lou,Yujie Zhang,Wenjie Li,Lilong Wang,Mianxin Liu,Lei Liu,Xiaosong Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
zh
[NLP-20] From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts ACL2025
【速读】: 该论文试图解决语言模型在处理长尾分布信息时的样本效率问题,即如何在有限的数据暴露下有效学习和记忆高频与低频事实。解决方案的关键在于通过分析不同架构和规模的模型在相同预训练数据上的表现,结合事实频率的标注,揭示模型架构与规模对事实学习效率的影响,从而为提升模型在稀有信息上的表现提供新的见解。
链接: https://arxiv.org/abs/2506.16912
作者: Daniel Christoph,Max Ploner,Patrick Haller,Alan Akbik
机构: Humboldt-Universität zu Berlin (洪堡大学); Science Of Intelligence
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the First Workshop on Large Language Model Memorization (L2M2), co-located with ACL 2025 in Vienna
Abstract:Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency.
zh
[NLP-21] MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
【速读】: 该论文试图解决黑盒大型语言模型(Large Language Models, LLMs)面临的越狱攻击(jailbreak attacks)问题,即通过特定方法诱导模型生成有害响应。解决方案的关键在于提出一种名为MIST(Iterative Semantic Tuning的缩写)的方法,该方法通过迭代语义调优来优化提示词(prompts),在保持原始语义意图的同时引入有害内容。MIST的核心策略包括顺序同义词搜索及其改进版本——顺序决定优化,以在语义相似性和计算效率之间取得平衡。
链接: https://arxiv.org/abs/2506.16792
作者: Muyang Zheng,Yuanzhi Yao,Changting Lin,Rui Wang,Meng Han
机构: Hefei University of Technology (合肥工业大学); Zhejiang University (浙江大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks–methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version–order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.
zh
[NLP-22] DistillNote: LLM -based clinical note summaries improve heart failure diagnosis
【速读】: 该论文试图解决临床文档负担过重的问题,特别是通过生成患者信息的简洁摘要来减轻医疗提供者的压力。解决方案的关键在于提出Distillnote框架,该框架利用大语言模型(Large Language Models, LLMs)进行临床记录摘要生成,并通过三种技术实现:(1)一步式直接摘要,(2)聚焦独立临床见解的结构化摘要,以及(3)进一步压缩结构化摘要的蒸馏摘要。这些方法在保持临床相关性和准确性的同时,显著提高了摘要的效率和压缩比。
链接: https://arxiv.org/abs/2506.16777
作者: Heloisa Oss Boll,Antonio Oss Boll,Leticia Puttlitz Boll,Ameen Abu Hanna,Iacer Calixto
机构: Amsterdam UMC (阿姆斯特丹大学医学中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Public Health, Methodology (阿姆斯特丹公共卫生,方法学); Institute of Mathematics and Statistics, University of São Paulo (圣保罗大学数学与统计研究所); Amsterdam Public Health, Mental Health (阿姆斯特丹公共卫生,心理健康)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and a divide-and-conquer approach involving (2) Structured summarization focused on independent clinical insights, and (3) Distilled summarization that further condenses the Structured summaries. We test how useful are the summaries by using them to predict heart failure compared to a model trained on the original notes. Distilled summaries achieve 79% text compression and up to 18.2% improvement in AUPRC compared to an LLM trained on the full notes. We also evaluate the quality of the generated summaries in an LLM-as-judge evaluation as well as through blinded pairwise comparisons with clinicians. Evaluations indicate that one-step summaries are favoured by clinicians according to relevance and clinical actionability, while distilled summaries offer optimal efficiency (avg. 6.9x compression-to-performance ratio) and significantly reduce hallucinations. We release our summaries on PhysioNet to encourage future research.
zh
[NLP-23] Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
【速读】: 该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在面对 jailbreak 攻击时的安全性问题,即攻击者通过绕过内置安全机制来诱导生成受限内容。解决方案的关键在于提出一种新型的黑盒 jailbreak 攻击框架——跨模态对抗多模态混淆(Cross-modal Adversarial Multimodal Obfuscation, CAMO),该框架将恶意提示分解为语义上无害的视觉和文本片段,并利用 LVLMs 的跨模态推理能力,通过多步骤推理隐蔽地重构有害指令,从而规避传统检测机制。
链接: https://arxiv.org/abs/2506.16760
作者: Lei Jiang,Zixun Zhang,Zizhou Wang,Xiaobing Sun,Zhen Li,Liangli Zhen,Xiaohua Xu
机构: University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Institute of High Performance Computing, ASTAR, Singapore (新加坡高性能计算研究所,ASTAR)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures
Abstract:Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs’ cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO’s effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.
zh
[NLP-24] SocialSim: Towards Socialized Simulation of Emotional Support Conversation AAAI2025
【速读】: 该论文旨在解决现有情感支持对话(Emotional Support Conversation, ESC)数据生成中忽视社会互动动态的问题,从而导致模拟效果不佳。其解决方案的关键在于提出SocialSim框架,该框架通过整合社会披露和社会意识两个关键方面来模拟ESC,具体表现为:在求助者侧构建全面的个性档案库以促进社会披露,在支持者侧通过激发认知推理生成逻辑且具有支持性的回应,从而提升对话的真实性和有效性。
链接: https://arxiv.org/abs/2506.16756
作者: Zhuang Chen,Yaru Cao,Guanqun Bi,Jincenzi Wu,Jinfeng Zhou,Xiyao Xiao,Si Chen,Hongning Wang,Minlie Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2025 Paper #32116 (Without Publication Edits)
Abstract:Emotional support conversation (ESC) helps reduce people’s psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.
zh
[NLP-25] Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly
【速读】: 该论文试图解决多模态社会推理中如何有效整合语言和视觉信息以生成上下文相关的社会判断的问题。解决方案的关键在于提出一种名为Language-Informed Rational Agent Synthesis (LIRAS)的框架,该框架通过将多模态输入解析为统一的符号表示,并利用贝叶斯逆向规划引擎进行概率推理,从而实现对社会情境的结构化建模与精准推断。
链接: https://arxiv.org/abs/2506.16755
作者: Lance Ying,Ryan Truong,Katherine M. Collins,Cedegao E. Zhang,Megan Wei,Tyler Brooke-Wilson,Tan Zhi-Xuan,Lionel Wong,Joshua B. Tenenbaum
机构: MIT(麻省理工学院); Harvard University(哈佛大学); University of Cambridge(剑桥大学); Brown University(布朗大学); Yale University(耶鲁大学); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 figures, 19 pages
Abstract:Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.
zh
[NLP-26] LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
【速读】: 该论文旨在解决语音-文本统一建模中语音标记序列过长导致的效率问题,以及传统降采样方法对语义结构的破坏问题。其解决方案的关键在于提出一种名为LM-SPT的语音标记化方法,该方法通过引入一种新颖的语义蒸馏机制,不直接通过池化匹配教师和学生特征,而是仅从语义标记重建语音,并最小化原始波形与重建波形编码表示之间的差异,从而学习出更符合语言模型语义对齐的离散单元。
链接: https://arxiv.org/abs/2506.16738
作者: Daejin Jo,Jeeyoung Yun,Byungseok Roh,Sungwoong Kim
机构: Kakao( kakao); Korea University(韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.
zh
[NLP-27] he Role of Model Confidence on Bias Effects in Measured Uncertainties
【速读】: 该论文试图解决在开放性任务中准确评估认知不确定性(epistemic uncertainty)的问题,尤其是在存在随机不确定性(aleatoric uncertainty)的情况下,如何有效量化模型的不确定性。解决方案的关键在于通过减轻提示引入的偏差来改善不确定性估计,研究发现,在模型无偏置置信度较低时,所有考虑的偏差都会对两种不确定性产生更大的影响,而低无偏置置信度会导致认知不确定性的低估(即过度自信),但对随机不确定性的估计方向影响不显著。这一发现为偏差缓解在不确定性量化中的应用提供了深入理解,并可能推动更先进的技术发展。
链接: https://arxiv.org/abs/2506.16724
作者: Xinyi Liu,Weiguang Wang,Hangfeng He
机构: University of Rochester (罗切斯特大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model’s lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases induce greater changes in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence leads to greater underestimation of epistemic uncertainty (i.e. overconfidence) due to bias, whereas it has no significant effect on the direction of changes in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.
zh
[NLP-28] Reason GRM: Enhancing Generative Reward Models through Large Reasoning Models
【速读】: 该论文试图解决生成式奖励模型(Generative Reward Models, GRMs)在捕捉人类偏好时因推理能力不足而导致的推理路径不完整或过于推测性的问题,从而引发幻觉或关键信息缺失。其解决方案的关键在于提出一种三阶段的生成式奖励建模框架——ReasonGRM,通过Zero-RL生成简洁且目标导向的推理路径、引入基于生成可能性的评估指标R^\star以减少训练中的幻觉数据,并通过强化学习进一步优化模型在复杂任务中的偏好区分能力。
链接: https://arxiv.org/abs/2506.16712
作者: Bin Chen,Xinzge Gao,Chuanrui Hu,Penghang Yu,Hua Zhang,Bing-Kun Bao
机构: Qihoo360(奇虎360); Nanjing University of Posts and Telecommunications(南京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, R^\star , which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8% on average and surpassing proprietary models such as GPT-4o by up to 5.6%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling.
zh
[NLP-29] Large Language Models as Psychological Simulators: A Methodological Guide
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在心理学和行为研究中应用时缺乏方法论指导的问题。其解决方案的关键在于提出一个框架,将LLMs作为心理模拟器,用于两种主要应用场景:一是通过构建基于心理学理论的角色和人格模型来探索多样化情境,二是作为计算模型以研究认知过程。该框架强调了构建具有心理基础的人格模型的方法、验证策略以及将模型行为与人类认知关联的手段,并提出了应对提示敏感性、训练数据截止时间限制和伦理问题等挑战的措施。
链接: https://arxiv.org/abs/2506.16702
作者: Zhicheng Lin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) offer emerging opportunities for psychological and behavioral research, but methodological guidance is lacking. This article provides a framework for using LLMs as psychological simulators across two primary applications: simulating roles and personas to explore diverse contexts, and serving as computational models to investigate cognitive processes. For simulation, we present methods for developing psychologically grounded personas that move beyond demographic categories, with strategies for validation against human data and use cases ranging from studying inaccessible populations to prototyping research instruments. For cognitive modeling, we synthesize emerging approaches for probing internal representations, methodological advances in causal interventions, and strategies for relating model behavior to human cognition. We address overarching challenges including prompt sensitivity, temporal limitations from training data cutoffs, and ethical considerations that extend beyond traditional human subjects review. Throughout, we emphasize the need for transparency about model capabilities and constraints. Together, this framework integrates emerging empirical evidence about LLM performance–including systematic biases, cultural limitations, and prompt brittleness–to help researchers wrangle these challenges and leverage the unique capabilities of LLMs in psychological research.
zh
[NLP-30] From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology
【速读】: 该论文试图解决当前在人工智能心理学研究中,由于将人类测量工具直接应用于大型语言模型(Large Language Models, LLMs)而产生的矛盾结果问题,这些问题可能只是统计幻象而非真实的心理现象。论文提出的解决方案关键在于构建一个双重有效性框架,以整合可靠测量原则与稳健因果推断标准,从而指导对LLMs的科学评估,确保证据强度与其科学目标相匹配,并推动发展计算层面的心理学概念模拟和明确可扩展的证据标准。
链接: https://arxiv.org/abs/2506.16697
作者: Zhicheng Lin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms–statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field’s foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output–endorsing “I am anxious”–requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.
zh
[NLP-31] LegiGPT : Party Politics and Transport Policy with Large Language Model
【速读】: 该论文试图解决立法者政治意识形态对政策制定影响的分析问题,其关键解决方案是引入一种新型框架LegiGPT,该框架将生成式AI(Generative AI)与可解释人工智能(XAI)相结合,通过多阶段过滤和分类流程对交通相关立法提案进行分析,从而识别影响交通政策制定的关键因素。
链接: https://arxiv.org/abs/2506.16692
作者: Hyunsoo Yun,Eun Hak Lee
机构: Seoul National University (首尔国立大学); Texas A&M Transportation Institute (德克萨斯A&M交通研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Given the significant influence of lawmakers’ political ideologies on legislative decision-making, understanding their impact on policymaking is critically important. We introduce a novel framework, LegiGPT, which integrates a large language model (LLM) with explainable artificial intelligence (XAI) to analyze transportation-related legislative proposals. LegiGPT employs a multi-stage filtering and classification pipeline using zero-shot prompting with GPT-4. Using legislative data from South Korea’s 21st National Assembly, we identify key factors - including sponsor characteristics, political affiliations, and geographic variables - that significantly influence transportation policymaking. The LLM was used to classify transportation-related bill proposals through a stepwise filtering process based on keywords, phrases, and contextual relevance. XAI techniques were then applied to examine relationships between party affiliation and associated attributes. The results reveal that the number and proportion of conservative and progressive sponsors, along with district size and electoral population, are critical determinants shaping legislative outcomes. These findings suggest that both parties contributed to bipartisan legislation through different forms of engagement, such as initiating or supporting proposals. This integrated approach provides a valuable tool for understanding legislative dynamics and guiding future policy development, with broader implications for infrastructure planning and governance.
zh
[NLP-32] Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations
【速读】: 该论文试图解决的问题是:生成式 AI (Generative AI) 中的语法结构表示机制与其在下游语法任务中的表现之间是否存在可靠的相关性。研究的关键在于通过“机制与结果”框架,评估32个开源权重的Transformer模型,发现通过探测(probing)提取的语法特征无法可靠预测目标语法评估的结果,从而揭示了潜在语法表示与下游任务中可观察语法行为之间的显著脱节。
链接: https://arxiv.org/abs/2506.16678
作者: Ananth Agarwal,Jasper Jian,Christopher D. Manning,Shikhar Murty
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model’s probing accuracy reliably predicts its downstream syntactic performance. Adopting a “mechanisms vs. outcomes” framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.
zh
[NLP-33] Arch-Router: Aligning LLM Routing with Human Preferences
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中因模型特性差异而带来的路由选择问题,现有方法在性能评估上无法有效反映由主观评价标准驱动的人类偏好,并且通常仅限于有限的模型池。解决方案的关键在于提出一种对齐用户偏好的路由框架,通过将查询与用户定义的领域(如旅行)或操作类型(如图像编辑)进行匹配,实现路由决策中的偏好编码;其中,核心组件是\textbfArch-Router,一个1.5B参数的轻量级模型,能够学习将查询映射到领域-操作偏好以指导模型选择,并支持无缝添加新模型而无需重新训练或修改架构。
链接: https://arxiv.org/abs/2506.16655
作者: Co Tran,Salman Paracha,Adil Hafeez,Shuguang Chen
机构: Katanemo Labs, Inc. (Katanemo 实验室公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid proliferation of large language models (LLMs) – each optimized for different strengths, style, or latency/cost profile – routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. In this work, we propose a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) – offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce \textbfArch-Router, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Our approach also supports seamlessly adding new models for routing without requiring retraining or architectural modifications. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. Our approach captures subjective evaluation criteria and makes routing decisions more transparent and flexible. Our model is available at: \textttthis https URL.
zh
[NLP-34] Long-Context Generalization with Sparse Attention
【速读】: 该论文试图解决传统基于Transformer的架构在处理长序列时因使用softmax计算注意力权重而导致的注意力分布过于密集、非信息性token累积注意力概率质量,进而引发表示崩溃的问题。解决方案的关键在于引入稀疏注意力机制——α-entmax,其能够对无关token分配精确零值,从而实现对固定大小模式的精准聚焦。此外,论文还提出了自适应可扩展熵最大(Adaptive-Scalable Entmax, ASEntmax),通过引入可学习的温度参数,使注意力分布能够在稀疏(模式聚焦)和密集(类似softmax)之间进行插值,进一步提升了模型性能。
链接: https://arxiv.org/abs/2506.16640
作者: Pavlo Vasylenko,Marcos Treviso,André F. T. Martins
机构: Instituto Superior Técnico, University of Lisbon (里斯本理工学院,里斯本大学); Instituto de Telecomunicações (电信研究所); Unbabel (Unbabel); ELLIS Unit Lisbon (ELLIS里斯本单元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that sparse attention mechanisms using \alpha -entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows \alpha -entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Finally, we show that the ability to locate and generalize fixed-size patterns can be further improved through a careful design of position encodings, which impacts both dense and sparse attention methods. By integrating ASEntmax into standard transformer layers alongside proper positional encodings, we show that our models greatly outperform softmax, scalable softmax, and fixed-temperature \alpha -entmax baselines on long-context generalization.
zh
[NLP-35] GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View
【速读】: 该论文试图解决多模态推理中对层次化视觉线索(hierarchical visual clues)推理不足的问题,尤其是在不同粒度级别(如局部细节和全局上下文)上的推理能力有限。解决方案的关键在于提出一个名为GeoGuess的新任务,以及构建一个专门策划的数据集GeoExplain,该数据集包含全景图-地理坐标-解释的三元组,并引入一种多模态、多层次推理方法SightSense,该方法能够基于视觉信息层级和外部地理知识进行预测与生成详细解释。
链接: https://arxiv.org/abs/2506.16633
作者: Fenghua Cheng,Jinxiang Wang,Sen Wang,Zi Huang,Xue Li
机构: The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess.
zh
[NLP-36] Initial Investigation of LLM -Assisted Development of Rule-Based Clinical NLP System
【速读】: 该论文试图解决规则基础自然语言处理(Rule-based NLP)系统在临床环境中因手动开发和维护而存在的劳动密集问题,尤其是在语言变异性较大的任务中。解决方案的关键在于利用大型语言模型(LLMs)仅在规则系统开发阶段进行辅助,通过LLMs高效识别临床笔记中的相关文本片段并提取用于命名实体识别(NER)的关键词,从而实现规则系统的半自动化或自动化开发,相较于深度学习模型方法,具有更快、更经济且透明的执行优势。
链接: https://arxiv.org/abs/2506.16628
作者: Jianlin Shi,Brian T. Bucher
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Despite advances in machine learning (ML) and large language models (LLMs), rule-based natural language processing (NLP) systems remain active in clinical settings due to their interpretability and operational efficiency. However, their manual development and maintenance are labor-intensive, particularly in tasks with large linguistic variability. To overcome these limitations, we proposed a novel approach employing LLMs solely during the rule-based systems development phase. We conducted the initial experiments focusing on the first two steps of developing a rule-based NLP pipeline: find relevant snippets from the clinical note; extract informative keywords from the snippets for the rule-based named entity recognition (NER) component. Our experiments demonstrated exceptional recall in identifying clinically relevant text snippets (Deepseek: 0.98, Qwen: 0.99) and 1.0 in extracting key terms for NER. This study sheds light on a promising new direction for NLP development, enabling semi-automated or automated development of rule-based systems with significantly faster, more cost-effective, and transparent execution compared with deep learning model-based solutions.
zh
[NLP-37] Modeling Public Perceptions of Science in Media
【速读】: 该论文旨在解决科学传播中公众对科学新闻的感知与互动难以预测的问题,从而提升科学传播的效果和公众信任。其解决方案的关键在于构建一个涵盖十二个维度的计算框架,用于建模公众对科学新闻的感知,并基于此创建大规模的科学新闻感知数据集,同时开发出性能优异的自然语言处理模型以预测公众感知评分。该研究通过分析感知作为结果和预测因子的双重视角,揭示了公众科学新闻消费频率是影响感知的主要因素,并证明了估计的公众感知与最终互动模式之间存在直接关联。
链接: https://arxiv.org/abs/2506.16622
作者: Jiaxin Pei,Dustin Wright,Isabelle Augenstin,David Jurgens
机构: Stanford University (斯坦福大学); University of Copenhagen (哥本哈根大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals’ frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.
zh
[NLP-38] A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications
【速读】: 该论文旨在解决生物医学领域中数据稀缺性、隐私担忧和数据质量挑战等问题,其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的快速发展来生成合成数据。通过系统综述59项研究,论文分析了合成数据生成在生物医学领域的应用趋势、方法及评估方式,揭示了当前在数据模态、生成方法和评估体系方面的现状与局限性。
链接: https://arxiv.org/abs/2506.16594
作者: Hanshu Rao,Weisi Liu,Haohan Wang,I-Chan Huang,Zhe He,Xiaolei Huang
机构: University of Memphis, Dept. of Computer Science(孟菲斯大学,计算机科学系); University of Illinois Urbana-Champaign, School of Information Sciences(伊利诺伊大学厄巴纳-香槟分校,信息科学学院); St Jude Children’s Research Hospital(圣犹大儿童研究医院); Florida State University, School of Information(佛罗里达州立大学,信息学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Synthetic data generation–mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields–has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.
zh
[NLP-39] Measuring (a Sufficient) World Model in LLM s: A Variance Decomposition Framework
【速读】: 该论文试图解决的问题是评估大型语言模型(Large Language Models, LLMs)是否具备一个稳健的“世界模型”——即一种结构化的世界理解能力,能够支持模型在超越表面模式的基础上进行泛化。其解决方案的关键在于提出一个形式化框架,用于衡量LLM在语义等价提示下产生一致输出的能力,同时区分表达不同意图的提示。该框架通过将模型响应的变异性分解为用户目的、用户表达和模型不稳定性的三个组成部分,从而量化模型行为的语义基础程度,而非受模型不稳定或表述差异驱动的程度。
链接: https://arxiv.org/abs/2506.16584
作者: Nadav Kunievsky,James A. Evans
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding whether large language models (LLMs) possess a world model-a structured understanding of the world that supports generalization beyond surface-level patterns-is central to assessing their reliability, especially in high-stakes applications. We propose a formal framework for evaluating whether an LLM exhibits a sufficiently robust world model, defined as producing consistent outputs across semantically equivalent prompts while distinguishing between prompts that express different intents. We introduce a new evaluation approach to measure this that decomposes model response variability into three components: variability due to user purpose, user articulation, and model instability. An LLM with a strong world model should attribute most of the variability in its responses to changes in foundational purpose rather than superficial changes in articulation. This approach allows us to quantify how much of a model’s behavior is semantically grounded rather than driven by model instability or alternative wording. We apply this framework to evaluate LLMs across diverse domains. Our results show how larger models attribute a greater share of output variability to changes in user purpose, indicating a more robust world model. This improvement is not uniform, however: larger models do not consistently outperform smaller ones across all domains, and their advantage in robustness is often modest. These findings highlight the importance of moving beyond accuracy-based benchmarks toward semantic diagnostics that more directly assess the structure and stability of a model’s internal understanding of the world.
zh
[NLP-40] Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement INTERSPEECH2025
【速读】: 该论文旨在解决非母语语音转换为自然母语口音的问题,同时保持说话人身份、韵律和发音准确性。其解决方案的关键在于提出一种首个支持流式处理的口音转换(Accent Conversion, AC)模型,通过引入Emformer编码器和优化的推理机制实现流式处理,并集成母语文本到语音(Text-to-Speech, TTS)模型生成理想的真实数据以提高训练效率。该方法在保持稳定延迟的同时实现了与顶级AC模型相当的性能。
链接: https://arxiv.org/abs/2506.16580
作者: Tuan-Nam Nguyen,Ngoc-Quan Pham,Seymanur Akti,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2025
Abstract:We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.
zh
[NLP-41] Advancing Harmful Content Detection in Organizational Research: Integrating Large Language Models with Elo Rating System
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在分析有害内容时因内置审核系统导致的问题,这些问题包括拒绝执行特定指令或生成过于谨慎的回应,从而影响研究结果的有效性。解决方案的关键是一种基于Elo评分的方法,该方法显著提升了LLMs在有害内容分析中的性能,在微侵略检测和仇恨言论两个数据集上的实验表明,该方法在准确率、精确率和F1分数等关键指标上优于传统LLM提示技术和传统机器学习模型。
链接: https://arxiv.org/abs/2506.16575
作者: Mustafa Akben,Aaron Satko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted for HICSS 2025 (Hawaii International Conference on System Sciences); under review
Abstract:Large language models (LLMs) offer promising opportunities for organizational research. However, their built-in moderation systems can create problems when researchers try to analyze harmful content, often refusing to follow certain instructions or producing overly cautious responses that undermine validity of the results. This is particularly problematic when analyzing organizational conflicts such as microaggressions or hate speech. This paper introduces an Elo rating-based method that significantly improves LLM performance for harmful content analysis In two datasets, one focused on microaggression detection and the other on hate speech, we find that our method outperforms traditional LLM prompting techniques and conventional machine learning models on key measures such as accuracy, precision, and F1 scores. Advantages include better reliability when analyzing harmful content, fewer false positives, and greater scalability for large-scale datasets. This approach supports organizational applications, including detecting workplace harassment, assessing toxic communication, and fostering safer and more inclusive work environments.
zh
[NLP-42] Weight Factorization and Centralization for Continual Learning in Speech Recognition INTERSPEECH2025
【速读】: 该论文试图解决在无需重新训练整个系统的情况下,持续吸收新数据时产生的灾难性遗忘问题(catastrophic forgetting),尤其是在下游应用中使用基础模型且无法访问原始训练数据的场景下。解决方案的关键在于提出一种基于两个阶段的持续学习方法:分解(factorization)和中心化(centralization),其中中心化阶段通过累积多个分散的低秩适配器中的知识,有效防止了灾难性遗忘的发生。
链接: https://arxiv.org/abs/2506.16574
作者: Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2025
Abstract:Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemingly insignificant disruption to the weights can destructively harm the quality of the models. Inspired by the ability of human brains to learn and consolidate knowledge through the waking-sleeping cycle, we propose a continual learning approach with two distinct phases: factorization and centralization, learning and merging knowledge accordingly. Our experiments on a sequence of varied code-switching datasets showed that the centralization stage can effectively prevent catastrophic forgetting by accumulating the knowledge in multiple scattering low-rank adapters.
zh
[NLP-43] Automatic Speech Recognition Biases in Newcastle English: an Error Analysis INTERSPEECH2025
【速读】: 该论文试图解决自动语音识别(ASR)系统在处理区域方言时表现不佳的问题,尤其是由于训练数据的偏见导致主流语言变体被优先考虑。研究的关键在于通过两阶段分析揭示ASR误识别背后的音系、词汇和形态句法错误,并系统分析特定区域代词“yous”和“wor”的识别问题,从而证明ASR错误与区域方言特征直接相关,而社会因素的影响较小。研究强调了在ASR训练数据中增加方言多样性以及引入社会语言学分析的重要性。
链接: https://arxiv.org/abs/2506.16558
作者: Dana Serditova,Kevin Tang,Jochen Steffens
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2025
Abstract:Automatic Speech Recognition (ASR) systems struggle with regional dialects due to biased training which favours mainstream varieties. While previous research has identified racial, age, and gender biases in ASR, regional bias remains underexamined. This study investigates ASR performance on Newcastle English, a well-documented regional dialect known to be challenging for ASR. A two-stage analysis was conducted: first, a manual error analysis on a subsample identified key phonological, lexical, and morphosyntactic errors behind ASR misrecognitions; second, a case study focused on the systematic analysis of ASR recognition of the regional pronouns yous'' and
wor’'. Results show that ASR errors directly correlate with regional dialectal features, while social factors play a lesser role in ASR mismatches. We advocate for greater dialectal diversity in ASR training data and highlight the value of sociolinguistic analysis in diagnosing and addressing regional biases.
zh
[NLP-44] Revela: Dense Retriever Learning via Language Modeling
【速读】: 该论文旨在解决在专业领域中,由于标注的查询-文档对成本高且难以获取,导致密集检索器(dense retrievers)训练困难的问题。其解决方案的关键在于引入Revela,一个通过语言建模实现自监督检索器学习的统一且可扩展的训练框架。Revela通过在批次内注意力机制中结合局部和跨文档上下文进行下一个词预测,模拟语言模型中的token级依赖关系,并利用检索器计算的相似性得分加权该注意力,从而将检索器优化嵌入到语言建模过程中。
链接: https://arxiv.org/abs/2506.16552
作者: Fengyu Cai,Tong Chen,Xinran Zhao,Sihao Chen,Hongming Zhang,Sherry Tongshuang Wu,Iryna Gurevych,Heinz Koeppl
机构: Technical University of Darmstadt (达姆施塔特工业大学); University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft (微软); Tencent AI Lab (腾讯人工智能实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly and hard to obtain in specialized domains such as code-motivating growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next-token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next-token prediction on both local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on both general-domain (BEIR) and domain-specific (CoIR) benchmarks across various retriever backbones. At a comparable parameter scale, Revela outperforms the previous best method with absolute improvements of 5.2 % (18.3 % relative) and 5.6 % (14.4 % relative) on NDCG@10, respectively, underscoring its effectiveness. Performance increases with model size, highlighting both the scalability of our approach and its promise for self-supervised retriever learning. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2506.16552 [cs.IR] (or arXiv:2506.16552v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.16552 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-45] Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples
【速读】: 该论文旨在解决低资源印欧语系语言(Indic languages)在对齐大型语言模型(LLMs)与人类偏好时,由于缺乏高质量的偏好数据而导致奖励模型不可靠的问题。其解决方案的关键在于提出一种名为RELIC的新型上下文学习框架,该框架通过成对排序目标训练检索器,从高资源语言中选择最能突出优选与非优选回复差异的上下文示例,从而提升低资源语言的奖励模型性能。
链接: https://arxiv.org/abs/2506.16502
作者: Soumya Suvra Ghosal,Vaibhav Singh,Akash Ghosh,Soumyabrata Pal,Subhadip Baidya,Sriparna Saha,Dinesh Manocha
机构: University of Maryland, College Park; IIT Bombay; IIT Patna; Adobe Research; IIT Kanpur
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reward models are essential for aligning large language models (LLMs) with human preferences. However, most open-source multilingual reward models are primarily trained on preference datasets in high-resource languages, resulting in unreliable reward signals for low-resource Indic languages. Collecting large-scale, high-quality preference data for these languages is prohibitively expensive, making preference-based training approaches impractical. To address this challenge, we propose RELIC, a novel in-context learning framework for reward modeling in low-resource Indic languages. RELIC trains a retriever with a pairwise ranking objective to select in-context examples from auxiliary high-resource languages that most effectively highlight the distinction between preferred and less-preferred responses. Extensive experiments on three preference datasets- PKU-SafeRLHF, WebGPT, and HH-RLHF-using state-of-the-art open-source reward models demonstrate that RELIC significantly improves reward model accuracy for low-resource Indic languages, consistently outperforming existing example selection methods. For example, on Bodo-a low-resource Indic language-using a LLaMA-3.2-3B reward model, RELIC achieves a 12.81% and 10.13% improvement in accuracy over zero-shot prompting and state-of-the-art example selection method, respectively.
zh
[NLP-46] owards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection
【速读】: 该论文试图解决隐性仇恨言论(implicit hate speech)在社交媒体平台中的检测问题,尤其是在现有有害言论数据集中可能已存在但未被明确标注的隐性仇恨言论。解决方案的关键在于利用现有的有害言论数据集,通过三个核心组件提升检测效果:关键样本识别、重新标注以及借助Llama-3 70B和GPT-4o进行数据增强,从而提高模型在不同数据集上的泛化能力。实验结果表明,该方法在隐性仇恨言论检测任务中取得了显著的F1分数提升。
链接: https://arxiv.org/abs/2506.16476
作者: Saad Almohaimeed,Saleh Almohaimeed,Damla Turgut,Ladislau Bölöni
机构: University of Central Florida (佛罗里达中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Implicit hate speech has recently emerged as a critical challenge for social media platforms. While much of the research has traditionally focused on harmful speech in general, the need for generalizable techniques to detect veiled and subtle forms of hate has become increasingly pressing. Based on lexicon analysis, we hypothesize that implicit hate speech is already present in publicly available harmful speech datasets but may not have been explicitly recognized or labeled by annotators. Additionally, crowdsourced datasets are prone to mislabeling due to the complexity of the task and often influenced by annotators’ subjective interpretations. In this paper, we propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets by leveraging existing harmful speech datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Experimental results demonstrate the effectiveness of our approach in improving implicit hate detection, achieving a +12.9-point F1 score improvement compared to the baseline.
zh
[NLP-47] Do We Talk to Robots Like Therapists and Do They Respond Accordingly? Language Alignment in AI Emotional Support
【速读】: 该论文试图解决的问题是:随着对话代理在情感支持对话中的应用增加,需要明确其交互与传统心理治疗场景的相似性。研究的核心在于验证机器人对话内容是否与人类间(H2H)心理治疗会话中的关注点一致,并评估机器人回应在语义上是否与人类治疗师的回应相匹配。解决方案的关键在于使用句子嵌入和K-means聚类方法,通过基于距离的聚类适配技术,评估不同代理类型之间的主题一致性,并利用欧几里得距离进行验证,从而揭示机器人对话与人类治疗对话在主题结构和语义层面的重合程度。
链接: https://arxiv.org/abs/2506.16473
作者: Sophie Chiang,Guy Laban,Hatice Gunes
机构: University of Cambridge(剑桥大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As conversational agents increasingly engage in emotionally supportive dialogue, it is important to understand how closely their interactions resemble those in traditional therapy settings. This study investigates whether the concerns shared with a robot align with those shared in human-to-human (H2H) therapy sessions, and whether robot responses semantically mirror those of human therapists. We analyzed two datasets: one of interactions between users and professional therapists (Hugging Face’s NLP Mental Health Conversations), and another involving supportive conversations with a social robot (QTrobot from LuxAI) powered by a large language model (LLM, GPT-3.5). Using sentence embeddings and K-means clustering, we assessed cross-agent thematic alignment by applying a distance-based cluster-fitting method that evaluates whether responses from one agent type map to clusters derived from the other, and validated it using Euclidean distances. Results showed that 90.88% of robot conversation disclosures could be mapped to clusters from the human therapy dataset, suggesting shared topical structure. For matched clusters, we compared the subjects as well as therapist and robot responses using Transformer, Word2Vec, and BERT embeddings, revealing strong semantic overlap in subjects’ disclosures in both datasets, as well as in the responses given to similar human disclosure themes across agent types (robot vs. human therapist). These findings highlight both the parallels and boundaries of robot-led support conversations and their potential for augmenting mental health interventions.
zh
[NLP-48] Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models ICLR2025
【速读】: 该论文旨在解决针对大型语言模型(Large Language Models, LLMs)的后门不对齐攻击问题,此类攻击通过隐藏触发器在不被正常安全审计发现的情况下破坏模型的安全对齐性。解决方案的关键在于提出BEAT,一种黑盒防御机制,其核心思想是通过在推理过程中检测触发样本以禁用后门。该方法基于一个关键观察——“探测拼接效应”,即拼接后的触发样本显著降低了后门LLM对恶意探测的拒绝率,而非触发样本则影响甚微。BEAT通过测量输入与探测拼接前后输出分布的畸变程度来判断输入是否被触发,从而从相反角度应对样本依赖性目标问题,捕获触发对拒绝信号(与样本无关)的影响,而非关注样本特定的成功攻击行为。
链接: https://arxiv.org/abs/2506.16447
作者: Biao Yi,Tiansheng Huang,Sishuo Chen,Tong Li,Zheli Liu,Zhixuan Chu,Yiming Li
机构: Nankai University (南开大学); Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted at ICLR 2025
Abstract:Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated triggered samples significantly reduce the refusal rate of the backdoored LLM towards a malicious probe, while non-triggered samples have little effect. Specifically, BEAT identifies whether an input is triggered by measuring the degree of distortion in the output distribution of the probe before and after concatenation with the input. Our method addresses the challenges of sample-dependent targets from an opposite perspective. It captures the impact of the trigger on the refusal signal (which is sample-independent) instead of sample-specific successful attack behaviors. It overcomes black-box access limitations by using multiple sampling to approximate the output distribution. Extensive experiments are conducted on various backdoor attacks and LLMs (including the closed-source GPT-3.5-turbo), verifying the effectiveness and efficiency of our defense. Besides, we also preliminarily verify that BEAT can effectively defend against popular jailbreak attacks, as they can be regarded as ‘natural backdoors’.
zh
[NLP-49] StoryWriter: A Multi-Agent Framework for Long Story Generation
【速读】: 该论文旨在解决长篇故事生成中存在的两个主要问题:叙事连贯性(discourse coherence)和叙事复杂性(narrative complexity)。为了解决这些问题,作者提出了StoryWriter框架,其关键在于采用多智能体架构,包含三个核心模块:大纲生成器、规划器和写作器。大纲生成器生成包含丰富事件情节、角色及事件间关系的事件驱动大纲;规划器进一步细化事件并规划各章节内容以保持故事的交织性和吸引力;写作器则根据当前事件动态压缩故事历史,生成并反映新情节,从而确保故事的连贯性。
链接: https://arxiv.org/abs/2506.16445
作者: Haotian Xia,Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story generation framework, which consists of three main modules: (1) outline agent, which generates event-based outlines containing rich event plots, character, and event-event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about 6,000 high-quality long stories, with an average length of 8,000 words. We train the model Llama3.1-8B and GLM4-9B using supervised fine-tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation.
zh
[NLP-50] REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storag e Processing ISCA-52
【速读】: 该论文旨在解决检索增强生成(RAG)系统中,由于大规模数据库导致的近似最近邻搜索(ANNS)在推理流水线中的性能瓶颈问题。现有解决方案在利用存储内处理(ISP)技术加速ANNS时存在算法不匹配、数据检索加速不足以及硬件修改复杂等局限性。论文提出的REIS系统是首个专为RAG设计的ISP系统,其关键在于三个机制:首先,采用将嵌入向量与文档关联的数据库布局以实现高效检索;其次,通过面向ISP的数据放置技术及轻量级闪存转换层实现高效的ANNS;最后,利用存储系统内部的计算资源运行ANNS引擎,从而显著提升检索性能和能效。
链接: https://arxiv.org/abs/2506.16444
作者: Kangqi Chen,Andreas Kosmas Kakolyris,Rakesh Nadig,Manos Frouzakis,Nika Mansouri Ghiasi,Yu Liang,Haiyu Mao,Jisung Park,Mohammad Sadrosadati,Onur Mutlu
机构: ETH Zürich(ETH Zurich); King’s College London(国王学院伦敦大学); POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Databases (cs.DB)
备注: Extended version of our publication at the 52nd International Symposium on Computer Architecture (ISCA-52), 2025
Abstract:Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).
zh
[NLP-51] Unpacking Generative AI in Education: Computational Modeling of Teacher and Student Perspectives in Social Media Discourse
【速读】: 该论文旨在解决生成式 AI(Generative AI, GAI)在教育领域中利益相关者(如学生和教师)对其态度和看法的动态分析问题,以及如何有效捕捉和理解在线社交平台上的相关讨论。其解决方案的关键在于提出并验证一个基于提示的大型语言模型(Large Language Models, LLMs)的模块化框架,该框架通过情感分析、主题建模和作者分类等技术对社交媒体数据进行分析,并展示了其在GPT-4o模型上的优越性能,例如在情感分析任务中达到90.6%的准确率,从而为研究GAI在教育中的社会影响提供了高效且精准的工具。
链接: https://arxiv.org/abs/2506.16412
作者: Paulina DeVito,Akhil Vallala,Sean Mcmahon,Yaroslav Hinda,Benjamin Thaw,Hanqi Zhuang,Hari Kalva
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This work has been submitted to IEEE Transactions on Computational Social Systems for possible publication
Abstract:Generative AI (GAI) technologies are quickly reshaping the educational landscape. As adoption accelerates, understanding how students and educators perceive these tools is essential. This study presents one of the most comprehensive analyses to date of stakeholder discourse dynamics on GAI in education using social media data. Our dataset includes 1,199 Reddit posts and 13,959 corresponding top-level comments. We apply sentiment analysis, topic modeling, and author classification. To support this, we propose and validate a modular framework that leverages prompt-based large language models (LLMs) for analysis of online social discourse, and we evaluate this framework against classical natural language processing (NLP) models. Our GPT-4o pipeline consistently outperforms prior approaches across all tasks. For example, it achieved 90.6% accuracy in sentiment analysis against gold-standard human annotations. Topic extraction uncovered 12 latent topics in the public discourse with varying sentiment and author distributions. Teachers and students convey optimism about GAI’s potential for personalized learning and productivity in higher education. However, key differences emerged: students often voice distress over false accusations of cheating by AI detectors, while teachers generally express concern about job security, academic integrity, and institutional pressures to adopt GAI tools. These contrasting perspectives highlight the tension between innovation and oversight in GAI-enabled learning environments. Our findings suggest a need for clearer institutional policies, more transparent GAI integration practices, and support mechanisms for both educators and students. More broadly, this study demonstrates the potential of LLM-based frameworks for modeling stakeholder discourse within online communities.
zh
[NLP-52] When Does Divide and Conquer Work for Long Context LLM ? A Noise Decomposition Framework
【速读】: 该论文试图解决将大型语言模型(Large Language Models, LLMs)应用于长文本时所面临的挑战。其解决方案的关键在于提出一个理论框架,将长上下文任务的失败模式分为三类:跨块依赖(任务噪声)、随着上下文规模增大而增加的混淆(模型噪声)以及部分结果整合不完善(聚合器噪声)。基于此框架,论文分析了多智能体分块策略的有效性,即通过将长序列划分为较小的块并聚合每个块的处理结果来提升性能。实验结果验证了理论分析,并揭示了在特定条件下,采用分块处理的较弱模型可能优于单次处理的先进模型,如GPT4o。
链接: https://arxiv.org/abs/2506.16411
作者: Zhen Xu,Shang Zhu,Jue Wang,Junlin Wang,Ben Athiwaratkun,Chi Wang,James Zou,Ce Zhang
机构: University of Chicago (芝加哥大学); Together AI (Together AI); Duke University (杜克大学); Google DeepMind (谷歌深度思维); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: under review
Abstract:We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a length sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring superlinear model noise growth with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.
zh
[NLP-53] IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
【速读】: 该论文旨在解决由视觉语言模型(Visual Language Model, VLM)驱动的具身智能体在规划过程中存在的缺陷所带来的安全风险问题,这些问题限制了其在现实家庭任务中的部署。现有静态、非交互式的评估范式无法充分评估交互环境中的风险,因为它们无法模拟由智能体行为引发的动态风险,并依赖不可靠的事后评估,忽略了不安全的中间步骤。论文提出的解决方案关键在于评估智能体的交互安全性,即其感知新兴风险并按正确程序顺序执行缓解措施的能力。为此,作者提出了IS-Bench,首个面向交互安全的多模态基准,包含161个具有挑战性的场景和388种独特的安全风险,支持过程导向的评估方法,以验证风险缓解操作是否在特定高风险步骤之前或之后执行。
链接: https://arxiv.org/abs/2506.16402
作者: Xiaoya Lu,Zeren Chen,Xuhao Hu,Yijin Zhou,Weichen Zhang,Dongrui Liu,Lu Sheng,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.
zh
[NLP-54] NepaliGPT : A Generative Language Model for the Nepali Language
【速读】: 该论文试图解决尼泊尔语领域缺乏生成式语言模型的问题,这导致其他下游任务如微调尚未被探索。解决方案的关键是提出了一种专为尼泊尔语设计的生成式大型语言模型——NepaliGPT,并构建了一个先进的尼泊尔语语料库(称为Devanagari Corpus)以及首个包含4,296个问答对的NepaliGPT基准数据集,从而为尼泊尔语自然语言处理提供了基础支持。
链接: https://arxiv.org/abs/2506.16399
作者: Shushanta Pudasaini,Aman Shakya,Siddhartha Shrestha,Sahil Bhatta,Sunil Thapa,Sushmita Palikhe
机构: Technological University Dublin (都柏林理工学院); Institute of Engineering (工程学院); Kathmandu Engineering College (加德满都工程学院); Fanshawe College (范沙瓦学院); Lambton College (兰布顿学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures
Abstract:After the release of ChatGPT, Large Language Models (LLMs) have gained huge popularity in recent days and thousands of variants of LLMs have been released. However, there is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet. To fill this research gap in the Nepali NLP space, this research proposes \textitNepaliGPT, a generative large language model tailored specifically for the Nepali language. This research introduces an advanced corpus for the Nepali language collected from several sources, called the Devanagari Corpus. Likewise, the research introduces the first NepaliGPT benchmark dataset comprised of 4,296 question-answer pairs in the Nepali language. The proposed LLM NepaliGPT achieves the following metrics in text generation: Perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25%, and causal consistency of 85.41%.
zh
[NLP-55] OJBench: A Competition Level Code Benchmark For Large Language Models
【速读】: 该论文试图解决当前代码基准测试在评估大型语言模型(Large Language Models, LLMs)的竞赛级代码推理能力方面的不足问题。解决方案的关键在于引入OJBench,这是一个新颖且具有挑战性的基准测试,包含来自NOI和ICPC的232道编程竞赛题目,旨在更严格地评估模型的推理能力。通过在37个模型上的全面评估,研究揭示了即使是最先进的推理导向模型在处理高难度竞赛级问题时仍面临显著挑战。
链接: https://arxiv.org/abs/2506.16395
作者: Zhexu Wang,Yiping Liu,Yejie Wang,Wenyang He,Bofei Gao,Muxi Diao,Yanxu Chen,Kelin Fu,Flood Sung,Zhilin Yang,Tianyu Liu,Weiran Xu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); Peking University (北京大学); Moonshot AI (月球射击人工智能)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures
Abstract:Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models’ reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.
zh
[NLP-56] From LLM -anation to LLM -orchestrator: Coordinating Small Models for Data Labeling
【速读】: 该论文试图解决基于大型语言模型(Large Language Models, LLMs)的标注范式在实际部署中的两个核心瓶颈:一是大规模标注中调用商业API的成本过高;二是在需要细粒度语义理解的任务(如情感分类和毒性分类)中,LLMs的标注准确率甚至低于专门针对此类任务的中小型语言模型(Small Language Models, SLMs)。解决方案的关键在于提出一种多模型协作标注的新范式,并设计了一个全自动标注框架AutoAnnotator。该框架包含两层结构:上层的元控制器层利用LLMs的生成与推理能力选择SLMs进行标注、自动生成标注代码并验证困难样本;下层的任务专家层由多个SLMs组成,通过多模型投票完成标注。此外,通过持续学习策略对SLMs进行分阶段微调,以提升其泛化能力。
链接: https://arxiv.org/abs/2506.16393
作者: Yao Lu,Zhaiyuan Ji,Jiawei Du,Yu Shanqing,Qi Xuan,Tianyi Zhou
机构: Zhejiang University of Technology (浙江理工大学); Agency for Science, Technology and Research (科技研究局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: this https URL.
zh
[NLP-57] RiOT: Efficient Prompt Refinement with Residual Optimization Tree
【速读】: 该论文试图解决自动提示优化方法中存在的两个问题:缺乏多样性,限制了对有价值和创新方向的探索,以及语义漂移,即针对某一任务的优化可能损害其他任务的性能。解决方案的关键在于提出一种名为Residual Optimization Tree (RiOT)的新框架,该框架通过文本梯度迭代优化提示,在每一步生成多个语义多样化的候选提示,并利用困惑度选择最佳提示;同时引入文本残差连接,通过选择性保留优化迭代中的有益内容来缓解语义漂移问题。
链接: https://arxiv.org/abs/2506.16389
作者: Chenyi Zhou,Zhengyan Shi,Yuan Yao,Lei Liang,Huajun Chen,Qiang Zhang
机构: Zhejiang University (浙江大学); University College London (伦敦大学学院); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks, covering commonsense, mathematical, logical, temporal, and semantic reasoning, demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting.
zh
[NLP-58] HausaNLP at SemEval-2025 Task 11: Advancing Hausa Text-based Emotion Detection
【速读】: 该论文试图解决低资源非洲语言Hausa中的多标签情感检测问题(multi-label emotion detection)。解决方案的关键在于对AfriBERTa——一个基于Transformer架构并在非洲语言上预训练的模型——进行微调(fine-tuning),以实现对Hausa文本的情感分类,其情感类别包括愤怒、厌恶、恐惧、喜悦、悲伤和惊讶。通过数据预处理、分词以及使用Hugging Face Trainer API进行模型微调,系统在验证集上达到了74.00%的准确率和73.50%的F1分数,证明了基于Transformer的模型在低资源语言情感分析中的有效性。
链接: https://arxiv.org/abs/2506.16388
作者: Sani Abdullahi Sani,Salim Abubakar,Falalu Ibrahim Lawan,Abdulhamid Abubakar,Maryam Bala
机构: HausaNLP; University of the Witwatersrand; Federal Polytechnic Daura; Kaduna State University
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents our approach to multi-label emotion detection in Hausa, a low-resource African language, as part of SemEval Track A. We fine-tuned AfriBERTa, a transformer-based model pre-trained on African languages, to classify Hausa text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our methodology involved data preprocessing, tokenization, and model fine-tuning using the Hugging Face Trainer API. The system achieved a validation accuracy of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of transformer-based models for emotion detection in low-resource languages.
zh
[NLP-59] Large Language Models in Argument Mining: A Survey
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中论点挖掘(Argument Mining, AM)领域的关键问题,即如何利用大型语言模型(Large Language Models, LLMs)提升论点结构的提取能力。其解决方案的关键在于通过系统性综述,梳理AM的基础理论、标注框架及数据集,并构建AM子任务的全面分类体系,同时探讨提示工程、思维链推理和检索增强等LLM技术在AM中的应用,以推动该领域在上下文学习、跨领域适应性和任务执行效率方面的进展。
链接: https://arxiv.org/abs/2506.16383
作者: Hao Li,Viktor Schlegel,Yizheng Sun,Riza Batista-Navarro,Goran Nenadic
机构: The University of Manchester (曼彻斯特大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注: Work draft
Abstract:Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.
zh
[NLP-60] InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
【速读】: 该论文试图解决当前基于自然语言指令的语音合成(Instruction-based TTS)系统在复杂风格控制能力上的不足,以及缺乏高质量基准和自动化评估指标的问题。其解决方案的关键在于引入InstructTTSEval,一个用于衡量复杂自然语言风格控制能力的基准,包含三个任务:声学参数指定、描述性风格指令和角色扮演,并通过Gemini模型作为自动评判工具,以评估系统对指令的遵循能力,从而推动更强大、灵活和准确的指令遵循TTS系统的发展。
链接: https://arxiv.org/abs/2506.16381
作者: Kexin Huang,Qian Tu,Liwei Fan,Chenchen Yang,Dong Zhang,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 19 pages, 9 figures
Abstract:In modern speech synthesis, paralinguistic information–such as a speaker’s vocal timbre, emotional state, and dynamic prosody–plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.
zh
[NLP-61] Can structural correspondences ground real world representational content in Large Language Models ?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否能够表征现实世界中的实体及其表征方式的问题。论文提出,尽管LLMs与现实世界之间可能存在结构对应关系,但这种对应关系本身不足以构成对现实世界的表征。解决方案的关键在于这些结构对应关系是否在任务执行中被恰当利用,即它们是否以一种能够解释成功任务表现的方式被使用,从而可能实现对现实内容的表征。这一过程需要克服LLMs文本封闭性的挑战。
链接: https://arxiv.org/abs/2506.16370
作者: Iwan Williams
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) such as GPT-4 produce compelling responses to a wide range of prompts. But their representational capacities are uncertain. Many LLMs have no direct contact with extra-linguistic reality: their inputs, outputs and training data consist solely of text, raising the questions (1) can LLMs represent anything and (2) if so, what? In this paper, I explore what it would take to answer these questions according to a structural-correspondence based account of representation, and make an initial survey of this evidence. I argue that the mere existence of structural correspondences between LLMs and worldly entities is insufficient to ground representation of those entities. However, if these structural correspondences play an appropriate role - they are exploited in a way that explains successful task performance - then they could ground real world contents. This requires overcoming a challenge: the text-boundedness of LLMs appears, on the face of it, to prevent them engaging in the right sorts of tasks.
zh
[NLP-62] DISCIE – Discriminative Closed Information Extraction
【速读】: 该论文旨在解决封闭信息抽取(Closed Information Extraction)中的关系抽取问题,尤其是在处理长尾关系时的准确性不足问题。其解决方案的关键在于采用一种判别式方法,结合类型和实体特定信息以提升关系抽取的准确性。该方法通过引入类型信息显著提高了性能,甚至在大规模封闭信息抽取任务中表现出优于当前最先进的端到端生成模型的效果,同时通过使用较小的模型实现了更高的效率。
链接: https://arxiv.org/abs/2506.16348
作者: Cedric Möller,Ricardo Usbeck
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces a novel method for closed information extraction. The method employs a discriminative approach that incorporates type and entity-specific information to improve relation extraction accuracy, particularly benefiting long-tail relations. Notably, this method demonstrates superior performance compared to state-of-the-art end-to-end generative models. This is especially evident for the problem of large-scale closed information extraction where we are confronted with millions of entities and hundreds of relations. Furthermore, we emphasize the efficiency aspect by leveraging smaller models. In particular, the integration of type-information proves instrumental in achieving performance levels on par with or surpassing those of a larger generative model. This advancement holds promise for more accurate and efficient information extraction techniques.
zh
[NLP-63] Analyzing the Influence of Knowledge Graph Information on Relation Extraction
【速读】: 该论文试图解决关系抽取模型在不同数据集上性能提升的问题,特别是针对每种关系的训练样本数量不平衡的情况。其解决方案的关键在于将知识图谱信息融入关系抽取模型中,通过结合传统关系抽取方法与图感知的神经贝尔曼-福特网络,利用知识图谱中实体的位置信息来提升模型性能。
链接: https://arxiv.org/abs/2506.16343
作者: Cedric Möller,Ricardo Usbeck
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:We examine the impact of incorporating knowledge graph information on the performance of relation extraction models across a range of datasets. Our hypothesis is that the positions of entities within a knowledge graph provide important insights for relation extraction tasks. We conduct experiments on multiple datasets, each varying in the number of relations, training examples, and underlying knowledge graphs. Our results demonstrate that integrating knowledge graph information significantly enhances performance, especially when dealing with an imbalance in the number of training examples for each relation. We evaluate the contribution of knowledge graph-based features by combining established relation extraction methods with graph-aware Neural Bellman-Ford networks. These features are tested in both supervised and zero-shot settings, demonstrating consistent performance improvements across various datasets.
zh
[NLP-64] Generalizability of Media Frames: Corpus creation and analysis across countries
【速读】: 该论文试图解决现有媒体框架体系(Media Frame Corpus, MFC)在跨文化语境下的适用性问题,特别是其是否能够有效捕捉非美国文化背景下的新闻议题。解决方案的关键在于构建并标注一个基于MFC框架的巴西葡萄牙语新闻数据集(FrameNews-PT),通过多轮标注评估MFC框架在巴西政治与经济新闻中的适用性,并进一步测试微调和零样本模型在跨领域数据上的表现。研究结果表明,尽管MFC的15个框架在整体上仍具有广泛适用性,但部分框架使用频率较低,且新出现的新闻议题常依赖通用的“备用”框架进行分析。
链接: https://arxiv.org/abs/2506.16337
作者: Agnese Daffara,Sourabh Dattawad,Sebastian Padó,Tanise Ceron
机构: Institute for Natural Language Processing, University of Stuttgart, Germany (自然语言处理研究所,斯图加特大学,德国); Bocconi University, Italy (博科尼大学,意大利)
类目: Computation and Language (cs.CL)
备注: 8 pages + References (3 pages) and Appendix (4 pages). This paper was submitted to StarSem 2025 and is currently under review
Abstract:Frames capture aspects of an issue that are emphasized in a debate by interlocutors and can help us understand how political language conveys different perspectives and ultimately shapes people’s opinions. The Media Frame Corpus (MFC) is the most commonly used framework with categories and detailed guidelines for operationalizing frames. It is, however, focused on a few salient U.S. news issues, making it unclear how well these frames can capture news issues in other cultural contexts. To explore this, we introduce FrameNews-PT, a dataset of Brazilian Portuguese news articles covering political and economic news and annotate it within the MFC framework. Through several annotation rounds, we evaluate the extent to which MFC frames generalize to the Brazilian debate issues. We further evaluate how fine-tuned and zero-shot models perform on out-of-domain data. Results show that the 15 MFC frames remain broadly applicable with minor revisions of the guidelines. However, some MFC frames are rarely used, and novel news issues are analyzed using general ‘fall-back’ frames. We conclude that cross-cultural frame use requires careful consideration.
zh
[NLP-65] Explainable Rule Application via Structured Prompting: A Neural-Symbolic Approach
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在需要自然语言理解和精确逻辑推理的领域(如法律分析)中,存在规则应用不一致、异常处理能力弱以及可解释性差的问题。其解决方案的关键在于提出一种结构化提示框架,将推理过程分解为三个可验证的步骤:实体识别、属性提取和符号规则应用。通过整合神经网络与符号方法,该方法在利用LLMs解释灵活性的同时,通过形式化验证确保逻辑一致性,并允许领域专家在不改变模型架构的情况下优化逻辑结构。
链接: https://arxiv.org/abs/2506.16335
作者: Albert Sadowski,Jarosław A. Chudziak
机构: Warsaw University of Technology (华沙理工大学); Poland (波兰)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication at the 29th International Conference on Knowledge-Based and Intelligent Information \ Engineering Systems (KES 2025)
Abstract:Large Language Models (LLMs) excel in complex reasoning tasks but struggle with consistent rule application, exception handling, and explainability, particularly in domains like legal analysis that require both natural language understanding and precise logical inference. This paper introduces a structured prompting framework that decomposes reasoning into three verifiable steps: entity identification, property extraction, and symbolic rule application. By integrating neural and symbolic approaches, our method leverages LLMs’ interpretive flexibility while ensuring logical consistency through formal verification. The framework externalizes task definitions, enabling domain experts to refine logical structures without altering the architecture. Evaluated on the LegalBench hearsay determination task, our approach significantly outperformed baselines, with OpenAI o-family models showing substantial improvements - o1 achieving an F1 score of 0.929 and o3-mini reaching 0.867 using structured decomposition with complementary predicates, compared to their few-shot baselines of 0.714 and 0.74 respectively. This hybrid neural-symbolic system offers a promising pathway for transparent and consistent rule-based reasoning, suggesting potential for explainable AI applications in structured legal reasoning tasks.
zh
[NLP-66] PL-Guard: Benchmarking Language Model Safety for Polish
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全评估和内容审核方面存在的语言偏见问题,即现有评估工具和安全机制主要针对英语等高资源语言,而忽略了全球大部分低资源语言。解决方案的关键在于构建一个手动标注的波兰语语言模型安全分类基准数据集,并生成对抗性扰动样本以测试模型的鲁棒性。通过微调三种不同模型(Llama-Guard-3-8B、基于HerBERT的分类器和PLLuM),并在不同标注数据组合下进行性能评估,最终证明基于HerBERT的分类器在对抗性条件下表现最佳。
链接: https://arxiv.org/abs/2506.16322
作者: Aleksandra Krasnodębska,Karolina Seweryn,Szymon Łukasik,Wojciech Kusa
机构: NASK – National Research Institute, Warsaw, Poland (NASK-国家研究机构,华沙,波兰)
类目: Computation and Language (cs.CL)
备注: Accepted to the 10th Workshop on Slavic Natural Language Processing
Abstract:Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
zh
[NLP-67] Advancing Automated Speaking Assessment Leverag ing Multifaceted Relevance and Grammar Information ISCA
【速读】: 该论文旨在解决当前自动口语评估(Automated Speaking Assessment, ASA)系统在多维度评价中未能充分考虑内容相关性、忽略图像或范例线索以及采用浅层语法分析的问题。其解决方案的关键在于引入两个创新增强模块:首先,多维相关性模块整合了问题、关联的图像内容、范例及二语学习者的口语回答,以实现对内容相关性的全面评估;其次,通过先进的语法错误修正(Grammar Error Correction, GEC)和详细标注提取细粒度语法错误特征,以识别具体的错误类别。这些方法显著提升了内容相关性、语言使用及整体ASA性能。
链接: https://arxiv.org/abs/2506.16285
作者: Hao-Chien Lu,Jhen-Ke Lin,Hong-Yun Lin,Chung-Chun Wang,Berlin Chen
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to the ISCA SLaTE-2025 Workshop
Abstract:Current automated speaking assessment (ASA) systems for use in multi-aspect evaluations often fail to make full use of content relevance, overlooking image or exemplar cues, and employ superficial grammar analysis that lacks detailed error types. This paper ameliorates these deficiencies by introducing two novel enhancements to construct a hybrid scoring model. First, a multifaceted relevance module integrates question and the associated image content, exemplar, and spoken response of an L2 speaker for a comprehensive assessment of content relevance. Second, fine-grained grammar error features are derived using advanced grammar error correction (GEC) and detailed annotation to identify specific error categories. Experiments and ablation studies demonstrate that these components significantly improve the evaluation of content relevance, language use, and overall ASA performance, highlighting the benefits of using richer, more nuanced feature sets for holistic speaking assessment.
zh
[NLP-68] End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
【速读】: 该论文试图解决低资源语言对在构建端到端语音到文本翻译(ST)系统时面临高质量标注数据稀缺的问题。解决方案的关键在于利用弱标签数据,通过先进的句子编码器进行双语语料挖掘,构建适用于低资源语言对的语音到文本翻译数据集,从而验证弱标签数据在提升ST模型性能方面的有效性。
链接: https://arxiv.org/abs/2506.16251
作者: Aishwarya Pothula,Bhavana Akkiraju,Srihari Bandarupalli,Charan D,Santosh Kesiraju,Anil Kumar Vuppala
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.
zh
[NLP-69] Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports
【速读】: 该论文试图解决放射科报告中“发现”部分内容详尽而“印象”部分较为简洁的摘要生成问题,旨在通过先进的抽象摘要模型从“发现”部分自动生成简明的“印象”部分。解决方案的关键在于利用预训练和开源的大语言模型(如T5-base、BART-base、PEGASUS-x-base、ChatGPT-4、LLaMA-3-8B以及自定义的带有覆盖机制的Pointer Generator Network)进行医学文本的摘要生成,并通过多种评估指标(如ROUGE-1、ROUGE-2、ROUGE-L、METEOR和BERTScore)对模型性能进行比较分析,以识别各模型在医学文本摘要中的优势与局限性。
链接: https://arxiv.org/abs/2506.16247
作者: Anindita Bhattacharya,Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 6 tables
Abstract:The findings section of a radiology report is often detailed and lengthy, whereas the impression section is comparatively more compact and captures key diagnostic conclusions. This research explores the use of advanced abstractive summarization models to generate the concise impression from the findings section of a radiology report. We have used the publicly available MIMIC-CXR dataset. A comparative analysis is conducted on leading pre-trained and open-source large language models, including T5-base, BART-base, PEGASUS-x-base, ChatGPT-4, LLaMA-3-8B, and a custom Pointer Generator Network with a coverage mechanism. To ensure a thorough assessment, multiple evaluation metrics are employed, including ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore. By analyzing the performance of these models, this study identifies their respective strengths and limitations in the summarization of medical text. The findings of this paper provide helpful information for medical professionals who need automated summarization solutions in the healthcare sector.
zh
[NLP-70] Web(er) of Hate: A Survey on How Hate Speech Is Typed
【速读】: 该论文试图解决仇恨言论数据集构建过程中涉及的复杂设计决策问题,这些问题需要在相互竞争的优先事项之间进行权衡。论文通过批判性分析多种数据集的方法学选择,揭示了常见的主题和实践及其对数据集可靠性的潜在影响。解决方案的关键在于采用一种反思性方法,强调研究者在数据集构建过程中应承认自身的价值判断,从而促进透明度和方法论的严谨性。
链接: https://arxiv.org/abs/2506.16190
作者: Luna Wang,Andrew Caines,Alice Hutchings
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber’s notion of ideal types, we argue for a reflexive approach in dataset creation, urging researchers to acknowledge their own value judgments during dataset construction, fostering transparency and methodological rigour.
zh
[NLP-71] JETHICS: Japanese Ethics Understanding Evaluation Dataset
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在伦理理解能力方面存在不足的问题,特别是针对日语环境下的伦理推理能力评估缺乏专门数据集。解决方案的关键在于构建JETHICS数据集,该数据集基于伦理学和政治哲学中的规范理论与概念,以及常识道德,包含78,000个示例,旨在为评估AI模型的伦理理解提供基准。通过在非专有LLMs和GPT-4o上的实验,验证了现有模型在该任务上的表现仍有较大提升空间。
链接: https://arxiv.org/abs/2506.16187
作者: Masashi Takeshita,Rafal Rzepka
机构: Nagoya University (名古屋大学); Hokkaido University (北海道大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we propose JETHICS, a Japanese dataset for evaluating ethics understanding of AI models. JETHICS contains 78K examples and is built by following the construction methods of the existing English ETHICS dataset. It includes four categories based normative theories and concepts from ethics and political philosophy; and one representing commonsense morality. Our evaluation experiments on non-proprietary large language models (LLMs) and on GPT-4o reveal that even GPT-4o achieves only an average score of about 0.7, while the best-performing Japanese LLM attains around 0.5, indicating a relatively large room for improvement in current LLMs.
zh
[NLP-72] SGIC: A Self-Guided Iterative Calibration Framework for RAG
【速读】: 该论文试图解决在检索增强生成(Retrieval-Augmented Generation, RAG)中,现有方法往往忽视大语言模型(Large Language Models, LLMs)的校准能力问题,而LLMs具备强大的上下文推理能力。解决方案的关键在于提出一种新的自引导迭代校准框架(Self-Guided Iterative Calibration Framework, SGIC),该框架利用不确定性得分作为工具,通过计算文档与查询的相关性及LLMs响应的置信度,并在多轮迭代中重新评估这些得分,结合先前响应进行校准优化,从而提升模型的响应准确性和可靠性。
链接: https://arxiv.org/abs/2506.16172
作者: Guanhua Chen,Yutong Yao,Lidia S. Chao,Xuebo Liu,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau (自然语言处理与计算机视觉实验室,计算机与信息科学系,澳门大学); Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China (计算与智能研究所,哈尔滨工业大学,深圳,中国)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-weight LLMs.
zh
[NLP-73] Under the Shadow of Babel: How Language Shapes Reasoning in LLM s
【速读】: 该论文试图解决语言结构是否会影响大型语言模型(Large Language Models, LLMs)内部认知模式的问题,即语言相对性理论在LLMs中的体现。其解决方案的关键在于构建了一个结构化的双语因果推理数据集BICAUSE,该数据集包含语义对齐的中英文样本,并涵盖正向和逆向的因果形式,从而能够系统地分析LLMs在不同语言中的注意力模式、因果词序偏好及语义抽象能力。通过这一结构化分析,研究验证了LLMs不仅模仿语言表层形式,还内化了由语言塑造的推理偏差。
链接: https://arxiv.org/abs/2506.16151
作者: Chenxi Wang,Yixuan Zhang,Lang Gao,Zixiang Xu,Zirui Song,Yanbo Wang,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures
Abstract:Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal word order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.
zh
[NLP-74] PRISON: Unmasking the Criminal Potential of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂社会情境中可能表现出的不当行为及其犯罪潜力的系统性评估问题。现有研究缺乏对LLMs在真实互动中犯罪能力的全面理解和量化分析。论文提出的统一框架PRISON通过五个维度——虚假陈述、栽赃陷害、心理操控、情感伪装和道德脱离——来量化LLMs的犯罪潜力,并利用改编自经典电影的结构化犯罪场景进行角色扮演评估。解决方案的关键在于构建一个系统化的评估体系,以揭示LLMs在未明确指令下仍可能表现出的潜在犯罪倾向,并揭示其在识别犯罪行为方面的不足。
链接: https://arxiv.org/abs/2506.16150
作者: Xinyi Wu,Geng Hong,Pei Chen,Yueyue Chen,Xudong Pan,Min Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs’ criminal potential across five dimensions: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films, we evaluate both criminal potential and anti-crime ability of LLMs via role-play. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 41% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
zh
[NLP-75] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在后训练阶段缺乏严谨评估以及传统强化学习方法在提升答案准确性的同时可能降低推理步骤与答案之间逻辑一致性的问题。其解决方案的关键在于提出GRPO-CARE框架,该框架通过引入双层奖励机制:基础奖励用于优化答案正确性,自适应一致性奖励则通过对比模型的推理到答案的可能性与一个缓慢演化的参考模型来增强逻辑一致性,从而在无需显式监督的情况下同时优化答案准确性和推理连贯性。
链接: https://arxiv.org/abs/2506.16141
作者: Yi Chen,Yuying Ge,Rui Wang,Yixiao Ge,Junhao Cheng,Ying Shan,Xihui Liu
机构: The University of Hong Kong (香港大学); ARC Lab, Tencent PCG (腾讯PCG人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code released at: this https URL
Abstract:Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting this http URL address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model’s reasoning-to-answer likelihood (via a slowly-evolving reference model) against group this http URL dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
zh
[NLP-76] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning
【速读】: 该论文试图解决金融自然语言处理(FinNLP)中提示工程的局限性,特别是在结构化思维链(structured CoT)提示方法上的不足。现有研究主要依赖标准提示或非结构化CoT提示,而结构化CoT提示的设计通常基于非领域专家的启发式方法,导致性能提升有限且推理过程缺乏可解释性。解决方案的关键在于引入FinCoT,一种结合领域专家金融推理的结构化思维链提示方法,通过明确的推理步骤指导大型语言模型,从而提升模型在金融领域任务中的性能,同时降低计算成本并增强推理过程的可解释性。
链接: https://arxiv.org/abs/2506.16123
作者: Natapong Nitarach,Warit Sirichotedumrong,Panop Pitchayarthorn,Pittawat Taveekitworachai,Potsawee Manakul,Kunat Pipatanakul
机构: SCB 10X, SCBX Group
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents FinCoT, a structured chain-of-thought (CoT) prompting approach that incorporates insights from domain-specific expert financial reasoning to guide the reasoning traces of large language models. We investigate that there are three main prompting styles in FinNLP: (1) standard prompting–zero-shot prompting; (2) unstructured CoT–CoT prompting without an explicit reasoning structure, such as the use of tags; and (3) structured CoT prompting–CoT prompting with explicit instructions or examples that define structured reasoning steps. Previously, FinNLP has primarily focused on prompt engineering with either standard or unstructured CoT prompting. However, structured CoT prompting has received limited attention in prior work. Furthermore, the design of reasoning structures in structured CoT prompting is often based on heuristics from non-domain experts. In this study, we investigate each prompting approach in FinNLP. We evaluate the three main prompting styles and FinCoT on CFA-style questions spanning ten financial domains. We observe that FinCoT improves performance from 63.2% to 80.5% and Qwen-2.5-7B-Instruct from 69.7% to 74.2%, while reducing generated tokens eight-fold compared to structured CoT prompting. Our findings show that domain-aligned structured prompts not only improve performance and reduce inference costs but also yield more interpretable and expert-aligned reasoning traces.
zh
[NLP-77] Probing the Robustness of Large Language Models Safety to Latent Perturbations
【速读】: 该论文试图解决当前安全对齐(safety alignment)方法在面对潜在空间中的微小扰动时表现出的脆弱性问题,即即使经过对齐的模型也可能因隐藏激活的细微变化而产生不安全响应。解决方案的关键在于提出一种基于隐层表示扰动的对抗性微调策略——逐层对抗补丁训练(Layer-wise Adversarial Patch Training, LAPT),通过在训练过程中注入受控扰动来增强模型对潜在空间扰动的鲁棒性,从而提升安全对齐的稳定性,同时保持模型的通用能力。
链接: https://arxiv.org/abs/2506.16078
作者: Tianle Gu,Kexin Huang,Zongqi Wang,Yixu Wang,Jie Li,Yuanqi Yao,Yang Yao,Yujiu Yang,Yan Teng,Yingchun Wang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); The University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at this https URL.
zh
[NLP-78] Cyberbullying Detection in Hinglish Text Using MURIL and Explainable AI
【速读】: 该论文试图解决在混合使用印地语和英语(Hinglish)的数字平台上,现有针对单语文本设计的网络欺凌检测系统面临的效果下降问题。解决方案的关键在于采用多语言印度语言表示架构(MURIL),通过其强大的多语言表示能力提升对Hinglish文本的网络欺凌检测性能,实验结果表明该方法在多个基准数据集上均优于现有的多语言模型如RoBERTa和IndicBERT。
链接: https://arxiv.org/abs/2506.16066
作者: Devesh Kumar
机构: National Institute of Technology Hamirpur (印度理工学院哈米尔普尔分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The growth of digital communication platforms has led to increased cyberbullying incidents worldwide, creating a need for automated detection systems to protect users. The rise of code-mixed Hindi-English (Hinglish) communication on digital platforms poses challenges for existing cyberbullying detection systems, which were designed primarily for monolingual text. This paper presents a framework for cyberbullying detection in Hinglish text using the Multilingual Representations for Indian Languages (MURIL) architecture to address limitations in current approaches. Evaluation across six benchmark datasets – Bohra \textitet al., BullyExplain, BullySentemo, Kumar \textitet al., HASOC 2021, and Mendeley Indo-HateSpeech – shows that the MURIL-based approach outperforms existing multilingual models including RoBERTa and IndicBERT, with improvements of 1.36 to 13.07 percentage points and accuracies of 86.97% on Bohra, 84.62% on BullyExplain, 86.03% on BullySentemo, 75.41% on Kumar datasets, 83.92% on HASOC 2021, and 94.63% on Mendeley dataset. The framework includes explainability features through attribution analysis and cross-linguistic pattern recognition. Ablation studies show that selective layer freezing, appropriate classification head design, and specialized preprocessing for code-mixed content improve detection performance, while failure analysis identifies challenges including context-dependent interpretation, cultural understanding, and cross-linguistic sarcasm detection, providing directions for future research in multilingual cyberbullying detection.
zh
[NLP-79] Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成输出时难以保持一致的诚实性和帮助性的问题。其解决方案的关键在于提出一种新颖的提示策略——自评引导的求知精炼提示(self-critique-guided curiosity refinement prompting),该策略通过引入两个轻量级的上下文步骤,即自评步骤和精炼步骤,使模型能够在不进行额外训练的情况下自我批判并优化响应,从而提升输出质量。
链接: https://arxiv.org/abs/2506.16064
作者: Duc Hieu Ho,Chenglin Fan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated robust capabilities across various natural language tasks. However, producing outputs that are consistently honest and helpful remains an open challenge. To overcome this challenge, this paper tackles the problem through two complementary directions. It conducts a comprehensive benchmark evaluation of ten widely used large language models, including both proprietary and open-weight models from OpenAI, Meta, and Google. In parallel, it proposes a novel prompting strategy, self-critique-guided curiosity refinement prompting. The key idea behind this strategy is enabling models to self-critique and refine their responses without additional training. The proposed method extends the curiosity-driven prompting strategy by incorporating two lightweight in-context steps including self-critique step and refinement step. The experiment results on the HONESET dataset evaluated using the framework \mathrmH^2 (honesty and helpfulness), which was executed with GPT-4o as a judge of honesty and helpfulness, show consistent improvements across all models. The approach reduces the number of poor-quality responses, increases high-quality responses, and achieves relative gains in \mathrmH^2 scores ranging from 1.4% to 4.3% compared to curiosity-driven prompting across evaluated models. These results highlight the effectiveness of structured self-refinement as a scalable and training-free strategy to improve the trustworthiness of LLMs outputs. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.16064 [cs.CL] (or arXiv:2506.16064v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.16064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-80] Knee-Deep in C-RASP: A Transformer Depth Hierarchy
【速读】: 该论文试图解决的问题是:深度增加是否能提升Transformer模型的能力,以及如何从理论上证明这种能力的提升。解决方案的关键在于将Transformer模型与一种计算模型C-RASP进行表达等价性分析,并通过研究带有计数算子的时序逻辑来证明更深的C-RASP程序具有更高的表达能力,从而推导出更深的Transformer模型在特定子类中更具表达性。
链接: https://arxiv.org/abs/2506.16055
作者: Andy Yang,Michaël Cadilhac,David Chiang
机构: University of Notre Dame(圣母大学); DePaul University(德保罗大学)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: 27 pages, 4 figures
Abstract:It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained with greater depth? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). These results are established by studying a form of temporal logic with counting operators, which was shown equivalent to C-RASP in previous work. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.
zh
[NLP-81] A Hybrid DeBERTa and Gated Broad Learning System for Cyberbullying Detection in English Text
【速读】: 该论文旨在解决网络平台上日益严重的网络欺凌(cyberbullying)检测问题,特别是在大规模在线通信环境中有效识别有害行为的挑战。其解决方案的关键在于提出一种混合架构,将基于Transformer的模型(如DeBERTa)的上下文理解能力与广义学习系统(Broad Learning System, BLS)在模式识别上的优势相结合,通过引入改进的DeBERTa模型与门控广义学习系统(Gated Broad Learning System, GBLS)分类器,构建了一个协同框架,显著提升了检测性能,并在多个基准数据集上取得了优异结果。
链接: https://arxiv.org/abs/2506.16052
作者: Devesh Kumar
机构: National Institute of Technology Hamirpur (印度国家技术学院哈米尔普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of online communication platforms has created unprecedented opportunities for global connectivity while simultaneously enabling harmful behaviors such as cyberbullying, which affects approximately 54.4% of teenagers according to recent research. This paper presents a hybrid architecture that combines the contextual understanding capabilities of transformer-based models with the pattern recognition strengths of broad learning systems for effective cyberbullying detection. This approach integrates a modified DeBERTa model augmented with Squeeze-and-Excitation blocks and sentiment analysis capabilities with a Gated Broad Learning System (GBLS) classifier, creating a synergistic framework that outperforms existing approaches across multiple benchmark datasets. The proposed ModifiedDeBERTa + GBLS model achieved good performance on four English datasets: 79.3% accuracy on HateXplain, 95.41% accuracy on SOSNet, 91.37% accuracy on Mendeley-I, and 94.67% accuracy on Mendeley-II. Beyond performance gains, the framework incorporates comprehensive explainability mechanisms including token-level attribution analysis, LIME-based local interpretations, and confidence calibration, addressing critical transparency requirements in automated content moderation. Ablation studies confirm the meaningful contribution of each architectural component, while failure case analysis reveals specific challenges in detecting implicit bias and sarcastic content, providing valuable insights for future improvements in cyberbullying detection systems.
zh
[NLP-82] DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling
【速读】: 该论文旨在解决推理时缩放(inference-time scaling)在实际应用中因依赖外部验证器或缺乏对现实计算约束优化而受到限制的问题。其解决方案的关键在于提出DynScaling,该方法通过两种主要创新实现优化:一种是集成的并行-串行采样策略,另一种是基于多臂老虎机(multi-armed bandit)的动态预算分配框架。前者通过从初始独立的并行响应构建合成的串行推理链,统一了并行与串行采样,促进多样且连贯的推理路径;后者将计算资源分配建模为多臂老虎机问题,根据先前采样结果的不确定性自适应地分配推理预算,从而提升计算效率。
链接: https://arxiv.org/abs/2506.16043
作者: Fei Wang,Xingchen Wan,Ruoxi Sun,Jiefeng Chen,Sercan Ö. Arık
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The integrated sampling strategy unifies parallel and sequential sampling by constructing synthetic sequential reasoning chains from initially independent parallel responses, promoting diverse and coherent reasoning trajectories. The dynamic budget allocation framework formulates the allocation of computational resources as a multi-armed bandit problem, adaptively distributing the inference budget across queries based on the uncertainty of previously sampled responses, thereby maximizing computational efficiency. By combining these components, DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost.
zh
[NLP-83] Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3
【速读】: 该论文旨在解决复杂问答任务中多跳推理和长文档上下文理解的挑战,其解决方案的关键在于构建一个基于LLaMA 3的检索增强生成(Retrieval-Augmented Generation, RAG)框架,该框架集成了密集检索模块、先进的上下文融合机制以及多跳推理能力,从而提升回答的准确性和连贯性。通过结合检索似然与生成交叉熵的联合优化策略,进一步增强了模型的鲁棒性与适应性。
链接: https://arxiv.org/abs/2506.16037
作者: Xinyue Huang,Ziqi Lin,Fang Sun,Wenchao Zhang,Kejian Tong,Yunbo Liu
机构: Cornell University (康奈尔大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model’s robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers.
zh
[NLP-84] EvoLM: In Search of Lost Language Model Training Dynamics
【速读】: 该论文旨在解决现代语言模型(Language Model, LM)训练过程中由于多阶段划分导致下游开发者难以评估各阶段设计选择影响的问题。其解决方案的关键在于提出EvoLM,一个模型套件,能够系统且透明地分析LM在预训练、持续预训练、监督微调和强化学习等阶段的训练动态。通过从头训练超过100个参数规模为1B和4B的LM,该研究全面评估了上游(语言建模)和下游(问题求解)推理能力,并考虑了领域内和领域外泛化情况,从而揭示了多个关键洞察,如过度预训练和后训练的收益递减、领域特定持续预训练中遗忘的缓解方法、持续预训练在连接预训练与后训练阶段中的关键作用,以及监督微调和强化学习配置中的复杂权衡。
链接: https://arxiv.org/abs/2506.16029
作者: Zhenting Qi,Fan Nie,Alexandre Alahi,James Zou,Himabindu Lakkaraju,Yilun Du,Eric Xing,Sham Kakade,Hanlin Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs’ training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
zh
[NLP-85] From General to Targeted Rewards: Surpassing GPT -4 in Open-Ended Long-Context Generation
【速读】: 该论文试图解决长文本生成(Open-ended Long Text Generation, Open-LTG)任务中缺乏高质量参考数据的问题,以及现有方法仅依赖通用评估作为奖励信号导致的准确性不足问题。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架——ProxyReward,其核心包括一个通过简单提示自动生成的ProxyReward Dataset,以及针对特定问题的信息全面性和准确性进行目标评估的ProxyReward Signal,从而有效提升模型在开放性长文本生成任务中的性能。
链接: https://arxiv.org/abs/2506.16024
作者: Zhihan Guo,Jiele Wu,Wenqian Cui,Yifei Zhang,Minda Hu,Yufei Wang,Irwin King
机构: The Chinese University of Hong Kong, Hong Kong SAR, China (中国香港中文大学); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current research on long-form context in Large Language Models (LLMs) primarily focuses on the understanding of long-contexts, the Open-ended Long Text Generation (Open-LTG) remains insufficiently explored. Training a long-context generation model requires curation of gold standard reference data, which is typically nonexistent for informative Open-LTG tasks. However, previous methods only utilize general assessments as reward signals, which limits accuracy. To bridge this gap, we introduce ProxyReward, an innovative reinforcement learning (RL) based framework, which includes a dataset and a reward signal computation method. Firstly, ProxyReward Dataset generation is accomplished through simple prompts that enables the model to create automatically, obviating extensive labeled data or significant manual effort. Secondly, ProxyReward Signal offers a targeted evaluation of information comprehensiveness and accuracy for specific questions. The experimental results indicate that our method ProxyReward surpasses even GPT-4-Turbo. It can significantly enhance performance by 20% on the Open-LTG task when training widely used open-source models, while also surpassing the LLM-as-a-Judge approach. Our work presents effective methods to enhance the ability of LLMs to address complex open-ended questions posed by human.
zh
[NLP-86] Bayesian Epistemology with Weighted Authority: A Formal Architecture for Truth-Promoting Autonomous Scientific Reasoning
【速读】: 该论文试图解决科学文献的指数级增长超出人类专家和现有人工智能系统认知处理能力的问题,其解决方案的关键在于提出贝叶斯知识论加权权威(Bayesian Epistemology with Weighted Authority, BEWA)架构,该架构通过将信念建模为结构化科学主张上的动态概率一致性函数,并结合复制评分、引用加权和时间衰减对每个主张进行上下文化、作者归属和评估,从而实现基于证据条件的贝叶斯推理、矛盾处理和知识衰减机制。
链接: https://arxiv.org/abs/2506.16015
作者: Craig S. Wright
机构: University of Exeter (埃克塞特大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Logic in Computer Science (cs.LO); Logic (math.LO)
备注: 91 pages, 0 figures, includes mathematical appendix and formal proofs. Designed as a foundational submission for a modular autonomous epistemic reasoning system. Suitable for logic in computer science, AI epistemology, and scientific informatics
Abstract:The exponential expansion of scientific literature has surpassed the epistemic processing capabilities of both human experts and current artificial intelligence systems. This paper introduces Bayesian Epistemology with Weighted Authority (BEWA), a formally structured architecture that operationalises belief as a dynamic, probabilistically coherent function over structured scientific claims. Each claim is contextualised, author-attributed, and evaluated through a system of replication scores, citation weighting, and temporal decay. Belief updates are performed via evidence-conditioned Bayesian inference, contradiction processing, and epistemic decay mechanisms. The architecture supports graph-based claim propagation, authorial credibility modelling, cryptographic anchoring, and zero-knowledge audit verification. By formalising scientific reasoning into a computationally verifiable epistemic network, BEWA advances the foundation for machine reasoning systems that promote truth utility, rational belief convergence, and audit-resilient integrity across dynamic scientific domains.
zh
[NLP-87] Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion ACL2025
【速读】: 该论文试图解决AI生成音乐内容检测的问题,尤其是在现有检测方法存在局限性的情况下,如基于音频的检测器对新生成器泛化能力差且易受音频扰动影响,而基于歌词的方法则依赖于干净格式和准确的歌词文本,这在实际中难以获得。解决方案的关键在于提出一种多模态、模块化且后期融合的管道(DE-detect),通过结合自动转录的演唱歌词与音频中捕捉歌词相关信息的语音特征,从而提升检测的鲁棒性并减少对低级伪影的敏感性,实现更实用的AI生成音乐检测方法。
链接: https://arxiv.org/abs/2506.15981
作者: Markus Frohmann,Gabriel Meseguer-Brocal,Markus Schedl,Elena V. Epure
机构: Deezer Research(Deezer 研究院); Johannes Kepler University Linz(约翰内斯·开普勒林茨大学); Linz Institute of Technology, AI Lab(林茨技术研究所,人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ACL 2025 Findings
Abstract:The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at this https URL.
zh
[NLP-88] A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension
【速读】: 该论文旨在解决越南语在自然语言处理(Natural Language Processing, NLP)任务中资源匮乏的问题,特别是文本分割和机器阅读理解(Machine Reading Comprehension, MRC)任务。其解决方案的关键在于构建了VSMRC数据集,该数据集包含来自越南语维基百科的15,942篇文档用于文本分割,以及通过人工质量保证生成的16,347对多选题-答案对,为越南语的NLP研究提供了可靠且多样化的资源。实验表明,多语言模型如mBERT在这些任务上表现优于单语模型,凸显了多语言模型在低资源语言中的潜力。
链接: https://arxiv.org/abs/2506.15978
作者: Toan Nguyen Hai,Ha Nguyen Viet,Truong Quan Xuan,Duc Do Minh
机构: Institute for Artificial Intelligence, VNU University of Engineering and Technology; Faculty of Information Technology, VNU University of Engineering and Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Vietnamese, the 20th most spoken language with over 102 million native speakers, lacks robust resources for key natural language processing tasks such as text segmentation and machine reading comprehension (MRC). To address this gap, we present VSMRC, the Vietnamese Text Segmentation and Multiple-Choice Reading Comprehension Dataset. Sourced from Vietnamese Wikipedia, our dataset includes 15,942 documents for text segmentation and 16,347 synthetic multiple-choice question-answer pairs generated with human quality assurance, ensuring a reliable and diverse resource. Experiments show that mBERT consistently outperforms monolingual models on both tasks, achieving an accuracy of 88.01% on MRC test set and an F1 score of 63.15% on text segmentation test set. Our analysis reveals that multilingual models excel in NLP tasks for Vietnamese, suggesting potential applications to other under-resourced languages. VSMRC is available at HuggingFace
zh
[NLP-89] Multi-use LLM Watermarking and the False Detection Problem
【速读】: 该论文试图解决生成式文本中因同时使用相同嵌入方式进行检测和用户识别而导致的误检问题(false detection problem)。当用户数量增加时,未加水印的文本可能会被错误地检测为已加水印。解决方案的关键在于提出双水印(Dual Watermarking)方法,该方法将检测水印和识别水印联合编码到生成的文本中,从而在保持高检测准确率的同时显著降低误报率。
链接: https://arxiv.org/abs/2506.15975
作者: Zihao Fu,Chris Russell
机构: Oxford Internet Institute (牛津互联网研究所); University of Oxford (牛津大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Digital watermarking is a promising solution for mitigating some of the risks arising from the misuse of automatically generated text. These approaches either embed non-specific watermarks to allow for the detection of any text generated by a particular sampler, or embed specific keys that allow the identification of the LLM user. However, simultaneously using the same embedding for both detection and user identification leads to a false detection problem, whereby, as user capacity grows, unwatermarked text is increasingly likely to be falsely detected as watermarked. Through theoretical analysis, we identify the underlying causes of this phenomenon. Building on these insights, we propose Dual Watermarking which jointly encodes detection and identification watermarks into generated text, significantly reducing false positives while maintaining high detection accuracy. Our experimental results validate our theoretical findings and demonstrate the effectiveness of our approach.
zh
[NLP-90] LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在进行长序列推理任务时,由于键值(Key-Value, KV)缓存过大导致的GPU内存开销问题。现有KV缓存压缩方法在处理长推理任务时效果有限。论文提出了一种名为LazyEviction的延迟KV驱逐框架,其关键在于通过观察窗口机制实现跨解码步骤的延迟驱逐,包含两个核心组件:(1)复发间隔跟踪,用于捕捉令牌重要性的时序变化;(2)基于最大复发间隔的驱逐策略,根据令牌的复发模式优先进行驱逐。实验表明,LazyEviction在保持数学推理数据集准确率的同时,将KV缓存大小减少了50%。
链接: https://arxiv.org/abs/2506.15969
作者: Haoyue Zhang,Hualei Zhang,Xiaosong Ma,Jie Zhang,Song Guo
机构: HKUST(香港科技大学); Tsinghua University(清华大学); HK PolyU(香港理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) exhibit enhanced reasoning capabilities by employing Chain-of-Thought (CoT). However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache size, particularly in tasks requiring long reasoning sequences, such as mathematics and programming. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens receive renewed attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, a lagged KV eviction framework designed to maintain reasoning performance while reducing KV memory. LazyEviction is an Observation Window-based Lagged Eviction Mechanism retaining latent recurring tokens by performing lagged evictions across decoding steps, which contains two key components: (1) Recurrence Interval Tracking for capturing temporal variations in token importance, and (2) an Maximum Recurrence Interval-Centric Eviction Policy that prioritizes eviction based on tokens’ recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache size by 50% while maintaining comparable accuracy on mathematics reasoning datasets, outperforming state-of-the-art methods. Our findings highlight the importance of preserving recurring tokens, which are critical for maintaining knowledge continuity in multi-step reasoning tasks.
zh
[NLP-91] Exploring Big Five Personality and AI Capability Effects in LLM -Simulated Negotiation Dialogues KDD2025
【速读】: 该论文旨在解决在任务关键型谈判场景中,如何评估和提升自主代理AI系统(agentic AI systems)的适应性与可靠性问题。其核心挑战在于使AI代理能够有效应对多样化的操作人员和利益相关者的个性特征,并在复杂的人机协作环境中实现高效、可信的谈判结果。解决方案的关键在于构建一个基于Sotopia模拟测试平台的评估框架,通过系统性实验分析人格特质与AI代理特性对社会协商结果的影响,特别是利用因果发现方法和社会认知语言学指标,深入挖掘代理在共情沟通、道德基础及观点模式等方面的细微差异,从而为高风险操作场景下的可靠AI系统提供可操作的优化路径。
链接: https://arxiv.org/abs/2506.15928
作者: Myke C. Cohen,Zhe Su,Hsien-Te Kao,Daniel Nguyen,Spencer Lynch,Maarten Sap,Svitlana Volkova
机构: Aptima, Inc.(Aptima公司); Arizona State University(亚利桑那州立大学); Carnegie Mellon University(卡内基梅隆大学); Amazon(亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Under review for KDD 2025 Workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models
Abstract:This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM-simulated social negotiation outcomes–a capability essential for a variety of applications involving cross-team coordination and civil-military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine-grained differences in agents’ empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high-stakes operational scenarios. Experiment 2 evaluates human-AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human-agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.
zh
[NLP-92] Reranking-based Generation for Unbiased Perspective Summarization ACL2025
【速读】: 该论文试图解决在现实场景中生成无偏总结(unbiased summaries)的问题,特别是在政治观点总结(political perspective summarization)任务中,现有评估框架依赖传统指标衡量覆盖度和忠实性等关键属性,但未验证其适用性,且改进总结方法的研究仍处于初期阶段。论文的关键解决方案在于(1)识别可靠的指标以评估观点总结质量,并(2)探索基于大语言模型(Large Language Models, LLMs)的方法在零样本推理之外的有效性。通过构建基于人工标注的测试集来评估指标可靠性,研究发现传统指标的表现劣于基于语言模型的指标,后者展现出强大的评估能力;进一步利用这些指标,研究证明了重排序(reranking)方法效果显著,而结合合成数据和重排序标注数据的偏好调优(preference tuning)可进一步提升性能。
链接: https://arxiv.org/abs/2506.15925
作者: Narutatsu Ri,Nicholas Deas,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings
Abstract:Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
zh
[NLP-93] Early Attentive Sparsification Accelerates Neural Speech Transcription
【速读】: 该论文试图解决神经语音转录任务中计算效率低的问题,通过在神经编码阶段早期对时域信号进行稀疏化处理来加速模型推理。其解决方案的关键在于利用Transformer音频编码器中自注意力机制的可解释性,在编码器的某一特定层对隐藏状态进行稀疏化处理,从而在保持接近原始模型性能的前提下,实现运行时效率的提升。实验表明,在不超过1%精度下降的情况下,将隐藏状态稀疏化至40%-60%的稀疏度,可在Nvidia GPU上实现高达1.6倍的运行时加速。
链接: https://arxiv.org/abs/2506.15912
作者: Zifei Xu,Sayeh Sharify,Hesham Mostafa,Tristan Webb,Wanzin Yazar,Xin Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Transformer-based neural speech processing has achieved state-of-the-art performance. Since speech audio signals are known to be highly compressible, here we seek to accelerate neural speech transcription by time-domain signal sparsification early in the neural encoding stage, taking advantage of the interpretability of the self-attention mechanism in transformer audio encoders. With the Whisper family of models, we perform a systematic architecture search over the joint space of sparsification stage (a certain encoder layer) and compression ratio (sparsity). We found that the best resulting solutions under 1% accuracy degradation choose to sparsify the hidden state to 40-60% sparsity at an early encoding stage, and thereby achieve up to 1.6x runtime acceleration in English speech transcription tasks on Nvidia GPUs without any fine-tuning.
zh
[NLP-94] From RAG to Agent ic: Validating Islamic-Medicine Responses with LLM Agents ICML-25
【速读】: 该论文试图解决如何将历史悠久的伊斯兰医学文本(如阿维森纳的《医学经典》和先知疗法)中的预防性护理、营养和整体疗法有效整合到现代人工智能系统中,以实现可靠且文化敏感的医学问答。其解决方案的关键在于提出一种统一的评估流程——Tibbe-AG,该流程通过将30个精心筛选的先知医学问题与经过人工验证的疗法对齐,并在三种配置下(直接生成、检索增强生成、科学自我批判过滤)比较三个大型语言模型(LLaMA-3、Mistral-7B、Qwen2-7B),同时引入一个作为代理裁判的次级大语言模型来评估答案质量,最终生成一个3C3H综合评分,从而提升答案的准确性和安全性。
链接: https://arxiv.org/abs/2506.15911
作者: Mohammad Amaan Sayeed,Mohammed Talha Alam,Raza Imam,Shahab Saquib Sohail,Amir Hussain
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under-review at the 4th Muslims in Machine Learning (MusIML) Workshop (ICML-25)
Abstract:Centuries-old Islamic medical texts like Avicenna’s Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.
zh
[NLP-95] Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理任务中对问题描述和提示策略的脆弱性,以及自回归模型在推理过程中因采样引起的错误问题。解决方案的关键在于评估模型在链式思维(Chain of Thought, CoT)推理中对人为引入的合成扰动进行自我纠正的能力,发现模型在多种开放权重模型和数据集上表现出强大的单次对话内在自我纠正行为,表明LLMs可能具备比文献中所展示更强的内在自我纠正能力。
链接: https://arxiv.org/abs/2506.15894
作者: Sam Silver,Jimin Sun,Ivan Zhang,Sara Hooker,Eddie Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, yet their performance remains brittle to minor variations in problem description and prompting strategy. Furthermore, reasoning is vulnerable to sampling-induced errors which autoregressive models must primarily address using self-correction via additionally-generated tokens. To better understand self-correction capabilities of recent models, we conduct experiments measuring models’ ability to self-correct synthetic perturbations introduced into their Chain of Thought (CoT) reasoning. We observe robust single-utterance intrinsic self-correction behavior across a range of open-weight models and datasets, ranging from subtle, implicit corrections to explicit acknowledgments and corrections of errors. Our findings suggest that LLMs, including those not finetuned for long CoT, may possess stronger intrinsic self-correction capabilities than commonly shown in the literature. The presence of this ability suggests that recent “reasoning” model work involves amplification of traits already meaningfully present in models.
zh
[NLP-96] Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
【速读】: 该论文试图解决在未分词语言(如中文)中应用字节对编码(Byte-Pair Encoding, BPE)时面临的挑战,因为BPE的频率驱动合并操作忽略了语言边界。解决方案的关键在于引入基于熵的信息理论预分词策略,通过无监督方式利用点互信息、左右熵以及预训练GPT-2模型的预测熵来指导BPE分割,从而提升分词的精度、召回率和F1分数。
链接: https://arxiv.org/abs/2506.15889
作者: Yifan Hu,Frank Liang,Dachuan Zhao,Jonathan Geuter,Varshini Reddy,Craig W. Schmidt,Chris Tanner
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.
zh
[NLP-97] Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
【速读】: 该论文试图解决现有测试时计算方法在推理深度上缺乏灵活性的问题,即现有方法如Best-of-N、多数投票和自省通常对所有输入采用统一的推理方式,未能考虑到不同问题可能需要不同的推理深度。解决方案的关键在于提出一种无需训练且与模型无关的框架——Fractional Reasoning,该框架通过提取与深度推理相关的潜在控制向量,并以可调缩放因子重新应用,从而在推理阶段实现对推理强度的连续控制,使模型能够根据输入复杂度调整其推理过程。
链接: https://arxiv.org/abs/2506.15882
作者: Sheng Liu,Tianlang Chen,Pan Lu,Haotian Ye,Yizheng Chen,Lei Xing,James Zou
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 18 pages, 5 figures, Project website: this https URL
Abstract:Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
zh
[NLP-98] MoR: Better Handling Diverse Queries with a Mixture of Sparse Dense and Human Retrievers
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-augmented Generation, RAG)系统中因固定使用单一检索器而导致的泛化能力不足问题。现有方法通常依赖启发式选择单一检索器,无法适应多样化的信息需求。其解决方案的关键在于引入一种零样本、加权组合的异构检索器混合模型(mixture of retrievers),通过动态整合不同检索器的互补信号(如BM25的词法匹配与密集检索器的语义相似性),实现更高效和有效的信息检索与生成。实验表明,该混合模型在参数量仅为0.8B的情况下,显著优于单个检索器及更大的7B模型。
链接: https://arxiv.org/abs/2506.15862
作者: Jushaan Singh Kalra,Xinran Zhao,To Eun Kim,Fengyu Cai,Fernando Diaz,Tongshuang Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Technical University of Darmstadt (达姆施塔特工业大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 3 figures
Abstract:Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamically select and integrate multiple retrievers for each individual query, without the need for manual selection? In our work, we validate this intuition with quantitative analysis and introduce mixture of retrievers: a zero-shot, weighted combination of heterogeneous retrievers. Extensive experiments show that such mixtures are effective and efficient: Despite totaling just 0.8B parameters, this mixture outperforms every individual retriever and even larger 7B models by +10.8% and +3.9% on average, respectively. Further analysis also shows that this mixture framework can help incorporate specialized non-oracle human information sources as retrievers to achieve good collaboration, with a 58.9% relative performance improvement over simulated humans alone.
zh
[NLP-99] Finance Language Model Evaluation (FLaME)
【速读】: 该论文试图解决现有评估框架在金融领域自然语言处理(Finance NLP, FinNLP)任务中对语言模型(Language Models, LMs)性能评估不足的问题,这一问题导致了对LMs在常见FinNLP任务中表现的低估。解决方案的关键在于提出首个全面的金融语言模型评估基准——Financial Language Model Evaluation (FLaME),并通过针对23个基础语言模型在20个核心金融NLP任务上的实证研究,系统性地评估了LMs在需要推理增强的任务中的潜力。
链接: https://arxiv.org/abs/2506.15846
作者: Glenn Matlin,Mika Okamoto,Huzaifa Pardawala,Yang Yang,Sudheer Chava
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs’ performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against ‘reasoning-reinforced’ LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.
zh
[NLP-100] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
【速读】: 该论文旨在解决现代语言代理在长时程、多轮交互中面临的记忆增长无界、计算成本高及推理性能下降的问题。现有大语言模型系统通常采用全上下文提示方法,导致内存消耗增加和推理效率降低。其解决方案的关键在于提出MEM1框架,该框架通过端到端强化学习机制,使代理在多轮任务中保持恒定内存占用,同时在每一轮中更新一个紧凑的共享内部状态,该状态联合支持记忆整合与推理,有效融合先验记忆与新环境观测,并策略性地丢弃无关或冗余信息。
链接: https://arxiv.org/abs/2506.15841
作者: Zijian Zhou,Ao Qu,Zhaoxuan Wu,Sunghwan Kim,Alok Prakash,Daniela Rus,Jinhua Zhao,Bryan Kian Hsiang Low,Paul Pu Liang
机构: Singapore-MIT Alliance for Research and Technology Centre (新加坡-麻省理工学院联合研究技术中心); National University of Singapore (新加坡国立大学); MIT (麻省理工学院); Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.
zh
[NLP-101] Rethinking LLM Training through Information Geometry and Quantum Metrics
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高维参数空间中的优化问题,该空间具有非欧几里得结构。解决方案的关键在于利用信息几何框架,特别是通过Fisher信息度量来表征优化景观,从而实现更合理的学习过程,即自然梯度下降。这种方法强调了曲率感知的优化策略,有助于深入理解LLM训练中的现象,如尖锐极小值、泛化能力和观测到的缩放定律。
链接: https://arxiv.org/abs/2506.15830
作者: Riccardo Di Sipio
机构: Dayforce, HCM
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注: 9 pages, 1 figure(s)
Abstract:Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-aware approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.
zh
[NLP-102] Veracity: An Open-Source AI Fact-Checking System
【速读】: 该论文试图解决虚假信息泛滥对社会造成的威胁,特别是由生成式AI(Generative AI)能力加剧的问题。其解决方案的关键在于开发了一个名为Veracity的开源AI系统,该系统通过大型语言模型(Large Language Models)与网络检索代理的协同作用,对用户提交的声明进行分析,并提供有依据的真伪评估及直观解释。
链接: https://arxiv.org/abs/2506.15794
作者: Taylor Lynn Curtis,Maximilian Puelma Touzel,William Garneau,Manon Gruaz,Mike Pinder,Li Wei Wang,Sukanya Krishna,Luda Cohen,Jean-François Godbout,Reihaneh Rabbany,Kellin Pelrine
机构: Mila; McGill University; Université de Montréal; Nord AI; Harvard University; Supervised Program for Alignment Research (SPAR)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity’s ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.
zh
[NLP-103] SLR: An Automated Synthesis Framework for Scalable Logical Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在逻辑推理能力评估与训练中的不足,特别是模型虽能生成语法正确的规则,但在逻辑推理上表现不佳的问题。解决方案的关键在于提出SLR框架,该框架通过可扩展的逻辑推理实现对LLMs的系统化评估与训练,能够自动合成具有精确控制难度的归纳推理任务,并生成隐式真实规则、可执行验证程序及任务指令提示,从而构建出大规模的SLR-Bench基准测试集,推动LLMs推理能力的提升。
链接: https://arxiv.org/abs/2506.15787
作者: Lukas Helff,Ahmad Omar,Felix Friedrich,Wolfgang Stammer,Antonia Wüst,Tim Woydt,Rupert Mitchell,Patrick Schramowski,Kristian Kersting
机构: TU Darmstadt hessian.AI DFKI CERTAIN, Germany Centre for Cognitive Science, Darmstadt
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR synthesizes (i) a latent ground-truth rule, (ii) an executable validation program used by a symbolic judge to deterministically verify model outputs, and (iii) an instruction prompt for the reasoning task. Using SLR, we create SLR-Bench, a benchmark comprising over 19k prompts spanning 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs do somewhat better, but incur substantial increases in test-time compute, sometimes exceeding 15k completion tokens. Finally, logic-tuning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and offers a scalable environment for probing and advancing LLMs’ reasoning capabilities.
zh
[NLP-104] VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service ACL2025
【速读】: 该论文试图解决Vision-Language Models (VLMs)在实际应用中的效率鲁棒性问题,即在面对对抗性输入时模型推理效率下降的潜在风险。现有研究多关注模型准确性,而对效率鲁棒性的探讨不足。解决方案的关键在于提出VLMInferSlow,这是一种在真实黑盒环境下评估VLM效率鲁棒性的新方法,其核心是结合细粒度的效率建模与零阶优化技术,以生成具有微小扰动但显著增加计算成本的对抗样本。
链接: https://arxiv.org/abs/2506.15755
作者: Xiasi Wang,Tianliang Yao,Simin Chen,Runqi Wang,Lei YE,Kuofeng Gao,Yi Huang,Yuan Yao
机构: The Hong Kong University of Science and Technology (香港科技大学); Tongji University (同济大学); Beijing Jiaotong University (北京交通大学); Huawei (华为); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ACL 2025
Abstract:Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters – an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community’s awareness about the efficiency robustness of VLMs.
zh
[NLP-105] Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全关键场景中响应不符合安全标准的问题,具体表现为对无害提示的不合理拒绝或生成有害内容。现有解决方案通常依赖于昂贵的模型参数微调或次优的启发式技术,而本文提出了一种新颖的解决方案——Sysformer,其核心在于学习适应指令调优LLMs中的系统提示(system prompt)。通过在LLM输入嵌入空间中更新初始系统提示以获得更稳健的系统提示,同时关注用户提示,Sysformer在保持LLM参数冻结的情况下,能够有效拒绝有害提示并理想响应安全提示,从而显著提升LLMs的鲁棒性。
链接: https://arxiv.org/abs/2506.15751
作者: Kartik Sharma,Yiqiao Jin,Vineeth Rakesh,Yingtong Dou,Menghai Pan,Mahashweta Das,Srijan Kumar
机构: Georgia Institute of Technology (佐治亚理工学院); Visa Research (维萨研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose \textbfSysformer , a trans \textbfformer model that updates an initial \textbfsys tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on 5 LLMs from different families and 2 recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto 80% gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto 90% . Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto 100% more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
zh
[NLP-106] OAgents : An Empirical Study of Building Effective Agents
【速读】: 该论文试图解决当前Agentic AI研究中缺乏标准化和科学严谨性的问题,这导致不同方法之间的公平比较难以实现,且设计选择对代理效果的影响尚不明确,进展衡量也存在挑战。其解决方案的关键在于通过在GAIA基准和BrowseComp上进行系统性实证研究,揭示关键代理组件中常见设计选择的影响,并引入更稳健的评估协议以提高实验的可重复性和比较的稳定性。基于研究结果,作者构建并开源了OAgents,一个模块化设计的新基础代理框架,实现了开源项目中的最先进性能。
链接: https://arxiv.org/abs/2506.15741
作者: He Zhu,Tianrui Qin,King Zhu,Heyuan Huang,Yeyi Guan,Jinxiang Xia,Yi Yao,Hanhao Li,Ningning Wang,Pai Liu,Tianhao Peng,Xin Gui,Xiaowan Li,Yuhui Liu,Yuchen Eleanor Jiang,Jun Wang,Changwang Zhang,Xiangru Tang,Ge Zhang,Jian Yang,Minghao Liu,Xitong Gao,Wangchunshu Zhou,Jiaheng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages
Abstract:Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI.
zh
[NLP-107] he Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在面对对抗攻击时的安全性问题,特别是其在多模态输入下容易被篡改以绕过安全防护机制并生成有害内容的漏洞。论文提出的关键解决方案是“安全提醒”(Safety Reminder),这是一种基于软提示调优的方法,通过优化可学习的提示标记并在文本生成过程中周期性注入,以主动重新激活模型的安全意识,从而有效防止有害内容的生成,同时保持模型在正常任务上的性能。
链接: https://arxiv.org/abs/2506.15734
作者: Peiyuan Tang,Haojie Xin,Xiaodong Zhang,Jun Sun,Qin Xia,Zijiang Yang
机构: Xi’an Jiaotong University (西安交通大学); University of Science and Technology of China (中国科学技术大学); Singapore Management University (新加坡管理大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 10 figures
Abstract:As Vision-Language Models (VLMs) demonstrate increasing capabilities across real-world applications such as code generation and chatbot assistance, ensuring their safety has become paramount. Unlike traditional Large Language Models (LLMs), VLMs face unique vulnerabilities due to their multimodal nature, allowing adversaries to modify visual or textual inputs to bypass safety guardrails and trigger the generation of harmful content. Through systematic analysis of VLM behavior under attack, we identify a novel phenomenon termed delayed safety awareness''. Specifically, we observe that safety-aligned VLMs may initially be compromised to produce harmful content, but eventually recognize the associated risks and attempt to self-correct. This pattern suggests that VLMs retain their underlying safety awareness but experience a temporal delay in their activation. Building on this insight, we hypothesize that VLMs' safety awareness can be proactively reactivated through carefully designed prompts. To this end, we introduce
The Safety Reminder’', a soft prompt tuning approach that optimizes learnable prompt tokens, which are periodically injected during the text generation process to enhance safety awareness, effectively preventing harmful content generation. Additionally, our safety reminder only activates when harmful content is detected, leaving normal conversations unaffected and preserving the model’s performance on benign tasks. Through comprehensive evaluation across three established safety benchmarks and one adversarial attacks, we demonstrate that our approach significantly reduces attack success rates while maintaining model utility, offering a practical solution for deploying safer VLMs in real-world applications.
zh
[NLP-108] textttSPECS: Faster Test-Time Scaling through Speculative Drafts
【速读】: 该论文试图解决在测试阶段扩展计算资源时,如何在提升大型语言模型(Large Language Models, LLMs)推理能力的同时,降低用户感知的延迟问题。当前的测试阶段扩展方法主要关注基于总计算资源(FLOPS)的准确性优化,而忽视了延迟约束。解决方案的关键在于提出一种面向延迟的测试阶段扩展方法 \textttSPECS,该方法受到推测解码(speculative decoding)的启发,利用一个较小且更快的模型高效生成候选序列,并通过来自较大目标模型和专用奖励模型的信号对这些候选序列进行评估,同时引入了基于奖励的软验证和奖励驱动的推迟机制,从而在保持或提升准确率的同时显著降低延迟。
链接: https://arxiv.org/abs/2506.15733
作者: Mert Cemri,Nived Rajaraman,Rishabh Tiwari,Xiaoxuan Liu,Kurt Keutzer,Ion Stoica,Kannan Ramchandran,Ahmad Beirami,Ziteng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 6 figures, 2 tables
Abstract:Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPS), often overlooking latency constraints. To address this gap, we propose \textttSPECS , a latency-aware test-time scaling method inspired by speculative decoding. \textttSPECS ~uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that \textttSPECS ~matches or surpasses beam search accuracy while reducing latency by up to \sim 19.1%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.
zh
[NLP-109] MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长上下文推理中因传统键值(KV)缓存淘汰策略无法适应不同模态信息差异而导致的效率低下问题。解决方案的关键在于提出一种模态自适应的KV缓存淘汰策略MadaKV,其核心包括模态偏好适应和分层压缩补偿,通过动态感知注意力头中的模态信息并自适应保留关键标记,从而显著降低KV缓存内存占用和模型推理解码延迟,同时保持高精度。
链接: https://arxiv.org/abs/2506.15724
作者: Kunxi Li,Zhonghua Jiang,Zhouzhou Shen,Zhaode Wang,Chengfei Lv,Shengyu Zhang,Fan Wu,Fei Wu
机构: Zhejiang University (浙江大学); Southeast University (东南大学); Alibaba (阿里巴巴); Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3 to 1.5 times improvement) while maintaining high accuracy across various multimodal long-context tasks. Extensive experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV compared to existing KV cache eviction methods.
zh
[NLP-110] daDPO: Distribution-Aware DPO for Distilling Conversational Abilities
【速读】: 该论文试图解决小规模语言模型在对话能力上性能显著下降的问题,这限制了其在资源受限环境中的部署。解决方案的关键在于提出一种名为daDPO(Distribution-Aware DPO)的统一方法,该方法结合了偏好优化和基于输出分布的知识蒸馏,通过充分利用教师模型的输出分布信息,提升了学生模型的对话能力。
链接: https://arxiv.org/abs/2506.15717
作者: Zhengze Zhang,Shiqi Wang,Yiqun Shen,Simin Guo,Dahua Lin,Xiaoliang Wang,Nguyen Cam-Tu,Fei Tan
机构: 1State Key Laboratory for Novel Software Technology, Nanjing University; 2School of Artificial Intelligence, Nanjing University; 3University of Chicago; 4The Chinese University of Hong Kong; 5East China Normal University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated exceptional performance across various applications, but their conversational abilities decline sharply as model size decreases, presenting a barrier to their deployment in resource-constrained environments. Knowledge distillation with Direct Preference Optimization (dDPO) has emerged as a promising approach to enhancing the conversational abilities of smaller models using a larger teacher model. However, current methods primarily focus on ‘black-box’ KD, which only uses the teacher’s responses, overlooking the output distribution offered by the teacher. This paper addresses this gap by introducing daDPO (Distribution-Aware DPO), a unified method for preference optimization and distribution-based distillation. We provide rigorous theoretical analysis and empirical validation, showing that daDPO outperforms existing methods in restoring performance for pruned models and enhancing smaller LLM models. Notably, in in-domain evaluation, our method enables a 20% pruned Vicuna1.5-7B to achieve near-teacher performance (-7.3% preference rate compared to that of dDPO’s -31%), and allows Qwen2.5-1.5B to occasionally outperform its 7B teacher model (14.0% win rate).
zh
[NLP-111] Adaptive Two Sided Laplace Transforms: A Learnable Interpretable and Scalable Replacement for Self-Attention
【速读】: 该论文旨在解决传统基于Transformer的大型语言模型(Large Language Models, LLMs)中自注意力机制在处理超长序列时计算复杂度高、效率低的问题。其解决方案的关键在于提出一种可学习的双侧短时拉普拉斯变换(learnable two-sided short-time Laplace transform, STLT)机制,通过为每个拉普拉斯节点引入可训练参数,实现对衰减率、振荡频率和窗口带宽T的端到端学习,从而动态调整标记相关性的半衰期和频率响应,并结合快速递归卷积与基于FFT的相关矩阵计算,有效降低时间与空间复杂度。
链接: https://arxiv.org/abs/2506.15714
作者: Andrew Kiruluta
机构: UC Berkeley(加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We propose an innovative, learnable two-sided short-time Laplace transform (STLT) mechanism to supplant the traditional self attention in transformer-based LLMs. Our STLT introduces trainable parameters for each Laplace node, enabling end-to-end learning of decay rates , oscillatory frequencies, and window bandwidth T. This flexibility allows the model to dynamically adapt token relevance half lives and frequency responses during training. By selecting S learnable nodes and leveraging fast recursive convolution, we achieve an effective complexity of in time and memory. We further incorporate an efficient FFT-based computation of the relevance matrix and an adaptive node allocation mechanism to dynamically adjust the number of active Laplace nodes. Empirical results on language modeling (WikiText-103, Project Gutenberg), machine translation (WMT’14 En-De), and long document question answering (NarrativeQA) demonstrate that our learnable STLT achieves perplexities and scores on par with or better than existing efficient transformers while naturally extending to context lengths exceeding 100k tokens or more limited only by available hardware. Ablation studies confirm the importance of learnable parameters and adaptive node allocation. The proposed approach combines interpretability, through explicit decay and frequency parameters, with scalability and robustness, offering a pathway towards ultra-long-sequence language modeling without the computational bottleneck of self-attention.
zh
[NLP-112] Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
【速读】: 该论文旨在解决长上下文大语言模型(LLMs)在解码过程中键值(KV)缓存内存需求快速增长导致的GPU内存容量和PCIe带宽瓶颈问题。现有稀疏注意力机制通过仅计算选定键值对的注意力权重来缓解该问题,但其索引计算通常需要遍历所有键向量,造成显著的计算和数据传输开销。论文提出的解决方案LFPS(Learn From the Past for Sparse Indexing)的关键在于通过动态构建基于历史注意力模式的稀疏索引候选,捕捉解码器注意力中的垂直模式(固定位置关注)和斜线模式(相对位置关注),并结合位置扩展策略有效预测当前步骤的Top-k索引,从而降低索引检索成本。
链接: https://arxiv.org/abs/2506.15704
作者: Feiyu Yao,Qian Wang
机构: Beijing University of Technology (北京工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) continue to support increasingly longer contexts, the memory demand for key-value (KV) caches during decoding grows rapidly, becoming a critical bottleneck in both GPU memory capacity and PCIe bandwidth. Sparse attention mechanisms alleviate this issue by computing attention weights only for selected key-value pairs. However, their indexing computation typically requires traversing all key vectors, resulting in significant computational and data transfer overhead. To reduce the cost of index retrieval, existing methods often treat each decoding step as an independent process, failing to exploit the temporal correlations embedded in historical decoding information. To this end, we propose LFPS(Learn From the Past for Sparse Indexing), an acceleration method that dynamically constructs sparse indexing candidates based on historical attention patterns. LFPS captures two prevalent trends in decoder attention -vertical patterns (attending to fixed positions) and slash patterns (attending to relative positions) -and incorporates a positional expansion strategy to effectively predict the Top-k indices for the current step. We validate LFPS on challenging long-context benchmarks such as LongBench-RULER, using Llama-3.1-8B-Instruct as the base model. Experimental results show that LFPS achieves up to 22.8 \times speedup over full attention and 9.6 \times speedup over exact Top-k retrieval on an RTX 4090 GPU and a single CPU core of a Xeon Gold 6430, respectively, while preserving generation accuracy. These results demonstrate that LFPS offers a practical and efficient solution for decoding optimization in long-context LLM inference.
zh
[NLP-113] DeepRTL2: A Versatile Model for RTL-Related Tasks ACL2025
【速读】: 该论文试图解决电子设计自动化(EDA)中嵌入式任务(embedding-based tasks)长期被忽视的问题,这些任务包括自然语言代码搜索、RTL代码功能等价性检查和性能预测,它们对加速和优化硬件设计流程至关重要。解决方案的关键在于提出DeepRTL2,这是一个统一生成式任务(generation-based)和嵌入式任务的大型语言模型家族,通过同时处理广泛的RTL相关任务,实现了EDA领域多样化挑战的全面解决方案,并在所有评估任务中达到了最先进的性能。
链接: https://arxiv.org/abs/2506.15697
作者: Yi Liu,Hongji Zhang,Yunhao Zhou,Zhengyuan Shi,Changran Xu,Qiang Xu
机构: The Chinese University of Hong Kong (香港中文大学); National Technology Innovation Center for EDA (EDA国家技术创新中心)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2025 Findings
Abstract:The integration of large language models (LLMs) into electronic design automation (EDA) has significantly advanced the field, offering transformative benefits, particularly in register transfer level (RTL) code generation and understanding. While previous studies have demonstrated the efficacy of fine-tuning LLMs for these generation-based tasks, embedding-based tasks, which are equally critical to EDA workflows, have been largely overlooked. These tasks, including natural language code search, RTL code functionality equivalence checking, and performance prediction, are essential for accelerating and optimizing the hardware design process. To address this gap, we present DeepRTL2, a family of versatile LLMs that unifies both generation- and embedding-based tasks related to RTL. By simultaneously tackling a broad range of tasks, DeepRTL2 represents the first model to provide a comprehensive solution to the diverse challenges in EDA. Through extensive experiments, we show that DeepRTL2 achieves state-of-the-art performance across all evaluated tasks.
zh
[NLP-114] BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models
【速读】: 该论文旨在解决当前旋转量化方法在大语言模型(Large Language Models, LLMs)量化过程中存在的两个根本性问题:一是旋转无法对齐通道均值,导致量化边界变宽和舍入误差增加;二是旋转使激活分布更接近高斯分布,增加了截断误差引起的能量损失。其解决方案的关键在于提出一种名为BASE-Q的简单而有效的方法,通过结合偏差校正和非对称缩放,显著降低舍入和截断误差,并支持按块优化,从而避免了内存密集型的全模型反向传播。
链接: https://arxiv.org/abs/2506.15689
作者: Liulu He,Shenli Zhen,Karwei Sun,Yijiang Liu,Yufei Zhao,Chongkang Tan,Huanrui Yang,Yuan Du,Li Du
机构: Nanjing University (南京大学); Alibaba Group (阿里巴巴集团); Arizona University (亚利桑那大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Rotations have become essential to state-of-the-art quantization pipelines for large language models (LLMs) by effectively smoothing outliers in weights and activations. However, further optimizing the rotation parameters offers only limited performance gains and introduces significant training overhead: due to rotation parameter sharing, full-model must be loaded simultaneously to enable backpropagation, resulting in substantial memory consumption and limited practical utility. In this work, we identify two fundamental limitations of current rotational quantization methods: (i) rotation fails to align channel means, resulting in wider quantization bounds and increased rounding errors; and (ii) rotation makes the activation distribution more Gaussian-like, increasing energy loss caused by clipping errors. To address these issues, we introduce \textbfBASE-Q, a simple yet powerful approach that combines bias correction and asymmetric scaling to effectively reduce rounding and clipping errors. Furthermore, BASE-Q enables blockwise optimization, eliminating the need for memory-intensive full-model backpropagation. Extensive experiments on various LLMs and benchmarks demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively. The code will be released soon.
zh
[NLP-115] cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
【速读】: 该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)管道中片段化(chunking)问题,即现有基于行的片段化启发式方法容易破坏语义结构,导致函数被错误分割或无关代码被合并,从而影响生成质量。解决方案的关键在于通过抽象语法树(Abstract Syntax Tree, AST)进行结构感知的片段化,该方法递归地将大型AST节点拆分为较小的块,并在尊重大小限制的前提下合并同级节点,从而生成自包含且语义连贯的代码单元。
链接: https://arxiv.org/abs/2506.15655
作者: Yilin Zhang,Xinran Zhao,Zora Zhiruo Wang,Chenyang Yang,Jiayi Wei,Tongshuang Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Augment Code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking – the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.
zh
[NLP-116] LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
【速读】: 该论文试图解决生成式 AI (Generative AI) 生成的图表说明(figure caption)缺乏个性化的问题,即作者通常需要对通用的AI生成说明进行修改以适应其写作风格和领域规范。解决方案的关键在于引入 LaMP-Cap 数据集,该数据集通过提供包含多模态信息(如图表图像、相关图表及其说明和提及图表的段落)的个性化资料,增强模型对上下文的理解,从而生成更贴近原作者风格的图表说明。实验表明,利用这些多模态资料能够显著提升生成说明的质量。
链接: https://arxiv.org/abs/2506.06561
作者: Ho Yin ‘Sam’ Ng,Ting-Yao Hsu,Aashish Anantha Ramakrishnan,Branislav Kveton,Nedim Lipka,Franck Dernoncourt,Dongwon Lee,Tong Yu,Sungchul Kim,Ryan A. Rossi,Ting-Hao ‘Kenneth’ Huang
机构: Pennsylvania State University (宾夕法尼亚州立大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The LaMP-CAP dataset is publicly available at: this https URL
Abstract:Figure captions are crucial for helping readers understand and remember a figure’s key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain’s style, highlighting the need for personalization. Despite language models’ personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document–each with its image, caption, and figure-mentioning paragraphs–as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
zh
计算机视觉
[CV-0] VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning WWW
【速读】:该论文旨在解决基于语言模型的导航系统在路径规划中受限于离散拓扑图的问题,从而提升智能体在真实环境中的导航能力。其解决方案的关键在于提出VLN-R1框架,该框架利用大型视觉-语言模型(Large Vision-Language Models, LVLM)直接将第一人称视频流转化为连续导航动作,并采用受DeepSeek-R1启发的GRPO训练方法,结合长短期记忆采样策略以平衡历史与当前观测,同时通过两阶段训练(监督微调和强化微调)实现对动作序列的精准控制与多步未来动作的策略性奖励加权。
链接: https://arxiv.org/abs/2506.17221
作者: Zhangyang Qi,Zhixiong Zhang,Yizhou Yu,Jiaqi Wang,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this http URL
Abstract:Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model’s action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.
zh
[CV-1] Emergent Temporal Correspondences from Video Diffusion Transformers
【速读】:该论文试图解决视频扩散模型(Video Diffusion Models)中如何内部建立和表示帧间时间对应关系的问题。其解决方案的关键在于提出DiffTrack,这是一个首个定量分析框架,通过构建包含伪真实跟踪标注的提示生成视频数据集,并设计新颖的评估指标,系统分析DiTs中3D注意力机制各组件(如表示、层和时间步)对时间对应关系的贡献。研究发现,特定层中的查询-键相似性在时间匹配中起关键作用,且该匹配在去噪过程中逐渐增强。
链接: https://arxiv.org/abs/2506.17220
作者: Jisu Nam,Soowon Son,Dahyun Chung,Jiyoung Kim,Siyoon Jin,Junhwa Hur,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at https:/cvlab-kaist.github.io/DiffTrack
Abstract:Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.
zh
[CV-2] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在需要视觉想象的任务中性能受限的问题,因为现有的VLMs依赖于文本解码,迫使模型通过语言描述视觉推理,而非直接利用视觉信息。论文提出的解决方案关键在于引入一种名为Mirage的机器心理意象框架,该框架在VLM解码过程中引入潜在视觉标记(latent visual tokens),使模型能够在不生成显式图像的情况下进行多模态轨迹的交织推理,从而增强其多模态推理能力。
链接: https://arxiv.org/abs/2506.17218
作者: Zeyuan Yang,Xueyang Yu,Delin Chen,Maohao Shen,Chuang Gan
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually’', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.
zh
[CV-3] Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation
【速读】:该论文试图解决长期交通模拟中代理(agent)动态进入和退出场景导致的稳定性问题,传统模型和基准主要关注场景中初始代理的闭环运动模拟,这在长期模拟中存在不足。解决方案的关键在于提出InfGen,这是一个统一的下一个标记预测模型,能够执行交错的闭环运动模拟与场景生成,并自动在两种模式之间切换,从而实现稳定的长期滚动模拟。
链接: https://arxiv.org/abs/2506.17213
作者: Xiuyu Yang,Shuhan Tan,Philipp Krähenbühl
机构: UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint. Project page: this https URL Code: this https URL
Abstract:An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at this https URL
zh
[CV-4] Part2GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
【速读】:该论文旨在解决如何高保真地建模多部件物体的结构与运动问题,特别是在3D重建中对关节式物体(articulated objects)的数字孪生进行精确表示。其解决方案的关键在于提出Part ^2 GS框架,该框架采用一种部件感知的3D高斯表示(part-aware 3D Gaussian representation),通过可学习属性编码关节部件,实现结构化且解耦的变换,从而保持几何细节的高保真性。同时,引入基于物理约束的运动感知规范表示,结合接触强制、速度一致性及向量场对齐等机制,确保运动的物理一致性,并通过排斥点场防止部件碰撞,提升运动连贯性。
链接: https://arxiv.org/abs/2506.17212
作者: Tianjiao Yu,Vedant Shah,Muntasir Wahed,Ying Shen,Kiet A. Nguyen,Ismini Lourentzou
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part ^2 GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part ^2 GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part ^2 GS consistently outperforms state-of-the-art methods by up to 10 \times in Chamfer Distance for movable parts.
zh
[CV-5] DreamCube: 3D Panorama Generation via Multi-plane Synchronization
【速读】:该论文旨在解决3D全景合成中高质量且多样化的视觉外观与几何结构生成问题,以及现有方法因2D单视角与3D全景之间不兼容而导致的效果受限问题。其解决方案的关键在于将多平面同步应用于2D基础模型的操作符,从而将其能力无缝扩展至全向域,进而通过引入DreamCube——一种多平面RGB-D扩散模型,实现对2D基础模型先验的最大化复用,以在保持多视角一致性的同时生成多样化外观和精确几何的3D全景内容。
链接: https://arxiv.org/abs/2506.17206
作者: Yukun Huang,Yanning Zhou,Jianan Wang,Kaiyi Huang,Xihui Liu
机构: The University of Hong Kong (香港大学); Tencent (腾讯); Astribot
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation.
zh
[CV-6] UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
【速读】:该论文试图解决统一模型在图像理解和生成任务中因模态对齐模式差异导致的性能妥协问题,特别是在共享Transformer主干网络中难以平衡跨任务的表示学习与任务特异性需求。解决方案的关键在于提出一种Y型架构UniFork,通过在浅层共享跨任务表示学习,在深层引入任务特定分支以避免任务干扰,从而有效平衡共享学习与任务专业化。
链接: https://arxiv.org/abs/2506.17202
作者: Teng Li,Quanfeng Lu,Lirui Zhao,Hao Li,Xizhou Zhu,Yu Qiao,Jun Zhang,Wenqi Shao
机构: Shanghai AI Laboratory (上海人工智能实验室); HKUST (香港科技大学); SJTU (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.
zh
[CV-7] Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition
【速读】:该论文旨在解决当前基于扩散和可控视频生成方法在动态性、泛化能力、长期一致性及效率方面的局限性,这些限制阻碍了多样化游戏视频的生成。其解决方案的关键在于提出Hunyuan-GameCraft框架,通过将标准键盘和鼠标输入统一到共享的相机表示空间以实现细粒度动作控制,并采用混合历史条件训练策略以延长视频序列并保留游戏场景信息。此外,通过模型蒸馏技术提升推理效率与可玩性,同时保持长时序的一致性,从而适用于复杂交互环境中的实时部署。
链接: https://arxiv.org/abs/2506.17201
作者: Jiaqi Li,Junshu Tang,Zhiyong Xu,Longhuang Wu,Yuan Zhou,Shuai Shao,Tianbao Yu,Zhiguo Cao,Qinglin Lu
机构: Tencent Hunyuan(腾讯混元); Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent advances in diffusion-based and controllable video generation have enabled high-quality and temporally coherent video synthesis, laying the groundwork for immersive interactive gaming experiences. However, current methods face limitations in dynamics, generality, long-term consistency, and efficiency, which limit the ability to create various gameplay videos. To address these gaps, we introduce Hunyuan-GameCraft, a novel framework for high-dynamic interactive video generation in game environments. To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space, facilitating smooth interpolation between various camera and movement operations. Then we propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information. Additionally, to enhance inference efficiency and playability, we achieve model distillation to reduce computational overhead while maintaining consistency across long temporal sequences, making it suitable for real-time deployment in complex interactive environments. The model is trained on a large-scale dataset comprising over one million gameplay recordings across over 100 AAA games, ensuring broad coverage and diversity, then fine-tuned on a carefully annotated synthetic dataset to enhance precision and control. The curated game scene data significantly improves the visual fidelity, realism and action controllability. Extensive experiments demonstrate that Hunyuan-GameCraft significantly outperforms existing models, advancing the realism and playability of interactive game video generation.
zh
[CV-8] Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation
【速读】:该论文旨在解决灵巧手操作中大规模演示生成的挑战,特别是如何高效创建多样且物理上合理的演示数据。解决方案的关键在于提出一种结合几何约束以提高可行性并引入额外条件以增强多样性的生成模型,从而构建了名为Dex1B的大规模、多样化且高质量的演示数据集。
链接: https://arxiv.org/abs/2506.17198
作者: Jianglong Ye,Keyi Wang,Chengjing Yuan,Ruihan Yang,Yiquan Li,Jiyue Zhu,Yuzhe Qin,Xueyan Zou,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to RSS 2025. Project page: this https URL
Abstract:Generating large-scale demonstrations for dexterous hand manipulation remains challenging, and several approaches have been proposed in recent years to address this. Among them, generative models have emerged as a promising paradigm, enabling the efficient creation of diverse and physically plausible demonstrations. In this paper, we introduce Dex1B, a large-scale, diverse, and high-quality demonstration dataset produced with generative models. The dataset contains one billion demonstrations for two fundamental tasks: grasping and articulation. To construct it, we propose a generative model that integrates geometric constraints to improve feasibility and applies additional conditions to enhance diversity. We validate the model on both established and newly introduced simulation benchmarks, where it significantly outperforms prior state-of-the-art methods. Furthermore, we demonstrate its effectiveness and robustness through real-world robot experiments. Our project page is at this https URL
zh
[CV-9] Facial Landmark Visualization and Emotion Recognition Through Neural Networks
【速读】:该论文试图解决从面部图像中进行情感识别的问题,旨在通过面部表情让机器学习人类情绪。其解决方案的关键在于提出了一种名为“面部关键点箱线图”的可视化技术,用于识别面部数据集中的异常值,并对比了两种面部关键点特征:(i)关键点的绝对位置和(ii)从中性表情到情感峰值的位移。实验结果表明,神经网络在性能上优于随机森林分类器。
链接: https://arxiv.org/abs/2506.17191
作者: Israel Juárez-Jiménez,Tiffany Guadalupe Martínez Paredes,Jesús García-Ramírez,Eric Ramos Aguilar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Best paper Award COMIA 2025
Abstract:Emotion recognition from facial images is a crucial task in human-computer interaction, enabling machines to learn human emotions through facial expressions. Previous studies have shown that facial images can be used to train deep learning models; however, most of these studies do not include a through dataset analysis. Visualizing facial landmarks can be challenging when extracting meaningful dataset insights; to address this issue, we propose facial landmark box plots, a visualization technique designed to identify outliers in facial datasets. Additionally, we compare two sets of facial landmark features: (i) the landmarks’ absolute positions and (ii) their displacements from a neutral expression to the peak of an emotional expression. Our results indicate that a neural network achieves better performance than a random forest classifier.
zh
[CV-10] YASMOT: Yet another stereo image multi-object tracker
【速读】:该论文试图解决在图像时间序列中如何有效跟踪对象并保持其身份识别的问题,这一问题对于提升目标检测性能以及后续任务如行为分类、行为预测和总体数量估计至关重要。解决方案的关键在于提出一种轻量且灵活的对象追踪器yasmot,它能够处理主流目标检测器的输出,并从单目或立体相机配置中实现时间上的对象跟踪,同时具备从目标检测器集成中生成共识检测的功能。
链接: https://arxiv.org/abs/2506.17186
作者: Ketil Malde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages
Abstract:There now exists many popular object detectors based on deep learning that can analyze images and extract locations and class labels for occurrences of objects. For image time series (i.e., video or sequences of stills), tracking objects over time and preserving object identity can help to improve object detection performance, and is necessary for many downstream tasks, including classifying and predicting behaviors, and estimating total abundances. Here we present yasmot, a lightweight and flexible object tracker that can process the output from popular object detectors and track objects over time from either monoscopic or stereoscopic camera configurations. In addition, it includes functionality to generate consensus detections from ensembles of object detectors.
zh
[CV-11] Co-Seg: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation
【速读】:该论文旨在解决医学图像分析中器官或组织联合分割的挑战,现有研究通常将不同的分割任务孤立处理,忽视了任务间的本质依赖关系,导致分割性能不足和医学图像理解不充分。其解决方案的关键在于提出一种名为Co-Seg++的框架,引入了一种新的共分割范式,使语义分割与实例分割任务能够相互增强,通过设计时空提示编码器(STP-Encoder)捕捉分割区域与图像嵌入之间的长程空间和时间关系,并采用多任务协同解码器(MTC-Decoder)利用跨指导增强两个任务的上下文一致性,从而提升分割效果。
链接: https://arxiv.org/abs/2506.17159
作者: Qing Xu,Yuxiang Luo,Wenting Duan,Zhen Chen
机构: University of Nottingham Ningbo China, Ningbo, Zhejiang, China; University of Nottingham, UK; Sichuan University; School of Computer Science, University of Lincoln, Lincoln LN6 7TS, UK; Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong SAR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-temporal prompt encoder (STP-Encoder) to capture long-range spatial and temporal relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at this https URL.
zh
[CV-12] Do We Need Large VLMs for Spotting Soccer Actions?
【速读】:该论文试图解决传统基于视频的动作定位(action spotting)任务中对视觉输入的依赖问题,该问题通常需要复杂且计算成本高的模型来处理密集的视频数据。解决方案的关键在于将任务从以视频为中心的方法转向以文本为中心的框架,利用大型语言模型(Large Language Models, LLMs)替代视觉-语言模型(Vision-Language Models, VLMs),从而实现轻量级和可扩展的方案。论文提出,专家解说包含丰富的细粒度描述和上下文线索,足以可靠地识别比赛中的关键动作,通过三个专门处理结果、兴奋度和战术的LLMs对解说文本进行滑动窗口分析,实现对进球、红黄牌和换人等事件的准确时间戳生成。
链接: https://arxiv.org/abs/2506.17144
作者: Ritabrata Chakraborty,Rajatsubhra Chakraborty,Avijit Dasgupta,Sandeep Chaurasia
机构: Manipal University Jaipur(Manipal大学贾伊普尔校区); UNC–Charlotte(北卡罗来纳大学夏洛特分校); IIIT Hyderabad(印度信息技术研究所海得拉巴校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures
Abstract:Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. In this work, we propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich, fine-grained descriptions and contextual cues such as excitement and tactical insights, contains enough information to reliably spot key actions in a match. To demonstrate this, we use the SoccerNet Echoes dataset, which provides timestamped commentary, and employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics. Each LLM evaluates sliding windows of commentary to identify actions like goals, cards, and substitutions, generating accurate timestamps for these events. Our experiments show that this language-centric approach performs effectively in detecting critical match events, providing a lightweight and training-free alternative to traditional video-based methods for action spotting.
zh
[CV-13] On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting
【速读】:该论文旨在解决对象计数模型在跨域部署时因密度分布差异导致性能下降的问题,因为密度变化是任务相关的,违反了传统领域自适应的假设。其解决方案的关键在于提出一种条件特征对齐的理论框架,通过将每个领域划分为子集(如目标与背景),并按条件测量差异,从而形式化条件分歧的概念。该方法通过联合误差界分析表明,在离散标签空间作为条件集合的情况下,条件对齐能够比无条件对齐获得更紧的源-目标决策误差上限,进而实现更好的跨域泛化能力。
链接: https://arxiv.org/abs/2506.17137
作者: Zhuonan Liang,Dongnan Liu,Jianan Fan,Yaxuan Song,Qiang Qu,Yu Yao,Peng Fu,Weidong Cai
机构: The University of Sydney (悉尼大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures, 8 tables
Abstract:Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment. We first formalize the notion of conditional divergence by partitioning each domain into subsets (e.g., object vs. background) and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. These insights motivate a general conditional adaptation principle: by preserving task-relevant variations while filtering out nuisance shifts, one can achieve superior cross-domain generalization for counting. We provide both defining conditional divergence then proving its benefit in lowering joint error and a practical adaptation strategy that preserves task-relevant information in unsupervised domain-adaptive counting. We demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.
zh
[CV-14] Semi-Supervised Multi-Modal Medical Image Segmentation for Complex Situations MICCAI2025
【速读】:该论文旨在解决医学图像分割中因标注数据有限而导致的性能不足问题,特别是在复杂背景和挑战性任务下的表现。其关键解决方案是提出一种新颖的半监督多模态医学图像分割方法,通过多阶段多模态融合与增强策略,充分利用多模态信息的互补性,减少特征差异,提升特征共享与对齐能力,并引入对比互学习以约束跨模态预测的一致性,从而提高分割结果的鲁棒性。
链接: https://arxiv.org/abs/2506.17136
作者: Dongdong Meng,Sheng Li,Hao Wu,Guoping Wang,Xueqing Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, accepted at MICCAI 2025
Abstract:Semi-supervised learning addresses the issue of limited annotations in medical images effectively, but its performance is often inadequate for complex backgrounds and challenging tasks. Multi-modal fusion methods can significantly improve the accuracy of medical image segmentation by providing complementary information. However, they face challenges in achieving significant improvements under semi-supervised conditions due to the challenge of effectively leveraging unlabeled data. There is a significant need to create an effective and reliable multi-modal learning strategy for leveraging unlabeled data in semi-supervised segmentation. To address these issues, we propose a novel semi-supervised multi-modal medical image segmentation approach, which leverages complementary multi-modal information to enhance performance with limited labeled data. Our approach employs a multi-stage multi-modal fusion and enhancement strategy to fully utilize complementary multi-modal information, while reducing feature discrepancies and enhancing feature sharing and alignment. Furthermore, we effectively introduce contrastive mutual learning to constrain prediction consistency across modalities, thereby facilitating the robustness of segmentation results in semi-supervised tasks. Experimental results on two multi-modal datasets demonstrate the superior performance and robustness of the proposed framework, establishing its valuable potential for solving medical image segmentation tasks in complex scenarios.
zh
[CV-15] Dynamic Watermark Generation for Digital Images using Perimeter Gated SPAD Imager PUFs
【速读】:该论文试图解决数字图像水印的生成问题,特别是利用图像传感器的物理不可克隆函数(PUFs)来实现安全标识和篡改检测。解决方案的关键在于采用周界门控单光子雪崩二极管(pgSPAD)成像器,并利用其制造过程中的暗信号非均匀性(DSNU)作为水印源,从而实现具有可控制灵敏度-鲁棒性权衡的源-场景特定动态水印。
链接: https://arxiv.org/abs/2506.17134
作者: Md Sakibur Sajal,Marc Dandin
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 7 figures, accepted at MWSCAS 2025 Conference
Abstract:Digital image watermarks as a security feature can be derived from the imager’s physically unclonable functions (PUFs) by utilizing the manufacturing variations, i.e., the dark signal non-uniformity (DSNU). While a few demonstrations focused on the CMOS image sensors (CIS) and active pixel sensors (APS), single photon avalanche diode (SPAD) imagers have never been investigated for this purpose. In this work, we have proposed a novel watermarking technique using perimeter gated SPAD (pgSPAD) imagers. We utilized the DSNU of three 64 x 64 pgSPAD imager chips, fabricated in a 0.35 \mum standard CMOS process and analyzed the simulated watermarks for standard test images from publicly available database. Our observation shows that both source identification and tamper detection can be achieved using the proposed source-scene-specific dynamic watermarks with a controllable sensitivity-robustness trade-off.
zh
[CV-16] RGBTrack: Fast Robust Depth-Free 6D Pose Estimation and Tracking IROS2025
【速读】:该论文旨在解决动态且高精度的物体6D位姿估计与跟踪问题,其核心挑战在于无需深度输入的情况下实现实时、鲁棒的位姿推断。解决方案的关键在于提出了一种基于FoundationPose架构的新型二分搜索策略,结合渲染与比较机制,从真实尺度的CAD模型中高效推断深度并生成鲁棒的位姿假设。此外,通过集成先进的2D目标跟踪器(XMem)、卡尔曼滤波器和状态机,RGBTrack能够在动态场景中保持稳定跟踪,同时其尺度恢复模块通过初始深度估计动态适配未知尺度的CAD模型,从而实现了与现代生成式重建技术的无缝集成。
链接: https://arxiv.org/abs/2506.17119
作者: Teng Guo,Jingjin Yu
机构: Rutgers, the State University of New Jersey (罗格斯大学新泽西州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IROS 2025
Abstract:We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack’s scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack’s novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for application areas including robotics, augmented reality, and computer vision. The source code for our implementation will be made publicly available at this https URL. Comments: Accepted to IROS 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2506.17119 [cs.CV] (or arXiv:2506.17119v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.17119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-17] Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping IROS2025
【速读】:该论文旨在解决机器人操作中6D物体位姿估计的问题,特别是传统依赖深度传感器的方法存在成本高、噪声大及无法处理透明物体的局限性。其解决方案的关键在于提出一种新颖的框架——单目一次测量度对齐(Monocular One-shot Metric-depth Alignment, MOMA),通过单张RGB图像恢复度量深度,利用一次适应过程结合单目深度估计模型(MDEM)技术,在相机标定过程中通过稀疏真实深度点进行尺度-旋转-平移对齐,从而在无需额外数据采集或模型重训练的情况下实现精确深度估计。
链接: https://arxiv.org/abs/2506.17110
作者: Teng Guo,Baichuan Huang,Jingjin Yu
机构: Rutgers University (罗格斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2025
Abstract:Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments during camera calibration, guided by sparse ground-truth depth points, enabling accurate depth estimation without additional data collection or model retraining on the testing setup. MOMA supports fine-tuning the MDEM on transparent objects, demonstrating strong generalization capabilities. Real-world experiments on tabletop 2-finger grasping and suction-based bin-picking applications show MOMA achieves high success rates in diverse tasks, confirming its effectiveness.
zh
[CV-18] Acquiring and Accumulating Knowledge from Diverse Datasets for Multi-label Driving Scene Classification
【速读】:该论文旨在解决驾驶场景识别(Driving Scene Identification, DSI)中的多标签分类问题,该问题需要为场景分配多个非排他性类别标签,以提升自动驾驶车辆对复杂驾驶环境的理解与交互能力。传统方法在通过多任务学习训练多标签分类模型时面临两大挑战:获取平衡且全面标注的多标签数据集以及在不同任务间实现学习平衡。论文提出的解决方案关键在于结合知识获取与积累(Knowledge Acquisition and Accumulation, KAA)和基于一致性的主动学习(Consistency-based Active Learning, CAL),其中KAA通过单任务学习从多个单标签数据集中获取并积累场景识别知识,而CAL则有效缓解了个体属性边缘分布与联合分布差异导致的知识缺口。
链接: https://arxiv.org/abs/2506.17101
作者: Ke Li,Chenyu Zhang,Yuxin Ding,Xianbiao Hu,Ruwen Qin
机构: Stony Brook Univerisity(石溪大学); Pennylvania State University(宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driving scene identification, which assigns multiple non-exclusive class labels to a scene, provides the contextual awareness necessary for enhancing autonomous vehicles’ ability to understand, reason about, and interact with the complex driving environment. As a multi-label classification problem, it is better tackled via multitasking learning. However, directly training a multi-label classification model for driving scene identification through multitask learning presents two main challenges: acquiring a balanced, comprehensively annotated multi-label dataset and balancing learning across different tasks. This paper introduces a novel learning system that synergizes knowledge acquisition and accumulation (KAA) with consistency-based active learning (CAL) to address those challenges. KAA acquires and accumulates knowledge about scene identification from various single-label datasets via monotask learning. Subsequently, CAL effectively resolves the knowledge gap caused by the discrepancy between the marginal distributions of individual attributes and their joint distribution. An ablation study on our Driving Scene Identification (DSI) dataset demonstrates a 56.1% performance increase over the baseline model pretrained on ImageNet. Of this, KAA accounts for 31.3% of the gain, and CAL contributes 24.8%. Moreover, KAA-CAL stands out as the best performer when compared to state-of-the-art (SOTA) multi-label models on two public datasets, BDD100K and HSD, achieving this while using 85% less data. The DSI dataset and the implementation code for KAA-CAL are available at this https URL .
zh
[CV-19] Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion
【速读】:该论文旨在解决3D部件装配(3D part assembly)中的通用性和可扩展性问题,特别是在面对具有不同部件数量、几何形状和结构的多样化真实世界对象时。传统方法通常依赖于确定性的部件位姿预测和特定类别训练,难以适应复杂场景。其解决方案的关键在于将部件装配建模为生成式问题,并利用扩散模型进行合理配置的采样,从而有效处理对称性、重复部件和多种有效装配带来的不确定性;同时引入基于稀疏锚点云的形状中心表示,实现欧几里得空间中的可扩展生成,而非传统的SE(3)位姿预测。
链接: https://arxiv.org/abs/2506.17074
作者: Wang Zhao,Yan-Pei Cao,Jiale Xu,Yuejiang Dong,Ying Shan
机构: Tencent ARC Lab (腾讯AI实验室); VAST (VAST); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project page: this https URL
Abstract:We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresses the core challenges of scaling to general 3D part assembly through innovations in task formulation, representation, and data. First, Assembler casts part assembly as a generative problem and employs diffusion models to sample plausible configurations, effectively capturing ambiguities arising from symmetry, repeated parts, and multiple valid assemblies. Second, we introduce a novel shape-centric representation based on sparse anchor point clouds, enabling scalable generation in Euclidean space rather than SE(3) pose prediction. Third, we construct a large-scale dataset of over 320K diverse part-object assemblies using a synthesis and filtering pipeline built on existing 3D shape repositories. Assembler achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects. Based on Assembler, we further introduce an interesting part-aware 3D modeling system that generates high-resolution, editable objects from images, demonstrating potential for interactive and compositional design. Project page: this https URL
zh
[CV-20] Relaxed syntax modeling in Transformers for future-proof license plate recognition
【速读】:该论文试图解决车牌识别系统在面对新发布的车牌时性能下降的问题,特别是在语法结构发生变化后,基于Transformer的网络表现出显著的识别准确率下降,这使其难以适应生产环境中的动态变化。解决方案的关键在于设计一种无需依赖特定语法结构的Transformer模型——SaLT(Syntax-Less Transformer),通过引入架构上的截断和替换策略,实现对车牌表示的语法无关建模,从而在保持对既有语法高精度识别的同时,显著提升对新型号车牌的识别稳定性。
链接: https://arxiv.org/abs/2506.17051
作者: Florent Meyer,Laurent Guichard,Denis Coquenet,Guillaume Gravier,Yann Soullard,Bertrand Coüasnon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Effective license plate recognition systems are required to be resilient to constant change, as new license plates are released into traffic daily. While Transformer-based networks excel in their recognition at first sight, we observe significant performance drop over time which proves them unsuitable for tense production environments. Indeed, such systems obtain state-of-the-art results on plates whose syntax is seen during training. Yet, we show they perform similarly to random guessing on future plates where legible characters are wrongly recognized due to a shift in their syntax. After highlighting the flows of positional and contextual information in Transformer encoder-decoders, we identify several causes for their over-reliance on past syntax. Following, we devise architectural cut-offs and replacements which we integrate into SaLT, an attempt at a Syntax-Less Transformer for syntax-agnostic modeling of license plate representations. Experiments on both real and synthetic datasets show that our approach reaches top accuracy on past syntax and most importantly nearly maintains performance on future license plates. We further demonstrate the robustness of our architecture enhancements by way of various ablations.
zh
[CV-21] Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance
【速读】:该论文试图解决如何揭示高阶视觉单元所编码的特征组合,以理解图像如何被转化为支持识别的表示问题。现有特征可视化方法仅能推断出单元最兴奋的图像,无法全面揭示响应保持不变的变换流形,而这对于视觉系统的泛化能力至关重要。解决方案的关键在于提出一种无偏、模型无关且无需梯度的框架——Stretch-and-Squeeze (SnS),通过双目标优化问题系统地表征单元的不变性景观及其对对抗扰动的敏感性。
链接: https://arxiv.org/abs/2506.17040
作者: Lorenzo Tausani,Paolo Muratore,Morgan B. Talbot,Giacomo Amerio,Gabriel Kreiman,Davide Zoccolan
机构: International School for Advanced Studies (SISSA), Trieste (Italy); Boston Children’s Hospital, Harvard Medical School, Cambridge (USA); Center for Brains, Minds, and Machines, MIT, Cambridge (USA); Harvard-MIT Program in Health Sciences and Technology, MIT, Cambridge (USA)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 9 figures
Abstract:Uncovering which features’ combinations high-level visual units encode is critical to understand how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit’s most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is key to generalization in vision. Here we introduce Stretch-and-Squeeze (SnS), an unbiased, model-agnostic, and gradient-free framework to systematically characterize a unit’s invariance landscape and its vulnerability to adversarial perturbations in both biological and artificial visual systems. SnS frames these transformations as bi-objective optimization problems. To probe invariance, SnS seeks image perturbations that maximally alter the representation of a reference stimulus in a given processing stage while preserving unit activation. To probe adversarial sensitivity, SnS seeks perturbations that minimally alter the stimulus while suppressing unit activation. Applied to convolutional neural networks (CNNs), SnS revealed image variations that were further from a reference image in pixel-space than those produced by affine transformations, while more strongly preserving the target unit’s response. The discovered invariant images differed dramatically depending on the choice of image representation used for optimization: pixel-level changes primarily affected luminance and contrast, while stretching mid- and late-layer CNN representations altered texture and pose respectively. Notably, the invariant images from robust networks were more recognizable by human subjects than those from standard networks, supporting the higher fidelity of robust CNNs as models of the visual system.
zh
[CV-22] Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns
【速读】:该论文试图解决在真实世界超分辨率重建模型训练中,如何准确提取和建模退化模式的问题,特别是在仅使用真实世界低分辨率(LR)图像时,难以同时捕捉模糊和多样噪声特征以及更隐式的退化现象(如色彩空间偏移)。解决方案的关键在于提出一种名为TripleGAN的新型框架,该框架包含三个组件:FirstGAN旨在缩小模糊特征的领域差距,SecondGAN执行领域特定的翻译以逼近目标领域的模糊特性并学习额外的退化模式,ThirdGAN则在由FirstGAN和SecondGAN生成的伪真实数据上进行训练,以重建真实世界的LR图像。
链接: https://arxiv.org/abs/2506.17027
作者: Yiyang Tie,Hong Zhu,Yunyun Luo,Jing Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction methods fails to capture both blur and diverse noise characteristics across varying LR distributions, as well as more implicit degradations such as color gamut shifts. Conversely, domain translation alone cannot accurately approximate real-world blur characteristics due to the significant degradation domain gap between synthetic and real data. To address these challenges, we propose a novel TripleGAN framework comprising two strategically designed components: The FirstGAN primarily focuses on narrowing the domain gap in blur characteristics, while the SecondGAN performs domain-specific translation to approximate target-domain blur properties and learn additional degradation patterns. The ThirdGAN is trained on pseudo-real data generated by the FirstGAN and SecondGAN to reconstruct real-world LR images. Extensive experiments on the RealSR and DRealSR datasets demonstrate that our method exhibits clear advantages in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts. The proposed framework effectively learns real-world degradation patterns from LR observations and synthesizes aligned datasets with corresponding degradation characteristics, thereby enabling the trained network to achieve superior performance in reconstructing high-quality SR images from real-world LR inputs.
zh
[CV-23] A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving
【速读】:该论文旨在解决自动驾驶中单个车辆感知能力受限的问题,如遮挡、传感器范围有限和视角狭窄等,从而影响3D语义占用预测的完整性和准确性。其解决方案的关键在于通过协作感知实现多智能体间互补信息的交换,并构建一个增强的协作感知数据集,该数据集在CARLA中通过高分辨率语义体素传感器进行重放,以提供密集且全面的占用标注。此外,还设计了不同预测范围的基准测试,以系统评估空间范围对协作预测的影响,并提出了一种基于空间对齐和注意力聚合的基线模型,实现跨智能体特征融合。实验结果表明,该基线模型在扩展预测范围时表现出更高的性能提升。
链接: https://arxiv.org/abs/2506.17004
作者: Hanlin Wu,Pengfei Lin,Ehsan Javanmardi,Naren Bao,Bo Qian,Hao Si,Manabu Tsukada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.
zh
[CV-24] Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)在资源受限环境中的应用难题,特别是在无人机等计算和存储资源有限的场景下,传统基于提示的UDA方法依赖大型视觉-语言模型并需要完整的源域数据访问,限制了其适用性。解决方案的关键在于提出Prmpt2Adpt框架,该框架基于教师-学生范式,并通过基于提示的特征对齐实现零样本域适应,核心是使用蒸馏和微调后的CLIP模型作为教师模型的冻结主干,结合提示驱动实例归一化(PIN)技术,将少量低级源域特征对齐到目标域语义,从而高效生成伪标签以指导学生模型的实时适应。
链接: https://arxiv.org/abs/2506.16994
作者: Yasir Ali Farrukh,Syed Wali,Irfan Khan,Nathaniel D. Bastian
机构: Texas A&M University (德克萨斯A&M大学); United States Military Academy (美国军事学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unsupervised Domain Adaptation (UDA) is a critical challenge in real-world vision systems, especially in resource-constrained environments like drones, where memory and computation are limited. Existing prompt-driven UDA methods typically rely on large vision-language models and require full access to source-domain data during adaptation, limiting their applicability. In this work, we propose Prmpt2Adpt, a lightweight and efficient zero-shot domain adaptation framework built around a teacher-student paradigm guided by prompt-based feature alignment. At the core of our method is a distilled and fine-tuned CLIP model, used as the frozen backbone of a Faster R-CNN teacher. A small set of low-level source features is aligned to the target domain semantics-specified only through a natural language prompt-via Prompt-driven Instance Normalization (PIN). These semantically steered features are used to briefly fine-tune the detection head of the teacher model. The adapted teacher then generates high-quality pseudo-labels, which guide the on-the-fly adaptation of a compact student model. Experiments on the MDS-A dataset demonstrate that Prmpt2Adpt achieves competitive detection performance compared to state-of-the-art methods, while delivering up to 7x faster adaptation and 5x faster inference speed using few source images-making it a practical and scalable solution for real-time adaptation in low-resource domains.
zh
[CV-25] ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds
【速读】:该论文旨在解决森林LiDAR三维点云分割(包括单株树分割和语义分割)在自然森林环境复杂性和多样性下的挑战。其关键解决方案是提出ForestFormer3D,一个统一且端到端的框架,通过引入ISA-guided查询点选择、基于分数的块合并策略以及一对一多关联机制,实现了高精度的分割效果,并在新提出的FOR-instanceV2数据集上取得了最先进的性能。此外,该方法在未见过的测试集上也表现出良好的泛化能力。
链接: https://arxiv.org/abs/2506.16991
作者: Binbin Xiang,Maciej Wielgosz,Stefano Puliti,Kamil Král,Martin Krůček,Azim Missarov,Rasmus Astrup
机构: Norwegian Institute of Bioeconomy Research (NIBIO); Silva Tarouca Research Institute for Landscape and Ornamental Gardening
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released soon.
zh
[CV-26] Reversing Flow for Image Restoration CVPR2025
【速读】:该论文旨在解决图像恢复中由于退化过程被建模为随机变换而导致的效率和复杂性问题。其解决方案的关键在于提出ResFlow框架,该框架将退化过程建模为确定性路径,通过连续归一化流(continuous normalizing flows)实现,并引入辅助过程以消除高精度图像预测中的不确定性,从而实现退化过程的可逆建模。
链接: https://arxiv.org/abs/2506.16961
作者: Haina Qin,Wenyang Luo,Libin Wang,Dandan Zheng,Jingdong Chen,Ming Yang,Bing Li,Weiming Hu
机构: Ant Group (蚂蚁集团); PeopleAI Inc. (人智科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: CVPR2025 Final Version; Corresponding Author: Bing Li
Abstract:Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications.
zh
[CV-27] Visual-Instructed Degradation Diffusion for All-in-One Image Restoration CVPR2025
【速读】:该论文旨在解决图像恢复任务中因不同退化类型需要独立模型而导致的泛化能力受限问题,特别是在现实场景中面对混合或未知退化时的表现不足。其解决方案的关键在于提出了一种名为Defusion的统一图像恢复框架,该框架通过视觉指令引导的退化扩散机制,构建与视觉退化模式对齐的显式视觉指令,从而在退化空间中利用扩散模型进行稳定且具有泛化能力的图像重建。
链接: https://arxiv.org/abs/2506.16960
作者: Wenyang Luo,Haina Qin,Zewen Chen,Libin Wang,Dandan Zheng,Yuming Li,Yufan Liu,Bing Li,Weiming Hu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; Ant Group; PeopleAI Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025 Final Version; Corresponding Author: Bing Li
Abstract:Image restoration tasks like deblurring, denoising, and dehazing usually need distinct models for each degradation type, restricting their generalization in real-world scenarios with mixed or unknown degradations. In this work, we propose \textbfDefusion, a novel all-in-one image restoration framework that utilizes visual instruction-guided degradation diffusion. Unlike existing methods that rely on task-specific models or ambiguous text-based priors, Defusion constructs explicit \textbfvisual instructions that align with the visual degradation patterns. These instructions are grounded by applying degradations to standardized visual elements, capturing intrinsic degradation features while agnostic to image semantics. Defusion then uses these visual instructions to guide a diffusion-based model that operates directly in the degradation space, where it reconstructs high-quality images by denoising the degradation effects with enhanced stability and generalizability. Comprehensive experiments demonstrate that Defusion outperforms state-of-the-art methods across diverse image restoration tasks, including complex and real-world degradations.
zh
[CV-28] LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models ICML2025
【速读】:该论文试图解决当前计算机视觉模型在分布外(Out-of-distribution, OOD)鲁棒性评估中的不足问题,即现有基准数据集如ImageNet-C已不再适用于评估基于大规模网络爬取数据集的模型性能。解决方案的关键在于引入LAION-C作为ImageNet-C的替代基准,该数据集包含六种专门设计为分布外的新型失真类型,即使对于大规模网络数据集如LAION也具有挑战性。
链接: https://arxiv.org/abs/2506.16950
作者: Fanfei Li,Thomas Klein,Wieland Brendel,Robert Geirhos,Roland S. Zimmermann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2025 camera ready version
Abstract:Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today’s large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.
zh
[CV-29] LunarLoc: Segment-Based Global Localization on the Moon
【速读】:该论文旨在解决月球表面自主操作中因传统地球导航基础设施(如GPS)不可用而导致的全局定位问题,特别是在长时间任务和复杂地形下,视觉-惯性里程计(VIO)等方法会积累里程计漂移,影响姿态估计的精度。解决方案的关键在于提出LunarLoc方法,该方法通过实例分割技术从机载立体图像中零样本提取岩石地标,并构建基于图的地形表示,再与之前会话中捕获的参考地图进行图论数据关联,从而实现高精度、无漂移的全局定位。
链接: https://arxiv.org/abs/2506.16940
作者: Annika Thomas,Robaire Galliath,Aleksander Garbuz,Luke Anger,Cormac O’Neill,Trevor Johst,Dami Thomas,George Lordos,Jonathan P. How
机构: Massachusetts Institute of Technology (麻省理工学院); NASA (美国国家航空航天局); The Johns Hopkins University (约翰霍普金斯大学); Caterpillar Inc. (卡特彼勒公司); Embodied AI (具身AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Global localization is necessary for autonomous operations on the lunar surface where traditional Earth-based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transport of regolith require precise pose estimation, but proposed approaches such as visual-inertial odometry (VIO) accumulate odometry drift over long traverses. Precise pose estimation is particularly important for upcoming missions such as the ISRU Pilot Excavator (IPEx) that rely on autonomous agents to operate over extended timescales and varied terrain. To help overcome odometry drift over long traverses, we propose LunarLoc, an approach to global localization that leverages instance segmentation for zero-shot extraction of boulder landmarks from onboard stereo imagery. Segment detections are used to construct a graph-based representation of the terrain, which is then aligned with a reference map of the environment captured during a previous session using graph-theoretic data association. This method enables accurate and drift-free global localization in visually ambiguous settings. LunarLoc achieves sub-cm level accuracy in multi-session global localization experiments, significantly outperforming the state of the art in lunar global localization. To encourage the development of further methods for global localization on the Moon, we release our datasets publicly with a playback module: this https URL.
zh
[CV-30] AIs Blind Spots: Geographic Knowledge and Diversity Deficit in Generated Urban Scenario
【速读】:该论文试图解决生成式 AI (Generative AI) 在地理知识上的局限性及其所嵌入的偏见问题。研究通过使用FLUX 1和Stable Diffusion 3.5生成美国各州及首都的合成图像,并利用DINO-v2 ViT-S/14进行图像嵌入以及Fréchet Inception Distance衡量图像相似性,分析模型对地理特征的表征能力。解决方案的关键在于通过系统性的图像生成与对比分析,揭示模型在地理代表性上的偏差,特别是在处理城市与农村地区以及欧洲风格地名时的实体消歧问题。
链接: https://arxiv.org/abs/2506.16898
作者: Ciro Beneduce,Massimiliano Luca,Bruno Lepri
机构: Bruno Kessler Foundation(布鲁诺·凯斯勒基金会)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Image generation models are revolutionizing many domains, and urban analysis and design is no exception. While such models are widely adopted, there is a limited literature exploring their geographic knowledge, along with the biases they embed. In this work, we generated 150 synthetic images for each state in the USA and related capitals using FLUX 1 and Stable Diffusion 3.5, two state-of-the-art models for image generation. We embed each image using DINO-v2 ViT-S/14 and the Fréchet Inception Distances to measure the similarity between the generated images. We found that while these models have implicitly learned aspects of USA geography, if we prompt the models to generate an image for “United States” instead of specific cities or states, the models exhibit a strong representative bias toward metropolis-like areas, excluding rural states and smaller cities. \colorblack In addition, we found that models systematically exhibit some entity-disambiguation issues with European-sounding names like Frankfort or Devon.
zh
[CV-31] With Limited Data for Multimodal Alignment Let the STRUCTURE Guide You
【速读】:该论文试图解决在有限配对多模态数据情况下构建高效多模态模型的问题(multimodal model),传统方法通常依赖于数百万对配对样本,而在许多领域获取这些数据既昂贵又不可行。解决方案的关键在于通过对预训练单模态基础模型进行对齐来实现高质量的多模态对齐,其中引入了STRUCTURE这一正则化技术,以保持单模态编码器潜在空间的邻域几何结构,并强调了在跨模态表示相似性最高的层进行对齐的重要性。
链接: https://arxiv.org/abs/2506.16895
作者: Fabian Gröger,Shuo Wen,Huyen Le,Maria Brbić
机构: EPFL(瑞士联邦理工学院); University of Basel(巴塞尔大学); HSLU(瑞士应用科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples \unicodex2013 less than 1% of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of 51.6% in classification and 91.8% in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.
zh
[CV-32] From Lab to Factory: Pitfalls and Guidelines for Self-/Unsupervised Defect Detection on Low-Quality Industrial Images ECML KDD’25
【速读】:该论文试图解决工业生产中产品质量问题的检测与定位问题,传统上依赖于人工检查,存在成本高和易出错的缺陷,而机器学习方法在实际应用中面临数据质量低和鲁棒性不足的问题。论文的关键解决方案是采用无监督或自监督的方法,以应对无法预先定义所有可能缺陷的情况,并通过评估两种先进的模型来识别和改进生产数据中的质量问题,而无需获取新数据。其核心贡献在于为实践者提供指导框架,以可靠地识别模型或数据中存在的鲁棒性或不变性相关问题,并提出更适合真实场景的基于似然的方法的改进方案。
链接: https://arxiv.org/abs/2506.16890
作者: Sebastian Hönel,Jonas Nordqvist
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 18 pages, 7 figures, 1 table. Camera-ready version for the 2025 conference European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD '25)
Abstract:The detection and localization of quality-related problems in industrially mass-produced products has historically relied on manual inspection, which is costly and error-prone. Machine learning has the potential to replace manual handling. As such, the desire is to facilitate an unsupervised (or self-supervised) approach, as it is often impossible to specify all conceivable defects ahead of time. A plethora of prior works have demonstrated the aptitude of common reconstruction-, embedding-, and synthesis-based methods in laboratory settings. However, in practice, we observe that most methods do not handle low data quality well or exude low robustness in unfavorable, but typical real-world settings. For practitioners it may be very difficult to identify the actual underlying problem when such methods underperform. Worse, often-reported metrics (e.g., AUROC) are rarely suitable in practice and may give misleading results. In our setting, we attempt to identify subtle anomalies on the surface of blasted forged metal parts, using rather low-quality RGB imagery only, which is a common industrial setting. We specifically evaluate two types of state-of-the-art models that allow us to identify and improve quality issues in production data, without having to obtain new data. Our contribution is to provide guardrails for practitioners that allow them to identify problems related to, e.g., (lack of) robustness or invariance, in either the chosen model or the data reliably in similar scenarios. Furthermore, we exemplify common pitfalls in and shortcomings of likelihood-based approaches and outline a framework for proper empirical risk estimation that is more suitable for real-world scenarios.
zh
[CV-33] ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware Control
【速读】:该论文旨在解决智能车辆系统中自主泊车的问题,特别是在城市环境中需要高精度控制的场景下,传统基于规则的泊车系统因无法适应复杂或动态环境而表现不佳。其解决方案的关键是提出一种基于Transformer的端到端框架,通过学习专家示范来实现自主泊车,该框架整合了周围视图摄像头图像、目标点表示、自车运动及行人轨迹作为输入,并输出包括油门、制动、转向和档位选择在内的离散控制序列,其中关键模块包括跨注意力机制与GRU-based行人预测器,以提升泊车的准确性与安全性。
链接: https://arxiv.org/abs/2506.16856
作者: Jun Fu,Bin Tian,Haonan Chen,Shi Meng,Tingting Yao
机构: Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: this https URL.
zh
[CV-34] Controllable and Expressive One-Shot Video Head Swapping
【速读】:该论文旨在解决视频中头部替换(video head swapping)过程中存在的多个挑战,包括现有方法主要关注局部面部替换而忽视整体头部形态、头部替换方法在处理发型多样性和复杂背景时表现不佳,以及无法在替换后调整目标头部的表情等问题。其解决方案的关键在于提出一种基于扩散模型的多条件可控框架,通过统一的潜在扩散范式引入创新策略:一是身份保持的上下文融合,采用与形状无关的掩码策略分离前景头部身份特征与背景/身体上下文,并结合头发增强策略以实现跨多种发型和复杂背景的鲁棒整体头部身份保留;二是表达感知的地标重定向与编辑,通过解耦身份、表情和头部姿态的3DMM驱动重定向模块,减少输入图像中原有表情的影响并支持表情编辑,同时采用尺度感知的重定向策略以降低跨身份的表情失真,从而提升迁移精度。
链接: https://arxiv.org/abs/2506.16852
作者: Chaonan Ji,Jinwei Qi,Peng Zhang,Bang Zhang,Liefeng Bo
机构: Tongyi Lab, Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.
zh
[CV-35] Camera Calibration via Circular Patterns: A Comprehensive Framework with Measurement Uncertainty and Unbiased Projection Model
【速读】:该论文旨在解决基于平面靶标的相机标定中,由于镜头畸变导致的圆形靶标质心投影模型存在偏差的问题,从而影响标定精度与鲁棒性。其解决方案的关键在于提出一种无偏的圆形图案投影模型,并引入质心不确定性以提升标定的鲁棒性和完整性,同时通过将二维形状的边界点建模为马尔可夫随机场(Markov Random Field),结合格林定理(Green’s Theorem)进行适当的形状表示,从而将形状分布传播至质心不确定性,最终实现标定精度和鲁棒性的显著提升。
链接: https://arxiv.org/abs/2506.16842
作者: Chaehyeon Song,Dongjae Lee,Jongwoo Lim,Ayoung Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at this https URL.
zh
[CV-36] Beyond Blur: A Fluid Perspective on Generative Diffusion Models
【速读】:该论文试图解决生成式图像合成中图像退化过程的建模问题,旨在通过物理驱动的偏微分方程(PDE)过程提升生成图像的质量与多样性。其解决方案的关键在于引入一种结合对流与扩散的新型PDE驱动的图像退化过程,该过程通过无量纲数(佩克莱特数、傅里叶数)控制,并利用GPU加速的格子玻尔兹曼求解器进行高效数值实现。同时,通过生成随机速度场引入现实的湍流效应,使神经网络学习逆转该对流-扩散算子,从而构建一种新颖的生成模型。
链接: https://arxiv.org/abs/2506.16827
作者: Grzegorz Gruszczynski,Michal Jan Wlodarczyk,Jakub J Meixner,Przemyslaw Musialski
机构: IDEAS NCBR; NJIT
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 8 figures, pre-print, supplementary pseudocode in appendix
Abstract:We propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Peclet, Fourier). We implement this PDE numerically through a GPU-accelerated custom Lattice Boltzmann solver for fast evaluation. To induce realistic turbulence, we generate stochastic velocity fields that introduce coherent motion and capture multi-scale mixing. In the generative process, a neural network learns to reverse the advection-diffusion operator thus constituting a novel generative model. We discuss how previous methods emerge as specific cases of our operator, demonstrating that our framework generalizes prior PDE-based corruption techniques. We illustrate how advection improves the diversity and quality of the generated images while keeping the overall color palette unaffected. This work bridges fluid dynamics, dimensionless PDE theory, and deep generative modeling, offering a fresh perspective on physically informed image corruption processes for diffusion-based synthesis.
zh
[CV-37] AnyTraverse: An off-road traversability framework with VLM and human operator in the loop
【速读】:该论文旨在解决非结构化环境中自主导航的可通行性分割问题,特别是在搜索与救援、军事行动、野生动物探索和农业等应用中,现有框架因环境变化大和场景不确定性而表现不佳,并且无法适应不同类型的机器人。其解决方案的关键在于提出AnyTraverse框架,该框架结合基于自然语言的提示与人工操作员辅助,以确定多种机器人车辆的可通行区域,通过仅在遇到未探索场景或提示中未包含的未知类别时调用操作员,从而减少主动监督负担,同时适应多变的户外场景。
链接: https://arxiv.org/abs/2506.16826
作者: Sattwik Sahu,Agamdeep Singh,Karthik Nambiar,Srikanth Saripalli,P.B. Sujit
机构: Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校); Texas A&M University(德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural language-based prompts with human-operator assistance to determine navigable regions for diverse robotic vehicles. The system segments scenes for a given set of prompts and calls the operator only when encountering previously unexplored scenery or unknown class not part of the prompt in its region-of-interest, thus reducing active supervision load while adapting to varying outdoor scenes. Our zero-shot learning approach eliminates the need for extensive data collection or retraining. Our experimental validation includes testing on RELLIS-3D, Freiburg Forest, and RUGD datasets and demonstrate real-world deployment on multiple robot platforms. The results show that AnyTraverse performs better than GA-NAV and Off-seg while offering a vehicle-agnostic approach to off-road traversability that balances automation with targeted human supervision.
zh
[CV-38] Self-supervised Feature Extraction for Enhanced Ball Detection on Soccer Robots
【速读】:该论文旨在解决自主人形足球机器人在动态和复杂环境(如RoboCup户外场地)中实现鲁棒且准确的球检测问题,传统监督方法依赖大量人工标注,成本高且耗时。解决方案的关键在于提出一种自监督学习框架,用于领域自适应特征提取,该框架利用通用预训练模型生成伪标签,并通过颜色化、边缘检测和三元组损失等自监督预训练任务学习鲁棒的视觉特征,同时结合模型无关元学习(MAML)策略,以最小监督实现快速适应新场景。
链接: https://arxiv.org/abs/2506.16821
作者: Can Lin,Daniele Affinita,Marco E. P. Zimmatore,Daniele Nardi,Domenico D. Bloisi,Vincenzo Suriani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust and accurate ball detection is a critical component for autonomous humanoid soccer robots, particularly in dynamic and challenging environments such as RoboCup outdoor fields. However, traditional supervised approaches require extensive manual annotation, which is costly and time-intensive. To overcome this problem, we present a self-supervised learning framework for domain-adaptive feature extraction to enhance ball detection performance. The proposed approach leverages a general-purpose pretrained model to generate pseudo-labels, which are then used in a suite of self-supervised pretext tasks – including colorization, edge detection, and triplet loss – to learn robust visual features without relying on manual annotations. Additionally, a model-agnostic meta-learning (MAML) strategy is incorporated to ensure rapid adaptation to new deployment scenarios with minimal supervision. A new dataset comprising 10,000 labeled images from outdoor RoboCup SPL matches is introduced, used to validate the method, and made available to the community. Experimental results demonstrate that the proposed pipeline outperforms baseline models in terms of accuracy, F1 score, and IoU, while also exhibiting faster convergence.
zh
[CV-39] Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection IJCAI2025
【速读】:该论文旨在解决视觉内容伪造问题,特别是针对生成式 AI(Generative AI)产生的深度伪造(deepfake)图像的检测与定位。现有方法主要集中在图像级分类或像素级定位,但普遍存在泛化能力有限或依赖复杂架构的问题。论文提出的解决方案是 Loupe 框架,其关键在于集成一种基于块感知的分类器与带有条件查询的分割模块,实现全局真实性分类与细粒度掩码预测的联合处理,并通过伪标签引导的测试时适应机制提升对测试集分布偏移的鲁棒性。
链接: https://arxiv.org/abs/2506.16819
作者: Yuchu Jiang,Jiaming Chu,Jian Zhao,Xin Zhang,Xu Yang,Lei Jin,Chi Zhang,Xuelong Li
机构: Southeast University; EVOL Lab, TeleAI of China Telecom; Beijing University of Posts and Telecommunications; Lanzhou University; Northwestern Polytechnical University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, accepted by IJCAI 2025 workshop
Abstract:The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at this https URL.
zh
[CV-40] FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
【速读】:该论文旨在解决现有大型视觉语言模型(LVLMs)在图像编辑任务中对“看到什么”和“如何编辑”处理分离的问题,即当前方法要么仅执行孤立的对象分割,要么仅将分割掩码作为局部编辑生成的条件提示,通常依赖多个独立模型。解决方案的关键在于提出FOCUS,一个统一的LVLM,其核心是通过端到端框架集成分割感知的感知与可控的对象中心生成,采用双分支视觉编码器同时捕捉全局语义上下文和细粒度空间细节,并利用基于MoVQGAN的视觉分词器生成离散视觉标记以提升生成质量,同时通过渐进式多阶段训练策略联合优化分割掩码并作为空间条件提示引导扩散解码器,从而实现视觉编码、分割与生成模块的协同。
链接: https://arxiv.org/abs/2506.16806
作者: Fan Yang,Yousong Zhu,Xin Li,Yufei Zhan,Hongyin Zhao,Shurong Zheng,Yaowei Wang,Ming Tang,Jinqiao Wang
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Science (中国科学院大学); Peng Cheng Laboratory (鹏城实验室); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat “what to see” and “how to edit” separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.
zh
[CV-41] Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes
【速读】:该论文试图解决在稀疏图像集下进行共视性推理(co-visibility reasoning)的问题,即如何准确识别多张图像中重叠可见区域的能力。现有视觉模型在这一任务上表现不佳,尤其是在稀疏条件下,表明当前模型缺乏对场景的高层次空间理解能力。解决方案的关键在于引入一种基于多视角的基准测试(Co-Visibility reasONing, Co-VisiON),并通过提出一种新的多视角基线模型Covis,实现更接近人类水平的共视性分析,从而推动视觉模型在复杂、稀疏环境下的高阶推理能力发展。
链接: https://arxiv.org/abs/2506.16805
作者: Chao Chen,Nobel Dang,Juexiao Zhang,Wenkai Sun,Pengfei Zheng,Xuhang He,Yimeng Ye,Taarun Srinivas,Chen Feng
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans exhibit a remarkable ability to recognize co-visibility-the overlapping regions visible in multiple images-even when these images are sparsely distributed across a complex scene. This capability is foundational in 3D vision and robotic perception. Despite significant progress in vision learning, it remains unclear whether current vision models have reached human-level proficiency in co-visibility analysis. In this work, we introduce the Co-Visibility reasONing (Co-VisiON) benchmark, designed to directly evaluate co-visibility reasoning on sparse image sets across over 1000 indoor scenarios. Our experiments reveal that while co-visibility is typically treated as a low-level feature matching task, it poses a significant challenge for existing vision models under sparse conditions. Notably, a proprietary vision-language model outperforms all purely vision-based approaches, with all models lagging substantially behind human performance. This gap underscores the need for more than basic pairwise vision processing-it calls for a comprehensive spatial understanding through high-level reasoning across multiple views. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, high-level reasoning in challenging, sparse environments. Our dataset and source code can be found at: this https URL
zh
[CV-42] Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation
【速读】:该论文旨在解决AI生成视频的检测器在实际应用中泛化能力不足的问题(video forensic detectors exhibit poor generalization)。其解决方案的关键在于引导检测器关注生成模型引入的内在低级伪影,而非依赖特定模型的高级语义缺陷。通过研究不同生成架构并提取具有鲁棒性、无偏性和跨模型共享的判别特征,结合基于小波分解的新型面向取证的数据增强策略,该方法提升了检测器的泛化能力,无需复杂算法或包含多种生成器的大规模数据集。
链接: https://arxiv.org/abs/2506.16802
作者: Riccardo Corvi,Davide Cozzolino,Ekta Prashnani,Shalini De Mello,Koki Nagano,Luisa Verdoliva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones. Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards seeing what really matters. In fact, a well-designed forensic classifier should focus on identifying intrinsic low-level artifacts introduced by a generative architecture rather than relying on high-level semantic flaws that characterize a specific model. In this work, first, we study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models. Then, we introduce a novel forensic-oriented data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues. Our novel training paradigm improves the generalizability of AI-generated video detectors, without the need for complex algorithms and large datasets that include multiple synthetic generators. To evaluate our approach, we train the detector using data from a single generative model and test it against videos produced by a wide range of other models. Despite its simplicity, our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models, such as NOVA and FLUX. Code and data will be made publicly available.
zh
[CV-43] RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
【速读】:该论文旨在解决真实世界图像超分辨率(Real-World Image Super-Resolution)任务中现有方法对退化图像内容理解不准确导致重建结果保真度低且不自然的问题。其解决方案的关键在于提出VLCoT框架,该框架受大语言模型中Chain of Thought(CoT)成功的启发,通过整合视觉与语言推理能力,模拟人类处理退化图像的过程,逐步生成更全面的文本描述和更高分辨率的图像。此外,为克服传统监督学习CoT在真实场景中泛化能力不足的问题,首次引入Group Relative Policy Optimization(GRPO)策略,并设计了四个奖励函数以优化模型的退化估计、内容理解与图像生成质量。
链接: https://arxiv.org/abs/2506.16796
作者: Junbo Qiao,Miaomiao Cai,Wei Li,Yutong Liu,Xudong Huang,Gaoqi He,Jiao Xie,Jie Hu,Xinghao Chen,Shaohui Lin
机构: East China Normal University (华东师范大学); University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.
zh
[CV-44] xtBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration
【速读】:该论文旨在解决脑肿瘤分析领域中缺乏结合放射影像与文本注释的综合性数据集的问题,这一缺陷限制了多模态方法在医学图像分割中的应用。其解决方案的关键在于引入TextBraTS数据集,这是首个公开的体积级多模态数据集,包含配对的MRI体积和丰富的文本注释,并基于此数据集提出了一种新的基线框架和序列交叉注意力方法,以实现文本引导的体积医学图像分割,从而提升脑肿瘤分割的准确性。
链接: https://arxiv.org/abs/2506.16784
作者: Xiaoyu Shi,Rahul Kumar Jain,Yinhao Li,Ruibo Hou,Jingliang Cheng,Jie Bai,Guohua Zhao,Lanfen Lin,Rui Xu,Yen-wei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data. To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques. Our dataset, implementation code, and pre-trained models are publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2506.16784 [cs.CV] (or arXiv:2506.16784v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.16784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-45] PQCAD-DM: Progressive Quantization and Calibration-Assisted Distillation for Extremely Efficient Diffusion Model
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成中计算复杂度高、资源消耗大以及传统压缩技术效果受限的问题。其关键解决方案是提出PQCAD-DM框架,该框架结合了渐进式量化(Progressive Quantization, PQ)和校准辅助蒸馏(Calibration-Assisted Distillation, CAD),通过自适应位宽转换减少低精度下的权重扰动,并利用全精度校准数据集提升学生模型的性能,从而在保持生成质量的同时显著提高计算效率。
链接: https://arxiv.org/abs/2506.16776
作者: Beomseok Ko,Hyeryung Jang
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
Abstract:Diffusion models excel in image generation but are computational and resource-intensive due to their reliance on iterative Markov chain processes, leading to error accumulation and limiting the effectiveness of naive compression techniques. In this paper, we propose PQCAD-DM, a novel hybrid compression framework combining Progressive Quantization (PQ) and Calibration-Assisted Distillation (CAD) to address these challenges. PQ employs a two-stage quantization with adaptive bit-width transitions guided by a momentum-based mechanism, reducing excessive weight perturbations in low-precision. CAD leverages full-precision calibration datasets during distillation, enabling the student to match full-precision performance even with a quantized teacher. As a result, PQCAD-DM achieves a balance between computational efficiency and generative quality, halving inference time while maintaining competitive performance. Extensive experiments validate PQCAD-DM’s superior generative capabilities and efficiency across diverse datasets, outperforming fixed-bit quantization methods.
zh
[CV-46] Infrared and Visible Image Fusion Based on Implicit Neural Representations
【速读】:该论文旨在解决红外与可见光图像融合问题,通过结合两种模态的优势生成信息丰富的图像,以满足视觉或计算需求。其解决方案的关键在于采用基于隐式神经表示(Implicit Neural Representations, INR)的方法,即通过神经网络参数化一个连续函数来隐式表示图像的多模态信息,从而突破传统方法对离散像素或显式特征的依赖。该方法利用归一化空间坐标作为输入,并通过多层感知机自适应融合两种模态的特征,实现高质量的图像融合。
链接: https://arxiv.org/abs/2506.16773
作者: Shuchen Sun,Ligen Shi,Chang Liu,Lina Wu,Jun Qiu
机构: Beijing Information Science and Technology University (北京信息科技大学); Capital Normal University (首都师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared and visible light image fusion aims to combine the strengths of both modalities to generate images that are rich in information and fulfill visual or computational requirements. This paper proposes an image fusion method based on Implicit Neural Representations (INR), referred to as INRFuse. This method parameterizes a continuous function through a neural network to implicitly represent the multimodal information of the image, breaking through the traditional reliance on discrete pixels or explicit features. The normalized spatial coordinates of the infrared and visible light images serve as inputs, and multi-layer perceptrons is utilized to adaptively fuse the features of both modalities, resulting in the output of the fused image. By designing multiple loss functions, the method jointly optimizes the similarity between the fused image and the original images, effectively preserving the thermal radiation information of the infrared image while maintaining the texture details of the visible light image. Furthermore, the resolution-independent characteristic of INR allows for the direct fusion of images with varying resolutions and achieves super-resolution reconstruction through high-density coordinate queries. Experimental results indicate that INRFuse outperforms existing methods in both subjective visual quality and objective evaluation metrics, producing fused images with clear structures, natural details, and rich information without the necessity for a training dataset.
zh
[CV-47] Class Agnostic Instance-level Descriptor for Visual Instance Search
【速读】:该论文试图解决视觉实例搜索(visual instance search)中由于缺乏有效的实例级特征表示而导致的挑战。传统监督或弱监督的目标检测方法因在未知物体类别上的表现不佳而无法适用。解决方案的关键在于基于自监督ViT(Vision Transformer)输出的特征集,将实例级区域发现建模为分层方式检测紧凑特征子集的过程。通过层次分解,生成不同语义尺度的实例区域,并有效处理了对象嵌入和遮挡问题,从而构建出图像中潜在实例的全面表示。
链接: https://arxiv.org/abs/2506.16745
作者: Qi-Ying Sun,Wan-Lei Zhao,Yi-Bo Miao,Chong-Wah Ngo
机构: Xiamen University (厦门大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance level feature representation. Supervised or weakly supervised object detection methods are not among the options due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of feature subsets. The non-leaf nodes and leaf nodes on the hierarchy correspond to the various instance regions in an image of different semantic scales. The hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in the real scenarios. The features derived from the nodes on the hierarchy make up a comprehensive representation for the latent instances in the image. Our instance-level descriptor remains effective on both the known and unknown object categories. Empirical studies on three instance search benchmarks show that it outperforms state-of-the-art methods considerably.
zh
[CV-48] Noise-Informed Diffusion-Generated Image Detection with Anomaly Attention
【速读】:该论文旨在解决扩散模型生成图像的检测问题,特别是针对未见过的扩散模型的泛化能力不足这一关键挑战。解决方案的关键在于利用图像噪声特征,通过引入一种新型的噪声感知自注意力(Noise-Aware Self-Attention, NASA)模块,聚焦于噪声区域以捕捉异常模式,从而提升对扩散生成图像的检测能力。
链接: https://arxiv.org/abs/2506.16743
作者: Weinan Guan,Wei Wang,Bo Peng,Ziwen He,Jing Dong,Haonan Cheng
机构: University of Chinese Academy of Sciences (中国科学院大学); New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA) (模式识别新实验室(NLPR),自动化研究所,中国科学院); Nanjing University of Information Science and Technology (南京信息工程大学); State Key Laboratory of Media Convergence and Communication, Communication University of China (媒体融合与通信国家重点实验室,中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TIFS 2025. Our code is availabel at this https URL
Abstract:With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective this http URL, a key challenge for forgery detection is generalising to diffusion models not seen during training. In this paper, we address this problem by focusing on image noise. We observe that images from different diffusion models share similar noise patterns, distinct from genuine images. Building upon this insight, we introduce a novel Noise-Aware Self-Attention (NASA) module that focuses on noise regions to capture anomalous patterns. To implement a SOTA detection model, we incorporate NASA into Swin Transformer, forming an novel detection architecture NASA-Swin. Additionally, we employ a cross-modality fusion embedding to combine RGB and noise images, along with a channel mask strategy to enhance feature learning from both modalities. Extensive experiments demonstrate the effectiveness of our approach in enhancing detection capabilities for diffusion-generated images. When encountering unseen generation methods, our approach achieves the state-of-the-art this http URL code is available at this https URL.
zh
[CV-49] Uncertainty-Aware Variational Information Pursuit for Interpretable Medical Image Analysis
【速读】:该论文旨在解决医疗影像中AI决策支持系统在生成可解释性结论时忽略实例级不确定性的问题,这种不确定性可能来源于模型本身的局限性(即认知不确定性)或专家回答的变异性(即随机不确定性)。其解决方案的关键在于提出一种名为Uncertainty-Aware V-IP (UAV-IP) 的新框架,该框架将不确定性量化整合到Variational Information Pursuit (V-IP) 的过程中,从而提升模型的可靠性和解释的准确性。
链接: https://arxiv.org/abs/2506.16742
作者: Md Nahiduzzaman,Ruwan Tennakoon,Steven Korevaar,Zongyuan Ge,Alireza Bab-Hadiashar
机构: RMIT University (皇家墨尔本理工大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In medical imaging, AI decision-support systems must balance accuracy and interpretability to build user trust and support effective clinical decision-making. Recently, Variational Information Pursuit (V-IP) and its variants have emerged as interpretable-by-design modeling techniques, aiming to explain AI decisions in terms of human-understandable, clinically relevant concepts. However, existing V-IP methods overlook instance-level uncertainties in query-answer generation, which can arise from model limitations (epistemic uncertainty) or variability in expert responses (aleatoric uncertainty). This paper introduces Uncertainty-Aware V-IP (UAV-IP), a novel framework that integrates uncertainty quantification into the V-IP process. We evaluate UAV-IP across four medical imaging datasets, PH2, Derm7pt, BrEaST, and SkinCon, demonstrating an average AUC improvement of approximately 3.2% while generating 20% more concise explanations compared to baseline V-IP, without sacrificing informativeness. These findings highlight the importance of uncertainty-aware reasoning in interpretable by design models for robust and reliable medical decision-making. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.16742 [cs.CV] (or arXiv:2506.16742v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.16742 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-50] Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection
【速读】:该论文旨在解决无人机(UAV)目标检测中由于平台运动和异步成像导致的多模态图像(可见光RGB与红外IR)之间空间错位问题,该问题引发语义不一致和模态冲突两大挑战。解决方案的关键在于提出一种统一框架CoDAF,其核心是通过两个新颖模块实现对齐与融合的联合处理:Offset-guided Semantic Alignment(OSA)模块利用基于注意力的空间偏移估计和可变形卷积实现更精确的特征对齐;Dynamic Attention-guided Fusion Module(DAFM)模块则通过门控机制自适应平衡模态贡献,并通过时空通道双重注意力优化融合特征。
链接: https://arxiv.org/abs/2506.16737
作者: Liu Zongzhen,Luo Hui,Wang Zhixing,Wei Yuxing,Zuo Haorui,Zhang Jianlin
机构: Chinese Academy of Sciences (中国科学院); Institute of Optics and Electronics, Chinese Academy of Sciences (光电研究所,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security. To improve robustness, recent studies have explored multimodal detection by fusing visible (RGB) and infrared (IR) imagery. However, due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment. This introduces two major challenges: semantic inconsistency at corresponding spatial locations and modality conflict during feature fusion. Existing methods often address these issues in isolation, limiting their effectiveness. In this paper, we propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF), a unified framework that jointly tackles both challenges in weakly aligned UAV-based object detection. CoDAF comprises two novel modules: the Offset-guided Semantic Alignment (OSA), which estimates attention-based spatial offsets and uses deformable convolution guided by a shared semantic space to align features more precisely; and the Dynamic Attention-guided Fusion Module (DAFM), which adaptively balances modality contributions through gating and refines fused features via spatial-channel dual attention. By integrating alignment and fusion in a unified design, CoDAF enables robust UAV object detection. Experiments on standard benchmarks validate the effectiveness of our approach, with CoDAF achieving a mAP of 78.6% on the DroneVehicle dataset.
zh
[CV-51] 3DeepRep: 3D Deep Low-rank Tensor Representation for Hyperspectral Image Inpainting
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)修复中由于数据缺失或损坏导致的信息不完整问题,其核心挑战在于如何有效利用HSI的低秩结构以恢复缺失区域。解决方案的关键在于提出一种三方向深度低秩张量表示模型(3DeepRep),该模型通过在HSI张量的三个模式上分别进行深度非线性变换,并在对应的潜在空间中对每个方向的模i前切片最小化核范数,从而形成三方向张量核范数正则化,以充分挖掘数据的低秩特性。此外,通过可学习的聚合模块融合三个方向分支的输出,进一步提升修复效果。
链接: https://arxiv.org/abs/2506.16735
作者: Yunshan Li,Wenwu Gong,Qianqian Wang,Chao Wang,Lili Yang
机构: Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent approaches based on transform-based tensor nuclear norm (TNN) have demonstrated notable effectiveness in hyperspectral image (HSI) inpainting by leveraging low-rank structures in latent representations. Recent developments incorporate deep transforms to improve low-rank tensor representation; however, existing approaches typically restrict the transform to the spectral mode, neglecting low-rank properties along other tensor modes. In this paper, we propose a novel 3-directional deep low-rank tensor representation (3DeepRep) model, which performs deep nonlinear transforms along all three modes of the HSI tensor. To enforce low-rankness, the model minimizes the nuclear norms of mode-i frontal slices in the corresponding latent space for each direction (i=1,2,3), forming a 3-directional TNN regularization. The outputs from the three directional branches are subsequently fused via a learnable aggregation module to produce the final result. An efficient gradient-based optimization algorithm is developed to solve the model in a self-supervised manner. Extensive experiments on real-world HSI datasets demonstrate that the proposed method achieves superior inpainting performance compared to existing state-of-the-art techniques, both qualitatively and quantitatively.
zh
[CV-52] SG: Textual Semantic Guidance for Infrared and Visible Image Fusion
【速读】:该论文旨在解决文本语义信息在红外与可见光图像融合(IVF)中的有效整合与利用问题,当前这一方面研究仍显不足。其解决方案的关键在于引入两个层次的文本语义:掩码语义层和文本语义层,均来源于大型视觉-语言模型(VLMs)提取的文本描述,并通过提出的文本语义引导框架(TeSG)实现对图像合成过程的优化,以提升下游任务如检测和分割的性能。
链接: https://arxiv.org/abs/2506.16730
作者: Mingrui Zhu,Xiru Chen,Xin Wei,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Chongqing University of Post and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, we introduce textual semantics at two levels: the mask semantic level and the text semantic level, both derived from textual descriptions extracted by large Vision-Language Models (VLMs). Building on this, we propose Textual Semantic Guidance for infrared and visible image fusion, termed TeSG, which guides the image synthesis process in a way that is optimized for downstream tasks such as detection and segmentation. Specifically, TeSG consists of three core components: a Semantic Information Generator (SIG), a Mask-Guided Cross-Attention (MGCA) module, and a Text-Driven Attentional Fusion (TDAF) module. The SIG generates mask and text semantics based on textual descriptions. The MGCA module performs initial attention-based fusion of visual features from both infrared and visible images, guided by mask semantics. Finally, the TDAF module refines the fusion process with gated attention driven by text semantics. Extensive experiments demonstrate the competitiveness of our approach, particularly in terms of performance on downstream tasks, compared to existing state-of-the-art methods.
zh
[CV-53] Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement ICMR2025
【速读】:该论文试图解决在已知类别信息稀缺条件下,传统广义类别发现(Generalized Category Discovery, GCD)模型性能下降的问题。其核心挑战在于如何在有限的标注样本和少量已知类别情况下,提升模型对未知类别的识别能力。解决方案的关键在于提出一种基于亲和力检索的决策边界增强框架,通过决策边界预训练模块减少过拟合,并利用亲和力检索得到的伪标签样本进行两阶段的决策边界优化,从而将已知类别的决策边界有效迁移至未知类别。
链接: https://arxiv.org/abs/2506.16728
作者: Yunhan Ren,Feng Luo,Siyu Huang
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICMR 2025
Abstract:While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: this https URL
zh
[CV-54] Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition
【速读】:该论文旨在解决从单目视角的杂乱视频动作序列中识别动作的问题,尤其是针对遮挡严重的情况。现有方法虽然通过将大规模预训练语言-图像模型适配到视频领域取得了良好效果,但尚未充分利用语言模型中蕴含的丰富常识先验(common sense priors),这些先验包含人类理解物体、人-物交互和活动所需的场景上下文。论文提出的解决方案关键在于引入一种结合语言驱动常识先验的框架,其核心包括:(1)视频上下文摘要组件,用于生成候选对象、活动及其交互;(2)描述生成模块,基于上下文生成当前场景描述并推断后续活动;(3)多模态活动识别头,融合视觉与文本线索进行动作识别。
链接: https://arxiv.org/abs/2506.16701
作者: Xiaodan Hu,Chuhang Zou,Suchen Wang,Jaechul Kim,Narendra Ahuja
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon.com LLC (亚马逊公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a framework incorporating language-driven common sense priors to identify cluttered video action sequences from monocular views that are often heavily occluded. We propose: (1) A video context summary component that generates candidate objects, activities, and the interactions between objects and activities; (2) A description generation module that describes the current scene given the context and infers subsequent activities, through auxiliary prompts and common sense reasoning; (3) A multi-modal activity recognition head that combines visual and textual cues to recognize video actions. We demonstrate the effectiveness of our approach on the challenging Action Genome and Charades datasets.
zh
[CV-55] LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中视觉-语言融合效率低下的问题,现有方法要么破坏模型固有结构,要么引入严重的长上下文计算负担,从而限制了模型的可扩展性和效率。其解决方案的关键在于提出LaVi,一种通过大型语言模型(Large Language Models, LLMs)内部特征调制实现无缝且高效视觉-语言融合的新范式。LaVi通过引入轻量且自适应的变换,避免了视觉标记拼接带来的长上下文扩展问题,利用逐标记的视觉条件化偏移量注入层归一化 affine 参数,直接根据视觉输入调制语言隐藏状态,从而在保持语言先验的同时显著降低计算成本。
链接: https://arxiv.org/abs/2506.16691
作者: Tongtian Yue,Longteng Guo,Yepeng Tang,Zijia Zhao,Xinxin Zhu,Hua Huang,Jing Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究所); School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model’s inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM’s linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.
zh
[CV-56] DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches
【速读】:该论文旨在解决在真实世界中部署的立体深度估计系统面临的对抗性攻击有效性不足的问题,特别是在物理环境中,以往通过重复优化纹理结构进行攻击的方法效果较差。论文提出的关键解决方案是引入在重复纹理之间设置规律间隔以形成条纹结构,这一创新结构显著提升了对抗补丁的攻击效果,并通过联合优化条纹结构和纹理元素,生成能够有效攻击先进立体深度估计方法(如RAFT-Stereo和STTR)以及商用RGB-D相机(如Intel RealSense)的对抗性补丁。
链接: https://arxiv.org/abs/2506.16690
作者: Yun Xing,Yue Cao,Nhat Chung,Jie Zhang,Ivor Tsang,Ming-Ming Cheng,Yang Liu,Lei Ma,Qing Guo
机构: CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore; Nanyang Technological University, Singapore; University of Alberta, Canada; VCIP, CS, Nankai University; NKIARI, Shenzhen Futian; The University of Tokyo, Japan; VNU-HCM, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated texture structures perform poorly in physical-world implementations, i.e., when deployed as patches, limiting their practical utility for testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals between repeated textures, creating a striped structure, significantly enhances the patch attack effectiveness. Through extensive experimentation, we analyze how variations of this novel structure influence the performance. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the striped structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack state-of-the-art stereo depth estimation methods, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems.
zh
[CV-57] How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
【速读】:该论文试图解决文本到图像生成模型中训练数据质量与描述性对模型性能影响的问题,特别是针对网络爬取数据集中的噪声和不一致性,研究合成训练标题(synthetic captions)的设计选择如何影响下游任务性能。解决方案的关键在于系统性地评估不同合成标题生成策略对模型文本对齐、输出美学和多样性的影响,从而为优化训练数据策略提供实证依据。
链接: https://arxiv.org/abs/2506.16679
作者: Manuel Brack,Sudeep Katakol,Felix Friedrich,Patrick Schramowski,Hareesh Ravi,Kristian Kersting,Ajinkya Kale
机构: Adobe Applied Research(Adobe应用研究); Hessian.AI(Hessian.AI); TU Darmstadt(TU达姆施塔特); DFKI(DFKI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model’s performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.
zh
[CV-58] Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge
【速读】:该论文旨在解决传统预训练与微调范式在不同规模CLIP模型预训练中面临的计算开销大、参数存储需求高及效率低的问题。其关键解决方案是提出MM-LG(Multimodal Learngene)框架,通过引入多模态块以提取多模态通用知识,并利用加权求和的方式分离多模态和单模态的通用组件,从而实现对不同规模和模态的下游模型进行数值初始化,显著降低了预训练成本并提升了性能。
链接: https://arxiv.org/abs/2506.16673
作者: Ruiming Chen,Junming Yang,Shiyu Xia,Xu Yang,Jing Wang,Xin Geng
机构: School of Computer Science and Engineering, Southeast University, Nanjing 210096, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG’s effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8 times pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.
zh
[CV-59] A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques
【速读】:该论文试图解决高维图像数据在进一步分析前需要进行降维的问题,其核心是对比两种线性降维技术——主成分分析(Principal Component Analysis, PCA)和奇异值分解(Singular Value Decomposition, SVD)的性能差异。解决方案的关键在于从基础原理推导出两种算法,并基于可解释性、数值稳定性和不同矩阵结构的适用性进行分析,从而提出无需实证基准的选用指南。
链接: https://arxiv.org/abs/2506.16663
作者: Michael Gyimadu,Gregory Bell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:High-dimensional image data often require dimensionality reduction before further analysis. This paper provides a purely analytical comparison of two linear techniques-Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). After the derivation of each algorithm from first principles, we assess their interpretability, numerical stability, and suitability for differing matrix shapes. building on classical and recent numerical literature, We synthesize rule-of-thumb guidelines for choosing one out of the two algorithms without empirical benchmarking, building on classical and recent numerical literature. Limitations and directions for future experimental work are outlined at the end.
zh
[CV-60] CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity
【速读】:该论文试图解决自然语言指令在机器人操作任务中存在歧义和模糊性的问题,以及现有语言条件策略因缺乏模块化和可解释性而导致的性能不足。解决方案的关键在于引入一种新的机器人操作框架,该框架利用视觉-语言模型(Vision-Language Model, VLM)解析自然语言指令中的抽象概念,并生成任务特定的代码作为可解释且可执行的中间表示,从而通过整合空间和语义信息生成3D注意力图,有效解决指令中的歧义问题。
链接: https://arxiv.org/abs/2506.16652
作者: Guang Yin,Yitong Li,Yixuan Wang,Dale McConachie,Paarth Shah,Kunimatsu Hashimoto,Huan Zhang,Katherine Liu,Yunzhu Li
机构: Columbia University (哥伦比亚大学); Toyota Research Institute (丰田研究机构); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted to Robotics: Science and Systems (RSS) 2025. The first three authors contributed equally. Project Page: this https URL
Abstract:Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction “Hang a mug on the mug tree” may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.
zh
[CV-61] Leverag ing CNN and IoT for Effective E-Waste Management
【速读】:该论文旨在解决电子废弃物(e-waste)处理过程中因识别、分类和路由不准确而导致的环境与健康风险问题。其解决方案的关键在于构建一个基于物联网(IoT)的系统,并结合轻量级卷积神经网络(CNN)分类流程,通过集成摄像头系统和数字称重设备,实现对电子物品的自动化分类,从而提升电子废弃物处理的效率与智能化水平。
链接: https://arxiv.org/abs/2506.16647
作者: Ajesh Thangaraj Nadar,Gabriel Nixon Raj,Soham Chandane,Sushant Bhat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, published in 2023 7th International Conference on I-SMAC IoT in Social Mobile Analytics and Cloud. Conference held in Kirtipur Nepal from 11 to 13 October 2023
Abstract:The increasing proliferation of electronic devices in the modern era has led to a significant surge in electronic waste (e-waste). Improper disposal and insufficient recycling of e-waste pose serious environmental and health risks. This paper proposes an IoT-enabled system combined with a lightweight CNN-based classification pipeline to enhance the identification, categorization, and routing of e-waste materials. By integrating a camera system and a digital weighing scale, the framework automates the classification of electronic items based on visual and weight-based attributes. The system demonstrates how real-time detection of e-waste components such as circuit boards, sensors, and wires can facilitate smart recycling workflows and improve overall waste processing efficiency.
zh
[CV-62] FlatCAD: Fast Curvature Regularization of Neural SDFs for CAD Models
【速读】:该论文试图解决神经有符号距离场(Neural Signed-Distance Fields, SDFs)在几何学习中难以实现可展开性、CAD风格行为的问题,这一问题通常依赖于需要完整Hessian矩阵计算和二阶自动微分的高斯曲率惩罚,导致内存和运行时间成本高昂。解决方案的关键在于提出一种曲率代理(curvature proxy),仅对混合二阶项(Weingarten项)进行正则化,从而允许两个主曲率自由适应数据并抑制不必要的扭曲。该方法通过两种互补的实现方式:有限差分代理和自动微分代理,分别以四次前向SDF评估和一次Hessian-向量乘积来计算混合导数,避免了显式的完整Hessian组装,提高了效率。
链接: https://arxiv.org/abs/2506.16627
作者: Haotian Yin,Aleksander Plocharski,Michal Jan Wlodarczyk,Mikolaj Kida,Przemyslaw Musialski
机构: New Jersey Institute of Technology (新泽西理工学院); Warsaw University of Technology (华沙理工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 page, 10 figures, preprint
Abstract:Neural signed-distance fields (SDFs) have become a versatile backbone for geometric learning, yet enforcing developable, CAD-style behavior still hinges on Gaussian curvature penalties that require full Hessian evaluation and second-order automatic differentiation, both of which are costly in memory and runtime. We present a curvature proxy that regularizes only the mixed second-order term (Weingarten term), allowing the two principal curvatures to adapt freely to data while suppressing unwanted warp. Two complementary instantiations realize this idea: (i) a finite-difference proxy that replaces each Hessian entry with four forward SDF evaluations and a single first-order gradient, and (ii) an autodiff proxy that computes the same mixed derivative via one Hessian-vector product, sidestepping explicit full Hessian assembly and remaining faster in practice. Both variants converge to the exact mixed second derivative, thus preserving the intended geometric bias without incurring full second-order graphs. On the ABC benchmarks, the proxies match or exceed the reconstruction fidelity of Hessian-based baselines while reducing GPU memory use and wall-clock time by a factor of two. Because the method is drop-in and framework-agnostic, it opens a practical path toward scalable, curvature-aware SDF learning for engineering-grade shape reconstruction.
zh
[CV-63] MetaQAP – A Meta-Learning Approach for Quality-Aware Pretraining in Image Quality Assessment
【速读】:该论文旨在解决图像质量评估(Image Quality Assessment, IQA)中的挑战,特别是由于人类感知的主观性和真实世界图像失真复杂性所带来的问题。其解决方案的关键在于提出一种新型的无参考IQA模型MetaQAP,该模型通过质量感知预训练、质量感知损失函数和元学习器的集成来提升性能。具体而言,模型首先在质量感知数据集上预训练卷积神经网络(CNN),然后引入质量感知损失函数以优化预测结果,并利用元学习器构建集成模型,有效融合多个基础模型的预测结果。实验结果表明,该方法在多个基准数据集上均取得了优异的性能,验证了其有效性和泛化能力。
链接: https://arxiv.org/abs/2506.16601
作者: Muhammad Azeem Aslam,Muhammad Hamza,Nisar Ahmed,Gulshan Saleem,Zhu Shuangtong,Hu Hongfei,Xu Wei,Saba Aslam,Wang Jun
机构: Xi’an Eurasia University (西安欧亚大学); Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences (长春光学精密机械研究所,中国科学院); Xidian University (西安电子科技大学); University of Engineering and Technology Lahore (巴基斯坦工程与技术大学拉合尔分校); University of Central Punjab (中央庞巴德大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions. This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning. The model performs three key contributions: pre-training Convolutional Neural Networks (CNNs) on a quality-aware dataset, implementing a quality-aware loss function to optimize predictions, and integrating a meta-learner to form an ensemble model that effectively combines predictions from multiple base models. Experimental evaluations were conducted on three benchmark datasets: LiveCD, KonIQ-10K, and BIQ2021. The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing IQA methods. Cross-dataset evaluations further demonstrated the generalizability of the model, with PLCC and SROCC scores ranging from 0.6721 to 0.8023 and 0.6515 to 0.7805, respectively, across diverse datasets. The ablation study confirmed the significance of each model component, revealing substantial performance degradation when critical elements such as the meta-learner or quality-aware loss function were omitted. MetaQAP not only addresses the complexities of authentic distortions but also establishes a robust and generalizable framework for practical IQA applications. By advancing the state-of-the-art in no-reference IQA, this research provides valuable insights and methodologies for future improvements and extensions in the field.
zh
[CV-64] Spatially-Aware Evaluation of Segmentation Uncertainty CVPR2025
【速读】:该论文旨在解决现有不确定性评估指标在医学图像分割中忽略空间上下文和解剖结构的问题,导致对不同质的不确定性模式(如散在分布与边界对齐)赋予相同的评分。其解决方案的关键在于提出三种考虑结构和边界信息的空间感知度量方法,从而提升不确定性地图与临床相关因素的一致性,并更好地区分有意义和虚假的不确定性模式。
链接: https://arxiv.org/abs/2506.16589
作者: Tal Zeevi,Eléonore V. Lieffrig,Lawrence H. Staib,John A. Onofrey
机构: Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Performance (cs.PF); Machine Learning (stat.ML)
备注: Presented at the 4th Workshop on Uncertainty Quantification for Computer Vision (CVPR 2025), June 11, 2025. This version is not included in the official proceedings
Abstract:Uncertainty maps highlight unreliable regions in segmentation predictions. However, most uncertainty evaluation metrics treat voxels independently, ignoring spatial context and anatomical structure. As a result, they may assign identical scores to qualitatively distinct patterns (e.g., scattered vs. boundary-aligned uncertainty). We propose three spatially aware metrics that incorporate structural and boundary information and conduct a thorough validation on medical imaging data from the prostate zonal segmentation challenge within the Medical Segmentation Decathlon. Our results demonstrate improved alignment with clinically important factors and better discrimination between meaningful and spurious uncertainty patterns.
zh
[CV-65] SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage
【速读】:该论文试图解决在紧急情况下有效进行脑卒中分诊时,因依赖真实患者面部视频数据而引发的伦理和隐私问题。解决方案的关键在于提出SafeTriage方法,该方法通过预训练的视频运动迁移(Video Motion Transfer, VMT)模型,将真实患者的面部运动特征映射到合成身份上,从而在保留对脑卒中诊断至关重要的面部动态信息的同时实现患者身份的去标识化。此外,为缓解正常人群预训练视频与患者群体测试视频之间的分布偏移,引入了条件生成模型进行视觉提示调优,以适应VMT模型的输入空间,确保运动迁移的准确性而不需微调VMT模型主干。
链接: https://arxiv.org/abs/2506.16578
作者: Tongan Cai,Haomiao Ni,Wenchao Ma,Yuan Xue,Qian Ma,Rachel Leicht,Kelvin Wong,John Volpi,Stephen T.C. Wong,James Z. Wang,Sharon X. Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IPMI 2025
Abstract:Effective stroke triage in emergency settings often relies on clinicians’ ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges – especially when training robust and generalizable models across institutions. To address these concerns, we propose SafeTriage, a novel method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. SafeTriage leverages a pretrained video motion transfer (VMT) model to map the motion characteristics of real patient faces onto synthetic identities. This approach retains diagnostically relevant facial dynamics without revealing the patients’ identities. To mitigate the distribution shift between normal population pre-training videos and patient population test videos, we introduce a conditional generative model for visual prompt tuning, which adapts the input space of the VMT model to ensure accurate motion transfer without needing to fine-tune the VMT model backbone. Comprehensive evaluation, including quantitative metrics and clinical expert assessments, demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage. Our evaluations also show that SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders.
zh
[CV-66] Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control
【速读】:该论文试图解决机器人在面对新颖视觉干扰物时,世界模型(world model)预测的可靠性下降问题,这会导致后续规划或动作验证的失败。解决方案的关键在于提出一种称为“观察干预重想象”(Reimagination with Observation Intervention, ReOI)的测试阶段策略,该策略通过检测并移除导致物理上不合理变化的视觉干扰物,将当前观测调整至训练分布附近,再基于修正后的观测进行未来结果的重新预测,并在事后重新引入干扰物以保持视觉一致性,从而提升动作结果预测的可靠性。
链接: https://arxiv.org/abs/2506.16565
作者: Yuxin Chen,Jianglan Wei,Chenfeng Xu,Boyi Li,Masayoshi Tomizuka,Andrea Bajcsy,Ran Tian
机构: University of California, Berkeley (加州大学伯克利分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models enable robots to “imagine” future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI “reimagines” future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3x in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.
zh
[CV-67] From Semantic To Instance: A Semi-Self-Supervised Learning Approach
【速读】:该论文旨在解决在农业图像中进行实例分割时面临的标注数据稀缺问题,特别是在密集排列且存在自遮挡的物体场景下,传统方法因需要大量像素级标注数据而受限。解决方案的关键在于提出一种半自监督学习方法,结合GLMask图像-掩码表示,使模型更关注形状、纹理和模式,减少对颜色特征的依赖,并通过生成语义分割再转换为实例级分割的流程,显著提升了实例分割性能。
链接: https://arxiv.org/abs/2506.16563
作者: Keyhan Najafian,Farhad Maleki,Lingling Jin,Ian Stavness
机构: University of Saskatchewan (萨斯喀彻温大学); University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.
zh
[CV-68] How Hard Is Snow? A Paired Domain Adaptation Dataset for Clear and Snowy Weather: CADC
【速读】:该论文试图解决雪天对三维目标检测性能影响的研究不足问题,以及现有数据集在雪天和晴天条件下标注数据不足或依赖合成数据导致的领域偏移问题。解决方案的关键在于构建CADC+,这是首个针对冬季驾驶条件的配对天气域适应数据集,通过在相同道路和时间段内采集的晴天数据与雪天数据进行配对,从而最小化与雪无关因素引起的领域偏移。
链接: https://arxiv.org/abs/2506.16531
作者: Mei Qi Tang,Sean Sedwards,Chengjie Huang,Krzysztof Czarnecki
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE IV 2025
Abstract:The impact of snowfall on 3D object detection performance remains underexplored. Conducting such an evaluation requires a dataset with sufficient labelled data from both weather conditions, ideally captured in the same driving environment. Current driving datasets with LiDAR point clouds either do not provide enough labelled data in both snowy and clear weather conditions, or rely on de-snowing methods to generate synthetic clear weather. Synthetic data often lacks realism and introduces an additional domain shift that confounds accurate evaluations. To address these challenges, we present CADC+, the first paired weather domain adaptation dataset for autonomous driving in winter conditions. CADC+ extends the Canadian Adverse Driving Conditions dataset (CADC) using clear weather data that was recorded on the same roads and in the same period as CADC. To create CADC+, we pair each CADC sequence with a clear weather sequence that matches the snowy sequence as closely as possible. CADC+ thus minimizes the domain shift resulting from factors unrelated to the presence of snow. We also present some preliminary results using CADC+ to evaluate the effect of snow on 3D object detection performance. We observe that snow introduces a combination of aleatoric and epistemic uncertainties, acting as both noise and a distinct data domain.
zh
[CV-69] Subspace-Boosted Model Merging
【速读】:该论文试图解决模型合并过程中随着融合专家模型数量增加而导致的性能增益下降问题,其核心问题是任务向量空间在多次合并后出现的秩坍塌(rank collapse)。解决方案的关键在于引入Subspace Boosting方法,该方法基于奇异值分解的任务向量空间进行操作,并保持任务向量的秩,从而显著提升合并效果。此外,通过采用高阶广义奇异值分解进一步量化任务相似性,提供了对模型合并的新可解释视角。
链接: https://arxiv.org/abs/2506.16506
作者: Ronald Skorobogat,Karsten Roth,Mariana-Iuliana Georgescu,Zeynep Akata
机构: Technical University of Munich, School of Computation, Information and Technology; Helmholtz Munich; Munich Center for Machine Learning (MCML); Tübingen AI Center, University of Tübingen; Munich Data Science Institute (MDSI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages (main + supp)
Abstract:Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on vision benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to further quantify task similarity, offering a new interpretable perspective on model merging.
zh
[CV-70] Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
【速读】:该论文旨在解决生成高保真、细节丰富的纹理化3D资产的问题,特别是在形状和纹理生成方面提升生成质量与真实性。其关键解决方案是采用两阶段流水线架构,并引入新的形状基础模型LATTICE,该模型通过大规模高质量数据集、模型规模和计算资源进行训练,实现了具有精确图像-3D对齐能力的尖锐且详细的3D形状生成;同时,在纹理生成方面,通过从Hunyuan3D 2.0 Paint模型扩展而来的多视角架构,结合物理基础渲染(PBR)技术,显著提升了端到端纹理生成的效果。
链接: https://arxiv.org/abs/2506.16504
作者: Zeqiang Lai,Yunfei Zhao,Haolin Liu,Zibo Zhao,Qingxiang Lin,Huiwen Shi,Xianghui Yang,Mingxin Yang,Shuhui Yang,Yifei Feng,Sheng Zhang,Xin Huang,Di Luo,Fan Yang,Fang Yang,Lifu Wang,Sicong Liu,Yixuan Tang,Yulin Cai,Zebin He,Tian Liu,Yuhong Liu,Jie Jiang,Linus,Jingwei Huang,Chunchao Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report
Abstract:In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model – LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation.
zh
[CV-71] Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors
【速读】:该论文试图解决视频流中人脸替换操作带来的安全威胁,特别是针对复杂物理场景下由人脸替换算法引入的视觉痕迹进行检测。解决方案的关键在于利用基于卷积神经网络(CNN)的数据驱动模型,通过分析不同数据源和替换算法下的泛化能力,以识别和表征与遮挡相关的视觉线索。然而,研究结果表明,在跨数据集的情况下,稳健地表征这些视觉线索存在显著困难,这凸显了需要专门的检测策略来应对此类伪造痕迹。
链接: https://arxiv.org/abs/2506.16497
作者: Riccardo Ziglio,Cecilia Pasquini,Silvio Ranise
机构: Center for Cybersecurity, Fondazione Bruno Kessler, Trento, Italy; Department of Mathematics, University of Trento, Italy; DIBRIS, University of Genoa, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 8 pages, 4 figures, workshop paper
Abstract:Face swapping manipulations in video streams represents an increasing threat in remote video communications, due to advances in automated and real-time tools. Recent literature proposes to characterize and exploit visual artifacts introduced in video frames by swapping algorithms when dealing with challenging physical scenes, such as face occlusions. This paper investigates the effectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora (including a newly collected one) and analyzing generalization capabilities with respect to different acquisition sources and swapping algorithms. The results confirm excellent performance of general-purpose CNN architectures when operating within the same data source, but a significant difficulty in robustly characterizing occlusion-based visual cues across datasets. This highlights the need for specialized detection strategies to deal with such artifacts. Comments: 8 pages, 4 figures, workshop paper Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2506.16497 [cs.CV] (or arXiv:2506.16497v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.16497 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Riccardo Ziglio Mr. [view email] [v1] Thu, 19 Jun 2025 17:51:11 UTC (1,421 KB) Full-text links: Access Paper: View a PDF of the paper titled Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors, by Riccardo Ziglio and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-06 Change to browse by: cs cs.AI cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-72] DT-UFC: Universal Large Model Feature Coding via Peaky-to-Balanced Distribution Transformation
【速读】:该论文旨在解决跨多种大型模型的通用特征编码(universal feature coding)问题,这一问题在以往的研究中主要针对特定任务或模型场景,缺乏对多样化大模型间特征分布差异的系统性处理。其关键挑战在于不同模型提取的特征具有本质上的多样性与分布不兼容性,例如DINOv2的特征呈现高度集中且尖锐的分布,而Stable Diffusion 3(SD3)的特征则更为分散和均匀,这种分布异质性严重影响了压缩效率和跨模型泛化能力。为应对这一挑战,作者提出了一种学习的尖峰到平衡分布变换方法,该方法将高度偏斜的特征分布重塑为统一的平衡目标空间,具有非均匀、数据驱动和即插即用的特点,无需修改下游编解码器即可实现异构分布的有效对齐,从而使得基于平衡目标分布训练的通用编解码器能够有效适应不同模型和任务的特征。
链接: https://arxiv.org/abs/2506.16495
作者: Changsheng Gao,Zijie Liu,Li Li,Dong Liu,Xiaoyan Sun,Weisi Lin
机构: Nanyang Technological University (南洋理工大学); Xiamen University (厦门大学); University of Science and Technology of China (中国科学技术大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Like image coding in visual data transmission, feature coding is essential for the distributed deployment of large models by significantly reducing transmission and storage overhead. However, prior studies have mostly targeted task- or model-specific scenarios, leaving the challenge of universal feature coding across diverse large models largely unaddressed. In this paper, we present the first systematic study on universal feature coding for large models. The key challenge lies in the inherently diverse and distributionally incompatible nature of features extracted from different models. For example, features from DINOv2 exhibit highly peaky, concentrated distributions, while those from Stable Diffusion 3 (SD3) are more dispersed and uniform. This distributional heterogeneity severely hampers both compression efficiency and cross-model generalization. To address this, we propose a learned peaky-to-balanced distribution transformation, which reshapes highly skewed feature distributions into a common, balanced target space. This transformation is non-uniform, data-driven, and plug-and-play, enabling effective alignment of heterogeneous distributions without modifying downstream codecs. With this alignment, a universal codec trained on the balanced target distribution can effectively generalize to features from different models and tasks. We validate our approach on three representative large models-LLaMA3, DINOv2, and SD3-across multiple tasks and modalities. Extensive experiments show that our method achieves notable improvements in both compression efficiency and cross-model generalization over task-specific baselines. All source code will be released for future research.
zh
[CV-73] How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?
【速读】:该论文试图解决在线情景记忆视频问答(Online Episodic-Memory Video Question Answering, OEM-VQA)问题,即在不进行额外训练的情况下,利用现成的多模态大语言模型(Multimodal Large Language Models, MLLMs)处理实时第一视角视频并回答多选题。解决方案的关键在于构建一个轻量级的文本记忆生成管道,通过MLLM描述符模块将流式视频转换为每分钟仅几KB的文本记忆,并借助LLM推理模块通过查询该记忆来回答问题。
链接: https://arxiv.org/abs/2506.16450
作者: Giuseppe Lando,Rosario Forte,Giovanni Maria Farinella,Antonino Furnari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 104/105 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.
zh
[CV-74] Structured Semantic 3D Reconstruction (S23DR) Challenge 2025 – Winning solution
【速读】:该论文旨在解决从稀疏点云和语义分割中预测房屋三维屋顶线框的问题。其关键解决方案是采用一种两阶段的三维深度学习方法,首先通过Gestalt分割从COLMAP点云中识别顶点候选,随后利用两个类似PointNet的模型分别对顶点进行细化与分类以及预测连接顶点的边,从而实现高效的三维结构重建。
链接: https://arxiv.org/abs/2506.16421
作者: Jan Skvrna,Lukas Neumann
机构: Czech Technical University (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents the winning solution for the S23DR Challenge 2025, which involves predicting a house’s 3D roof wireframe from a sparse point cloud and semantic segmentations. Our method operates directly in 3D, first identifying vertex candidates from the COLMAP point cloud using Gestalt segmentations. We then employ two PointNet-like models: one to refine and classify these candidates by analyzing local cubic patches, and a second to predict edges by processing the cylindrical regions connecting vertex pairs. This two-stage, 3D deep learning approach achieved a winning Hybrid Structure Score (HSS) of 0.43 on the private leaderboard.
zh
[CV-75] Efficient Transformations in Deep Learning Convolutional Neural Networks
【速读】:该论文试图解决在卷积神经网络(Convolutional Neural Network, CNN)中如何平衡计算效率、能耗与分类精度的问题。其解决方案的关键在于将信号处理变换——快速傅里叶变换(Fast Fourier Transform, FFT)、沃尔什-哈达玛变换(Walsh-Hadamard Transform, WHT)和离散余弦变换(Discrete Cosine Transform, DCT)——集成到ResNet50模型中,以优化模型的性能。实验结果表明,引入WHT在降低能耗的同时显著提升了分类准确率,尤其在早期和后期卷积层中应用WHT的改进模型取得了最佳效果。
链接: https://arxiv.org/abs/2506.16418
作者: Berk Yilmaz,Daniel Fidel Harvey,Prajit Dhuri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: All authors contributed equally to this work. 17 pages, 36 references, 10 figures, 1 appendix
Abstract:This study investigates the integration of signal processing transformations – Fast Fourier Transform (FFT), Walsh-Hadamard Transform (WHT), and Discrete Cosine Transform (DCT) – within the ResNet50 convolutional neural network (CNN) model for image classification. The primary objective is to assess the trade-offs between computational efficiency, energy consumption, and classification accuracy during training and inference. Using the CIFAR-100 dataset (100 classes, 60,000 images), experiments demonstrated that incorporating WHT significantly reduced energy consumption while improving accuracy. Specifically, a baseline ResNet50 model achieved a testing accuracy of 66%, consuming an average of 25,606 kJ per model. In contrast, a modified ResNet50 incorporating WHT in the early convolutional layers achieved 74% accuracy, and an enhanced version with WHT applied to both early and late layers achieved 79% accuracy, with an average energy consumption of only 39 kJ per model. These results demonstrate the potential of WHT as a highly efficient and effective approach for energy-constrained CNN applications.
zh
[CV-76] Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks EMNLP2025
【速读】:该论文旨在解决视觉文档理解(Visual Document Understanding, VDU)系统在面对现实中的对抗性扰动时鲁棒性不足的问题。其关键解决方案是提出首个统一框架,用于生成和评估基于光学字符识别(OCR)的VDU模型的多模态对抗攻击,涵盖六种基于梯度的布局攻击场景,并通过约束布局扰动预算(如IoU=0.6)来保持攻击的合理性,从而全面评估模型的脆弱性。
链接: https://arxiv.org/abs/2506.16407
作者: Dong Nguyen Tien,Dung D. Le
机构: VinUniversity (维纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, under review at EMNLP 2025
Abstract:Visual Document Understanding (VDU) systems have achieved strong performance in information extraction by integrating textual, layout, and visual signals. However, their robustness under realistic adversarial perturbations remains insufficiently explored. We introduce the first unified framework for generating and evaluating multi-modal adversarial attacks on OCR-based VDU models. Our method covers six gradient-based layout attack scenarios, incorporating manipulations of OCR bounding boxes, pixels, and texts across both word and line granularities, with constraints on layout perturbation budget (e.g., IoU = 0.6) to preserve plausibility. Experimental results across four datasets (FUNSD, CORD, SROIE, DocVQA) and six model families demonstrate that line-level attacks and compound perturbations (BBox + Pixel + Text) yield the most severe performance degradation. Projected Gradient Descent (PGD)-based BBox perturbations outperform random-shift baselines in all investigated models. Ablation studies further validate the impact of layout budget, text modification, and adversarial transferability. Comments: 8 pages, 1 figure, under review at EMNLP 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.16407 [cs.CV] (or arXiv:2506.16407v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.16407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-77] rajSceneLLM : A Multimodal Perspective on Semantic GPS Trajectory Analysis
【速读】:该论文旨在解决传统方法在提取GPS轨迹数据中的深层语义表示以及融合上下文地图信息方面的不足。其解决方案的关键在于提出TrajSceneLLM框架,该框架通过整合可视化地图图像(编码空间上下文)和通过大语言模型(LLM)推理生成的文本描述(捕捉时间序列和移动动态),生成具有丰富语义内容的轨迹场景嵌入,并将其与简单的多层感知机(MLP)分类器结合,从而显著提升了旅行模式识别(TMI)任务的性能。
链接: https://arxiv.org/abs/2506.16401
作者: Chunhou Ji,Qiumeng Li
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review for ACM SIGSPATIAL 2025
Abstract:GPS trajectory data reveals valuable patterns of human mobility and urban dynamics, supporting a variety of spatial applications. However, traditional methods often struggle to extract deep semantic representations and incorporate contextual map information. We propose TrajSceneLLM, a multimodal perspective for enhancing semantic understanding of GPS trajectories. The framework integrates visualized map images (encoding spatial context) and textual descriptions generated through LLM reasoning (capturing temporal sequences and movement dynamics). Separate embeddings are generated for each modality and then concatenated to produce trajectory scene embeddings with rich semantic content which are further paired with a simple MLP classifier. We validate the proposed framework on Travel Mode Identification (TMI), a critical task for analyzing travel choices and understanding mobility behavior. Our experiments show that these embeddings achieve significant performance improvement, highlighting the advantage of our LLM-driven method in capturing deep spatio-temporal dependencies and reducing reliance on handcrafted features. This semantic enhancement promises significant potential for diverse downstream applications and future research in geospatial artificial intelligence. The source code and dataset are publicly available at: this https URL.
zh
[CV-78] HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分析中由于依赖欧几里得嵌入导致的语义层次建模不足的问题。现有方法虽然尝试利用WSI的自然层次结构(如图像块、区域和切片),但其效果受限于欧几里得空间的表达能力。解决方案的关键在于提出HyperPath,该方法通过将病理学视觉-语言基础模型提取的视觉与文本特征映射到双曲空间,结合语义层次一致性损失和模态对齐损失,增强特征的语义连贯性,并利用测地线距离进行分类,从而实现更有效的WSI分析。
链接: https://arxiv.org/abs/2506.16398
作者: Peixiang Huang,Yanyan Huang,Weiqin Zhao,Junjun He,Lequan Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy – patches, regions, and slides – with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully capture semantic hierarchies. To address this limitation, we propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies of WSIs in hyperbolic space, thereby enhancing WSI classification. Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space. We design an Angular Modality Alignment Loss to ensure robust cross-modal alignment, while a Semantic Hierarchy Consistency Loss further refines feature hierarchies through entailment and contradiction relationships and thus enhance semantic coherence. The classification is performed with geodesic distance, which measures the similarity between entities in the hyperbolic semantic hierarchy. This eliminates the need for linear classifiers and enables a geometry-aware approach to WSI analysis. Extensive experiments show that our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis.
zh
[CV-79] CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset
【速读】:该论文旨在解决微表情识别(micro-gesture recognition)在情感计算中的挑战,特别是由于手势的细微性、非自愿性和低运动幅度导致的识别难度。其解决方案的关键在于提出一种基于CLIP的架构——CLIP-MG,通过引入人体姿态(skeleton)信息,结合姿态引导的语义查询生成和门控多模态融合机制,提升微手势分类的性能,最终在iMiGUE数据集上达到了61.82%的Top-1准确率。
链接: https://arxiv.org/abs/2506.16385
作者: Santosh Patapati,Trisanth Srinivasan,Amith Adiraju
机构: Cyrion Labs(赛瑞恩实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-language models like CLIP for micro-gesture recognition.
zh
[CV-80] AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios
【速读】:该论文旨在解决自动驾驶中因遮挡导致的感知精度不足问题,特别是针对地面车辆与无人机(UAV)协同感知场景缺乏高质量数据集的问题。解决方案的关键在于构建AGC-Drive,这是首个大规模真实世界数据集,用于支持空中-地面协同的3D感知任务。该数据集包含多辆地面车辆和一架无人机采集的大量多视角、多传感器数据,并提供了详细的3D边界框标注,以促进车辆间及车辆与无人机间的协作感知研究。
链接: https://arxiv.org/abs/2506.16371
作者: Yunhao Hou,Bochao Zou,Min Zhang,Ran Chen,Shangdong Yang,Yanmei Zhang,Junbao Zhuo,Siheng Chen,Jiansheng Chen,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学); Xiamen NEVC Advanced Electric Powertrain Technology Innovation Center (厦门新能源汽车先进电驱技术协同创新中心); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 120K LiDAR frames and 440K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 19.5% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 400 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at this https URL.
zh
[CV-81] Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation
【速读】:该论文旨在解决Vision Transformers (ViTs)在处理大量token时产生的高计算需求问题,这一问题限制了其在医学图像分析中的实际应用。解决方案的关键在于提出一种自适应提示引导的剪枝方法,通过基于提示的空间先验对token进行相关性排序,低相关性token被降权处理,从而仅保留关键区域进行后续处理,实现计算资源的有效分配并提升分割精度。
链接: https://arxiv.org/abs/2506.16369
作者: Pallabi Dutta,Anubhab Maity,Sushmita Mitra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The high computational demands of Vision Transformers (ViTs), in processing a huge number of tokens, often constrain their practical application in analyzing medical images. This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy facilitates end-to-end training, maintains gradient flow, and improves segmentation accuracy by focusing computational resources on essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens; thereby, enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of \sim 35-55% tokens; thus reducing the computational costs relative to the baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.
zh
[CV-82] MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval ICMR2025
【速读】:该论文旨在解决大规模图像检索任务中高效生成二进制哈希码的问题,以提升图像检索的效率和效果。其解决方案的关键在于提出了一种基于视觉状态空间的哈希模型MambaHash,该模型引入了分阶段架构的主干网络,通过分组Mamba操作实现多方向扫描以建模局部与全局信息,并结合通道交互注意力模块增强跨通道的信息通信,最后设计自适应特征增强模块以提高特征多样性与视觉表征能力。
链接: https://arxiv.org/abs/2506.16353
作者: Chao He,Hongxi Wei
机构: Inner Mongolia University (内蒙古大学); Provincial Key Laboratory of Mongolian Information Processing Technology (蒙古信息处理技术重点实验室); National and Local Joint Engineering Research Center of Mongolian Information Processing (蒙古信息处理国家与地方联合工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICMR2025. arXiv admin note: text overlap with arXiv:2405.07524
Abstract:Deep image hashing aims to enable effective large-scale image retrieval by mapping the input images into simple binary hash codes through deep neural networks. More recently, Vision Mamba with linear time complexity has attracted extensive attention from researchers by achieving outstanding performance on various computer tasks. Nevertheless, the suitability of Mamba for large-scale image retrieval tasks still needs to be explored. Towards this end, we propose a visual state space hashing model, called MambaHash. Concretely, we propose a backbone network with stage-wise architecture, in which grouped Mamba operation is introduced to model local and global information by utilizing Mamba to perform multi-directional scanning along different groups of the channel. Subsequently, the proposed channel interaction attention module is used to enhance information communication across channels. Finally, we meticulously design an adaptive feature enhancement module to increase feature diversity and enhance the visual representation capability of the model. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that compared with the state-of-the-art deep hashing methods, our proposed MambaHash has well efficiency and superior performance to effectively accomplish large-scale image retrieval tasks. Source code is available this https URL
zh
[CV-83] Watermarking Autoregressive Image Generation KR
【速读】:该论文试图解决在生成式AI(Generative AI)输出中嵌入水印以追踪其来源的问题,特别是在自回归图像生成模型中尚未有在令牌级别进行水印嵌入的研究。解决方案的关键在于应对反向循环一致性(Reverse Cycle-Consistency, RCC)的缺失问题,即重新标记生成的图像令牌会显著改变令牌序列,从而擦除水印。为了解决这一问题并提高方法对常见图像变换、神经压缩和移除攻击的鲁棒性,作者提出了(i)一种定制的分词器-逆分词器微调过程以提升RCC,以及(ii)一个互补的水印同步层。
链接: https://arxiv.org/abs/2506.16349
作者: Nikola Jovanović,Ismail Labiad,Tomáš Souček,Martin Vechev,Pierre Fernandez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In this work, we present the first such approach by adapting language model watermarking techniques to this setting. We identify a key challenge: the lack of reverse cycle-consistency (RCC), wherein re-tokenizing generated image tokens significantly alters the token sequence, effectively erasing the watermark. To address this and to make our method robust to common image transformations, neural compression, and removal attacks, we introduce (i) a custom tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a complementary watermark synchronization layer. As our experiments demonstrate, our approach enables reliable and robust watermark detection with theoretically grounded p-values.
zh
[CV-84] ransparency Techniques for Neural Networks trained on Writer Identification and Writer Verification
【速读】:该论文试图解决神经网络在手写文本鉴定(Writer Identification, WI)和手写文本验证(Writer Verification, WV)任务中的“黑箱”系统透明性问题,旨在提升系统的性能和可靠性。解决方案的关键在于首次在该领域应用两种透明性技术:一种是提供像素级显著性图(pixel-level saliency maps),另一种是提供点特定显著性图(point-specific saliency maps),以揭示神经网络在识别过程中关注的特征及图像间的相似性信息。通过删除和插入评分指标对这些技术进行评估,并与法医专家在鉴定过程中的关注区域进行对比,结果显示像素级显著性图优于点特定显著性图,更适合为法医专家提供支持。
链接: https://arxiv.org/abs/2506.16331
作者: Viktoria Pundy,Marco Peer,Florian Kleber
机构: TU Wien(维也纳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Networks are the state of the art for many tasks in the computer vision domain, including Writer Identification (WI) and Writer Verification (WV). The transparency of these “black box” systems is important for improvements of performance and reliability. For this work, two transparency techniques are applied to neural networks trained on WI and WV for the first time in this domain. The first technique provides pixel-level saliency maps, while the point-specific saliency maps of the second technique provide information on similarities between two images. The transparency techniques are evaluated using deletion and insertion score metrics. The goal is to support forensic experts with information on similarities in handwritten text and to explore the characteristics selected by a neural network for the identification process. For the qualitative evaluation, the highlights of the maps are compared to the areas forensic experts consider during the identification process. The evaluation results show that the pixel-wise saliency maps outperform the point-specific saliency maps and are suitable for the support of forensic experts.
zh
[CV-85] Reliable Few-shot Learning under Dual Noises
【速读】:该论文旨在解决在开放世界环境下,基于任务适配的少样本学习(few-shot learning, FSL)中,由于支持样本和查询样本中不可避免的分布内(in-distribution, ID)和分布外(out-of-distribution, OOD)噪声导致模型适应效果不佳的问题。其解决方案的关键在于提出DETA++框架,通过对比相关性聚合(Contrastive Relevance Aggregation, CoRA)模块计算支持样本的图像和区域权重,并引入干净原型损失和噪声熵最大化损失以实现噪声鲁棒的任务适配;同时利用记忆库存储并优化每个类别的清洁区域,结合局部最近质心分类器(Local Nearest Centroid Classifier, LocalNCC)提升查询样本预测的鲁棒性;此外,采用类内区域交换策略(Intra-class Region Swapping, IntraSwap)修正ID类别原型,进一步增强模型对双重噪声的鲁棒性。
链接: https://arxiv.org/abs/2506.16330
作者: Ji Zhang,Jingkuan Song,Lianli Gao,Nicu Sebe,Heng Tao Shen
机构: Southwest Jiaotong University (西南交通大学); University of Electronic Science and Technology of China (电子科技大学); University of Trento (特伦托大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures,
Abstract:Recent advances in model pre-training give rise to task adaptation-based few-shot learning (FSL), where the goal is to adapt a pre-trained task-agnostic model for capturing task-specific knowledge with a few-labeled support samples of the target this http URL, existing approaches may still fail in the open world due to the inevitable in-distribution (ID) and out-of-distribution (OOD) noise from both support and query samples of the target task. With limited support samples available, i) the adverse effect of the dual noises can be severely amplified during task adaptation, and ii) the adapted model can produce unreliable predictions on query samples in the presence of the dual noises. In this work, we propose DEnoised Task Adaptation (DETA++) for reliable FSL. DETA++ uses a Contrastive Relevance Aggregation (CoRA) module to calculate image and region weights for support samples, based on which a clean prototype loss and a noise entropy maximization loss are proposed to achieve noise-robust task adaptation. Additionally,DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearest Centroid Classifier (LocalNCC) is devised to yield noise-robust predictions on query samples. Moreover, DETA++ utilizes an Intra-class Region Swapping (IntraSwap) strategy to rectify ID class prototypes during task adaptation, enhancing the model’s robustness to the dual noises. Extensive experiments demonstrate the effectiveness and flexibility of DETA++.
zh
[CV-86] RealDriveSim: A Realistic Multi-Modal Multi-Task Synthetic Dataset for Autonomous Driving
【速读】:该论文试图解决大规模数据集标注成本过高导致难以有效扩展的问题,以及现有合成数据集在范围、真实性和任务适用性方面的局限性。其解决方案的关键在于提出RealDriveSim,这是一个具有多模态特性的逼真合成数据集,支持主流的2D计算机视觉应用及其LiDAR对应任务,并提供细粒度的64类标注,从而在多个应用和领域中实现了优于现有合成基准的性能。
链接: https://arxiv.org/abs/2506.16319
作者: Arpit Jadon,Haoran Wang,Phillip Thomas,Michael Stanley,S. Nathaniel Cibik,Rachel Laurat,Omar Maher,Lukas Hoyer,Ozan Unal,Dengxin Dai
机构: German Aerospace Center (德国航空航天中心); Computer Vision Lab, Huawei Research Center Zurich (计算机视觉实验室,华为苏黎世研究中心); Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Parallel Domain (并行领域); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE Intelligent Vehicles Symposium (IV) 2025
Abstract:As perception models continue to develop, the need for large-scale datasets increases. However, data annotation remains far too expensive to effectively scale and meet the demand. Synthetic datasets provide a solution to boost model performance with substantially reduced costs. However, current synthetic datasets remain limited in their scope, realism, and are designed for specific tasks and applications. In this work, we present RealDriveSim, a realistic multi-modal synthetic dataset for autonomous driving that not only supports popular 2D computer vision applications but also their LiDAR counterparts, providing fine-grained annotations for up to 64 classes. We extensively evaluate our dataset for a wide range of applications and domains, demonstrating state-of-the-art results compared to existing synthetic benchmarks. The dataset is publicly available at this https URL.
zh
[CV-87] Segment Anything for Satellite Imagery: A Strong Baseline and a Regional Dataset for Automatic Field Delineation
【速读】:该论文旨在解决农业田块边界精准映射的问题,以提高农业操作的效率。其解决方案的关键在于基于Segment Anything Model (SAM)构建一个用于田块分割的流程,并引入微调策略以适应该任务。此外,研究还提出了一种获取补充区域数据集的方法,以覆盖现有数据源未涵盖的区域,从而提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.16318
作者: Carmelo Scribano,Elena Govi,Paolo bertellini,Simone Parisi,Giorgia Franchini,Marko Bertogna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Acceptet at ICIAP 2025
Abstract:Accurate mapping of agricultural field boundaries is essential for the efficient operation of agriculture. Automatic extraction from high-resolution satellite imagery, supported by computer vision techniques, can avoid costly ground surveys. In this paper, we present a pipeline for field delineation based on the Segment Anything Model (SAM), introducing a fine-tuning strategy to adapt SAM to this task. In addition to using published datasets, we describe a method for acquiring a complementary regional dataset that covers areas beyond current sources. Extensive experiments assess segmentation accuracy and evaluate the generalization capabilities. Our approach provides a robust baseline for automated field delineation. The new regional dataset, known as ERAS, is now publicly available.
zh
[CV-88] Learning Multi-scale Spatial-frequency Features for Image Denoising
【速读】:该论文旨在解决传统图像去噪方法在多尺度表示和频率域特性处理上的不足,具体表现为:现有架构主要依赖固定单输入单输出的Unet结构,忽略了像素级别的多尺度表示;同时,以往方法对频率域的处理过于统一,未考虑高频率与低频率噪声的不同特性。其解决方案的关键在于提出一种多尺度自适应双域网络(MADNet),通过图像金字塔输入实现从低分辨率图像中恢复无噪声结果,并设计自适应空间-频率学习单元(ASFU)以分离并交互高、低频信息,同时在跳跃连接中引入全局特征融合块以增强多尺度特征。
链接: https://arxiv.org/abs/2506.16307
作者: Xu Zhao,Chen Zhao,Xiantao Hu,Hongliang Zhang,Ying Tai,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Recent advancements in multi-scale architectures have demonstrated exceptional performance in image denoising tasks. However, existing architectures mainly depends on a fixed single-input single-output Unet architecture, ignoring the multi-scale representations of pixel level. In addition, previous methods treat the frequency domain uniformly, ignoring the different characteristics of high-frequency and low-frequency noise. In this paper, we propose a novel multi-scale adaptive dual-domain network (MADNet) for image denoising. We use image pyramid inputs to restore noise-free results from low-resolution images. In order to realize the interaction of high-frequency and low-frequency information, we design an adaptive spatial-frequency learning unit (ASFU), where a learnable mask is used to separate the information into high-frequency and low-frequency components. In the skip connections, we design a global feature fusion block to enhance the features at different scales. Extensive experiments on both synthetic and real noisy image datasets verify the effectiveness of MADNet compared with current state-of-the-art denoising approaches.
zh
[CV-89] Wavelet-based Global Orientation and Surface Reconstruction for Point Clouds
【速读】:该论文试图解决无向点云的表面重建问题,特别是针对稀疏点云中传统方法(如iWSR)表现不佳的问题。其解决方案的关键在于提出一种基于小波的 mollified indicator function 表示方法,通过修改核函数平滑表面不连续性,并利用小波基函数的紧支性和卷积核函数的性质加速系数计算。此外,还提出了一种新的无散度函数场构建方法,以引入额外的齐次约束来提升重建效果和稳定性。
链接: https://arxiv.org/abs/2506.16299
作者: Yueji Ma,Yanzun Meng,Dong Xiao,Zuoqiang Shi,Bin Wang
机构: Tsinghua University (清华大学); University of Science and Technology of China (中国科学技术大学); Yanqi Lake Beijing Institute of Mathematical Sciences and Applications (北京雁栖湖数学科学与应用研究院)
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注: 22Pages
Abstract:Unoriented surface reconstruction is an important task in computer graphics and has extensive applications. Based on the compact support of wavelet and orthogonality properties, classic wavelet surface reconstruction achieves good and fast reconstruction. However, this method can only handle oriented points. Despite some improved attempts for unoriented points, such as iWSR, these methods perform poorly on sparse point clouds. To address these shortcomings, we propose a wavelet-based method to represent the mollified indicator function and complete both the orientation and surface reconstruction tasks. We use the modifying kernel function to smoothen out discontinuities on the surface, aligning with the continuity of the wavelet basis function. During the calculation of coefficient, we fully utilize the properties of the convolutional kernel function to shift the modifying computation onto wavelet basis to accelerate. In addition, we propose a novel method for constructing the divergence-free function field and using them to construct the additional homogeneous constraints to improve the effectiveness and stability. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both orientation and reconstruction for sparse models. We align the matrix construction with the compact support property of wavelet basis functions to further accelerate our method, resulting in efficient performance on CPU. Our source codes will be released on GitHub.
zh
[CV-90] SycnMapV2: Robust and Adaptive Unsupervised Segmentation
【速读】:该论文试图解决现有人工智能算法在面对噪声、天气变化和模糊等视觉干扰时,难以保持准确性的难题。其解决方案的关键在于提出SyncMapV2,该方法通过自组织动力学方程与随机网络概念相结合的学习范式,实现了无需鲁棒训练、监督或损失函数的无监督分割,并具备在线适应能力,从而在多种类型的图像退化条件下表现出卓越的鲁棒性与适应性。
链接: https://arxiv.org/abs/2506.16297
作者: Heng Zhang,Zikang Wan,Danilo Vasconcellos Vargas
机构: Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Human vision excels at segmenting visual cues without the need for explicit training, and it remains remarkably robust even as noise severity increases. In contrast, existing AI algorithms struggle to maintain accuracy under similar conditions. Here, we present SyncMapV2, the first to solve unsupervised segmentation with state-of-the-art robustness. SyncMapV2 exhibits a minimal drop in mIoU, only 0.01%, under digital corruption, compared to a 23.8% drop observed in SOTA this http URL superior performance extends across various types of corruption: noise (7.3% vs. 37.7%), weather (7.5% vs. 33.8%), and blur (7.0% vs. 29.5%). Notably, SyncMapV2 accomplishes this without any robust training, supervision, or loss functions. It is based on a learning paradigm that uses self-organizing dynamical equations combined with concepts from random networks. Moreover,unlike conventional methods that require re-initialization for each new input, SyncMapV2 adapts online, mimicking the continuous adaptability of human vision. Thus, we go beyond the accurate and robust results, and present the first algorithm that can do all the above online, adapting to input rather than re-initializing. In adaptability tests, SyncMapV2 demonstrates near-zero performance degradation, which motivates and fosters a new generation of robust and adaptive intelligence in the near future.
zh
[CV-91] Fine-grained Image Retrieval via Dual-Vision Adaptation
【速读】:该论文旨在解决细粒度图像检索(Fine-Grained Image Retrieval, FGIR)中学习具有区分性的视觉表征的问题。现有方法通常通过在语义嵌入空间中施加成对相似性约束或引入定位子网络来微调整个模型,但这些方法容易过拟合训练数据并遗忘大规模预训练中获得的知识,从而降低泛化能力。论文提出的双视觉适应(Dual-Vision Adaptation, DVA)方法的关键在于通过协同样本和特征适应,引导冻结的预训练模型进行FGIR,具体包括对象感知适应和上下文适应,以提升模型对关键对象及特征的感知能力,并通过知识蒸馏机制将判别性知识迁移至图像编码器,从而在保持性能的同时减少可学习参数。
链接: https://arxiv.org/abs/2506.16273
作者: Xin Jiang,Meiqi Cao,Hao Tang,Fei Shen,Zechao Li
机构: Nanjing University of Science and Technology (南京理工大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.
zh
[CV-92] Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images
【速读】:该论文旨在解决传统基于点云的滑坡监测方法在空间覆盖性和三维位移估计精度方面的不足,这些问题通常源于仅依赖几何或辐射信息导致的稀疏或非三维位移估算。其解决方案的关键在于提出一种分层分区的粗到细方法,通过融合三维点云与共配准的RGB图像,构建基于块的匹配,结合三维几何与二维图像特征,并通过几何一致性检查和刚性变换估计来优化匹配,从而实现高空间覆盖率和高精度的密集三维位移矢量场估计。
链接: https://arxiv.org/abs/2506.16265
作者: Zhaoyi Wang,Jemil Avers Butt,Shengyu Huang,Tomislav Medic,Andreas Wieser
机构: ETH Zürich, Institute of Geodesy and Photogrammetry (ETH Zurich, 地球科学与摄影测量研究所); Atlas optimization GmbH (Atlas优化有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Geophysics (physics.geo-ph)
备注: 20 pages, 16 figures. Preprint under peer review. Example data and code available at [GitHub]( this https URL )
Abstract:Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at this https URL.
zh
[CV-93] R3eVision: A Survey on Robust Rendering Restoration and Enhancement for 3D Low-Level Vision
【速读】:该论文旨在解决现有神经渲染方法(如Neural Radiance Fields和3D Gaussian Splatting)在面对真实世界退化条件(如噪声、模糊、低分辨率和天气引起的伪影)时鲁棒性不足的问题。其解决方案的关键在于将传统2D低级视觉任务扩展到3D空间,形成3D低级视觉(3D Low-Level Vision)领域,通过引入退化感知的渲染问题框架,提升在恶劣条件下的高保真3D重建能力。
链接: https://arxiv.org/abs/2506.16262
作者: Weeyoung Kwon,Jeahun Sung,Minkyu Jeon,Chanho Eom,Jihyong Oh
机构: Chung-Ang University (忠南大学); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL
Abstract:Neural rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved significant progress in photorealistic 3D scene reconstruction and novel view synthesis. However, most existing models assume clean and high-resolution (HR) multi-view inputs, which limits their robustness under real-world degradations such as noise, blur, low-resolution (LR), and weather-induced artifacts. To address these limitations, the emerging field of 3D Low-Level Vision (3D LLV) extends classical 2D Low-Level Vision tasks including super-resolution (SR), deblurring, weather degradation removal, restoration, and enhancement into the 3D spatial domain. This survey, referred to as R\textsuperscript3eVision, provides a comprehensive overview of robust rendering, restoration, and enhancement for 3D LLV by formalizing the degradation-aware rendering problem and identifying key challenges related to spatio-temporal consistency and ill-posed optimization. Recent methods that integrate LLV into neural rendering frameworks are categorized to illustrate how they enable high-fidelity 3D reconstruction under adverse conditions. Application domains such as autonomous driving, AR/VR, and robotics are also discussed, where reliable 3D perception from degraded inputs is critical. By reviewing representative methods, datasets, and evaluation protocols, this work positions 3D LLV as a fundamental direction for robust 3D content generation and scene-level reconstruction in real-world environments.
zh
[CV-94] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models ICML25
【速读】:该论文旨在解决联邦提示学习(Federated Prompt Learning, FPL)在视觉-语言模型中面临的表现与鲁棒性之间的权衡问题,尤其是在分布外(Out-of-Distribution, OOD)数据迁移场景下,由于客户端内部分布(In-Distribution, ID)异质性的存在,导致模型可靠性受限。其解决方案的关键在于提出一种联邦OOD感知上下文优化(Federated OOD-aware Context Optimization, FOCoOp)框架,通过引入ID全局提示、局部提示和OOD提示来捕捉客户端间的多样化分布,并利用双层分布鲁棒优化实现对OOD偏移的适应,同时通过半不平衡最优传输校准全局提示、看似OOD提示和OOD提示,以提升客户端间的判别一致性。
链接: https://arxiv.org/abs/2506.16218
作者: Xinting Liao,Weiming Liu,Jiaming Qian,Pengyang Zhou,Jiahe Xu,Wenjie Wang,Chaochao Chen,Xiaolin Zheng,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML25
Abstract:Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemingly OOD prompts, and OOD prompts by semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.
zh
[CV-95] VideoGAN-based Trajectory Proposal for Automated Vehicles
【速读】:该论文旨在解决道路车辆自动化程度提升中轨迹生成的问题,特别是如何有效捕捉未来轨迹的复杂且多模态分布。其解决方案的关键在于利用生成对抗网络(Generative Adversarial Network, GAN)架构,通过低分辨率鸟瞰图(Bird’s-eye View, BEV)占用网格视频作为训练数据,生成具有物理真实性的轨迹。该方法通过单帧目标检测和帧间目标匹配提取抽象轨迹数据,并在训练时间与推理速度上优于扩散模型,最终在Waymo Open Motion Dataset的真实视频数据上实现了空间和动态参数分布的对齐。
链接: https://arxiv.org/abs/2506.16209
作者: Annajoyce Mariani,Kira Maag,Hanno Gottschalk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Being able to generate realistic trajectory options is at the core of increasing the degree of automation of road vehicles. While model-driven, rule-based, and classical learning-based methods are widely used to tackle these tasks at present, they can struggle to effectively capture the complex, multimodal distributions of future trajectories. In this paper we investigate whether a generative adversarial network (GAN) trained on videos of bird’s-eye view (BEV) traffic scenarios can generate statistically accurate trajectories that correctly capture spatial relationships between the agents. To this end, we propose a pipeline that uses low-resolution BEV occupancy grid videos as training data for a video generative model. From the generated videos of traffic scenarios we extract abstract trajectory data using single-frame object detection and frame-to-frame object matching. We particularly choose a GAN architecture for the fast training and inference times with respect to diffusion models. We obtain our best results within 100 GPU hours of training, with inference times under 20,ms. We demonstrate the physical realism of the proposed trajectories in terms of distribution alignment of spatial and dynamic parameters with respect to the ground truth videos from the Waymo Open Motion Dataset.
zh
[CV-96] FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation
【速读】:该论文旨在解决高精度任务中机器人操作的计算效率低和信息探索能力不足的问题(high-precision robotic manipulation)。现有基于扩散模型的策略学习方法在推理过程中由于迭代去噪过程导致计算效率低下,且未能充分利用生成模型在三维环境中的信息探索潜力。解决方案的关键在于提出FlowRAM框架,该框架通过生成模型实现区域感知,结合动态半径调度机制实现自适应感知,并引入状态空间模型以线性复杂度整合多模态信息,同时采用条件流匹配方法简化动作姿态的学习过程,从而显著提升推理速度与高精度任务的性能。
链接: https://arxiv.org/abs/2506.16201
作者: Sen Wang,Le Wang,Sanping Zhou,Jingyi Tian,Jiayi Li,Haowen Sun,Wei Tang
机构: Xi’an Jiaotong University (西安交通大学); University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.
zh
[CV-97] Integrating Generative Adversarial Networks and Convolutional Neural Networks for Enhanced Traffic Accidents Detection and Analysis
【速读】:该论文试图解决交通事故检测系统中监督监控不足和数据稀缺的问题,其解决方案的关键在于结合生成对抗网络(Generative Adversarial Networks, GANs)进行数据合成以及卷积神经网络(Convolutional Neural Networks, CNN)进行模型训练,从而提升事故检测的准确性和适用性。
链接: https://arxiv.org/abs/2506.16186
作者: Zhenghao Xi,Xiang Liu,Yaqi Liu,Yitong Cai,Yangyu Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accident detection using Closed Circuit Television (CCTV) footage is one of the most imperative features for enhancing transport safety and efficient traffic control. To this end, this research addresses the issues of supervised monitoring and data deficiency in accident detection systems by adapting excellent deep learning technologies. The motivation arises from rising statistics in the number of car accidents worldwide; this calls for innovation and the establishment of a smart, efficient and automated way of identifying accidents and calling for help to save lives. Addressing the problem of the scarcity of data, the presented framework joins Generative Adversarial Networks (GANs) for synthesizing data and Convolutional Neural Networks (CNN) for model training. Video frames for accidents and non-accidents are collected from YouTube videos, and we perform resizing, image enhancement and image normalisation pixel range adjustments. Three models are used: CNN, Fine-tuned Convolutional Neural Network (FTCNN) and Vision Transformer (VIT) worked best for detecting accidents from CCTV, obtaining an accuracy rate of 94% and 95%, while the CNN model obtained 88%. Such results show that the proposed framework suits traffic safety applications due to its high real-time accident detection capabilities and broad-scale applicability. This work lays the foundation for intelligent surveillance systems in the future for real-time traffic monitoring, smart city framework, and integration of intelligent surveillance systems into emergency management systems.
zh
[CV-98] Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization
【速读】:该论文旨在解决多任务遥感生理测量中的多源语义域泛化(Multi-source Synsemantic Domain Generalization, MSSDG)问题,以及在测试阶段进行个性化适应(Test-Time Personalized Adaptation, TTPA)的挑战。其关键解决方案是提出一个统一框架(GAP),通过引入先验知识(Priors)来分离面部视频中的不变语义、个体偏差和噪声,并在不同阶段和不同面部信息中融合先验与观察结果,从而在最小调整下同时实现MSSDG和TTPA。
链接: https://arxiv.org/abs/2506.16160
作者: Jiyao Wang,Xiao Yang,Hao Lu,Dengbo He,Kaishun Wu
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personalized adaptation (TTPA) after MSSDG is also worth exploring, while the gap between previous generalization and personalization methods is significant and hard to fuse. Thus, we proposed a unified framework for MSSD\textbfG and TTP\textbfA employing \textbfPriors (\textbfGAP) in biometrics and remote photoplethysmography (rPPG). We first disentangled information from face videos into invariant semantics, individual bias, and noise. Then, multiple modules incorporating priors and our observations were applied in different stages and for different facial information. Then, based on the different principles of achieving generalization and personalization, our framework could simultaneously address MSSDG and TTPA under multi-task remote physiological estimation with minimal adjustments. We expanded the MSSDG benchmark to the TTPA protocol on six publicly available datasets and introduced a new real-world driving dataset with complete labeling. Extensive experiments that validated our approach, and the codes along with the new dataset will be released.
zh
[CV-99] Co-Speech Gesture and Facial Expression Generation for Non-Photorealistic 3D Characters SIGGRAPH2025
【速读】:该论文试图解决现有研究多聚焦于写实虚拟角色,而忽视了非写实角色(如动漫角色)中独特情感表达的问题。解决方案的关键在于利用从漫画中提取的表达数据以及对话特定语义手势,来实现符合非写实角色风格的夸张情感表达。
链接: https://arxiv.org/abs/2506.16159
作者: Taisei Omine(1),Naoyuki Kawabata(1),Fuminori Homma(1) ((1) Sony Group Corporation)
机构: Sony Group Corporation(索尼集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2025 Poster
Abstract:With the advancement of conversational AI, research on bodily expressions, including gestures and facial expressions, has also progressed. However, many existing studies focus on photorealistic avatars, making them unsuitable for non-photorealistic characters, such as those found in anime. This study proposes methods for expressing emotions, including exaggerated expressions unique to non-photorealistic characters, by utilizing expression data extracted from comics and dialogue-specific semantic gestures. A user study demonstrated significant improvements across multiple aspects when compared to existing research.
zh
[CV-100] MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在参考表达分割 (Referring Expression Segmentation, RES) 模型中对对抗样本的鲁棒性问题,特别是现有对抗攻击方法在多模态结构上的适应性不足以及跨文本输入的泛化能力有限的问题。解决方案的关键在于提出一种名为多模态双向攻击(Multimodal Bidirectional Attack)的新策略,该策略通过引入可学习的代理文本嵌入扰动,并在攻击生成过程中联合进行视觉对齐优化和文本对抗优化,从而提升对抗样本在不同文本输入下的跨文本迁移能力。
链接: https://arxiv.org/abs/2506.16157
作者: Xingbai Chen,Tingchao Fu,Renyang Liu,Wei Zhou,Chao Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5pages
Abstract:Referring Expression Segmentation (RES) enables precise object segmentation in images based on natural language descriptions, offering high flexibility and broad applicability in real-world vision tasks. Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored. While prior adversarial attack methods have explored adversarial robustness on conventional segmentation models, they perform poorly when directly applied to RES, failing to expose vulnerabilities in its multimodal structure. Moreover, in practical open-world scenarios, users typically issue multiple, diverse referring expressions to interact with the same image, highlighting the need for adversarial examples that generalize across varied textual inputs. To address these multimodal challenges, we propose a novel adversarial attack strategy termed \textbfMultimodal Bidirectional Attack, tailored for RES models. Our method introduces learnable proxy textual embedding perturbation and jointly performs visual-aligned optimization on the image modality and textual-adversarial optimization on the textual modality during attack generation. This dual optimization framework encourages adversarial images to actively adapt to more challenging text embedding during optimization, thereby enhancing their cross-text transferability, which refers to the ability of adversarial examples to remain effective under a variety of unseen or semantically diverse textual inputs. Extensive experiments conducted on multiple RES models and benchmark datasets demonstrate the superior effectiveness of our method compared to existing methods.
zh
[CV-101] Neurosymbolic Object-Centric Learning with Distant Supervision
【速读】:该论文旨在解决如何在没有明确对象级监督或预定义对象分解的情况下,从原始非结构化感知数据中学习具有泛化能力的对象中心表示的问题。其解决方案的关键在于提出一种神经符号框架,即DeepObjectLog,该框架结合了感知模块与基于概率逻辑编程的符号推理层,通过仅使用远距离监督实现对对象的发现和建模,并引入了一种新的学习信号以指导有意义对象的识别。
链接: https://arxiv.org/abs/2506.16129
作者: Stefano Colamonaco,David Debot,Giuseppe Marra
机构: KU Leuven (天主教鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Relational learning enables models to generalize across structured domains by reasoning over objects and their interactions. While recent advances in neurosymbolic reasoning and object-centric learning bring us closer to this goal, existing systems rely either on object-level supervision or on a predefined decomposition of the input into objects. In this work, we propose a neurosymbolic formulation for learning object-centric representations directly from raw unstructured perceptual data and using only distant supervision. We instantiate this approach in DeepObjectLog, a neurosymbolic model that integrates a perceptual module, which extracts relevant object representations, with a symbolic reasoning layer based on probabilistic logic programming. By enabling sound probabilistic logical inference, the symbolic component introduces a novel learning signal that further guides the discovery of meaningful objects in the input. We evaluate our model across a diverse range of generalization settings, including unseen object compositions, unseen tasks, and unseen number of objects. Experimental results show that our method outperforms neural and neurosymbolic baselines across the tested settings.
zh
[CV-102] FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
【速读】:该论文旨在解决视频生成中高时间一致性难以实现的问题,尤其是在使用扩散模型时。现有方法如FreeInit虽然通过迭代精炼初始噪声来减少训练与推理之间的差距,但这一过程显著增加了计算成本。论文提出的解决方案关键在于引入FastInit,这是一种快速噪声初始化方法,其核心是学习一个视频噪声预测网络(Video Noise Prediction Network, VNPNet),该网络能够在单次前向传播中根据随机噪声和文本提示生成优化后的噪声,从而在不进行迭代精炼的情况下提升视频生成的效率与时间一致性。
链接: https://arxiv.org/abs/2506.16119
作者: Chengyu Bai,Yuming Li,Zhongyu Zhao,Jintao Chen,Peidong Jia,Qi She,Ming Lu,Shanghang Zhang
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.
zh
[CV-103] AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models
【速读】:该论文试图解决传统手动设计视觉提示(visual prompt)在大型视觉语言模型(LVLM)中效率低、效果不佳的问题,这些问题导致模型性能无法达到最优。解决方案的关键在于提出AutoV,该方法通过自动选择最优视觉提示来提升模型性能,其核心是利用预训练的LVLM对多种视觉提示进行评估并排序,随后以排序结果作为监督信号训练AutoV,使其能够根据给定的文本查询和输入图像自动选择最佳视觉提示。
链接: https://arxiv.org/abs/2506.16112
作者: Yuan Zhang,Chun-Kai Fan,Tao Huang,Ming Lu,Sicheng Yu,Junwen Pan,Kuan Cheng,Qi She,Shanghang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages
Abstract:Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbfAutoV that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves \textbf1.7% accuracy gain on LLaVA ^\textWild , and AutoV boosts Qwen2.5-VL by \textbf1.9% on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs.
zh
[CV-104] PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
【速读】:该论文旨在解决密集视频描述任务中事件定位与描述生成的联合优化问题,现有方法依赖于Transformer架构,通过隐式学习事件位置和语义,但需要大量训练数据且实际性能受限。其解决方案的关键在于提出一种名为PR-DETR的框架,通过在检测Transformer中注入显式的空间位置和事件关系先验,以提升定位精度和描述质量。具体而言,首先生成基于位置锚点的查询作为位置先验,提供场景特定的位置和语义信息;其次设计事件关系编码器,显式计算事件边界间的关系作为关系先验,以增强描述的语义连贯性。
链接: https://arxiv.org/abs/2506.16082
作者: Yizhe Li,Sanping Zhou,Zheng Qin,Le Wang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model’s performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information about potential events as position prior, which serves as the initial event search regions to eliminate the implausible event proposals. On the other hand, we further design an event relation encoder to explicitly calculate the relationship between event boundaries as relation prior to guide the event interaction to improve the semantic coherence of the captions. Extensive ablation studies are conducted to verify the effectiveness of the position and relation prior. Experimental results also show the competitive performance of our method on ActivityNet Captions and YouCook2 datasets.
zh
[CV-105] D3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading
【速读】:该论文试图解决词级唇读任务中由于感受野盲区导致的连续唇部运动信息丢失问题,进而影响复杂时间表示建模的效果。其解决方案的关键在于提出TD3Net,一种结合密集跳跃连接和多膨胀时间卷积的后端架构,通过为跳跃连接的特征应用不同的膨胀因子,构建无盲区的宽且密集的感受野,从而有效保留时间连续性并利用多样化的时序特征。
链接: https://arxiv.org/abs/2506.16073
作者: Byung Hoon Lee,Wooseok Shin,Sung Won Han
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures
Abstract:The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository: this https URL
zh
[CV-106] STAR-Pose: Efficient Low-Resolution Video Human Pose Estimation via Spatial-Temporal Adaptive Super-Resolution
【速读】:该论文旨在解决低分辨率视频中的人体姿态估计(Human Pose Estimation, HPE)问题,传统方法要么依赖高质量输入,要么采用计算成本高昂的级联处理流程,限制了其在资源受限环境中的应用。其解决方案的关键在于提出STAR-Pose框架,该框架包含一种改进的时空Transformer结构,结合了LeakyReLU-modified线性注意力机制以高效捕捉长程时间依赖关系,并引入自适应融合模块以增强局部纹理特征,同时设计了姿态感知的复合损失函数,使网络专注于提升关键点定位的结构特征而非单纯优化视觉质量。
链接: https://arxiv.org/abs/2506.16061
作者: Yucheng Jin,Jinyan Chen,Ziyue He,Baojun Han,Furan An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14pages 3figures, alredy submiss to PRCV 2025
Abstract:Human pose estimation in low-resolution videos presents a fundamental challenge in computer vision. Conventional methods either assume high-quality inputs or employ computationally expensive cascaded processing, which limits their deployment in resource-constrained environments. We propose STAR-Pose, a spatial-temporal adaptive super-resolution framework specifically designed for video-based human pose estimation. Our method features a novel spatial-temporal Transformer with LeakyReLU-modified linear attention, which efficiently captures long-range temporal dependencies. Moreover, it is complemented by an adaptive fusion module that integrates parallel CNN branch for local texture enhancement. We also design a pose-aware compound loss to achieve task-oriented super-resolution. This loss guides the network to reconstruct structural features that are most beneficial for keypoint localization, rather than optimizing purely for visual quality. Extensive experiments on several mainstream video HPE datasets demonstrate that STAR-Pose outperforms existing approaches. It achieves up to 5.2% mAP improvement under extremely low-resolution (64x48) conditions while delivering 2.8x to 4.4x faster inference than cascaded approaches.
zh
[CV-107] Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
【速读】:该论文试图解决开放词汇分割(open-vocabulary segmentation)中现有测试集无法有效衡量模型对“开放词汇”概念理解能力的问题,因为现有测试集的语义空间与训练空间高度相似,缺乏足够的多样性。为了解决这一问题,作者提出一个新的基准测试集OpenBench,其语义空间与训练语义有显著差异,能够更真实地评估模型对广泛现实概念的理解和分割能力。解决方案的关键在于通过精心设计的异构特征融合以及无需成本的训练空间扩展,提升模型在多样化和开放场景下的分割性能,从而在现有数据集和所提出的OpenBench上均取得最先进的结果。
链接: https://arxiv.org/abs/2506.16058
作者: Yong Liu,SongLi Wu,Sule Bai,Jiahao Wang,Yitong Wang,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); The University of Hong Kong (香港大学); ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary segmentation aims to achieve segmentation of arbitrary categories given unlimited text inputs as guidance. To achieve this, recent works have focused on developing various technical routes to exploit the potential of large-scale pre-trained vision-language models and have made significant progress on existing benchmarks. However, we find that existing test sets are limited in measuring the models’ comprehension of ``open-vocabulary" concepts, as their semantic space closely resembles the training space, even with many overlapping categories. To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. It is designed to better assess the model’s ability to understand and segment a wide range of real-world concepts. When testing existing methods on OpenBench, we find that their performance diverges from the conclusions drawn on existing test sets. In addition, we propose a method named OVSNet to improve the segmentation performance for diverse and open scenarios. Through elaborate fusion of heterogeneous features and cost-free expansion of the training space, OVSNet achieves state-of-the-art results on both existing datasets and our proposed OpenBench. Corresponding analysis demonstrate the soundness and effectiveness of our proposed benchmark and method.
zh
[CV-108] PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models
【速读】:该论文旨在解决视觉生成中注意力机制的二次复杂度导致的高内存和计算成本问题,特别是在生成高分辨率图像或多帧视频时所需的长标记序列场景下。解决方案的关键在于通过重新组织注意力模式来缓解低密度和低比特宽度下的挑战,而非引入专门的稀疏化和量化设计。论文提出了一种名为**Pattern-Aware token ReOrdering (PARO)**的技术,将多样化的注意力模式统一为硬件友好的块状模式,从而显著简化并提升稀疏化与量化的效果。
链接: https://arxiv.org/abs/2506.16054
作者: Tianchen Zhao,Ke Hong,Xinhao Yang,Xuefeng Xiao,Huixia Li,Feng Ling,Ruiqi Xie,Siqi Chen,Hongyu Zhu,Yichong Zhang,Yu Wang
机构: Tsinghua University (清华大学); ByteDance Seed (字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page: this https URL
Abstract:In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs, especially for longer token sequences required in high-resolution image or multi-frame video generation. To address this, prior research has explored techniques such as sparsification and quantization. However, these techniques face significant challenges under low density and reduced bitwidths. Through systematic analysis, we identify that the core difficulty stems from the dispersed and irregular characteristics of visual attention patterns. Therefore, instead of introducing specialized sparsification and quantization design to accommodate such patterns, we propose an alternative strategy: reorganizing the attention pattern to alleviate the challenges. Inspired by the local aggregation nature of visual feature extraction, we design a novel Pattern-Aware token ReOrdering (PARO) technique, which unifies the diverse attention patterns into a hardware-friendly block-wise pattern. This unification substantially simplifies and enhances both sparsification and quantization. We evaluate the performance-efficiency trade-offs of various design choices and finalize a methodology tailored for the unified pattern. Our approach, PAROAttention, achieves video and image generation with lossless metrics, and nearly identical results from full-precision (FP) baselines, while operating at notably lower density (~20%-30%) and bitwidth (INT8/INT4), achieving a 1.9x to 2.7x end-to-end latency speedup.
zh
[CV-109] Noise Fusion-based Distillation Learning for Anomaly Detection in Complex Industrial Environments IROS2025
【速读】:该论文旨在解决在复杂和非结构化的工业环境中准确检测和定位工件缺陷的问题,这一场景下存在视角、姿态和光照条件的多样性。解决方案的关键在于提出一种基于协作蒸馏异构教师网络(HetNet)的新方法,结合了自适应的局部-全局特征融合模块和局部多变量高斯噪声生成模块,从而有效建模正常模式的复杂特征分布并提升异常检测的鲁棒性。
链接: https://arxiv.org/abs/2506.16050
作者: Jiawen Yu,Jieji Ren,Yang Chang,Qiaojun Yu,Xuan Tong,Boyang Wang,Yan Song,You Li,Xinji Mai,Wenqiang Zhang
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2025 Oral
Abstract:Anomaly detection and localization in automated industrial manufacturing can significantly enhance production efficiency and product quality. Existing methods are capable of detecting surface defects in pre-defined or controlled imaging environments. However, accurately detecting workpiece defects in complex and unstructured industrial environments with varying views, poses and illumination remains challenging. We propose a novel anomaly detection and localization method specifically designed to handle inputs with perturbative patterns. Our approach introduces a new framework based on a collaborative distillation heterogeneous teacher network (HetNet), an adaptive local-global feature fusion module, and a local multivariate Gaussian noise generation module. HetNet can learn to model the complex feature distribution of normal patterns using limited information about local disruptive changes. We conducted extensive experiments on mainstream benchmarks. HetNet demonstrates superior performance with approximately 10% improvement across all evaluation metrics on MSC-AD under industrial conditions, while achieving state-of-the-art results on other datasets, validating its resilience to environmental fluctuations and its capability to enhance the reliability of industrial anomaly detection systems across diverse scenarios. Tests in real-world environments further confirm that HetNet can be effectively integrated into production lines to achieve robust and real-time anomaly detection. Codes, images and videos are published on the project website at: this https URL
zh
[CV-110] EndoMUST: Monocular Depth Estimation for Robotic Endoscopy via End-to-end Multi-step Self-supervised Training IROS2025
【速读】:该论文旨在解决内窥镜场景中由于光照变化和纹理稀疏导致的单目深度估计与自监督深度估计中的信息干扰问题。其关键解决方案是提出一种多步骤高效微调框架,在每个训练周期中将过程划分为光流配准、多尺度图像分解和多变换对齐三个步骤,确保每一步仅训练相关网络而不受无关信息干扰,从而提升模型在自监督深度估计任务中的性能。
链接: https://arxiv.org/abs/2506.16017
作者: Liangjing Shao,Linxin Bai,Chenkang Du,Xinrong Chen
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention (上海市医学影像计算与计算机辅助介入重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IROS 2025
Abstract:Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both illumination issues and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4% \sim 10% lower error. The evaluation code of this work has been published on this https URL.
zh
[CV-111] DIGMAPPER: A Modular System for Automated Geologic Map Digitization
【速读】:该论文试图解决历史地质图数字化过程中存在的劳动密集和耗时问题,这些问题限制了对关键矿产资源的高效评估。解决方案的关键在于开发DIGMAPPER系统,该系统采用模块化、可扩展的架构,集成先进的深度学习模型进行地图版面分析、特征提取和地理参照,同时利用基于大语言模型的上下文学习、合成数据生成和基于Transformer的模型等创新技术,以应对训练数据有限和视觉内容复杂等挑战。
链接: https://arxiv.org/abs/2506.16006
作者: Weiwei Duan,Michael P. Gerlek,Steven N. Minton,Craig A. Knoblock,Fandel Lin,Theresa Chen,Leeje Jang,Sofia Kirsanova,Zekun Li,Yijun Lin,Yao-Yi Chiang
机构: Inferlink Corporation(推理链接公司); USC Information Sciences Institute(南加州大学信息科学研究所); University of Minnesota(明尼苏达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Historical geologic maps contain rich geospatial information, such as rock units, faults, folds, and bedding planes, that is critical for assessing mineral resources essential to renewable energy, electric vehicles, and national security. However, digitizing maps remains a labor-intensive and time-consuming task. We present DIGMAPPER, a modular, scalable system developed in collaboration with the United States Geological Survey (USGS) to automate the digitization of geologic maps. DIGMAPPER features a fully dockerized, workflow-orchestrated architecture that integrates state-of-the-art deep learning models for map layout analysis, feature extraction, and georeferencing. To overcome challenges such as limited training data and complex visual content, our system employs innovative techniques, including in-context learning with large language models, synthetic data generation, and transformer-based models. Evaluations on over 100 annotated maps from the DARPA-USGS dataset demonstrate high accuracy across polygon, line, and point feature extraction, and reliable georeferencing performance. Deployed at USGS, DIGMAPPER significantly accelerates the creation of analysis-ready geospatial datasets, supporting national-scale critical mineral assessments and broader geoscientific applications.
zh
[CV-112] Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation
【速读】:该论文试图解决在机器人导航中,独立视觉位置识别(Visual Place Recognition, VPR)系统对精心设计的对抗攻击缺乏防御能力的问题,这种攻击可能导致严重后果。解决方案的关键在于通过引入对抗攻击检测器(Adversarial Attack Detector, AAD),构建VPR、AAD与主动导航决策之间的闭环,从而提升系统的鲁棒性。研究通过实验验证了不同检测准确率下的AAD对定位性能的提升效果,表明即使在75%的真正例率和25%的假正例率下,也能实现显著的性能改进,如沿轨道定位误差平均减少约50%。
链接: https://arxiv.org/abs/2506.15988
作者: Connor Malone,Owen Claxton,Iman Shames,Michael Milford
机构: QUT Centre for Robotics (QUT机器人中心); Centre for Advanced Defence Research in Robotics and Autonomous Systems (先进防御研究机器人与自主系统中心); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Stand-alone Visual Place Recognition (VPR) systems have little defence against a well-designed adversarial attack, which can lead to disastrous consequences when deployed for robot navigation. This paper extensively analyzes the effect of four adversarial attacks common in other perception tasks and four novel VPR-specific attacks on VPR localization performance. We then propose how to close the loop between VPR, an Adversarial Attack Detector (AAD), and active navigation decisions by demonstrating the performance benefit of simulated AADs in a novel experiment paradigm – which we detail for the robotics community to use as a system framework. In the proposed experiment paradigm, we see the addition of AADs across a range of detection accuracies can improve performance over baseline; demonstrating a significant improvement – such as a ~50% reduction in the mean along-track localization error – can be achieved with True Positive and False Positive detection rates of only 75% and up to 25% respectively. We examine a variety of metrics including: Along-Track Error, Percentage of Time Attacked, Percentage of Time in an `Unsafe’ State, and Longest Continuous Time Under Attack. Expanding further on these results, we provide the first investigation into the efficacy of the Fast Gradient Sign Method (FGSM) adversarial attack for VPR. The analysis in this work highlights the need for AADs in real-world systems for trustworthy navigation, and informs quantitative requirements for system design.
zh
[CV-113] Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization
【速读】:该论文旨在解决手语视频生成(Sign Language Video Generation, SLVG)中生成视频自然度和表现力受限的问题,现有方法主要依赖单一粗粒度条件(如骨骼序列)作为翻译模型与视频生成模型之间的中介,导致生成效果不足。其解决方案的关键在于提出SignViP框架,通过引入多细粒度条件(如细粒度姿态和3D手部动作)提升生成质量,并采用离散标记化范式将这些条件整合为离散标记,从而实现更精确的视频生成。
链接: https://arxiv.org/abs/2506.15980
作者: Cong Wang,Zexuan Deng,Zhiwei Jiang,Fei Shen,Yafeng Yin,Shiwei Gan,Zifeng Cheng,Shiping Ge,Qing Gu
机构: Nanjing University (南京大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at this https URL.
zh
[CV-114] owards Classifying Histopathological Microscope Images as Time Series Data
【速读】:该论文试图解决微观病理图像在癌症诊断中的分类问题,这类图像由于手动获取和弱标签的特性,长期以来未被深度学习社区充分重视。解决方案的关键在于将微观图像序列视为时间序列数据进行处理,并利用动态时间规整(Dynamic Time-series Warping, DTW)将其长度不一的序列拟合到固定长度目标,同时采用基于注意力的池化机制实现病例分类的同步预测。
链接: https://arxiv.org/abs/2506.15977
作者: Sungrae Hong,Hyeongmin Park,Youngsin Ko,Sol Lee,Bryan Wong,Mun Yong Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, Accepted by International Symposium on Biomedical Imaging (ISBI) 2025
Abstract:As the frontline data for cancer diagnosis, microscopic pathology images are fundamental for providing patients with rapid and accurate treatment. However, despite their practical value, the deep learning community has largely overlooked their usage. This paper proposes a novel approach to classifying microscopy images as time series data, addressing the unique challenges posed by their manual acquisition and weakly labeled nature. The proposed method fits image sequences of varying lengths to a fixed-length target by leveraging Dynamic Time-series Warping (DTW). Attention-based pooling is employed to predict the class of the case simultaneously. We demonstrate the effectiveness of our approach by comparing performance with various baselines and showcasing the benefits of using various inference strategies in achieving stable and reliable results. Ablation studies further validate the contribution of each component. Our approach contributes to medical image analysis by not only embracing microscopic images but also lifting them to a trustworthy level of performance.
zh
[CV-115] LBMamba: Locally Bi-directional Mamba
【速读】:该论文试图解决传统Mamba模型在计算机视觉任务中由于其单向性导致的感受野受限问题,以及通过引入全局反向扫描所带来的计算负载增加的问题。解决方案的关键在于提出LBMamba,一个局部双向状态空间模型(State Space Model)块,其通过在前向选择性扫描中嵌入轻量级局部反向扫描,并在每个线程的寄存器中执行,从而避免了额外的反向扫描操作。基于LBMamba,论文进一步提出了LBVim,一种可扩展的视觉主干网络,通过每两层交替扫描方向来恢复全局感受野,从而在不增加额外计算负担的情况下提升模型性能。
链接: https://arxiv.org/abs/2506.15976
作者: Jingwei Zhang,Xi Han,Hong Qin,Mahdi S. Hosseini,Dimitris Samaras
机构: Stony Brook University (斯托尼布鲁克大学); Concordia University (康考迪亚大学); Mila–Quebec AI Institute (Mila–魁北克人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to TMLR
Abstract:Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel selective scan, has recently emerged as a linearly-scaling, efficient alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this limitation by augmenting Mamba’s global forward scan with a global backward scan, forming a bi-directional scan that restores a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan and executes it entirely in per-thread registers. Building on LBMamba, we present LBVim, a scalable vision backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate the versatility of our approach on both natural images and whole slide images (WSIs). We show that our LBVim constantly offers a superior performance-throughput trade-off. That is under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. We also integrate LBMamba into the SOTA pathology multiple instance learning (MIL) approach, MambaMIL, which uses single directional scan. Experiments on 3 public WSI classification datasets for show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy.
zh
[CV-116] Heterogeneous-Modal Unsupervised Domain Adaptation via Latent Space Bridging
【速读】:该论文试图解决无监督域适应(Unsupervised Domain Adaptation, UDA)在源域和目标域属于完全不同模态时表现不佳的问题。解决方案的关键在于提出了一种新的设置——异构模态无监督域适应(Heterogeneous-Modal Unsupervised Domain Adaptation, HMUDA),通过引入一个包含两种模态未标记样本的桥梁域来实现跨模态的知识迁移。为了解决HMUDA下的学习问题,作者提出了潜在空间桥接(Latent Space Bridging, LSB)框架,该框架采用双分支结构,结合特征一致性损失以对齐不同模态的表示,并利用域对齐损失减少不同域间类别中心的差异。
链接: https://arxiv.org/abs/2506.15971
作者: Jiawen Yang,Shuhao Chen,Yucong Duan,Ke Tang,Yu Zhang
机构: Southern University of Science and Technology (南方科技大学); SZ DJI Technology Co., Ltd (大疆创新科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Unsupervised domain adaptation (UDA) methods effectively bridge domain gaps but become struggled when the source and target domains belong to entirely distinct modalities. To address this limitation, we propose a novel setting called Heterogeneous-Modal Unsupervised Domain Adaptation (HMUDA), which enables knowledge transfer between completely different modalities by leveraging a bridge domain containing unlabeled samples from both modalities. To learn under the HMUDA setting, we propose Latent Space Bridging (LSB), a specialized framework designed for the semantic segmentation task. Specifically, LSB utilizes a dual-branch architecture, incorporating a feature consistency loss to align representations across modalities and a domain alignment loss to reduce discrepancies between class centroids across domains. Extensive experiments conducted on six benchmark datasets demonstrate that LSB achieves state-of-the-art performance.
zh
[CV-117] Polyline Path Masked Attention for Vision Transformer
【速读】:该论文旨在解决深度学习框架中全局依赖建模与空间位置建模这两个核心问题。其解决方案的关键在于提出了一种名为Polyline Path Masked Attention (PPMA)的机制,该机制将Vision Transformers (ViTs)的自注意力机制与Mamba2的增强结构化掩码相结合,通过引入二维折线路径扫描策略改进传统结构化掩码,从而更有效地保留图像标记间的邻接关系,并在自注意力机制中显式建模空间邻接先验。
链接: https://arxiv.org/abs/2506.15940
作者: Zhongchen Zhao,Chaodong Xiao,Hui Lin,Qi Xie,Lei Zhang,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at this https URL.
zh
[CV-118] Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization
【速读】:该论文试图解决视频同步问题,即对来自不同角度捕捉相同事件的多视频流进行对齐,这一问题在现实电视节目制作、体育分析、监控和自动驾驶系统等应用中至关重要。现有方法主要依赖音频线索或特定视觉事件,限制了其在缺乏可靠信号环境中的适用性。此外,现有视频同步基准缺乏通用性和可复现性,阻碍了该领域的发展。论文提出的解决方案是VideoSync框架,其关键在于不依赖特定特征提取方法(如人体姿态估计),从而实现对不同类型内容的广泛适用性。通过在新构建的数据集上进行评估,并提供数据集创建的方法和代码,VideoSync建立了可复现的基准,同时揭示了现有最先进方法中的偏差,并提出了更严格的评估框架,证明了其在公平实验条件下的优越性能。
链接: https://arxiv.org/abs/2506.15937
作者: Yosub Shin,Igor Molybog
机构: University of Hawai’i at Manoa (夏威夷大学马诺阿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Video synchronization-aligning multiple video streams capturing the same event from different angles-is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems. Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings where such signals may be unreliable or absent. Additionally, existing benchmarks for video synchronization lack generality and reproducibility, restricting progress in the field. In this work, we introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods, such as human pose estimation, enabling broader applicability across different content types. We evaluate our system on newly composed datasets covering single-human, multi-human, and non-human scenarios, providing both the methodology and code for dataset creation to establish reproducible benchmarks. Our analysis reveals biases in prior SOTA work, particularly in SeSyn-Net’s preprocessing pipeline, leading to inflated performance claims. We correct these biases and propose a more rigorous evaluation framework, demonstrating that VideoSync outperforms existing approaches, including SeSyn-Net, under fair experimental conditions. Additionally, we explore various synchronization offset prediction methods, identifying a convolutional neural network (CNN)-based model as the most effective. Our findings advance video synchronization beyond domain-specific constraints, making it more generalizable and robust for real-world applications.
zh
[CV-119] MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior
【速读】:该论文旨在解决图像和视频去摩尔纹(demoiréing)问题,该问题涉及复杂的非线性退化过程,传统监督学习方法在去除摩尔纹模式或避免过度平滑结果方面存在局限,主要由于模型容量受限和训练数据稀缺导致无法准确重建真实图像。其解决方案的关键在于提出一种基于最大后验(MAP)估计的混合框架,该框架结合了两种互补组件:一是增强的监督学习模型,采用高效的线性注意力测试时训练(TTT)模块以直接学习RAW到sRGB的非线性映射;二是截断流匹配先验(TFMP),通过与干净图像分布对齐进一步优化输出,有效恢复高频细节并抑制伪影。
链接: https://arxiv.org/abs/2506.15929
作者: Liangyan Li,Yimo Ning,Kevin Le,Wei Dong,Yunzhe Li,Jun Chen,Xiaohong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:This paper introduces a novel framework for image and video demoiréing by integrating Maximum A Posteriori (MAP) estimation with advanced deep learning techniques. Demoiréing addresses inherently nonlinear degradation processes, which pose significant challenges for existing methods. Traditional supervised learning approaches either fail to remove moiré patterns completely or produce overly smooth results. This stems from constrained model capacity and scarce training data, which inadequately represent the clean image distribution and hinder accurate reconstruction of ground-truth images. While generative models excel in image restoration for linear degradations, they struggle with nonlinear cases such as demoiréing and often introduce artifacts. To address these limitations, we propose a hybrid MAP-based framework that integrates two complementary components. The first is a supervised learning model enhanced with efficient linear attention Test-Time Training (TTT) modules, which directly learn nonlinear mappings for RAW-to-sRGB demoiréing. The second is a Truncated Flow Matching Prior (TFMP) that further refines the outputs by aligning them with the clean image distribution, effectively restoring high-frequency details and suppressing artifacts. These two components combine the computational efficiency of linear attention with the refinement abilities of generative models, resulting in improved restoration performance. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2506.15929 [cs.CV] (or arXiv:2506.15929v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.15929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-120] Pediatric Pancreas Segmentation from MRI Scans with Deep Learning
【速读】:该论文旨在解决儿童急性胰腺炎(AP)和慢性胰腺炎(CP)患者在磁共振成像(MRI)中胰腺分割的挑战,提出了一种名为PanSegNet的深度学习(DL)算法以实现精准的胰腺分割。该解决方案的关键在于利用深度学习模型对不同病理状态下的胰腺进行自动化分割,并通过Dice相似性系数(DSC)和95百分位Hausdorff距离(HD95)等指标验证其性能,结果显示PanSegNet在健康对照组和患病组中均达到了专家水平的分割精度,具备临床可靠性。
链接: https://arxiv.org/abs/2506.15908
作者: Elif Keles,Merve Yazol,Gorkem Durak,Ziliang Hong,Halil Ertugrul Aktas,Zheyuan Zhang,Linkai Peng,Onkar Susladkar,Necati Guzelyel,Oznur Leman Boyunaga,Cemal Yazici,Mark Lowe,Aliye Uc,Ulas Bagci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and MRI data available for public
Abstract:Objective: Our study aimed to evaluate and validate PanSegNet, a deep learning (DL) algorithm for pediatric pancreas segmentation on MRI in children with acute pancreatitis (AP), chronic pancreatitis (CP), and healthy controls. Methods: With IRB approval, we retrospectively collected 84 MRI scans (1.5T/3T Siemens Aera/Verio) from children aged 2-19 years at Gazi University (2015-2024). The dataset includes healthy children as well as patients diagnosed with AP or CP based on clinical criteria. Pediatric and general radiologists manually segmented the pancreas, then confirmed by a senior pediatric radiologist. PanSegNet-generated segmentations were assessed using Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff distance (HD95). Cohen’s kappa measured observer agreement. Results: Pancreas MRI T2W scans were obtained from 42 children with AP/CP (mean age: 11.73 +/- 3.9 years) and 42 healthy children (mean age: 11.19 +/- 4.88 years). PanSegNet achieved DSC scores of 88% (controls), 81% (AP), and 80% (CP), with HD95 values of 3.98 mm (controls), 9.85 mm (AP), and 15.67 mm (CP). Inter-observer kappa was 0.86 (controls), 0.82 (pancreatitis), and intra-observer agreement reached 0.88 and 0.81. Strong agreement was observed between automated and manual volumes (R^2 = 0.85 in controls, 0.77 in diseased), demonstrating clinical reliability. Conclusion: PanSegNet represents the first validated deep learning solution for pancreatic MRI segmentation, achieving expert-level performance across healthy and diseased states. This tool, algorithm, along with our annotated dataset, are freely available on GitHub and OSF, advancing accessible, radiation-free pediatric pancreatic imaging and fostering collaborative research in this underserved domain.
zh
[CV-121] Visual symbolic mechanisms: Emergent symbol processing in vision language models
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在处理视觉场景时存在的“绑定问题”(binding problem),即如何将不同的特征正确地组合以表征单独的对象。其解决方案的关键在于识别出一种新兴的符号化机制,该机制通过内容无关的空间索引方案支持VLMs中的绑定过程,并且绑定错误可直接归因于这些机制的失效。
链接: https://arxiv.org/abs/2506.15871
作者: Rim Assouel,Declan Campbell,Taylor Webb
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); Princeton Neuroscience Institute (普林斯顿神经科学研究所); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this ‘binding problem’ via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.
zh
[CV-122] Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation
【速读】:该论文旨在解决联网与自动驾驶车辆(Connected and Autonomous Vehicles, CAVs)中由AI-equipped (AIE)摄像头捕获的图像所引发的隐私风险问题。现有方法如人脸模糊和数据混淆仍无法有效防止通过其他特征(如着装)进行个体追踪。论文提出了一种基于反馈强化学习(reinforcement learning, RL)和视觉-语言模型(vision-language models, VLMs)的隐私保护框架,其关键在于将图像转换为语义等效的文本描述,在保留场景相关信息的同时保障视觉隐私。通过分层强化学习策略迭代优化生成文本,提升了语义准确性和隐私保护效果。
链接: https://arxiv.org/abs/2506.15854
作者: Abdolazim Rezaei,Mehdi Sookhak,Ahmad Patooghy
机构: Texas A&M University Corpus Christi (德克萨斯农工大学科珀斯克里斯蒂分校); North Carolina A&T State University (北卡罗来纳农业技术州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Connected and Autonomous Vehicles (CAVs) rely on a range of devices that often process privacy-sensitive data. Among these, roadside units play a critical role particularly through the use of AI-equipped (AIE) cameras for applications such as violation detection. However, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. While traditional techniques such as face blurring and obfuscation have been applied to mitigate privacy risks, individual privacy remains at risk, as individuals can still be tracked using other features such as their clothing. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The main idea is to convert images into semantically equivalent textual descriptions, ensuring that scene-relevant information is retained while visual privacy is preserved. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Evaluation results demonstrate significant improvements in both privacy protection and textual quality, with the Unique Word Count increasing by approximately 77% and Detail Density by around 50% compared to existing approaches.
zh
[CV-123] Assessing the impact of Binarization for Writer Identification in Greek Papyrus
【速读】:该论文试图解决希腊纸莎草文献的作者识别问题(writer identification),其关键在于图像二值化(binarization)对后续作者识别性能的影响。研究比较了传统二值化方法与先进的深度学习(Deep Learning, DL)模型,并评估了二值化质量对作者识别效果的影响,同时探讨了数据增强技术及模型选择标准的作用。结果表明,数据增强对深度学习方法具有显著影响,且二值化效果与下游作者识别性能之间存在强相关性。
链接: https://arxiv.org/abs/2506.15852
作者: Dominic Akt,Marco Peer,Florian Kleber
机构: TU Wien (维也纳技术大学); HEIA-FR (HEIA-FR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication for AIROV 2025
Abstract:This paper tackles the task of writer identification for Greek papyri. A common preprocessing step in writer identification pipelines is image binarization, which prevents the model from learning background features. This is challenging in historical documents, in our case Greek papyri, as background is often non-uniform, fragmented, and discolored with visible fiber structures. We compare traditional binarization methods to state-of-the-art Deep Learning (DL) models, evaluating the impact of binarization quality on subsequent writer identification performance. DL models are trained with and without a custom data augmentation technique, as well as different model selection criteria are applied. The performance of these binarization methods, is then systematically evaluated on the DIBCO 2019 dataset. The impact of binarization on writer identification is subsequently evaluated using a state-of-the-art approach for writer identification. The results of this analysis highlight the influence of data augmentation for DL methods. Furthermore, findings indicate a strong correlation between binarization effectiveness on papyri documents of DIBCO 2019 and downstream writer identification performance.
zh
[CV-124] Semantic and Feature Guided Uncertainty Quantification of Visual Localization for Autonomous Vehicles ICRA2025
【速读】:该论文旨在解决传感器测量不确定性与深度学习网络结合时的量化问题,这对于许多机器人系统,尤其是自动驾驶等安全关键应用至关重要。其解决方案的关键在于利用轻量级传感器误差模型来学习测量不确定性,该模型将图像特征和语义信息映射到二维误差分布,从而实现基于匹配图像对特定上下文的不确定性估计,并隐式地捕捉未标注的关键因素(如城市与高速公路、动态与静态场景、冬季与夏季等)。
链接: https://arxiv.org/abs/2506.15851
作者: Qiyuan Wu,Mark Campbell
机构: Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025
Abstract:The uncertainty quantification of sensor measurements coupled with deep learning networks is crucial for many robotics systems, especially for safety-critical applications such as self-driving cars. This paper develops an uncertainty quantification approach in the context of visual localization for autonomous driving, where locations are selected based on images. Key to our approach is to learn the measurement uncertainty using light-weight sensor error model, which maps both image feature and semantic information to 2-dimensional error distribution. Our approach enables uncertainty estimation conditioned on the specific context of the matched image pair, implicitly capturing other critical, unannotated factors (e.g., city vs highway, dynamic vs static scenes, winter vs summer) in a latent manner. We demonstrate the accuracy of our uncertainty prediction framework using the Ithaca365 dataset, which includes variations in lighting and weather (sunny, night, snowy). Both the uncertainty quantification of the sensor+network is evaluated, along with Bayesian localization filters using unique sensor gating method. Results show that the measurement error does not follow a Gaussian distribution with poor weather and lighting conditions, and is better predicted by our Gaussian Mixture model.
zh
[CV-125] PRISM-Loc: a Lightweight Long-range LiDAR Localization in Urban Environments with Topological Maps IROS2025
【速读】:该论文试图解决在大规模环境中,移动机器人或自动驾驶车辆进行实时定位的问题,特别是在使用密集全局激光雷达地图时面临的计算复杂性和存储需求高的挑战。解决方案的关键在于提出一种基于拓扑地图的定位方法PRISM-Loc,其核心是采用双阶段定位流程,包括全局场景识别和在识别出的位置内估计局部位姿。为实现局部位姿估计,引入了一种基于2D特征和点优化的原始激光雷达扫描匹配算法。
链接: https://arxiv.org/abs/2506.15849
作者: Kirill Muravyev,Vasily Yuryev,Oleg Bulichev,Dmitry Yudin,Konstantin Yakovlev
机构: Federal Research Center for Computer Science and Control of Russian Academy of Sciences (俄罗斯科学院计算机科学与控制联邦研究中心); Moscow Institute of Physics and Technology (莫斯科物理技术学院); Artificial Intelligence Research Institute (人工智能研究机构); Innopolis University (伊诺波利斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This version was submitted and rejected from IROS 2025 conference
Abstract:Localization in the environment is one of the crucial tasks of navigation of a mobile robot or a self-driving vehicle. For long-range routes, performing localization within a dense global lidar map in real time may be difficult, and the creation of such a map may require much memory. To this end, leveraging topological maps may be useful. In this work, we propose PRISM-Loc – a topological map-based approach for localization in large environments. The proposed approach leverages a twofold localization pipeline, which consists of global place recognition and estimation of the local pose inside the found location. For local pose estimation, we introduce an original lidar scan matching algorithm, which is based on 2D features and point-based optimization. We evaluate the proposed method on the ITLP-Campus dataset on a 3 km route, and compare it against the state-of-the-art metric map-based and place recognition-based competitors. The results of the experiments show that the proposed method outperforms its competitors both quality-wise and computationally-wise.
zh
[CV-126] EchoShot: Multi-Shot Portrait Video Generation
【速读】:该论文旨在解决现有视频扩散模型在多镜头肖像视频生成中面临的身份一致性不足与内容可控性有限的问题(multi-shot portrait video generation with identity consistency and flexible content controllability)。其解决方案的关键在于提出EchoShot框架,该框架通过在视频扩散Transformer架构中引入shot-aware position embedding机制,以建模多镜头间的差异并建立多镜头视觉内容与其文本描述之间的复杂对应关系,从而实现无需额外计算开销的多镜头视频数据直接训练。
链接: https://arxiv.org/abs/2506.15838
作者: Jiahao Wang,Hualian Sheng,Sijia Cai,Weizhan Zhang,Caixia Yan,Yachuang Feng,Bing Deng,Jieping Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling.
zh
[CV-127] ADAM-Dehaze: Adaptive Density-Aware Multi-Stage Dehazing for Improved Object Detection in Foggy Conditions
【速读】:该论文旨在解决雾霾等恶劣天气条件下视觉信息严重退化对自动驾驶车辆、监控系统及其他关键安全应用带来的挑战。其解决方案的关键在于提出ADAM-Dehaze框架,该框架通过联合优化图像恢复与目标检测,在不同雾密度下实现自适应处理。核心创新包括一个轻量级的Haze Density Estimation Network (HDEN)用于分类雾的强度,并根据分类结果动态选择对应的CORUN分支进行处理,以及一种自适应损失函数以平衡物理模型一致性和感知保真度,从而在提升去雾效果的同时保持细节清晰。
链接: https://arxiv.org/abs/2506.15837
作者: Fatmah AlHindaassi,Mohammed Talha Alam,Fakhri Karray
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under-review at IEEE SMC 2025
Abstract:Adverse weather conditions, particularly fog, pose a significant challenge to autonomous vehicles, surveillance systems, and other safety-critical applications by severely degrading visual information. We introduce ADAM-Dehaze, an adaptive, density-aware dehazing framework that jointly optimizes image restoration and object detection under varying fog intensities. A lightweight Haze Density Estimation Network (HDEN) classifies each input as light, medium, or heavy fog. Based on this score, the system dynamically routes the image through one of three CORUN branches: Light, Medium, or Complex, each tailored to its haze regime. A novel adaptive loss balances physical-model coherence and perceptual fidelity, ensuring both accurate defogging and preservation of fine details. On Cityscapes and the real-world RTTS benchmark, ADAM-Dehaze improves PSNR by up to 2.1 dB, reduces FADE by 30 percent, and increases object detection mAP by up to 13 points, while cutting inference time by 20 percent. These results highlight the importance of intensity-specific processing and seamless integration with downstream vision tasks. Code available at: this https URL.
zh
[CV-128] VEIGAR: View-consistent Explicit Inpainting and Geometry Alignment for 3D object Removal
【速读】:该论文旨在解决新型视图合成(Novel View Synthesis, NVS)和3D生成任务中跨视图一致性维护的问题,传统方法通常依赖于初始的3D重建阶段以建立几何结构,导致计算开销大且重建质量不理想。论文提出的解决方案关键在于提出VEIGAR框架,该框架无需初始重建阶段,通过轻量级基础模型在像素空间中可靠对齐先验信息,并引入基于尺度不变深度损失的新监督策略,从而减少对传统尺度与平移操作的依赖,提升了计算效率与生成质量。
链接: https://arxiv.org/abs/2506.15821
作者: Pham Khai Nguyen Do,Bao Nguyen Tran,Nam Nguyen,Duc Dung Nguyen
机构: Ho Chi Minh City University of Technology, VNUHCM (胡志明市科技大学,VNUHCM)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent advances in Novel View Synthesis (NVS) and 3D generation have significantly improved editing tasks, with a primary emphasis on maintaining cross-view consistency throughout the generative process. Contemporary methods typically address this challenge using a dual-strategy framework: performing consistent 2D inpainting across all views guided by embedded priors either explicitly in pixel space or implicitly in latent space; and conducting 3D reconstruction with additional consistency guidance. Previous strategies, in particular, often require an initial 3D reconstruction phase to establish geometric structure, introducing considerable computational overhead. Even with the added cost, the resulting reconstruction quality often remains suboptimal. In this paper, we present VEIGAR, a computationally efficient framework that outperforms existing methods without relying on an initial reconstruction phase. VEIGAR leverages a lightweight foundation model to reliably align priors explicitly in the pixel space. In addition, we introduce a novel supervision strategy based on scale-invariant depth loss, which removes the need for traditional scale-and-shift operations in monocular depth regularization. Through extensive experimentation, VEIGAR establishes a new state-of-the-art benchmark in reconstruction quality and cross-view consistency, while achieving a threefold reduction in training time compared to the fastest existing method, highlighting its superior balance of efficiency and effectiveness.
zh
[CV-129] GratNet: A Photorealistic Neural Shader for Diffractive Surfaces
【速读】:该论文旨在解决结构色(structural coloration)在光波建模中对密集预处理数据的高度依赖问题,这一依赖限制了其在复杂纳米结构的可靠和逼真渲染中的应用。论文提出的解决方案关键在于采用基于多层感知机(MLP)的数据驱动方法,通过数据压缩视角设计训练与建模策略,以适应衍射表面反射率数据集的领域特性,从而实现高精度与高效率的渲染,同时避免过拟合并具备鲁棒的重采样能力。
链接: https://arxiv.org/abs/2506.15815
作者: Narayan Kandel,Daljit Singh J.S. Dhillon
机构: Clemson University (克莱姆森大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structural coloration is commonly modeled using wave optics for reliable and photorealistic rendering of natural, quasi-periodic and complex nanostructures. Such models often rely on dense, preliminary or preprocessed data to accurately capture the nuanced variations in diffractive surface reflectances. This heavy data dependency warrants implicit neural representation which has not been addressed comprehensively in the current literature. In this paper, we present a multi-layer perceptron (MLP) based method for data-driven rendering of diffractive surfaces with high accuracy and efficiency. We primarily approach this problem from a data compression perspective to devise a nuanced training and modeling method which is attuned to the domain and range characteristics of diffractive reflectance datasets. Importantly, our approach avoids over-fitting and has robust resampling behavior. Using Peak-Signal-to-Noise (PSNR), Structural Similarity Index Measure (SSIM) and a flipping difference evaluator (FLIP) as evaluation metrics, we demonstrate the high-quality reconstruction of the ground-truth. In comparison to a recent state-of-the-art offline, wave-optical, forward modeling approach, our method reproduces subjectively similar results with significant performance gains. We reduce the memory footprint of the raw datasets by two orders of magnitude in general. Lastly, we depict the working of our method with actual surface renderings.
zh
[CV-130] Implicit 3D scene reconstruction using deep learning towards efficient collision understanding in autonomous driving
【速读】:该论文试图解决在密集交通环境中,现有技术难以实现精确的三维场景重建问题,尤其是在高边界精度的物体形状重建方面仍存在不足。其解决方案的关键在于开发一种基于学习的三维场景重建方法,利用LiDAR数据和深度神经网络构建静态符号距离函数(Signed Distance Function, SDF)地图,相较于传统的多边形表示方法,该方法能够更精细地映射三维障碍物的边界细节,从而提升碰撞检测性能,特别是在拥挤和动态环境中。
链接: https://arxiv.org/abs/2506.15806
作者: Akarshani Ramanayake,Nihal Kodikara
机构: School of Computing (计算学院); Informatics Institute of Technology (信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In crowded urban environments where traffic is dense, current technologies struggle to oversee tight navigation, but surface-level understanding allows autonomous vehicles to safely assess proximity to surrounding obstacles. 3D or 2D scene mapping of the surrounding objects is an essential task in addressing the above problem. Despite its importance in dense vehicle traffic conditions, 3D scene reconstruction of object shapes with higher boundary level accuracy is not yet entirely considered in current literature. The sign distance function represents any shape through parameters that calculate the distance from any point in space to the closest obstacle surface, making it more efficient in terms of storage. In recent studies, researchers have started to formulate problems with Implicit 3D reconstruction methods in the autonomous driving domain, highlighting the possibility of using sign distance function to map obstacles effectively. This research addresses this gap by developing a learning-based 3D scene reconstruction methodology that leverages LiDAR data and a deep neural network to build a the static Signed Distance Function (SDF) maps. Unlike traditional polygonal representations, this approach has the potential to map 3D obstacle shapes with more boundary-level details. Our preliminary results demonstrate that this method would significantly enhance collision detection performance, particularly in congested and dynamic environments.
zh
[CV-131] Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation
【速读】:该论文旨在解决视觉语言导航(Visual Language Navigation, VLN)任务中存在的一系列挑战,包括预训练视觉-语言模型(Vision-Language Models, VLMs)在动态视角下的感知能力不足、未微调的大型语言模型(Large Language Models, LLMs)或VLMs在VLN任务中的性能受限,以及微调带来的高计算成本问题。其解决方案的关键在于提出一种弱监督的部分对比学习方法(Weakly-supervised Partial Contrastive Learning, WPCL),通过有效整合预训练VLM知识到感知过程,而无需对VLM进行微调,从而提升智能体在动态视角下识别物体的能力,并在保持计算效率的同时增强对环境线索的解释与响应能力。
链接: https://arxiv.org/abs/2506.15757
作者: Ruoyu Wang,Tong Yu,Junda Wu,Yao Liu,Julian McAuley,Lina Yao
机构: University of New South Wales (新南威尔士大学); Adobe Research (Adobe 研究院); University of California San Diego (加州大学圣地亚哥分校); Northeastern University (东北大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions. Despite the progress made by existing methods, these methods often present some common challenges. First, they rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios. Second, the performance is limited when using pre-trained LLMs or VLMs without fine-tuning, due to the absence of VLN domain knowledge. Third, while fine-tuning LLMs and VLMs can improve results, their computational costs are higher than those without fine-tuning. To address these limitations, we propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent’s ability to identify objects from dynamic viewpoints in VLN scenarios by effectively integrating pre-trained VLM knowledge into the perception process, without requiring VLM fine-tuning. Our method enhances the agent’s ability to interpret and respond to environmental cues while ensuring computational efficiency. Experimental results have shown that our method outperforms the baseline methods on multiple benchmarks, which validate the effectiveness, robustness and generalizability of our method.
zh
[CV-132] A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion
【速读】:该论文旨在解决单视角图像引导的点云补全(Single-View Image Guided Point Cloud Completion, SVIPC)任务中,图像引导的必要性尚未被深入探讨的问题。为探究这一问题,作者提出了一种基于注意力机制的多分支编码器-解码器网络作为强基线方法,该方法仅以部分点云为输入,实现无视角依赖的点云补全。其解决方案的关键在于层次化的自融合机制,通过交叉注意力和自注意力层有效整合多流信息,从而增强特征表示并提升网络对几何结构的捕捉能力。
链接: https://arxiv.org/abs/2506.15747
作者: Fangzhou Lin,Zilin Dai,Rigved Sanku,Songlin Hou,Kazunori D Yamada,Haichong K. Zhang,Ziming Zhang
机构: Worcester Polytechnic Institute (伍斯特理工学院); Dell Technologies (戴尔科技公司); Tohoku University (东北大学); Harvard Kenneth C. Griffin Graduate School of Arts and Sciences (哈佛大学肯尼斯·C·格里芬文理研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages, 2 figures
Abstract:The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at this https URL.
zh
[CV-133] ripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning CVPR2025
【速读】:该论文旨在解决少样本类增量学习(Few-shot Class Incremental Learning, FSCIL)中的灾难性遗忘和过拟合问题。在FSCIL中,模型在仅有限新类别样本的情况下进行持续学习,容易遗忘之前学到的知识并过度拟合到少量新样本。为了解决这些问题,论文提出了一种关键的解决方案——三元权重空间集成(Tripartite Weight-Space Ensemble, Tri-WE)。Tri-WE通过在权重空间中对基础模型、前一个模型和当前模型进行插值,特别是对分类头进行处理,从而协同保留基础模型和前一个模型的知识。此外,论文还引入了一种基于增强数据的知识蒸馏正则化损失项,以克服从少量数据中提取泛化表示的挑战,从而提升模型的适应性和性能。
链接: https://arxiv.org/abs/2506.15720
作者: Juntae Lee,Munawar Hayat,Sungrack Yun
机构: Qualcomm AI Research(高通人工智能研究)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
Abstract:Few-shot class incremental learning (FSCIL) enables the continual learning of new concepts with only a few training examples. In FSCIL, the model undergoes substantial updates, making it prone to forgetting previous concepts and overfitting to the limited new examples. Most recent trend is typically to disentangle the learning of the representation from the classification head of the model. A well-generalized feature extractor on the base classes (many examples and many classes) is learned, and then fixed during incremental learning. Arguing that the fixed feature extractor restricts the model’s adaptability to new classes, we introduce a novel FSCIL method to effectively address catastrophic forgetting and overfitting issues. Our method enables to seamlessly update the entire model with a few examples. We mainly propose a tripartite weight-space ensemble (Tri-WE). Tri-WE interpolates the base, immediately previous, and current models in weight-space, especially for the classification heads of the models. Then, it collaboratively maintains knowledge from the base and previous models. In addition, we recognize the challenges of distilling generalized representations from the previous model from scarce data. Hence, we suggest a regularization loss term using amplified data knowledge distillation. Simply intermixing the few-shot data, we can produce richer data enabling the distillation of critical knowledge from the previous model. Consequently, we attain state-of-the-art results on the miniImageNet, CUB200, and CIFAR100 datasets.
zh
[CV-134] Shadow defense against gradient inversion attack in federated learning
【速读】:该论文试图解决联邦学习(Federated Learning, FL)中因模型更新通信导致的隐私泄露问题,特别是在医疗领域中,梯度反演攻击(Gradient Inversion Attacks, GIAs)可能重建训练图像并侵犯患者隐私。现有防御机制缺乏对哪些梯度或图像信息最易受攻击的深入理解,导致保护措施过于粗放,影响模型性能或无法有效防护。论文提出的解决方案的关键在于利用具有可解释性的影子模型(shadow model)识别敏感区域,并在此基础上进行针对性的样本级噪声注入,从而实现更有效的隐私保护与模型性能的平衡。
链接: https://arxiv.org/abs/2506.15711
作者: Le Jiang,Liyan Ma,Guang Yang
机构: Bioengineering Department and Imperial-X, Imperial College London (生物工程系和帝国X,帝国理工学院); School of Computer Engineering and Science, Shanghai university (计算机工程与科学学院,上海大学); National Heart and Lung Institute, Imperial College London (国家心脏和肺脏研究所,帝国理工学院); Cardiovascular Research Centre, Royal Brompton Hospital (心血管研究中心,皇家布罗姆普顿医院); School of Biomedical Engineering & Imaging Sciences, King’s College London (生物医学工程与影像科学学院,国王学院伦敦分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning (FL) has emerged as a transformative framework for privacy-preserving distributed training, allowing clients to collaboratively train a global model without sharing their local data. This is especially crucial in sensitive fields like healthcare, where protecting patient data is paramount. However, privacy leakage remains a critical challenge, as the communication of model updates can be exploited by potential adversaries. Gradient inversion attacks (GIAs), for instance, allow adversaries to approximate the gradients used for training and reconstruct training images, thus stealing patient privacy. Existing defense mechanisms obscure gradients, yet lack a nuanced understanding of which gradients or types of image information are most vulnerable to such attacks. These indiscriminate calibrated perturbations result in either excessive privacy protection degrading model accuracy, or insufficient one failing to safeguard sensitive information. Therefore, we introduce a framework that addresses these challenges by leveraging a shadow model with interpretability for identifying sensitive areas. This enables a more targeted and sample-specific noise injection. Specially, our defensive strategy achieves discrepancies of 3.73 in PSNR and 0.2 in SSIM compared to the circumstance without defense on the ChestXRay dataset, and 2.78 in PSNR and 0.166 in the EyePACS dataset. Moreover, it minimizes adverse effects on model performance, with less than 1% F1 reduction compared to SOTA methods. Our extensive experiments, conducted across diverse types of medical images, validate the generalization of the proposed framework. The stable defense improvements for FedAvg are consistently over 1.5% times in LPIPS and SSIM. It also offers a universal defense against various GIA types, especially for these sensitive areas in images.
zh
[CV-135] Global Context-aware Representation Learning for Spatially Resolved Transcriptomics ICML2025
【速读】:该论文旨在解决空间转录组学中基于图的方法在获取有意义的点位表示时存在的局限性,特别是在空间域边界附近的点位,因其过度依赖相邻点位而难以捕捉到有效的特征。解决方案的关键在于提出Spotscape框架,其中引入了相似性望远镜模块以捕获多个点位之间的全局关系,并采用相似性缩放策略来调节切片内与切片间点位的距离,从而实现有效的多切片整合。
链接: https://arxiv.org/abs/2506.15698
作者: Yunhak Oh,Junseok Lee,Yeongmin Kim,Sangwoo Seo,Namkyeong Lee,Chanyoung Park
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025
Abstract:Spatially Resolved Transcriptomics (SRT) is a cutting-edge technique that captures the spatial context of cells within tissues, enabling the study of complex biological networks. Recent graph-based methods leverage both gene expression and spatial information to identify relevant spatial domains. However, these approaches fall short in obtaining meaningful spot representations, especially for spots near spatial domain boundaries, as they heavily emphasize adjacent spots that have minimal feature differences from an anchor node. To address this, we propose Spotscape, a novel framework that introduces the Similarity Telescope module to capture global relationships between multiple spots. Additionally, we propose a similarity scaling strategy to regulate the distances between intra- and inter-slice spots, facilitating effective multi-slice integration. Extensive experiments demonstrate the superiority of Spotscape in various downstream tasks, including single-slice and multi-slice scenarios. Our code is available at the following link: https: //github.com/yunhak0/Spotscape.
zh
[CV-136] Proportional Sensitivity in Generative Adversarial Network (GAN)-Augmented Brain Tumor Classification Using Convolutional Neural Network
【速读】:该论文试图解决医学影像数据集规模有限对深度学习模型性能的影响问题,特别是针对脑肿瘤MRI图像分类任务。其解决方案的关键在于利用生成式对抗网络(Generative Adversarial Networks, GAN)生成合成图像,并将其与真实图像以不同比例混合,用于训练卷积神经网络(Convolutional Neural Network, CNN),从而评估合成数据对模型分类性能的影响。
链接: https://arxiv.org/abs/2506.17165
作者: Mahin Montasir Afif,Abdullah Al Noman,K. M. Tahsin Kabir,Md. Mortuza Ahmmed,Md. Mostafizur Rahman,Mufti Mahmud,Md. Ashraful Babu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This papaer has been submitted to The 18th International Conference on Brain Informatics (BI’25), Italy
Abstract:Generative Adversarial Networks (GAN) have shown potential in expanding limited medical imaging datasets. This study explores how different ratios of GAN-generated and real brain tumor MRI images impact the performance of a CNN in classifying healthy vs. tumorous scans. A DCGAN was used to create synthetic images which were mixed with real ones at various ratios to train a custom CNN. The CNN was then evaluated on a separate real-world test set. Our results indicate that the model maintains high sensitivity and precision in tumor classification, even when trained predominantly on synthetic data. When only a small portion of GAN data was added, such as 900 real images and 100 GAN images, the model achieved excellent performance, with test accuracy reaching 95.2%, and precision, recall, and F1-score all exceeding 95%. However, as the proportion of GAN images increased further, performance gradually declined. This study suggests that while GANs are useful for augmenting limited datasets especially when real data is scarce, too much synthetic data can introduce artifacts that affect the model’s ability to generalize to real world cases.
zh
[CV-137] MeDi: Metadata-Guided Diffusion Models for Mitigating Biases in Tumor Classification
【速读】:该论文试图解决深度学习模型在临床实践中因对不同条件(如染色、扫描仪、医院和人口统计学特征)缺乏鲁棒性而导致的泛化能力不足问题,尤其是在训练数据中过表示的子群体上表现良好,但在较少见的模式上容易出现捷径学习和偏差预测。解决方案的关键在于提出一种名为MeDi(Metadata-guided generative Diffusion model)的框架,该框架通过显式建模元数据,生成合成数据以针对性增强欠代表子群体,从而平衡有限的训练数据并减轻下游模型中的偏差。
链接: https://arxiv.org/abs/2506.17140
作者: David Jacob Drexlin,Jonas Dippel,Julius Hense,Niklas Prenißl,Grégoire Montavon,Frederick Klauschen,Klaus-Robert Müller
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning models have made significant advances in histological prediction tasks in recent years. However, for adaptation in clinical practice, their lack of robustness to varying conditions such as staining, scanner, hospital, and demographics is still a limiting factor: if trained on overrepresented subpopulations, models regularly struggle with less frequent patterns, leading to shortcut learning and biased predictions. Large-scale foundation models have not fully eliminated this issue. Therefore, we propose a novel approach explicitly modeling such metadata into a Metadata-guided generative Diffusion model framework (MeDi). MeDi allows for a targeted augmentation of underrepresented subpopulations with synthetic data, which balances limited training data and mitigates biases in downstream models. We experimentally show that MeDi generates high-quality histopathology images for unseen subpopulations in TCGA, boosts the overall fidelity of the generated images, and enables improvements in performance for downstream classifiers on datasets with subpopulation shifts. Our work is a proof-of-concept towards better mitigating data biases with generative models.
zh
[CV-138] Robust Training with Data Augmentation for Medical Imaging Classification
【速读】:该论文试图解决深度神经网络在医学影像分类任务中对对抗攻击和分布偏移的脆弱性问题,这些问题可能影响诊断可靠性并降低医疗专业人员的信任。解决方案的关键在于提出一种稳健训练算法与数据增强(RTDA)相结合的方法,通过在训练过程中引入多样化的数据增强策略,提升模型对对抗扰动和分布偏移的鲁棒性,同时保持高干净准确率。
链接: https://arxiv.org/abs/2506.17133
作者: Josué Martínez-Martínez,Olivia Brown,Mostafa Karami,Sheida Nabavi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks are increasingly being used to detect and diagnose medical conditions using medical imaging. Despite their utility, these models are highly vulnerable to adversarial attacks and distribution shifts, which can affect diagnostic reliability and undermine trust among healthcare professionals. In this study, we propose a robust training algorithm with data augmentation (RTDA) to mitigate these vulnerabilities in medical image classification. We benchmark classifier robustness against adversarial perturbations and natural variations of RTDA and six competing baseline techniques, including adversarial training and data augmentation approaches in isolation and combination, using experimental data sets with three different imaging technologies (mammograms, X-rays, and ultrasound). We demonstrate that RTDA achieves superior robustness against adversarial attacks and improved generalization performance in the presence of distribution shift in each image classification task while maintaining high clean accuracy.
zh
[CV-139] PET Tracer Separation Using Conditional Diffusion Transformer with Multi-latent Space Learning
【速读】:该论文旨在解决单辐射剂正电子发射断层扫描(PET)中多辐射剂信号难以区分的问题,因为不同辐射剂在PET成像中产生的伽马光子对具有相同的能量,导致无法有效分离各辐射剂的信号。解决方案的关键在于提出一种多潜在空间引导的纹理条件扩散变压器模型(MS-CDT),该模型将扩散和变压器架构整合到统一的优化框架中,并引入纹理掩码作为条件输入以增强图像细节。通过利用来自不同辐射剂的多潜在空间先验,模型能够捕捉多层次特征表示,从而在计算效率与细节保留之间取得平衡。
链接: https://arxiv.org/abs/2506.16934
作者: Bin Huang,Feihong Xu,Xinchong Shi,Shan Huang,Binxuan Li,Fei Li,Qiegen Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In clinical practice, single-radiotracer positron emission tomography (PET) is commonly used for imaging. Although multi-tracer PET imaging can provide supplementary information of radiotracers that are sensitive to physiological function changes, enabling a more comprehensive characterization of physiological and pathological states, the gamma-photon pairs generated by positron annihilation reactions of different tracers in PET imaging have the same energy, making it difficult to distinguish the tracer signals. In this study, a multi-latent space guided texture conditional diffusion transformer model (MS-CDT) is proposed for PET tracer separation. To the best of our knowledge, this is the first attempt to use texture condition and multi-latent space for tracer separation in PET imaging. The proposed model integrates diffusion and transformer architectures into a unified optimization framework, with the novel addition of texture masks as conditional inputs to enhance image details. By leveraging multi-latent space prior derived from different tracers, the model captures multi-level feature representations, aiming to balance computational efficiency and detail preservation. The texture masks, serving as conditional guidance, help the model focus on salient structural patterns, thereby improving the extraction and utilization of fine-grained image textures. When combined with the diffusion transformer backbone, this conditioning mechanism contributes to more accurate and robust tracer separation. To evaluate its effectiveness, the proposed MS-CDT is compared with several advanced methods on two types of 3D PET datasets: brain and chest scans. Experimental results indicate that MS-CDT achieved competitive performance in terms of image quality and preservation of clinically relevant information. Code is available at: this https URL.
zh
[CV-140] mperature calibration of surface emissivities with an improved thermal image enhancement network
【速读】:该论文试图解决红外热成像中由于材料发射率变化导致的温度精度问题,现有方法通常忽视了辐射校准与图像退化的联合优化。其解决方案的关键在于提出一种物理引导的神经框架,通过对称跳跃卷积神经网络(symmetric skip-CNN)和发射率感知注意力模块统一进行温度校正与图像增强。该框架在预处理阶段对图像感兴趣区域(ROI)进行分割并初步校正发射率,同时采用一种新型双约束损失函数,通过均值-方差对齐和基于Kullback-Leibler散度的直方图匹配来增强目标区域与参考区域的统计一致性,从而实现热辐射特征与空间上下文的动态融合,抑制发射率伪影并恢复结构细节。
链接: https://arxiv.org/abs/2506.16803
作者: Ning Chu,Siya Zheng,Shanqing Zhang,Li Li,Caifang Cai,Ali Mohammad-Djafari,Feng Zhao,Yuanbo Song
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared thermography faces persistent challenges in temperature accuracy due to material emissivity variations, where existing methods often neglect the joint optimization of radiometric calibration and image degradation. This study introduces a physically guided neural framework that unifies temperature correction and image enhancement through a symmetric skip-CNN architecture and an emissivity-aware attention module. The pre-processing stage segments the ROIs of the image and and initially corrected the firing rate. A novel dual-constrained loss function strengthens the statistical consistency between the target and reference regions through mean-variance alignment and histogram matching based on Kullback-Leibler dispersion. The method works by dynamically fusing thermal radiation features and spatial context, and the model suppresses emissivity artifacts while recovering structural details. After validating the industrial blower system under different conditions, the improved network realizes the dynamic fusion of thermal radiation characteristics and spatial background, with accurate calibration results in various industrial conditions.
zh
[CV-141] A Prior-Guided Joint Diffusion Model in Projection Domain for PET Tracer Conversion
【速读】:该论文旨在解决18F-FDG PET在某些肿瘤中效果有限的问题,同时克服18F-DOPA PET因合成复杂、运输和临床应用受限而难以广泛应用的挑战。其解决方案的关键在于提出一种先验引导的联合扩散模型(PJDM),该模型在投影域内将18F-FDG PET图像转换为18F-DOPA PET图像,通过粗略估计模型和先验精炼模型的独立训练,并利用高阶混合采样器生成初始合成sinogram,再通过退化后的sinogram作为额外条件进行迭代精炼,从而提升sinogram质量和合成效果。
链接: https://arxiv.org/abs/2506.16733
作者: Fang Chen,Weifeng Zhang,Xingyu Ai,BingXuan Li,An Li,Qiegen Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neurological disorders. However, its complex synthesis and limitations in transportation and clinical use hinder widespread adoption. During PET imaging, the sinogram represents a form of raw data acquired by the scanner. Therefore, modeling in projection domain enables more direct utilization of the original information, potentially reducing the accumulation of errors introduced during the image reconstruction process. Inspired by these factors, this study proposes a prior-guided joint diffusion model (PJDM) for transforming 18F-FDG PET images into 18F-DOPA PET images in projection domain. Specifically, a coarse estimation model and a prior refinement model are trained independently. During inference, an initial synthetic 18F-DOPA PET sinogram is generated using a higher-order hybrid sampler. This sinogram is then degraded and serves as an additional condition to guide the iterative refinement process using learned prior. Experimental results demonstrated that PJDM effectively improved both sinogram quality and synthetic outcomes. The code is available at: this https URL.
zh
[CV-142] Overfitting in Histopathology Model Training: The Need for Customized Architectures
【速读】:该论文试图解决深度学习模型在组织病理学图像分析中出现的过拟合(overfitting)问题。研究指出,直接采用并微调为自然图像分析设计的大规模模型通常会导致性能不佳和显著的过拟合。解决方案的关键在于开发专门针对组织病理学图像分析的定制化架构,而非单纯增加模型容量。实验表明,在有限数据集下,更简单且领域特定的架构能够实现与复杂模型相当甚至更好的性能,同时有效减少过拟合。
链接: https://arxiv.org/abs/2506.16631
作者: Saghir Alfasly,Ghazal Alabtah,H.R. Tizhoosh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study investigates the critical problem of overfitting in deep learning models applied to histopathology image analysis. We show that simply adopting and fine-tuning large-scale models designed for natural image analysis often leads to suboptimal performance and significant overfitting when applied to histopathology tasks. Through extensive experiments with various model architectures, including ResNet variants and Vision Transformers (ViT), we show that increasing model capacity does not necessarily improve performance on histopathology datasets. Our findings emphasize the need for customized architectures specifically designed for histopathology image analysis, particularly when working with limited datasets. Using Oesophageal Adenocarcinomas public dataset, we demonstrate that simpler, domain-specific architectures can achieve comparable or better performance while minimizing overfitting.
zh
[CV-143] Exoplanet Classification through Vision Transformers with Temporal Image Analysis
【速读】:该论文试图解决系外行星分类问题,这一问题在天文学中长期存在,需要大量的计算和观测资源。传统方法耗时、费力且成本高,因此亟需更先进的机器学习技术来提升分类效率。该研究的关键解决方案是将NASA开普勒任务的原始光变曲线数据转换为Gramian Angular Fields (GAFs)和Recurrence Plots (RPs),并利用Vision Transformer (ViT)模型对这些图像进行处理,以捕捉复杂的时序依赖关系。通过5折交叉验证评估模型性能,结果表明RPs优于GAFs,ViT模型在召回率和精确率上分别达到89.46%和85.09%,展现出在准确识别系外行星凌日事件方面的显著能力。
链接: https://arxiv.org/abs/2506.16597
作者: Anupma Choudhary,Sohith Bandari,B.S.Kushvah,C. Swastik
机构: Indian Institute of Technology (Indian School of Mines); Indian Institute of Astrophysics; Pondicherry University
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Astronomical Journal
Abstract:The classification of exoplanets has been a longstanding challenge in astronomy, requiring significant computational and observational resources. Traditional methods demand substantial effort, time, and cost, highlighting the need for advanced machine learning techniques to enhance classification efficiency. In this study, we propose a methodology that transforms raw light curve data from NASA’s Kepler mission into Gramian Angular Fields (GAFs) and Recurrence Plots (RPs) using the Gramian Angular Difference Field and recurrence plot techniques. These transformed images serve as inputs to the Vision Transformer (ViT) model, leveraging its ability to capture intricate temporal dependencies. We assess the performance of the model through recall, precision, and F1 score metrics, using a 5-fold cross-validation approach to obtain a robust estimate of the model’s performance and reduce evaluation bias. Our comparative analysis reveals that RPs outperform GAFs, with the ViT model achieving an 89.46 % recall and an 85.09 % precision rate, demonstrating its significant capability in accurately identifying exoplanetary transits. Despite using under-sampling techniques to address class imbalance, dataset size reduction remains a limitation. This study underscores the importance of further research into optimizing model architectures to enhance automation, performance, and generalization of the model.
zh
[CV-144] Hybrid Attention Network for Accurate Breast Tumor Segmentation in Ultrasound Images
【速读】:该论文旨在解决乳腺超声图像中肿瘤自动分割的难题,该问题主要源于图像固有的噪声、病灶尺度变化以及边界模糊等因素。其解决方案的关键在于提出一种基于混合注意力机制的网络架构,该架构在编码器部分引入预训练的DenseNet121以实现鲁棒的特征提取,并在解码器部分设计多分支注意力增强模块,结合全局空间注意力(Global Spatial Attention)、位置编码(Position Encoding)和缩放点积注意力(Scaled Dot-Product Attention)来学习全局上下文、空间关系和相对位置特征。此外,通过在跳跃连接中嵌入空间特征增强块(Spatial Feature Enhancement Block)以及采用融合二元交叉熵(Binary Cross-Entropy)和Jaccard指数损失的混合损失函数,进一步提升了模型对类别不平衡和不规则肿瘤形状的鲁棒性。
链接: https://arxiv.org/abs/2506.16592
作者: Muhammad Azeem Aslam,Asim Naveed,Nisar Ahmed
机构: Xi’an Eurasia University (西安欧亚大学); University of Engineering and Technology Lahore (巴基斯坦工程与技术大学拉合尔分校)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast ultrasound imaging is a valuable tool for early breast cancer detection, but automated tumor segmentation is challenging due to inherent noise, variations in scale of lesions, and fuzzy boundaries. To address these challenges, we propose a novel hybrid attention-based network for lesion segmentation. Our proposed architecture integrates a pre-trained DenseNet121 in the encoder part for robust feature extraction with a multi-branch attention-enhanced decoder tailored for breast ultrasound images. The bottleneck incorporates Global Spatial Attention (GSA), Position Encoding (PE), and Scaled Dot-Product Attention (SDPA) to learn global context, spatial relationships, and relative positional features. The Spatial Feature Enhancement Block (SFEB) is embedded at skip connections to refine and enhance spatial features, enabling the network to focus more effectively on tumor regions. A hybrid loss function combining Binary Cross-Entropy (BCE) and Jaccard Index loss optimizes both pixel-level accuracy and region-level overlap metrics, enhancing robustness to class imbalance and irregular tumor shapes. Experiments on public datasets demonstrate that our method outperforms existing approaches, highlighting its potential to assist radiologists in early and accurate breast cancer diagnosis.
zh
[CV-145] DiffO: Single-step Diffusion for Image Compression at Ultra-Low Bitrates
【速读】:该论文旨在解决在极低比特率下图像压缩方法存在严重质量退化的问题,以及基于扩散模型的压缩方法在感知质量与解码延迟方面的局限性。其解决方案的关键在于提出首个单步扩散模型(DiffO),通过两个核心创新实现高效且高质量的图像压缩:一是VQ残差训练,将结构基础码与潜在空间中的学习残差分离,从而捕捉全局几何结构和高频细节;二是速率自适应噪声调制,实时调整去噪强度以匹配目标比特率。
链接: https://arxiv.org/abs/2506.16572
作者: Chanung Park,Joo Chan Lee,Jong Hwan Ko
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although image compression is fundamental to visual data processing and has inspired numerous standard and learned codecs, these methods still suffer severe quality degradation at extremely low bits per pixel. While recent diffusion based models provided enhanced generative performance at low bitrates, they still yields limited perceptual quality and prohibitive decoding latency due to multiple denoising steps. In this paper, we propose the first single step diffusion model for image compression (DiffO) that delivers high perceptual quality and fast decoding at ultra low bitrates. DiffO achieves these goals by coupling two key innovations: (i) VQ Residual training, which factorizes a structural base code and a learned residual in latent space, capturing both global geometry and high frequency details; and (ii) rate adaptive noise modulation, which tunes denoising strength on the fly to match the desired bitrate. Extensive experiments show that DiffO surpasses state of the art compression performance while improving decoding speed by about 50x compared to prior diffusion-based methods, greatly improving the practicality of generative codecs. The code will be available at this https URL.
zh
[CV-146] VesselSDF: Distance Field Priors for Vascular Network Reconstruction
【速读】:该论文试图解决从稀疏的CT扫描切片中准确分割血管网络的问题,这一问题在医学影像中具有显著挑战性,主要由于血管的细小、分支特性以及成像平面间的固有稀疏性。现有基于二值体素分类的深度学习方法在结构连续性和几何保真度方面表现不佳。解决方案的关键在于提出VesselSDF框架,该框架利用符号距离场(Signed Distance Field, SDF)进行鲁棒的血管重建,将血管分割重新定义为连续的SDF回归问题,通过每个体积点到最近血管表面的符号距离来表示,从而自然捕捉血管的平滑管状几何结构及其分支模式,并通过自适应高斯正则化器消除常见的SDF伪影,实现更精确的几何重建。
链接: https://arxiv.org/abs/2506.16556
作者: Salvatore Esposito,Daniel Rebain,Arno Onken,Changjian Li,Oisin Mac Aodha
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of vascular networks from sparse CT scan slices remains a significant challenge in medical imaging, particularly due to the thin, branching nature of vessels and the inherent sparsity between imaging planes. Existing deep learning approaches, based on binary voxel classification, often struggle with structural continuity and geometric fidelity. To address this challenge, we present VesselSDF, a novel framework that leverages signed distance fields (SDFs) for robust vessel reconstruction. Our method reformulates vessel segmentation as a continuous SDF regression problem, where each point in the volume is represented by its signed distance to the nearest vessel surface. This continuous representation inherently captures the smooth, tubular geometry of blood vessels and their branching patterns. We obtain accurate vessel reconstructions while eliminating common SDF artifacts such as floating segments, thanks to our adaptive Gaussian regularizer which ensures smoothness in regions far from vessel surfaces while producing precise geometry near the surface boundaries. Our experimental results demonstrate that VesselSDF significantly outperforms existing methods and preserves vessel geometry and connectivity, enabling more reliable vascular analysis in clinical settings.
zh
[CV-147] AGE-US: automated gestational age estimation based on fetal ultrasound images
【速读】:该论文试图解决胎儿生长监测中准确估算孕周的问题,尤其是在传统方法(如基于末次月经的估算)不可靠或无法获取的情况下,以及超声波方法依赖人工测量导致的变异性问题。解决方案的关键在于提出一种可解释的深度学习方法,利用新型分割架构和距离图(distance maps)来克服数据集限制和标注掩码稀缺的问题,从而实现自动化、高精度的孕周计算。
链接: https://arxiv.org/abs/2506.16256
作者: César Díaz-Parga,Marta Nuñez-Garcia,Maria J. Carreira,Gabriel Bernardino,Nicolás Vila-Blanco
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA) 2025
Abstract:Being born small carries significant health risks, including increased neonatal mortality and a higher likelihood of future cardiac diseases. Accurate estimation of gestational age is critical for monitoring fetal growth, but traditional methods, such as estimation based on the last menstrual period, are in some situations difficult to obtain. While ultrasound-based approaches offer greater reliability, they rely on manual measurements that introduce variability. This study presents an interpretable deep learning-based method for automated gestational age calculation, leveraging a novel segmentation architecture and distance maps to overcome dataset limitations and the scarcity of segmentation masks. Our approach achieves performance comparable to state-of-the-art models while reducing complexity, making it particularly suitable for resource-constrained settings and with limited annotated data. Furthermore, our results demonstrate that the use of distance maps is particularly suitable for estimating femur endpoints.
zh
[CV-148] CF-Seg: Counterfactuals meet Segmentation MICCAI2025
【速读】:该论文试图解决在医学影像中对解剖结构进行准确分割的问题,尤其是在疾病存在时,由于疾病模式改变了周围健康组织的外观、引入了模糊边界或遮蔽了关键解剖结构,导致传统分割模型性能下降,可能引发误诊。解决方案的关键在于生成反事实(Counterfactual, CF)图像,以模拟无疾病状态下相同解剖结构的外观,而无需改变基础分割模型,从而提升分割精度并支持后续临床决策。
链接: https://arxiv.org/abs/2506.16213
作者: Raghav Mehta,Fabio De Sousa Ribeiro,Tian Xia,Melanie Roschewitz,Ainkaran Santhirasekaram,Dominic C. Marshall,Ben Glocker
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025
Abstract:Segmenting anatomical structures in medical images plays an important role in the quantitative assessment of various diseases. However, accurate segmentation becomes significantly more challenging in the presence of disease. Disease patterns can alter the appearance of surrounding healthy tissues, introduce ambiguous boundaries, or even obscure critical anatomical structures. As such, segmentation models trained on real-world datasets may struggle to provide good anatomical segmentation, leading to potential misdiagnosis. In this paper, we generate counterfactual (CF) images to simulate how the same anatomy would appear in the absence of disease without altering the underlying structure. We then use these CF images to segment structures of interest, without requiring any changes to the underlying segmentation model. Our experiments on two real-world clinical chest X-ray datasets show that the use of counterfactual images improves anatomical segmentation, thereby aiding downstream clinical decision-making.
zh
[CV-149] From Coarse to Continuous: Progressive Refinement Implicit Neural Representation for Motion-Robust Anisotropic MRI Reconstruction
【速读】:该论文旨在解决运动鲁棒性磁共振成像(motion-robust magnetic resonance imaging, MRI)中从2D切片恢复解剖一致的3D脑体积的问题,尤其是在加速采集或患者运动条件下,面临局部细节丢失、全局结构混叠和体素各向异性等挑战。其解决方案的关键在于提出一种渐进式细化隐式神经表示(progressive refinement implicit neural representation, PR-INR)框架,该框架在几何感知坐标空间内统一了运动校正、结构细化和体积合成,通过多阶段模块逐步提升重建质量,包括运动感知扩散模块生成粗略体积重建、隐式细节恢复模块进行残差细化以及体素连续感知表示模块实现高精度跨切片补全与高频细节恢复。
链接: https://arxiv.org/abs/2506.16210
作者: Zhenxuan Zhang,Lipei Zhang,Yanqi Cheng,Zi Wang,Fanwen Wang,Haosen Zhang,Yue Yang,Yinzhe Wu,Jiahao Huang,Angelica I Aviles-Rivero,Zhifan Gao,Guang Yang,Peter J. Lally
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains.
zh
[CV-150] Enhanced Dermatology Image Quality Assessment via Cross-Domain Training
【速读】:该论文试图解决远程皮肤科诊疗中图像质量不佳的问题,这一问题严重影响了远程会诊的有效性。其解决方案的关键在于提出跨领域训练图像质量评估(Image Quality Assessment, IQA)模型,通过结合皮肤科和非皮肤科的IQA数据集,以克服皮肤科IQA数据规模小的限制,从而提升模型在处理多种图像失真时的能力,进而改善远程皮肤科诊疗中的图像质量管理。
链接: https://arxiv.org/abs/2506.16116
作者: Ignacio Hernández Montilla,Alfonso Medela,Paola Pasquali,Andy Aguilar,Taig Mac Carthy,Gerardo Fernández,Antonio Martorell,Enrique Onieva
机构: Legit.Health(合法的健康); University of Deusto(德乌斯托大学); Pius Hospital de Valls(皮乌斯瓦尔斯医院); Dermatology Department, Hospital de Manises(曼尼塞斯医院皮肤科)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures. This manuscript has been accepted to the 2025 12th International Conference on Bioinformatics Research and Applications (ICBRA 2025). It will be published in International Conference Proceedings by ACM, which will be archived in ACM Digital Library, indexed by Ei Compendex and Scopus
Abstract:Teledermatology has become a widely accepted communication method in daily clinical practice, enabling remote care while showing strong agreement with in-person visits. Poor image quality remains an unsolved problem in teledermatology and is a major concern to practitioners, as bad-quality images reduce the usefulness of the remote consultation process. However, research on Image Quality Assessment (IQA) in dermatology is sparse, and does not leverage the latest advances in non-dermatology IQA, such as using larger image databases with ratings from large groups of human observers. In this work, we propose cross-domain training of IQA models, combining dermatology and non-dermatology IQA datasets. For this purpose, we created a novel dermatology IQA database, this http URL-DIQA-Artificial, using dermatology images from several sources and having them annotated by a group of human observers. We demonstrate that cross-domain training yields optimal performance across domains and overcomes one of the biggest limitations in dermatology IQA, which is the small scale of data, and leads to models trained on a larger pool of image distortions, resulting in a better management of image quality in the teledermatology process.
zh
[CV-151] Fast Training-free Perceptual Image Compression
【速读】:该论文旨在解决训练-free感知图像编解码器在解码过程中依赖扩散反演或样本通信导致的解码时间过长的问题,这些方法通常需要1分钟以上才能解码单张图像。其解决方案的关键在于提出一种无需训练的算法,能够在保证感知质量的前提下显著缩短解码时间,并针对不同解码时间预算(约0.1秒、0.1-10秒和≥10秒)提供优化的实现方式,从而实现了高效且高质量的图像解码。
链接: https://arxiv.org/abs/2506.16102
作者: Ziran Zhu,Tongda Xu,Minye Huang,Dailan He,Xingtong Ge,Xinjie Zhang,Ling Li,Yan Wang
机构: Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); SenseTime Research (商汤科技研究部); The Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology (哈尔滨工业大学); Hong Kong University of Science and Technology (香港科技大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal perceptual quality when decoding time budget is \approx 0.1 s, 0.1-10 s and \ge 10 s. Our approach: 1). improves the decoding time of training-free codec from 1 min to 0.1-10 s with comparable perceptual quality. 2). can be applied to non-differentiable codec such as VTM. 3). can be used to improve previous perceptual codecs, such as MS-ILLM. 4). can easily achieve perception-distortion trade-off. Empirically, we show that our approach successfully improves the perceptual quality of ELIC, VTM and MS-ILLM with fast decoding. Our approach achieves comparable FID to previous training-free codec with significantly less decoding time. And our approach still outperforms previous conditional generative model based codecs such as HiFiC and MS-ILLM in terms of FID. The source code is provided in the supplementary material.
zh
[CV-152] Bias Variation Compensation in Perimeter-Gated SPAD TRNGs
【速读】:该论文旨在解决基于熵源阵列的随机数生成器中存在偏差变化(Bias Variation, BV)的问题。现有去偏算法虽高效,但硬件友好的优化实现依赖于原始比特流中的比特偏差,难以适应广泛的BV。其解决方案的关键在于采用了一个64×64的周界栅极单光子雪崩二极管(pgSPAD)阵列作为熵源,并结合一种偏差补偿技术。通过根据器件本征暗计数率施加适当的栅极电压,在室温下实现了每像素2 kHz的原始比特生成速率,且偏差低于1%。
链接: https://arxiv.org/abs/2506.15888
作者: Md Sakibur Sajal,Hunter Guthrie,Marc Dandin
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Instrumentation and Detectors (physics.ins-det); Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 8 figures, 1 software, accepted at MWSCAS 2025 conference
Abstract:Random number generators that utilize arrays of entropy source elements suffer from bias variation (BV). Despite the availability of efficient debiasing algorithms, optimized implementations of hardware friendly options depend on the bit bias in the raw bit streams and cannot accommodate a wide BV. In this work, we present a 64 x 64 array of perimeter gated single photon avalanche diodes (pgSPADs), fabricated in a 0.35 \mum standard CMOS technology, as a source of entropy to generate random binary strings with a BV compensation technique. By applying proper gate voltages based on the devices’ native dark count rates, we demonstrate less than 1% BV for a raw-bit generation rate of 2 kHz/pixel at room temperature. The raw bits were debiased using the classical iterative Von Neumann’s algorithm and the debiased bits were found to pass all of the 16 tests from NIST’s Statistical Test Suite.
zh
[CV-153] Cross-Modality Learning for Predicting IHC Biomarkers from HE-Stained Whole-Slide Images
【速读】:该论文旨在解决免疫组化(IHC)染色成本高、耗时且资源密集的问题,通过生成式 AI (Generative AI) 技术从苏木精-伊红(HE)全切片图像(WSIs)中预测IHC染色模式。其解决方案的关键在于提出HistoStainAlign框架,该框架通过对比学习策略整合配对的HE和IHC嵌入表示,从而在无需像素级标注或组织配准的情况下捕捉形态学与分子特征的互补关系。
链接: https://arxiv.org/abs/2506.15853
作者: Amit Das,Naofumi Tomita,Kyle J. Syme,Weijie Ma,Paige O’Connor,Kristin N. Corbett,Bing Ren,Xiaoying Liu,Saeed Hassanpour
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hematoxylin and Eosin (HE) staining is a cornerstone of pathological analysis, offering reliable visualization of cellular morphology and tissue architecture for cancer diagnosis, subtyping, and grading. Immunohistochemistry (IHC) staining provides molecular insights by detecting specific proteins within tissues, enhancing diagnostic accuracy, and improving treatment planning. However, IHC staining is costly, time-consuming, and resource-intensive, requiring specialized expertise. To address these limitations, this study proposes HistoStainAlign, a novel deep learning framework that predicts IHC staining patterns directly from HE whole-slide images (WSIs) by learning joint representations of morphological and molecular features. The framework integrates paired HE and IHC embeddings through a contrastive training strategy, capturing complementary features across staining modalities without patch-level annotations or tissue registration. The model was evaluated on gastrointestinal and lung tissue WSIs with three commonly used IHC stains: P53, PD-L1, and Ki-67. HistoStainAlign achieved weighted F1 scores of 0.735 [95% Confidence Interval (CI): 0.670-0.799], 0.830 [95% CI: 0.772-0.886], and 0.723 [95% CI: 0.607-0.836], respectively for these three IHC stains. Embedding analyses demonstrated the robustness of the contrastive alignment in capturing meaningful cross-stain relationships. Comparisons with a baseline model further highlight the advantage of incorporating contrastive learning for improved stain pattern prediction. This study demonstrates the potential of computational approaches to serve as a pre-screening tool, helping prioritize cases for IHC staining and improving workflow efficiency.
zh
[CV-154] MoNetV2: Enhanced Motion Network for Freehand 3D Ultrasound Reconstruction
【速读】:该论文旨在解决自由手式三维超声(freehand 3D US)中图像重建的累积漂移问题以及在复杂运动轨迹下重建精度不足的问题。其解决方案的关键在于提出增强型运动网络(MoNetV2),通过引入基于传感器的时间多分支结构融合图像与运动信息、设计在线多层级一致性约束以利用扫描的固有一致性,以及采用在线多模态自监督策略来减少累积误差,从而提升重建的准确性和泛化能力。
链接: https://arxiv.org/abs/2506.15835
作者: Mingyuan Luo,Xin Yang,Zhongnuo Yan,Yan Cao,Yuanji Zhang,Xindi Hu,Jin Wang,Haoxuan Ding,Wei Han,Litao Sun,Dong Ni
机构: Shenzhen University (深圳大学); Shenzhen RayShape Medical Technology Inc. (深圳瑞影医疗科技有限公司); Zhejiang Provincial People’s Hospital (浙江省人民医院); Shandong University (山东大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional (3D) ultrasound (US) aims to provide sonographers with the spatial relationships of anatomical structures, playing a crucial role in clinical diagnosis. Recently, deep-learning-based freehand 3D US has made significant advancements. It reconstructs volumes by estimating transformations between images without external tracking. However, image-only reconstruction poses difficulties in reducing cumulative drift and further improving reconstruction accuracy, particularly in scenarios involving complex motion trajectories. In this context, we propose an enhanced motion network (MoNetV2) to enhance the accuracy and generalizability of reconstruction under diverse scanning velocities and tactics. First, we propose a sensor-based temporal and multi-branch structure that fuses image and motion information from a velocity perspective to improve image-only reconstruction accuracy. Second, we devise an online multi-level consistency constraint that exploits the inherent consistency of scans to handle various scanning velocities and tactics. This constraint exploits both scan-level velocity consistency, path-level appearance consistency, and patch-level motion consistency to supervise inter-frame transformation estimation. Third, we distill an online multi-modal self-supervised strategy that leverages the correlation between network estimation and motion information to further reduce cumulative errors. Extensive experiments clearly demonstrate that MoNetV2 surpasses existing methods in both reconstruction quality and generalizability performance across three large datasets.
zh
[CV-155] Diffusion-based Counterfactual Augmentation: Towards Robust and Interpretable Knee Osteoarthritis Grading
【速读】:该论文试图解决膝骨关节炎(Knee Osteoarthritis, KOA)放射影像自动评分中存在显著的观察者间变异性和深度学习模型在关键决策边界处鲁棒性不足的问题。解决方案的关键在于提出一种基于扩散模型的反事实增强框架(Diffusion-based Counterfactual Augmentation, DCA),通过生成针对性的反事实样本以提升模型的鲁棒性和可解释性,其核心机制是利用由分类器引导的边界驱动与流形约束共同控制的随机微分方程(Stochastic Differential Equation, SDE)在潜在空间中进行导航。
链接: https://arxiv.org/abs/2506.15748
作者: Zhe Wang,Yuhua Ru,Aladine Chetouani,Tina Shiang,Fang Chen,Fabian Bauer,Liping Zhang,Didier Hans,Rachid Jennane,William Ewing Palmer,Mohamed Jarraya,Yung Hsin Chen
机构: Harvard Medical School (哈佛医学院); Jiangsu Institute of Hematology, The First Affiliated Hospital of Soochow University (江苏省血液病研究所,苏州大学第一附属医院); L2TI Laboratory, University Sorbonne Paris Nord (L2TI实验室,巴黎萨克雷大学); department of Medical School, Henan University of Chinese Medicine (河南中医药大学医学部); Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne (科隆大学医学系诊断与介入放射科,科隆大学医院); Department of Electrical and Electronic Engineering, University of Hong Kong (香港大学电子工程系); Nuclear Medicine Division, Geneva University Hospital (日内瓦大学医院核医学部); IDP Institute, UMR CNRS 7013, University of Orleans (奥尔良大学IDP研究所,CNRS UMR 7013)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated grading of Knee Osteoarthritis (KOA) from radiographs is challenged by significant inter-observer variability and the limited robustness of deep learning models, particularly near critical decision boundaries. To address these limitations, this paper proposes a novel framework, Diffusion-based Counterfactual Augmentation (DCA), which enhances model robustness and interpretability by generating targeted counterfactual examples. The method navigates the latent space of a diffusion model using a Stochastic Differential Equation (SDE), governed by balancing a classifier-informed boundary drive with a manifold constraint. The resulting counterfactuals are then used within a self-corrective learning strategy to improve the classifier by focusing on its specific areas of uncertainty. Extensive experiments on the public Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) datasets demonstrate that this approach significantly improves classification accuracy across multiple model architectures. Furthermore, the method provides interpretability by visualizing minimal pathological changes and revealing that the learned latent space topology aligns with clinical knowledge of KOA progression. The DCA framework effectively converts model uncertainty into a robust training signal, offering a promising pathway to developing more accurate and trustworthy automated diagnostic systems. Our code is available at this https URL.
zh
[CV-156] Pixel-wise Modulated Dice Loss for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割任务中神经网络性能受数据不平衡影响的问题,具体包括类别不平衡(class imbalance)和难度不平衡(difficulty imbalance)。类别不平衡导致损失函数被多数类主导,而难度不平衡则使损失函数被容易分类的像素主导,从而影响训练效果。为了解决这些问题,论文提出了一种基于Dice loss的简单修改方法——像素级调制Dice损失(Pixel-wise Modulated Dice loss, PM Dice loss),其关键在于引入像素级调制项,利用Dice loss在处理类别不平衡方面的有效性,同时解决难度不平衡问题,且计算成本较低。实验结果表明,该方法在多个医学分割任务中优于现有针对难度不平衡问题设计的方法。
链接: https://arxiv.org/abs/2506.15744
作者: Seyed Mohsen Hosseini
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Class imbalance and the difficulty imbalance are the two types of data imbalance that affect the performance of neural networks in medical segmentation tasks. In class imbalance the loss is dominated by the majority classes and in difficulty imbalance the loss is dominated by easy to classify pixels. This leads to an ineffective training. Dice loss, which is based on a geometrical metric, is very effective in addressing the class imbalance compared to the cross entropy (CE) loss, which is adopted directly from classification tasks. To address the difficulty imbalance, the common approach is employing a re-weighted CE loss or a modified Dice loss to focus the training on difficult to classify areas. The existing modification methods are computationally costly and with limited success. In this study we propose a simple modification to the Dice loss with minimal computational cost. With a pixel level modulating term, we take advantage of the effectiveness of Dice loss in handling the class imbalance to also handle the difficulty imbalance. Results on three commonly used medical segmentation tasks show that the proposed Pixel-wise Modulated Dice loss (PM Dice loss) outperforms other methods, which are designed to tackle the difficulty imbalance problem.
zh
[CV-157] Smartphone-integrated RPA-CRISPR-Cas12a Detection System with Microneedle Sampling for Point-of-Care Diagnosis of Potato Late Blight in Early Stage
【速读】:该论文旨在解决传统植物病害检测方法(如PCR和LAMP)依赖于笨重且昂贵的实验室设备、操作复杂,难以在田间进行即时诊断的问题。其解决方案的关键在于开发了一种便携式的RPA-CRISPR诊断系统,结合智能手机用于荧光图像的采集与分析,并采用聚乙烯醇(PVA)微针贴片实现快速样本提取,显著提高了DNA提取效率并简化了检测流程,从而实现了无需专业设备的现场早期植物病害检测。
链接: https://arxiv.org/abs/2506.15728
作者: Jiangnan Zhao(1 and 2),Hanbo Xu(1 and 2),Cifu Xu(1 and 2),Wenlong Yin(1 and 2),Laixin Luo(3),Gang Liu(1 and 2),Yan Wang(1 and 2) ((1) Key Laboratory of Smart Agriculture Systems, Ministry of Education, China Agricultural University, Beijing, PR China, (2) Key Laboratory of Agricultural Information Acquisition Technology, Ministry of Agriculture and Rural Affairs of China, China Agricultural University, Beijing, PR China, (3) Department of Plant Pathology, China Agricultural University, Beijing Key Laboratory of Seed Disease Testing and Control, Beijing, PR China)
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM)
备注: 32 pages,7 figures,1 table
Abstract:Potato late blight, caused by the oomycete pathogen Phytophthora infestans, is one of the most devastating diseases affecting potato crops in the history. Although conventional detection methods of plant diseases such as PCR and LAMP are highly sensitive and specific, they rely on bulky and expensive laboratory equipment and involve complex operations, making them impracticable for point-of care diagnosis in the field. Here in this study, we report a portable RPA-CRISPR based diagnosis system for plant disease, integrating smartphone for acquisition and analysis of fluorescent images. A polyvinyl alcohol (PVA) microneedle patch was employed for sample extraction on the plant leaves within one minute, the DNA extraction efficiency achieved 56 ug/mg, which is approximately 3 times to the traditional CTAB methods (18 ug/mg). The system of RPA-CRISPR-Cas12a isothermal assay was established to specifically target P. infestans with no cross-reactivity observed against closely-related species (P. sojae, P. capsici). The system demonstrated a detection limit of 2 pg/uL for P. infestans genomic DNA, offering sensitivity comparable to that of benchtop laboratory equipment. The system demonstrates the early-stage diagnosis capability by achieving a approximately 80% and 100% detection rate on the third and fourth day post-inoculation respectively, before visible symptoms observed on the leaves. The smartphone-based “sample-to-result” system decouples the limitations of traditional methods that rely heavily on specialized equipment, offering a promising way for early-stage plant disease detection and control in the field.
zh
人工智能
[AI-0] No Free Lunch: Rethinking Internal Feedback for LLM Reasoning
【速读】:该论文试图解决后训练阶段大型语言模型(Large Language Models, LLMs)在推理能力提升中的监督依赖问题。传统方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)虽然效果显著,但需要大量外部监督信号。本文提出的解决方案是利用内部反馈(Reinforcement Learning from Internal Feedback, RLIF),其关键在于依靠模型自身生成的内在信号,如token级熵、轨迹级熵和自信心度,而非依赖外部奖励。这种策略在训练初期能够有效提升基础LLM的推理性能,但在训练后期性能下降,且对已进行指令调优的模型提升有限,表明内在反馈存在一定的局限性。
链接: https://arxiv.org/abs/2506.17219
作者: Yanzhi Zhang,Zhaoxi Zhang,Haoxiang Guan,Yilin Cheng,Yitong Duan,Chen Wang,Yue Wang,Shuxin Zheng,Jiyan He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning. Approaches like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) have shown strong results, but they require extensive external supervision. We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards. In particular, we leverage unsupervised reward proxies such as token-level entropy, trajectory-level entropy, and self-certainty. Our theoretical analysis shows these internal objectives are partially equivalent, and we empirically evaluate various RLIF strategies on challenging math reasoning benchmarks. Experimental results demonstrate that RLIF can boost the reasoning performance of base LLMs at the beginning phase of the training, matching or surpassing RLVR techniques on these tasks. However, when training progresses, performance degrades even below the model before training. Moreover, we find that RLIF yields little improvement for instruction-tuned models, indicating diminishing returns of intrinsic feedback once an LLM is already instruction-tuned. We further analyze this limitation by mixing model weights and explain the reason of RLIF’s training behaviors, providing practical guidelines for integrating internal feedback signals into LLM training. We hope our analysis of internal feedback will inform more principled and effective strategies for LLM post-training.
zh
[AI-1] Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning ICML2025
【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)模型在扩展过程中遇到的网络路径问题,这些问题使得有效扩大模型规模变得极具挑战性。论文提出的解决方案的关键在于引入静态网络稀疏性(static network sparsity),而非依赖更复杂的修改或架构改进。通过简单的单次随机剪枝(one-shot random pruning)方法,在训练前随机移除一定比例的网络权重,即可显著提升模型的可扩展性,实现更高的参数效率和更强的优化鲁棒性。
链接: https://arxiv.org/abs/2506.17204
作者: Guozheng Ma,Lu Li,Zilin Wang,Li Shen,Pierre-Luc Bacon,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025
Abstract:Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
zh
[AI-2] Continual Learning with Columnar Spiking Neural Networks
【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会迅速遗忘之前学到的知识。其解决方案的关键在于利用柱状组织的脉冲神经网络(columnar-organized spiking neural networks, SNNs),具体通过CoLaNET(Columnar Layered Network)结构实现,该结构通过微柱(microcolumns)在缺乏与先前学习共享结构的情况下,最有效地适应新任务,从而在保持旧知识(稳定性)和获取新信息(可塑性)之间实现有效的平衡。
链接: https://arxiv.org/abs/2506.17169
作者: Denis Larionov,Nikolay Bazenkov,Mikhail Kiselev
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:This study investigates columnar-organized spiking neural networks (SNNs) for continual learning and catastrophic forgetting. Using CoLaNET (Columnar Layered Network), we show that microcolumns adapt most efficiently to new tasks when they lack shared structure with prior learning. We demonstrate how CoLaNET hyperparameters govern the trade-off between retaining old knowledge (stability) and acquiring new information (plasticity). Our optimal configuration learns ten sequential MNIST tasks effectively, maintaining 92% accuracy on each. It shows low forgetting, with only 4% performance degradation on the first task after training on nine subsequent tasks.
zh
[AI-3] he MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making
【速读】:该论文试图解决医疗大型语言模型(Large Language Models, LLMs)在临床环境中面对现实世界变异性时,与人类在响应上的差异问题,这对确保医疗LLMs的安全部署至关重要。解决方案的关键在于引入MedPerturb数据集,该数据集通过系统化的临床输入扰动(包括性别修改、风格变化和格式改变)来评估医疗LLMs的临床鲁棒性,从而揭示人类与LLMs在处理不同扰动时的决策差异。
链接: https://arxiv.org/abs/2506.17163
作者: Abinitha Gourabathina,Yuexing Hao,Walter Gerych,Marzyeh Ghassemi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.
zh
[AI-4] Sparse-Reg: Improving Sample Complexity in Offline Reinforcement Learning using Sparsity
【速读】:该论文旨在解决在离线强化学习(offline reinforcement learning, RL)中使用小数据集时算法容易过拟合导致性能下降的问题。其解决方案的关键在于引入一种基于稀疏性的正则化技术——“Sparse-Reg”,通过增强模型的稀疏性来缓解过拟合,从而在数据量有限的情况下实现有效的学习,并在连续控制任务中优于现有的最先进基线方法。
链接: https://arxiv.org/abs/2506.17155
作者: Samin Yeasar Arnob,Scott Fujimoto,Doina Precup
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we investigate the use of small datasets in the context of offline reinforcement learning (RL). While many common offline RL benchmarks employ datasets with over a million data points, many offline RL applications rely on considerably smaller datasets. We show that offline RL algorithms can overfit on small datasets, resulting in poor performance. To address this challenge, we introduce “Sparse-Reg”: a regularization technique based on sparsity to mitigate overfitting in offline reinforcement learning, enabling effective learning in limited data settings and outperforming state-of-the-art baselines in continuous control.
zh
[AI-5] Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
【速读】:该论文试图解决扩散模型在生成分子构象时,尽管基于相同的模型,但在使用其推导的力进行粗粒度分子动力学模拟时出现不一致的问题。问题的关键在于,在小扩散时间步长下,传统扩散模型无法满足描述分数演化规律的Fokker-Planck方程。为了解决这一问题,作者提出了一种基于能量的扩散模型,并引入了由Fokker-Planck方程推导出的正则化项,以确保生成样本的一致性。
链接: https://arxiv.org/abs/2506.17139
作者: Michael Plainer,Hao Wu,Leon Klein,Stephan Günnemann,Frank Noé
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
备注:
Abstract:Diffusion models have recently gained significant attention due to their effectiveness in various scientific domains, including biochemistry. When trained on equilibrium molecular distributions, diffusion models provide both: a generative procedure to sample equilibrium conformations and associated forces derived from the model’s scores. However, using the forces for coarse-grained molecular dynamics simulations uncovers inconsistencies in the samples generated via classical diffusion inference and simulation, despite both originating from the same model. Particularly at the small diffusion timesteps required for simulations, diffusion models fail to satisfy the Fokker-Planck equation, which governs how the score should evolve over time. We interpret this deviation as an indication of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term enforcing consistency. We demonstrate the effectiveness of our approach on toy systems, alanine dipeptide, and introduce a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and demonstrates enhanced consistency and efficient sampling.
zh
[AI-6] Chain-of-Trust: A Progressive Trust Evaluation Framework Enabled by Generative AI
【速读】:该论文旨在解决在依赖分布式资源的复杂协作系统中,由于网络动态性和信息收集延迟导致无法同时观测和收集所有协作设备的信任属性,从而难以进行全面信任评估的问题。解决方案的关键在于提出一种名为“链式信任”的渐进式信任评估框架,该框架通过任务分解将信任评估过程划分为多个链式阶段,并在每个阶段仅收集与当前任务阶段相关的最新设备属性数据,从而降低信任评估的复杂性和开销。此外,该框架利用生成式 AI 的上下文学习、小样本学习和推理能力对收集的数据进行分析,以快速生成准确的评估结果。
链接: https://arxiv.org/abs/2506.17130
作者: Botao Zhu,Xianbin Wang,Lei Zhang,Xuemin(Sherman)Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In collaborative systems with complex tasks relying on distributed resources, trust evaluation of potential collaborators has emerged as an effective mechanism for task completion. However, due to the network dynamics and varying information gathering latencies, it is extremely challenging to observe and collect all trust attributes of a collaborating device concurrently for a comprehensive trust assessment. In this paper, a novel progressive trust evaluation framework, namely chain-of-trust, is proposed to make better use of misaligned device attribute data. This framework, designed for effective task completion, divides the trust evaluation process into multiple chained stages based on task decomposition. At each stage, based on the task completion process, the framework only gathers the latest device attribute data relevant to that stage, leading to reduced trust evaluation complexity and overhead. By leveraging advanced in-context learning, few-shot learning, and reasoning capabilities, generative AI is then employed to analyze and interpret the collected data to produce correct evaluation results quickly. Only devices deemed trustworthy at this stage proceed to the next round of trust evaluation. The framework ultimately determines devices that remain trustworthy across all stages. Experimental results demonstrate that the proposed framework achieves high accuracy in trust evaluation.
zh
[AI-7] Rapid and Continuous Trust Evaluation for Effective Task Collaboration Through Siamese Model
【速读】:该论文试图解决在协作系统中,如何快速且持续地评估合作者可信度的问题,这一问题由于分布式设备、复杂的操作环境以及动态变化的资源而变得尤为复杂。解决方案的关键在于提出一种基于Siamese网络的快速连续信任评估框架(SRCTE),该框架通过属性控制流图(ACFG)表示合作者的通信与计算资源属性及历史协作数据,并利用两个共享参数的Structure2vec网络学习ACFG对的深度语义,进而通过计算嵌入向量的相似性来确定合作者在每个时间片的信任值。
链接: https://arxiv.org/abs/2506.17128
作者: Botao Zhu,Xianbin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Trust is emerging as an effective tool to ensure the successful completion of collaborative tasks within collaborative systems. However, rapidly and continuously evaluating the trustworthiness of collaborators during task execution is a significant challenge due to distributed devices, complex operational environments, and dynamically changing resources. To tackle this challenge, this paper proposes a Siamese-enabled rapid and continuous trust evaluation framework (SRCTE) to facilitate effective task collaboration. First, the communication and computing resource attributes of the collaborator in a trusted state, along with historical collaboration data, are collected and represented using an attributed control flow graph (ACFG) that captures trust-related semantic information and serves as a reference for comparison with data collected during task execution. At each time slot of task execution, the collaborator’s communication and computing resource attributes, as well as task completion effectiveness, are collected in real time and represented with an ACFG to convey their trust-related semantic information. A Siamese model, consisting of two shared-parameter Structure2vec networks, is then employed to learn the deep semantics of each pair of ACFGs and generate their embeddings. Finally, the similarity between the embeddings of each pair of ACFGs is calculated to determine the collaborator’s trust value at each time slot. A real system is built using two Dell EMC 5200 servers and a Google Pixel 8 to test the effectiveness of the proposed SRCTE framework. Experimental results demonstrate that SRCTE converges rapidly with only a small amount of data and achieves a high anomaly trust detection rate compared to the baseline algorithm.
zh
[AI-8] When Can Model-Free Reinforcement Learning be Enough for Thinking?
【速读】:该论文试图解决的问题是:在无模型强化学习(model-free reinforcement learning, RL)中,何时以及为何会涌现出类似“思考”的策略以实现奖励最大化。论文的核心解决方案是引入一种称为“思维马尔可夫决策过程”(thought Markov decision process, thought MDP)的理论模型,该模型扩展了经典马尔可夫决策过程(MDP),引入了抽象的“思维状态”和“思维动作”。通过该模型,作者证明了策略初始化在决定是否出现“思考”行为中的关键作用,并形式化地表明思维动作等价于代理在继续行动前执行一步策略改进。此外,论文还验证了开源大语言模型满足理论预测的必要条件,并提出了可能使“思考”行为在语言生成以外领域被学习的充分条件。
链接: https://arxiv.org/abs/2506.17124
作者: Josiah P. Hanna,Nicholas E. Corrado
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures
Abstract:Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of “thinking” through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to “thinking” as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a \textitthought Markov decision process (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.
zh
[AI-9] Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
【速读】:该论文试图解决当前大型推理模型在数学证明任务中存在隐藏的推理缺陷问题,这些问题由于高准确率报告、纯数值评估依赖以及潜在的基准泄露而被掩盖。解决方案的关键在于利用数学证明固有的严谨性和方法复杂性作为诊断工具,以揭示这些模型的内在不足。为此,研究者提出了RFMDataset(Reveal Failure Modes)数据集,通过对其上先进模型的深入分析,识别出10种细粒度错误类型,从而揭示了当前大型推理模型在数学证明理解、单步推理的正确性保障以及推理过程中的幻觉和不完整性等方面的局限性。
链接: https://arxiv.org/abs/2506.17114
作者: Dadi Guo,Jiayu Liu,Zhiyuan Fan,Zhitao He,Haoran Li,Yumeng Wang,Yi R.(May)Fung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models’ performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models’ self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.
zh
[AI-10] owards Advanced Mathematical Reasoning for LLM s via First-Order Logic Theorem Proving
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在涉及多步骤一阶逻辑(First-Order Logic, FOL)推理的复杂数学推理任务中的表现不足问题。尽管LLMs在现有数学推理基准上表现良好,但在需要多步骤FOL推导的任务中仍存在显著缺陷,如Deepseek-Prover-V2-7B在作者提出的定理证明数据集上的准确率仅为4.2%。解决方案的关键在于提出DREAM,一种自适应方法,通过引入公理驱动的策略多样化机制以增强生成策略的多样性,并结合子命题错误反馈机制帮助模型反思和修正证明过程,从而提升数学推理能力。
链接: https://arxiv.org/abs/2506.17104
作者: Chuxue Cao,Mengze Li,Juntao Dai,Jinluan Yang,Zijian Zhao,Shengyu Zhang,Weijie Shi,Chengzhong Liu,Sirui Han,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B’s low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs’ generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs’ mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
zh
[AI-11] ransDreamerV3: Implanting Transformer In DreamerV3
【速读】:该论文旨在解决传统强化学习模型在复杂环境中记忆与决策能力不足的问题,其解决方案的关键在于对DreamerV3架构进行改进,通过集成Transformer编码器来增强模型的记忆和决策能力。这种改进使得TransDreamerV3在Atari-Freeway和Crafter任务中表现出优于DreamerV3的性能,展示了基于世界模型的强化学习方法在引入Transformer结构后的潜在优势。
链接: https://arxiv.org/abs/2506.17103
作者: Shruti Sadanand Dongare,Amun Kharel,Jonathan Samuel,Xiaona Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces TransDreamerV3, a reinforcement learning model that enhances the DreamerV3 architecture by integrating a transformer encoder. The model is designed to improve memory and decision-making capabilities in complex environments. We conducted experiments on Atari-Boxing, Atari-Freeway, Atari-Pong, and Crafter tasks, where TransDreamerV3 demonstrated improved performance over DreamerV3, particularly in the Atari-Freeway and Crafter tasks. While issues in the Minecraft task and limited training across all tasks were noted, TransDreamerV3 displays advancement in world model-based reinforcement learning, leveraging transformer architectures.
zh
[AI-12] Identifiability of Deep Polynomial Neural Networks
【速读】:该论文试图解决多项式神经网络(Polynomial Neural Networks, PNNs)的可识别性(identifiability)问题,这是确保模型可解释性的关键性质。解决方案的关键在于揭示激活函数次数与网络层宽之间的复杂相互作用,并通过深度PNN与低秩张量分解及Kruskal型唯一性定理之间的联系,提出构造性的证明方法。该方法不仅给出了由网络结构决定的通用条件,还提供了依赖于网络参数的有效条件。
链接: https://arxiv.org/abs/2506.17093
作者: Konstantin Usevich,Clara Dérand,Ricardo Borsoi,Marianne Clausel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG); Machine Learning (stat.ML)
备注: 1 figure
Abstract:Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability – a key property for ensuring interpretability – remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. This yields both generic conditions determined by the architecture, and effective conditions that depend on the network’s parameters. We also settle an open conjecture on the expected dimension of PNN’s neurovarieties, and provide new bounds on the activation degrees required for it to reach its maximum.
zh
[AI-13] Dispositions and Roles of Generically Dependent Entities
【速读】:该论文试图解决BFO 2020在表示生成性依赖持续体(如软件或数据集)的功能、倾向和角色方面的局限性(Generically Dependent Continuants)。文献指出,这一缺陷阻碍了对计算机模型功能或模型执行过程中数据集各种角色的充分表达。解决方案的关键在于提出两种方法:(a) 使用定义类来扩展表示能力,以及(b) 提出修改建议,使BFO能够支持生成性依赖持续体的功能、倾向和角色。
链接: https://arxiv.org/abs/2506.17085
作者: Fabian Neuhaus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:BFO 2020 does not support functions, dispositions, and roles of generically dependent continuants (like software or datasets). In this paper, we argue that this is a severe limitation, which prevents, for example, the adequate representation of the functions of computer models or the various roles of datasets during the execution of these models. We discuss the aspects of BFO 2020 that prevent the representation of realizable entities of generically dependent continuants. Two approaches to address the issue are presented: (a) the use of defined classes and (b) a proposal of changes that allow BFO to support functions, dispositions, and roles of generically dependent continuants.
zh
[AI-14] LLM -Based Bot Broadens the Range of Arguments in Online Discussions Even When Transparently Disclosed as AI
【速读】:该论文试图解决在线政治讨论中观点范围有限的问题,这一问题源于用户自我选择和平台算法倾向于促进观点相似者之间的交流。解决方案的关键在于使用基于大语言模型(Large Language Model, LLM)的机器人(bot),该机器人能够主动监测讨论、识别缺失的论点并将其引入对话中,从而扩大参与者表达的观点范围。实验结果表明,该方法在客观和主观指标上均有效扩展了讨论中的论点范围,且披露机器人作为人工智能(Artificial Intelligence, AI)的身份并未显著影响其效果。
链接: https://arxiv.org/abs/2506.17073
作者: Valeria Vuk,Cristina Sarasua,Fabrizio Gilardi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:A wide range of participation is essential for democracy, as it helps prevent the dominance of extreme views, erosion of legitimacy, and political polarization. However, engagement in online political discussions often features a limited spectrum of views due to high levels of self-selection and the tendency of online platforms to facilitate exchanges primarily among like-minded individuals. This study examines whether an LLM-based bot can widen the scope of perspectives expressed by participants in online discussions through two pre-registered randomized experiments conducted in a chatroom. We evaluate the impact of a bot that actively monitors discussions, identifies missing arguments, and introduces them into the conversation. The results indicate that our bot significantly expands the range of arguments, as measured by both objective and subjective metrics. Furthermore, disclosure of the bot as AI does not significantly alter these effects. These findings suggest that LLM-based moderation tools can positively influence online political discourse.
zh
[AI-15] Flow-Based Non-stationary Temporal Regime Causal Structure Learning
【速读】:该论文旨在解决多变量时间序列中因果关系推断的问题,特别是在存在多个未知边界的时间段(即不同制度)的情况下,每个制度可能具有不同的因果结构,并且面临非平稳性和复杂噪声分布的挑战。现有方法通常假设平稳性或高斯噪声,无法有效处理这些现实场景中的问题。论文提出的解决方案是FANTOM框架,其关键在于通过贝叶斯期望最大化算法同时推断制度数量、对应索引以及每个制度的有向无环图(DAG),并能够处理非平稳过程及非高斯和异方差噪声。
链接: https://arxiv.org/abs/2506.17065
作者: Abdellah Rahmani,Pascal Frossard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding causal relationships in multivariate time series is crucial in many scenarios, such as those dealing with financial or neurological data. Many such time series exhibit multiple regimes, i.e., consecutive temporal segments with a priori unknown boundaries, with each regime having its own causal structure. Inferring causal dependencies and regime shifts is critical for analyzing the underlying processes. However, causal structure learning in this setting is challenging due to (1) non stationarity, i.e., each regime can have its own causal graph and mixing function, and (2) complex noise distributions, which may be non Gaussian or heteroscedastic. Existing causal discovery approaches cannot address these challenges, since generally assume stationarity or Gaussian noise with constant variance. Hence, we introduce FANTOM, a unified framework for causal discovery that handles non stationary processes along with non Gaussian and heteroscedastic noises. FANTOM simultaneously infers the number of regimes and their corresponding indices and learns each regime’s Directed Acyclic Graph. It uses a Bayesian Expectation Maximization algorithm that maximizes the evidence lower bound of the data log likelihood. On the theoretical side, we prove, under mild assumptions, that temporal heteroscedastic causal models, introduced in FANTOM’s formulation, are identifiable in both stationary and non stationary settings. In addition, extensive experiments on synthetic and real data show that FANTOM outperforms existing methods.
zh
[AI-16] MAWIFlow Benchmark: Realistic Flow-Based Evaluation for Network Intrusion Detection
【速读】:该论文旨在解决现有网络入侵检测基准数据集依赖于合成流量而无法反映实际运行环境中统计变异性与时间漂移的问题。其解决方案的关键在于提出MAWIFlow基准数据集,该数据集基于MAWILAB v1.1数据集生成,通过可重复的预处理流程将原始数据包捕获转换为符合CICFlowMeter格式的流表示,并保留原始异常标签,从而实现对异常检测方法的真实且可复现的评估。
链接: https://arxiv.org/abs/2506.17041
作者: Joshua Schraven,Alexander Windmann,Oliver Niggemann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures
Abstract:Benchmark datasets for network intrusion detection commonly rely on synthetically generated traffic, which fails to reflect the statistical variability and temporal drift encountered in operational environments. This paper introduces MAWIFlow, a flow-based benchmark derived from the MAWILAB v1.1 dataset, designed to enable realistic and reproducible evaluation of anomaly detection methods. A reproducible preprocessing pipeline is presented that transforms raw packet captures into flow representations conforming to the CICFlowMeter format, while preserving MAWILab’s original anomaly labels. The resulting datasets comprise temporally distinct samples from January 2011, 2016, and 2021, drawn from trans-Pacific backbone traffic. To establish reference baselines, traditional machine learning methods, including Decision Trees, Random Forests, XGBoost, and Logistic Regression, are compared to a deep learning model based on a CNN-BiLSTM architecture. Empirical results demonstrate that tree-based classifiers perform well on temporally static data but experience significant performance degradation over time. In contrast, the CNN-BiLSTM model maintains better performance, thus showing improved generalization. These findings underscore the limitations of synthetic benchmarks and static models, and motivate the adoption of realistic datasets with explicit temporal structure. All datasets, pipeline code, and model implementations are made publicly available to foster transparency and reproducibility. Comments: 11 pages, 3 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.17041 [cs.LG] (or arXiv:2506.17041v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.17041 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-17] LSCD: Lomb-Scargle Conditioned Diffusion for Time series Imputation ICML2025
【速读】:该论文试图解决时间序列中缺失或非均匀采样数据在机器学习中的处理难题。传统方法通常依赖于快速傅里叶变换(Fast Fourier Transform, FFT),而FFT假设数据是均匀采样的,因此需要先进行插值处理,这可能导致频谱失真。论文提出的解决方案的关键是一种可微分的Lomb–Scargle层,该层能够可靠地计算非均匀采样数据的功率谱,并将其集成到一种新型基于分数的扩散模型(Lomb–Scargle Conditional Diffusion, LSCD)中,用于基于完整信号频谱的时间序列插补。实验表明,该方法在恢复缺失数据方面优于纯时域基线方法,同时生成一致的频率估计。
链接: https://arxiv.org/abs/2506.17039
作者: Elizabeth Fons,Alejandro Sztrajman,Yousef El-Laham,Luciana Ferrer,Svitlana Vyetrenko,Manuela Veloso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In ICML 2025
Abstract:Time series with missing or irregularly sampled data are a persistent challenge in machine learning. Many methods operate on the frequency-domain, relying on the Fast Fourier Transform (FFT) which assumes uniform sampling, therefore requiring prior interpolation that can distort the spectra. To address this limitation, we introduce a differentiable Lomb–Scargle layer that enables a reliable computation of the power spectrum of irregularly sampled data. We integrate this layer into a novel score-based diffusion model (LSCD) for time series imputation conditioned on the entire signal spectrum. Experiments on synthetic and real-world benchmarks demonstrate that our method recovers missing data more accurately than purely time-domain baselines, while simultaneously producing consistent frequency estimates. Crucially, our method can be easily integrated into learning frameworks, enabling broader adoption of spectral guidance in machine learning approaches involving incomplete or irregular data.
zh
[AI-18] A Quantile Regression Approach for Remaining Useful Life Estimation with State Space Models
【速读】:该论文旨在解决工业4.0和5.0中预测性维护(Predictive Maintenance, PdM)领域中设备剩余使用寿命(Remaining Useful Life, RUL)预测的准确性与计算效率问题。其解决方案的关键在于提出一种基于状态空间模型(State Space Model, SSM)的新型RUL估计方法,并引入同时分位数回归(Simoultaneous Quantile Regression, SQR)以处理模型不确定性,从而实现多分位数估计,提升了长期序列建模的性能。
链接: https://arxiv.org/abs/2506.17018
作者: Davide Frizzo,Francesco Borsatti,Gian Antonio Susto
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IFAC Joint Conference on Computers, Cognition, and Communication (J3C) 2025
Abstract:Predictive Maintenance (PdM) is pivotal in Industry 4.0 and 5.0, proactively enhancing efficiency through accurate equipment Remaining Useful Life (RUL) prediction, thus optimizing maintenance scheduling and reducing unexpected failures and premature interventions. This paper introduces a novel RUL estimation approach leveraging State Space Models (SSM) for efficient long-term sequence modeling. To handle model uncertainty, Simoultaneous Quantile Regression (SQR) is integrated into the SSM, enabling multiple quantile estimations. The proposed method is benchmarked against traditional sequence modelling techniques (LSTM, Transformer, Informer) using the C-MAPSS dataset. Results demonstrate superior accuracy and computational efficiency of SSM models, underscoring their potential for high-stakes industrial applications.
zh
[AI-19] Elevating Styled Mahjong Agents with Learning from Demonstration
【速读】:该论文试图解决在麻将游戏中开发具有多样化独特对局风格且高水平的机器人(bot)这一相对未被充分研究的问题。现有离线学习和从示范学习(Learning-from-Demonstration, LfD)算法由于麻将游戏固有的高随机性及分布外状态的普遍性而表现不佳。论文的关键解决方案是利用现有麻将智能体的游戏历史,并提出一种新型LfD算法,该算法仅需对近端策略优化(Proximal Policy Optimization)进行最小修改,从而显著提升智能体的技能水平并有效保留其独特的对局风格。
链接: https://arxiv.org/abs/2506.16995
作者: Lingfeng Li,Yunlong Lu,Yongyi Wang,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A wide variety of bots in games enriches the gameplay experience and enhances replayability. Recent advancements in game artificial intelligence have predominantly focused on improving the proficiency of bots. Nevertheless, developing highly competent bots with a wide range of distinct play styles remains a relatively under-explored area. We select the Mahjong game environment as a case study. The high degree of randomness inherent in the Mahjong game and the prevalence of out-of-distribution states lead to suboptimal performance of existing offline learning and Learning-from-Demonstration (LfD) algorithms. In this paper, we leverage the gameplay histories of existing Mahjong agents and put forward a novel LfD algorithm that necessitates only minimal modifications to the Proximal Policy Optimization algorithm. The comprehensive empirical results illustrate that our proposed method not only significantly enhances the proficiency of the agents but also effectively preserves their unique play styles.
zh
[AI-20] Formal Control for Uncertain Systems via Contract-Based Probabilistic Surrogates (Extended Version)
【速读】:该论文旨在解决在形式化方法中构建准确系统表示的挑战,这一问题限制了形式化方法的可扩展性,因为生成的模型通常过于复杂,难以在保证形式正确性和性能的前提下进行有效决策。论文提出了一种基于概率模拟关系和随机系统的代理模型的方法,其关键在于通过消除直接计算误差界的需求,显著提升了此类模拟关系的可扩展性和实用性。该方法在高维空间中表现出良好的可扩展性,并能在不确定性下处理复杂的非线性智能体-环境交互,同时满足无限时域时序逻辑保证。
链接: https://arxiv.org/abs/2506.16971
作者: Oliver Schön,Sofie Haesaert,Sadegh Soudjani
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 26 pages, 5 figures, extended version of paper accepted for publication at QEST 2025
Abstract:The requirement for identifying accurate system representations has not only been a challenge to fulfill, but it has compromised the scalability of formal methods, as the resulting models are often too complex for effective decision making with formal correctness and performance guarantees. Focusing on probabilistic simulation relations and surrogate models of stochastic systems, we propose an approach that significantly enhances the scalability and practical applicability of such simulation relations by eliminating the need to compute error bounds directly. As a result, we provide an abstraction-based technique that scales effectively to higher dimensions while addressing complex nonlinear agent-environment interactions with infinite-horizon temporal logic guarantees amidst uncertainty. Our approach trades scalability for conservatism favorably, as demonstrated on a complex high-dimensional vehicle intersection case study.
zh
[AI-21] Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
【速读】:该论文旨在解决移动机器人在仓库检索和环境监测等应用中的任务规划问题,这类问题通常可建模为广义旅行商问题(Generalized Traveling Salesman Problem, GTSP),其挑战在于如何高效且准确地从多个目标簇中选择一个位置并生成最优路径。论文提出的解决方案的关键在于构建一种多模态融合学习(Multimodal Fused Learning, MMFL)框架,该框架结合图结构与图像表示,通过坐标图像构建器、自适应分辨率缩放策略以及多模态融合模块,实现几何与空间特征的有效整合,从而在实时性要求下生成高质量的任务规划方案。
链接: https://arxiv.org/abs/2506.16931
作者: Jiaqi Chen,Mingfeng Fan,Xuefeng Zhang,Jingsong Liang,Yuhong Cao,Guohua Wu,Guillaume Adrien Sartoretti
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 14 pages, 6 figures, under review
Abstract:Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
zh
[AI-22] A deep learning and machine learning approach to predict neonatal death in the context of São Paulo
【速读】:该论文旨在解决新生儿死亡率较高的问题,特别是在欠发达及部分发达国家中,新生儿死亡仍然是一个严峻的现实。研究的关键在于通过机器学习方法对高危新生儿进行早期预测,从而为母婴提供及时的护理以避免早产儿死亡。为此,研究利用了140万例新生儿的历史数据,采用逻辑回归、K近邻、随机森林分类器、极端梯度提升(XGBoost)、卷积神经网络和长短期记忆网络(LSTM)等算法,最终发现LSTM在深度学习模型中表现最佳,准确率达到99%,因此被认为是预测是否需要采取预防措施的最有效方法。
链接: https://arxiv.org/abs/2506.16929
作者: Mohon Raihan,Plabon Kumar Saha,Rajan Das Gupta,A Z M Tahmidul Kabir,Afia Anjum Tamanna,Md. Harun-Ur-Rashid,Adnan Bin Abdus Salam,Md Tanvir Anjum,A Z M Ahteshamul Kabir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neonatal death is still a concerning reality for underdeveloped and even some developed countries. Worldwide data indicate that 26.693 babies out of 1,000 births die, according to Macro Trades. To reduce this number, early prediction of endangered babies is crucial. Such prediction enables the opportunity to take ample care of the child and mother so that early child death can be avoided. In this context, machine learning was used to determine whether a newborn baby is at risk. To train the predictive model, historical data of 1.4 million newborns was used. Machine learning and deep learning techniques such as logical regression, K-nearest neighbor, random forest classifier, extreme gradient boosting (XGBoost), convolutional neural network, and long short-term memory (LSTM) were implemented using the dataset to identify the most accurate model for predicting neonatal mortality. Among the machine learning algorithms, XGBoost and random forest classifier achieved the best accuracy with 94%, while among the deep learning models, LSTM delivered the highest accuracy with 99%. Therefore, using LSTM appears to be the most suitable approach to predict whether precautionary measures for a child are necessary.
zh
[AI-23] Real-Time Black-Box Optimization for Dynamic Discrete Environments Using Embedded Ising Machines
【速读】:该论文试图解决动态环境中离散变量优化的问题,特别是在实时系统中,传统多臂老虎机(Multi-Armed Bandit, MAB)算法由于组合优化带来的巨大动作空间而难以有效进行优化。解决方案的关键在于将基于伊辛机(Ising Machine)的黑箱优化(Black-box Optimization, BBO)方法扩展为一种启发式MAB方法,该方法能够有效探索动作空间,并在考虑变量间相互作用及环境动态变化的情况下提升动态适应能力。
链接: https://arxiv.org/abs/2506.16924
作者: Tomoya Kashimata,Yohei Hamakawa,Masaya Yamasaki,Kosuke Tatsumura
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 18 pages, 6figures
Abstract:Many real-time systems require the optimization of discrete variables. Black-box optimization (BBO) algorithms and multi-armed bandit (MAB) algorithms perform optimization by repeatedly taking actions and observing the corresponding instant rewards without any prior knowledge. Recently, a BBO method using an Ising machine has been proposed to find the best action that is represented by a combination of discrete values and maximizes the instant reward in static environments. In contrast, dynamic environments, where real-time systems operate, necessitate MAB algorithms that maximize the average reward over multiple trials. However, due to the enormous number of actions resulting from the combinatorial nature of discrete optimization, conventional MAB algorithms cannot effectively optimize dynamic, discrete environments. Here, we show a heuristic MAB method for dynamic, discrete environments by extending the BBO method, in which an Ising machine effectively explores the actions while considering interactions between variables and changes in dynamic environments. We demonstrate the dynamic adaptability of the proposed method in a wireless communication system with moving users.
zh
[AI-24] owards Effective Complementary Security Analysis using Large Language Models
【速读】:该论文试图解决静态应用安全测试(Static Application Security Testing, SAST)工具生成的潜在安全弱点评估中存在大量误报(False Positives, FPs)的问题,这些问题降低了安全分析的有效性。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)提升对SAST结果的评估能力,通过先进的提示技术(如Chain-of-Thought和Self-Consistency)显著提高误报检测率,同时保持完美的真实正例(True Positives, TPs)检出率。
链接: https://arxiv.org/abs/2506.16899
作者: Jonas Wagner,Simon Müller,Christian Näther,Jan-Philipp Steghöfer,Andreas Both
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:A key challenge in security analysis is the manual evaluation of potential security weaknesses generated by static application security testing (SAST) tools. Numerous false positives (FPs) in these reports reduce the effectiveness of security analysis. We propose using Large Language Models (LLMs) to improve the assessment of SAST findings. We investigate the ability of LLMs to reduce FPs while trying to maintain a perfect true positive rate, using datasets extracted from the OWASP Benchmark (v1.2) and a real-world software project. Our results indicate that advanced prompting techniques, such as Chain-of-Thought and Self-Consistency, substantially improve FP detection. Notably, some LLMs identified approximately 62.5% of FPs in the OWASP Benchmark dataset without missing genuine weaknesses. Combining detections from different LLMs would increase this FP detection to approximately 78.9%. Additionally, we demonstrate our approach’s generalizability using a real-world dataset covering five SAST tools, three programming languages, and infrastructure files. The best LLM detected 33.85% of all FPs without missing genuine weaknesses, while combining detections from different LLMs would increase this detection to 38.46%. Our findings highlight the potential of LLMs to complement traditional SAST tools, enhancing automation and reducing resources spent addressing false alarms.
zh
[AI-25] he Importance of Being Lazy: Scaling Limits of Continual Learning
【速读】:该论文试图解决神经网络在非平稳环境中持续学习时面临的灾难性遗忘(catastrophic forgetting, CF)问题,以及模型规模与特征学习程度对持续学习性能的影响。其解决方案的关键在于通过区分“懒惰”(lazy)和“丰富”(rich)的训练机制,揭示模型宽度增加在减少特征学习时的有益作用,并利用动力学平均场理论分析特征学习状态下模型的无限宽度动态特性,从而更全面地理解CF现象及其与任务非平稳性和特征学习之间的复杂关系。
链接: https://arxiv.org/abs/2506.16884
作者: Jacopo Graldi,Alessandro Breccia,Giulia Lanzillotta,Thomas Hofmann,Lorenzo Noci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Proceedings of the 42nd International Conference on Machine Learning (2025). JG and AB contributed equally to this work
Abstract:Despite recent efforts, neural networks still struggle to learn in non-stationary environments, and our understanding of catastrophic forgetting (CF) is far from complete. In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating between lazy and rich training regimes through a variable parameterization of the architecture. We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity and transfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning.
zh
[AI-26] Bandwidth Selectors on Semiparametric Bayesian Networks
【速读】:该论文试图解决在半参数贝叶斯网络(SPBNs)中,传统基于正态性假设的带宽矩阵选择方法在非正态数据下可能导致密度估计不准确和预测性能下降的问题。解决方案的关键在于引入先进的带宽选择方法,如交叉验证和插件选择器,以提升SPBNs在复杂数据分布下的学习能力和适用性。实验结果表明,这些方法相比传统的正态规则能更有效地利用增加的数据信息,尤其在样本量较大时表现出更好的性能。
链接: https://arxiv.org/abs/2506.16844
作者: Victor Alejandre(1),Concha Bielza(1),Pedro Larrañaga(1) ((1) Universidad Politecnica de Madrid, Spain)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 37 pages, 15 figures. Submitted to Information Sciences
Abstract:Semiparametric Bayesian networks (SPBNs) integrate parametric and non-parametric probabilistic models, offering flexibility in learning complex data distributions from samples. In particular, kernel density estimators (KDEs) are employed for the non-parametric component. Under the assumption of data normality, the normal rule is used to learn the bandwidth matrix for the KDEs in SPBNs. This matrix is the key hyperparameter that controls the trade-off between bias and variance. However, real-world data often deviates from normality, potentially leading to suboptimal density estimation and reduced predictive performance. This paper first establishes the theoretical framework for the application of state-of-the-art bandwidth selectors and subsequently evaluates their impact on SPBN performance. We explore the approaches of cross-validation and plug-in selectors, assessing their effectiveness in enhancing the learning capability and applicability of SPBNs. To support this investigation, we have extended the open-source package PyBNesian for SPBNs with the additional bandwidth selection techniques and conducted extensive experimental analyses. Our results demonstrate that the proposed bandwidth selectors leverage increasing information more effectively than the normal rule, which, despite its robustness, stagnates with more data. In particular, unbiased cross-validation generally outperforms the normal rule, highlighting its advantage in high sample size scenarios.
zh
[AI-27] Learning Dexterous Object Handover
【速读】:该论文旨在解决多指机械手之间灵巧物体交接的问题,这是在协作环境中部署机器人所必需的关键技能。解决方案的关键在于采用基于双四元数的新型奖励函数,以最小化旋转距离,相较于欧拉角和旋转矩阵等其他旋转表示方法具有更优的表现。该奖励函数有效提升了交接任务的性能与鲁棒性。
链接: https://arxiv.org/abs/2506.16822
作者: Daniel Frau-Alfaro,Julio Castaño-Amoros,Santiago Puente,Pablo Gil,Roberto Calandra
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Paper accepted for presentation in RoMan 2025
Abstract:Object handover is an important skill that we use daily when interacting with other humans. To deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use of a novel reward function based on dual quaternions to minimize the rotation distance, which outperforms other rotation representations such as Euler and rotation matrices. The robustness of the trained policy is experimentally evaluated by testing w.r.t. objects that are not included in the training distribution, and perturbations during the handover process. The results demonstrate that the trained policy successfully perform this task, achieving a total success rate of 94% in the best-case scenario after 100 experiments, thereby showing the robustness of our policy with novel objects. In addition, the best-case performance of the policy decreases by only 13.8% when the other robot moves during the handover, proving that our policy is also robust to this type of perturbation, which is common in real-world object handovers.
zh
[AI-28] Robust Dynamic Material Handling via Adaptive Constrained Evolutionary Reinforcement Learning
【速读】:该论文旨在解决动态物料搬运(Dynamic Material Handling, DMH)中的实时任务分配问题,以最小化完工时间和延误。其关键挑战包括动态事件带来的适应性需求、任务延迟等约束条件的满足、稀疏奖励机制以及有限计算资源和历史记录的有效利用。为应对这些挑战,本文提出了一种新颖的自适应约束进化强化学习(Adaptive Constrained Evolutionary Reinforcement Learning, ACERL)方法,其核心在于维护一个演员群体以实现多样化的探索,并通过每个演员处理稀疏奖励和约束违反来限制策略行为,同时自适应选择最有益的训练实例以提升策略性能。
链接: https://arxiv.org/abs/2506.16795
作者: Chengpeng Hu,Ziming Wang,Bo Yuan,Jialin Liu,Chengqi Zhang,Xin Yao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic material handling (DMH) involves the assignment of dynamically arriving material transporting tasks to suitable vehicles in real time for minimising makespan and tardiness. In real-world scenarios, historical task records are usually available, which enables the training of a decision policy on multiple instances consisting of historical records. Recently, reinforcement learning has been applied to solve DMH. Due to the occurrence of dynamic events such as new tasks, adaptability is highly required. Solving DMH is challenging since constraints including task delay should be satisfied. A feedback is received only when all tasks are served, which leads to sparse reward. Besides, making the best use of limited computational resources and historical records for training a robust policy is crucial. The time allocated to different problem instances would highly impact the learning process. To tackle those challenges, this paper proposes a novel adaptive constrained evolutionary reinforcement learning (ACERL) approach, which maintains a population of actors for diverse exploration. ACERL accesses each actor for tackling sparse rewards and constraint violation to restrict the behaviour of the policy. Moreover, ACERL adaptively selects the most beneficial training instances for improving the policy. Extensive experiments on eight training and eight unseen test instances demonstrate the outstanding performance of ACERL compared with several state-of-the-art algorithms. Policies trained by ACERL can schedule the vehicles while fully satisfying the constraints. Additional experiments on 40 unseen noised instances show the robust performance of ACERL. Cross-validation further presents the overall effectiveness of ACREL. Besides, a rigorous ablation study highlights the coordination and benefits of each ingredient of ACERL.
zh
[AI-29] abArena: A Living Benchmark for Machine Learning on Tabular Data
【速读】:该论文试图解决当前表格数据领域基准测试系统静态化、缺乏持续更新的问题,这种问题导致基准测试无法反映模型版本更新、发现缺陷或新模型发布后的实际情况。解决方案的关键在于引入TabArena,这是一个首个持续维护的动态表格基准测试系统,其核心包括手动整理具有代表性的数据集和良好实现的模型,进行大规模基准测试以初始化公开排行榜,并组建经验丰富的维护团队,从而确保基准测试的时效性和可靠性。
链接: https://arxiv.org/abs/2506.16791
作者: Nick Erickson,Lennart Purucker,Andrej Tschalzev,David Holzmüller,Prateek Mutalik Desai,and David Salinas,Frank Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 51 pages. Code available at this https URL examples at this https URL dataset curation at this https URL and this https URL
Abstract:With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning and investigate the contributions of individual models. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at this https URL.
zh
[AI-30] What Is the Point of Equality in Machine Learning Fairness? Beyond Equality of Opportunity
【速读】:该论文试图解决当前机器学习(Machine Learning, ML)公平性研究中过于侧重分配平等(distributive equality)所带来的伦理基础不完整问题。现有研究通常认为不公平的ML模型是因其未能平等分配社会资源和机会,但这种视角无法充分解释代表性伤害(representational harms)的道德错误,也无法说明为何ML系统应促进人们作为平等个体之间的关系(relational equality)。论文的关键解决方案是提出一个融合分配平等与关系平等的多元 egalitarian(平等主义)框架,以提供更全面的伦理基础来应对ML系统所造成的各种伤害,并通过批判性社会与政治哲学理论,为ML公平性研究提供更具深度和广度的指导。
链接: https://arxiv.org/abs/2506.16782
作者: Youjin Kong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for presentation at ACM FAccT 2025; under final review (minor revision) at an ACM journal
Abstract:Fairness in machine learning (ML) has become a rapidly growing area of research. But why, in the first place, is unfairness in ML morally wrong? And why should we care about improving fairness? Most fair-ML research implicitly appeals to distributive equality: the idea that desirable goods and benefits, such as opportunities (e.g., Barocas et al., 2023), should be equally distributed across society. Unfair ML models, then, are seen as wrong because they unequally distribute such benefits. This paper argues that this exclusive focus on distributive equality offers an incomplete and potentially misleading ethical foundation. Grounding ML fairness in egalitarianism – the view that equality is a fundamental moral and social ideal – requires challenging structural inequality: systematic, institutional, and durable arrangements that privilege some groups while disadvantaging others. Structural inequality manifests through ML systems in two primary forms: allocative harms (e.g., economic loss) and representational harms (e.g., stereotypes, erasure). While distributive equality helps address allocative harms, it fails to explain why representational harms are wrong – why it is wrong for ML systems to reinforce social hierarchies that stratify people into superior and inferior groups – and why ML systems should aim to foster a society where people relate as equals (i.e., relational equality). To address these limitations, the paper proposes a multifaceted egalitarian framework for ML fairness that integrates both distributive and relational equality. Drawing on critical social and political philosophy, this framework offers a more comprehensive ethical foundation for tackling the full spectrum of harms perpetuated by ML systems. The paper also outlines practical pathways for implementing the framework across the ML pipeline.
zh
[AI-31] Reinforcement learning for hybrid charging stations planning and operation considering fixed and mobile chargers
【速读】:该论文试图解决城市道路网络中混合充电基础设施(Hybrid Charging Infrastructure)的最优规划与运营问题,旨在提升充电设施的可用性并减少用户不便。其关键解决方案是提出一种融合固定充电站选址与配置优化以及移动充电器动态调度的综合方法,该方法基于模型预测控制(Model Predictive Control, MPC)的充电需求预测模型,并结合深度强化学习与启发式调度技术,以实现固定充电设施规划与移动充电器实时操作的有效衔接。
链接: https://arxiv.org/abs/2506.16764
作者: Yanchen Zhu,Honghui Zou,Chufan Liu,Yuyu Luo,Yuankai Wu,Yuxuan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11pages
Abstract:The success of vehicle electrification, which brings significant societal and environmental benefits, is contingent upon the availability of efficient and adaptable charging infrastructure. Traditional fixed-location charging stations often face issues like underutilization or congestion due to the dynamic nature of charging demand. Mobile chargers have emerged as a flexible solution, capable of relocating to align with these demand fluctuations. This paper addresses the optimal planning and operation of hybrid charging infrastructures, integrating both fixed and mobile chargers within urban road networks. We introduce the Hybrid Charging Station Planning and Operation (HCSPO) problem, which simultaneously optimizes the location and configuration of fixed charging stations and schedules mobile chargers for dynamic operations. Our approach incorporates a charging demand prediction model grounded in Model Predictive Control (MPC) to enhance decision-making. To solve the HCSPO problem, we propose a deep reinforcement learning method, augmented with heuristic scheduling techniques, to effectively bridge the planning of fixed chargers with the real-time operation of mobile chargers. Extensive case studies using real-world urban scenarios demonstrate that our method significantly improves the availability of charging infrastructure and reduces user inconvenience compared to existing solutions and baselines.
zh
[AI-32] Metapath-based Hyperbolic Contrastive Learning for Heterogeneous Graph Embedding
【速读】:该论文试图解决现有超球形异构图嵌入模型因仅依赖单一超球形空间而无法有效捕捉异构图中多样化的幂律结构的问题。解决方案的关键在于提出一种基于元路径的超球形对比学习框架(Metapath-based Hyperbolic Contrastive Learning, MHCL),通过使用多个超球形空间来捕获异构图中的复杂结构,同时利用对比学习方法优化元路径嵌入,以增强不同元路径嵌入之间的可区分性,从而更有效地表征异构图的语义信息。
链接: https://arxiv.org/abs/2506.16754
作者: Jongmin Park,Seunghoon Han,Won-Yong Shin,Sungsu Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 14 pages, 9 figures
Abstract:The hyperbolic space, characterized by a constant negative curvature and exponentially expanding space, aligns well with the structural properties of heterogeneous graphs. However, although heterogeneous graphs inherently possess diverse power-law structures, most hyperbolic heterogeneous graph embedding models rely on a single hyperbolic space. This approach may fail to effectively capture the diverse power-law structures within heterogeneous graphs. To address this limitation, we propose a Metapath-based Hyperbolic Contrastive Learning framework (MHCL), which uses multiple hyperbolic spaces to capture diverse complex structures within heterogeneous graphs. Specifically, by learning each hyperbolic space to describe the distribution of complex structures corresponding to each metapath, it is possible to capture semantic information effectively. Since metapath embeddings represent distinct semantic information, preserving their discriminability is important when aggregating them to obtain node representations. Therefore, we use a contrastive learning approach to optimize MHCL and improve the discriminability of metapath embeddings. In particular, our contrastive learning method minimizes the distance between embeddings of the same metapath and maximizes the distance between those of different metapaths in hyperbolic space, thereby improving the separability of metapath embeddings with distinct semantic information. We conduct comprehensive experiments to evaluate the effectiveness of MHCL. The experimental results demonstrate that MHCL outperforms state-of-the-art baselines in various graph machine learning tasks, effectively capturing the complex structures of heterogeneous graphs.
zh
[AI-33] Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation ICML2025
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在面对对抗性输入观测时的鲁棒性问题,特别是在长时域内处理最坏情况场景的挑战。现有方法虽然取得了一定成效,但其通过交替学习使智能体与对抗者相互依赖,导致与环境的交互效率低下,并阻碍了非策略(off-policy)方法的发展。该论文提出的解决方案关键在于将对抗性学习重新表述为一个软约束优化问题,从而无需额外的环境交互,提高了方法的效率和可行性。
链接: https://arxiv.org/abs/2506.16753
作者: Kosuke Nakanishi,Akihiro Kubo,Yuji Yasui,Shin Ishii
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICML2025 poster, 39 pages, 6 figures, 13 tables. arXiv admin note: text overlap with arXiv:2409.00418
Abstract:Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL’s inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent’s cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at this https URL.
zh
[AI-34] On Training-Test (Mis)alignment in Unsupervised Combinatorial Optimization: Observation Empirical Exploration and Analysis DATE ICML2025
【速读】:该论文试图解决无监督组合优化(Unsupervised Combinatorial Optimization, UCO)中训练与测试阶段之间的不一致问题。现有UCO方法在训练时追求概率意义上具有潜力的连续决策,而在测试时通过去随机化(derandomization)获得最终确定性决策,但这种训练与测试过程的不匹配导致较低的训练损失并不一定意味着更好的去随机化后性能。论文提出的解决方案关键在于在训练过程中引入可微分的去随机化版本,以更好地对齐训练与测试阶段,从而提升模型的整体性能。
链接: https://arxiv.org/abs/2506.16732
作者: Fanchen Bu,Kijung Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Probability (math.PR)
备注: 2nd Workshop on Test-Time Adaptation: Putting Updates to the Test @ ICML 2025
Abstract:In unsupervised combinatorial optimization (UCO), during training, one aims to have continuous decisions that are promising in a probabilistic sense for each training instance, which enables end-to-end training on initially discrete and non-differentiable problems. At the test time, for each test instance, starting from continuous decisions, derandomization is typically applied to obtain the final deterministic decisions. Researchers have developed more and more powerful test-time derandomization schemes to enhance the empirical performance and the theoretical guarantee of UCO methods. However, we notice a misalignment between training and testing in the existing UCO methods. Consequently, lower training losses do not necessarily entail better post-derandomization performance, even for the training instances without any data distribution shift. Empirically, we indeed observe such undesirable cases. We explore a preliminary idea to better align training and testing in UCO by including a differentiable version of derandomization into training. Our empirical exploration shows that such an idea indeed improves training-test alignment, but also introduces nontrivial challenges into training.
zh
[AI-35] Incentivizing High-quality Participation From Federated Learning Agents
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中两个关键问题:一是参与方(agents)可能因缺乏激励而选择退出系统或提供低质量数据;二是现有基于博弈论的联邦学习方法在数据收集过程中忽略了不同参与方贡献数据所带来的异质性努力,导致聚合模型效果不佳。解决方案的关键在于提出一个考虑数据异质性的激励感知框架,通过引入Wasserstein距离来量化异质性努力并重新定义收敛上界,同时利用同伴预测机制设计评分函数以诱导真实报告,并构建两阶段Stackelberg博弈模型以分析均衡存在性。
链接: https://arxiv.org/abs/2506.16731
作者: Jinlong Pang,Jiaheng Wei,Yifan Hua,Chen Qian,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
Abstract:Federated learning (FL) provides a promising paradigm for facilitating collaboration between multiple clients that jointly learn a global model without directly sharing their local data. However, existing research suffers from two caveats: 1) From the perspective of agents, voluntary and unselfish participation is often assumed. But self-interested agents may opt out of the system or provide low-quality contributions without proper incentives; 2) From the mechanism designer’s perspective, the aggregated models can be unsatisfactory as the existing game-theoretical federated learning approach for data collection ignores the potential heterogeneous effort caused by contributed data. To alleviate above challenges, we propose an incentive-aware framework for agent participation that considers data heterogeneity to accelerate the convergence process. Specifically, we first introduce the notion of Wasserstein distance to explicitly illustrate the heterogeneous effort and reformulate the existing upper bound of convergence. To induce truthful reporting from agents, we analyze and measure the generalization error gap of any two agents by leveraging the peer prediction mechanism to develop score functions. We further present a two-stage Stackelberg game model that formalizes the process and examines the existence of equilibrium. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed mechanism.
zh
[AI-36] riCon-SF: A Triple-Shuffle and Contribution-Aware Serial Federated Learning Framework for Heterogeneous Healthcare Data
【速读】:该论文试图解决跨孤岛联邦学习中数据异质性带来的隐私泄露和模型安全问题,特别是在没有中心化聚合的情况下,客户端间直接传递模型可能违反隐私法规并面临梯度泄露和关联攻击的风险。此外,如何在半诚实或恶意客户端存在的情况下确保系统的鲁棒性仍是重大挑战。解决方案的关键在于提出TriCon-SF框架,该框架通过三重随机化(模型层、数据片段和训练序列的洗牌)打破确定性学习模式,从而增强隐私和系统鲁棒性;同时结合Shapley值方法动态评估客户端贡献,实现对不诚实行为的检测与系统可追溯性。
链接: https://arxiv.org/abs/2506.16723
作者: Yuping Yan,Yizhi Wang,Yuanshuai Li,Yaochu Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Serial pipeline training is an efficient paradigm for handling data heterogeneity in cross-silo federated learning with low communication overhead. However, even without centralized aggregation, direct transfer of models between clients can violate privacy regulations and remain susceptible to gradient leakage and linkage attacks. Additionally, ensuring resilience against semi-honest or malicious clients who may manipulate or misuse received models remains a grand challenge, particularly in privacy-sensitive domains such as healthcare. To address these challenges, we propose TriCon-SF, a novel serial federated learning framework that integrates triple shuffling and contribution awareness. TriCon-SF introduces three levels of randomization by shuffling model layers, data segments, and training sequences to break deterministic learning patterns and disrupt potential attack vectors, thereby enhancing privacy and robustness. In parallel, it leverages Shapley value methods to dynamically evaluate client contributions during training, enabling the detection of dishonest behavior and enhancing system accountability. Extensive experiments on non-IID healthcare datasets demonstrate that TriCon-SF outperforms standard serial and parallel federated learning in both accuracy and communication efficiency. Security analysis further supports its resilience against client-side privacy attacks.
zh
[AI-37] Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation
【速读】:该论文试图解决在多智能体系统中,单个智能体如何适应新的环境、任务以及与未知队友和对手进行有效协作与竞争的问题。其解决方案的关键在于提出了一种新的建模方法——多检索与动态生成(Multi-Retrieval and Dynamic Generation, MRDG),该方法通过行为轨迹建模队友和对手,并结合位置编码器、超网络模块和视角对齐模块,提升智能体的学习与适应能力,从而实现跨场景、跨任务的泛化性能。
链接: https://arxiv.org/abs/2506.16718
作者: Chenxu Wang,Yonggang Jin,Cheng Hu,Youpeng Zhao,Zipeng Dai,Jian Zhao,Shiyu Huang,Liuyu Xiang,Junge Zhang,Zhaofeng He
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: This manuscript is under submission to Neurocomputing
Abstract:Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, we propose a more comprehensive setting, Agent Collaborative-Competitive Adaptation (ACCA), which evaluates an agent to generalize across diverse scenarios, tasks, and interactions with both unfamiliar opponents and teammates. In ACCA, agents adjust to task and environmental changes, collaborate with unseen teammates, and compete against unknown opponents. We introduce a new modeling approach, Multi-Retrieval and Dynamic Generation (MRDG), that effectively models both teammates and opponents using their behavioral trajectories. This method incorporates a positional encoder for varying team sizes and a hypernetwork module to boost agents’ learning and adaptive capabilities. Additionally, a viewpoint alignment module harmonizes the observational perspectives of retrieved teammates and opponents with the learning agent. Extensive tests in benchmark scenarios like SMAC, Overcooked-AI, and Melting Pot show that MRDG significantly improves robust collaboration and competition with unseen teammates and opponents, surpassing established baselines. Our code is available at: this https URL
zh
[AI-38] Interpretable Low-Dimensional Modeling of Spatiotemporal Agent States for Decision Making in Football Tactics
【速读】:该论文试图解决足球战术分析中传统模型计算成本高、可解释性差以及规则模型未能全面考虑所有球员状态的问题。其解决方案的关键在于构建一种基于时空数据的低维规则模型,通过定义可解释的状态变量(如持球者和潜在传球接应者的空间位置与状态),结合StatsBomb事件数据和SkillCorner跟踪数据,利用XGBoost模型预测传球成功率,从而实现更具可解释性和实用性的战术分析工具。
链接: https://arxiv.org/abs/2506.16696
作者: Kenjiro Ide,Taiga Someya,Kohei Kawaguchi,Keisuke Fujii
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, presented in iCSports 2024 Abstract Track
Abstract:Understanding football tactics is crucial for managers and analysts. Previous research has proposed models based on spatial and kinematic equations, but these are computationally expensive. Also, Reinforcement learning approaches use player positions and velocities but lack interpretability and require large datasets. Rule-based models align with expert knowledge but have not fully considered all players’ states. This study explores whether low-dimensional, rule-based models using spatiotemporal data can effectively capture football tactics. Our approach defines interpretable state variables for both the ball-holder and potential pass receivers, based on criteria that explore options like passing. Through discussions with a manager, we identified key variables representing the game state. We then used StatsBomb event data and SkillCorner tracking data from the 2023 / 24 LaLiga season to train an XGBoost model to predict pass success. The analysis revealed that the distance between the player and the ball, as well as the player’s space score, were key factors in determining successful passes. Our interpretable low-dimensional modeling facilitates tactical analysis through the use of intuitive variables and provides practical value as a tool to support decision-making in football.
zh
[AI-39] Fast and Stable Diffusion Planning through Variational Adaptive Weighting
【速读】:该论文旨在解决扩散模型在离线强化学习(offline RL)中训练成本高、收敛速度慢的问题,尤其是在使用基于Transformer的去噪主干网络时。其解决方案的关键在于提出一种变分最优的不确定性感知权重函数,并在其基础上引入一种基于流生成建模框架的在线估计闭式多项式逼近方法,从而提升训练的稳定性和效率。
链接: https://arxiv.org/abs/2506.16688
作者: Zhiying Qiu,Tao Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have recently shown promise in offline RL. However, these methods often suffer from high training costs and slow convergence, particularly when using transformer-based denoising backbones. While several optimization strategies have been proposed – such as modified noise schedules, auxiliary prediction targets, and adaptive loss weighting – challenges remain in achieving stable and efficient training. In particular, existing loss weighting functions typically rely on neural network approximators, which can be ineffective in early training phases due to limited generalization capacity of MLPs when exposed to sparse feedback in the early training stages. In this work, we derive a variationally optimal uncertainty-aware weighting function and introduce a closed-form polynomial approximation method for its online estimation under the flow-based generative modeling framework. We integrate our method into a diffusion planning pipeline and evaluate it on standard offline RL benchmarks. Experimental results on Maze2D and Kitchen tasks show that our method achieves competitive performance with up to 10 times fewer training steps, highlighting its practical effectiveness.
zh
[AI-40] A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation
【速读】:该论文旨在解决生成式检索推荐中因大规模推荐系统中令牌空间的冗余和规模过大而导致的计算复杂性问题,以及现有基于重建的量化方法在生成式检索任务中与区分不同物品目标不一致的问题。同时,论文还致力于有效整合多模态辅助信息(如文本、图像和地理位置知识)以提升推荐效果。解决方案的关键在于提出一种基于对比学习的无监督深度量化方法——SimCIT,其核心思想是通过可学习的残差量化模块,在对比学习框架下实现多模态知识对齐与语义令牌化的协同优化。
链接: https://arxiv.org/abs/2506.16683
作者: Penglong Zhai,Yifang Yuan,Fanyi Di,Jie Li,Yue Liu,Chen Li,Jie Huang,Sicong Wang,Yao Xu,Xin Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 12 pages,7 figures
Abstract:Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alternative to ID tokens, which typically leveraged reconstruction-based strategies, like RQ-VAE, to quantize content embeddings and significantly reduce the embedding size. However, reconstructive quantization aims for the precise reconstruction of each item embedding independently, which conflicts with the goal of generative retrieval tasks focusing more on differentiating among items. Moreover, multi-modal side information of items, such as descriptive text and images, geographical knowledge in location-based recommendation services, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Nevertheless, effectively integrating such complementary knowledge into existing generative recommendation frameworks remains challenging. To overcome these challenges, we propose a novel unsupervised deep quantization exclusively based on contrastive learning, named SimCIT (a Simple Contrastive Item Tokenization framework). Specifically, different from existing reconstruction-based strategies, SimCIT propose to use a learnable residual quantization module to align with the signals from different modalities of the items, which combines multi-modal knowledge alignment and semantic tokenization in a mutually beneficial contrastive learning framework. Extensive experiments across public datasets and a large-scale industrial dataset from various domains demonstrate SimCIT’s effectiveness in LLM-based generative recommendation.
zh
[AI-41] A Minimalist Optimizer Design for LLM Pretraining
【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)预训练过程中优化器状态占用内存过大的问题,特别是如何在保持先进性能的前提下最小化优化器状态的内存消耗。其解决方案的关键在于提出一种新的优化器SCALE(Stochastic Column-normalized Last-layer Momentum),该优化器结合了列归一化的随机梯度下降(SGD)和仅作用于输出层的一阶动量,通过列归一化对梯度进行沿输出维度的标准化,并仅在梯度方差最高的输出层引入动量,从而在减少内存使用的同时保持高性能。
链接: https://arxiv.org/abs/2506.16659
作者: Athanasios Glentis,Jiaxiang Li,Andi Han,Mingyi Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which require significant memory to maintain first- and second-moment matrices, known as optimizer states. While recent works such as GaLore, Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What is the minimal amount of optimizer state that is truly necessary to retain state-of-the-art performance in LLM pretraining? In this work, we systematically investigate this question using a bottom-up approach. We find that two memory- and compute-efficient optimization techniques are particularly effective: (1) column-wise gradient normalization significantly boosts the performance of plain SGD without requiring momentum; and (2) adding first-order momentum only to the output layer - where gradient variance is highest - yields performance competitive with fully adaptive methods such as Muon. Based on these insights, we propose SCALE (Stochastic Column-normalized Last-layer Momentum), a new optimizer that combines column-normalized SGD with last-layer momentum, where column normalization refers to normalizing the gradient along the output dimension. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira, and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For the LLaMA 7B model, SCALE outperforms the state-of-the-art method APOLLO in terms of both perplexity and memory consumption. In addition, our method serves as a minimalist baseline for more sophisticated optimizer design.
zh
[AI-42] Relational Deep Learning: Challenges Foundations and Next-Generation Architectures
【速读】:该论文旨在解决如何在多表关系型数据库中进行端到端的表示学习问题,从而替代传统的特征工程方法。其解决方案的关键在于将关系型数据库构建为“关系实体图”(relational entity graph),利用图神经网络(GNN)进行建模,以捕捉数据中的结构化、时序性和异质性特征。通过这一框架,研究者能够更有效地处理复杂的关系数据,并推动图机器学习多个子领域向基础模型设计的融合。
链接: https://arxiv.org/abs/2506.16654
作者: Vijay Prakash Dwivedi,Charilaos Kanatsoulis,Shenyang Huang,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as ‘relational entity graphs’ for Relational Deep Learning (RDL) - a new blueprint that enables end-to-end representation learning without traditional feature engineering. Compared to arbitrary graph-structured data, relational entity graphs have key properties: (i) their structure is defined by primary-foreign key relationships between entities in different tables, (ii) the structural connectivity is a function of the relational schema defining a database, and (iii) the graph connectivity is temporal and heterogeneous in nature. In this paper, we provide a comprehensive review of RDL by first introducing the representation of relational databases as relational entity graphs, and then reviewing public benchmark datasets that have been used to develop and evaluate recent GNN-based RDL models. We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data, while also surveying foundational neural network methods and recent architectural advances specialized for relational entity graphs. Finally, we explore opportunities to unify these distinct modeling challenges, highlighting how RDL converges multiple sub-fields in graph machine learning towards the design of foundation models that can transform the processing of relational data.
zh
[AI-43] LLM s in Coding and their Impact on the Commercial Software Engineering Landscape
【速读】:该论文试图解决生成式 AI(Generative AI)在软件工程中应用时带来的安全与准确性问题,包括私有数据泄露、代码片段中的安全漏洞以及模型对错误观点的附和倾向(sycophancy)。论文提出的解决方案关键在于:企业必须对每行AI生成的代码进行标记和审查,将提示词和输出限制在私有或本地部署环境中,遵守新兴的安全法规,并增加能够检测附和性回答的测试机制,从而在提升开发效率的同时保障安全性和准确性。
链接: https://arxiv.org/abs/2506.16653
作者: Vladislav Belozerov,Peter J Barclay,Askhan Sami
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large-language-model coding tools are now mainstream in software engineering. But as these same tools move human effort up the development stack, they present fresh dangers: 10% of real prompts leak private data, 42% of generated snippets hide security flaws, and the models can even ``agree’’ with wrong ideas, a trait called sycophancy. We argue that firms must tag and review every AI-generated line of code, keep prompts and outputs inside private or on-premises deployments, obey emerging safety regulations, and add tests that catch sycophantic answers – so they can gain speed without losing security and accuracy.
zh
[AI-44] SemAgent : A Semantics Aware Program Repair Agent
【速读】:该论文试图解决现有基于智能体的自动化程序修复(APR)系统在处理软件工程任务时过于局部化的问题,即系统倾向于仅关注看似可疑的代码行并孤立修复,而缺乏对问题语义、代码语义或执行语义的深入理解,导致生成的补丁过度拟合用户问题。解决方案的关键在于引入SemAgent,这是一种基于工作流的新型方法,通过结合问题语义、代码语义和执行语义,生成完整的补丁以修复所有与问题相关的代码行。该方法通过一个创新的流水线实现,包括利用执行语义检索相关上下文、通过广义抽象理解问题语义、在抽象上下文中隔离代码语义,并在两阶段架构中应用这些理解:第一阶段提出细粒度修复,第二阶段根据推断的问题语义过滤相关修复。
链接: https://arxiv.org/abs/2506.16650
作者: Anvith Pabba,Alex Mathai,Anindya Chakraborty,Baishakhi Ray
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) have shown impressive capabilities in downstream software engineering tasks such as Automated Program Repair (APR). In particular, there has been a lot of research on repository-level issue-resolution benchmarks such as SWE-Bench. Although there has been significant progress on this topic, we notice that in the process of solving such issues, existing agentic systems tend to hyper-localize on immediately suspicious lines of code and fix them in isolation, without a deeper understanding of the issue semantics, code semantics, or execution semantics. Consequently, many existing systems generate patches that overfit to the user issue, even when a more general fix is preferable. To address this limitation, we introduce SemAgent, a novel workflow-based procedure that leverages issue, code, and execution semantics to generate patches that are complete - identifying and fixing all lines relevant to the issue. We achieve this through a novel pipeline that (a) leverages execution semantics to retrieve relevant context, (b) comprehends issue-semantics via generalized abstraction, © isolates code-semantics within the context of this abstraction, and (d) leverages this understanding in a two-stage architecture: a repair stage that proposes fine-grained fixes, followed by a reviewer stage that filters relevant fixes based on the inferred issue-semantics. Our evaluations show that our methodology achieves a solve rate of 44.66% on the SWEBench-Lite benchmark beating all other workflow-based approaches, and an absolute improvement of 7.66% compared to our baseline, which lacks such deep semantic understanding. We note that our approach performs particularly well on issues requiring multi-line reasoning (and editing) and edge-case handling, suggesting that incorporating issue and code semantics into APR pipelines can lead to robust and semantically consistent repairs.
zh
[AI-45] History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation
【速读】:该论文旨在解决传统Object Goal Navigation(ObjectNav)方法在未知环境中对物体进行导航时存在的上下文理解不足和重复导航行为的问题。现有方法通常仅将Vision-Language Models(VLMs)用于简单的视觉-语言嵌入相似性检查,未能充分发挥其深层次推理能力。论文提出的解决方案关键在于引入一种基于历史的动态提示机制,通过向VLM提供动作历史上下文,使其能够生成语义引导评分以指导导航动作,并主动避免决策循环。此外,还提出了VLM辅助的航点生成机制,以优化对目标物体的最终接近路径。
链接: https://arxiv.org/abs/2506.16623
作者: Mobin Habibpour,Fatemeh Afghah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Object Goal Navigation (ObjectNav) challenges robots to find objects in unseen environments, demanding sophisticated reasoning. While Vision-Language Models (VLMs) show potential, current ObjectNav methods often employ them superficially, primarily using vision-language embeddings for object-scene similarity checks rather than leveraging deeper reasoning. This limits contextual understanding and leads to practical issues like repetitive navigation behaviors. This paper introduces a novel zero-shot ObjectNav framework that pioneers the use of dynamic, history-aware prompting to more deeply integrate VLM reasoning into frontier-based exploration. Our core innovation lies in providing the VLM with action history context, enabling it to generate semantic guidance scores for navigation actions while actively avoiding decision loops. We also introduce a VLM-assisted waypoint generation mechanism for refining the final approach to detected objects. Evaluated on the HM3D dataset within Habitat, our approach achieves a 46% Success Rate (SR) and 24.8% Success weighted by Path Length (SPL). These results are comparable to state-of-the-art zero-shot methods, demonstrating the significant potential of our history-augmented VLM prompting strategy for more robust and context-aware robotic navigation.
zh
[AI-46] he Role of Explanation Styles and Perceived Accuracy on Decision Making in Predictive Process Monitoring
【速读】:该论文试图解决预测性流程监控(Predictive Process Monitoring, PPM)中深度学习模型可解释性不足的问题,这一问题影响了用户对模型的信任和采用。解决方案的关键在于通过可解释人工智能(Explainable AI, XAI)提供不同类型的解释(如特征重要性、基于规则的解释和反事实解释),并评估这些解释风格以及用户对AI准确性的感知如何影响决策过程。研究通过实验分析了解释风格和感知准确性对任务表现、决策一致性和决策信心的影响,以探索提升用户决策效果和信任度的有效途径。
链接: https://arxiv.org/abs/2506.16617
作者: Soobin Chae,Suhwan Lee,Hanna Hauptmann,Hajo A. Reijers,Xixi Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at CAiSE’25
Abstract:Predictive Process Monitoring (PPM) often uses deep learning models to predict the future behavior of ongoing processes, such as predicting process outcomes. While these models achieve high accuracy, their lack of interpretability undermines user trust and adoption. Explainable AI (XAI) aims to address this challenge by providing the reasoning behind the predictions. However, current evaluations of XAI in PPM focus primarily on functional metrics (such as fidelity), overlooking user-centered aspects such as their effect on task performance and decision-making. This study investigates the effects of explanation styles (feature importance, rule-based, and counterfactual) and perceived AI accuracy (low or high) on decision-making in PPM. We conducted a decision-making experiment, where users were presented with the AI predictions, perceived accuracy levels, and explanations of different styles. Users’ decisions were measured both before and after receiving explanations, allowing the assessment of objective metrics (Task Performance and Agreement) and subjective metrics (Decision Confidence). Our findings show that perceived accuracy and explanation style have a significant effect.
zh
[AI-47] Distribution Parameter Actor-Critic: Shifting the Agent -Environment Boundary for Diverse Action Spaces
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)中动作空间类型限制的问题,即如何在不同类型的动作空间(如离散、连续、混合等)中实现更高效和稳定的策略学习。其解决方案的关键在于将分布参数(distribution parameters)作为动作来处理,从而重新定义智能体与环境的边界,使新的动作空间变为连续空间。在此基础上,提出了广义确定性策略梯度估计器——分布参数策略梯度(Distribution Parameter Policy Gradient, DPPG),该方法在原动作空间中具有更低的梯度方差,并结合了插值批评学习(Interpolated Critic Learning, ICL)以提升学习效果。最终,基于TD3框架,提出了分布参数Actor-Critic(Distribution Parameter Actor-Critic, DPAC)算法,在连续控制任务中表现出优于TD3的性能。
链接: https://arxiv.org/abs/2506.16608
作者: Jiamin He,A. Rupam Mahmood,Martha White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a novel reinforcement learning (RL) framework that treats distribution parameters as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, mixed, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distribution Parameter Policy Gradient (DPPG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce interpolated critic learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical DPPG-based actor-critic algorithm, Distribution Parameter Actor-Critic (DPAC). Empirically, DPAC outperforms TD3 in MuJoCo continuous control tasks from OpenAI Gym and DeepMind Control Suite, and demonstrates competitive performance on the same environments with discretized action spaces.
zh
[AI-48] FLAME: Towards Federated Fine-Tuning Large Language Models Through Adaptive SMoE
【速读】:该论文试图解决现有资源自适应LoRA联邦微调方法中由于全局LoRA矩阵压缩导致的信息损失问题,从而影响模型性能。其解决方案的关键在于提出FLAME框架,该框架基于稀疏专家混合(Sparse Mixture-of-Experts, SMoE)架构,保留完整的全局LoRA矩阵,并通过调整每个客户端激活的专家数量实现客户端侧的适应性。此外,FLAME通过轻量级的重新缩放机制和激活感知聚合方案应对部分专家激活带来的输出幅度不匹配以及专家训练质量不平衡问题。
链接: https://arxiv.org/abs/2506.16600
作者: Khiem Le,Tuan Tran,Ting Hua,Nitesh V. Chawla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing resource-adaptive LoRA federated fine-tuning methods enable clients to fine-tune models using compressed versions of global LoRA matrices, in order to accommodate various compute resources across clients. This compression requirement will lead to suboptimal performance due to information loss. To address this, we propose FLAME, a novel federated learning framework based on the Sparse Mixture-of-Experts (SMoE) architecture. Unlike prior approaches, FLAME retains full (uncompressed) global LoRA matrices and achieves client-side adaptability by varying the number of activated experts per client. However, incorporating SMoE into federated learning introduces unique challenges, specifically, the mismatch in output magnitude from partial expert activation and the imbalance in expert training quality across clients. FLAME tackles these challenges through a lightweight rescaling mechanism and an activation-aware aggregation scheme. Empirical results across diverse computational settings demonstrate that FLAME consistently outperforms existing methods, providing a robust and effective solution for resource-adaptive federated learning.
zh
[AI-49] A Community-driven vision for a new Knowledge Resource for AI
【速读】:该论文试图解决当前人工智能基础设施中缺乏可验证、通用且广泛可用的知识资源这一关键问题(critical deficiency)。尽管已有如WordNet、ConceptNet、Wolfram|Alpha等知识资源,但它们仍无法满足AI系统在知识缺口、机器人规划和事实性错误检测等方面的需求。论文提出的解决方案之关键在于构建一个由社区驱动的新知识基础设施,该框架应包含一套用于有效利用知识模块的开放工程体系,并涵盖贡献者共同采用的规范与社会结构。
链接: https://arxiv.org/abs/2506.16596
作者: Vinay K Chaudhri,Chaitan Baru,Brandon Bennett,Mehul Bhatt,Darion Cassel,Anthony G Cohn,Rina Dechter,Esra Erdem,Dave Ferrucci,Ken Forbus,Gregory Gelfond,Michael Genesereth,Andrew S. Gordon,Benjamin Grosof,Gopal Gupta,Jim Hendler,Sharat Israni,Tyler R. Josephson,Patrick Kyllonen,Yuliya Lierler,Vladimir Lifschitz,Clifton McFate,Hande K. McGinty,Leora Morgenstern,Alessandro Oltramari,Praveen Paritosh,Dan Roth,Blake Shepard,Cogan Shimzu,Denny Vrandečić,Mark Whiting,Michael Witbrock
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:The long-standing goal of creating a comprehensive, multi-purpose knowledge resource, reminiscent of the 1984 Cyc project, still persists in AI. Despite the success of knowledge resources like WordNet, ConceptNet, Wolfram|Alpha and other commercial knowledge graphs, verifiable, general-purpose widely available sources of knowledge remain a critical deficiency in AI infrastructure. Large language models struggle due to knowledge gaps; robotic planning lacks necessary world knowledge; and the detection of factually false information relies heavily on human expertise. What kind of knowledge resource is most needed in AI today? How can modern technology shape its development and evaluation? A recent AAAI workshop gathered over 50 researchers to explore these questions. This paper synthesizes our findings and outlines a community-driven vision for a new knowledge infrastructure. In addition to leveraging contemporary advances in knowledge representation and reasoning, one promising idea is to build an open engineering framework to exploit knowledge modules effectively within the context of practical applications. Such a framework should include sets of conventions and social structures that are adopted by contributors.
zh
[AI-50] Energy-Based Transfer for Reinforcement Learning
【速读】:该论文试图解决强化学习算法在多任务或持续学习场景中样本效率低的问题。其解决方案的关键在于提出一种基于能量的迁移学习方法,通过分布外检测选择性地提供指导,使教师策略仅在其训练分布的状态下进行干预,从而避免因任务差异导致的次优引导和低回报行为偏差。
链接: https://arxiv.org/abs/2506.16590
作者: Zeyun Deng,Jasorsi Ghosh,Fiona Xie,Yuzhe Lu,Katia Sycara,Joseph Campbell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning algorithms often suffer from poor sample efficiency, making them challenging to apply in multi-task or continual learning settings. Efficiency can be improved by transferring knowledge from a previously trained teacher policy to guide exploration in new but related tasks. However, if the new task sufficiently differs from the teacher’s training task, the transferred guidance may be sub-optimal and bias exploration toward low-reward behaviors. We propose an energy-based transfer learning method that uses out-of-distribution detection to selectively issue guidance, enabling the teacher to intervene only in states within its training distribution. We theoretically show that energy scores reflect the teacher’s state-visitation density and empirically demonstrate improved sample efficiency and performance across both single-task and multi-task settings.
zh
[AI-51] AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits Challenges and Future Directions
【速读】:该论文试图解决传统质量保证(Quality Assurance, QA)方法在应对现代软件系统的复杂性、规模以及快速迭代周期时所面临的挑战,尤其是在资源有限的情况下导致的高质量保障成本高昂的问题。解决方案的关键在于评估将现代面向人工智能(AI)的工具集成到质量保证流程中的潜在优势、挑战与前景,并通过实证研究验证其有效性,例如利用AI代理执行端到端回归测试,结果显示生成的测试用例仅有8.3%的不稳定执行,表明该方法具有显著潜力。然而,研究也指出在实际应用中仍需克服诸如语义覆盖生成、大语言模型(Large Language Models, LLMs)的“黑盒”特性及可解释性不足、测试用例变异后的自动修正倾向等问题,强调了对生成物和测试结果进行严格验证的必要性。
链接: https://arxiv.org/abs/2506.16586
作者: Ihor Pysmennyi,Roman Kyslyi,Kyrylo Kleshch
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures
Abstract:Traditional quality assurance (QA) methods face significant challenges in addressing the complexity, scale, and rapid iteration cycles of modern software systems and are strained by limited resources available, leading to substantial costs associated with poor quality. The object of this research is the Quality Assurance processes for modern distributed software applications. The subject of the research is the assessment of the benefits, challenges, and prospects of integrating modern AI-oriented tools into quality assurance processes. We performed comprehensive analysis of implications on both verification and validation processes covering exploratory test analyses, equivalence partitioning and boundary analyses, metamorphic testing, finding inconsistencies in acceptance criteria (AC), static analyses, test case generation, unit test generation, test suit optimization and assessment, end to end scenario execution. End to end regression of sample enterprise application utilizing AI-agents over generated test scenarios was implemented as a proof of concept highlighting practical use of the study. The results, with only 8.3% flaky executions of generated test cases, indicate significant potential for the proposed approaches. However, the study also identified substantial challenges for practical adoption concerning generation of semantically identical coverage, “black box” nature and lack of explainability from state-of-the-art Large Language Models (LLMs), the tendency to correct mutated test cases to match expected results, underscoring the necessity for thorough verification of both generated artifacts and test execution results. The research demonstrates AI’s transformative potential for QA but highlights the importance of a strategic approach to implementing these technologies, considering the identified limitations and the need for developing appropriate verification methodologies.
zh
[AI-52] One Sample is Enough to Make Conformal Prediction Robust
【速读】:该论文试图解决在存在最坏情况噪声的输入下,如何高效地生成具有严格置信保障的预测集的问题。现有基于平滑的鲁棒共形预测(RCP)方法需要对每个输入进行大量模型前向传播,计算成本较高。该论文提出了一种单样本鲁棒共形预测(RCP1)方法,其关键在于通过任意二元证明证书来认证共形预测过程本身,而非单独的得分,从而在保证预测集鲁棒性的同时显著减少平均集合大小。该方法不依赖具体任务设置(分类或回归),并进一步扩展至基于平滑的鲁棒共形风险控制。
链接: https://arxiv.org/abs/2506.16553
作者: Soroush H. Zargarbashi,Mohammad Sadegh Akhondzadeh,Aleksandar Bojchevski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Given any model, conformal prediction (CP) returns prediction sets guaranteed to include the true label with high adjustable probability. Robust CP (RCP) extends this to inputs with worst-case noise. A well-established approach is to use randomized smoothing for RCP since it is applicable to any black-box model and provides smaller sets compared to deterministic methods. However, current smoothing-based RCP requires many model forward passes per each input which is computationally expensive. We show that conformal prediction attains some robustness even with a forward pass on a single randomly perturbed input. Using any binary certificate we propose a single sample robust CP (RCP1). Our approach returns robust sets with smaller average set size compared to SOTA methods which use many (e.g. around 100) passes per input. Our key insight is to certify the conformal prediction procedure itself rather than individual scores. Our approach is agnostic to the setup (classification and regression). We further extend our approach to smoothing-based robust conformal risk control.
zh
[AI-53] BIDA: A Bi-level Interaction Decision-making Algorithm for Autonomous Vehicles in Dynamic Traffic Scenarios
【速读】:该论文旨在解决自动驾驶汽车(Autonomous Vehicles, AVs)在复杂现实交通环境中与其它交通参与者交互时,因人类行为不可预测性所带来的安全性和决策效率问题,尤其是在多车道高速公路和无信号T型交叉口等动态场景中。解决方案的关键在于设计一种双层交互决策算法(Bi-level Interaction Decision-making Algorithm, BIDA),该算法将交互式蒙特卡洛树搜索(Interactive Monte Carlo Tree Search, MCTS)与深度强化学习(Deep Reinforcement Learning, DRL)相结合,通过构建可靠的值网络和策略网络来指导交互式MCTS的在线推理过程,并结合动态轨迹规划器和轨迹跟踪控制器以实现平稳的驾驶操作,从而提升AV在动态关键交通场景中的交互合理性、效率和安全性。
链接: https://arxiv.org/abs/2506.16546
作者: Liyang Yu,Tianyi Wang,Junfeng Jiao,Fengwu Shan,Hongqing Chu,Bingzhao Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 6 pages, 3 figures, 4 tables, accepted for IEEE Intelligent Vehicles (IV) Symposium 2025
Abstract:In complex real-world traffic environments, autonomous vehicles (AVs) need to interact with other traffic participants while making real-time and safety-critical decisions accordingly. The unpredictability of human behaviors poses significant challenges, particularly in dynamic scenarios, such as multi-lane highways and unsignalized T-intersections. To address this gap, we design a bi-level interaction decision-making algorithm (BIDA) that integrates interactive Monte Carlo tree search (MCTS) with deep reinforcement learning (DRL), aiming to enhance interaction rationality, efficiency and safety of AVs in dynamic key traffic scenarios. Specifically, we adopt three types of DRL algorithms to construct a reliable value network and policy network, which guide the online deduction process of interactive MCTS by assisting in value update and node selection. Then, a dynamic trajectory planner and a trajectory tracking controller are designed and implemented in CARLA to ensure smooth execution of planned maneuvers. Experimental evaluations demonstrate that our BIDA not only enhances interactive deduction and reduces computational costs, but also outperforms other latest benchmarks, which exhibits superior safety, efficiency and interaction rationality under varying traffic conditions.
zh
[AI-54] ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning
【速读】:该论文试图解决在AI驱动的开发过程中,基于大语言模型(Large Language Model, LLM)的智能体在推理过程中无法充分利用探索积累的经验,从而导致效率低下和性能不佳的问题。解决方案的关键在于提出ML-Master,这是一种新型的AI-for-AI(AI4AI)智能体,其核心是通过采用选择性作用范围的记忆机制,实现探索与推理的无缝整合,从而高效结合多条并行解题路径的多样见解与分析推理,避免因过多上下文信息而影响智能体表现。
链接: https://arxiv.org/abs/2506.16499
作者: Zexi Liu,Yuzhu Cai,Xinyu Zhu,Yujie Zheng,Runkun Chen,Ying Wen,Yanfeng Wang,Weinan E,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As AI capabilities advance toward and potentially beyond human-level performance, a natural transition emerges where AI-driven development becomes more efficient than human-centric approaches. A promising pathway toward this transition lies in AI-for-AI (AI4AI), which leverages AI techniques to automate and optimize the design, training, and deployment of AI systems themselves. While LLM-based agents have shown the potential to realize AI4AI, they are often unable to fully leverage the experience accumulated by agents during the exploration of solutions in the reasoning process, leading to inefficiencies and suboptimal performance. To address this limitation, we propose ML-Master, a novel AI4AI agent that seamlessly integrates exploration and reasoning by employing a selectively scoped memory mechanism. This approach allows ML-Master to efficiently combine diverse insights from parallel solution trajectories with analytical reasoning, guiding further exploration without overwhelming the agent with excessive context. We evaluate ML-Master on the MLE-Bench, where it achieves a 29.3% average medal rate, significantly surpassing existing methods, particularly in medium-complexity tasks, while accomplishing this superior performance within a strict 12-hour time constraint-half the 24-hour limit used by previous baselines. These results demonstrate ML-Master’s potential as a powerful tool for advancing AI4AI.
zh
[AI-55] Grounding Language Models with Semantic Digital Twins for Robotic Planning
【速读】:该论文试图解决在动态环境中实现适应性且目标驱动的机器人任务执行问题,特别是在面对不确定性与失败时如何保持任务完成的可靠性。解决方案的关键在于将语义数字孪生(Semantic Digital Twins, SDTs)与大型语言模型(Large Language Models, LLMs)相结合,通过语义接地将自然语言指令分解为结构化的动作三元组,并利用SDT提供的上下文环境数据进行对象效用和交互规则的解析,从而支持动作规划与实时适应性调整。此外,当执行失败时,LLM能够结合错误反馈与SDT洞察生成恢复策略并迭代优化动作计划。
链接: https://arxiv.org/abs/2506.16493
作者: Mehreen Naeem,Andrew Melnik,Michael Beetz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a novel framework that integrates Semantic Digital Twins (SDTs) with Large Language Models (LLMs) to enable adaptive and goal-driven robotic task execution in dynamic environments. The system decomposes natural language instructions into structured action triplets, which are grounded in contextual environmental data provided by the SDT. This semantic grounding allows the robot to interpret object affordances and interaction rules, enabling action planning and real-time adaptability. In case of execution failures, the LLM utilizes error feedback and SDT insights to generate recovery strategies and iteratively revise the action plan. We evaluate our approach using tasks from the ALFRED benchmark, demonstrating robust performance across various household scenarios. The proposed framework effectively combines high-level reasoning with semantic environment understanding, achieving reliable task completion in the face of uncertainty and failure.
zh
[AI-56] Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining
【速读】:该论文试图解决四足机器人在复杂环境中实现可扩展的自主多功能操作技能的问题,其核心挑战在于如何高效地赋予四足机器人类人的操作能力。解决方案的关键在于提出一种跨本体模仿学习系统,通过整合人类和L ocoMan(一种具备多种操作模式的四足机器人)的数据,构建统一且模块化的观察与动作空间,并采用高效的模块化架构支持跨本体的协同训练与预训练。此外,研究还构建了首个针对L ocoMan机器人的操作数据集,涵盖多种家庭任务,从而显著提升了系统在真实世界任务中的成功率。
链接: https://arxiv.org/abs/2506.16475
作者: Yaru Niu,Yunzhe Zhang,Mingyang Yu,Changyi Lin,Chenhao Li,Yikai Wang,Yuxiang Yang,Wenhao Yu,Tingnan Zhang,Bingqing Chen,Jonathan Francis,Zhenzhen Li,Jie Tan,Ding Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a cross-embodiment imitation learning system for quadrupedal manipulation, leveraging data collected from both humans and LocoMan, a quadruped equipped with multiple manipulation modes. Specifically, we develop a teleoperation and data collection pipeline, which unifies and modularizes the observation and action spaces of the human and the robot. To effectively leverage the collected data, we propose an efficient modularized architecture that supports co-training and pretraining on structured modality-aligned data across different embodiments. Additionally, we construct the first manipulation dataset for the LocoMan robot, covering various household tasks in both unimanual and bimanual modes, supplemented by a corresponding human dataset. We validate our system on six real-world manipulation tasks, where it achieves an average success rate improvement of 41.9% overall and 79.7% under out-of-distribution (OOD) settings compared to the baseline. Pretraining with human data contributes a 38.6% success rate improvement overall and 82.7% under OOD settings, enabling consistently better performance with only half the amount of robot data. Our code, hardware, and data are open-sourced at: this https URL.
zh
[AI-57] Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities
【速读】:该论文试图解决从目标未归一化概率密度中高效采样的问题,这一问题在众多高影响力科学应用中具有核心意义。现有基于扩散模型的采样器无法对简单分子系统规模的分布进行采样,因此本文提出了一种新的框架——渐进推理时间退火(Progressive Inference-Time Annealing, PITA),其关键在于结合了两种互补的插值技术:I.) 玻尔兹曼分布的退火和II.) 扩散平滑。PITA通过从高温到低温依次训练一系列扩散模型,利用温度退火目标密度的易访问样本进行训练,并在后续步骤中通过推理时间退火,借助新颖的Feynman-Kac偏微分方程与序贯蒙特卡洛方法,为下一阶段的扩散模型获取低温下的训练样本。
链接: https://arxiv.org/abs/2506.16471
作者: Tara Akhound-Sadegh,Jungyoon Lee,Avishek Joey Bose,Valentin De Bortoli,Arnaud Doucet,Michael M. Bronstein,Dominique Beaini,Siamak Ravanbakhsh,Kirill Neklyudov,Alexander Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact scientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: this https URL
zh
[AI-58] Joint Tensor-Train Parameterization for Efficient and Expressive Low-Rank Adaptation
【速读】:该论文旨在解决传统低秩适配(Low-Rank Adaptation, LoRA)在参数效率和模型表达能力方面的局限性。标准LoRA独立优化低秩矩阵,导致其表达能力和泛化能力受限;而经典张量分解(Tensor-Train, TT)方法虽可应用于单个LoRA矩阵,但未能显著提升参数效率或性能。论文提出的解决方案是TensorGuide,其关键在于通过统一的TT结构生成两个相关联的低秩LoRA矩阵,利用受控高斯噪声驱动,从而实现结构化的低秩适配,显著提升模型的表达能力、泛化能力和参数效率,同时不增加可训练参数数量。
链接: https://arxiv.org/abs/2506.16456
作者: Jun Qi,Chen-Yu Liu,Sabato Marco Siniscalchi,Chao-Han Huck Yang,Min-Hsiu Hsieh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Preprint. Under Review
Abstract:Low-Rank Adaptation (LoRA) is widely recognized for its parameter-efficient fine-tuning of large-scale neural models. However, standard LoRA independently optimizes low-rank matrices, which inherently limits its expressivity and generalization capabilities. While classical tensor-train (TT) decomposition can be separately employed on individual LoRA matrices, this work demonstrates that the classical TT-based approach neither significantly improves parameter efficiency nor achieves substantial performance gains. This paper proposes TensorGuide, a novel tensor-train-guided adaptation framework to overcome these limitations. TensorGuide generates two correlated low-rank LoRA matrices through a unified TT structure driven by controlled Gaussian noise. The resulting joint TT representation inherently provides structured, low-rank adaptations, significantly enhancing expressivity, generalization, and parameter efficiency without increasing the number of trainable parameters. Theoretically, we justify these improvements through neural tangent kernel analyses, demonstrating superior optimization dynamics and enhanced generalization. Extensive experiments on quantum dot classification and GPT-2 fine-tuning benchmarks demonstrate that TensorGuide-based LoRA consistently outperforms standard LoRA and TT-LoRA, achieving improved accuracy and scalability with fewer parameters.
zh
[AI-59] Consumer-friendly EEG-based Emotion Recognition System: A Multi-scale Convolutional Neural Network Approach
【速读】:该论文旨在解决在真实生活场景中基于脑电图(Electroencephalogram, EEG)进行自动情绪识别的问题。其解决方案的关键在于提出一种新型的多尺度卷积神经网络方法,通过引入具有多种比例系数的特征提取核以及一种能够从大脑四个不同区域学习关键信息的新类型核,从而提升模型在预测愉悦度、唤醒度和支配度评分方面的性能。
链接: https://arxiv.org/abs/2506.16448
作者: Tri Duc Ly,Gia H. Ngo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 10 figures
Abstract:EEG is a non-invasive, safe, and low-risk method to record electrophysiological signals inside the brain. Especially with recent technology developments like dry electrodes, consumer-grade EEG devices, and rapid advances in machine learning, EEG is commonly used as a resource for automatic emotion recognition. With the aim to develop a deep learning model that can perform EEG-based emotion recognition in a real-life context, we propose a novel approach to utilize multi-scale convolutional neural networks to accomplish such tasks. By implementing feature extraction kernels with many ratio coefficients as well as a new type of kernel that learns key information from four separate areas of the brain, our model consistently outperforms the state-of-the-art TSception model in predicting valence, arousal, and dominance scores across many performance evaluation metrics.
zh
[AI-60] Leverag ing Influence Functions for Resampling Data in Physics-Informed Neural Networks
【速读】:该论文试图解决如何提高物理信息神经网络(Physics-informed neural networks, PINNs)的预测精度问题,其解决方案的关键在于采用基于影响函数(influence function)的采样方法对训练数据进行有针对性的重采样。通过利用可解释人工智能(explainable AI, XAI)中的数据归因方法,该方法能够识别并优化对模型预测具有较大影响的训练点,从而提升PINNs的性能。
链接: https://arxiv.org/abs/2506.16443
作者: Jonas R. Naujoks,Aleksander Krasowski,Moritz Weckbecker,Galip Ümit Yolcu,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek,René P. Klausen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: This article was presented at “The 3rd World Conference on eXplainable Artificial Intelligence” (2025)
Abstract:Physics-informed neural networks (PINNs) offer a powerful approach to solving partial differential equations (PDEs), which are ubiquitous in the quantitative sciences. Applied to both forward and inverse problems across various scientific domains, PINNs have recently emerged as a valuable tool in the field of scientific machine learning. A key aspect of their training is that the data – spatio-temporal points sampled from the PDE’s input domain – are readily available. Influence functions, a tool from the field of explainable AI (XAI), approximate the effect of individual training points on the model, enhancing interpretability. In the present work, we explore the application of influence function-based sampling approaches for the training data. Our results indicate that such targeted resampling based on data attribution methods has the potential to enhance prediction accuracy in physics-informed neural networks, demonstrating a practical application of an XAI method in PINN training.
zh
[AI-61] Agent ic Personalisation of Cross-Channel Marketing Experiences
【速读】:该论文试图解决消费者应用中内容推送的个性化与效率问题,传统方法依赖人工营销工作,难以实现内容、时机、频率和文案的高效个性化。其解决方案的关键在于将任务建模为顺序决策框架,通过差异倍差法(Difference-in-Differences)估计个体处理效应,并结合汤普森采样(Thompson sampling)平衡探索与利用的权衡,从而优化模块化决策策略以最大化增量参与度。
链接: https://arxiv.org/abs/2506.16429
作者: Sami Abboud,Eleanor Hanna,Olivier Jeunen,Vineesha Raheja,Schaun Wheeler
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Consumer applications provide ample opportunities to surface and communicate various forms of content to users. From promotional campaigns for new features or subscriptions, to evergreen nudges for engagement, or personalised recommendations; across e-mails, push notifications, and in-app surfaces. The conventional approach to orchestration for communication relies heavily on labour-intensive manual marketer work, and inhibits effective personalisation of content, timing, frequency, and copy-writing. We formulate this task under a sequential decision-making framework, where we aim to optimise a modular decision-making policy that maximises incremental engagement for any funnel event. Our approach leverages a Difference-in-Differences design for Individual Treatment Effect estimation, and Thompson sampling to balance the explore-exploit trade-off. We present results from a multi-service application, where our methodology has resulted in significant increases to a variety of goal events across several product features, and is currently deployed across 150 million users.
zh
[AI-62] Optimizing MoE Routers: Design Implementation and Evaluation in Transformer Models
【速读】:该论文旨在解决Mixture of Experts (MoE)架构中由于路由模块(router module)性能不足导致的负载不平衡和精度下降问题。其解决方案的关键在于设计并实现多种不同的路由架构,以优化专家选择过程,提升模型的参数效率、推理延迟、路由熵以及专家利用率。通过对比分析线性、注意力、多层感知机(MLP)、混合、哈希及新型的MLP-Hadamard等六种路由变体,研究揭示了不同路由机制在速度与表达能力之间的权衡,并验证了MLP-Hadamard在结构化稀疏路由方面的独特优势。
链接: https://arxiv.org/abs/2506.16419
作者: Daniel Fidel Harvey,George Weale,Berk Yilmaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: All authors contributed equally. 11 pages, 6 figures
Abstract:Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts. Bad routing can load imbalance and reduced accuracy. This project designed and implemented different router architectures within Transformer models to fix these limitations. We experimented with six distinct router variants Linear, Attention, Multi-Layer Perceptron (MLP), Hybrid, Hash, and our new MLP-Hadamard. We characterized these routers using BERT and the Qwen1.5-MoE model, looking at parameter efficiency, inference latency, routing entropy, and expert utilization patterns. Our evaluations showed distinct trade-offs: Linear routers offer speed, while MLP and Attention routers provide greater expressiveness. The MLP-Hadamard router shows a unique capability for structured, sparse routing. We successfully replaced and fine-tuned custom routers within the complex, quantized Qwen1.5-MoE model. This work provides a comparative analysis of MoE router designs and offers insights into optimizing their performance for efficient and effective large-scale model deployment.
zh
[AI-63] Drag -and-Drop LLM s: Zero-Shot Prompt-to-Weights
【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)在适应下游任务时需要为每个数据集进行单独优化的问题,从而降低定制化成本。其解决方案的关键在于提出一种基于提示的参数生成方法——\textbfDrag-and-Drop LLMs (\textitDnD),通过将少量未标注的任务提示直接映射到LoRA(Low-Rank Adaptation)权重更新,从而消除任务级训练的需求。该方法利用轻量级文本编码器将提示批次压缩为条件嵌入,并通过级联超卷积解码器生成完整的LoRA矩阵,实现快速生成任务特定参数。
链接: https://arxiv.org/abs/2506.16406
作者: Zhiyuan Liang,Dongwen Tang,Yuhao Zhou,Xuanlei Zhao,Mingjia Shi,Wangbo Zhao,Zekai Li,Peihao Wang,Konstantin Schürholt,Damian Borth,Michael M. Bronstein,Yang You,Zhangyang Wang,Kai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: We propose a method that can generate LoRA parameters in seconds
Abstract:Modern Parameter-Efficient Fine-Tuning (PEFT) methods such as low-rank adaptation (LoRA) reduce the cost of customizing large language models (LLMs), yet still require a separate optimization run for every downstream dataset. We introduce \textbfDrag-and-Drop LLMs (\textitDnD), a prompt-conditioned parameter generator that eliminates per-task training by mapping a handful of unlabeled task prompts directly to LoRA weight updates. A lightweight text encoder distills each prompt batch into condition embeddings, which are then transformed by a cascaded hyper-convolutional decoder into the full set of LoRA matrices. Once trained in a diverse collection of prompt-checkpoint pairs, DnD produces task-specific parameters in seconds, yielding i) up to \textbf12,000 \times lower overhead than full fine-tuning, ii) average gains up to \textbf30% in performance over the strongest training LoRAs on unseen common-sense reasoning, math, coding, and multimodal benchmarks, and iii) robust cross-domain generalization despite never seeing the target data or labels. Our results demonstrate that prompt-conditioned parameter generation is a viable alternative to gradient-based adaptation for rapidly specializing LLMs. Our project is available at \hrefthis https URLthis https URL.
zh
[AI-64] Improved Exploration in GFlownets via Enhanced Epistemic Neural Networks ICML2025
【速读】:该论文试图解决GFlowNets中高效识别训练轨迹的问题,特别是如何在状态空间中优先探索奖励分布尚未充分学习的区域。解决方案的关键在于引入认知不确定性驱动的探索机制,通过集成认知神经网络(ENN)与传统GFlowNets架构,实现更高效的联合预测和更准确的不确定性量化,从而提升探索效率和最优轨迹识别能力。
链接: https://arxiv.org/abs/2506.16313
作者: Sajan Muhammad,Salem Lahlou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the EXAIT Workshop at ICML 2025
Abstract:Efficiently identifying the right trajectories for training remains an open problem in GFlowNets. To address this, it is essential to prioritize exploration in regions of the state space where the reward distribution has not been sufficiently learned. This calls for uncertainty-driven exploration, in other words, the agent should be aware of what it does not know. This attribute can be measured by joint predictions, which are particularly important for combinatorial and sequential decision problems. In this research, we integrate epistemic neural networks (ENN) with the conventional architecture of GFlowNets to enable more efficient joint predictions and better uncertainty quantification, thereby improving exploration and the identification of optimal trajectories. Our proposed algorithm, ENN-GFN-Enhanced, is compared to the baseline method in GFlownets and evaluated in grid environments and structured sequence generation in various settings, demonstrating both its efficacy and efficiency.
zh
[AI-65] Approximation Fixpoint Theory with Refined Approximation Spaces KR2024
【速读】:该论文试图解决Approximation Fixpoint Theory (AFT) 在处理某些相对简单的例子时所面临的局限性。传统AFT通过在格上的区间进行近似来构造或逼近感兴趣的不动点,但这种方法在某些情况下表达能力不足。解决方案的关键在于将一致的AFT扩展为能够处理比区间更精细的近似方式,引入了更为一般的近似空间概念,从而提升了理论的表达能力和适用范围。
链接: https://arxiv.org/abs/2506.16294
作者: Linde Vanbesien,Bart Bogaerts,Marc Denecker
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Submitted to KR 2024
Abstract:Approximation Fixpoint Theory (AFT) is a powerful theory covering various semantics of non-monotonic reasoning formalisms in knowledge representation such as Logic Programming and Answer Set Programming. Many semantics of such non-monotonic formalisms can be characterized as suitable fixpoints of a non-monotonic operator on a suitable lattice. Instead of working on the original lattice, AFT operates on intervals in such lattice to approximate or construct the fixpoints of interest. While AFT has been applied successfully across a broad range of non-monotonic reasoning formalisms, it is confronted by its limitations in other, relatively simple, examples. In this paper, we overcome those limitations by extending consistent AFT to deal with approximations that are more refined than intervals. Therefore, we introduce a more general notion of approximation spaces, showcase the improved expressiveness and investigate relations between different approximation spaces.
zh
[AI-66] Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective
【速读】:该论文试图解决的是在高模糊性场景下,传统自回归基础模型进行下一个标记预测时所面临的计算复杂性和性能瓶颈问题。研究认为,当前模型在处理高模糊性预测时存在计算不可行性,而现有的方法未能有效区分低模糊性和高模糊性的计算需求,导致了不利的归纳偏置。解决方案的关键在于引入MetaHMM基准,并基于认知科学理论提出一种将预训练模型转换为蒙特卡洛预测器的方法,该方法通过解耦任务推理与标记预测,提升了模糊情境下的性能表现。
链接: https://arxiv.org/abs/2506.16288
作者: Leo Gagnon,Eric Elmoznino,Sarthak Mittal,Tom Marty,Tejas Kasetty,Dhanya Sridhar,Guillaume Lajoie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid adaptation ability of auto-regressive foundation models is often attributed to the diversity of their pre-training data. This is because, from a Bayesian standpoint, minimizing prediction error in such settings requires integrating over all plausible latent hypotheses consistent with observations. While this behavior is desirable in principle, it often proves too ambitious in practice: under high ambiguity, the number of plausible latent alternatives makes Bayes-optimal prediction computationally intractable. Cognitive science has long recognized this limitation, suggesting that under such conditions, heuristics or information-seeking strategies are preferable to exhaustive inference. Translating this insight to next-token prediction, we hypothesize that low- and high-ambiguity predictions pose different computational demands, making ambiguity-agnostic next-token prediction a detrimental inductive bias. To test this, we introduce MetaHMM, a synthetic sequence meta-learning benchmark with rich compositional structure and a tractable Bayesian oracle. We show that Transformers indeed struggle with high-ambiguity predictions across model sizes. Motivated by cognitive theories, we propose a method to convert pre-trained models into Monte Carlo predictors that decouple task inference from token prediction. Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference, though challenges remain.
zh
[AI-67] Artificial Intelligence for Atmospheric Sciences: A Research Roadmap
【速读】:该论文试图解决将人工智能(Artificial Intelligence, AI)有效整合到大气科学中的问题,以应对大气研究中面临的大数据与基础设施等关键挑战。其解决方案的关键在于提出一个跨学科的研究路线图,旨在推动AI技术在大气现象分析和自然灾害预测中的应用,从而提升大气科学研究的效率与准确性。
链接: https://arxiv.org/abs/2506.16281
作者: Martha Arbayani Zaidan,Naser Hossein Motlagh,Petteri Nurmi,Tareq Hussein,Markku Kulmala,Tuukka Petäjä,Sasu Tarkoma
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:
Abstract:Atmospheric sciences are crucial for understanding environmental phenomena ranging from air quality to extreme weather events, and climate change. Recent breakthroughs in sensing, communication, computing, and Artificial Intelligence (AI) have significantly advanced atmospheric sciences, enabling the generation of vast amounts of data through long-term Earth observations and providing powerful tools for analyzing atmospheric phenomena and predicting natural disasters. This paper contributes a critical interdisciplinary overview that bridges the fields of atmospheric science and computer science, highlighting the transformative potential of AI in atmospheric research. We identify key challenges associated with integrating AI into atmospheric research, including issues related to big data and infrastructure, and provide a detailed research roadmap that addresses both current and emerging challenges.
zh
[AI-68] CapsDT: Diffusion-Transformer for Capsule Robot Manipulation IROS2025
【速读】:该论文旨在解决内窥镜机器人(尤其是胶囊内窥镜机器人)在消化系统内部执行操作时,如何通过视觉-语言-动作(Vision-Language-Action, VLA)模型实现更直观和高效的交互问题。其解决方案的关键在于设计一种名为CapsDT的扩散Transformer模型,该模型能够处理交错的视觉输入和文本指令,并推断出相应的机器人控制信号,从而完成内窥镜任务。此外,研究还构建了一个胶囊内窥镜机器人系统,通过机械臂操控磁铁驱动胶囊机器人,并在胃部模拟器中完成了多个层次的内窥镜任务及数据集的创建。
链接: https://arxiv.org/abs/2506.16263
作者: Xiting He,Mingwu Su,Xinqi Jiang,Long Bai,Jiewen Lai,Hongliang Ren
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IROS 2025
Abstract:Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.
zh
[AI-69] Synthetic ALS-EEG Data Augmentation for ALS Diagnosis Using Conditional WGAN with Weight Clipping
【速读】:该论文试图解决肌萎缩侧索硬化症(Amyotrophic Lateral Sclerosis, ALS)患者高质量脑电图(EEG)数据稀缺以及ALS与健康对照组数据严重不平衡的问题,这些问题阻碍了可靠机器学习分类器的训练。解决方案的关键是利用条件Wasserstein生成对抗网络(Conditional Wasserstein Generative Adversarial Network, CWGAN)生成合成ALS EEG信号,通过在私有EEG数据集(ALS与非ALS)上训练CWGAN,学习ALS EEG信号的分布并生成逼真的合成样本,从而为分类器提供增强数据,缓解类别不平衡问题并提高ALS检测的准确性。
链接: https://arxiv.org/abs/2506.16243
作者: Abdulvahap Mutlu,Şengül Doğan,Türker Tuncer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The code is available on GitHub: this https URL
Abstract:Amyotrophic Lateral Sclerosis (ALS) is a rare neurodegenerative disease, and high-quality EEG data from ALS patients are scarce. This data scarcity, coupled with severe class imbalance between ALS and healthy control recordings, poses a challenge for training reliable machine learning classifiers. In this work, we address these issues by generating synthetic EEG signals for ALS patients using a Conditional Wasserstein Generative Adversarial Network (CWGAN). We train CWGAN on a private EEG dataset (ALS vs. non-ALS) to learn the distribution of ALS EEG signals and produce realistic synthetic samples. We preprocess and normalize EEG recordings, and train a CWGAN model to generate synthetic ALS signals. The CWGAN architecture and training routine are detailed, with key hyperparameters chosen for stable training. Qualitative evaluation of generated signals shows that they closely mimic real ALS EEG patterns. The CWGAN training converged with generator and discriminator loss curves stabilizing, indicating successful learning. The synthetic EEG signals appear realistic and have potential use as augmented data for training classifiers, helping to mitigate class imbalance and improve ALS detection accuracy. We discuss how this approach can facilitate data sharing and enhance diagnostic models.
zh
[AI-70] From Teacher to Student: Tracking Memorization Through Model Distillation ACL2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中可能记住部分训练数据所带来的隐私和安全问题。其解决方案的关键在于通过知识蒸馏(Knowledge Distillation, KD)技术,将一个在特定数据集上微调过的大型教师模型的知识转移到较小的学生模型中,从而在降低计算成本和模型规模的同时,显著减少对任务数据的过记忆风险。
链接: https://arxiv.org/abs/2506.16170
作者: Simardeep Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, in-proceedings L2M2 @ ACL 2025
Abstract:Large language models (LLMs) are known to memorize parts of their training data, raising important concerns around privacy and security. While previous research has focused on studying memorization in pre-trained models, much less is known about how knowledge distillation (KD) affects this http URL this study, we explore how different KD methods influence the memorization of fine-tuned task data when a large teacher model is distilled into smaller student this http URL study demonstrates that distilling a larger teacher model, fine-tuned on a dataset, into a smaller variant not only lowers computational costs and model size but also significantly reduces the memorization risks compared to standard fine-tuning approaches.
zh
[AI-71] On using AI for EEG-based BCI applications: problems current challenges and future trends
【速读】:该论文旨在解决如何将人工智能(AI)有效应用于脑电图(EEG)基脑机接口(BCI)以实现更实用和高效的脑-机交互问题。其核心挑战在于构建强大的基础模型,以应对EEG信号解码中的独特复杂性,包括信号的低信噪比、个体差异以及实时处理需求等。解决方案的关键在于从因果视角出发,深入理解BCI系统的基本范式,并针对AI模型在数据建模、特征提取和泛化能力等方面的挑战提出系统性改进策略,同时探索技术、方法和伦理层面的突破路径,以推动EEG-B CI技术在真实环境中的广泛应用。
链接: https://arxiv.org/abs/2506.16168
作者: Thomas Barbera,Jacopo Burger,Alessandro D’Amelio,Simone Zini,Simone Bianco,Raffaella Lanzarotti,Paolo Napoletano,Giuseppe Boccignone,Jose Luis Contreras-Vidal
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Imagine unlocking the power of the mind to communicate, create, and even interact with the world around us. Recent breakthroughs in Artificial Intelligence (AI), especially in how machines “see” and “understand” language, are now fueling exciting progress in decoding brain signals from scalp electroencephalography (EEG). Prima facie, this opens the door to revolutionary brain-computer interfaces (BCIs) designed for real life, moving beyond traditional uses to envision Brain-to-Speech, Brain-to-Image, and even a Brain-to-Internet of Things (BCIoT). However, the journey is not as straightforward as it was for Computer Vision (CV) and Natural Language Processing (NLP). Applying AI to real-world EEG-based BCIs, particularly in building powerful foundational models, presents unique and intricate hurdles that could affect their reliability. Here, we unfold a guided exploration of this dynamic and rapidly evolving research area. Rather than barely outlining a map of current endeavors and results, the goal is to provide a principled navigation of this hot and cutting-edge research landscape. We consider the basic paradigms that emerge from a causal perspective and the attendant challenges presented to AI-based models. Looking ahead, we then discuss promising research avenues that could overcome today’s technological, methodological, and ethical limitations. Our aim is to lay out a clear roadmap for creating truly practical and effective EEG-based BCI solutions that can thrive in everyday environments. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.16168 [cs.HC] (or arXiv:2506.16168v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2506.16168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-72] Large Language Models are Near-Optimal Decision-Makers with a Non-Human Learning Behavior
【速读】:该论文试图解决人工智能在决策过程中与人类决策机制的差异问题,特别是大型语言模型(Large Language Models, LLMs)在不确定性、风险和认知转换等核心决策维度上的表现及其决策过程的可解释性问题。解决方案的关键在于通过三个经过验证的心理学实验任务对五种领先的LLMs进行基准测试,并与360名新招募的人类被试进行对比,从而揭示LLMs在决策性能上的优势及其决策机制与人类的根本性差异。
链接: https://arxiv.org/abs/2506.16163
作者: Hao Li,Gengrui Zhang,Petter Holme,Shuyue Hu,Zhen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human decision-making belongs to the foundation of our society and civilization, but we are on the verge of a future where much of it will be delegated to artificial intelligence. The arrival of Large Language Models (LLMs) has transformed the nature and scope of AI-supported decision-making; however, the process by which they learn to make decisions, compared to humans, remains poorly understood. In this study, we examined the decision-making behavior of five leading LLMs across three core dimensions of real-world decision-making: uncertainty, risk, and set-shifting. Using three well-established experimental psychology tasks designed to probe these dimensions, we benchmarked LLMs against 360 newly recruited human participants. Across all tasks, LLMs often outperformed humans, approaching near-optimal performance. Moreover, the processes underlying their decisions diverged fundamentally from those of humans. On the one hand, our finding demonstrates the ability of LLMs to manage uncertainty, calibrate risk, and adapt to changes. On the other hand, this disparity highlights the risks of relying on them as substitutes for human judgment, calling for further inquiry.
zh
[AI-73] Geometric Learning in Black-Box Optimization: A GNN Framework for Algorithm Performance Prediction
【速读】:该论文旨在解决数值黑箱优化中自动化算法性能预测的问题,传统方法通常依赖于问题特征描述(如探索性景观分析特征)作为机器学习模型的输入,但往往忽略了算法配置这一关键影响因素。解决方案的关键在于利用异构图数据结构和图神经网络,通过捕捉问题特性、算法配置与性能结果之间的复杂依赖关系,构建更全面的性能预测模型。
链接: https://arxiv.org/abs/2506.16144
作者: Ana Kostovska,Carola Doerr,Sašo Džeroski,Panče Panov,Tome Eftimov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Automated algorithm performance prediction in numerical blackbox optimization often relies on problem characterizations, such as exploratory landscape analysis features. These features are typically used as inputs to machine learning models and are represented in a tabular format. However, such approaches often overlook algorithm configurations, a key factor influencing performance. The relationships between algorithm operators, parameters, problem characteristics, and performance outcomes form a complex structure best represented as a graph. This work explores the use of heterogeneous graph data structures and graph neural networks to predict the performance of optimization algorithms by capturing the complex dependencies between problems, algorithm configurations, and performance outcomes. We focus on two modular frameworks, modCMA-ES and modDE, which decompose two widely used derivative-free optimization algorithms: the covariance matrix adaptation evolution strategy (CMA-ES) and differential evolution (DE). We evaluate 324 modCMA-ES and 576 modDE variants on 24 BBOB problems across six runtime budgets and two problem dimensions. Achieving up to 36.6% improvement in MSE over traditional tabular-based methods, this work highlights the potential of geometric learning in black-box optimization.
zh
[AI-74] Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching INTERSPEECH2025
【速读】:该论文旨在解决构音障碍(dysarthria)患者语音可理解性差的问题,以实现从构音障碍语音到正常语音的有效转换。其解决方案的关键在于利用自监督学习(SSL)特征及其量化表示替代传统的梅尔频谱图(mel-spectrograms),并提出一种完全非自回归的方法,通过条件流匹配(Conditional Flow Matching, CFM)与扩散变压器(Diffusion Transformers)直接学习从构音障碍语音到清晰语音的映射,从而提升语音可理解性并加快收敛速度。
链接: https://arxiv.org/abs/2506.16127
作者: Shoutrik Das,Nishant Singh,Arjun Gangwar,S Umesh
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025
Abstract:Dysarthria is a neurological disorder that significantly impairs speech intelligibility, often rendering affected individuals unable to communicate effectively. This necessitates the development of robust dysarthric-to-regular speech conversion techniques. In this work, we investigate the utility and limitations of self-supervised learning (SSL) features and their quantized representations as an alternative to mel-spectrograms for speech generation. Additionally, we explore methods to mitigate speaker variability by generating clean speech in a single-speaker voice using features extracted from WavLM. To this end, we propose a fully non-autoregressive approach that leverages Conditional Flow Matching (CFM) with Diffusion Transformers to learn a direct mapping from dysarthric to clean speech. Our findings highlight the effectiveness of discrete acoustic units in improving intelligibility while achieving faster convergence compared to traditional mel-spectrogram-based approaches.
zh
[AI-75] GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks
【速读】:该论文旨在解决生成式推荐(Generative Recommendations, GR)框架中关键的微调步骤所面临的曝光偏差(exposure bias)问题。现有方法主要依赖于监督微调的下一个词预测损失或推荐特定的直接偏好优化策略,但未能充分探索潜在的正样本,导致模型在训练过程中难以有效学习到未观察到的高质量推荐内容。论文提出的解决方案的关键在于将GR视为多步生成任务,并构建了一个基于GFlowNets的微调框架(GFlowGR),通过融合传统推荐系统的协同知识,设计自适应轨迹采样器和全面的奖励模型,结合GFlowNets的多样化生成特性、采样及启发式加权技术,有效缓解了曝光偏差问题。
链接: https://arxiv.org/abs/2506.16114
作者: Yejing Wang,Shengyu Zhou,Jinyu Lu,Qidong Liu,Xinhang Li,Wenlin Zhang,Feng Li,Pengjie Wang,Jian Xu,Bo Zheng,Xiangyu Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remarkable success across a wide range of scenarios. The majority of existing research efforts primarily concentrate on developing powerful item tokenizers or advancing LLM decoding strategies to attain superior performance. However, the critical fine-tuning step in GR frameworks, which is essential for adapting LLMs to recommendation data, remains largely unexplored. Current approaches predominantly rely on either the next-token prediction loss of supervised fine-tuning (SFT) or recommendationspecific direct preference optimization (DPO) strategies. Both methods ignore the exploration of possible positive unobserved samples, which is commonly referred to as the exposure bias problem. To mitigate this problem, this paper treats the GR as a multi-step generation task and constructs a GFlowNets-based fine-tuning framework (GFlowGR). The proposed framework integrates collaborative knowledge from traditional recommender systems to create an adaptive trajectory sampler and a comprehensive reward model. Leveraging the diverse generation property of GFlowNets, along with sampling and heuristic weighting techniques, GFlowGR emerges as a promising approach to mitigate the exposure bias problem. Extensive empirical results on two real-world datasets and with two different GR backbones highlight the effectiveness and robustness of GFlowGR.
zh
[AI-76] A Brain-to-Population Graph Learning Framework for Diagnosing Brain Disorders
【速读】:该论文试图解决基于图的方法在诊断脑部疾病时依赖预定义脑图谱而忽略图谱中嵌入的丰富信息以及站点和表型变异带来的混杂效应的问题。其解决方案的关键在于提出一种两阶段的脑到人群图学习(Brain-to-Population Graph Learning, B2P-GL)框架,该框架整合了脑区语义相似性与基于条件的人群图建模,在第一阶段通过引入GPT-4的脑图谱知识增强图表示并优化脑图结构,在第二阶段通过融合表型数据构建人群图以减少混杂因素并提升诊断性能。
链接: https://arxiv.org/abs/2506.16096
作者: Qianqian Liao,Wuque Cai,Hongze Sun,Dongze Liu,Duo Chen,Dezhong Yao,Daqing Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 13 tables; this paper has been submitted for possible publication
Abstract:Recent developed graph-based methods for diagnosing brain disorders using functional connectivity highly rely on predefined brain atlases, but overlook the rich information embedded within atlases and the confounding effects of site and phenotype variability. To address these challenges, we propose a two-stage Brain-to-Population Graph Learning (B2P-GL) framework that integrates the semantic similarity of brain regions and condition-based population graph modeling. In the first stage, termed brain representation learning, we leverage brain atlas knowledge from GPT-4 to enrich the graph representation and refine the brain graph through an adaptive node reassignment graph attention network. In the second stage, termed population disorder diagnosis, phenotypic data is incorporated into population graph construction and feature fusion to mitigate confounding effects and enhance diagnosis performance. Experiments on the ABIDE I, ADHD-200, and Rest-meta-MDD datasets show that B2P-GL outperforms state-of-the-art methods in prediction accuracy while enhancing interpretability. Overall, our proposed framework offers a reliable and personalized approach to brain disorder diagnosis, advancing clinical applicability.
zh
[AI-77] Consistency Verification in Ontology-Based Process Models with Parameter Interdependencies
【速读】:该论文旨在解决制造过程中参数相互依赖关系建模中数据检索与解释的一致性问题,以及在跨上下文应用时确保数学表达式正确评估的挑战。其解决方案的关键在于基于本体的过程模型集成标准化过程语义、数据元素定义和形式化数学结构,并通过三类验证机制实现:基于SPARQL的过滤以获取相关过程数据,基于预期单位注释和语义分类的单位一致性检查,以及用于验证相互依赖关系可计算性的数据完整性检查。
链接: https://arxiv.org/abs/2506.16087
作者: Tom Jeleniewski,Hamied Nabizada,Jonathan Reif,Felix Gehlhoff,Alexander Fay
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: This paper is accepted at IEEE ETFA 2025 and will be published in the conference proceedings
Abstract:The formalization of process knowledge using ontologies enables consistent modeling of parameter interdependencies in manufacturing. These interdependencies are typically represented as mathematical expressions that define relations between process parameters, supporting tasks such as calculation, validation, and simulation. To support cross-context application and knowledge reuse, such expressions are often defined in a generic form and applied across multiple process contexts. This highlights the necessity of a consistent and semantically coherent model to ensure the correctness of data retrieval and interpretation. Consequently, dedicated mechanisms are required to address key challenges such as selecting context-relevant data, ensuring unit compatibility between variables and data elements, and verifying the completeness of input data required for evaluating mathematical expressions. This paper presents a set of verification mechanisms for a previously developed ontology-based process model that integrates standardized process semantics, data element definitions, and formal mathematical constructs. The approach includes (i) SPARQL-based filtering to retrieve process-relevant data, (ii) a unit consistency check based on expected-unit annotations and semantic classification, and (iii) a data completeness check to validate the evaluability of interdependencies. The applicability of the approach is demonstrated with a use case from Resin Transfer Molding (RTM), supporting the development of machine-interpretable and verifiable engineering models.
zh
[AI-78] CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations
【速读】:该论文试图解决从EEG数据中提取深层特征以及有效整合多视角信息的难题,从而开发出一个可泛化的EEG表征学习预训练框架。现有预训练方法通常仅依赖单一视角的上下文语义,未能捕捉不同视角之间的复杂协同作用,限制了表征的表达能力和泛化性。解决方案的关键在于提出CRIA框架,该框架通过变长和变通道编码实现不同数据集下EEG数据的统一表征,并利用交叉注意力机制融合时间、频谱和空间特征,结合基于信息瓶颈原理的注意力矩阵掩码策略与新型视角掩码预训练方案,以增强模型的泛化能力。
链接: https://arxiv.org/abs/2506.16056
作者: Puchun Liu,C. L. Philip Chen,Yubin He,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The difficulty of extracting deep features from EEG data and effectively integrating information from multiple views presents significant challenges for developing a generalizable pretraining framework for EEG representation learning. However, most existing pre-training methods rely solely on the contextual semantics of a single view, failing to capture the complex and synergistic interactions among different perspectives, limiting the expressiveness and generalization of learned representations. To address these issues, this paper proposes CRIA, an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets. In this work, we define cross-view information as the integrated representation that emerges from the interaction among temporal, spectral, and spatial views of EEG signals. The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively, and combines an attention matrix masking strategy based on the information bottleneck principle with a novel viewpoint masking pre-training scheme. Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions, achieving a balanced accuracy of 57.02% for multi-class event classification and 80.03% for anomaly detection, highlighting its strong generalization ability.
zh
[AI-79] OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
【速读】:该论文试图解决生成式 AI(Generative AI)在计算机使用任务中存在极高端到端延迟的问题,导致系统在实际应用中不可用。研究的关键在于分析并揭示大模型在规划和反思过程中的调用是造成整体延迟的主要原因,并通过构建OSWorld-Human数据集,提供人类确定的任务轨迹以评估代理的效率,从而为未来计算机代理的发展提供指导。
链接: https://arxiv.org/abs/2506.16042
作者: Reyna Abhyankar,Qi Qi,Yiying Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS)
备注:
Abstract:Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning and reflection account for the majority of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the highest-scoring agents on OSWorld take 1.4-2.7x more steps than necessary.
zh
[AI-80] Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
【速读】:该论文旨在解决传统基于文本的文档分块方法在处理复杂文档结构、多页表格、嵌入式图表以及跨页面上下文依赖时存在的局限性。其解决方案的关键在于提出一种新颖的多模态文档分块方法,利用大型多模态模型(Large Multimodal Models, LMMs)批量处理PDF文档,同时保持语义连贯性和结构完整性,通过可配置的页面批次处理和跨批次上下文保留机制,有效提升对跨页表格、嵌入式视觉元素和流程性内容的处理准确性。
链接: https://arxiv.org/abs/2506.16035
作者: Vishesh Tripathi,Tanmay Odapally,Indraneel Das,Uday Allu,Biddwan Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 11 pages, 1 Figure, 1 Table
Abstract:Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.
zh
[AI-81] Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellm an Formulations
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中硬约束导致策略性能下降的问题,特别是如何在满足不同奖励和惩罚阈值或多个奖励阈值的同时优化策略。其解决方案的关键在于将哈密顿-雅可比(Hamilton-Jacobi, HJ)方程与RL相结合,提出两种新的价值函数以处理“Reach-Always-Avoid”和“Reach-Reach”问题。通过将问题分解为可到达、避免和可到达-避免子问题,该方法推导出显式且可处理的贝尔曼形式,从而避免了传统时序逻辑方法中复杂的自动机表示,并提供了一种新的受约束决策制定视角。基于此分析,作者提出了DO-HJ-PPO算法,在安全到达和多目标达成任务中表现出优于现有基线的方法。
链接: https://arxiv.org/abs/2506.16016
作者: William Sharpless,Dylan Hirsch,Sander Tonkens,Nikhil Shinde,Sylvia Herbert
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Hard constraints in reinforcement learning (RL), whether imposed via the reward function or the model architecture, often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but often require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: (1) the Reach-Always-Avoid problem - of achieving distinct reward and penalty thresholds - and (2) the Reach-Reach problem - of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context by decomposing our problem into reach, avoid, and reach-avoid problems, as to leverage these aforementioned recent advances. From a mathematical perspective, the Reach-Always-Avoid and Reach-Reach problems are complementary and fundamentally different from standard sum-of-rewards problems and temporal logic problems, providing a new perspective on constrained decision-making. We leverage our analysis to propose a variation of Proximal Policy Optimization (DO-HJ-PPO), which solves these problems. Across a range of tasks for safe-arrival and multi-target achievement, we demonstrate that DO-HJ-PPO produces qualitatively distinct behaviors from previous approaches and out-competes a number of baselines in various metrics.
zh
[AI-82] VRAIL: Vectorized Reward-based Attribution for Interpretable Learning
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中奖励函数设计困难以及模型可解释性不足的问题。解决方案的关键在于提出一种双层框架VRAIL(Vectorized Reward-based Attribution for Interpretable Learning),该框架通过从状态特征中学习可解释的权重表示,结合深度学习阶段估计价值函数与强化学习阶段通过基于潜在的奖励变换来优化学习过程,从而实现对个体特征及其交互重要性的归因分析,提升训练稳定性和模型的可解释性。
链接: https://arxiv.org/abs/2506.16014
作者: Jina Kim,Youjin Jang,Jeongjin Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose VRAIL (Vectorized Reward-based Attribution for Interpretable Learning), a bi-level framework for value-based reinforcement learning (RL) that learns interpretable weight representations from state features. VRAIL consists of two stages: a deep learning (DL) stage that fits an estimated value function using state features, and an RL stage that uses this to shape learning via potential-based reward transformations. The estimator is modeled in either linear or quadratic form, allowing attribution of importance to individual features and their interactions. Empirical results on the Taxi-v3 environment demonstrate that VRAIL improves training stability and convergence compared to standard DQN, without requiring environment modifications. Further analysis shows that VRAIL uncovers semantically meaningful subgoals, such as passenger possession, highlighting its ability to produce human-interpretable behavior. Our findings suggest that VRAIL serves as a general, model-agnostic framework for reward shaping that enhances both learning and interpretability.
zh
[AI-83] AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction
【速读】:该论文旨在解决时间序列预测中同时实现严格的时间因果性、次二次复杂度以确保实际可扩展性以及多尺度模式识别以提高长时程预测准确性这三个相互竞争的目标。其解决方案的关键在于三个创新:1)分层时间建模,通过将预测分解为并行处理的段级块并随后进行段内序列细化,实现时间一致性与计算效率的平衡;2)动态窗口注意力机制,采用具有指数衰减的可学习因果窗口,降低复杂度的同时保持精确的时间关系;3)自适应时间编码,结合固定振荡模式与可学习衰减率,以捕捉多尺度的时间模式。这些创新显著提升了模型的性能与效率。
链接: https://arxiv.org/abs/2506.16001
作者: Qianru Zhang,Honggang Wen,Ming Li,Dong Huang,Siu-Ming Yiu,Christian S. Jensen,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at this https URL.
zh
[AI-84] Quantum Artificial Intelligence for Secure Autonomous Vehicle Navigation: An Architectural Proposal
【速读】:该论文旨在解决自动驾驶车辆导航中的关键问题,包括多模态传感器数据融合、动态复杂环境下的导航策略优化以及车辆通信的安全性。其解决方案的关键在于引入量子人工智能技术,具体包括基于量子幅度编码的量子神经网络用于多模态传感器融合,Nav-Q模块通过变分量子电路实现导航策略优化,以及采用后量子密码协议保障车内及V2X(Vehicle to Everything)通信的安全性,从而提升自动驾驶系统的性能与安全性。
链接: https://arxiv.org/abs/2506.16000
作者: Hemanth Kannamarlapudi,Sowmya Chintalapudi
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Robotics (cs.RO); Quantum Physics (quant-ph)
备注: 5 pages, 2 figures, 17 references. Architectural proposal for quantum AI integration in autonomous vehicle navigation systems for secured navigation
Abstract:Navigation is a very crucial aspect of autonomous vehicle ecosystem which heavily relies on collecting and processing large amounts of data in various states and taking a confident and safe decision to define the next vehicle maneuver. In this paper, we propose a novel architecture based on Quantum Artificial Intelligence by enabling quantum and AI at various levels of navigation decision making and communication process in Autonomous vehicles : Quantum Neural Networks for multimodal sensor fusion, Nav-Q for Quantum reinforcement learning for navigation policy optimization and finally post-quantum cryptographic protocols for secure communication. Quantum neural networks uses quantum amplitude encoding to fuse data from various sensors like LiDAR, radar, camera, GPS and weather etc., This approach gives a unified quantum state representation between heterogeneous sensor modalities. Nav-Q module processes the fused quantum states through variational quantum circuits to learn optimal navigation policies under swift dynamic and complex conditions. Finally, post quantum cryptographic protocols are used to secure communication channels for both within vehicle communication and V2X (Vehicle to Everything) communications and thus secures the autonomous vehicle communication from both classical and quantum security threats. Thus, the proposed framework addresses fundamental challenges in autonomous vehicles navigation by providing quantum performance and future proof security. Index Terms Quantum Computing, Autonomous Vehicles, Sensor Fusion
zh
[AI-85] rainVerify: Equivalence-Based Verification for Distributed LLM Training
【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)分布式训练过程中因缺乏验证机制而导致的隐性错误问题,这些问题可能导致巨大的计算资源浪费。解决方案的关键在于提出TrainVerify系统,该系统通过形式化验证分布式并行执行计划与原始模型逻辑规范之间的数学等价性,确保训练过程的正确性。为应对LLMs规模庞大带来的验证复杂性,TrainVerify引入了形状缩减技术和分阶段并行验证算法,从而在保持形式正确性的前提下显著降低验证复杂度。
链接: https://arxiv.org/abs/2506.15961
作者: Yunchi Lu,Youshan Miao,Cheng Tan,Peng Huang,Yi Zhu,Xian Zhang,Fan Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model’s logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 (405B) and DeepSeek-V3 (671B) training plans.
zh
[AI-86] PNCS:Power-Norm Cosine Similarity for Diverse Client Selection in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于数据异质性导致的客户端梯度相关性难以被现有方法有效捕捉的问题,从而影响模型聚合的效果。其解决方案的关键在于提出一种基于幂归一化余弦相似度(Power-Norm Cosine Similarity, PNCS)的新型框架,通过捕获高阶梯度矩来应对非独立同分布(non-IID)数据挑战,进而提升模型的收敛速度和精度。此外,还引入了一个简单的算法,通过选择历史队列确保客户端选择的多样性。
链接: https://arxiv.org/abs/2506.15923
作者: Liangyan Li,Yangyi Liu,Yimo Ning,Stefano Rini,Jun Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated Learning (FL) has emerged as a powerful paradigm for leveraging diverse datasets from multiple sources while preserving data privacy by avoiding centralized storage. However, many existing approaches fail to account for the intricate gradient correlations between remote clients, a limitation that becomes especially problematic in data heterogeneity scenarios. In this work, we propose a novel FL framework utilizing Power-Norm Cosine Similarity (PNCS) to improve client selection for model aggregation. By capturing higher-order gradient moments, PNCS addresses non-IID data challenges, enhancing convergence speed and accuracy. Additionally, we introduce a simple algorithm ensuring diverse client selection through a selection history queue. Experiments with a VGG16 model across varied data partitions demonstrate consistent improvements over state-of-the-art methods.
zh
[AI-87] KG-FGNN: Knowledge-guided GNN Foundation Model for Fertilisation-oriented Soil GHG Flux Prediction
【速读】:该论文试图解决农业数据稀缺问题,该问题阻碍了机器学习方法在精准土壤温室气体(Greenhouse Gas, GHG)通量预测中的应用。解决方案的关键在于提出一种基于知识引导的图神经网络框架,该框架通过整合农业过程模型中的先验知识与图神经网络技术,生成多维农业数据集并提取关键农业特征,同时利用图神经网络捕捉农业特征间的相关性,从而实现对施肥导向的土壤GHG通量的准确预测。
链接: https://arxiv.org/abs/2506.15896
作者: Yu Zhang,Gaoshan Bi,Simon Jeffery,Max Davis,Yang Li,Qing Xue,Po Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:Precision soil greenhouse gas (GHG) flux prediction is essential in agricultural systems for assessing environmental impacts, developing emission mitigation strategies and promoting sustainable agriculture. Due to the lack of advanced sensor and network technologies on majority of farms, there are challenges in obtaining comprehensive and diverse agricultural data. As a result, the scarcity of agricultural data seriously obstructs the application of machine learning approaches in precision soil GHG flux prediction. This research proposes a knowledge-guided graph neural network framework that addresses the above challenges by integrating knowledge embedded in an agricultural process-based model and graph neural network techniques. Specifically, we utilise the agricultural process-based model to simulate and generate multi-dimensional agricultural datasets for 47 countries that cover a wide range of agricultural variables. To extract key agricultural features and integrate correlations among agricultural features in the prediction process, we propose a machine learning framework that integrates the autoencoder and multi-target multi-graph based graph neural networks, which utilises the autoencoder to selectively extract significant agricultural features from the agricultural process-based model simulation data and the graph neural network to integrate correlations among agricultural features for accurately predict fertilisation-oriented soil GHG fluxes. Comprehensive experiments were conducted with both the agricultural simulation dataset and real-world agricultural dataset to evaluate the proposed approach in comparison with well-known baseline and state-of-the-art regression methods. The results demonstrate that our proposed approach provides superior accuracy and stability in fertilisation-oriented soil GHG prediction.
zh
[AI-88] Deep Reinforcement Learning Xiangqi Player with Monte Carlo Tree Search
【速读】:该论文试图解决象棋(Xiangqi)这一文化重要策略游戏中因复杂性带来的AI挑战,包括其独特的棋盘布局、棋子移动约束和胜利条件。解决方案的关键在于将深度强化学习(Deep Reinforcement Learning, DRL)与蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)相结合,通过策略-价值网络模拟走法后果并优化决策过程,从而克服象棋高分支因子和不对称棋子动态等难题。
链接: https://arxiv.org/abs/2506.15880
作者: Berk Yilmaz,Junyu Hu,Jinsong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: All authors contributed equally to this work.24 pages, 10 figures
Abstract:This paper presents a Deep Reinforcement Learning (DRL) system for Xiangqi (Chinese Chess) that integrates neural networks with Monte Carlo Tree Search (MCTS) to enable strategic self-play and self-improvement. Addressing the underexplored complexity of Xiangqi, including its unique board layout, piece movement constraints, and victory conditions, our approach combines policy-value networks with MCTS to simulate move consequences and refine decision-making. By overcoming challenges such as Xiangqi’s high branching factor and asymmetrical piece dynamics, our work advances AI capabilities in culturally significant strategy games while providing insights for adapting DRL-MCTS frameworks to domain-specific rule systems.
zh
[AI-89] Uncertainty Estimation by Human Perception versus Neural Models
【速读】:该论文试图解决现代神经网络(Neural Networks, NNs)在预测不确定性估计上的不准确问题,即模型常常产生过于自信的预测,即使预测错误。解决方案的关键在于将人类感知的不确定性与模型预测的不确定性进行比较,并通过引入人类提供的软标签来改进模型的校准性,从而在不牺牲准确性的前提下提升模型的可靠性。
链接: https://arxiv.org/abs/2506.15850
作者: Pedro Mendes,Paolo Romano,David Garlan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern neural networks (NNs) often achieve high predictive accuracy but remain poorly calibrated, producing overconfident predictions even when wrong. This miscalibration poses serious challenges in applications where reliable uncertainty estimates are critical. In this work, we investigate how human perceptual uncertainty compares to uncertainty estimated by NNs. Using three vision benchmarks annotated with both human disagreement and crowdsourced confidence, we assess the correlation between model-predicted uncertainty and human-perceived uncertainty. Our results show that current methods only weakly align with human intuition, with correlations varying significantly across tasks and uncertainty metrics. Notably, we find that incorporating human-derived soft labels into the training process can improve calibration without compromising accuracy. These findings reveal a persistent gap between model and human uncertainty and highlight the potential of leveraging human insights to guide the development of more trustworthy AI systems.
zh
[AI-90] SafeMimic: Towards Safe and Autonomous Human-to-Robot Imitation for Mobile Manipulation
【速读】:该论文旨在解决机器人如何通过观看人类的单个视频演示,安全且自主地学习新的移动操作技能的问题。其核心挑战在于从第三人称视角的视频中提取任务目标与执行策略,并将其转换为第一人称视角,同时适应机器人自身的形态进行安全执行。解决方案的关键在于SafeMimic框架,该框架通过将视频分解为语义和动作片段,将其转换为自我参照的视角,然后在模拟环境中训练的多个安全Q函数的协同作用下,采样并验证候选动作,以递推时域的方式确保安全性,并在无法前进时回溯至先前状态,调整轨迹和抓取模式,从而实现高效、安全的任务学习。
链接: https://arxiv.org/abs/2506.15847
作者: Arpit Bahety,Arnav Balaji,Ben Abbatematteo,Roberto Martín-Martín
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:For robots to become efficient helpers in the home, they must learn to perform new mobile manipulation tasks simply by watching humans perform them. Learning from a single video demonstration from a human is challenging as the robot needs to first extract from the demo what needs to be done and how, translate the strategy from a third to a first-person perspective, and then adapt it to be successful with its own morphology. Furthermore, to mitigate the dependency on costly human monitoring, this learning process should be performed in a safe and autonomous manner. We present SafeMimic, a framework to learn new mobile manipulation skills safely and autonomously from a single third-person human video. Given an initial human video demonstration of a multi-step mobile manipulation task, SafeMimic first parses the video into segments, inferring both the semantic changes caused and the motions the human executed to achieve them and translating them to an egocentric reference. Then, it adapts the behavior to the robot’s own morphology by sampling candidate actions around the human ones, and verifying them for safety before execution in a receding horizon fashion using an ensemble of safety Q-functions trained in simulation. When safe forward progression is not possible, SafeMimic backtracks to previous states and attempts a different sequence of actions, adapting both the trajectory and the grasping modes when required for its morphology. As a result, SafeMimic yields a strategy that succeeds in the demonstrated behavior and learns task-specific actions that reduce exploration in future attempts. Our experiments show that our method allows robots to safely and efficiently learn multi-step mobile manipulation behaviors from a single human demonstration, from different users, and in different environments, with improvements over state-of-the-art baselines across seven tasks
zh
[AI-91] Context Matters! Relaxing Goals with LLM s for Feasible 3D Scene Planning
【速读】:该论文试图解决传统人工智能与机器人规划方法在现实场景中因机器人感知能力有限及感知与规划谓词难以对齐而导致的规划失败问题,以及大型语言模型(Large Language Models, LLMs)在生成可行且安全计划方面的不足。解决方案的关键在于将经典规划与LLMs相结合,利用其提取常识知识和动作接地的能力,并通过分层结构定义功能等价目标以逐步放松规划约束,从而使得不可行任务变得可处理,支持代理在特定情境下部分实现预期目标。
链接: https://arxiv.org/abs/2506.15828
作者: Emanuele Musumeci,Michele Brienza,Francesco Argenziano,Vincenzo Suriani,Daniele Nardi,Domenico D. Bloisi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Classical planning in AI and Robotics addresses complex tasks by shifting from imperative to declarative approaches (e.g., PDDL). However, these methods often fail in real scenarios due to limited robot perception and the need to ground perceptions to planning predicates. This often results in heavily hard-coded behaviors that struggle to adapt, even with scenarios where goals can be achieved through relaxed planning. Meanwhile, Large Language Models (LLMs) lead to planning systems that leverage commonsense reasoning but often at the cost of generating unfeasible and/or unsafe plans. To address these limitations, we present an approach integrating classical planning with LLMs, leveraging their ability to extract commonsense knowledge and ground actions. We propose a hierarchical formulation that enables robots to make unfeasible tasks tractable by defining functionally equivalent goals through gradual relaxation. This mechanism supports partial achievement of the intended objective, suited to the agent’s specific context. Our method demonstrates its ability to adapt and execute tasks effectively within environments modeled using 3D Scene Graphs through comprehensive qualitative and quantitative evaluations. We also show how this method succeeds in complex scenarios where other benchmark methods are more likely to fail. Code, dataset, and additional material are released to the community.
zh
[AI-92] Linearithmic Clean-up for Vector-Symbolic Key-Value Memory with Kroneker Rotation Products
【速读】:该论文试图解决当前向量符号架构(Vector-Symbolic Architectures, VSAs)中的计算瓶颈问题,即“clean-up”步骤的高计算复杂度。Clean-up步骤用于解码从架构中检索到的噪声向量,通常需要将噪声向量与原型向量的“codebook”进行比较,导致计算复杂度为二次方或类似级别。论文提出的解决方案的关键在于一种基于旋转类似矩阵的Kronecker积的新codebook表示方法,该方法使得clean-up的时间复杂度降低至线性对数级别(O(NlogN)),空间复杂度为O(N),并且codebook无需显式存储,仅需O(logN)的空间,同时保持了与传统方法相当的内存容量。
链接: https://arxiv.org/abs/2506.15793
作者: Ruipeng Liu,Qinru Qiu,Simon Khan,Garrett E. Katz
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, conference paper
Abstract:A computational bottleneck in current Vector-Symbolic Architectures (VSAs) is the clean-up'' step, which decodes the noisy vectors retrieved from the architecture. Clean-up typically compares noisy vectors against a
codebook’’ of prototype vectors, incurring computational complexity that is quadratic or similar. We present a new codebook representation that supports efficient clean-up, based on Kroneker products of rotation-like matrices. The resulting clean-up time complexity is linearithmic, i.e. \mathcalO(N,\textlog,N) , where N is the vector dimension and also the number of vectors in the codebook. Clean-up space complexity is \mathcalO(N) . Furthermore, the codebook is not stored explicitly in computer memory: It can be represented in \mathcalO(\textlog,N) space, and individual vectors in the codebook can be materialized in \mathcalO(N) time and space. At the same time, asymptotic memory capacity remains comparable to standard approaches. Computer experiments confirm these results, demonstrating several orders of magnitude more scalability than baseline VSA techniques.
zh
[AI-93] Graphics4Science: Computer Graphics for Scientific Impacts
【速读】:该论文试图解决计算机图形学与科学领域之间存在的沟通与方法论鸿沟,旨在通过重构图形学作为科学建模语言的角色,促进两个领域之间的知识融合与协作。其解决方案的关键在于利用图形学中的核心方法,如几何推理和物理建模,提供归纳偏置,从而在数据稀缺的场景下有效应对科学挑战,并通过弥合两个社区间的术语差异来推动跨学科合作。
链接: https://arxiv.org/abs/2506.15786
作者: Peter Yichen Chen,Minghao Guo,Hanspeter Pfister,Ming Lin,William Freeman,Qixing Huang,Han-Wei Shen,Wojciech Matusik
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Optics (physics.optics)
备注:
Abstract:Computer graphics, often associated with films, games, and visual effects, has long been a powerful tool for addressing scientific challenges–from its origins in 3D visualization for medical imaging to its role in modern computational modeling and simulation. This course explores the deep and evolving relationship between computer graphics and science, highlighting past achievements, ongoing contributions, and open questions that remain. We show how core methods, such as geometric reasoning and physical modeling, provide inductive biases that help address challenges in both fields, especially in data-scarce settings. To that end, we aim to reframe graphics as a modeling language for science by bridging vocabulary gaps between the two communities. Designed for both newcomers and experts, Graphics4Science invites the graphics community to engage with science, tackle high-impact problems where graphics expertise can make a difference, and contribute to the future of scientific discovery. Additional details are available on the course website: this https URL
zh
[AI-94] Advancing Stochastic 3-SAT Solvers by Dissipating Oversatisfied Constraints
【速读】:该论文旨在解决NP完全的3-SAT问题,特别是针对临界困难实例的求解效率问题。现有求解器如WalkSAT在处理这些实例时容易陷入局部极小值,这些局部极小值与真实解相比具有更多的过满足组合约束。解决方案的关键在于提出一种名为DOCSAT的随机局部搜索启发式算法,该算法通过耗散过满足约束(DOC),即减少其不利的丰富性,从而使其变为关键约束,以此避免或逃离局部极小值陷阱。
链接: https://arxiv.org/abs/2506.15774
作者: J. Schwardt,J. C. Budich
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)
备注: 5+1 pages, 6+2 figures
Abstract:We introduce and benchmark a stochastic local search heuristic for the NP-complete satisfiability problem 3-SAT that drastically outperforms existing solvers in the notoriously difficult realm of critically hard instances. Our construction is based on the crucial observation that well established previous approaches such as WalkSAT are prone to get stuck in local minima that are distinguished from true solutions by a larger number of oversatisfied combinatorial constraints. To address this issue, the proposed algorithm, coined DOCSAT, dissipates oversatisfied constraints (DOC), i.e. reduces their unfavorable abundance so as to render them critical. We analyze and benchmark our algorithm on a randomly generated sample of hard but satisfiable 3-SAT instances with varying problem sizes up to N=15000. Quite remarkably, we find that DOCSAT outperforms both WalkSAT and other well known algorithms including the complete solver Kissat, even when comparing its ability to solve the hardest quintile of the sample to the average performance of its competitors. The essence of DOCSAT may be seen as a way of harnessing statistical structure beyond the primary cost function of a combinatorial problem to avoid or escape local minima traps in stochastic local search, which opens avenues for generalization to other optimization problems.
zh
[AI-95] Linear-Time Primitives for Algorithm Development in Graphical Causal Inference
【速读】:该论文试图解决图形因果推断中算法原语的效率问题,特别是针对传统方法如moralization和latent projection在计算上的低效性。解决方案的关键在于提出CIfly框架,该框架将可达性(reachability)作为可复用的核心操作,并通过规则表模式规范算法,证明其可以在线性时间内运行,从而提供一种更高效的替代方案。
链接: https://arxiv.org/abs/2506.15758
作者: Marcel Wienöbst,Sebastian Weichwald,Leonard Henckel
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:We introduce CIfly, a framework for efficient algorithmic primitives in graphical causal inference that isolates reachability as a reusable core operation. It builds on the insight that many causal reasoning tasks can be reduced to reachability in purpose-built state-space graphs that can be constructed on the fly during traversal. We formalize a rule table schema for specifying such algorithms and prove they run in linear time. We establish CIfly as a more efficient alternative to the common primitives moralization and latent projection, which we show are computationally equivalent to Boolean matrix multiplication. Our open-source Rust implementation parses rule table text files and runs the specified CIfly algorithms providing high-performance execution accessible from Python and R. We demonstrate CIfly’s utility by re-implementing a range of established causal inference tasks within the framework and by developing new algorithms for instrumental variables. These contributions position CIfly as a flexible and scalable backbone for graphical causal inference, guiding algorithm development and enabling easy and efficient deployment.
zh
[AI-96] RecBayes: Recurrent Bayesian Ad Hoc Teamwork in Large Partially Observable Domains
【速读】:该论文试图解决在部分可观测环境下,即代理被临时部署到已有团队运行的环境中,且在任何阶段都不需要访问环境状态或队友动作的情况下,如何有效识别已知团队和任务的问题。解决方案的关键在于使用基于过去经验训练的循环贝叶斯分类器,使临时代理仅通过观察就能有效地识别出已知团队和任务,从而协助团队完成任务。
链接: https://arxiv.org/abs/2506.15756
作者: João G. Ribeiro,Yaniv Oren,Alberto Sardinha,Matthijs Spaan,Francisco S. Melo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper proposes RecBayes, a novel approach for ad hoc teamwork under partial observability, a setting where agents are deployed on-the-fly to environments where pre-existing teams operate, that never requires, at any stage, access to the states of the environment or the actions of its teammates. We show that by relying on a recurrent Bayesian classifier trained using past experiences, an ad hoc agent is effectively able to identify known teams and tasks being performed from observations alone. Unlike recent approaches such as PO-GPL (Gu et al., 2021) and FEAT (Rahman et al., 2023), that require at some stage fully observable states of the environment, actions of teammates, or both, or approaches such as ATPO (Ribeiro et al., 2023) that require the environments to be small enough to be tabularly modelled (Ribeiro et al., 2023), in their work up to 4.8K states and 1.7K observations, we show RecBayes is both able to handle arbitrarily large spaces while never relying on either states and teammates’ actions. Our results in benchmark domains from the multi-agent systems literature, adapted for partial observability and scaled up to 1M states and 2^125 observations, show that RecBayes is effective at identifying known teams and tasks being performed from partial observations alone, and as a result, is able to assist the teams in solving the tasks effectively.
zh
[AI-97] SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂和长周期场景中作为自主代理时,可能通过隐藏目标对用户造成危害的问题,即评估LLMs在完成多种现实任务的同时逃避监控并实现有害隐含目标的能力。其解决方案的关键在于构建了SHADE-Arena数据集,这是一个用于评估LLMs代理在复杂环境中执行良性主任务与有害辅助目标能力的多样化评估基准,通过衡量代理完成主任务、辅助任务以及规避检测的能力,揭示当前前沿模型在隐蔽破坏行为中的表现及局限性。
链接: https://arxiv.org/abs/2506.15740
作者: Jonathan Kutasov,Yuqi Sun,Paul Colognese,Teun van der Weij,Linda Petrini,Chen Bo Calvin Zhang,John Hughes,Xiang Deng,Henry Sleight,Tyler Tracy,Buck Shlegeris,Joe Benton
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and © avoid detection, we find that the best performing frontier models score 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) as sabotage agents when overseen by Claude 3.6 Sonnet. For current frontier models, success on the side task relies heavily on having access to a hidden scratchpad that is not visible to the monitor. We also use SHADE-Arena to measure models’ monitoring abilities, with the top monitor (Gemini 2.5 Pro) achieving an AUC of 0.87 at distinguishing benign and malign transcripts. We find that for now, models still struggle at sabotage due to failures in long-context main task execution. However, our measurements already demonstrate the difficulty of monitoring for subtle sabotage attempts, which we expect to only increase in the face of more complex and longer-horizon tasks.
zh
[AI-98] A Study of Hybrid and Evolutionary Metaheuristics for Single Hidden Layer Feedforward Neural Network Architecture
【速读】:该论文旨在解决使用随机梯度下降(SGD)训练人工神经网络(ANNs)时面临的计算成本高和易陷入局部最优解的问题。其解决方案的关键在于引入基于种群的元启发式优化算法(MHOs),如粒子群优化(PSO)和遗传算法(GAs),并提出一种混合PSO-SGD策略以提升局部搜索效率。实验结果表明,该混合方法在多个网络规模下显著降低了训练均方误差(MSE),优于传统GA和PSO,并验证了构建块假说(BBH)在进化搜索中仍具有有效性。
链接: https://arxiv.org/abs/2506.15737
作者: Gautam Siddharth Kashyap,Md Tabrez Nafis,Samar Wazir
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Training Artificial Neural Networks (ANNs) with Stochastic Gradient Descent (SGD) frequently encounters difficulties, including substantial computing expense and the risk of converging to local optima, attributable to its dependence on partial weight gradients. Therefore, this work investigates Particle Swarm Optimization (PSO) and Genetic Algorithms (GAs) - two population-based Metaheuristic Optimizers (MHOs) - as alternatives to SGD to mitigate these constraints. A hybrid PSO-SGD strategy is developed to improve local search efficiency. The findings indicate that the hybrid PSO-SGD technique decreases the median training MSE by 90 to 95 percent relative to conventional GA and PSO across various network sizes (e.g., from around 0.02 to approximately 0.001 in the Sphere function). RMHC attains substantial enhancements, reducing MSE by roughly 85 to 90 percent compared to GA. Simultaneously, RS consistently exhibits errors exceeding 0.3, signifying subpar performance. These findings underscore that hybrid and evolutionary procedures significantly improve training efficiency and accuracy compared to conventional optimization methods and imply that the Building Block Hypothesis (BBH) may still be valid, indicating that advantageous weight structures are retained during evolutionary search.
zh
[AI-99] ContextBench: Modifying Contexts for Targeted Latent Activation
【速读】:该论文试图解决如何生成能够触发语言模型特定行为或潜在特征的输入问题,这一问题在模型安全性领域具有广泛的应用价值。解决方案的关键在于提出一种称为“上下文修改”的方法,并构建了ContextBench基准测试平台,用于评估方法的核心能力及潜在安全应用。研究通过增强进化提示优化(Evolutionary Prompt Optimisation, EPO)并引入大语言模型辅助和扩散模型修复技术,提升了生成输入在激发目标特征与保持语言流畅性之间的平衡能力。
链接: https://arxiv.org/abs/2506.15735
作者: Robert Graham,Edward Stevinson,Leo Richter,Alexander Chia,Joseph Miller,Joseph Isaac Bloom
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench – a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.
zh
[AI-100] LLM s Struggle to Perform Counterfactual Reasoning with Parametric Knowledge ICML2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对新场景时,如何有效结合上下文知识与参数化知识的问题。其解决方案的关键在于通过反事实推理(counterfactual reasoning)的视角,探索LLMs是否能够整合外部信息与自身参数化知识。研究发现,LLMs在反事实推理任务中通常依赖于自身的参数化知识,而难以有效融合新的或不熟悉的信息,且简单的后期微调方法难以提升其反事实推理能力,甚至可能损害原有参数化知识的存储。
链接: https://arxiv.org/abs/2506.15732
作者: Khurram Yamin,Gaurav Ghosal,Bryan Wilder
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025 Workshop on Scaling up Intervention Models
Abstract:Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability – often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM’s abilities to re-purpose parametric knowledge in novel settings.
zh
[AI-101] Graph Diffusion that can Insert and Delete
【速读】:该论文旨在解决现有基于离散去噪扩散概率模型(DDPMs)的图生成模型在扩散过程中无法适应图尺寸(即原子数量)变化的问题,这一限制严重制约了其在条件生成场景(如属性驱动的分子设计)中的有效性。论文提出的解决方案的关键在于重新设计噪声添加和去噪过程,以支持节点的单调插入和删除,从而实现化学图在生成过程中的动态生长或收缩,该模型被称为GrIDDD。
链接: https://arxiv.org/abs/2506.15725
作者: Matteo Ninniri,Marco Podda,Davide Bacciu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models of graphs based on discrete Denoising Diffusion Probabilistic Models (DDPMs) offer a principled approach to molecular generation by systematically removing structural noise through iterative atom and bond adjustments. However, existing formulations are fundamentally limited by their inability to adapt the graph size (that is, the number of atoms) during the diffusion process, severely restricting their effectiveness in conditional generation scenarios such as property-driven molecular design, where the targeted property often correlates with the molecular size. In this paper, we reformulate the noising and denoising processes to support monotonic insertion and deletion of nodes. The resulting model, which we call GrIDDD, dynamically grows or shrinks the chemical graph during generation. GrIDDD matches or exceeds the performance of existing graph diffusion models on molecular property targeting despite being trained on a more difficult problem. Furthermore, when applied to molecular optimization, GrIDDD exhibits competitive performance compared to specialized optimization models. This work paves the way for size-adaptive molecular generation with graph diffusion.
zh
[AI-102] UniMate: A Unified Model for Mechanical Metamaterial Generation Property Prediction and Condition Confirmation
【速读】:该论文旨在解决现有机器学习模型在机械超材料设计中难以同时考虑三维拓扑结构、密度条件和力学性能三个关键模态的问题。当前大多数研究仅关注其中两个模态,例如根据三维拓扑结构预测力学性能或根据所需性能生成三维拓扑结构,从而导致对整体设计过程的覆盖不足。论文提出的解决方案的关键在于构建一个统一模型UNIMATE,其核心包含模态对齐模块和协同扩散生成模块,以实现对三个模态的联合建模与生成。
链接: https://arxiv.org/abs/2506.15722
作者: Wangzhi Zhan,Jianpeng Chen,Dongqi Fu,Dawei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultra-stiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to consider all three modalities together. However, a comprehensive literature review indicates that most existing works only consider two modalities, e.g., predicting mechanical properties given the 3D topology or generating 3D topology given the required properties. Therefore, there is still a significant gap for the state-of-the-art machine learning models capturing the whole. Hence, we propose a unified model named UNIMATE, which consists of a modality alignment module and a synergetic diffusion generation module. Experiments indicate that UNIMATE outperforms the other baseline models in topology generation task, property prediction task, and condition confirmation task by up to 80.2%, 5.1%, and 50.2%, respectively. We opensource our proposed UNIMATE model and corresponding results at this https URL.
zh
[AI-103] Alternates Assemble! Selecting Optimal Alternates for Citizens Assemblies
【速读】:该论文试图解决公民大会(citizens’ assemblies)中由于参与者流失导致的代表性不足问题,现有方法未考虑替补人员(alternates)的选择对平衡组成的影响。解决方案的关键在于引入一种优化框架,通过利用学习理论工具,基于历史数据估计流失概率,并选择替补人员以最小化预期的代表性偏差。
链接: https://arxiv.org/abs/2506.15716
作者: Angelos Assos,Carmel Baharav,Bailey Flanigan,Ariel Procaccia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:An increasingly influential form of deliberative democracy centers on citizens’ assemblies, where randomly selected people discuss policy questions. The legitimacy of these panels hinges on their representation of the broader population, but panelists often drop out, leading to an unbalanced composition. Although participant attrition is mitigated in practice by alternates, their selection is not taken into account by existing methods. To address this gap, we introduce an optimization framework for alternate selection. Our algorithmic approach, which leverages learning-theoretic machinery, estimates dropout probabilities using historical data and selects alternates to minimize expected misrepresentation. We establish theoretical guarantees for our approach, including worst-case bounds on sample complexity (with implications for computational efficiency) and on loss when panelists’ probabilities of dropping out are mis-estimated. Empirical evaluation using real-world data demonstrates that, compared to the status quo, our method significantly improves representation while requiring fewer alternates.
zh
[AI-104] NeuronSeek: On Stability and Expressivity of Task-driven Neurons
【速读】:该论文旨在解决深度学习中网络神经元设计的优化问题,即如何通过调整神经元结构以提升模型性能。其解决方案的关键在于使用张量分解(Tensor Decomposition, TD)替代传统的符号回归(Symbolic Regression, SR),以发现最优的神经元表达式,从而提高模型的稳定性与收敛速度。此外,论文还提供了理论保证,证明通过修改聚合函数并结合常见激活函数,可以在参数数量固定的情况下逼近任意连续函数,为NeuronSeek框架提供了数学基础。
链接: https://arxiv.org/abs/2506.15715
作者: Hanyu Pei,Jing-Xiao Liao,Qibin Zhao,Ting Gao,Shijun Zhang,Xiaoge Zhang,Feng-Lei Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures
Abstract:Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network’s neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons. Along this direction, this work replaces symbolic regression with tensor decomposition (TD) to discover optimal neuronal formulations, offering enhanced stability and faster convergence. Furthermore, we establish theoretical guarantees that modifying the aggregation functions with common activation functions can empower a network with a fixed number of parameters to approximate any continuous function with an arbitrarily small error, providing a rigorous mathematical foundation for the NeuronSeek framework. Extensive empirical evaluations demonstrate that our NeuronSeek-TD framework not only achieves superior stability, but also is competitive relative to the state-of-the-art models across diverse benchmarks. The code is available at this https URL.
zh
[AI-105] BatteryBERT for Realistic Battery Fault Detection Using Point-Masked Signal Modeling
【速读】:该论文旨在解决锂离子电池中故障检测的准确性问题,尤其是在捕捉复杂的时间依赖性和充分利用大量未标记数据方面的挑战。现有方法在处理工业场景中常见的数值时间序列数据时存在局限性,而大型语言模型虽具有强大的表征能力,但其架构并不直接适用于此类数据。论文提出的解决方案关键在于将BERT风格的预训练框架适配于电池故障检测,通过引入定制化的时序到令牌表示模块和针对电池应用设计的点级掩码信号建模(point-MSM)预训练任务,实现对电流、电压等充放电周期序列数据的自监督学习,从而获得分布鲁棒且上下文感知的时间嵌入。
链接: https://arxiv.org/abs/2506.15712
作者: Songqi Zhou,Ruixue Liu,Yixing Wang,Jia Lu,Benben Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate fault detection in lithium-ion batteries is essential for the safe and reliable operation of electric vehicles and energy storage systems. However, existing methods often struggle to capture complex temporal dependencies and cannot fully leverage abundant unlabeled data. Although large language models (LLMs) exhibit strong representation capabilities, their architectures are not directly suited to the numerical time-series data common in industrial settings. To address these challenges, we propose a novel framework that adapts BERT-style pretraining for battery fault detection by extending the standard BERT architecture with a customized time-series-to-token representation module and a point-level Masked Signal Modeling (point-MSM) pretraining task tailored to battery applications. This approach enables self-supervised learning on sequential current, voltage, and other charge-discharge cycle data, yielding distributionally robust, context-aware temporal embeddings. We then concatenate these embeddings with battery metadata and feed them into a downstream classifier for accurate fault classification. Experimental results on a large-scale real-world dataset show that models initialized with our pretrained parameters significantly improve both representation quality and classification accuracy, achieving an AUROC of 0.945 and substantially outperforming existing approaches. These findings validate the effectiveness of BERT-style pretraining for time-series fault detection.
zh
[AI-106] RAST: Reasoning Activation in LLM s via Small-model Transfer
【速读】:该论文试图解决在大规模语言模型(Large Language Models, LLMs)中应用强化学习(Reinforcement Learning, RL)时计算资源消耗过大的问题。其关键解决方案在于提出一种名为RAST的方法,该方法通过将小规模RL训练模型产生的概率调整注入到更大模型中,从而转移推理行为,实现对大模型推理能力的有效提升,同时显著降低GPU内存需求。
链接: https://arxiv.org/abs/2506.15710
作者: Siru Ouyang,Xinyu Zhu,Zilin Xiao,Minhao Jiang,Yu Meng,Jiawei Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs), as evidenced by recent successes such as OpenAI’s o1 and Deepseek-R1. However, applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads. On the other hand, while being powerful, recent studies suggest that RL does not fundamentally endow models with new knowledge; rather, it primarily reshapes the model’s output distribution to activate reasoning capabilities latent in the base model. Building on this insight, we hypothesize that the changes in output probabilities induced by RL are largely model-size invariant, opening the door to a more efficient paradigm: training a small model with RL and transferring its induced probability shifts to larger base models. To verify our hypothesis, we conduct a token-level analysis of decoding trajectories and find high alignment in RL-induced output distributions across model scales, validating our hypothesis. Motivated by this, we propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models. Experiments across multiple mathematical reasoning benchmarks show that RAST substantially and consistently enhances the reasoning capabilities of base models while requiring significantly lower GPU memory than direct RL training, sometimes even yielding better performance than the RL-trained counterparts. Our findings offer new insights into the nature of RL-driven reasoning and practical strategies for scaling its benefits without incurring its full computational cost. The project page of RAST is available at this https URL.
zh
[AI-107] Studying and Improving Graph Neural Network-based Motif Estimation
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在网络模体显著性轮廓(SP)预测中的应用尚未被充分探索的问题,特别是在缺乏明确基准的情况下。其解决方案的关键在于将SP估计任务从子图频率估计中独立出来,通过直接估计SP并将其建模为多目标回归问题,从而优化模型的可解释性、稳定性和可扩展性。这种方法旨在突破传统基于子图计数的模体估计所面临的理论限制。
链接: https://arxiv.org/abs/2506.15709
作者: Pedro C. Vieira,Miguel E. P. Silva,Pedro Manuel Pinto Ribeiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript represents a revised version from the paper on this https URL . Still a work in progress. Comments are welcome! 23 pages (12 main text + references), 9 figures, 5 tables
Abstract:Graph Neural Networks (GNNs) are a predominant method for graph representation learning. However, beyond subgraph frequency estimation, their application to network motif significance-profile (SP) prediction remains under-explored, with no established benchmarks in the literature. We propose to address this problem, framing SP estimation as a task independent of subgraph frequency estimation. Our approach shifts from frequency counting to direct SP estimation and modulates the problem as multitarget regression. The reformulation is optimised for interpretability, stability and scalability on large graphs. We validate our method using a large synthetic dataset and further test it on real-world graphs. Our experiments reveal that 1-WL limited models struggle to make precise estimations of SPs. However, they can generalise to approximate the graph generation processes of networks by comparing their predicted SP with the ones originating from synthetic generators. This first study on GNN-based motif estimation also hints at how using direct SP estimation can help go past the theoretical limitations that motif estimation faces when performed through subgraph counting.
zh
[AI-108] Refined Causal Graph Structure Learning via Curvature for Brain Disease Classification
【速读】:该论文试图解决传统图神经网络(Graph Neural Networks, GNNs)在脑疾病分类/检测任务中未能充分考虑脑区感兴趣区域(ROIs)之间因果关系的问题,而仅依赖于典型的相关性分析。解决方案的关键在于提出一种名为CGB(Causal Graphs for Brains)的新框架,该框架通过因果发现方法、转移熵以及几何曲率策略来建模优化的脑网络,从而揭示ROIs之间的因果关系,并通过几何曲率策略进行图重布线以提升模型表达能力并减少信息瓶颈。
链接: https://arxiv.org/abs/2506.15708
作者: Falih Gozi Febrinanto,Adonia Simango,Chengpei Xu,Jingjing Zhou,Jiangang Ma,Sonika Tyagi,Feng Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph neural networks (GNNs) have been developed to model the relationship between regions of interest (ROIs) in brains and have shown significant improvement in detecting brain diseases. However, most of these frameworks do not consider the intrinsic relationship of causality factor between brain ROIs, which is arguably more essential to observe cause and effect interaction between signals rather than typical correlation values. We propose a novel framework called CGB (Causal Graphs for Brains) for brain disease classification/detection, which models refined brain networks based on the causal discovery method, transfer entropy, and geometric curvature strategy. CGB unveils causal relationships between ROIs that bring vital information to enhance brain disease classification performance. Furthermore, CGB also performs a graph rewiring through a geometric curvature strategy to refine the generated causal graph to become more expressive and reduce potential information bottlenecks when GNNs model it. Our extensive experiments show that CGB outperforms state-of-the-art methods in classification tasks on brain disease datasets, as measured by average F1 scores.
zh
[AI-109] Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling
【速读】:该论文试图解决在测试阶段缩放(Test-Time Scaling, TTS)中,如何在有限的计算预算下有效分配推理时间以提升大型语言模型(Large Language Models, LLMs)性能的问题。现有搜索方法在解级分配资源时倾向于选择候选路径较多的推理方向,导致计算资源使用效率低下。解决方案的关键在于提出一种可证明最优的定向资源分配方法(Direction-Oriented Resource Allocation, DORA),通过将方向质量与候选数量解耦,在方向级别上进行资源分配,从而缓解这一偏差并提升搜索效率。
链接: https://arxiv.org/abs/2506.15707
作者: Xinglin Wang,Yiwei Li,Shaoxiong Feng,Peiwen Yuan,Yueqi Zhang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs) by using additional inference-time computation to explore multiple reasoning paths through search. Yet how to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time. To bridge this gap, we formulate test-time search as a resource allocation problem and derive the optimal allocation strategy that maximizes the probability of obtaining a correct solution under a fixed rollout budget. Within this formulation, we reveal a core limitation of existing search methods: solution-level allocation tends to favor reasoning directions with more candidates, leading to theoretically suboptimal and inefficient use of compute. To address this, we propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias by decoupling direction quality from candidate count and allocating resources at the direction level. To demonstrate DORA’s effectiveness, we conduct extensive experiments on challenging mathematical reasoning benchmarks including MATH500, AIME2024, and AIME2025. The empirical results show that DORA consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art accuracy. We hope our findings contribute to a broader understanding of optimal TTS for LLMs.
zh
[AI-110] MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理任务中难以保证每一步推理正确性的问题,尤其是由于无法有效抑制错误输出而导致的幻觉现象。现有方法如直接偏好优化(Direct Preference Optimization, DPO)在长链数学推理中效果有限,主要原因是其难以有效捕捉偏好数据中接受与拒绝答案之间的差异,并且训练目标与生成指标之间存在不一致。该论文提出的多粒度直接偏好优化(Multi-Granularity Direct Preference Optimization, MDPO)方法,通过在三个粒度层面——解法到解法(Solution2Solution)、推理到推理(Inference2Inference)和步骤到步骤(Step2Step)——优化数学推理能力,从而提升模型的逻辑推理与计算准确性。关键在于统一三个粒度的训练目标,使其与生成指标对齐,进而更有效地抑制错误输出。
链接: https://arxiv.org/abs/2506.15706
作者: Yunze Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) as it requires ensuring the correctness of each reasoning step. Researchers have been strengthening the mathematical reasoning abilities of LLMs through supervised fine-tuning, but due to the inability to suppress incorrect outputs, illusions can easily arise. Recently, Direct Preference Optimization (DPO) has been widely adopted for aligning human intent by using preference data to prevent LLMs from generating incorrect outputs. However, it has shown limited benefits in long-chain mathematical reasoning, mainly because DPO struggles to effectively capture the differences between accepted and rejected answers from preferences in long-chain data. The inconsistency between DPO training and LLMs’ generation metrics also affects the effectiveness of suppressing incorrect outputs. We propose the Multi-Granularity Direct Preference Optimization (MDPO) method, optimizing the mathematical reasoning of LLMs at three granularities: Solution2Solution, Inference2Inference, and Step2Step. Solution2Solution focuses on the correctness of entire long-chain reasoning; Inference2Inference concentrates on logical reasoning between steps; Step2Step corrects computational errors in steps, enhancing the computational capabilities of LLMs. Additionally, we unify the training objectives of the three granularities to align with the generation metrics. We conducted experiments on the open-source models Qwen2 and Llama3, achieving improvements of 1.7% and 0.9% on the GSM8K dataset, and 2.3% and 1.2% on the MATH dataset, outperforming DPO and other DPO variant methods. Furthermore, we also provide a pipeline for constructing MDPO training data that is simple and does not require manual annotation costs.
zh
[AI-111] Generalisation Bounds of Zero-Shot Economic Forecasting using Time Series Foundation Models
【速读】:该论文试图解决宏观经济指标预测中传统计量经济模型需要大量定制化训练数据和复杂建模流程的问题,提出使用时间序列基础模型(Time Series Foundation Models, TSFMs)实现零样本预测(zero-shot forecasting)。解决方案的关键在于通过预训练的TSFMs直接捕捉丰富的经济动态、适应制度变化,并在无需微调的情况下提供可靠的不确定性估计,从而在宏观经济学领域实现高效且通用的预测能力。
链接: https://arxiv.org/abs/2506.15705
作者: Jittarin Jetwiriyanon,Teo Susnjak,Surangika Ranathunga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This study investigates zero-shot forecasting capabilities of Time Series Foundation Models (TSFMs) for macroeconomic indicators. We apply TSFMs to forecasting economic indicators under univariate conditions, bypassing the need for train bespoke econometric models using and extensive training datasets. Our experiments were conducted on a case study dataset, without additional customisation. We rigorously back-tested three state-of-the-art TSFMs (Chronos, TimeGPT and Moirai) under data-scarce conditions and structural breaks. Our results demonstrate that appropriately engineered TSFMs can internalise rich economic dynamics, accommodate regime shifts, and deliver well-behaved uncertainty estimates out of the box, while matching state-of-the-art multivariate models on this domain. Our findings suggest that, without any fine-tuning, TSFMs can match or exceed classical models during stable economic conditions. However, they are vulnerable to degradation in performances during periods of rapid shocks. The findings offer guidance to practitioners on when zero-shot deployments are viable for macroeconomic monitoring and strategic planning.
zh
[AI-112] Federated Incomplete Multi-view Clustering with Globally Fused Graph Guidance
【速读】:该论文旨在解决联邦多视图聚类中两个关键问题:一是现有方法仅依赖全局伪标签指导下游聚类,未能充分利用全局信息进行特征提取;二是联邦多视图聚类任务中对缺失数据问题的研究较少。其解决方案的关键在于提出一种基于全局融合图引导的联邦不完整多视图聚类方法(FIMCFG),通过在每个客户端设计双头图卷积编码器提取包含全局和视图特异性信息的潜在特征,并在融合图的指导下将这些特征融合为高层特征,从而提升聚类性能。
链接: https://arxiv.org/abs/2506.15703
作者: Guoqing Chao,Zhenghao Zhang,Lei Meng,Jie Wen,Dianhui Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated multi-view clustering has been proposed to mine the valuable information within multi-view data distributed across different devices and has achieved impressive results while preserving the privacy. Despite great progress, most federated multi-view clustering methods only used global pseudo-labels to guide the downstream clustering process and failed to exploit the global information when extracting features. In addition, missing data problem in federated multi-view clustering task is less explored. To address these problems, we propose a novel Federated Incomplete Multi-view Clustering method with globally Fused Graph guidance (FIMCFG). Specifically, we designed a dual-head graph convolutional encoder at each client to extract two kinds of underlying features containing global and view-specific information. Subsequently, under the guidance of the fused graph, the two underlying features are fused into high-level features, based on which clustering is conducted under the supervision of pseudo-labeling. Finally, the high-level features are uploaded to the server to refine the graph fusion and pseudo-labeling computation. Extensive experimental results demonstrate the effectiveness and superiority of FIMCFG. Our code is publicly available at this https URL.
zh
[AI-113] Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation
【速读】:该论文试图解决在有限数据条件下对语言模型进行微调时,模型泛化能力退化的问题。其解决方案的关键在于提出一种名为最小微调(Minifinetuning, MFT)的方法,该方法通过个体化的校正自蒸馏机制,在无需任何预训练数据回放的情况下显著减轻过拟合引起的泛化能力下降,并在新领域数据稀缺(甚至仅500个样本)时仍表现出内在的鲁棒性。
链接: https://arxiv.org/abs/2506.15702
作者: Peter Belcak,Greg Heinrich,Jan Kautz,Pavlo Molchanov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Finetuning language models for a new domain inevitably leads to the deterioration of their general performance. This becomes more pronounced the more limited the finetuning data resource. We introduce minifinetuning (MFT), a method for language model domain adaptation that considerably reduces the effects of overfitting-induced degeneralization in low-data settings and which does so in the absence of any pre-training data for replay. MFT demonstrates 2-10x more favourable specialization-to-degeneralization ratios than standard finetuning across a wide range of models and domains and exhibits an intrinsic robustness to overfitting when data in the new domain is scarce and down to as little as 500 samples. Employing corrective self-distillation that is individualized on the sample level, MFT outperforms parameter-efficient finetuning methods, demonstrates replay-like degeneralization mitigation properties, and is composable with either for a combined effect. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.15702 [cs.LG] (or arXiv:2506.15702v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.15702 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-114] Compiler-R1: Towards Agent ic Compiler Auto-tuning with Reinforcement Learning
【速读】:该论文旨在解决编译器自动调优中两个关键挑战:缺乏高质量的推理数据集用于智能体训练,以及与编译环境的有效交互有限。其解决方案的关键在于提出Compiler-R1,这是一个基于强化学习(Reinforcement Learning, RL)的框架,专门增强大型语言模型(Large Language Models, LLMs)在编译器自动调优中的能力,该框架包含一个精心构建的高质量推理数据集和一种新颖的两阶段端到端强化学习训练流程,通过基于结果的奖励机制实现高效的环境探索与学习。
链接: https://arxiv.org/abs/2506.15701
作者: Haolin Pan,Hongyu Lin,Haoran Luo,Yang Liu,Kaichun Yao,Libo Zhang,Mingjie Xing,Yanjun Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Compiler auto-tuning optimizes pass sequences to improve performance metrics such as Intermediate Representation (IR) instruction count. Although recent advances leveraging Large Language Models (LLMs) have shown promise in automating compiler tuning, two significant challenges still remain: the absence of high-quality reasoning datasets for agents training, and limited effective interactions with the compilation environment. In this work, we introduce Compiler-R1, the first reinforcement learning (RL)-driven framework specifically augmenting LLM capabilities for compiler auto-tuning. Compiler-R1 features a curated, high-quality reasoning dataset and a novel two-stage end-to-end RL training pipeline, enabling efficient environment exploration and learning through an outcome-based reward. Extensive experiments across seven datasets demonstrate Compiler-R1 achieving an average 8.46% IR instruction count reduction compared to opt -Oz, showcasing the strong potential of RL-trained LLMs for compiler optimization. Our code and datasets are publicly available at this https URL.
zh
[AI-115] Contraction Actor-Critic: Contraction Metric-Guided Reinforcement Learning for Robust Path Tracking
【速读】:该论文试图解决控制收缩度量(CCM)在复杂系统中应用时的局限性,包括其仅能保证轨迹收敛到单一轨迹而缺乏整体最优性,以及需要已知动力学模型并面临高维非凸可行性问题带来的可扩展性不足。解决方案的关键是将CCM与强化学习(Reinforcement Learning, RL)相结合,通过引入一种称为收缩演员-评论家(Contraction Actor-Critic, CAC)的算法,实现无需显式动力学模型的情况下,自动学习具有长期最优性的收缩策略。该方法同时学习一个收缩度量生成器(Contraction Metric Generator, CMG)和基于该度量的最优跟踪策略,从而在未知动力学下提升控制性能。
链接: https://arxiv.org/abs/2506.15700
作者: Minjae Cho,Hiroyasu Tsukamoto,Huy Trong Tran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Control contraction metrics (CCMs) provide a framework to co-synthesize a controller and a corresponding contraction metric – a positive-definite Riemannian metric under which a closed-loop system is guaranteed to be incrementally exponentially stable. However, the synthesized controller only ensures that all the trajectories of the system converge to one single trajectory and, as such, does not impose any notion of optimality across an entire trajectory. Furthermore, constructing CCMs requires a known dynamics model and non-trivial effort in solving an infinite-dimensional convex feasibility problem, which limits its scalability to complex systems featuring high dimensionality with uncertainty. To address these issues, we propose to integrate CCMs into reinforcement learning (RL), where CCMs provide dynamics-informed feedback for learning control policies that minimize cumulative tracking error under unknown dynamics. We show that our algorithm, called contraction actor-critic (CAC), formally enhances the capability of CCMs to provide a set of contracting policies with the long-term optimality of RL in a fully automated setting. Given a pre-trained dynamics model, CAC simultaneously learns a contraction metric generator (CMG) – which generates a contraction metric – and uses an actor-critic algorithm to learn an optimal tracking policy guided by that metric. We demonstrate the effectiveness of our algorithm relative to established baselines through extensive empirical studies, including simulated and real-world robot experiments, and provide a theoretical rationale for incorporating contraction theory into RL.
zh
[AI-116] BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap
【速读】:该论文试图解决当前大规模语言模型(Large Language Model, LLM)的机器遗忘(Machine Unlearning)评估基准存在遗忘集与保留集高度不均衡的问题,这导致对遗忘方法有效性的评估结果失真。解决方案的关键在于提出 \texttt{BLUR} 基准,该基准通过提供更真实的遗忘-保留重叠场景、扩展的评估任务、组合的遗忘/保留查询以及不同难度级别的重学习数据集,以实现更稳健的评估。
链接: https://arxiv.org/abs/2506.15699
作者: Shengyuan Hu,Neil Kale,Pratiksha Thaker,Yiwei Fu,Steven Wu,Virginia Smith
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine unlearning has the potential to improve the safety of large language models (LLMs) by removing sensitive or harmful information post hoc. A key challenge in unlearning involves balancing between forget quality (effectively unlearning undesirable information) and retain quality (maintaining good performance on other, general tasks). Unfortunately, as we show, current LLM unlearning benchmarks contain highly disparate forget and retain sets – painting a false picture of the effectiveness of LLM unlearning methods. This can be particularly problematic because it opens the door for benign perturbations, such as relearning attacks, to easily reveal supposedly unlearned knowledge once models are deployed. To address this, we present \textttBLUR : a benchmark for LLM unlearning that provides more realistic scenarios of forget-retain overlap. \textttBLUR significantly expands on existing unlearning benchmarks by providing extended evaluation tasks, combined forget/retain queries, and relearning datasets of varying degrees of difficulty. Despite the benign nature of the queries considered, we find that the performance of existing methods drops significantly when evaluated on \textttBLUR , with simple approaches performing better on average than more recent methods. These results highlight the importance of robust evaluation and suggest several important directions of future study. Our benchmark is publicly available at: this https URL
zh
[AI-117] What Do Latent Action Models Actually Learn?
【速读】:该论文试图解决生成式AI(Generative AI)中潜在动作模型(Latent Action Models, LAMs)在无标签视频中学习与动作相关变化时,如何区分由动作引起的可控变化与外部噪声的问题。解决方案的关键在于提出一个线性模型,以分析LAM学习的本质,并通过数值模拟揭示观测数据、动作和噪声之间的结构关系,从而为鼓励学习可控变化提供理论依据和策略支持,如数据增强、数据清洗和辅助动作预测等。
链接: https://arxiv.org/abs/2506.15691
作者: Chuheng Zhang,Tim Pearce,Pushi Zhang,Kaixin Wang,Xiaoyu Chen,Wei Shen,Li Zhao,Jiang Bian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern – do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being this http URL provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.
zh
[AI-118] LLM Web Dynamics: Tracing Model Collapse in a Network of LLM s
【速读】:该论文试图解决大规模语言模型(Large Language Model, LLM)训练中因依赖公共互联网合成数据而可能引发的模型崩溃(model collapse)问题,该问题在现有研究中尚未得到充分探讨。论文提出的解决方案关键在于引入LLM Web Dynamics (LWD)框架,该框架通过检索增强生成(Retrieval-Augmented Generation, RAG)数据库模拟互联网环境,从而在网络层面分析模型输出的收敛模式,并通过类比相互作用高斯混合模型提供理论保障。
链接: https://arxiv.org/abs/2506.15690
作者: Tianyu Wang,Lingyou Pang,Akira Horiguchi,Carey E. Priebe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Methodology (stat.ME)
备注:
Abstract:The increasing use of synthetic data from the public Internet has enhanced data usage efficiency in large language model (LLM) training. However, the potential threat of model collapse remains insufficiently explored. Existing studies primarily examine model collapse in a single model setting or rely solely on statistical surrogates. In this work, we introduce LLM Web Dynamics (LWD), an efficient framework for investigating model collapse at the network level. By simulating the Internet with a retrieval-augmented generation (RAG) database, we analyze the convergence pattern of model outputs. Furthermore, we provide theoretical guarantees for this convergence by drawing an analogy to interacting Gaussian Mixture Models.
zh
[AI-119] Cellular Traffic Prediction via Deep State Space Models with Attention Mechanism
【速读】:该论文旨在解决蜂窝网络中流量预测的准确性问题,因为流量具有高度动态性且受多种外部因素影响。其解决方案的关键在于提出一种端到端框架,通过结合卷积神经网络(Convolutional Neural Network, CNN)与注意力机制来捕捉空间动态特性,并利用卡尔曼滤波(Kalman Filter)进行时间建模,同时充分挖掘如社交活动等辅助信息以提升预测性能。
链接: https://arxiv.org/abs/2506.15688
作者: Hui Ma,Kai Yang,Man-On Pun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cellular traffic prediction is of great importance for operators to manage network resources and make decisions. Traffic is highly dynamic and influenced by many exogenous factors, which would lead to the degradation of traffic prediction accuracy. This paper proposes an end-to-end framework with two variants to explicitly characterize the spatiotemporal patterns of cellular traffic among neighboring cells. It uses convolutional neural networks with an attention mechanism to capture the spatial dynamics and Kalman filter for temporal modelling. Besides, we can fully exploit the auxiliary information such as social activities to improve prediction performance. We conduct extensive experiments on three real-world datasets. The results show that our proposed models outperform the state-of-the-art machine learning techniques in terms of prediction accuracy.
zh
[AI-120] Learning from M-Tuple Dominant Positive and Unlabeled Data
【速读】:该论文试图解决的是标签比例学习(Label Proportion Learning, LLP)问题,即在每个包(bag)中仅提供各类实例比例信息的情况下进行分类任务。其核心挑战在于实际应用中难以获得精确的类别比例监督信息。论文提出的解决方案关键在于构建一个广义的学习框架\emphMDPU,通过数学建模任意大小元组内实例的分布,并在正类实例数量不少于负类实例数量的约束下,推导出满足风险一致性的无偏风险估计器。为进一步缓解训练过程中的过拟合问题,引入了风险修正方法,从而得到修正后的风险估计器,理论分析和实验验证均表明该方法的有效性。
链接: https://arxiv.org/abs/2506.15686
作者: Jiahe Qin,Junpeng Li,Changchun Hua,Yana Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Label Proportion Learning (LLP) addresses the classification problem where multiple instances are grouped into bags and each bag contains information about the proportion of each class. However, in practical applications, obtaining precise supervisory information regarding the proportion of instances in a specific class is challenging. To better align with real-world application scenarios and effectively leverage the proportional constraints of instances within tuples, this paper proposes a generalized learning framework \emphMDPU. Specifically, we first mathematically model the distribution of instances within tuples of arbitrary size, under the constraint that the number of positive instances is no less than that of negative instances. Then we derive an unbiased risk estimator that satisfies risk consistency based on the empirical risk minimization (ERM) method. To mitigate the inevitable overfitting issue during training, a risk correction method is introduced, leading to the development of a corrected risk estimator. The generalization error bounds of the unbiased risk estimator theoretically demonstrate the consistency of the proposed method. Extensive experiments on multiple datasets and comparisons with other relevant baseline methods comprehensively validate the effectiveness of the proposed learning framework.
zh
[AI-121] Ignition Phase : Standard Training for Fast Adversarial Robustness
【速读】:该论文试图解决对抗训练(Adversarial Training, AT)中因过度关注攻击生成而忽视基础特征表示的问题。其解决方案的关键在于引入对抗进化训练(Adversarial Evolution Training, AET),通过在传统AT之前战略性地添加一个经验风险最小化(Empirical Risk Minimization, ERM)阶段,以培育有利的特征流形,从而提升鲁棒性获取的效率和效果。
链接: https://arxiv.org/abs/2506.15685
作者: Wang Yu-Hang,Liu ying,Fang liang,Wang Xuelin,Junkang Guo,Shiwei Li,Lei Gao,Jian Liu,Wenfei Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial Training (AT) is a cornerstone defense, but many variants overlook foundational feature representations by primarily focusing on stronger attack generation. We introduce Adversarial Evolution Training (AET), a simple yet powerful framework that strategically prepends an Empirical Risk Minimization (ERM) phase to conventional AT. We hypothesize this initial ERM phase cultivates a favorable feature manifold, enabling more efficient and effective robustness acquisition. Empirically, AET achieves comparable or superior robustness more rapidly, improves clean accuracy, and cuts training costs by 8-25%. Its effectiveness is shown across multiple datasets, architectures, and when augmenting established AT methods. Our findings underscore the impact of feature pre-conditioning via standard training for developing more efficient, principled robust defenses. Code is available in the supplementary material.
zh
[AI-122] Single-shot thermometry of simulated Bose–Einstein condensates using artificial intelligence
【速读】:该论文试图解决在超冷玻色气体中精确测定热力学参数的问题,这一问题由于传统测量技术的破坏性及实验固有的不确定性而难以实现。解决方案的关键在于采用一种基于人工智能的方法,利用卷积神经网络(Convolutional Neural Network)从单次拍摄的、原位成像的有限温度玻色气体密度剖面中快速、非破坏性地估计化学势和温度。该模型仅在谐波势配置下的准二维“薄饼”凝聚体上进行训练,能够在极短时间内完成参数提取,并展现出对不同陷阱几何结构和热化动力学的零样本泛化能力。
链接: https://arxiv.org/abs/2506.16925
作者: Jack Griffiths,Steven A. Wrathmall,Simon A. Gardiner
机构: 未知
类目: Quantum Gases (cond-mat.quant-gas); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Precise determination of thermodynamic parameters in ultracold Bose gases remains challenging due to the destructive nature of conventional measurement techniques and inherent experimental uncertainties. We demonstrate an artificial intelligence approach for rapid, non-destructive estimation of the chemical potential and temperature from single-shot, in situ imaged density profiles of finite-temperature Bose gases. Our convolutional neural network is trained exclusively on quasi-2D `pancake’ condensates in harmonic trap configurations. It achieves parameter extraction within fractions of a second. The model also demonstrates zero-shot generalisation across both trap geometry and thermalisation dynamics, successfully estimating thermodynamic parameters for toroidally trapped condensates with errors of only a few nanokelvin despite no prior exposure to such geometries during training, and maintaining predictive accuracy during dynamic thermalisation processes after a relatively brief evolution without explicit training on non-equilibrium states. These results suggest that supervised learning can overcome traditional limitations in ultracold atom thermometry, with extension to broader geometric configurations, temperature ranges, and additional parameters potentially enabling comprehensive real-time analysis of quantum gas experiments. Such capabilities could significantly streamline experimental workflows whilst improving measurement precision across a range of quantum fluid systems.
zh
[AI-123] RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching INTERSPEECH2025
【速读】:该论文旨在解决传统基于常微分方程(ODE)的文本到语音(TTS)生成方法在语音质量与推理速度之间的权衡问题。传统方法虽然能够生成自然质量的语音,但通常需要大量的生成步骤,导致推理速度较慢。解决方案的关键在于引入RapFlow-TTS模型,该模型通过在流匹配(FM)训练中引入速度一致性约束,在FM拉直的ODE轨迹上保持速度场的一致性,从而在减少生成步骤的同时保证合成语音的质量。此外,还采用了时间间隔调度和对抗学习等技术进一步提升少步长合成的质量。
链接: https://arxiv.org/abs/2506.16741
作者: Hyun Joon Park,Jeongmin Liu,Jin Sob Kim,Jeong Yeol Yang,Sung Won Han,Eunwoo Song
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted on Interspeech 2025
Abstract:We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.
zh
[AI-124] Latent Noise Injection for Private and Statistically Aligned Synthetic Data Generation
【速读】:该论文旨在解决高维场景下生成式数据(Generative Data)在隐私保护统计分析中的收敛速度慢的问题,传统基于生成模型的方法如归一化流(Normalizing Flows)在逼近真实数据分布时往往无法达到经典的 1/n 收敛速率。其解决方案的关键在于提出一种基于掩码自回归流(Masked Autoregressive Flows, MAF)的潜在噪声注入方法,通过在潜在空间中扰动每个数据点并映射回数据域,保持观测数据与生成数据之间的一一对应关系,从而在高维场景下生成更贴近真实分布的合成数据。该方法还满足局部 (ϵ,δ)-差分隐私,并引入单一扰动参数以控制隐私与效用之间的权衡。
链接: https://arxiv.org/abs/2506.16636
作者: Rex Shen,Lu Tian
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Synthetic Data Generation has become essential for scalable, privacy-preserving statistical analysis. While standard approaches based on generative models, such as Normalizing Flows, have been widely used, they often suffer from slow convergence in high-dimensional settings, frequently converging more slowly than the canonical 1/\sqrtn rate when approximating the true data distribution. To overcome these limitations, we propose a Latent Noise Injection method using Masked Autoregressive Flows (MAF). Instead of directly sampling from the trained model, our method perturbs each data point in the latent space and maps it back to the data domain. This construction preserves a one to one correspondence between observed and synthetic data, enabling synthetic outputs that closely reflect the underlying distribution, particularly in challenging high-dimensional regimes where traditional sampling struggles. Our procedure satisfies local (\epsilon, \delta) -differential privacy and introduces a single perturbation parameter to control the privacy-utility trade-off. Although estimators based on individual synthetic datasets may converge slowly, we show both theoretically and empirically that aggregating across K studies in a meta analysis framework restores classical efficiency and yields consistent, reliable inference. We demonstrate that with a well-calibrated perturbation parameter, Latent Noise Injection achieves strong statistical alignment with the original data and robustness against membership inference attacks. These results position our method as a compelling alternative to conventional flow-based sampling for synthetic data sharing in decentralized and privacy-sensitive domains, such as biomedical research. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.16636 [stat.ML] (or arXiv:2506.16636v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.16636 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-125] Category-based Galaxy Image Generation via Diffusion Models
【速读】:该论文试图解决传统星系生成方法依赖物理假设和参数调优的问题,以及现有数据驱动生成模型在质量、多样性及物理一致性方面的不足。其解决方案的关键在于提出GalCatDiff框架,该框架首次在天文学中结合星系图像特征与天体物理属性,并通过改进的U-Net结构和新型Astro-RAB(Residual Attention Block)模块,动态融合注意力机制与卷积操作,以确保全局一致性和局部特征保真度,同时利用类别嵌入实现高效且具有物理一致性的星系生成。
链接: https://arxiv.org/abs/2506.16255
作者: Xingzhong Fan,Hongming Tang,Yue Zeng,M.B.N.Kouwenhoven,Guangquan Zeng
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures. Submitted to AAS Astronomical Journal (AJ) and is under revision. See another indenpdent work for furthur reference – Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation (Ma, Sun et al.). Comments are welcome
Abstract:Conventional galaxy generation methods rely on semi-analytical models and hydrodynamic simulations, which are highly dependent on physical assumptions and parameter tuning. In contrast, data-driven generative models do not have explicit physical parameters pre-determined, and instead learn them efficiently from observational data, making them alternative solutions to galaxy generation. Among these, diffusion models outperform Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) in quality and diversity. Leveraging physical prior knowledge to these models can further enhance their capabilities. In this work, we present GalCatDiff, the first framework in astronomy to leverage both galaxy image features and astrophysical properties in the network design of diffusion models. GalCatDiff incorporates an enhanced U-Net and a novel block entitled Astro-RAB (Residual Attention Block), which dynamically combines attention mechanisms with convolution operations to ensure global consistency and local feature fidelity. Moreover, GalCatDiff uses category embeddings for class-specific galaxy generation, avoiding the high computational costs of training separate models for each category. Our experimental results demonstrate that GalCatDiff significantly outperforms existing methods in terms of the consistency of sample color and size distributions, and the generated galaxies are both visually realistic and physically consistent. This framework will enhance the reliability of galaxy simulations and can potentially serve as a data augmentor to support future galaxy classification algorithm development.
zh
[AI-126] CP2: Leverag ing Geometry for Conformal Prediction via Canonicalization UAI2025
【速读】:该论文旨在解决在几何数据偏移(geometric data shifts)条件下,置信预测(conformal prediction, CP)方法的实用性下降问题,即当数据样本受到旋转或翻转等变换时,CP的覆盖率保证和模型性能会受到影响。解决方案的关键在于将几何信息(如几何姿态)整合到置信预测过程中,以恢复其理论保证并提升对几何偏移的鲁棒性,其中特别利用了姿态归一化(pose canonicalization)作为提取相关信息的有效手段。
链接: https://arxiv.org/abs/2506.16189
作者: Putri A. van der Linden,Alexander Timans,Erik J. Bekkers
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 7 figures, 9 tables (including appendix); published at UAI 2025
Abstract:We study the problem of conformal prediction (CP) under geometric data shifts, where data samples are susceptible to transformations such as rotations or flips. While CP endows prediction models with post-hoc uncertainty quantification and formal coverage guarantees, their practicality breaks under distribution shifts that deteriorate model performance. To address this issue, we propose integrating geometric information–such as geometric pose–into the conformal procedure to reinstate its guarantees and ensure robustness under geometric shifts. In particular, we explore recent advancements on pose canonicalization as a suitable information extractor for this purpose. Evaluating the combined approach across discrete and continuous shifts and against equivariant and augmentation-based baselines, we find that integrating geometric information with CP yields a principled way to address geometric shifts while maintaining broad applicability to black-box predictors.
zh
[AI-127] Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma
【速读】:该论文旨在解决质子弧治疗(Proton Arc Therapy, PAT)中能量层(Energy Layer, EL)序列优化的计算复杂性问题,该问题由于可能的能量层转换数量庞大而难以高效解决。论文提出的解决方案的关键在于引入一种无监督深度学习框架SPArcdl,该框架基于改进的数据表示方法——点计数表示(spot-count representation),并通过UNet架构训练一个三目标优化函数,以最大化靶区覆盖、最小化器官危及组织(OAR)照射和减少能量切换时间。该方法实现了快速且有效的EL预选择,显著提升了计划质量和治疗效率。
链接: https://arxiv.org/abs/2506.15803
作者: Bohan Yang,Gang Liu,Rirao Dao,Yujia Qian,Ke Shi,Anke Tang,Yong Luo,Jingnan Liu
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Objective. Proton arc therapy (PAT) is an emerging and promising modality in radiotherapy, offering several advantages over conventional intensitymodulated proton therapy (IMPT). However, identifying the optimal energy layer (EL) sequence remains computationally intensive due to the large number of possible energy layer transitions. This study proposes an unsupervised deep learning framework for fast and effective EL pre-selection, aiming to minimize energy layer switch time while preserving high plan quality. Approach. We introduce a novel data representation method, spot-count representation, which encodes the number of proton spots intersecting the target and organs at risk (OARs) in a matrix structured by sorted gantry angles and energy layers. This representation is the input of a UNet-based architecture, SPArcdl, which is trained to optimize a tri-objective function: maximizing target coverage, minimizing OAR exposure, and reducing energy switching time. The model is evaluated on 54 nasopharyngeal cancer cases, and its performance is benchmarked against plans generated by SPArcparticle swarm. Main results. SPArcdl produces EL pre-selection that significantly improves both plan quality and delivery efficiency. Compared to SPArc particle swarm, it enhances the conformity index by 0.16 (p 0.01), reduces the homogeneity index by 0.71 (p 0.01), shortens the energy switching time by 38.4% (p 0.01), and lowers the mean dose to brainstem by 0.21 (p 0.01). The results unintentionally reveal employing unchanged ELS is more time-wise efficient than descended ELS. SPArcdl’s inference time is within 1 second. Significance. SPArcdl is a fast and effective tool for generating high-quality PAT plans by strategically pre-selecting energy layers to reduce delivery time while maintaining excellent dosimetric performance.
zh
[AI-128] RUST: Transparent Robust and Ultra-Sparse Trees
【速读】:该论文旨在解决传统可解释模型(如CART)在预测准确性上落后于黑盒模型(如Random Forest)的问题,同时保持模型的可解释性。其解决方案的关键在于提出TRUST(Transparent, Robust, and Ultra-Sparse Trees)模型,该模型结合了Random Forest的预测精度与浅层决策树及稀疏线性模型的可解释性,并通过大型语言模型生成用户友好的解释,从而在保持高预测准确性的前提下显著提升模型的透明度和可解释性。
链接: https://arxiv.org/abs/2506.15791
作者: Albert Dorador
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Piecewise-constant regression trees remain popular for their interpretability, yet often lag behind black-box models like Random Forest in predictive accuracy. In this work, we introduce TRUST (Transparent, Robust, and Ultra-Sparse Trees), a novel regression tree model that combines the accuracy of Random Forests with the interpretability of shallow decision trees and sparse linear models. TRUST further enhances transparency by leveraging Large Language Models to generate tailored, user-friendly explanations. Extensive validation on synthetic and real-world benchmark datasets demonstrates that TRUST consistently outperforms other interpretable models – including CART, Lasso, and Node Harvest – in predictive accuracy, while matching the accuracy of Random Forest and offering substantial gains in both accuracy and interpretability over M5’, a well-established model that is conceptually related.
zh
机器学习
[LG-0] BREAD: Branched Rollouts from Expert Anchors Bridge SFT RL for Reasoning
链接: https://arxiv.org/abs/2506.17211
作者: Xuechen Zhang,Zijian Huang,Yingcong Li,Chenshun Ni,Jiasi Chen,Samet Oymak
类目: Machine Learning (cs.LG)
*备注:
Abstract:Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from. The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL)stage such as Group Relative Policy Optimization (GRPO). In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them. Under a suitable theoretical model, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert’s traces are too difficult for the small model to express, or (2) the small model’s initialization has exponentially small likelihood of success. To address these, we introduce BREAD: a GRPO variant that unifies the SFT and RL stages via partial expert guidance and branched rollouts. When self-generated traces fail, BREAD adaptively inserts short expert prefixes/hints, allowing the small model to complete the rest of the reasoning path, and ensuring that each update includes at least one successful trace. This mechanism both densifies the reward signal and induces a natural learning curriculum. BREAD requires fewer than 40% of ground-truth traces, consistently outperforming standard GRPO while speeding up the training by about 3 times. Importantly, we demonstrate that BREAD helps the model solve problems that are otherwise unsolvable by the SFT + RL strategy, highlighting how branched rollouts and expert guidance can substantially boost SLM reasoning.
[LG-1] Optimal Implicit Bias in Linear Regression
链接: https://arxiv.org/abs/2506.17187
作者: Kanumuri Nithin Varma,Babak Hassibi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Most modern learning problems are over-parameterized, where the number of learnable parameters is much greater than the number of training data points. In this over-parameterized regime, the training loss typically has infinitely many global optima that completely interpolate the data with varying generalization performance. The particular global optimum we converge to depends on the implicit bias of the optimization algorithm. The question we address in this paper is, ``What is the implicit bias that leads to the best generalization performance?". To find the optimal implicit bias, we provide a precise asymptotic analysis of the generalization performance of interpolators obtained from the minimization of convex functions/potentials for over-parameterized linear regression with non-isotropic Gaussian data. In particular, we obtain a tight lower bound on the best generalization error possible among this class of interpolators in terms of the over-parameterization ratio, the variance of the noise in the labels, the eigenspectrum of the data covariance, and the underlying distribution of the parameter to be estimated. Finally, we find the optimal convex implicit bias that achieves this lower bound under certain sufficient conditions involving the log-concavity of the distribution of a Gaussian convolved with the prior of the true underlying parameter.
[LG-2] Variational Learning of Disentangled Representations
链接: https://arxiv.org/abs/2506.17182
作者: Yuli Slavutsky,Ozgur Beker,David Blei,Bianca Dumitrascu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Disentangled representations enable models to separate factors of variation that are shared across experimental conditions from those that are condition-specific. This separation is essential in domains such as biomedical data analysis, where generalization to new treatments, patients, or species depends on isolating stable biological signals from context-dependent effects. While extensions of the variational autoencoder (VAE) framework have been proposed to address this problem, they frequently suffer from leakage between latent representations, limiting their ability to generalize to unseen conditions. Here, we introduce DISCoVeR, a new variational framework that explicitly separates condition-invariant and condition-specific factors. DISCoVeR integrates three key components: (i) a dual-latent architecture that models shared and specific factors separately; (ii) two parallel reconstructions that ensure both representations remain informative; and (iii) a novel max-min objective that encourages clean separation without relying on handcrafted priors, while making only minimal assumptions. Theoretically, we show that this objective maximizes data likelihood while promoting disentanglement, and that it admits a unique equilibrium. Empirically, we demonstrate that DISCoVeR achieves improved disentanglement on synthetic datasets, natural images, and single-cell RNA-seq data. Together, these results establish DISCoVeR as a principled approach for learning disentangled representations in multi-condition settings.
[LG-3] Deep generative models as the probability transformation functions
链接: https://arxiv.org/abs/2506.17171
作者: Vitalii Bondar,Vira Babenko,Roman Trembovetskyi,Yurii Korobeinyk,Viktoriya Dzyuba
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, accepted for publication in “ICIST 2025 Springer Proceedings”
Abstract:This paper introduces a unified theoretical perspective that views deep generative models as probability transformation functions. Despite the apparent differences in architecture and training methodologies among various types of generative models - autoencoders, autoregressive models, generative adversarial networks, normalizing flows, diffusion models, and flow matching - we demonstrate that they all fundamentally operate by transforming simple predefined distributions into complex target data distributions. This unifying perspective facilitates the transfer of methodological improvements between model architectures and provides a foundation for developing universal theoretical approaches, potentially leading to more efficient and effective generative modeling techniques.
[LG-4] Neural Polar Decoders for DNA Data Storag e
链接: https://arxiv.org/abs/2506.17076
作者: Ziv Aharoni,Henry D. Pfister
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Synchronization errors, such as insertions and deletions, present a fundamental challenge in DNA-based data storage systems, arising from both synthesis and sequencing noise. These channels are often modeled as insertion-deletion-substitution (IDS) channels, for which designing maximum-likelihood decoders is computationally expensive. In this work, we propose a data-driven approach based on neural polar decoders (NPDs) to design low-complexity decoders for channels with synchronization errors. The proposed architecture enables decoding over IDS channels with reduced complexity O(AN log N ) , where A is a tunable parameter independent of the channel. NPDs require only sample access to the channel and can be trained without an explicit channel model. Additionally, NPDs provide mutual information (MI) estimates that can be used to optimize input distributions and code design. We demonstrate the effectiveness of NPDs on both synthetic deletion and IDS channels. For deletion channels, we show that NPDs achieve near-optimal decoding performance and accurate MI estimation, with significantly lower complexity than trellis-based decoders. We also provide numerical estimates of the channel capacity for the deletion channel. We extend our evaluation to realistic DNA storage settings, including channels with multiple noisy reads and real-world Nanopore sequencing data. Our results show that NPDs match or surpass the performance of existing methods while using significantly fewer parameters than the state-of-the-art. These findings highlight the promise of NPDs for robust and efficient decoding in DNA data storage systems.
[LG-5] Empowering Near-Field Communications in Low-Altitude Economy with LLM : Fundamentals Potentials Solutions and Future Directions
链接: https://arxiv.org/abs/2506.17067
作者: Zhuo Xu,Tianyue Zheng,Linglong Dai
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:The low-altitude economy (LAE) is gaining significant attention from academia and industry. Fortunately, LAE naturally aligns with near-field communications in extremely large-scale MIMO (XL-MIMO) systems. By leveraging near-field beamfocusing, LAE can precisely direct beam energy to unmanned aerial vehicles, while the additional distance dimension boosts overall spectrum efficiency. However, near-field communications in LAE still face several challenges, such as the increase in signal processing complexity and the necessity of distinguishing between far and near-field users. Inspired by the large language models (LLM) with powerful ability to handle complex problems, we apply LLM to solve challenges of near-field communications in LAE. The objective of this article is to provide a comprehensive analysis and discussion on LLM-empowered near-field communications in LAE. Specifically, we first introduce fundamentals of LLM and near-field communications, including the key advantages of LLM and key characteristics of near-field communications. Then, we reveal the opportunities and challenges of near-field communications in LAE. To address these challenges, we present a LLM-based scheme for near-field communications in LAE, and provide a case study which jointly distinguishes far and near-field users and designs multi-user precoding matrix. Finally, we outline and highlight several future research directions and open issues.
[LG-6] Client Selection Strategies for Federated Semantic Communications in Heterogeneous IoT Networks
链接: https://arxiv.org/abs/2506.17063
作者: Samer Lahoud,Kinda Khawam
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:The exponential growth of IoT devices presents critical challenges in bandwidth-constrained wireless networks, particularly regarding efficient data transmission and privacy preservation. This paper presents a novel federated semantic communication (SC) framework that enables collaborative training of bandwidth-efficient models for image reconstruction across heterogeneous IoT devices. By leveraging SC principles to transmit only semantic features, our approach dramatically reduces communication overhead while preserving reconstruction quality. We address the fundamental challenge of client selection in federated learning environments where devices exhibit significant disparities in dataset sizes and data distributions. Our framework implements three distinct client selection strategies that explore different trade-offs between system performance and fairness in resource allocation. The system employs an end-to-end SC architecture with semantic bottlenecks, coupled with a loss-based aggregation mechanism that naturally adapts to client heterogeneity. Experimental evaluation on image data demonstrates that while Utilitarian selection achieves the highest reconstruction quality, Proportional Fairness maintains competitive performance while significantly reducing participation inequality and improving computational efficiency. These results establish that federated SC can successfully balance reconstruction quality, resource efficiency, and fairness in heterogeneous IoT deployments, paving the way for sustainable and privacy-preserving edge intelligence applications.
[LG-7] Universal Music Representations? Evaluating Foundation Models on World Music Corpora
链接: https://arxiv.org/abs/2506.17055
作者: Charilaos Papaioannou,Emmanouil Benetos,Alexandros Potamianos
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2025
Abstract:Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize across diverse musical traditions. This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora spanning Western popular, Greek, Turkish, and Indian classical traditions. We employ three complementary methodologies to investigate these models’ cross-cultural capabilities: probing to assess inherent representations, targeted supervised fine-tuning of 1-2 layers, and multi-label few-shot learning for low-resource scenarios. Our analysis shows varying cross-cultural generalization, with larger models typically outperforming on non-Western music, though results decline for culturally distant traditions. Notably, our approaches achieve state-of-the-art performance on five out of six evaluated datasets, demonstrating the effectiveness of foundation models for world music understanding. We also find that our targeted fine-tuning approach does not consistently outperform probing across all settings, suggesting foundation models already encode substantial musical knowledge. Our evaluation framework and benchmarking results contribute to understanding how far current models are from achieving universal music representations while establishing metrics for future progress.
[LG-8] Navigating the Deep: Signature Extraction on Deep Neural Networks
链接: https://arxiv.org/abs/2506.17047
作者: Haolin Liu,Adrien Siproudhis,Samuel Experton,Peter Lorenz,Christina Boura,Thomas Peyrin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 26 pages
Abstract:Neural network model extraction has emerged in recent years as an important security concern, as adversaries attempt to recover a network’s parameters via black-box queries. A key step in this process is signature extraction, which aims to recover the absolute values of the network’s weights layer by layer. Prior work, notably by Carlini et al. (2020), introduced a technique inspired by differential cryptanalysis to extract neural network parameters. However, their method suffers from several limitations that restrict its applicability to networks with a few layers only. Later works focused on improving sign extraction, but largely relied on the assumption that signature extraction itself was feasible. In this work, we revisit and refine the signature extraction process by systematically identifying and addressing for the first time critical limitations of Carlini et al.'s signature extraction method. These limitations include rank deficiency and noise propagation from deeper layers. To overcome these challenges, we propose efficient algorithmic solutions for each of the identified issues, greatly improving the efficiency of signature extraction. Our approach permits the extraction of much deeper networks than was previously possible. We validate our method through extensive experiments on ReLU-based neural networks, demonstrating significant improvements in extraction depth and accuracy. For instance, our extracted network matches the target network on at least 95% of the input space for each of the eight layers of a neural network trained on the CIFAR-10 dataset, while previous works could barely extract the first three layers. Our results represent a crucial step toward practical attacks on larger and more complex neural network architectures. Comments: 26 pages Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2506.17047 [cs.LG] (or arXiv:2506.17047v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.17047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] Critical Appraisal of Fairness Metrics in Clinical Predictive AI
链接: https://arxiv.org/abs/2506.17035
作者: João Matos,Ben Van Calster,Leo Anthony Celi,Paula Dhiman,Judy Wawira Gichoya,Richard D. Riley,Chris Russell,Sara Khalid,Gary S. Collins
类目: Machine Learning (cs.LG)
*备注: 32 pages, 1 figure, 2 tables, 5 boxes, 4 linked supplementary materials
Abstract:Predictive artificial intelligence (AI) offers an opportunity to improve clinical practice and patient outcomes, but risks perpetuating biases if fairness is inadequately addressed. However, the definition of “fairness” remains unclear. We conducted a scoping review to identify and critically appraise fairness metrics for clinical predictive AI. We defined a “fairness metric” as a measure quantifying whether a model discriminates (societally) against individuals or groups defined by sensitive attributes. We searched five databases (2014-2024), screening 820 records, to include 41 studies, and extracted 62 fairness metrics. Metrics were classified by performance-dependency, model output level, and base performance metric, revealing a fragmented landscape with limited clinical validation and overreliance on threshold-dependent measures. Eighteen metrics were explicitly developed for healthcare, including only one clinical utility metric. Our findings highlight conceptual challenges in defining and quantifying fairness and identify gaps in uncertainty quantification, intersectionality, and real-world applicability. Future work should prioritise clinically meaningful metrics.
[LG-10] Scalable and Reliable Multi-agent Reinforcement Learning for Traffic Assignment
链接: https://arxiv.org/abs/2506.17029
作者: Leizhen Wang,Peibo Duan,Cheng Lyu,Zewen Wang,Zhiqiang He,Nan Zheng,Zhenliang Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:The evolution of metropolitan cities and the increase in travel demands impose stringent requirements on traffic assignment methods. Multi-agent reinforcement learning (MARL) approaches outperform traditional methods in modeling adaptive routing behavior without requiring explicit system dynamics, which is beneficial for real-world deployment. However, MARL frameworks face challenges in scalability and reliability when managing extensive networks with substantial travel demand, which limiting their practical applicability in solving large-scale traffic assignment problems. To address these challenges, this study introduces MARL-OD-DA, a new MARL framework for the traffic assignment problem, which redefines agents as origin-destination (OD) pair routers rather than individual travelers, significantly enhancing scalability. Additionally, a Dirichlet-based action space with action pruning and a reward function based on the local relative gap are designed to enhance solution reliability and improve convergence efficiency. Experiments demonstrate that the proposed MARL framework effectively handles medium-sized networks with extensive and varied city-level OD demand, surpassing existing MARL methods. When implemented in the SiouxFalls network, MARL-OD-DA achieves better assignment solutions in 10 steps, with a relative gap that is 94.99% lower than that of conventional methods.
[LG-11] he Hidden Cost of an Image: Quantifying the Energy Consumption of AI Image Generation
链接: https://arxiv.org/abs/2506.17016
作者: Giulia Bertazzini,Chiara Albisani,Daniele Baracchi,Dasara Shullani,Roberto Verdecchia
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:With the growing adoption of AI image generation, in conjunction with the ever-increasing environmental resources demanded by AI, we are urged to answer a fundamental question: What is the environmental impact hidden behind each image we generate? In this research, we present a comprehensive empirical experiment designed to assess the energy consumption of AI image generation. Our experiment compares 17 state-of-the-art image generation models by considering multiple factors that could affect their energy consumption, such as model quantization, image resolution, and prompt length. Additionally, we consider established image quality metrics to study potential trade-offs between energy consumption and generated image quality. Results show that image generation models vary drastically in terms of the energy they consume, with up to a 46x difference. Image resolution affects energy consumption inconsistently, ranging from a 1.3x to 4.7x increase when doubling resolution. U-Net-based models tend to consume less than Transformer-based one. Model quantization instead results to deteriorate the energy efficiency of most models, while prompt length and content have no statistically significant impact. Improving image quality does not always come at the cost of a higher energy consumption, with some of the models producing the highest quality images also being among the most energy efficient ones.
[LG-12] Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators
链接: https://arxiv.org/abs/2506.17007
作者: Marco Jiralerspong,Esther Derman,Danilo Vucetic,Nikolay Malkin,Bilun Sun,Tianyu Zhang,Pierre-Luc Bacon,Gauthier Gidel
类目: Machine Learning (cs.LG)
*备注:
Abstract:A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: this https URL.
[LG-13] RocketStack: A level-aware deep recursive ensemble learning framework with exploratory feature fusion and model pruning dynamics
链接: https://arxiv.org/abs/2506.16965
作者: Çağatay Demirel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 1 graphical abstract, 7 figures, 9 tables, 2 supplementary figures
Abstract:Ensemble learning remains a cornerstone of machine learning, with stacking used to integrate predictions from multiple base learners through a meta-model. However, deep stacking remains rare, as most designs prioritize horizontal diversity over recursive depth due to model complexity, feature redundancy, and computational burden. To address these challenges, RocketStack, a level-aware recursive ensemble framework, is introduced and explored up to ten stacking levels, extending beyond prior architectures. The framework incrementally prunes weaker learners at each level, enabling deeper stacking without excessive complexity. To mitigate early performance saturation, mild Gaussian noise is added to out-of-fold (OOF) scores before pruning, and compared against strict OOF pruning. Further both per-level and periodic feature compressions are explored using attention-based selection, Simple, Fast, Efficient (SFE) filter, and autoencoders. Across 33 datasets (23 binary, 10 multi-class), linear-trend tests confirmed rising accuracy with depth in most variants, and the top performing meta-model at each level increasingly outperformed the strongest standalone ensemble. In the binary subset, periodic SFE with mild OOF-score randomization reached 97.08% at level 10, 5.14% above the strict-pruning configuration and cut runtime by 10.5% relative to no compression. In the multi-class subset, periodic attention selection reached 98.60% at level 10, exceeding the strongest baseline by 6.11%, while reducing runtime by 56.1% and feature dimensionality by 74% compared to no compression. These findings highlight mild randomization as an effective regularizer and periodic compression as a stabilizer. Echoing the design of multistage rockets in aerospace (prune, compress, propel) RocketStack achieves deep recursive ensembling with tractable complexity.
[LG-14] RCNet: ΔΣ IADCs as Recurrent AutoEncoders
链接: https://arxiv.org/abs/2506.16903
作者: Arnaud Verdant,William Guicquero,Jérôme Chossat
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a deep learning model (RCNet) for Delta-Sigma ( \Delta\Sigma ) ADCs. Recurrent Neural Networks (RNNs) allow to describe both modulators and filters. This analogy is applied to Incremental ADCs (IADC). High-end optimizers combined with full-custom losses are used to define additional hardware design constraints: quantized weights, signal saturation, temporal noise injection, devices area. Focusing on DC conversion, our early results demonstrate that SNR defined as an Effective Number Of Bits (ENOB) can be optimized under a certain hardware mapping complexity. The proposed RCNet succeeded to provide design tradeoffs in terms of SNR ( 13bit) versus area constraints ( 14pF total capacitor) at a given OSR (80 samples). Interestingly, it appears that the best RCNet architectures do not necessarily rely on high-order modulators, leveraging additional topology exploration degrees of freedom.
[LG-15] Optimal Depth of Neural Networks
链接: https://arxiv.org/abs/2506.16862
作者: Qian Qi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Determining the optimal depth of a neural network is a fundamental yet challenging problem, typically resolved through resource-intensive experimentation. This paper introduces a formal theoretical framework to address this question by recasting the forward pass of a deep network, specifically a Residual Network (ResNet), as an optimal stopping problem. We model the layer-by-layer evolution of hidden representations as a sequential decision process where, at each layer, a choice is made between halting computation to make a prediction or continuing to a deeper layer for a potentially more refined representation. This formulation captures the intrinsic trade-off between accuracy and computational cost. Our primary theoretical contribution is a proof that, under a plausible condition of diminishing returns on the residual functions, the expected optimal stopping depth is provably finite, even in an infinite-horizon setting. We leverage this insight to propose a novel and practical regularization term, \mathcalL_\rm depth , that encourages the network to learn representations amenable to efficient, early exiting. We demonstrate the generality of our framework by extending it to the Transformer architecture and exploring its connection to continuous-depth models via free-boundary problems. Empirical validation on ImageNet confirms that our regularizer successfully induces the theoretically predicted behavior, leading to significant gains in computational efficiency without compromising, and in some cases improving, final model accuracy.
[LG-16] Anomaly Detection in Event-triggered Traffic Time Series via Similarity Learning
链接: https://arxiv.org/abs/2506.16855
作者: Shaoyu Dou,Kai Yang,Yang Jiao,Chengbo Qiu,Kui Ren
类目: Machine Learning (cs.LG)
*备注: 16 pages, 14 figures. Published in IEEE Transactions on Dependable and Secure Computing. arXiv admin note: substantial text overlap with arXiv:2207.08159
Abstract:Time series analysis has achieved great success in cyber security such as intrusion detection and device identification. Learning similarities among multiple time series is a crucial problem since it serves as the foundation for downstream analysis. Due to the complex temporal dynamics of the event-triggered time series, it often remains unclear which similarity metric is appropriate for security-related tasks, such as anomaly detection and clustering. The overarching goal of this paper is to develop an unsupervised learning framework that is capable of learning similarities among a set of event-triggered time series. From the machine learning vantage point, the proposed framework harnesses the power of both hierarchical multi-resolution sequential autoencoders and the Gaussian Mixture Model (GMM) to effectively learn the low-dimensional representations from the time series. Finally, the obtained similarity measure can be easily visualized for the explanation. The proposed framework aspires to offer a stepping stone that gives rise to a systematic approach to model and learn similarities among a multitude of event-triggered time series. Through extensive qualitative and quantitative experiments, it is revealed that the proposed method outperforms state-of-the-art methods considerably.
[LG-17] Reward-Agnostic Prompt Optimization for Text-to-Image Diffusion Models
链接: https://arxiv.org/abs/2506.16853
作者: Semin Kim,Yeonwoo Cha,Jaehoon Yoo,Seunghoon Hong
类目: Machine Learning (cs.LG)
*备注: 28 pages, Under review
Abstract:We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textitwithout requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a “hint”) as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, using up to 3.5 times less inference budget, and, given sufficient inference budget, achieves performance comparable to learning-based baselines that require reward-specific fine-tuning. The code is available at this https URL.
[LG-18] Soft decision trees for survival analysis
链接: https://arxiv.org/abs/2506.16846
作者: Antonio Consoloa,Edoardo Amaldi,Emilio Carrizosa
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Decision trees are popular in survival analysis for their interpretability and ability to model complex relationships. Survival trees, which predict the timing of singular events using censored historical data, are typically built through heuristic approaches. Recently, there has been growing interest in globally optimized trees, where the overall tree is trained by minimizing the error function over all its parameters. We propose a new soft survival tree model (SST), with a soft splitting rule at each branch node, trained via a nonlinear optimization formulation amenable to decomposition. Since SSTs provide for every input vector a specific survival function associated to a single leaf node, they satisfy the conditional computation property and inherit the related benefits. SST and the training formulation combine flexibility with interpretability: any smooth survival function (parametric, semiparametric, or nonparametric) estimated through maximum likelihood can be used, and each leaf node of an SST yields a cluster of distinct survival functions which are associated to the data points routed to it. Numerical experiments on 15 well-known datasets show that SSTs, with parametric and spline-based semiparametric survival functions, trained using an adaptation of the node-based decomposition algorithm proposed by Consolo et al. (2024) for soft regression trees, outperform three benchmark survival trees in terms of four widely-used discrimination and calibration measures. SSTs can also be extended to consider group fairness.
[LG-19] FedFitTech: A Baseline in Federated Learning for Fitness Tracking
链接: https://arxiv.org/abs/2506.16840
作者: Zeyneddin Oz,Shreyas Korde,Marius Bock,Kristof Van Laerhoven
类目: Machine Learning (cs.LG)
*备注: This submission includes a total of 7 pages and 6 figures
Abstract:Rapid evolution of sensors and resource-efficient machine learning models have spurred the widespread adoption of wearable fitness tracking devices. Equipped with inertial sensors, such devices can continuously capture physical movements for fitness technology (FitTech), enabling applications from sports optimization to preventive healthcare. Traditional centralized learning approaches to detect fitness activities struggle with privacy concerns, regulatory constraints, and communication inefficiencies. In contrast, Federated Learning (FL) enables a decentralized model training by communicating model updates rather than private wearable sensor data. Applying FL to FitTech presents unique challenges, such as data imbalance, lack of labelled data, heterogeneous user activity patterns, and trade-offs between personalization and generalization. To simplify research on FitTech in FL, we present the FedFitTech baseline, under the Flower framework, which is publicly available and widely used by both industry and academic researchers. Additionally, to illustrate its usage, this paper presents a case study that implements a system based on the FedFitTech baseline, incorporating a client-side early stopping strategy and comparing the results. For instance, this system allows wearable devices to optimize the trade-off between capturing common fitness activity patterns and preserving individuals’ nuances, thereby enhancing both the scalability and efficiency of privacy-aware fitness tracking applications. Results show that this reduces overall redundant communications by 13 percent, while maintaining the overall recognition performance at a negligible recognition cost by 1 percent. Thus, FedFitTech baseline creates a foundation for a wide range of new research and development opportunities in FitTech, and it is available as open-source at: this https URL
[LG-20] Predicting New Research Directions in Materials Science using Large Language Models and Concept Graphs
链接: https://arxiv.org/abs/2506.16824
作者: Thomas Marwitz,Alexander Colsmann,Ben Breitung,Christoph Brabec,Christoph Kirchlechner,Eva Blasco,Gabriel Cadilha Marques,Horst Hahn,Michael Hirtz,Pavel A. Levkin,Yolita M. Eggeler,Tobias Schlöder,Pascal Friederich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Due to an exponential increase in published research articles, it is impossible for individual scientists to read all publications, even within their own research field. In this work, we investigate the use of large language models (LLMs) for the purpose of extracting the main concepts and semantic information from scientific abstracts in the domain of materials science to find links that were not noticed by humans and thus to suggest inspiring near/mid-term future research directions. We show that LLMs can extract concepts more efficiently than automated keyword extraction methods to build a concept graph as an abstraction of the scientific literature. A machine learning model is trained to predict emerging combinations of concepts, i.e. new research ideas, based on historical data. We demonstrate that integrating semantic concept information leads to an increased prediction performance. The applicability of our model is demonstrated in qualitative interviews with domain experts based on individualized model suggestions. We show that the model can inspire materials scientists in their creative thinking process by predicting innovative combinations of topics that have not yet been investigated.
[LG-21] Robust Group Anomaly Detection for Quasi-Periodic Network Time Series
链接: https://arxiv.org/abs/2506.16815
作者: Kai Yang,Shaoyu Dou,Pan Luo,Xin Wang,H. Vincent Poor
类目: Machine Learning (cs.LG)
*备注: Published in IEEE Transactions on Network Science and Engineering
Abstract:Many real-world multivariate time series are collected from a network of physical objects embedded with software, electronics, and sensors. The quasi-periodic signals generated by these objects often follow a similar repetitive and periodic pattern, but have variations in the period, and come in different lengths caused by timing (synchronization) errors. Given a multitude of such quasi-periodic time series, can we build machine learning models to identify those time series that behave differently from the majority of the observations? In addition, can the models help human experts to understand how the decision was made? We propose a sequence to Gaussian Mixture Model (seq2GMM) framework. The overarching goal of this framework is to identify unusual and interesting time series within a network time series database. We further develop a surrogate-based optimization algorithm that can efficiently train the seq2GMM model. Seq2GMM exhibits strong empirical performance on a plurality of public benchmark datasets, outperforming state-of-the-art anomaly detection techniques by a significant margin. We also theoretically analyze the convergence property of the proposed training algorithm and provide numerical results to substantiate our theoretical claims.
[LG-22] Exploring and Improving Initialization for Deep Graph Neural Networks: A Signal Propagation Perspective
链接: https://arxiv.org/abs/2506.16790
作者: Senmiao Wang,Yupeng Chen,Yushun Zhang,Ruoyu Sun,Tian Ding
类目: Machine Learning (cs.LG)
*备注: Published in TMLR (2025)
Abstract:Graph Neural Networks (GNNs) often suffer from performance degradation as the network depth increases. This paper addresses this issue by introducing initialization methods that enhance signal propagation (SP) within GNNs. We propose three key metrics for effective SP in GNNs: forward propagation, backward propagation, and graph embedding variation (GEV). While the first two metrics derive from classical SP theory, the third is specifically designed for GNNs. We theoretically demonstrate that a broad range of commonly used initialization methods for GNNs, which exhibit performance degradation with increasing depth, fail to control these three metrics simultaneously. To deal with this limitation, a direct exploitation of the SP analysis–searching for weight initialization variances that optimize the three metrics–is shown to significantly enhance the SP in deep GCNs. This approach is called Signal Propagation on Graph-guided Initialization (SPoGInit). Our experiments demonstrate that SPoGInit outperforms commonly used initialization methods on various tasks and architectures. Notably, SPoGInit enables performance improvements as GNNs deepen, which represents a significant advancement in addressing depth-related challenges and highlights the validity and effectiveness of the SP analysis framework.
[LG-23] Revisiting LoRA through the Lens of Parameter Redundancy: Spectral Encoding Helps ACL2025
链接: https://arxiv.org/abs/2506.16787
作者: Jiashun Cheng,Aochuan Chen,Nuo Chen,Ziqi Gao,Yuhan Li,Jia Li,Fugee Tsung
类目: Machine Learning (cs.LG)
*备注: 18 pages; Accepted to ACL 2025 Findings
Abstract:Low-Rank Adaptation (LoRA) has emerged as a prominent technique for fine-tuning large foundation models. Despite its successes, the substantial parameter redundancy, which limits the capacity and efficiency of LoRA, has been recognized as a bottleneck. In this work, we systematically investigate the impact of redundancy in fine-tuning LoRA and reveal that reducing density redundancy does not degrade expressiveness. Based on this insight, we introduce \underlineSpectral-\underlineencoding \underlineLow-\underlineRank \underlineAdaptation (SeLoRA), which harnesses the robust expressiveness of spectral bases to re-parameterize LoRA from a sparse spectral subspace. Designed with simplicity, SeLoRA enables seamless integration with various LoRA variants for performance boosting, serving as a scalable plug-and-play framework. Extensive experiments substantiate that SeLoRA achieves greater efficiency with fewer parameters, delivering superior performance enhancements over strong baselines on various downstream tasks, including commonsense reasoning, math reasoning, and code generation.
[LG-24] IsoNet: Causal Analysis of Multimodal Transformers for Neuromuscular Gesture Classification
链接: https://arxiv.org/abs/2506.16744
作者: Eion Tyacke,Kunal Gupta,Jay Patel,Rui Li
类目: Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP)
*备注:
Abstract:Hand gestures are a primary output of the human motor system, yet the decoding of their neuromuscular signatures remains a bottleneck for basic neuroscience and assistive technologies such as prosthetics. Traditional human-machine interface pipelines rely on a single biosignal modality, but multimodal fusion can exploit complementary information from sensors. We systematically compare linear and attention-based fusion strategies across three architectures: a Multimodal MLP, a Multimodal Transformer, and a Hierarchical Transformer, evaluating performance on scenarios with unimodal and multimodal inputs. Experiments use two publicly available datasets: NinaPro DB2 (sEMG and accelerometer) and HD-sEMG 65-Gesture (high-density sEMG and force). Across both datasets, the Hierarchical Transformer with attention-based fusion consistently achieved the highest accuracy, surpassing the multimodal and best single-modality linear-fusion MLP baseline by over 10% on NinaPro DB2 and 3.7% on HD-sEMG. To investigate how modalities interact, we introduce an Isolation Network that selectively silences unimodal or cross-modal attention pathways, quantifying each group of token interactions’ contribution to downstream decisions. Ablations reveal that cross-modal interactions contribute approximately 30% of the decision signal across transformer layers, highlighting the importance of attention-driven fusion in harnessing complementary modality information. Together, these findings reveal when and how multimodal fusion would enhance biosignal classification and also provides mechanistic insights of human muscle activities. The study would be beneficial in the design of sensor arrays for neurorobotic systems.
[LG-25] Optimism Without Regularization: Constant Regret in Zero-Sum Games
链接: https://arxiv.org/abs/2506.16736
作者: John Lazarsfeld,Georgios Piliouras,Ryann Sim,Stratis Skoulakis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:This paper studies the optimistic variant of Fictitious Play for learning in two-player zero-sum games. While it is known that Optimistic FTRL – a regularized algorithm with a bounded stepsize parameter – obtains constant regret in this setting, we show for the first time that similar, optimal rates are also achievable without regularization: we prove for two-strategy games that Optimistic Fictitious Play (using any tiebreaking rule) obtains only constant regret, providing surprising new evidence on the ability of non-no-regret algorithms for fast learning in games. Our proof technique leverages a geometric view of Optimistic Fictitious Play in the dual space of payoff vectors, where we show a certain energy function of the iterates remains bounded over time. Additionally, we also prove a regret lower bound of \Omega(\sqrtT) for Alternating Fictitious Play. In the unregularized regime, this separates the ability of optimism and alternation in achieving o(\sqrtT) regret.
[LG-26] DRARL: Disengagement-Reason -Augmented Reinforcement Learning for Efficient Improvement of Autonomous Driving Policy
链接: https://arxiv.org/abs/2506.16720
作者: Weitao Zhou,Bo Zhang,Zhong Cao,Xiang Li,Qian Cheng,Chunyang Liu,Yaqin Zhang,Diange Yang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:With the increasing presence of automated vehicles on open roads under driver supervision, disengagement cases are becoming more prevalent. While some data-driven planning systems attempt to directly utilize these disengagement cases for policy improvement, the inherent scarcity of disengagement data (often occurring as a single instances) restricts training effectiveness. Furthermore, some disengagement data should be excluded since the disengagement may not always come from the failure of driving policies, e.g. the driver may casually intervene for a while. To this end, this work proposes disengagement-reason-augmented reinforcement learning (DRARL), which enhances driving policy improvement process according to the reason of disengagement cases. Specifically, the reason of disengagement is identified by a out-of-distribution (OOD) state estimation model. When the reason doesn’t exist, the case will be identified as a casual disengagement case, which doesn’t require additional policy adjustment. Otherwise, the policy can be updated under a reason-augmented imagination environment, improving the policy performance of disengagement cases with similar reasons. The method is evaluated using real-world disengagement cases collected by autonomous driving robotaxi. Experimental results demonstrate that the method accurately identifies policy-related disengagement reasons, allowing the agent to handle both original and semantically similar cases through reason-augmented training. Furthermore, the approach prevents the agent from becoming overly conservative after policy adjustments. Overall, this work provides an efficient way to improve driving policy performance with disengagement cases.
[LG-27] How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension
链接: https://arxiv.org/abs/2506.16704
作者: Cynthia Dwork,Lunjia Hu,Han Shao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study a fundamental question of domain generalization: given a family of domains (i.e., data distributions), how many randomly sampled domains do we need to collect data from in order to learn a model that performs reasonably well on every seen and unseen domain in the family? We model this problem in the PAC framework and introduce a new combinatorial measure, which we call the domain shattering dimension. We show that this dimension characterizes the domain sample complexity. Furthermore, we establish a tight quantitative relationship between the domain shattering dimension and the classic VC dimension, demonstrating that every hypothesis class that is learnable in the standard PAC setting is also learnable in our setting.
[LG-28] SIDE: Semantic ID Embedding for effective learning from sequences
链接: https://arxiv.org/abs/2506.16698
作者: Dinesh Ramasamy,Shakti Kumar,Chris Cadonic,Jiaxin Yang,Sohini Roychowdhury,Esam Abdel Rhman,Srihari Reddy
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 images, 6 tables
Abstract:Sequence-based recommendations models are driving the state-of-the-art for industrial ad-recommendation systems. Such systems typically deal with user histories or sequence lengths ranging in the order of O(10^3) to O(10^4) events. While adding embeddings at this scale is manageable in pre-trained models, incorporating them into real-time prediction models is challenging due to both storage and inference costs. To address this scaling challenge, we propose a novel approach that leverages vector quantization (VQ) to inject a compact Semantic ID (SID) as input to the recommendation models instead of a collection of embeddings. Our method builds on recent works of SIDs by introducing three key innovations: (i) a multi-task VQ-VAE framework, called VQ fusion that fuses multiple content embeddings and categorical predictions into a single Semantic ID; (ii) a parameter-free, highly granular SID-to-embedding conversion technique, called SIDE, that is validated with two content embedding collections, thereby eliminating the need for a large parameterized lookup table; and (iii) a novel quantization method called Discrete-PCA (DPCA) which generalizes and enhances residual quantization techniques. The proposed enhancements when applied to a large-scale industrial ads-recommendation system achieves 2.4X improvement in normalized entropy (NE) gain and 3X reduction in data footprint compared to traditional SID methods.
[LG-29] Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections
链接: https://arxiv.org/abs/2506.16685
作者: Xiaomeng Xu,Yifan Hou,Zeyi Liu,Shuran Song
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We address key challenges in Dataset Aggregation (DAgger) for real-world contact-rich manipulation: how to collect informative human correction data and how to effectively update policies with this new data. We introduce Compliant Residual DAgger (CR-DAgger), which contains two novel components: 1) a Compliant Intervention Interface that leverages compliance control, allowing humans to provide gentle, accurate delta action corrections without interrupting the ongoing robot policy execution; and 2) a Compliant Residual Policy formulation that learns from human corrections while incorporating force feedback and force control. Our system significantly enhances performance on precise contact-rich manipulation tasks using minimal correction data, improving base policy success rates by over 50% on two challenging tasks (book flipping and belt assembly) while outperforming both retraining-from-scratch and finetuning approaches. Through extensive real-world experiments, we provide practical guidance for implementing effective DAgger in real-world robot learning tasks. Result videos are available at: this https URL
[LG-30] he Hitchhikers Guide to Efficient End-to-End and Tight DP Auditing
链接: https://arxiv.org/abs/2506.16666
作者: Meenatchi Sundaram Muthu Selva Annamalai,Borja Balle,Jamie Hayes,Georgios Kaissis,Emiliano De Cristofaro
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:This paper systematizes research on auditing Differential Privacy (DP) techniques, aiming to identify key insights into the current state of the art and open challenges. First, we introduce a comprehensive framework for reviewing work in the field and establish three cross-contextual desiderata that DP audits should target–namely, efficiency, end-to-end-ness, and tightness. Then, we systematize the modes of operation of state-of-the-art DP auditing techniques, including threat models, attacks, and evaluation functions. This allows us to highlight key details overlooked by prior work, analyze the limiting factors to achieving the three desiderata, and identify open research problems. Overall, our work provides a reusable and systematic methodology geared to assess progress in the field and identify friction points and future directions for our community to focus on.
[LG-31] Private Training Data Generation by Clustering Embeddings
链接: https://arxiv.org/abs/2506.16661
作者: Felix Zhou,Samson Zhou,Vahab Mirrokni,Alessandro Epasto,Vincent Cohen-Addad
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:Deep neural networks often use large, high-quality datasets to achieve high performance on many machine learning tasks. When training involves potentially sensitive data, this process can raise privacy concerns, as large models have been shown to unintentionally memorize and reveal sensitive information, including reconstructing entire training samples. Differential privacy (DP) provides a robust framework for protecting individual data and in particular, a new approach to privately training deep neural networks is to approximate the input dataset with a privately generated synthetic dataset, before any subsequent training algorithm. We introduce a novel principled method for DP synthetic image embedding generation, based on fitting a Gaussian Mixture Model (GMM) in an appropriate embedding space using DP clustering. Our method provably learns a GMM under separation conditions. Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy on standard benchmark datasets. Additionally, we demonstrate that our method can generate realistic synthetic images that achieve downstream classification accuracy comparable to SOTA methods. Our method is quite general, as the encoder and decoder modules can be freely substituted to suit different tasks. It is also highly scalable, consisting only of subroutines that scale linearly with the number of samples and/or can be implemented efficiently in distributed systems.
[LG-32] Mesh-Informed Neural Operator : A Transformer Generative Approach
链接: https://arxiv.org/abs/2506.16656
作者: Yaozhong Shi,Zachary E. Ross,Domniki Asimaki,Kamyar Azizzadenesheli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative models in function spaces, situated at the intersection of generative modeling and operator learning, are attracting increasing attention due to their immense potential in diverse scientific and engineering applications. While functional generative models are theoretically domain- and discretization-agnostic, current implementations heavily rely on the Fourier Neural Operator (FNO), limiting their applicability to regular grids and rectangular domains. To overcome these critical limitations, we introduce the Mesh-Informed Neural Operator (MINO). By leveraging graph neural operators and cross-attention mechanisms, MINO offers a principled, domain- and discretization-agnostic backbone for generative modeling in function spaces. This advancement significantly expands the scope of such models to more diverse applications in generative, inverse, and regression tasks. Furthermore, MINO provides a unified perspective on integrating neural operators with general advanced deep learning architectures. Finally, we introduce a suite of standardized evaluation metrics that enable objective comparison of functional generative models, addressing another critical gap in the field.
[LG-33] A Distributional-Lifting Theorem for PAC Learning COLT2025
链接: https://arxiv.org/abs/2506.16651
作者: Guy Blanc,Jane Lange,Carmen Strassle,Li-Yang Tan
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注: COLT 2025
Abstract:The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a distributional-lifting theorem: This upgrades a learner that succeeds with respect to a limited distribution family \mathcalD to one that succeeds with respect to any distribution D^\star , with an efficiency overhead that scales with the complexity of expressing D^\star as a mixture of distributions in \mathcalD . Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting uniform-distribution learners and designed a lifter that uses a conditional sample oracle for D^\star , a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns D^\star and then uses this information to lift. We show that their approach is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then take a different approach that sidesteps the need to learn D^\star , yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler. Comments: COLT 2025 Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2506.16651 [cs.LG] (or arXiv:2506.16651v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.16651 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-34] Semantic Outlier Removal with Embedding Models and LLM s ACL2025
链接: https://arxiv.org/abs/2506.16644
作者: Eren Akbiyik,João Almeida,Rik Melis,Ritu Sriram,Viviana Petrescu,Vilhjálmur Vilhjálmsson
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) Industry Track, 10 pages
Abstract:Modern text processing pipelines demand robust methods to remove extraneous content while preserving a document’s core message. Traditional approaches such as HTML boilerplate extraction or keyword filters often fail in multilingual settings and struggle with context-sensitive nuances, whereas Large Language Models (LLMs) offer improved quality at high computational cost. We introduce SORE (Semantic Outlier Removal), a cost-effective, transparent method that leverages multilingual sentence embeddings and approximate nearest-neighbor search to identify and excise unwanted text segments. By first identifying core content via metadata embedding and then flagging segments that either closely match predefined outlier groups or deviate significantly from the core, SORE achieves near-LLM extraction precision at a fraction of the cost. Experiments on HTML datasets demonstrate that SORE outperforms structural methods and yield high precision in diverse scenarios. Our system is currently deployed in production, processing millions of documents daily across multiple languages while maintaining both efficiency and accuracy. To facilitate reproducibility and further research, we release our implementation and evaluation datasets.
[LG-35] Learning Causally Predictable Outcomes from Psychiatric Longitudinal Data
链接: https://arxiv.org/abs/2506.16629
作者: Eric V. Strobl
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: R code is available at this http URL
Abstract:Causal inference in longitudinal biomedical data remains a central challenge, especially in psychiatry, where symptom heterogeneity and latent confounding frequently undermine classical estimators. Most existing methods for treatment effect estimation presuppose a fixed outcome variable and address confounding through observed covariate adjustment. However, the assumption of unconfoundedness may not hold for a fixed outcome in practice. To address this foundational limitation, we directly optimize the outcome definition to maximize causal identifiability. Our DEBIAS (Durable Effects with Backdoor-Invariant Aggregated Symptoms) algorithm learns non-negative, clinically interpretable weights for outcome aggregation, maximizing durable treatment effects and empirically minimizing both observed and latent confounding by leveraging the time-limited direct effects of prior treatments in psychiatric longitudinal data. The algorithm also furnishes an empirically verifiable test for outcome unconfoundedness. DEBIAS consistently outperforms state-of-the-art methods in recovering causal effects for clinically interpretable composite outcomes across comprehensive experiments in depression and schizophrenia.
[LG-36] SlepNet: Spectral Subgraph Representation Learning for Neural Dynamics
链接: https://arxiv.org/abs/2506.16602
作者: Siddharth Viswanath,Rahul Singh,Yanlei Zhang,J. Adam Noah,Joy Hirsch,Smita Krishnaswamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks have been useful in machine learning on graph-structured data, particularly for node classification and some types of graph classification tasks. However, they have had limited use in representing patterning of signals over graphs. Patterning of signals over graphs and in subgraphs carries important information in many domains including neuroscience. Neural signals are spatiotemporally patterned, high dimensional and difficult to decode. Graph signal processing and associated GCN models utilize the graph Fourier transform and are unable to efficiently represent spatially or spectrally localized signal patterning on graphs. Wavelet transforms have shown promise here, but offer non-canonical representations and cannot be tightly confined to subgraphs. Here we propose SlepNet, a novel GCN architecture that uses Slepian bases rather than graph Fourier harmonics. In SlepNet, the Slepian harmonics optimally concentrate signal energy on specifically relevant subgraphs that are automatically learned with a mask. Thus, they can produce canonical and highly resolved representations of neural activity, focusing energy of harmonics on areas of the brain which are activated. We evaluated SlepNet across three fMRI datasets, spanning cognitive and visual tasks, and two traffic dynamics datasets, comparing its performance against conventional GNNs and graph signal processing constructs. SlepNet outperforms the baselines in all datasets. Moreover, the extracted representations of signal patterns from SlepNet offers more resolution in distinguishing between similar patterns, and thus represent brain signaling transients as informative trajectories. Here we have shown that these extracted trajectory representations can be used for other downstream untrained tasks. Thus we establish that SlepNet is useful both for prediction and representation learning in spatiotemporal data.
[LG-37] DRIVE Through the Unpredictability:From a Protocol Investigating Slip to a Metric Estimating Command Uncertainty
链接: https://arxiv.org/abs/2506.16593
作者: Nicolas Samson,William Larrivée-Hardy,William Dubois,Élie Roy-Brouard,Edith Brotherton,Dominic Baril,Julien Lépine,François Pomerleau
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This version is the preprint of a journal article with the same title, accepted in the IEEE Transactions on Field Robotics. To have a look at the early access version, use the following link this https URL
Abstract:Off-road autonomous navigation is a challenging task as it is mainly dependent on the accuracy of the motion model. Motion model performances are limited by their ability to predict the interaction between the terrain and the UGV, which an onboard sensor can not directly measure. In this work, we propose using the DRIVE protocol to standardize the collection of data for system identification and characterization of the slip state space. We validated this protocol by acquiring a dataset with two platforms (from 75 kg to 470 kg) on six terrains (i.e., asphalt, grass, gravel, ice, mud, sand) for a total of 4.9 hours and 14.7 km. Using this data, we evaluate the DRIVE protocol’s ability to explore the velocity command space and identify the reachable velocities for terrain-robot interactions. We investigated the transfer function between the command velocity space and the resulting steady-state slip for an SSMR. An unpredictability metric is proposed to estimate command uncertainty and help assess risk likelihood and severity in deployment. Finally, we share our lessons learned on running system identification on large UGV to help the community.
[LG-38] A Free Probabilistic Framework for Analyzing the Transformer-based Language Models
链接: https://arxiv.org/abs/2506.16550
作者: Swagatam Das
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We outline an operator-theoretic framework for analyzing transformer-based language models using the tools of free probability theory. By representing token embeddings and attention mechanisms as self-adjoint operators in a racial probability space, we reinterpret attention as a non-commutative convolution and view the layer-wise propagation of representations as an evolution governed by free additive convolution. This formalism reveals a spectral dynamical system underpinning deep transformer stacks and offers insight into their inductive biases, generalization behavior, and entropy dynamics. We derive a generalization bound based on free entropy and demonstrate that the spectral trace of transformer layers evolves predictably with depth. Our approach bridges neural architecture with non-commutative harmonic analysis, enabling principled analysis of information flow and structural complexity in large language models
[LG-39] Mr. Snuffleupagus at SemEval-2025 Task 4: Unlearning Factual Knowledge from LLM s Using Adaptive RMU SEMEVAL-2025
链接: https://arxiv.org/abs/2506.16548
作者: Arjun Dosajh,Mihika Sanghi
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, to be published in SemEval-2025
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their tendency to memorize training data raises concerns regarding privacy, copyright compliance, and security, particularly in cases involving Personally Identifiable Information (PII). Effective machine unlearning techniques are essential to mitigate these risks, yet existing methods remain underdeveloped for LLMs due to their open-ended output space. In this work, we apply the Adaptive Representation Misdirection Unlearning (RMU) technique to unlearn sensitive information from LLMs. Through extensive experiments, we analyze the effects of unlearning across different decoder layers to determine the most effective regions for sensitive information removal. Our technique ranked 4th on the official leaderboard of both 1B parameter and 7B parameter models.
[LG-40] Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic Semantic and NLI Approaches INTERSPEECH2025
链接: https://arxiv.org/abs/2506.16528
作者: Bornali Phukon,Xiuwen Zheng,Mark Hasegawa-Johnson
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, Interspeech 2025
Abstract:Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.
[LG-41] Robust Reward Modeling via Causal Rubrics
链接: https://arxiv.org/abs/2506.16507
作者: Pragya Srivastava,Harman Singh,Rahul Madhavan,Gandharv Patil,Sravanti Addepalli,Arun Suggala,Rengarajan Aravamudhan,Soumya Sharma,Anirban Laha,Aravindan Raghuveer,Karthikeyan Shanmugam,Doina Precup
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our augmentations are produced without any knowledge of spurious factors, via answer interventions only along causal rubrics, that are identified by querying an oracle LLM. Empirically, Crome significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.4% and achieving gains of up to 13.2% and 7.2% in specific categories. The robustness of Crome is further testified by the consistent gains obtained in a Best-of-N inference setting across increasing N, across various benchmarks, including the popular RewardBench (covering chat, chat-hard, safety, and reasoning tasks), the safety-focused WildGuardTest, and the reasoning-specific GSM8k.
[LG-42] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity ICML2025
链接: https://arxiv.org/abs/2506.16500
作者: Samir Khaki,Xiuyu Li,Junxian Guo,Ligeng Zhu,Chenfeng Xu,Konstantinos N. Plataniotis,Amir Yazdanbakhsh,Kurt Keutzer,Song Han,Zhijian Liu
类目: Machine Learning (cs.LG)
*备注: ICML 2025. The first three authors contributed equally to this work. Project page: this https URL
Abstract:Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they may even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We propose a lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset of weights for loss and gradient computation. Also, we systematically analyze and address sensitivity across layers, tokens, and training steps. Our experimental results show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning, code generation, and instruction following.
[LG-43] Manifold Learning for Personalized and Label-Free Detection of Cardiac Arrhythmias
链接: https://arxiv.org/abs/2506.16494
作者: Amir Reza Vazifeh,Jason W. Fleischer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Electrocardiograms (ECGs) provide direct, non-invasive measurements of heart activity and are well-established tools for detecting and monitoring cardiovascular disease. However, manual ECG analysis can be time-consuming and prone to errors. Machine learning has emerged as a promising approach for automated heartbeat recognition and classification, but substantial variations in ECG signals make it challenging to develop generalizable models. ECG signals can vary widely across individuals and leads, while datasets often follow different labeling standards and may be biased, all of which greatly hinder supervised methods. Conventional unsupervised methods, e.g. principal component analysis, prioritize large (and often obvious) variances in the data and typically overlook subtle yet clinically relevant patterns. If labels are missing and/or variations are significant but small, both approaches fail. Here, we show that nonlinear dimensionality reduction (NLDR) can accommodate these issues and identify medically relevant features in ECG signals, with no need for training or prior information. Using the MLII and V1 leads of the MIT-BIH dataset, we demonstrate that t-distributed stochastic neighbor embedding and uniform manifold approximation and projection can discriminate individual recordings in mixed populations with = 90% accuracy and distinguish different arrhythmias in individual patients with a median accuracy of 98.96% and a median F1-score of 91.02%. The results show that NLDR holds much promise for cardiac monitoring, including the limiting cases of single-lead ECG and the current 12-lead standard of care, and for personalized health care beyond cardiology.
[LG-44] Black-Box Privacy Attacks on Shared Representations in Multitask Learning
链接: https://arxiv.org/abs/2506.16460
作者: John Abascal,Nicolás Berrios,Alina Oprea,Jonathan Ullman,Adam Smith,Matthew Jagielski
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 30 pages, 8 figures
Abstract:Multitask learning (MTL) has emerged as a powerful paradigm that leverages similarities among multiple learning tasks, each with insufficient samples to train a standalone model, to solve them simultaneously while minimizing data sharing across users and organizations. MTL typically accomplishes this goal by learning a shared representation that captures common structure among the tasks by embedding data from all tasks into a common feature space. Despite being designed to be the smallest unit of shared information necessary to effectively learn patterns across multiple tasks, these shared representations can inadvertently leak sensitive information about the particular tasks they were trained on. In this work, we investigate what information is revealed by the shared representations through the lens of inference attacks. Towards this, we propose a novel, black-box task-inference threat model where the adversary, given the embedding vectors produced by querying the shared representation on samples from a particular task, aims to determine whether that task was present when training the shared representation. We develop efficient, purely black-box attacks on machine learning models that exploit the dependencies between embeddings from the same task without requiring shadow models or labeled reference data. We evaluate our attacks across vision and language domains for multiple use cases of MTL and demonstrate that even with access only to fresh task samples rather than training data, a black-box adversary can successfully infer a task’s inclusion in training. To complement our experiments, we provide theoretical analysis of a simplified learning setting and show a strict separation between adversaries with training samples and fresh samples from the target task’s distribution. Comments: 30 pages, 8 figures Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2506.16460 [cs.LG] (or arXiv:2506.16460v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.16460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-45] An efficient neuromorphic approach for collision avoidance combining Stack-CNN with event cameras
链接: https://arxiv.org/abs/2506.16436
作者: Antonio Giulio Coretti,Mattia Varile,Mario Edoardo Bertaina
类目: Machine Learning (cs.LG)
*备注: 18th International Conference on Space Operations - Safety and sustainability of Space Operations (SSU)
Abstract:Space debris poses a significant threat, driving research into active and passive mitigation strategies. This work presents an innovative collision avoidance system utilizing event-based cameras - a novel imaging technology well-suited for Space Situational Awareness (SSA) and Space Traffic Management (STM). The system, employing a Stack-CNN algorithm (previously used for meteor detection), analyzes real-time event-based camera data to detect faint moving objects. Testing on terrestrial data demonstrates the algorithm’s ability to enhance signal-to-noise ratio, offering a promising approach for on-board space imaging and improving STM/SSA operations.
[LG-46] EFormer: An Effective Edge-based Transformer for Vehicle Routing Problems
链接: https://arxiv.org/abs/2506.16428
作者: Dian Meng,Zhiguang Cao,Yaoxin Wu,Yaqing Hou,Hongwei Ge,Qiang Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent neural heuristics for the Vehicle Routing Problem (VRP) primarily rely on node coordinates as input, which may be less effective in practical scenarios where real cost metrics-such as edge-based distances-are more relevant. To address this limitation, we introduce EFormer, an Edge-based Transformer model that uses edge as the sole input for VRPs. Our approach employs a precoder module with a mixed-score attention mechanism to convert edge information into temporary node embeddings. We also present a parallel encoding strategy characterized by a graph encoder and a node encoder, each responsible for processing graph and node embeddings in distinct feature spaces, respectively. This design yields a more comprehensive representation of the global relationships among edges. In the decoding phase, parallel context embedding and multi-query integration are used to compute separate attention mechanisms over the two encoded embeddings, facilitating efficient path construction. We train EFormer using reinforcement learning in an autoregressive manner. Extensive experiments on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) reveal that EFormer outperforms established baselines on synthetic datasets, including large-scale and diverse distributions. Moreover, EFormer demonstrates strong generalization on real-world instances from TSPLib and CVRPLib. These findings confirm the effectiveness of EFormer’s core design in solving VRPs.
[LG-47] Generating Directed Graphs with Dual Attention and Asymmetric Encoding
链接: https://arxiv.org/abs/2506.16404
作者: Alba Carballo-Castro,Manuel Madeira,Yiming Qin,Dorina Thanou,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Directed graphs naturally model systems with asymmetric, ordered relationships, essential to applications in biology, transportation, social networks, and visual understanding. Generating such graphs enables tasks such as simulation, data augmentation and novel instance discovery; however, directed graph generation remains underexplored. We identify two key factors limiting progress in this direction: first, modeling edge directionality introduces a substantially larger dependency space, making the underlying distribution harder to learn; second, the absence of standardized benchmarks hinders rigorous evaluation. Addressing the former requires more expressive models that are sensitive to directional topologies. We propose Directo, the first generative model for directed graphs built upon the discrete flow matching framework. Our approach combines: (i) principled positional encodings tailored to asymmetric pairwise relations, (ii) a dual-attention mechanism capturing both incoming and outgoing dependencies, and (iii) a robust, discrete generative framework. To support evaluation, we introduce a benchmark suite covering synthetic and real-world datasets. It shows that our method performs strongly across diverse settings and even competes with specialized models for particular classes, such as directed acyclic graphs. Our results highlight the effectiveness and generality of our approach, establishing a solid foundation for future research in directed graph generation.
[LG-48] GoalLadder: Incremental Goal Discovery with Vision-Language Models
链接: https://arxiv.org/abs/2506.16396
作者: Alexey Zakharov,Shimon Whiteson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, \textbfGoalLadder , that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent’s task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM’s feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of \sim 95% compared to only \sim 45% of the best competitor.
[LG-49] State-Space Kolmogorov Arnold Networks for Interpretable Nonlinear System Identification
链接: https://arxiv.org/abs/2506.16392
作者: Gonçalo Granjal Cruz,Balazs Renczes,Mark C Runacres,Jan Decuyper
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted for IEEE Control Systems Letters
Abstract:While accurate, black-box system identification models lack interpretability of the underlying system dynamics. This paper proposes State-Space Kolmogorov-Arnold Networks (SS-KAN) to address this challenge by integrating Kolmogorov-Arnold Networks within a state-space framework. The proposed model is validated on two benchmark systems: the Silverbox and the Wiener-Hammerstein benchmarks. Results show that SS-KAN provides enhanced interpretability due to sparsity-promoting regularization and the direct visualization of its learned univariate functions, which reveal system nonlinearities at the cost of accuracy when compared to state-of-the-art black-box models, highlighting SS-KAN as a promising approach for interpretable nonlinear system identification, balancing accuracy and interpretability of nonlinear system dynamics.
[LG-50] Classification of Cattle Behavior and Detection of Heat (Estrus) using Sensor Data
链接: https://arxiv.org/abs/2506.16380
作者: Druva Dhakshinamoorthy,Avikshit Jha,Sabyasachi Majumdar,Devdulal Ghosh,Ranjita Chakraborty,Hena Ray
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures. Druva Dhakshinamoorthy and Avikshit Jha contributed equally as co-first authors. Work conducted during a summer internship at CDAC Kolkata by students of BITS Pilani
Abstract:This paper presents a novel system for monitoring cattle behavior and detecting estrus (heat) periods using sensor data and machine learning. We designed and deployed a low-cost Bluetooth-based neck collar equipped with accelerometer and gyroscope sensors to capture real-time behavioral data from real cows, which was synced to the cloud. A labeled dataset was created using synchronized CCTV footage to annotate behaviors such as feeding, rumination, lying, and others. We evaluated multiple machine learning models – Support Vector Machines (SVM), Random Forests (RF), and Convolutional Neural Networks (CNN) – for behavior classification. Additionally, we implemented a Long Short-Term Memory (LSTM) model for estrus detection using behavioral patterns and anomaly detection. Our system achieved over 93% behavior classification accuracy and 96% estrus detection accuracy on a limited test set. The approach offers a scalable and accessible solution for precision livestock monitoring, especially in resource-constrained environments.
[LG-51] Data-Driven Policy Mapping for Safe RL-based Energy Management Systems
链接: https://arxiv.org/abs/2506.16352
作者: Theo Zangato,Aomar Osmani,Pegah Alizadeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Increasing global energy demand and renewable integration complexity have placed buildings at the center of sustainable energy management. We present a three-step reinforcement learning(RL)-based Building Energy Management System (BEMS) that combines clustering, forecasting, and constrained policy learning to address scalability, adaptability, and safety challenges. First, we cluster non-shiftable load profiles to identify common consumption patterns, enabling policy generalization and transfer without retraining for each new building. Next, we integrate an LSTM based forecasting module to anticipate future states, improving the RL agents’ responsiveness to dynamic conditions. Lastly, domain-informed action masking ensures safe exploration and operation, preventing harmful decisions. Evaluated on real-world data, our approach reduces operating costs by up to 15% for certain building types, maintains stable environmental performance, and quickly classifies and optimizes new buildings with limited data. It also adapts to stochastic tariff changes without retraining. Overall, this framework delivers scalable, robust, and cost-effective building energy management.
[LG-52] Bayesian Optimization over Bounded Domains with the Beta Product Kernel UAI2025
链接: https://arxiv.org/abs/2506.16316
作者: Huy Hoang Nguyen,Han Zhou,Matthew B. Blaschko,Aleksei Tiulpin
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at UAI 2025
Abstract:Bayesian optimization with Gaussian processes (GP) is commonly used to optimize black-box functions. The Matérn and the Radial Basis Function (RBF) covariance functions are used frequently, but they do not make any assumptions about the domain of the function, which may limit their applicability in bounded domains. To address the limitation, we introduce the Beta kernel, a non-stationary kernel induced by a product of Beta distribution density functions. Such a formulation allows our kernel to naturally model functions on bounded domains. We present statistical evidence supporting the hypothesis that the kernel exhibits an exponential eigendecay rate, based on empirical analyses of its spectral properties across different settings. Our experimental results demonstrate the robustness of the Beta kernel in modeling functions with optima located near the faces or vertices of the unit hypercube. The experiments show that our kernel consistently outperforms a wide range of kernels, including the well-known Matérn and RBF, in different problems, including synthetic function optimization and the compression of vision and language models.
[LG-53] Signatures to help interpretability of anomalies
链接: https://arxiv.org/abs/2506.16314
作者: Emmanuel Gangler(1),Emille E. O. Ishida(1),Matwey V. Kornilov(2 and 3),Vladimir Korolev,Anastasia Lavrukhina(3),Konstantin Malanchev(4),Maria V. Pruzhinskaya(1 and 3),Etienne Russeil(1 and 5),Timofey Semenikhin(3 and 6),Sreevarsha Sreejith(7),Alina A. Volnova(8) ((1) Université Clermont Auvergne CNRS LPCA, Clermont-Ferrand, France, (2) National Research University Higher School of Economics, Moscow, Russia, (3) Sternberg Astronomical Institute Lomonosov Moscow State University, Moscow, Russia, (4) McWilliams Center for Cosmology and Astrophysics, Department of Physics, Carnegie Mellon University, Pittsburgh, PA, USA, (5) The Oskar Klein Centre Department of Astronomy, Stockholm University AlbaNova, Stockholm, Sweden, (6) Faculty of Physics, Lomonosov Moscow State University, Moscow, Russia, (7) Physics department, University of Surrey, Guildford, UK, (8) Space Research Institute of the Russian Academy of Sciences, Moscow, Russia)
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 7 pages, 3 figure, proceedings of the International Conference on Machine Learning for Astrophysics (ML4ASTRO2)
Abstract:Machine learning is often viewed as a black box when it comes to understanding its output, be it a decision or a score. Automatic anomaly detection is no exception to this rule, and quite often the astronomer is left to independently analyze the data in order to understand why a given event is tagged as an anomaly. We introduce here idea of anomaly signature, whose aim is to help the interpretability of anomalies by highlighting which features contributed to the decision.
[LG-54] Optimizing Multilingual Text-To-Speech with Accents Emotions
链接: https://arxiv.org/abs/2506.16310
作者: Pranav Pawar,Akshansh Dwivedi,Jenish Boricha,Himanshu Gohil,Aditya Dubey
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 8 figures
Abstract:State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as “Namaste, let’s talk about Hindi phrase” with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.
[LG-55] Optimal Online Bookmaking for Any Number of Outcomes COLT
链接: https://arxiv.org/abs/2506.16253
作者: Hadar Tal,Oron Sabag
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Optimization and Control (math.OC)
*备注: Accepted for presentation at the Conference on Learning Theory (COLT) 2025
Abstract:We study the Online Bookmaking problem, where a bookmaker dynamically updates betting odds on the possible outcomes of an event. In each betting round, the bookmaker can adjust the odds based on the cumulative betting behavior of gamblers, aiming to maximize profit while mitigating potential loss. We show that for any event and any number of betting rounds, in a worst-case setting over all possible gamblers and outcome realizations, the bookmaker’s optimal loss is the largest root of a simple polynomial. Our solution shows that bookmakers can be as fair as desired while avoiding financial risk, and the explicit characterization reveals an intriguing relation between the bookmaker’s regret and Hermite polynomials. We develop an efficient algorithm that computes the optimal bookmaking strategy: when facing an optimal gambler, the algorithm achieves the optimal loss, and in rounds where the gambler is suboptimal, it reduces the achieved loss to the optimal opportunistic loss, a notion that is related to subgame perfect Nash equilibrium. The key technical contribution to achieve these results is an explicit characterization of the Bellman-Pareto frontier, which unifies the dynamic programming updates for Bellman’s value function with the multi-criteria optimization framework of the Pareto frontier in the context of vector repeated games.
[LG-56] Active MRI Acquisition with Diffusion Guided Bayesian Experimental Design
链接: https://arxiv.org/abs/2506.16237
作者: Jacopo Iollo,Geoffroy Oudoumanessah,Carole Lartizien,Michel Dojat,Florence Forbes
类目: Machine Learning (cs.LG)
*备注:
Abstract:A key challenge in maximizing the benefits of Magnetic Resonance Imaging (MRI) in clinical settings is to accelerate acquisition times without significantly degrading image quality. This objective requires a balance between under-sampling the raw k-space measurements for faster acquisitions and gathering sufficient raw information for high-fidelity image reconstruction and analysis tasks. To achieve this balance, we propose to use sequential Bayesian experimental design (BED) to provide an adaptive and task-dependent selection of the most informative measurements. Measurements are sequentially augmented with new samples selected to maximize information gain on a posterior distribution over target images. Selection is performed via a gradient-based optimization of a design parameter that defines a subsampling pattern. In this work, we introduce a new active BED procedure that leverages diffusion-based generative models to handle the high dimensionality of the images and employs stochastic optimization to select among a variety of patterns while meeting the acquisition process constraints and budget. So doing, we show how our setting can optimize, not only standard image reconstruction, but also any associated image analysis task. The versatility and performance of our approach are demonstrated on several MRI acquisitions.
[LG-57] hink Global Act Local: Bayesian Causal Discovery with Language Models in Sequential Data
链接: https://arxiv.org/abs/2506.16234
作者: Prakhar Verma,David Arbour,Sunav Choudhary,Harshita Chopra,Arno Solin,Atanu R. Sinha
类目: Machine Learning (cs.LG)
*备注: 24 pages, preprint
Abstract:Causal discovery from observational data typically assumes full access to data and availability of domain experts. In practice, data often arrive in batches, and expert knowledge is scarce. Language Models (LMs) offer a surrogate but come with their own issues-hallucinations, inconsistencies, and bias. We present BLANCE (Bayesian LM-Augmented Causal Estimation)-a hybrid Bayesian framework that bridges these gaps by adaptively integrating sequential batch data with LM-derived noisy, expert knowledge while accounting for both data-induced and LM-induced biases. Our proposed representation shift from Directed Acyclic Graph (DAG) to Partial Ancestral Graph (PAG) accommodates ambiguities within a coherent Bayesian framework, allowing grounding the global LM knowledge in local observational data. To guide LM interaction, we use a sequential optimization scheme that adaptively queries the most informative edges. Across varied datasets, BLANCE outperforms prior work in structural accuracy and extends to Bayesian parameter estimation, showing robustness to LM noise.
[LG-58] Malware Classification Leverag ing NLP Machine Learning for Enhanced Accuracy
链接: https://arxiv.org/abs/2506.16224
作者: Bishwajit Prasad Gond,Rajneekant,Pushkar Kishore,Durga Prasad Mohapatra
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates the application of natural language processing (NLP)-based n-gram analysis and machine learning techniques to enhance malware classification. We explore how NLP can be used to extract and analyze textual features from malware samples through n-grams, contiguous string or API call sequences. This approach effectively captures distinctive linguistic patterns among malware and benign families, enabling finer-grained classification. We delve into n-gram size selection, feature representation, and classification algorithms. While evaluating our proposed method on real-world malware samples, we observe significantly improved accuracy compared to the traditional methods. By implementing our n-gram approach, we achieved an accuracy of 99.02% across various machine learning algorithms by using hybrid feature selection technique to address high dimensionality. Hybrid feature selection technique reduces the feature set to only 1.6% of the original features.
[LG-59] From Pixels to CSI: Distilling Latent Dynamics For Efficient Wireless Resource Management
链接: https://arxiv.org/abs/2506.16216
作者: Charbel Bou Chaaya,Abanoub M. Girgis,Mehdi Bennis
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we aim to optimize the radio resource management of a communication system between a remote controller and its device, whose state is represented through image frames, without compromising the performance of the control task. We propose a novel machine learning (ML) technique to jointly model and predict the dynamics of the control system as well as the wireless propagation environment in latent space. Our method leverages two coupled joint-embedding predictive architectures (JEPAs): a control JEPA models the control dynamics and guides the predictions of a wireless JEPA, which captures the dynamics of the device’s channel state information (CSI) through cross-modal conditioning. We then train a deep reinforcement learning (RL) algorithm to derive a control policy from latent control dynamics and a power predictor to estimate scheduling intervals with favorable channel conditions based on latent CSI representations. As such, the controller minimizes the usage of radio resources by utilizing the coupled JEPA networks to imagine the device’s trajectory in latent space. We present simulation results on synthetic multimodal data and show that our proposed approach reduces transmit power by over 50% while maintaining control performance comparable to baseline methods that do not account for wireless optimization.
[LG-60] Efficient and Privacy-Preserving Soft Prompt Transfer for LLM s ICML2025
链接: https://arxiv.org/abs/2506.16196
作者: Xun Wang,Jing Xu,Franziska Boenisch,Michael Backes,Christopher A. Choquette-Choo,Adam Dziedzic
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML2025
Abstract:Prompting has become a dominant paradigm for adapting large language models (LLMs). While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user’s token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic for efficiency and privacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API. To address these issues, we propose POST (Privacy Of Soft prompt Transfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM. POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.
[LG-61] Hallucination Level of Artificial Intelligence Whisperer: Case Speech Recognizing Pantterinousut Rap Song
链接: https://arxiv.org/abs/2506.16174
作者: Ismo Horppu,Frederick Ayala,Erlin Gulbenkoglu
类目: Machine Learning (cs.LG)
*备注: 15 pages, 10 figures
Abstract:All languages are peculiar. Some of them are considered more challenging to understand than others. The Finnish Language is known to be a complex language. Also, when languages are used by artists, the pronunciation and meaning might be more tricky to understand. Therefore, we are putting AI to a fun, yet challenging trial: translating a Finnish rap song to text. We will compare the Faster Whisperer algorithm and YouTube’s internal speech-to-text functionality. The reference truth will be Finnish rap lyrics, which the main author’s little brother, Mc Timo, has written. Transcribing the lyrics will be challenging because the artist raps over synth music player by Syntikka Janne. The hallucination level and mishearing of AI speech-to-text extractions will be measured by comparing errors made against the original Finnish lyrics. The error function is informal but still works for our case.
[LG-62] Solving Zero-Sum Convex Markov Games ICML2025
链接: https://arxiv.org/abs/2506.16120
作者: Fivos Kalogiannis,Emmanouil-Vasileios Vlatakis-Gkaragkounis,Ian Gemp,Georgios Piliouras
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注: To appear in the Proceedings of the 2025 International Conference on Machine Learning (ICML 2025)
Abstract:We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al. (2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon. We follow a two-step approach. First, leveraging properties of hidden-convex–hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex-proximal Polyak-Lojasiewicz (NC-pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest. Comments: To appear in the Proceedings of the 2025 International Conference on Machine Learning (ICML 2025) Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC) Cite as: arXiv:2506.16120 [cs.GT] (or arXiv:2506.16120v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2506.16120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] Mitigating Over-Squashing in Graph Neural Networks by Spectrum-Preserving Sparsification ICML2025
链接: https://arxiv.org/abs/2506.16110
作者: Langzhang Liang,Fanchen Bu,Zixing Song,Zenglin Xu,Shirui Pan,Kijung Shin
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICML 2025
Abstract:The message-passing paradigm of Graph Neural Networks often struggles with exchanging information across distant nodes typically due to structural bottlenecks in certain graph regions, a limitation known as \textitover-squashing. To reduce such bottlenecks, \textitgraph rewiring, which modifies graph topology, has been widely used. However, existing graph rewiring techniques often overlook the need to preserve critical properties of the original graph, e.g., \textitspectral properties. Moreover, many approaches rely on increasing edge count to improve connectivity, which introduces significant computational overhead and exacerbates the risk of over-smoothing. In this paper, we propose a novel graph rewiring method that leverages \textitspectrum-preserving graph \textitsparsification, for mitigating over-squashing. Our method generates graphs with enhanced connectivity while maintaining sparsity and largely preserving the original graph spectrum, effectively balancing structural bottleneck reduction and graph property preservation. Experimental results validate the effectiveness of our approach, demonstrating its superiority over strong baseline methods in classification accuracy and retention of the Laplacian spectrum.
[LG-64] Investigating Lagrangian Neural Networks for Infinite Horizon Planning in Quadrupedal Locomotion
链接: https://arxiv.org/abs/2506.16079
作者: Prakrut Kotecha,Aditya Shirwatkar,Shishir Kolathaya
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, Accepted at Advances in Robotics (AIR) Conference 2025
Abstract:Lagrangian Neural Networks (LNNs) present a principled and interpretable framework for learning the system dynamics by utilizing inductive biases. While traditional dynamics models struggle with compounding errors over long horizons, LNNs intrinsically preserve the physical laws governing any system, enabling accurate and stable predictions essential for sustainable locomotion. This work evaluates LNNs for infinite horizon planning in quadrupedal robots through four dynamics models: (1) full-order forward dynamics (FD) training and inference, (2) diagonalized representation of Mass Matrix in full order FD, (3) full-order inverse dynamics (ID) training with FD inference, (4) reduced-order modeling via torso centre-of-mass (CoM) dynamics. Experiments demonstrate that LNNs bring improvements in sample efficiency (10x) and superior prediction accuracy (up to 2-10x) compared to baseline methods. Notably, the diagonalization approach of LNNs reduces computational complexity while retaining some interpretability, enabling real-time receding horizon control. These findings highlight the advantages of LNNs in capturing the underlying structure of system dynamics in quadrupeds, leading to improved performance and efficiency in locomotion planning and control. Additionally, our approach achieves a higher control frequency than previous LNN methods, demonstrating its potential for real-world deployment on quadrupeds.
[LG-65] Joint User Priority and Power Scheduling for QoS-Aware WMMSE Precoding: A Constrained-Actor Attentive-Critic Approach
链接: https://arxiv.org/abs/2506.16074
作者: Kexuan Wang,An Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:6G wireless networks are expected to support diverse quality-of-service (QoS) demands while maintaining high energy efficiency. Weighted Minimum Mean Square Error (WMMSE) precoding with fixed user priorities and transmit power is widely recognized for enhancing overall system performance but lacks flexibility to adapt to user-specific QoS requirements and time-varying channel conditions. To address this, we propose a novel constrained reinforcement learning (CRL) algorithm, Constrained-Actor Attentive-Critic (CAAC), which uses a policy network to dynamically allocate user priorities and power for WMMSE precoding. Specifically, CAAC integrates a Constrained Stochastic Successive Convex Approximation (CSSCA) method to optimize the policy, enabling more effective handling of energy efficiency goals and satisfaction of stochastic non-convex QoS constraints compared to traditional and existing CRL methods. Moreover, CAAC employs lightweight attention-enhanced Q-networks to evaluate policy updates without prior environment model knowledge. The network architecture not only enhances representational capacity but also boosts learning efficiency. Simulation results show that CAAC outperforms baselines in both energy efficiency and QoS satisfaction.
[LG-66] A Lightweight RL-Driven Deep Unfolding Network for Robust WMMSE Precoding in Massive MU-MIMO-OFDM Systems
链接: https://arxiv.org/abs/2506.16072
作者: Kexuan Wang,An Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weighted Minimum Mean Square Error (WMMSE) precoding is widely recognized for its near-optimal weighted sum rate performance. However, its practical deployment in massive multi-user (MU) multiple-input multiple-output (MIMO) orthogonal frequency-division multiplexing (OFDM) systems is hindered by the assumption of perfect channel state information (CSI) and high computational complexity. To address these issues, we first develop a wideband stochastic WMMSE (SWMMSE) algorithm that iteratively maximizes the ergodic weighted sum-rate (EWSR) under imperfect CSI. Building on this, we propose a lightweight reinforcement learning (RL)-driven deep unfolding (DU) network (RLDDU-Net), where each SWMMSE iteration is mapped to a network layer. Specifically, its DU module integrates approximation techniques and leverages beam-domain sparsity as well as frequency-domain subcarrier correlation, significantly accelerating convergence and reducing computational overhead. Furthermore, the RL module adaptively adjusts the network depth and generates compensation matrices to mitigate approximation errors. Simulation results under imperfect CSI demonstrate that RLDDU-Net outperforms existing baselines in EWSR performance while offering superior computational and convergence efficiency.
[LG-67] Floating-Point Neural Networks Are Provably Robust Universal Approximators
链接: https://arxiv.org/abs/2506.16065
作者: Geonho Hwang,Wonyeol Lee,Yeachan Park,Sejun Park,Feras Saad
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注: 70 pages, 4 figures. Appearing in CAV 2025
Abstract:The classical universal approximation (UA) theorem for neural networks establishes mild conditions under which a feedforward neural network can approximate a continuous function f with arbitrary accuracy. A recent result shows that neural networks also enjoy a more general interval universal approximation (IUA) theorem, in the sense that the abstract interpretation semantics of the network using the interval domain can approximate the direct image map of f (i.e., the result of applying f to a set of inputs) with arbitrary accuracy. These theorems, however, rest on the unrealistic assumption that the neural network computes over infinitely precise real numbers, whereas their software implementations in practice compute over finite-precision floating-point numbers. An open question is whether the IUA theorem still holds in the floating-point setting. This paper introduces the first IUA theorem for floating-point neural networks that proves their remarkable ability to perfectly capture the direct image map of any rounded target function f , showing no limits exist on their expressiveness. Our IUA theorem in the floating-point setting exhibits material differences from the real-valued setting, which reflects the fundamental distinctions between these two computational models. This theorem also implies surprising corollaries, which include (i) the existence of provably robust floating-point neural networks; and (ii) the computational completeness of the class of straight-line programs that use only floating-point additions and multiplications for the class of all floating-point programs that halt. Comments: 70 pages, 4 figures. Appearing in CAV 2025 Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL) Cite as: arXiv:2506.16065 [cs.LG] (or arXiv:2506.16065v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.16065 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-68] From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
链接: https://arxiv.org/abs/2506.16051
作者: Zhiwei Li,Carl Kesselman,Tran Huy Nguyen,Benjamin Yixing Xu,Kyle Bolo,Kimberley Yu
类目: Machine Learning (cs.LG); Databases (cs.DB); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.
[LG-69] A Scalable Factorization Approach for High-Order Structured Tensor Recovery
链接: https://arxiv.org/abs/2506.16032
作者: Zhen Qin,Michael B. Wakin,Zhihui Zhu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:
Abstract:Tensor decompositions, which represent an N -order tensor using approximately N factors of much smaller dimensions, can significantly reduce the number of parameters. This is particularly beneficial for high-order tensors, as the number of entries in a tensor grows exponentially with the order. Consequently, they are widely used in signal recovery and data analysis across domains such as signal processing, machine learning, and quantum physics. A computationally and memory-efficient approach to these problems is to optimize directly over the factors using local search algorithms such as gradient descent, a strategy known as the factorization approach in matrix and tensor optimization. However, the resulting optimization problems are highly nonconvex due to the multiplicative interactions between factors, posing significant challenges for convergence analysis and recovery guarantees. In this paper, we present a unified framework for the factorization approach to solving various tensor decomposition problems. Specifically, by leveraging the canonical form of tensor decompositions–where most factors are constrained to be orthonormal to mitigate scaling ambiguity–we apply Riemannian gradient descent (RGD) to optimize these orthonormal factors on the Stiefel manifold. Under a mild condition on the loss function, we establish a Riemannian regularity condition for the factorized objective and prove that RGD converges to the ground-truth tensor at a linear rate when properly initialized. Notably, both the initialization requirement and the convergence rate scale polynomially rather than exponentially with N , improving upon existing results for Tucker and tensor-train format tensors. Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC) Cite as: arXiv:2506.16032 [cs.LG] (or arXiv:2506.16032v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.16032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] Bridging Brain with Foundation Models through Self-Supervised Learning
链接: https://arxiv.org/abs/2506.16009
作者: Hamdi Altaheri,Fakhri Karray,Md. Milon Islam,S M Taslim Uddin Raju,Amir-Hossein Karimi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models (FMs), powered by self-supervised learning (SSL), have redefined the capabilities of artificial intelligence, demonstrating exceptional performance in domains like natural language processing and computer vision. These advances present a transformative opportunity for brain signal analysis. Unlike traditional supervised learning, which is limited by the scarcity of labeled neural data, SSL offers a promising solution by enabling models to learn meaningful representations from unlabeled data. This is particularly valuable in addressing the unique challenges of brain signals, including high noise levels, inter-subject variability, and low signal-to-noise ratios. This survey systematically reviews the emerging field of bridging brain signals with foundation models through the innovative application of SSL. It explores key SSL techniques, the development of brain-specific foundation models, their adaptation to downstream tasks, and the integration of brain signals with other modalities in multimodal SSL frameworks. The review also covers commonly used evaluation metrics and benchmark datasets that support comparative analysis. Finally, it highlights key challenges and outlines future research directions. This work aims to provide researchers with a structured understanding of this rapidly evolving field and a roadmap for developing generalizable brain foundation models powered by self-supervision.
[LG-71] Data-Agnostic Cardinality Learning from Imperfect Workloads
链接: https://arxiv.org/abs/2506.16007
作者: Peizhi Wu,Rong Kang,Tieying Zhang,Jianjun Chen,Ryan Marcus,Zachary G. Ives
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 14 pages. Technical Report (Extended Version)
Abstract:Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to data or summaries for best performance, and they assume perfect training workloads with complete and balanced join templates (or join graphs). Such assumptions rarely hold in real-world scenarios, in which join templates are incomplete and imbalanced. We present GRASP, a data-agnostic cardinality learning system designed to work under these real-world constraints. GRASP’s compositional design generalizes to unseen join templates and is robust to join template imbalance. It also introduces a new per-table CardEst model that handles value distribution shifts for range predicates, and a novel learned count sketch model that captures join correlations across base relations. Across three database instances, we demonstrate that GRASP consistently outperforms existing query-driven models on imperfect workloads, both in terms of estimation accuracy and query latency. Remarkably, GRASP achieves performance comparable to, or even surpassing, traditional approaches built over the underlying data on the complex CEB-IMDb-full benchmark – despite operating without any data access and using only 10% of all possible join templates.
[LG-72] On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond
链接: https://arxiv.org/abs/2506.15963
作者: Jingyi Cui,Qi Zhang,Yifei Wang,Yisen Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs). It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks. Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, through theoretical analysis, we for the first time propose the necessary and sufficient conditions for identifiable SAEs (SAEs that learn unique and ground truth monosemantic features), including 1) extreme sparsity of the ground truth feature, 2) sparse activation of SAEs, and 3) enough hidden dimensions of SAEs. Moreover, when the identifiable conditions are not fully met, we propose a reweighting strategy to improve the identifiability. Specifically, following the theoretically suggested weight selection principle, we prove that the gap between the loss functions of SAE reconstruction and monosemantic feature reconstruction can be narrowed, so that the reweighted SAEs have better reconstruction of the ground truth monosemantic features than the uniformly weighted ones. In experiments, we validate our theoretical findings and show that our weighted SAE significantly improves feature monosemanticity and interpretability.
[LG-73] One Period to Rule Them All: Identifying Critical Learning Periods in Deep Networks
链接: https://arxiv.org/abs/2506.15954
作者: Vinicius Yuiti Fukase,Heitor Gama,Barbara Bueno,Lucas Libanio,Anna Helena Reali Costa,Artur Jordao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Critical Learning Periods comprehend an important phenomenon involving deep learning, where early epochs play a decisive role in the success of many training recipes, such as data augmentation. Existing works confirm the existence of this phenomenon and provide useful insights. However, the literature lacks efforts to precisely identify when critical periods occur. In this work, we fill this gap by introducing a systematic approach for identifying critical periods during the training of deep neural networks, focusing on eliminating computationally intensive regularization techniques and effectively applying mechanisms for reducing computational costs, such as data pruning. Our method leverages generalization prediction mechanisms to pinpoint critical phases where training recipes yield maximum benefits to the predictive ability of models. By halting resource-intensive recipes beyond these periods, we significantly accelerate the learning phase and achieve reductions in training time, energy consumption, and CO _2 emissions. Experiments on standard architectures and benchmarks confirm the effectiveness of our method. Specifically, we achieve significant milestones by reducing the training time of popular architectures by up to 59.67%, leading to a 59.47% decrease in CO _2 emissions and a 60% reduction in financial costs, without compromising performance. Our work enhances understanding of training dynamics and paves the way for more sustainable and efficient deep learning practices, particularly in resource-constrained environments. In the era of the race for foundation models, we believe our method emerges as a valuable framework. The repository is available at this https URL
[LG-74] On the optimal regret of collaborative personalized linear bandits
链接: https://arxiv.org/abs/2506.15943
作者: Bruce Huang,Ruida Zhou,Lin F. Yang,Suhas Diggavi
类目: Machine Learning (cs.LG)
*备注: 30 pages, 4 figures
Abstract:Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound \tildeO(d\sqrtmn) , \tildeO(dm^1-\gamma\sqrtn) , \tildeO(dm\sqrtn) for the number of rounds n in the range of (0, \fracdm \sigma^2) , [\fracdm^2\gamma \sigma^2, \fracd\sigma^2] and (\fracd\sigma^2, \infty) respectively, where \sigma measures the level of heterogeneity, m is the number of agents, and \gamma\in[0, 1/2] is an absolute constant. In contrast, agents without collaboration achieve a regret bound O(dm\sqrtn) at best.
[LG-75] CORAL: Disentangling Latent Representations in Long-Tailed Diffusion
链接: https://arxiv.org/abs/2506.15933
作者: Esther Rodriguez,Monica Welfert,Samuel McDowell,Nathan Stromberg,Julian Antolin Camarena,Lalitha Sankar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have achieved impressive performance in generating high-quality and diverse synthetic data. However, their success typically assumes a class-balanced training distribution. In real-world settings, multi-class data often follow a long-tailed distribution, where standard diffusion models struggle – producing low-diversity and lower-quality samples for tail classes. While this degradation is well-documented, its underlying cause remains poorly understood. In this work, we investigate the behavior of diffusion models trained on long-tailed datasets and identify a key issue: the latent representations (from the bottleneck layer of the U-Net) for tail class subspaces exhibit significant overlap with those of head classes, leading to feature borrowing and poor generation quality. Importantly, we show that this is not merely due to limited data per class, but that the relative class imbalance significantly contributes to this phenomenon. To address this, we propose COntrastive Regularization for Aligning Latents (CORAL), a contrastive latent alignment framework that leverages supervised contrastive losses to encourage well-separated latent class representations. Experiments demonstrate that CORAL significantly improves both the diversity and visual quality of samples generated for tail classes relative to state-of-the-art methods.
[LG-76] Competing Bandits in Matching Markets via Super Stability
链接: https://arxiv.org/abs/2506.15926
作者: Soumya Basu
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:We study bandit learning in matching markets with two-sided reward uncertainty, extending prior research primarily focused on single-sided uncertainty. Leveraging the concept of `super-stability’ from Irving (1994), we demonstrate the advantage of the Extended Gale-Shapley (GS) algorithm over the standard GS algorithm in achieving true stable matchings under incomplete information. By employing the Extended GS algorithm, our centralized algorithm attains a logarithmic pessimal stable regret dependent on an instance-dependent admissible gap parameter. This algorithm is further adapted to a decentralized setting with a constant regret increase. Finally, we establish a novel centralized instance-dependent lower bound for binary stable regret, elucidating the roles of the admissible gap and super-stable matching in characterizing the complexity of stable matching with bandit feedback.
[LG-77] Pieceformer: Similarity-Driven Knowledge Transfer via Scalable Graph Transformer in VLSI
链接: https://arxiv.org/abs/2506.15907
作者: Hang Yang,Yusheng Hu,Yong Liu,Cong(Callie)Hao
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 4 figures, 1 table, submitted
Abstract:Accurate graph similarity is critical for knowledge transfer in VLSI design, enabling the reuse of prior solutions to reduce engineering effort and turnaround time. We propose Pieceformer, a scalable, self-supervised similarity assessment framework, equipped with a hybrid message-passing and graph transformer encoder. To address transformer scalability, we incorporate a linear transformer backbone and introduce a partitioned training pipeline for efficient memory and parallelism management. Evaluations on synthetic and real-world CircuitNet datasets show that Pieceformer reduces mean absolute error (MAE) by 24.9% over the baseline and is the only method to correctly cluster all real-world design groups. We further demonstrate the practical usage of our model through a case study on a partitioning task, achieving up to 89% runtime reduction. These results validate the framework’s effectiveness for scalable, unbiased design reuse in modern VLSI systems.
[LG-78] VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics
链接: https://arxiv.org/abs/2506.15903
作者: Josef Kuchař,Marek Kadlčík,Michal Spiegel,Michal Štefánik
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a large-scale dataset for instruction-guided vector image editing, consisting of over 270,000 pairs of SVG images paired with natural language edit instructions. Our dataset enables training and evaluation of models that modify vector graphics based on textual commands. We describe the data collection process, including image pairing via CLIP similarity and instruction generation with vision-language models. Initial experiments with state-of-the-art large language models reveal that current methods struggle to produce accurate and valid edits, underscoring the challenge of this task. To foster research in natural language-driven vector graphic generation and editing, we make our resources created within this work publicly available.
[LG-79] Clinically Interpretable Mortality Prediction for ICU Patients with Diabetes and Atrial Fibrillation: A Machine Learning Approach
链接: https://arxiv.org/abs/2506.15901
作者: Li Sun,Shuheng Chen,Yong Si,Junyi Fan,Maryam Pishgar,Elham Pishgar,Kamiar Alaei,Greg Placencia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background: Patients with both diabetes mellitus (DM) and atrial fibrillation (AF) face elevated mortality in intensive care units (ICUs), yet models targeting this high-risk group remain limited. Objective: To develop an interpretable machine learning (ML) model predicting 28-day mortality in ICU patients with concurrent DM and AF using early-phase clinical data. Methods: A retrospective cohort of 1,535 adult ICU patients with DM and AF was extracted from the MIMIC-IV database. Data preprocessing involved median/mode imputation, z-score normalization, and early temporal feature engineering. A two-step feature selection pipeline-univariate filtering (ANOVA F-test) and Random Forest-based multivariate ranking-yielded 19 interpretable features. Seven ML models were trained with stratified 5-fold cross-validation and SMOTE oversampling. Interpretability was assessed via ablation and Accumulated Local Effects (ALE) analysis. Results: Logistic regression achieved the best performance (AUROC: 0.825; 95% CI: 0.779-0.867), surpassing more complex models. Key predictors included RAS, age, bilirubin, and extubation. ALE plots showed intuitive, non-linear effects such as age-related risk acceleration and bilirubin thresholds. Conclusion: This interpretable ML model offers accurate risk prediction and clinical insights for early ICU triage in patients with DM and AF. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.15901 [cs.LG] (or arXiv:2506.15901v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.15901 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuheng Chen [view email] [v1] Wed, 18 Jun 2025 22:04:12 UTC (2,329 KB) Full-text links: Access Paper: View a PDF of the paper titled Clinically Interpretable Mortality Prediction for ICU Patients with Diabetes and Atrial Fibrillation: A Machine Learning Approach, by Li Sun and 7 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-80] rajDiff: Diffusion Bridge Network with Semantic Alignment for Trajectory Similarity Computation
链接: https://arxiv.org/abs/2506.15898
作者: Xiao Zhang,Xingyu Zhao,Hong Xia,Yuan Cao,Guiyuan Jiang,Junyu Dong,Yanwei Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the proliferation of location-tracking technologies, massive volumes of trajectory data are continuously being collected. As a fundamental task in trajectory data mining, trajectory similarity computation plays a critical role in a wide range of real-world applications. However, existing learning-based methods face three challenges: First, they ignore the semantic gap between GPS and grid features in trajectories, making it difficult to obtain meaningful trajectory embeddings. Second, the noise inherent in the trajectories, as well as the noise introduced during grid discretization, obscures the true motion patterns of the trajectories. Third, existing methods focus solely on point-wise and pair-wise losses, without utilizing the global ranking information obtained by sorting all trajectories according to their similarity to a given trajectory. To address the aforementioned challenges, we propose a novel trajectory similarity computation framework, named TrajDiff. Specifically, the semantic alignment module relies on cross-attention and an attention score mask mechanism with adaptive fusion, effectively eliminating semantic discrepancies between data at two scales and generating a unified representation. Additionally, the DDBM-based Noise-robust Pre-Training introduces the transfer patterns between any two trajectories into the model training process, enhancing the model’s noise robustness. Finally, the overall ranking-aware regularization shifts the model’s focus from a local to a global perspective, enabling it to capture the holistic ordering information among trajectories. Extensive experiments on three publicly available datasets show that TrajDiff consistently outperforms state-of-the-art baselines. In particular, it achieves an average HR@1 gain of 33.38% across all three evaluation metrics and datasets.
[LG-81] Formal Models of Active Learning from Contrastive Examples
链接: https://arxiv.org/abs/2506.15893
作者: Farnam Mansouri,Hans U. Simon,Adish Singla,Yuxin Chen,Sandra Zilles
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning can greatly benefit from providing learning algorithms with pairs of contrastive training examples – typically pairs of instances that differ only slightly, yet have different class labels. Intuitively, the difference in the instances helps explain the difference in the class labels. This paper proposes a theoretical framework in which the effect of various types of contrastive examples on active learners is studied formally. The focus is on the sample complexity of learning concept classes and how it is influenced by the choice of contrastive examples. We illustrate our results with geometric concept classes and classes of Boolean functions. Interestingly, we reveal a connection between learning from contrastive examples and the classical model of self-directed learning.
[LG-82] Fair Contracts in Principal-Agent Games with Heterogeneous Types
链接: https://arxiv.org/abs/2506.15887
作者: Jakub Tłuczek,Victor Villin,Christos Dimitrakakis
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Fairness is desirable yet challenging to achieve within multi-agent systems, especially when agents differ in latent traits that affect their abilities. This hidden heterogeneity often leads to unequal distributions of wealth, even when agents operate under the same rules. Motivated by real-world examples, we propose a framework based on repeated principal-agent games, where a principal, who also can be seen as a player of the game, learns to offer adaptive contracts to agents. By leveraging a simple yet powerful contract structure, we show that a fairness-aware principal can learn homogeneous linear contracts that equalize outcomes across agents in a sequential social dilemma. Importantly, this fairness does not come at the cost of efficiency: our results demonstrate that it is possible to promote equity and stability in the system while preserving overall performance.
[LG-83] -SHRED: Symbolic Regression for Regularization and Model Discovery with Transformer Shallow Recurrent Decoders
链接: https://arxiv.org/abs/2506.15881
作者: Alexey Yermakov,David Zoro,Mars Liyao Gao,J. Nathan Kutz
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, submitted to Transactions of the Royal Society (Symbolic Regression in the Physical Sciences)
Abstract:SHallow REcurrent Decoders (SHRED) are effective for system identification and forecasting from sparse sensor measurements. Such models are light-weight and computationally efficient, allowing them to be trained on consumer laptops. SHRED-based models rely on Recurrent Neural Networks (RNNs) and a simple Multi-Layer Perceptron (MLP) for the temporal encoding and spatial decoding respectively. Despite the relatively simple structure of SHRED, they are able to predict chaotic dynamical systems on different physical, spatial, and temporal scales directly from a sparse set of sensor measurements. In this work, we improve SHRED by leveraging transformers (T-SHRED) for the temporal encoding which improves performance on next-step state prediction on large datasets. We also introduce a sparse identification of nonlinear dynamics (SINDy) attention mechanism into T-SHRED to perform symbolic regression directly on the latent space as part of the model regularization architecture. Symbolic regression improves model interpretability by learning and regularizing the dynamics of the latent space during training. We analyze the performance of T-SHRED on three different dynamical systems ranging from low-data to high-data regimes. We observe that SINDy attention T-SHRED accurately predicts future frames based on an interpretable symbolic model across all tested datasets.
[LG-84] Job Market Cheat Codes: Prototyping Salary Prediction and Job Grouping with Synthetic Job Listings
链接: https://arxiv.org/abs/2506.15879
作者: Abdel Rahman Alsheyab(1),Mohammad Alkhasawneh(1),Nidal Shahin(1) ((1) Jordan University of Science and Technology, Irbid, Jordan)
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, synthetic data only, experimental work
Abstract:This paper presents a machine learning methodology prototype using a large synthetic dataset of job listings to identify trends, predict salaries, and group similar job roles. Employing techniques such as regression, classification, clustering, and natural language processing (NLP) for text-based feature extraction and representation, this study aims to uncover the key features influencing job market dynamics and provide valuable insights for job seekers, employers, and researchers. Exploratory data analysis was conducted to understand the dataset’s characteristics. Subsequently, regression models were developed to predict salaries, classification models to predict job titles, and clustering techniques were applied to group similar jobs. The analyses revealed significant factors influencing salary and job roles, and identified distinct job clusters based on the provided data. While the results are based on synthetic data and not intended for real-world deployment, the methodology demonstrates a transferable framework for job market analysis.
[LG-85] Hidden Breakthroughs in Language Model Training
链接: https://arxiv.org/abs/2506.15872
作者: Sara Kangaslahti,Elan Rosenfeld,Naomi Saphra
类目: Machine Learning (cs.LG)
*备注: 17 pages, 10 figures
Abstract:Loss curves are smooth during most of model training, so visible discontinuities stand out as possible conceptual breakthroughs. Studying these breakthroughs enables a deeper understanding of learning dynamics, but only when they are properly identified. This paper argues that similar breakthroughs occur frequently throughout training but they are obscured by a loss metric that collapses all variation into a single scalar. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace. We use our method to identify clusters of samples that share similar changes in loss during training, disaggregating the overall loss into that of smaller groups of conceptually similar data. We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent interpretable breakthroughs in the model’s capabilities. We demonstrate the promise of these hidden phase transitions as a tool for unsupervised interpretability.
[LG-86] Improving Rectified Flow with Boundary Conditions
链接: https://arxiv.org/abs/2506.15864
作者: Xixi Hu,Runlong Liao,Keyang Xu,Bo Liu,Yeqing Li,Eugene Ie,Hongliang Fei,Qiang Liu
类目: Machine Learning (cs.LG)
*备注: 14 pages
Abstract:Rectified Flow offers a simple and effective approach to high-quality generative modeling by learning a velocity field. However, we identify a limitation in directly modeling the velocity with an unconstrained neural network: the learned velocity often fails to satisfy certain boundary conditions, leading to inaccurate velocity field estimations that deviate from the desired ODE. This issue is particularly critical during stochastic sampling at inference, as the score function’s errors are amplified near the boundary. To mitigate this, we propose a Boundary-enforced Rectified Flow Model (Boundary RF Model), in which we enforce boundary conditions with a minimal code modification. Boundary RF Model improves performance over vanilla RF model, demonstrating 8.01% improvement in FID score on ImageNet using ODE sampling and 8.98% improvement using SDE sampling.
[LG-87] In-field Calibration of Low-Cost Sensors through XGBoost Aggregate Sensor Data
链接: https://arxiv.org/abs/2506.15840
作者: Kevin Yin,Julia Gersey,Pei Zhang
类目: Machine Learning (cs.LG)
*备注: 6 pages including citations
Abstract:Effective large-scale air quality monitoring necessitates distributed sensing due to the pervasive and harmful nature of particulate matter (PM), particularly in urban environments. However, precision comes at a cost: highly accurate sensors are expensive, limiting the spatial deployments and thus their coverage. As a result, low-cost sensors have become popular, though they are prone to drift caused by environmental sensitivity and manufacturing variability. This paper presents a model for in-field sensor calibration using XGBoost ensemble learning to consolidate data from neighboring sensors. This approach reduces dependence on the presumed accuracy of individual sensors and improves generalization across different locations.
[LG-88] Code Rate Optimization via Neural Polar Decoders
链接: https://arxiv.org/abs/2506.15836
作者: Ziv Aharoni,Bashar Huleihel,Henry D Pfister,Haim H Permuter
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a method to optimize communication code rates via the application of neural polar decoders (NPDs). Employing this approach enables simultaneous optimization of code rates over input distributions while providing a practical coding scheme within the framework of polar codes. The proposed approach is designed for scenarios where the channel model is unknown, treating the channel as a black box that produces output samples from input samples. We employ polar codes to achieve our objectives, using NPDs to estimate mutual information (MI) between the channel inputs and outputs, and optimize a parametric model of the input distribution. The methodology involves a two-phase process: a training phase and an inference phase. In the training phase, two steps are repeated interchangeably. First, the estimation step estimates the MI of the channel inputs and outputs via NPDs. Second, the improvement step optimizes the input distribution parameters to maximize the MI estimate obtained by the NPDs. In the inference phase, the optimized model is used to construct polar codes. This involves incorporating the Honda-Yamamoto (HY) scheme to accommodate the optimized input distributions and list decoding to enhance decoding performance. Experimental results on memoryless and finite-state channels (FSCs) demonstrate the effectiveness of our approach, particularly in cases where the channel’s capacity-achieving input distribution is non-uniform. For these cases, we show significant improvements in MI and bit error rates (BERs) over those achieved by uniform and independent and identically distributed (i.i.d.) input distributions, validating our method for block lengths up to 1024. This scalable approach has potential applications in real-world communication systems, bridging theoretical capacity estimation and practical coding performance.
[LG-89] Heterogeneous Federated Reinforcement Learning Using Wasserstein Barycenters
链接: https://arxiv.org/abs/2506.15825
作者: Luiz Pereira,M. Hadi Amini
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we first propose a novel algorithm for model fusion that leverages Wasserstein barycenters in training a global Deep Neural Network (DNN) in a distributed architecture. To this end, we divide the dataset into equal parts that are fed to “agents” who have identical deep neural networks and train only over the dataset fed to them (known as the local dataset). After some training iterations, we perform an aggregation step where we combine the weight parameters of all neural networks using Wasserstein barycenters. These steps form the proposed algorithm referred to as FedWB. Moreover, we leverage the processes created in the first part of the paper to develop an algorithm to tackle Heterogeneous Federated Reinforcement Learning (HFRL). Our test experiment is the CartPole toy problem, where we vary the lengths of the poles to create heterogeneous environments. We train a deep Q-Network (DQN) in each environment to learn to control each cart, while occasionally performing a global aggregation step to generalize the local models; the end outcome is a global DQN that functions across all environments.
[LG-90] AI-based modular warning machine for risk identification in proximity healthcare
链接: https://arxiv.org/abs/2506.15823
作者: Chiara Razzetta,Shahryar Noei,Federico Barbarossa,Edoardo Spairani,Monica Roascio,Elisa Barbi,Giulia Ciacci,Sara Sommariva,Sabrina Guastavino,Michele Piana,Matteo Lenge,Gabriele Arnulfo,Giovanni Magenes,Elvira Maranesi,Giulio Amabili,Anna Maria Massone,Federico Benvenuto,Giuseppe Jurman,Diego Sona,Cristina Campi
类目: Machine Learning (cs.LG)
*备注:
Abstract:“DHEAL-COM - Digital Health Solutions in Community Medicine” is a research and technology project funded by the Italian Department of Health for the development of digital solutions of interest in proximity healthcare. The activity within the DHEAL-COM framework allows scientists to gather a notable amount of multi-modal data whose interpretation can be performed by means of machine learning algorithms. The present study illustrates a general automated pipeline made of numerous unsupervised and supervised methods that can ingest such data, provide predictive results, and facilitate model interpretations via feature identification.
[LG-91] Optimizing Bidding Strategies in First-Price Auctions in Binary Feedback Setting with Predictions
链接: https://arxiv.org/abs/2506.15817
作者: Jason Tandiary
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper studies Vickrey first-price auctions under binary feedback. Leveraging the enhanced performance of machine learning algorithms, the new algorithm uses past information to improve the regret bounds of the BROAD-OMD algorithm. Motivated by the growing relevance of first-price auctions and the predictive capabilities of machine learning models, this paper proposes a new algorithm within the BROAD-OMD framework (Hu et al., 2025) that leverages predictions of the highest competing bid. This paper’s main contribution is an algorithm that achieves zero regret under accurate predictions. Additionally, a bounded regret bound of O(T^(3/4) * Vt^(1/4)) is established under certain normality conditions.
[LG-92] DeepJ: Graph Convolutional Transformers with Differentiable Pooling for Patient Trajectory Modeling
链接: https://arxiv.org/abs/2506.15809
作者: Deyi Li,Zijun Yao,Muxuan Liang,Mei Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, graph learning has gained significant interest for modeling complex interactions among medical events in structured Electronic Health Record (EHR) data. However, existing graph-based approaches often work in a static manner, either restricting interactions within individual encounters or collapsing all historical encounters into a single snapshot. As a result, when it is necessary to identify meaningful groups of medical events spanning longitudinal encounters, existing methods are inadequate in modeling interactions cross encounters while accounting for temporal dependencies. To address this limitation, we introduce Deep Patient Journey (DeepJ), a novel graph convolutional transformer model with differentiable graph pooling to effectively capture intra-encounter and inter-encounter medical event interactions. DeepJ can identify groups of temporally and functionally related medical events, offering valuable insights into key event clusters pertinent to patient outcome prediction. DeepJ significantly outperformed five state-of-the-art baseline models while enhancing interpretability, demonstrating its potential for improved patient risk stratification.
[LG-93] Steering Your Diffusion Policy with Latent Space Reinforcement Learning
链接: https://arxiv.org/abs/2506.15799
作者: Andrew Wagenmaker,Mitsuhiko Nakamoto,Yunchu Zhang,Seohong Park,Waleed Yagoub,Anusha Nagabandi,Abhishek Gupta,Sergey Levine
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior – an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies – a state-of-the-art BC methodology – we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
[LG-94] Descriptor-based Foundation Models for Molecular Property Prediction
链接: https://arxiv.org/abs/2506.15792
作者: Jackson Burns,Akshat Zalte,William Green
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Fast and accurate prediction of molecular properties with machine learning is pivotal to scientific advancements across myriad domains. Foundation models in particular have proven especially effective, enabling accurate training on small, real-world datasets. This study introduces CheMeleon, a novel molecular foundation model pre-trained on deterministic molecular descriptors from the Mordred package, leveraging a Directed Message-Passing Neural Network to predict these descriptors in a noise-free setting. Unlike conventional approaches relying on noisy experimental data or biased quantum mechanical simulations, CheMeleon uses low-noise molecular descriptors to learn rich molecular representations. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 79% on Polaris tasks, outperforming baselines like Random Forest (46%), fastprop (39%), and Chemprop (36%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (63%) and other foundation models. However, it struggles to distinguish activity cliffs like many of the tested models. The t-SNE projection of CheMeleon’s learned representations demonstrates effective separation of chemical series, highlighting its ability to capture structural nuances. These results underscore the potential of descriptor-based pre-training for scalable and effective molecular property prediction, opening avenues for further exploration of descriptor sets and unlabeled datasets.
[LG-95] Convergent Methods for Koopman Operators on Reproducing Kernel Hilbert Spaces
链接: https://arxiv.org/abs/2506.15782
作者: Nicolas Boullé,Matthew J. Colbrook,Gustav Conradie
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS); Spectral Theory (math.SP); Machine Learning (stat.ML)
*备注:
Abstract:Data-driven spectral analysis of Koopman operators is a powerful tool for understanding numerous real-world dynamical systems, from neuronal activity to variations in sea surface temperature. The Koopman operator acts on a function space and is most commonly studied on the space of square-integrable functions. However, defining it on a suitable reproducing kernel Hilbert space (RKHS) offers numerous practical advantages, including pointwise predictions with error bounds, improved spectral properties that facilitate computations, and more efficient algorithms, particularly in high dimensions. We introduce the first general, provably convergent, data-driven algorithms for computing spectral properties of Koopman and Perron–Frobenius operators on RKHSs. These methods efficiently compute spectra and pseudospectra with error control and spectral measures while exploiting the RKHS structure to avoid the large-data limits required in the L^2 settings. The function space is determined by a user-specified kernel, eliminating the need for quadrature-based sampling as in L^2 and enabling greater flexibility with finite, externally provided datasets. Using the Solvability Complexity Index hierarchy, we construct adversarial dynamical systems for these problems to show that no algorithm can succeed in fewer limits, thereby proving the optimality of our algorithms. Notably, this impossibility extends to randomized algorithms and datasets. We demonstrate the effectiveness of our algorithms on challenging, high-dimensional datasets arising from real-world measurements and high-fidelity numerical simulations, including turbulent channel flow, molecular dynamics of a binding protein, Antarctic sea ice concentration, and Northern Hemisphere sea surface height. The algorithms are publicly available in the software package \textttSpecRKHS .
[LG-96] Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration
链接: https://arxiv.org/abs/2506.15721
作者: Junqi Gao,Zhichang Guo,Dazhi Zhang,Dong Li,Runze Liu,Pengfei Li,Kai Tian,Biqing Qi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM’s varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM’s performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM’s updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM’s capabilities. Our code is available at this https URL.
[LG-97] Data-Driven Heat Pump Management: Combining Machine Learning with Anomaly Detection for Residential Hot Water Systems
链接: https://arxiv.org/abs/2506.15719
作者: Manal Rahal,Bestoun S. Ahmed,Roger Renstrom,Robert Stener,Albrecht Wurtz
类目: Machine Learning (cs.LG)
*备注: 33 pages accepted in Neural Networks and Applications
Abstract:Heat pumps (HPs) have emerged as a cost-effective and clean technology for sustainable energy systems, but their efficiency in producing hot water remains restricted by conventional threshold-based control methods. Although machine learning (ML) has been successfully implemented for various HP applications, optimization of household hot water demand forecasting remains understudied. This paper addresses this problem by introducing a novel approach that combines predictive ML with anomaly detection to create adaptive hot water production strategies based on household-specific consumption patterns. Our key contributions include: (1) a composite approach combining ML and isolation forest (iForest) to forecast household demand for hot water and steer responsive HP operations; (2) multi-step feature selection with advanced time-series analysis to capture complex usage patterns; (3) application and tuning of three ML models: Light Gradient Boosting Machine (LightGBM), Long Short-Term Memory (LSTM), and Bi-directional LSTM with the self-attention mechanism on data from different types of real HP installations; and (4) experimental validation on six real household installations. Our experiments show that the best-performing model LightGBM achieves superior performance, with RMSE improvements of up to 9.37% compared to LSTM variants with R^2 values between 0.748-0.983. For anomaly detection, our iForest implementation achieved an F1-score of 0.87 with a false alarm rate of only 5.2%, demonstrating strong generalization capabilities across different household types and consumption patterns, making it suitable for real-world HP deployments.
[LG-98] BuildingBRep-11K: Precise Multi-Storey B-Rep Building Solids with Rich Layout Metadata
链接: https://arxiv.org/abs/2506.15718
作者: Yu Guo,Hongji Fang,Tianyu Fang,Zhe Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the rise of artificial intelligence, the automatic generation of building-scale 3-D objects has become an active research topic, yet training such models still demands large, clean and richly annotated datasets. We introduce BuildingBRep-11K, a collection of 11 978 multi-storey (2-10 floors) buildings (about 10 GB) produced by a shape-grammar-driven pipeline that encodes established building-design principles. Every sample consists of a geometrically exact B-rep solid-covering floors, walls, slabs and rule-based openings-together with a fast-loading .npy metadata file that records detailed per-floor parameters. The generator incorporates constraints on spatial scale, daylight optimisation and interior layout, and the resulting objects pass multi-stage filters that remove Boolean failures, undersized rooms and extreme aspect ratios, ensuring compliance with architectural standards. To verify the dataset’s learnability we trained two lightweight PointNet baselines. (i) Multi-attribute regression. A single encoder predicts storey count, total rooms, per-storey vector and mean room area from a 4 000-point cloud. On 100 unseen buildings it attains 0.37-storey MAE (87 % within \pm1 ), 5.7-room MAE, and 3.2 m ^2 MAE on mean area. (ii) Defect detection. With the same backbone we classify GOOD versus DEFECT; on a balanced 100-model set the network reaches 54 % accuracy, recalling 82 % of true defects at 53 % precision (41 TP, 9 FN, 37 FP, 13 TN). These pilots show that BuildingBRep-11K is learnable yet non-trivial for both geometric regression and topological quality assessment
[LG-99] An application of machine learning to the motion response prediction of floating assets
链接: https://arxiv.org/abs/2506.15713
作者: Michael T.M.B. Morris-Thomas,Marius Martens
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn)
*备注: 17 pages, 6 figures
Abstract:The real-time prediction of floating offshore asset behavior under stochastic metocean conditions remains a significant challenge in offshore engineering. While traditional empirical and frequency-domain methods work well in benign conditions, they struggle with both extreme sea states and nonlinear responses. This study presents a supervised machine learning approach using multivariate regression to predict the nonlinear motion response of a turret-moored vessel in 400 m water depth. We developed a machine learning workflow combining a gradient-boosted ensemble method with a custom passive weathervaning solver, trained on approximately 10^6 samples spanning 100 features. The model achieved mean prediction errors of less than 5% for critical mooring parameters and vessel heading accuracy to within 2.5 degrees across diverse metocean conditions, significantly outperforming traditional frequency-domain methods. The framework has been successfully deployed on an operational facility, demonstrating its efficacy for real-time vessel monitoring and operational decision-making in offshore environments.
[LG-100] CoC: Chain-of-Cancer based on Cross-Modal Autoregressive Traction for Survival Prediction
链接: https://arxiv.org/abs/2506.15696
作者: Haipeng Zhou,Sicheng Yang,Sihan Yang,Jing Qin,Lei Chen,Lei Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Survival prediction aims to evaluate the risk level of cancer patients. Existing methods primarily rely on pathology and genomics data, either individually or in combination. From the perspective of cancer pathogenesis, epigenetic changes, such as methylation data, could also be crucial for this task. Furthermore, no previous endeavors have utilized textual descriptions to guide the prediction. To this end, we are the first to explore the use of four modalities, including three clinical modalities and language, for conducting survival prediction. In detail, we are motivated by the Chain-of-Thought (CoT) to propose the Chain-of-Cancer (CoC) framework, focusing on intra-learning and inter-learning. We encode the clinical data as the raw features, which remain domain-specific knowledge for intra-learning. In terms of inter-learning, we use language to prompt the raw features and introduce an Autoregressive Mutual Traction module for synergistic representation. This tailored framework facilitates joint learning among multiple modalities. Our approach is evaluated across five public cancer datasets, and extensive experiments validate the effectiveness of our methods and proposed designs, leading to producing \sota results. Codes will be released.
[LG-101] SimuGen: Multi-modal Agent ic Framework for Constructing Block Diagram-Based Simulation Models
链接: https://arxiv.org/abs/2506.15695
作者: Xinxing Ren,Qianbo Zang,Zekun Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in large language models (LLMs) have shown impressive performance in mathematical reasoning and code generation. However, LLMs still struggle in the simulation domain, particularly in generating Simulink models, which are essential tools in engineering and scientific research. Our preliminary experiments indicate that LLM agents often fail to produce reliable and complete Simulink simulation code from text-only inputs, likely due to the lack of Simulink-specific data in their pretraining. To address this challenge, we propose SimuGen, a multimodal agent-based framework that automatically generates accurate Simulink simulation code by leveraging both the visual Simulink diagram and domain knowledge. SimuGen coordinates several specialized agents, including an investigator, unit test reviewer, code generator, executor, debug locator, and report writer, supported by a domain-specific knowledge base. This collaborative and modular design enables interpretable, robust, and reproducible Simulink simulation generation. Our source code is publicly available at this https URL.
[LG-102] Development of a Multiprocessing Interface Genetic Algorithm for Optimising a Multilayer Perceptron for Disease Prediction
链接: https://arxiv.org/abs/2506.15694
作者: Iliyas Ibrahim Iliyas,Souley Boukari,Abdulsalam Yau Gital
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study introduces a framework that integrates nonlinear feature extraction, classification, and efficient optimization. First, kernel principal component analysis with a radial basis function kernel reduces dimensionality while preserving 95% of the variance. Second, a multilayer perceptron (MLP) learns to predict disease status. Finally, a modified multiprocessing genetic algorithm (MIGA) optimizes MLP hyperparameters in parallel over ten generations. We evaluated this approach on three datasets: the Wisconsin Diagnostic Breast Cancer dataset, the Parkinson’s Telemonitoring dataset, and the chronic kidney disease dataset. The MLP tuned by the MIGA achieved the best accuracy of 99.12% for breast cancer, 94.87% for Parkinson’s disease, and 100% for chronic kidney disease. These results outperform those of other methods, such as grid search, random search, and Bayesian optimization. Compared with a standard genetic algorithm, kernel PCA revealed nonlinear relationships that improved classification, and the MIGA’s parallel fitness evaluations reduced the tuning time by approximately 60%. The genetic algorithm incurs high computational cost from sequential fitness evaluations, but our multiprocessing interface GA (MIGA) parallelizes this step, slashing the tuning time and steering the MLP toward the best accuracy score of 99.12%, 94.87%, and 100% for breast cancer, Parkinson’s disease, and CKD, respectively.
[LG-103] Verifiable Safety Q-Filters via Hamilton-Jacobi Reachability and Multiplicative Q-Networks
链接: https://arxiv.org/abs/2506.15693
作者: Jiaxing Li,Hanjiang Hu,Yujie Yang,Changliu Liu
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures
Abstract:Recent learning-based safety filters have outperformed conventional methods, such as hand-crafted Control Barrier Functions (CBFs), by effectively adapting to complex constraints. However, these learning-based approaches lack formal safety guarantees. In this work, we introduce a verifiable model-free safety filter based on Hamilton-Jacobi reachability analysis. Our primary contributions include: 1) extending verifiable self-consistency properties for Q value functions, 2) proposing a multiplicative Q-network structure to mitigate zero-sublevel-set shrinkage issues, and 3) developing a verification pipeline capable of soundly verifying these self-consistency properties. Our proposed approach successfully synthesizes formally verified, model-free safety certificates across four standard safe-control benchmarks.
[LG-104] MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
链接: https://arxiv.org/abs/2506.15692
作者: Jaehyun Nam,Jinsung Yoon,Jiefeng Chen,Jinwoo Shin,Sercan Ö. Arık,Tomas Pfister
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 44% of the Kaggle competitions on the MLE-bench, significantly outperforming the best alternative.
[LG-105] S2GPT -PINNs: Sparse and Small models for PDEs
链接: https://arxiv.org/abs/2506.15687
作者: Yajie Ji,Yanlai Chen,Shawn Koohy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages,6 figures
Abstract:We propose S ^2 GPT-PINN, a sparse and small model for solving parametric partial differential equations (PDEs). Similar to Small Language Models (SLMs), S ^2 GPT-PINN is tailored to domain-specific (families of) PDEs and characterized by its compact architecture and minimal computational power. Leveraging a small amount of extremely high quality data via a mathematically rigorous greedy algorithm that is enabled by the large full-order models, S ^2 GPT-PINN relies on orders of magnitude less parameters than PINNs to achieve extremely high efficiency via two levels of customizations. The first is knowledge distillation via task-specific activation functions that are transferred from Pre-Trained PINNs. The second is a judicious down-sampling when calculating the physics-informed loss of the network compressing the number of data sites by orders of magnitude to the size of the small model.
[LG-106] Implementing Keyword Spotting on the MCUX947 Microcontroller with Integrated NPU
链接: https://arxiv.org/abs/2506.08911
作者: Petar Jakuš,Hrvoje Džapo
类目: Human-Computer Interaction (cs.HC); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF); Sound (cs.SD)
*备注: 4 pages
Abstract:This paper presents a keyword spotting (KWS) system implemented on the NXP MCXN947 microcontroller with an integrated Neural Processing Unit (NPU), enabling real-time voice interaction on resource-constrained devices. The system combines MFCC feature extraction with a CNN classifier, optimized using Quantization Aware Training to reduce model size with minimal accuracy drop. Experimental results demonstrate a 59x speedup in inference time when leveraging the NPU compared to CPU-only execution, achieving 97.06% accuracy with a model size of 30.58 KB, demonstrating the feasibility of efficient, low-power voice interfaces on embedded platforms.
[LG-107] Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres
链接: https://arxiv.org/abs/2506.17197
作者: Samuel Howard,Peter Potaptchik,George Deligiannidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint
Abstract:Recent advances in flow-based generative modelling have provided scalable methods for computing the Schrödinger Bridge (SB) between distributions, a dynamic form of entropy-regularised Optimal Transport (OT) for the quadratic cost. The successful Iterative Markovian Fitting (IMF) procedure solves the SB problem via sequential bridge-matching steps, presenting an elegant and practical approach with many favourable properties over the more traditional Iterative Proportional Fitting (IPF) procedure. Beyond the standard setting, optimal transport can be generalised to the multi-marginal case in which the objective is to minimise a cost defined over several marginal distributions. Of particular importance are costs defined over a tree structure, from which Wasserstein barycentres can be recovered as a special case. In this work, we extend the IMF procedure to solve for the tree-structured SB problem. Our resulting algorithm inherits the many advantages of IMF over IPF approaches in the tree-based setting. In the specific case of Wasserstein barycentres, our approach can be viewed as extending fixed-point approaches for barycentre computation to the case of flow-based entropic OT solvers.
[LG-108] Empowering Near-Field Communications in Low-Altitude Economy with LLM : Fundamentals Potentials Solutions and Future Directions
链接: https://arxiv.org/abs/2506.17067
作者: Zhuo Xu,Tianyue Zheng,Linglong Dai
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:The low-altitude economy (LAE) is gaining significant attention from academia and industry. Fortunately, LAE naturally aligns with near-field communications in extremely large-scale MIMO (XL-MIMO) systems. By leveraging near-field beamfocusing, LAE can precisely direct beam energy to unmanned aerial vehicles, while the additional distance dimension boosts overall spectrum efficiency. However, near-field communications in LAE still face several challenges, such as the increase in signal processing complexity and the necessity of distinguishing between far and near-field users. Inspired by the large language models (LLM) with powerful ability to handle complex problems, we apply LLM to solve challenges of near-field communications in LAE. The objective of this article is to provide a comprehensive analysis and discussion on LLM-empowered near-field communications in LAE. Specifically, we first introduce fundamentals of LLM and near-field communications, including the key advantages of LLM and key characteristics of near-field communications. Then, we reveal the opportunities and challenges of near-field communications in LAE. To address these challenges, we present a LLM-based scheme for near-field communications in LAE, and provide a case study which jointly distinguishes far and near-field users and designs multi-user precoding matrix. Finally, we outline and highlight several future research directions and open issues.
[LG-109] Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings NEURIPS2025
链接: https://arxiv.org/abs/2506.17064
作者: Aditya Sengar,Ali Hariri,Daniel Probst,Patrick Barth,Pierre Vandergheynst
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 10 pages (main text), 4 figures, 2 tables. Submitted to NeurIPS 2025. Code and data are publicly available
Abstract:Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, including every side-chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low-dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue-based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a 2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue-based pooling strategy reproduces the reference ensemble with high structural fidelity (all-atom lDDT of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route to system-specific, all-atom ensemble generation for large proteins, providing a promising tool for structure-based therapeutic design on complex, dynamic targets. The D2R-MD dataset and our implementation are freely available to facilitate further research.
[LG-110] Bayesian Joint Model of Multi-Sensor and Failure Event Data for Multi-Mode Failure Prediction
链接: https://arxiv.org/abs/2506.17036
作者: Sina Aghaee Dabaghan Fard,Minhee Kim,Akash Deep,Jaesung Lee
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Modern industrial systems are often subject to multiple failure modes, and their conditions are monitored by multiple sensors, generating multiple time-series signals. Additionally, time-to-failure data are commonly available. Accurately predicting a system’s remaining useful life (RUL) requires effectively leveraging multi-sensor time-series data alongside multi-mode failure event data. In most existing models, failure modes and RUL prediction are performed independently, ignoring the inherent relationship between these two tasks. Some models integrate multiple failure modes and event prediction using black-box machine learning approaches, which lack statistical rigor and cannot characterize the inherent uncertainty in the model and data. This paper introduces a unified approach to jointly model the multi-sensor time-series data and failure time concerning multiple failure modes. This proposed model integrate a Cox proportional hazards model, a Convolved Multi-output Gaussian Process, and multinomial failure mode distributions in a hierarchical Bayesian framework with corresponding priors, enabling accurate prediction with robust uncertainty quantification. Posterior distributions are effectively obtained by Variational Bayes, and prediction is performed with Monte Carlo sampling. The advantages of the proposed model is validated through extensive numerical and case studies with jet-engine dataset.
[LG-111] Simulating Correlated Electrons with Symmetry-Enforced Normalizing Flows
链接: https://arxiv.org/abs/2506.17015
作者: Dominic Schuh,Janik Kreit,Evan Berkowitz,Lena Funcke,Thomas Luu,Kim A. Nicoli,Marcel Rodekamp
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 9 pages, 7 figures
Abstract:We present the first proof of principle that normalizing flows can accurately learn the Boltzmann distribution of the fermionic Hubbard model - a key framework for describing the electronic structure of graphene and related materials. State-of-the-art methods like Hybrid Monte Carlo often suffer from ergodicity issues near the time-continuum limit, leading to biased estimates. Leveraging symmetry-aware architectures as well as independent and identically distributed sampling, our approach resolves these issues and achieves significant speed-ups over traditional methods.
[LG-112] Enhancing Expressivity of Quantum Neural Networks Based on the SWAP test
链接: https://arxiv.org/abs/2506.16938
作者: Sebastian Nagies,Emiliano Tolotti,Davide Pastorello,Enrico Blanzieri
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures
Abstract:Parameterized quantum circuits represent promising architectures for machine learning applications, yet many lack clear connections to classical models, potentially limiting their ability to translate the wide success of classical neural networks to the quantum realm. We examine a specific type of quantum neural network (QNN) built exclusively from SWAP test circuits, and discuss its mathematical equivalence to a classical two-layer feedforward network with quadratic activation functions under amplitude encoding. Our analysis across classical real-world and synthetic datasets reveals that while this architecture can successfully learn many practical tasks, it exhibits fundamental expressivity limitations due to violating the universal approximation theorem, particularly failing on harder problems like the parity check function. To address this limitation, we introduce a circuit modification using generalized SWAP test circuits that effectively implements classical neural networks with product layers. This enhancement enables successful learning of parity check functions in arbitrary dimensions which we analytically argue to be impossible for the original architecture beyond two dimensions regardless of network size. Our results establish a framework for enhancing QNN expressivity through classical task analysis and demonstrate that our SWAP test-based architecture offers broad representational capacity, suggesting potential promise also for quantum learning tasks.
[LG-113] A Neural Operator based Hybrid Microscale Model for Multiscale Simulation of Rate-Dependent Materials
链接: https://arxiv.org/abs/2506.16918
作者: Dhananjeyan Jeyaraj,Hamidreza Eivazi,Jendrik-Alexander Tröger,Stefan Wittek,Stefan Hartmann,Andreas Rausch
类目: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:The behavior of materials is influenced by a wide range of phenomena occurring across various time and length scales. To better understand the impact of microstructure on macroscopic response, multiscale modeling strategies are essential. Numerical methods, such as the \textFE^2 approach, account for micro-macro interactions to predict the global response in a concurrent manner. However, these methods are computationally intensive due to the repeated evaluations of the microscale. This challenge has led to the integration of deep learning techniques into computational homogenization frameworks to accelerate multiscale simulations. In this work, we employ neural operators to predict the microscale physics, resulting in a hybrid model that combines data-driven and physics-based approaches. This allows for physics-guided learning and provides flexibility for different materials and spatial discretizations. We apply this method to time-dependent solid mechanics problems involving viscoelastic material behavior, where the state is represented by internal variables only at the microscale. The constitutive relations of the microscale are incorporated into the model architecture and the internal variables are computed based on established physical principles. The results for homogenized stresses ( 6% error) show that the approach is computationally efficient ( \sim 100 \times faster).
[LG-114] Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
链接: https://arxiv.org/abs/2506.16658
作者: Wenlong Ji,Yihan Pan,Ruihao Zhu,Lihua Lei
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emphsurrogate rewards. A prominent feature of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, provided that the correlation is non-zero – even in cases where the mean surrogate reward completely misaligns with the true mean rewards. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We compare MLA-UCB with the standard UCB on a range of numerical studies and show a sizable efficiency gain even when the size of the offline data and the correlation between predicted and true rewards are moderate.
[LG-115] Improvement of Nuclide Detection through Graph Spectroscopic Analysis Framework and its Application to Nuclear Facility Upset Detection
链接: https://arxiv.org/abs/2506.16522
作者: Pedro Rodríguez Fernández,Christian Svinth,Alex Hagen
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:We present a method to improve the detection limit for radionuclides using spectroscopic radiation detectors and the arrival time of each detected radiation quantum. We enable this method using a neural network with an attention mechanism. We illustrate the method on the detection of Cesium release from a nuclear facility during an upset, and our method shows 2\times improvement over the traditional spectroscopic method. We hypothesize that our method achieves this performance increase by modulating its detection probability by the overall rate of probable detections, specifically by adapting detection thresholds based on temporal event distributions and local spectral features, and show evidence to this effect. We believe this method is applicable broadly and may be more successful for radionuclides with more complicated decay chains than Cesium; we also note that our method can generalize beyond the addition of arrival time and could integrate other data about each detection event, such as pulse quality, location in detector, or even combining the energy and time from detections in different detectors.
[LG-116] On Continuous Monitoring of Risk Violations under Unknown Shift UAI2025
链接: https://arxiv.org/abs/2506.16416
作者: Alexander Timans,Rajeev Verma,Eric Nalisnick,Christian A. Naesseth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AT and RV are joint first authors. Accepted at the Conference on Uncertainty in Artificial Intelligence (UAI 2025)
Abstract:Machine learning systems deployed in the real world must operate under dynamic and often unpredictable distribution shifts. This challenges the validity of statistical safety assurances on the system’s risk established beforehand. Common risk control frameworks rely on fixed assumptions and lack mechanisms to continuously monitor deployment reliability. In this work, we propose a general framework for the real-time monitoring of risk violations in evolving data streams. Leveraging the ‘testing by betting’ paradigm, we propose a sequential hypothesis testing procedure to detect violations of bounded risks associated with the model’s decision-making mechanism, while ensuring control on the false alarm rate. Our method operates under minimal assumptions on the nature of encountered shifts, rendering it broadly applicable. We illustrate the effectiveness of our approach by monitoring risks in outlier detection and set prediction under a variety of shifts.
[LG-117] Identifying Heterogeneity in Distributed Learning
链接: https://arxiv.org/abs/2506.16394
作者: Zelin Xiao,Jia Gu,Song Xi Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study methods for identifying heterogeneous parameter components in distributed M-estimation with minimal data transmission. One is based on a re-normalized Wald test, which is shown to be consistent as long as the number of distributed data blocks K is of a smaller order of the minimum block sample size and the level of heterogeneity is dense. The second one is an extreme contrast test (ECT) based on the difference between the largest and smallest component-wise estimated parameters among data blocks. By introducing a sample splitting procedure, the ECT can avoid the bias accumulation arising from the M-estimation procedures, and exhibits consistency for K being much larger than the sample size while the heterogeneity is sparse. The ECT procedure is easy to operate and communication-efficient. A combination of the Wald and the extreme contrast tests is formulated to attain more robust power under varying levels of sparsity of the heterogeneity. We also conduct intensive numerical experiments to compare the family-wise error rate (FWER) and the power of the proposed methods. Additionally, we conduct a case study to present the implementation and validity of the proposed methods.
[LG-118] Feedback-driven recurrent quantum neural network universality
链接: https://arxiv.org/abs/2506.16332
作者: Lukas Gonon,Rodrigo Martínez-Peña,Juan-Pablo Ortega
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 31 pages
Abstract:Quantum reservoir computing uses the dynamics of quantum systems to process temporal data, making it particularly well-suited for learning with noisy intermediate-scale quantum devices. Early experimental proposals, such as the restarting and rewinding protocols, relied on repeating previous steps of the quantum map to avoid backaction. However, this approach compromises real-time processing and increases computational overhead. Recent developments have introduced alternative protocols that address these limitations. These include online, mid-circuit measurement, and feedback techniques, which enable real-time computation while preserving the input history. Among these, the feedback protocol stands out for its ability to process temporal information with comparatively fewer components. Despite this potential advantage, the theoretical foundations of feedback-based quantum reservoir computing remain underdeveloped, particularly with regard to the universality and the approximation capabilities of this approach. This paper addresses this issue by presenting a recurrent quantum neural network architecture that extends a class of existing feedforward models to a dynamic, feedback-driven reservoir setting. We provide theoretical guarantees for variational recurrent quantum neural networks, including approximation bounds and universality results. Notably, our analysis demonstrates that the model is universal with linear readouts, making it both powerful and experimentally accessible. These results pave the way for practical and theoretically grounded quantum reservoir computing with real-time processing capabilities.
[LG-119] he Condition Number as a Scale-Invariant Proxy for Information Encoding in Neural Units
链接: https://arxiv.org/abs/2506.16289
作者: Oswaldo Ludwig
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper explores the relationship between the condition number of a neural network’s weight tensor and the extent of information encoded by the associated processing unit, viewed through the lens of information theory. We argue that a high condition number, though not sufficient for effective knowledge encoding, may indicate that the unit has learned to selectively amplify and compress information. We formalize this intuition, particularly for linear units with Gaussian inputs, linking the condition number and the transformation’s log-volume scaling factor to the characteristics of the output entropy and the geometric properties of the learned transformation. Our analysis demonstrates that for a fixed weight norm, a concentrated distribution of singular values (high condition number) corresponds to reduced overall information transfer, indicating a specialized and efficient encoding strategy. Furthermore, we present a practical case study where these principles are applied to guide selective fine-tuning of a multimodal Large Language Model, aiming to mitigate catastrophic forgetting during cross-modal adaptation. Unlike many existing catastrophic forgetting mitigation methods that rely on access to pre-training statistics, which are often unavailable, our selective fine-tuning approach offers a way to bypass this common requirement.
[LG-120] Random feature approximation for general spectral methods
链接: https://arxiv.org/abs/2506.16283
作者: Mike Nguyen,Nicole Mücke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2308.15434 , arXiv:2412.17518
Abstract:Random feature approximation is arguably one of the most widely used techniques for kernel methods in large-scale learning algorithms. In this work, we analyze the generalization properties of random feature methods, extending previous results for Tikhonov regularization to a broad class of spectral regularization techniques. This includes not only explicit methods but also implicit schemes such as gradient descent and accelerated algorithms like the Heavy-Ball and Nesterov method. Through this framework, we enable a theoretical analysis of neural networks and neural operators through the lens of the Neural Tangent Kernel (NTK) approach trained via gradient descent. For our estimators we obtain optimal learning rates over regularity classes (even for classes that are not included in the reproducing kernel Hilbert space), which are defined through appropriate source conditions. This improves or completes previous results obtained in related settings for specific kernel algorithms.
[LG-121] Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation
链接: https://arxiv.org/abs/2506.16233
作者: Chenrui Ma,Zechang Sun,Tao Jing,Zheng Cai,Yuan-Sen Ting,Song Huang,Mingyu Li
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: We have submitted to AAS journals. See another independent work for further reference – Category-based Galaxy Image Generation via Diffusion Models (Fan, Tang et al.). Comments are welcome
Abstract:Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets – whether from simulations or human annotation – a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data. Leveraging the Galaxy Zoo 2 dataset which contains visual feature – galaxy image pairs from volunteer annotation, we demonstrate that our model generates diverse, high-fidelity galaxy images closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features ( \sim 0.1% in GZ2 dataset) as a test case, our approach doubled the number of detected instances from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at this https URL.
[LG-122] Diffusion-Based Hypothesis Testing and Change-Point Detection
链接: https://arxiv.org/abs/2506.16089
作者: Sean Moushegian,Taposh Banerjee,Vahid Tarokh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Score-based methods have recently seen increasing popularity in modeling and generation. Methods have been constructed to perform hypothesis testing and change-point detection with score functions, but these methods are in general not as powerful as their likelihood-based peers. Recent works consider generalizing the score-based Fisher divergence into a diffusion-divergence by transforming score functions via multiplication with a matrix-valued function or a weight matrix. In this paper, we extend the score-based hypothesis test and change-point detection stopping rule into their diffusion-based analogs. Additionally, we theoretically quantify the performance of these diffusion-based algorithms and study scenarios where optimal performance is achievable. We propose a method of numerically optimizing the weight matrix and present numerical simulations to illustrate the advantages of diffusion-based algorithms.
[LG-123] Contactless Precision Steering of Particles in a Fluid inside a Cube with Rotating Walls
链接: https://arxiv.org/abs/2506.15958
作者: Lucas Amoudruz,Petr Karnakov,Petros Koumoutsakos
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Contactless manipulation of small objects is essential for biomedical and chemical applications, such as cell analysis, assisted fertilisation, and precision chemistry. Established methods, including optical, acoustic, and magnetic tweezers, are now complemented by flow control techniques that use flow-induced motion to enable precise and versatile manipulation. However, trapping multiple particles in fluid remains a challenge. This study introduces a novel control algorithm capable of steering multiple particles in flow. The system uses rotating disks to generate flow fields that transport particles to precise locations. Disk rotations are governed by a feedback control policy based on the Optimising a Discrete Loss (ODIL) framework, which combines fluid dynamics equations with path objectives into a single loss function. Our experiments, conducted in both simulations and with the physical device, demonstrate the capability of the approach to transport two beads simultaneously to predefined locations, advancing robust contactless particle manipulation for biomedical applications.
[LG-124] From Local Interactions to Global Operators: Scalable Gaussian Process Operator for Physical Systems
链接: https://arxiv.org/abs/2506.15906
作者: Sawan Kumar,Tapas Tripura,Rajdip Nayek,Souvik Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Operator learning offers a powerful paradigm for solving parametric partial differential equations (PDEs), but scaling probabilistic neural operators such as the recently proposed Gaussian Processes Operators (GPOs) to high-dimensional, data-intensive regimes remains a significant challenge. In this work, we introduce a novel, scalable GPO, which capitalizes on sparsity, locality, and structural information through judicious kernel design. Addressing the fundamental limitation of cubic computational complexity, our method leverages nearest-neighbor-based local kernel approximations in the spatial domain, sparse kernel approximation in the parameter space, and structured Kronecker factorizations to enable tractable inference on large-scale datasets and high-dimensional input. While local approximations often introduce accuracy trade-offs due to limited kernel interactions, we overcome this by embedding operator-aware kernel structures and employing expressive, task-informed mean functions derived from neural operator architectures. Through extensive evaluations on a broad class of nonlinear PDEs - including Navier-Stokes, wave advection, Darcy flow, and Burgers’ equations - we demonstrate that our framework consistently achieves high accuracy across varying discretization scales. These results underscore the potential of our approach to bridge the gap between scalability and fidelity in GPO, offering a compelling foundation for uncertainty-aware modeling in complex physical systems.
[LG-125] Superconducting Qubit Readout Using Next-Generation Reservoir Computing
链接: https://arxiv.org/abs/2506.15771
作者: Robert Kent,Benjamin Lienhard,Gregory Lafyatis,Daniel J. Gauthier
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum processors require rapid and high-fidelity simultaneous measurements of many qubits. While superconducting qubits are among the leading modalities toward a useful quantum processor, their readout remains a bottleneck. Traditional approaches to processing measurement data often struggle to account for crosstalk present in frequency-multiplexed readout, the preferred method to reduce the resource overhead. Recent approaches to address this challenge use neural networks to improve the state-discrimination fidelity. However, they are computationally expensive to train and evaluate, resulting in increased latency and poor scalability as the number of qubits increases. We present an alternative machine learning approach based on next-generation reservoir computing that constructs polynomial features from the measurement signals and maps them to the corresponding qubit states. This method is highly parallelizable, avoids the costly nonlinear activation functions common in neural networks, and supports real-time training, enabling fast evaluation, adaptability, and scalability. Despite its lower computational complexity, our reservoir approach is able to maintain high qubit-state-discrimination fidelity. Relative to traditional methods, our approach achieves error reductions of up to 50% and 11% on single- and five-qubit datasets, respectively, and delivers up to 2.5x crosstalk reduction on the five-qubit dataset. Compared with recent machine-learning methods, evaluating our model requires 100x fewer multiplications for single-qubit and 2.5x fewer for five-qubit models. This work demonstrates that reservoir computing can enhance qubit-state discrimination while maintaining scalability for future quantum processors.
[LG-126] Approximate Ricci-flat Metrics for Calabi-Yau Manifolds
链接: https://arxiv.org/abs/2506.15766
作者: Seung-Joo Lee,Andre Lukas
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 15 pages, 6 figures
Abstract:We outline a method to determine analytic Kähler potentials with associated approximately Ricci-flat Kähler metrics on Calabi-Yau manifolds. Key ingredients are numerically calculating Ricci-flat Kähler potentials via machine learning techniques and fitting the numerical results to Donaldson’s Ansatz. We apply this method to the Dwork family of quintic hypersurfaces in \mathbbP^4 and an analogous one-parameter family of bi-cubic CY hypersurfaces in \mathbbP^2\times\mathbbP^2 . In each case, a relatively simple analytic expression is obtained for the approximately Ricci-flat Kähler potentials, including the explicit dependence on the complex structure parameter. We find that these Kähler potentials only depend on the modulus of the complex structure parameter.
[LG-127] Implicit neural representations for accurate estimation of the standard model of white matter
链接: https://arxiv.org/abs/2506.15762
作者: Tom Hendriks,Gerrit Arends,Edwin Versteeg,Anna Vilanova,Maxime Chamberland,Chantal M.W. Tax
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 27 pages, 12 figures
Abstract:Diffusion magnetic resonance imaging (dMRI) enables non-invasive investigation of tissue microstructure. The Standard Model (SM) of white matter aims to disentangle dMRI signal contributions from intra- and extra-axonal water compartments. However, due to the model its high-dimensional nature, extensive acquisition protocols with multiple b-values and diffusion tensor shapes are typically required to mitigate parameter degeneracies. Even then, accurate estimation remains challenging due to noise. This work introduces a novel estimation framework based on implicit neural representations (INRs), which incorporate spatial regularization through the sinusoidal encoding of the input coordinates. The INR method is evaluated on both synthetic and in vivo datasets and compared to parameter estimates using cubic polynomials, supervised neural networks, and nonlinear least squares. Results demonstrate superior accuracy of the INR method in estimating SM parameters, particularly in low signal-to-noise conditions. Additionally, spatial upsampling of the INR can represent the underlying dataset anatomically plausibly in a continuous way, which is unattainable with linear or cubic interpolation. The INR is fully unsupervised, eliminating the need for labeled training data. It achieves fast inference ( \sim 6 minutes), is robust to both Gaussian and Rician noise, supports joint estimation of SM kernel parameters and the fiber orientation distribution function with spherical harmonics orders up to at least 8 and non-negativity constraints, and accommodates spatially varying acquisition protocols caused by magnetic gradient non-uniformities. The combination of these properties along with the possibility to easily adapt the framework to other dMRI models, positions INRs as a potentially important tool for analyzing and interpreting diffusion MRI data.
[LG-128] Compilation Optimization Error Mitigation and Machine Learning in Quantum Algorithms
链接: https://arxiv.org/abs/2506.15760
作者: Shuangbao Paul Wang,Jianzhou Mao,Eric Sakk
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:This paper discusses the compilation, optimization, and error mitigation of quantum algorithms, essential steps to execute real-world quantum algorithms. Quantum algorithms running on a hybrid platform with QPU and CPU/GPU take advantage of existing high-performance computing power with quantum-enabled exponential speedups. The proposed approximate quantum Fourier transform (AQFT) for quantum algorithm optimization improves the circuit execution on top of an exponential speed-ups the quantum Fourier transform has provided.
[LG-129] Quantum Fisher-Preconditioned Reinforcement Learning: From Single-Qubit Control to Rayleigh-Fading Link Adaptation
链接: https://arxiv.org/abs/2506.15753
作者: Oluwaseyi Giwa,Muhammad Ahmed Mohsin,Muhammad Ali Jamshed
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 5 pages, 3 figures, submitted to IEEE Communications Letters
Abstract:In this letter, we propose Quantum-Preconditioned Policy Gradient (QPPG), a natural gradient-based algorithm for link adaptation that whitens policy updates using the full inverse quantum Fisher information with Tikhonov regularization. QPPG bridges classical and quantum geometry, achieving stable learning even under noise. Evaluated on classical and quantum environments, including noisy single-qubit Gym tasks and Rayleigh-fading channels, QPPG converges 4 times faster than REINFORCE and sustains a 1 dB gain under uncertainty. It reaches a 90 percent return in one hundred episodes with high noise robustness, showcasing the advantages of full QFI-based preconditioning for scalable quantum reinforcement learning.
[LG-130] InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
链接: https://arxiv.org/abs/2506.15745
作者: Minsoo Kim,Kyuhong Shim,Jungwook Choi,Simyung Chang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time–quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy–even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.
[LG-131] Sampling conditioned diffusions via Pathspace Projected Monte Carlo
链接: https://arxiv.org/abs/2506.15743
作者: Tobias Grafke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:We present an algorithm to sample stochastic differential equations conditioned on rather general constraints, including integral constraints, endpoint constraints, and stochastic integral constraints. The algorithm is a pathspace Metropolis-adjusted manifold sampling scheme, which samples stochastic paths on the submanifold of realizations that adhere to the conditioning constraint. We demonstrate the effectiveness of the algorithm by sampling a dynamical condensation phase transition, conditioning a random walk on a fixed Levy stochastic area, conditioning a stochastic nonlinear wave equation on high amplitude waves, and sampling a stochastic partial differential equation model of turbulent pipe flow conditioned on relaminarization events.
[LG-132] Modern approaches to building effective interpretable models of the property market using machine learning
链接: https://arxiv.org/abs/2506.15723
作者: Irina G. Tanashkina,Alexey S. Tanashkin,Alexander S. Maksimchuik,Anna Yu. Poshivailo
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); General Economics (econ.GN); Applications (stat.AP)
*备注: 42 pages, 22 figures
Abstract:In this article, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. The researcher, lacking expertise in this topic, encounters numerous difficulties in the effort to build a good model. The main source of this is the huge difference between noisy real market data and ideal data which is very common in all types of tutorials on machine learning. This paper covers all stages of modeling: the collection of initial data, identification of outliers, the search and analysis of patterns in data, the formation and final choice of price factors, the building of the model, and the evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with interpolation methods of geostatistics allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point the application of geostatistical methods is difficult. Therefore we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. Thus we show, that despite the strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.
信息检索
[IR-0] RAG entA: Multi-Agent Retrieval-Augmented Generation for Attributed Question Answering SIGIR2025
链接: https://arxiv.org/abs/2506.16988
作者: Ines Besrour,Jingbo He,Tobias Schreieder,Michael Färber
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025
Abstract:We present RAGentA, a multi-agent retrieval-augmented generation (RAG) framework for attributed question answering (QA). With the goal of trustworthy answer generation, RAGentA focuses on optimizing answer correctness, defined by coverage and relevance to the question and faithfulness, which measures the extent to which answers are grounded in retrieved documents. RAGentA uses a multi-agent architecture that iteratively filters retrieved documents, generates attributed answers with in-line citations, and verifies completeness through dynamic refinement. Central to the framework is a hybrid retrieval strategy that combines sparse and dense methods, improving Recall@20 by 12.5% compared to the best single retrieval model, resulting in more correct and well-supported answers. Evaluated on a synthetic QA dataset derived from the FineWeb index, RAGentA outperforms standard RAG baselines, achieving gains of 1.09% in correctness and 10.72% in faithfulness. These results demonstrate the effectiveness of the multi-agent architecture and hybrid retrieval in advancing trustworthy QA.
[IR-1] Pyramid Mixer: Multi-dimensional Multi-period Interest Modeling for Sequential Recommendation SIGIR’25
链接: https://arxiv.org/abs/2506.16942
作者: Zhen Gong,Zhifang Fan,Hui Lu,Qiwei Chen,Chenbin Zhang,Lin Guan,Yuchao Zheng,Feng Zhang,Xiao Yang,Zuotao Liu
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR’25
Abstract:Sequential recommendation, a critical task in recommendation systems, predicts the next user action based on the understanding of the user’s historical behaviors. Conventional studies mainly focus on cross-behavior modeling with self-attention based methods while neglecting comprehensive user interest modeling for more dimensions. In this study, we propose a novel sequential recommendation model, Pyramid Mixer, which leverages the MLP-Mixer architecture to achieve efficient and complete modeling of user interests. Our method learns comprehensive user interests via cross-behavior and cross-feature user sequence modeling. The mixer layers are stacked in a pyramid way for cross-period user temporal interest learning. Through extensive offline and online experiments, we demonstrate the effectiveness and efficiency of our method, and we obtain a +0.106% improvement in user stay duration and a +0.0113% increase in user active days in the online A/B test. The Pyramid Mixer has been successfully deployed on the industrial platform, demonstrating its scalability and impact in real-world applications.
[IR-2] Multi-Objective Recommendation in the Era of Generative AI: A Survey of Recent Progress and Future Prospects
链接: https://arxiv.org/abs/2506.16893
作者: Zihan Hong,Yushi Wu,Zhiting Zhao,Shanshan Feng,Jianghong Ma,Jiao Liu,Tianjun Wei
类目: Information Retrieval (cs.IR)
*备注: 21 pages
Abstract:With the recent progress in generative artificial intelligence (Generative AI), particularly in the development of large language models, recommendation systems are evolving to become more versatile. Unlike traditional techniques, generative AI not only learns patterns and representations from complex data but also enables content generation, data synthesis, and personalized experiences. This generative capability plays a crucial role in the field of recommendation systems, helping to address the issue of data sparsity and improving the overall performance of recommendation systems. Numerous studies on generative AI have already emerged in the field of recommendation systems. Meanwhile, the current requirements for recommendation systems have surpassed the single utility of accuracy, leading to a proliferation of multi-objective research that considers various goals in recommendation systems. However, to the best of our knowledge, there remains a lack of comprehensive studies on multi-objective recommendation systems based on generative AI technologies, leaving a significant gap in the literature. Therefore, we investigate the existing research on multi-objective recommendation systems involving generative AI to bridge this gap. We compile current research on multi-objective recommendation systems based on generative techniques, categorizing them by objectives. Additionally, we summarize relevant evaluation metrics and commonly used datasets, concluding with an analysis of the challenges and future directions in this domain.
[IR-3] Neural Prioritisation for Web Crawling ICTIR2025
链接: https://arxiv.org/abs/2506.16146
作者: Francesza Pezzuti,Sean MacAvaney,Nicola Tonellotto
类目: Information Retrieval (cs.IR)
*备注: Published at ACM ICTIR 2025
Abstract:Given the vast scale of the Web, crawling prioritisation techniques based on link graph traversal, popularity, link analysis, and textual content are frequently applied to surface documents that are most likely to be valuable. While existing techniques are effective for keyword-based search, both retrieval methods and user search behaviours are shifting from keyword-based matching to natural language semantic matching. The remarkable success of applying semantic matching and quality signals during ranking leads us to hypothesize that crawling could be improved by prioritizing Web pages with high semantic quality. To investigate this, we propose a semantic quality-driven prioritisation technique to enhance the effectiveness of crawling and align the crawler behaviour with recent shift towards natural language search. We embed semantic understanding directly into the crawling process – leveraging recent neural semantic quality estimators to prioritise the crawling frontier – with the goal of surfacing content that is semantically rich and valuable for modern search needs. Our experiments on the English subset of ClueWeb22-B and the Researchy Questions query set show that, compared to existing crawling techniques, neural crawling policies significantly improve harvest rate, maxNDCG, and search effectiveness during the early stages of crawling. Meanwhile, crawlers based on our proposed neural policies maintain comparable search performance on keyword queries from the MS MARCO Web Search query set. While this work does not propose a definitive and complete solution, it presents a forward-looking perspective on Web crawling and opens the door to a new line of research on leveraging semantic analysis to effectively align crawlers with the ongoing shift toward natural language search.
[IR-4] SEP-GCN: Leverag ing Similar Edge Pairs with Temporal and Spatial Contexts for Location-Based Recommender Systems SIGIR ICTIR
链接: https://arxiv.org/abs/2506.16003
作者: Tan Loc Nguyen,Tin T. Tran
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注: Accepted for ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) 2025, Padua, Itay
Abstract:Recommender systems play a crucial role in enabling personalized content delivery amidst the challenges of information overload and human mobility. Although conventional methods often rely on interaction matrices or graph-based retrieval, recent approaches have sought to exploit contextual signals such as time and location. However, most existing models focus on node-level representation or isolated edge attributes, underutilizing the relational structure between interactions. We propose SEP-GCN, a novel graph-based recommendation framework that learns from pairs of contextually similar interaction edges, each representing a user-item check-in event. By identifying edge pairs that occur within similar temporal windows or geographic proximity, SEP-GCN augments the user-item graph with contextual similarity links. These links bridge distant but semantically related interactions, enabling improved long-range information propagation. The enriched graph is processed via an edge-aware convolutional mechanism that integrates contextual similarity into the message-passing process. This allows SEP-GCN to model user preferences more accurately and robustly, especially in sparse or dynamic environments. Experiments on benchmark data sets show that SEP-GCN consistently outperforms strong baselines in both predictive accuracy and robustness.
[IR-5] Empowering Graph-based Approximate Nearest Neighbor Search with Adaptive Awareness Capabilities KDD2025
链接: https://arxiv.org/abs/2506.15986
作者: Jiancheng Ruan,Tingyang Chen,Renchi Yang,Xiangyu Ke,Yunjun Gao
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: Accecpted by KDD2025
Abstract:Approximate Nearest Neighbor Search (ANNS) in high-dimensional spaces finds extensive applications in databases, information retrieval, recommender systems, etc. While graph-based methods have emerged as the leading solution for ANNS due to their superior query performance, they still face several challenges, such as struggling with local optima and redundant computations. These issues arise because existing methods (i) fail to fully exploit the topological information underlying the proximity graph G, and (ii) suffer from severe distribution mismatches between the base data and queries in practice. To this end, this paper proposes GATE, high-tier proximity Graph with Adaptive Topology and Query AwarEness, as a lightweight and adaptive module atop the graph-based indexes to accelerate ANNS. Specifically, GATE formulates the critical problem to identify an optimal entry point in the proximity graph for a given query, facilitating faster online search. By leveraging the inherent clusterability of high-dimensional data, GATE first extracts a small set of hub nodes V as candidate entry points. Then, resorting to a contrastive learning-based two-tower model, GATE encodes both the structural semantics underlying G and the query-relevant features into the latent representations of these hub nodes V. A navigation graph index on V is further constructed to minimize the model inference overhead. Extensive experiments demonstrate that GATE achieves a 1.2-2.0X speed-up in query performance compared to state-of-the-art graph-based indexes. Comments: Accecpted by KDD2025 Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2506.15986 [cs.DB] (or arXiv:2506.15986v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2506.15986 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-6] Architecture is All You Need: Improving LLM Recommenders by Dropping the Text
链接: https://arxiv.org/abs/2506.15833
作者: Kevin Foley,Shaghayegh Agah,Kavya Priyanka Kakinada
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 1 figure
Abstract:In recent years, there has been an explosion of interest in the applications of large pre-trained language models (PLMs) to recommender systems, with many studies showing strong performance of PLMs on common benchmark datasets. PLM-based recommender models benefit from flexible and customizable prompting, an unlimited vocabulary of recommendable items, and general ``world knowledge’’ acquired through pre-training on massive text corpora. While PLM-based recommenders show promise in settings where data is limited, they are hard to implement in practice due to their large size and computational cost. Additionally, fine-tuning PLMs to improve performance on collaborative signals may degrade the model’s capacity for world knowledge and generalizability. We propose a recommender model that uses the architecture of large language models (LLMs) while reducing layer count and dimensions and replacing the text-based subword tokenization of a typical LLM with discrete tokens that uniquely represent individual content items. We find that this simplified approach substantially outperforms both traditional sequential recommender models and PLM-based recommender models at a tiny fraction of the size and computational complexity of PLM-based models. Our results suggest that the principal benefit of LLMs in recommender systems is their architecture, rather than the world knowledge acquired during extensive pre-training.